[PATCH v3 01/20] mm/swap: decouple swap cache from physical swap infrastructure

Nhat Pham posted 20 patches 2 days, 8 hours ago
There is a newer version of this series
[PATCH v3 01/20] mm/swap: decouple swap cache from physical swap infrastructure
Posted by Nhat Pham 2 days, 8 hours ago
When we virtualize the swap space, we will manage swap cache at the
virtual swap layer. To prepare for this, decouple swap cache from
physical swap infrastructure.

We will also remove all the swap cache related helpers of swap table. We
will keep the rest of the swap table infrastructure, which will be
repurposed to serve as the rmap (physical -> virtual swap mapping)
later.

Note that with this patch, we will move to a single global lock to
synchronize swap cache accesses. This is temporarily, as the swap cache
will be re-partitioned in to (virtual) swap clusters once we move the
swap cache to the soon-to-be-introduced virtual swap layer.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 Documentation/mm/swap-table.rst |  69 -----------
 mm/huge_memory.c                |  11 +-
 mm/migrate.c                    |  13 +-
 mm/shmem.c                      |   7 +-
 mm/swap.h                       |  26 ++--
 mm/swap_state.c                 | 205 +++++++++++++++++---------------
 mm/swap_table.h                 |  78 +-----------
 mm/swapfile.c                   |  43 ++-----
 mm/vmscan.c                     |   9 +-
 9 files changed, 158 insertions(+), 303 deletions(-)
 delete mode 100644 Documentation/mm/swap-table.rst

diff --git a/Documentation/mm/swap-table.rst b/Documentation/mm/swap-table.rst
deleted file mode 100644
index da10bb7a0dc37..0000000000000
--- a/Documentation/mm/swap-table.rst
+++ /dev/null
@@ -1,69 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-:Author: Chris Li <chrisl@kernel.org>, Kairui Song <kasong@tencent.com>
-
-==========
-Swap Table
-==========
-
-Swap table implements swap cache as a per-cluster swap cache value array.
-
-Swap Entry
-----------
-
-A swap entry contains the information required to serve the anonymous page
-fault.
-
-Swap entry is encoded as two parts: swap type and swap offset.
-
-The swap type indicates which swap device to use.
-The swap offset is the offset of the swap file to read the page data from.
-
-Swap Cache
-----------
-
-Swap cache is a map to look up folios using swap entry as the key. The result
-value can have three possible types depending on which stage of this swap entry
-was in.
-
-1. NULL: This swap entry is not used.
-
-2. folio: A folio has been allocated and bound to this swap entry. This is
-   the transient state of swap out or swap in. The folio data can be in
-   the folio or swap file, or both.
-
-3. shadow: The shadow contains the working set information of the swapped
-   out folio. This is the normal state for a swapped out page.
-
-Swap Table Internals
---------------------
-
-The previous swap cache is implemented by XArray. The XArray is a tree
-structure. Each lookup will go through multiple nodes. Can we do better?
-
-Notice that most of the time when we look up the swap cache, we are either
-in a swap in or swap out path. We should already have the swap cluster,
-which contains the swap entry.
-
-If we have a per-cluster array to store swap cache value in the cluster.
-Swap cache lookup within the cluster can be a very simple array lookup.
-
-We give such a per-cluster swap cache value array a name: the swap table.
-
-A swap table is an array of pointers. Each pointer is the same size as a
-PTE. The size of a swap table for one swap cluster typically matches a PTE
-page table, which is one page on modern 64-bit systems.
-
-With swap table, swap cache lookup can achieve great locality, simpler,
-and faster.
-
-Locking
--------
-
-Swap table modification requires taking the cluster lock. If a folio
-is being added to or removed from the swap table, the folio must be
-locked prior to the cluster lock. After adding or removing is done, the
-folio shall be unlocked.
-
-Swap table lookup is protected by RCU and atomic read. If the lookup
-returns a folio, the user must lock the folio before use.
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40cf59301c21a..21215ac870144 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3783,7 +3783,6 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
 	/* Prevent deferred_split_scan() touching ->_refcount */
 	ds_queue = folio_split_queue_lock(folio);
 	if (folio_ref_freeze(folio, folio_cache_ref_count(folio) + 1)) {
-		struct swap_cluster_info *ci = NULL;
 		struct lruvec *lruvec;
 
 		if (old_order > 1) {
@@ -3826,7 +3825,7 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
 				return -EINVAL;
 			}
 
-			ci = swap_cluster_get_and_lock(folio);
+			swap_cache_lock();
 		}
 
 		/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
@@ -3862,8 +3861,8 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
 			 * Anonymous folio with swap cache.
 			 * NOTE: shmem in swap cache is not supported yet.
 			 */
-			if (ci) {
-				__swap_cache_replace_folio(ci, folio, new_folio);
+			if (folio_test_swapcache(folio)) {
+				__swap_cache_replace_folio(folio, new_folio);
 				continue;
 			}
 
@@ -3901,8 +3900,8 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
 		if (do_lru)
 			unlock_page_lruvec(lruvec);
 
-		if (ci)
-			swap_cluster_unlock(ci);
+		if (folio_test_swapcache(folio))
+			swap_cache_unlock();
 	} else {
 		split_queue_unlock(ds_queue);
 		return -EAGAIN;
diff --git a/mm/migrate.c b/mm/migrate.c
index 4688b9e38cd2f..11d9b43dff5d8 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -571,7 +571,6 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 		struct folio *newfolio, struct folio *folio, int expected_count)
 {
 	XA_STATE(xas, &mapping->i_pages, folio->index);
-	struct swap_cluster_info *ci = NULL;
 	struct zone *oldzone, *newzone;
 	int dirty;
 	long nr = folio_nr_pages(folio);
@@ -601,13 +600,13 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 	newzone = folio_zone(newfolio);
 
 	if (folio_test_swapcache(folio))
-		ci = swap_cluster_get_and_lock_irq(folio);
+		swap_cache_lock_irq();
 	else
 		xas_lock_irq(&xas);
 
 	if (!folio_ref_freeze(folio, expected_count)) {
-		if (ci)
-			swap_cluster_unlock_irq(ci);
+		if (folio_test_swapcache(folio))
+			swap_cache_unlock_irq();
 		else
 			xas_unlock_irq(&xas);
 		return -EAGAIN;
@@ -640,7 +639,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 	}
 
 	if (folio_test_swapcache(folio))
-		__swap_cache_replace_folio(ci, folio, newfolio);
+		__swap_cache_replace_folio(folio, newfolio);
 	else
 		xas_store(&xas, newfolio);
 
@@ -652,8 +651,8 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 	folio_ref_unfreeze(folio, expected_count - nr);
 
 	/* Leave irq disabled to prevent preemption while updating stats */
-	if (ci)
-		swap_cluster_unlock(ci);
+	if (folio_test_swapcache(folio))
+		swap_cache_unlock();
 	else
 		xas_unlock(&xas);
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 79af5f9f8b908..1db97ef2d14eb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2133,7 +2133,6 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
 				struct shmem_inode_info *info, pgoff_t index,
 				struct vm_area_struct *vma)
 {
-	struct swap_cluster_info *ci;
 	struct folio *new, *old = *foliop;
 	swp_entry_t entry = old->swap;
 	int nr_pages = folio_nr_pages(old);
@@ -2166,12 +2165,12 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
 	new->swap = entry;
 	folio_set_swapcache(new);
 
-	ci = swap_cluster_get_and_lock_irq(old);
-	__swap_cache_replace_folio(ci, old, new);
+	swap_cache_lock_irq();
+	__swap_cache_replace_folio(old, new);
 	mem_cgroup_replace_folio(old, new);
 	shmem_update_stats(new, nr_pages);
 	shmem_update_stats(old, -nr_pages);
-	swap_cluster_unlock_irq(ci);
+	swap_cache_unlock_irq();
 
 	folio_add_lru(new);
 	*foliop = new;
diff --git a/mm/swap.h b/mm/swap.h
index 1bd466da30393..8726b587a5b5d 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -199,6 +199,11 @@ void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug);
 
 /* linux/mm/swap_state.c */
 extern struct address_space swap_space __read_mostly;
+void swap_cache_lock_irq(void);
+void swap_cache_unlock_irq(void);
+void swap_cache_lock(void);
+void swap_cache_unlock(void);
+
 static inline struct address_space *swap_address_space(swp_entry_t entry)
 {
 	return &swap_space;
@@ -247,14 +252,12 @@ static inline bool folio_matches_swap_entry(const struct folio *folio,
  */
 struct folio *swap_cache_get_folio(swp_entry_t entry);
 void *swap_cache_get_shadow(swp_entry_t entry);
-void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadow);
+int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, gfp_t gfp, void **shadow);
 void swap_cache_del_folio(struct folio *folio);
-/* Below helpers require the caller to lock and pass in the swap cluster. */
-void __swap_cache_del_folio(struct swap_cluster_info *ci,
-			    struct folio *folio, swp_entry_t entry, void *shadow);
-void __swap_cache_replace_folio(struct swap_cluster_info *ci,
-				struct folio *old, struct folio *new);
-void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents);
+/* Below helpers require the caller to lock the swap cache. */
+void __swap_cache_del_folio(struct folio *folio, swp_entry_t entry, void *shadow);
+void __swap_cache_replace_folio(struct folio *old, struct folio *new);
+void swap_cache_clear_shadow(swp_entry_t entry, int nr_ents);
 
 void show_swap_cache_info(void);
 void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
@@ -411,21 +414,20 @@ static inline void *swap_cache_get_shadow(swp_entry_t entry)
 	return NULL;
 }
 
-static inline void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadow)
+static inline int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, gfp_t gfp, void **shadow)
 {
+	return 0;
 }
 
 static inline void swap_cache_del_folio(struct folio *folio)
 {
 }
 
-static inline void __swap_cache_del_folio(struct swap_cluster_info *ci,
-		struct folio *folio, swp_entry_t entry, void *shadow)
+static inline void __swap_cache_del_folio(struct folio *folio, swp_entry_t entry, void *shadow)
 {
 }
 
-static inline void __swap_cache_replace_folio(struct swap_cluster_info *ci,
-		struct folio *old, struct folio *new)
+static inline void __swap_cache_replace_folio(struct folio *old, struct folio *new)
 {
 }
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 44d228982521e..34c9d9b243a74 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -22,8 +22,8 @@
 #include <linux/vmalloc.h>
 #include <linux/huge_mm.h>
 #include <linux/shmem_fs.h>
+#include <linux/xarray.h>
 #include "internal.h"
-#include "swap_table.h"
 #include "swap.h"
 
 /*
@@ -41,6 +41,28 @@ struct address_space swap_space __read_mostly = {
 	.a_ops = &swap_aops,
 };
 
+static DEFINE_XARRAY(swap_cache);
+
+void swap_cache_lock_irq(void)
+{
+	xa_lock_irq(&swap_cache);
+}
+
+void swap_cache_unlock_irq(void)
+{
+	xa_unlock_irq(&swap_cache);
+}
+
+void swap_cache_lock(void)
+{
+	xa_lock(&swap_cache);
+}
+
+void swap_cache_unlock(void)
+{
+	xa_unlock(&swap_cache);
+}
+
 static bool enable_vma_readahead __read_mostly = true;
 
 #define SWAP_RA_ORDER_CEILING	5
@@ -86,17 +108,22 @@ void show_swap_cache_info(void)
  */
 struct folio *swap_cache_get_folio(swp_entry_t entry)
 {
-	unsigned long swp_tb;
+	void *entry_val;
 	struct folio *folio;
 
 	for (;;) {
-		swp_tb = swap_table_get(__swap_entry_to_cluster(entry),
-					swp_cluster_offset(entry));
-		if (!swp_tb_is_folio(swp_tb))
+		rcu_read_lock();
+		entry_val = xa_load(&swap_cache, entry.val);
+		if (!entry_val || xa_is_value(entry_val)) {
+			rcu_read_unlock();
 			return NULL;
-		folio = swp_tb_to_folio(swp_tb);
-		if (likely(folio_try_get(folio)))
+		}
+		folio = entry_val;
+		if (likely(folio_try_get(folio))) {
+			rcu_read_unlock();
 			return folio;
+		}
+		rcu_read_unlock();
 	}
 
 	return NULL;
@@ -112,12 +139,14 @@ struct folio *swap_cache_get_folio(swp_entry_t entry)
  */
 void *swap_cache_get_shadow(swp_entry_t entry)
 {
-	unsigned long swp_tb;
+	void *entry_val;
+
+	rcu_read_lock();
+	entry_val = xa_load(&swap_cache, entry.val);
+	rcu_read_unlock();
 
-	swp_tb = swap_table_get(__swap_entry_to_cluster(entry),
-				swp_cluster_offset(entry));
-	if (swp_tb_is_shadow(swp_tb))
-		return swp_tb_to_shadow(swp_tb);
+	if (xa_is_value(entry_val))
+		return entry_val;
 	return NULL;
 }
 
@@ -132,46 +161,58 @@ void *swap_cache_get_shadow(swp_entry_t entry)
  * with reference count or locks.
  * The caller also needs to update the corresponding swap_map slots with
  * SWAP_HAS_CACHE bit to avoid race or conflict.
+ *
+ * Return: 0 on success, negative error code on failure.
  */
-void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadowp)
+int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, gfp_t gfp, void **shadowp)
 {
-	void *shadow = NULL;
-	unsigned long old_tb, new_tb;
-	struct swap_cluster_info *ci;
-	unsigned int ci_start, ci_off, ci_end;
+	XA_STATE_ORDER(xas, &swap_cache, entry.val, folio_order(folio));
 	unsigned long nr_pages = folio_nr_pages(folio);
+	unsigned long i;
+	void *old;
 
 	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
 	VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
 	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
 
-	new_tb = folio_to_swp_tb(folio);
-	ci_start = swp_cluster_offset(entry);
-	ci_end = ci_start + nr_pages;
-	ci_off = ci_start;
-	ci = swap_cluster_lock(__swap_entry_to_info(entry), swp_offset(entry));
-	do {
-		old_tb = __swap_table_xchg(ci, ci_off, new_tb);
-		WARN_ON_ONCE(swp_tb_is_folio(old_tb));
-		if (swp_tb_is_shadow(old_tb))
-			shadow = swp_tb_to_shadow(old_tb);
-	} while (++ci_off < ci_end);
-
 	folio_ref_add(folio, nr_pages);
 	folio_set_swapcache(folio);
 	folio->swap = entry;
-	swap_cluster_unlock(ci);
 
-	node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
-	lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
+	do {
+		xas_lock_irq(&xas);
+		xas_create_range(&xas);
+		if (xas_error(&xas))
+			goto unlock;
+		for (i = 0; i < nr_pages; i++) {
+			VM_BUG_ON_FOLIO(xas.xa_index != entry.val + i, folio);
+			old = xas_load(&xas);
+			if (old && !xa_is_value(old)) {
+				VM_WARN_ON_ONCE_FOLIO(1, folio);
+				xas_set_err(&xas, -EEXIST);
+				goto unlock;
+			}
+			if (shadowp && xa_is_value(old) && !*shadowp)
+				*shadowp = old;
+			xas_store(&xas, folio);
+			xas_next(&xas);
+		}
+		node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
+		lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
+unlock:
+		xas_unlock_irq(&xas);
+	} while (xas_nomem(&xas, gfp));
 
-	if (shadowp)
-		*shadowp = shadow;
+	if (!xas_error(&xas))
+		return 0;
+
+	folio_clear_swapcache(folio);
+	folio_ref_sub(folio, nr_pages);
+	return xas_error(&xas);
 }
 
 /**
  * __swap_cache_del_folio - Removes a folio from the swap cache.
- * @ci: The locked swap cluster.
  * @folio: The folio.
  * @entry: The first swap entry that the folio corresponds to.
  * @shadow: shadow value to be filled in the swap cache.
@@ -180,30 +221,23 @@ void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadowp
  * This won't put the folio's refcount. The caller has to do that.
  *
  * Context: Caller must ensure the folio is locked and in the swap cache
- * using the index of @entry, and lock the cluster that holds the entries.
+ * using the index of @entry, and lock the swap cache xarray.
  */
-void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
-			    swp_entry_t entry, void *shadow)
+void __swap_cache_del_folio(struct folio *folio, swp_entry_t entry, void *shadow)
 {
-	unsigned long old_tb, new_tb;
-	unsigned int ci_start, ci_off, ci_end;
-	unsigned long nr_pages = folio_nr_pages(folio);
+	long nr_pages = folio_nr_pages(folio);
+	XA_STATE(xas, &swap_cache, entry.val);
+	int i;
 
-	VM_WARN_ON_ONCE(__swap_entry_to_cluster(entry) != ci);
 	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
 	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
 	VM_WARN_ON_ONCE_FOLIO(folio_test_writeback(folio), folio);
 
-	new_tb = shadow_swp_to_tb(shadow);
-	ci_start = swp_cluster_offset(entry);
-	ci_end = ci_start + nr_pages;
-	ci_off = ci_start;
-	do {
-		/* If shadow is NULL, we sets an empty shadow */
-		old_tb = __swap_table_xchg(ci, ci_off, new_tb);
-		WARN_ON_ONCE(!swp_tb_is_folio(old_tb) ||
-			     swp_tb_to_folio(old_tb) != folio);
-	} while (++ci_off < ci_end);
+	for (i = 0; i < nr_pages; i++) {
+		void *old = xas_store(&xas, shadow);
+		VM_WARN_ON_FOLIO(old != folio, folio);
+		xas_next(&xas);
+	}
 
 	folio->swap.val = 0;
 	folio_clear_swapcache(folio);
@@ -223,12 +257,11 @@ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
  */
 void swap_cache_del_folio(struct folio *folio)
 {
-	struct swap_cluster_info *ci;
 	swp_entry_t entry = folio->swap;
 
-	ci = swap_cluster_lock(__swap_entry_to_info(entry), swp_offset(entry));
-	__swap_cache_del_folio(ci, folio, entry, NULL);
-	swap_cluster_unlock(ci);
+	xa_lock_irq(&swap_cache);
+	__swap_cache_del_folio(folio, entry, NULL);
+	xa_unlock_irq(&swap_cache);
 
 	put_swap_folio(folio, entry);
 	folio_ref_sub(folio, folio_nr_pages(folio));
@@ -236,7 +269,6 @@ void swap_cache_del_folio(struct folio *folio)
 
 /**
  * __swap_cache_replace_folio - Replace a folio in the swap cache.
- * @ci: The locked swap cluster.
  * @old: The old folio to be replaced.
  * @new: The new folio.
  *
@@ -246,39 +278,23 @@ void swap_cache_del_folio(struct folio *folio)
  * the starting offset to override all slots covered by the new folio.
  *
  * Context: Caller must ensure both folios are locked, and lock the
- * cluster that holds the old folio to be replaced.
+ * swap cache xarray.
  */
-void __swap_cache_replace_folio(struct swap_cluster_info *ci,
-				struct folio *old, struct folio *new)
+void __swap_cache_replace_folio(struct folio *old, struct folio *new)
 {
 	swp_entry_t entry = new->swap;
 	unsigned long nr_pages = folio_nr_pages(new);
-	unsigned int ci_off = swp_cluster_offset(entry);
-	unsigned int ci_end = ci_off + nr_pages;
-	unsigned long old_tb, new_tb;
+	XA_STATE(xas, &swap_cache, entry.val);
+	int i;
 
 	VM_WARN_ON_ONCE(!folio_test_swapcache(old) || !folio_test_swapcache(new));
 	VM_WARN_ON_ONCE(!folio_test_locked(old) || !folio_test_locked(new));
 	VM_WARN_ON_ONCE(!entry.val);
 
-	/* Swap cache still stores N entries instead of a high-order entry */
-	new_tb = folio_to_swp_tb(new);
-	do {
-		old_tb = __swap_table_xchg(ci, ci_off, new_tb);
-		WARN_ON_ONCE(!swp_tb_is_folio(old_tb) || swp_tb_to_folio(old_tb) != old);
-	} while (++ci_off < ci_end);
-
-	/*
-	 * If the old folio is partially replaced (e.g., splitting a large
-	 * folio, the old folio is shrunk, and new split sub folios replace
-	 * the shrunk part), ensure the new folio doesn't overlap it.
-	 */
-	if (IS_ENABLED(CONFIG_DEBUG_VM) &&
-	    folio_order(old) != folio_order(new)) {
-		ci_off = swp_cluster_offset(old->swap);
-		ci_end = ci_off + folio_nr_pages(old);
-		while (ci_off++ < ci_end)
-			WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) != old);
+	for (i = 0; i < nr_pages; i++) {
+		void *old_entry = xas_store(&xas, new);
+		WARN_ON_ONCE(!old_entry || xa_is_value(old_entry) || old_entry != old);
+		xas_next(&xas);
 	}
 }
 
@@ -287,20 +303,20 @@ void __swap_cache_replace_folio(struct swap_cluster_info *ci,
  * @entry: The starting index entry.
  * @nr_ents: How many slots need to be cleared.
  *
- * Context: Caller must ensure the range is valid, all in one single cluster,
- * not occupied by any folio, and lock the cluster.
+ * Context: Caller must ensure the range is valid and all in one single cluster,
+ * not occupied by any folio.
  */
-void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents)
+void swap_cache_clear_shadow(swp_entry_t entry, int nr_ents)
 {
-	struct swap_cluster_info *ci = __swap_entry_to_cluster(entry);
-	unsigned int ci_off = swp_cluster_offset(entry), ci_end;
-	unsigned long old;
+	XA_STATE(xas, &swap_cache, entry.val);
+	int i;
 
-	ci_end = ci_off + nr_ents;
-	do {
-		old = __swap_table_xchg(ci, ci_off, null_to_swp_tb());
-		WARN_ON_ONCE(swp_tb_is_folio(old));
-	} while (++ci_off < ci_end);
+	xas_lock(&xas);
+	for (i = 0; i < nr_ents; i++) {
+		xas_store(&xas, NULL);
+		xas_next(&xas);
+	}
+	xas_unlock(&xas);
 }
 
 /*
@@ -480,7 +496,10 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	if (mem_cgroup_swapin_charge_folio(new_folio, NULL, gfp_mask, entry))
 		goto fail_unlock;
 
-	swap_cache_add_folio(new_folio, entry, &shadow);
+	/* May fail (-ENOMEM) if XArray node allocation failed. */
+	if (swap_cache_add_folio(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow))
+		goto fail_unlock;
+
 	memcg1_swapin(entry, 1);
 
 	if (shadow)
diff --git a/mm/swap_table.h b/mm/swap_table.h
index ea244a57a5b7a..ad2cb2ef46903 100644
--- a/mm/swap_table.h
+++ b/mm/swap_table.h
@@ -13,71 +13,6 @@ struct swap_table {
 
 #define SWP_TABLE_USE_PAGE (sizeof(struct swap_table) == PAGE_SIZE)
 
-/*
- * A swap table entry represents the status of a swap slot on a swap
- * (physical or virtual) device. The swap table in each cluster is a
- * 1:1 map of the swap slots in this cluster.
- *
- * Each swap table entry could be a pointer (folio), a XA_VALUE
- * (shadow), or NULL.
- */
-
-/*
- * Helpers for casting one type of info into a swap table entry.
- */
-static inline unsigned long null_to_swp_tb(void)
-{
-	BUILD_BUG_ON(sizeof(unsigned long) != sizeof(atomic_long_t));
-	return 0;
-}
-
-static inline unsigned long folio_to_swp_tb(struct folio *folio)
-{
-	BUILD_BUG_ON(sizeof(unsigned long) != sizeof(void *));
-	return (unsigned long)folio;
-}
-
-static inline unsigned long shadow_swp_to_tb(void *shadow)
-{
-	BUILD_BUG_ON((BITS_PER_XA_VALUE + 1) !=
-		     BITS_PER_BYTE * sizeof(unsigned long));
-	VM_WARN_ON_ONCE(shadow && !xa_is_value(shadow));
-	return (unsigned long)shadow;
-}
-
-/*
- * Helpers for swap table entry type checking.
- */
-static inline bool swp_tb_is_null(unsigned long swp_tb)
-{
-	return !swp_tb;
-}
-
-static inline bool swp_tb_is_folio(unsigned long swp_tb)
-{
-	return !xa_is_value((void *)swp_tb) && !swp_tb_is_null(swp_tb);
-}
-
-static inline bool swp_tb_is_shadow(unsigned long swp_tb)
-{
-	return xa_is_value((void *)swp_tb);
-}
-
-/*
- * Helpers for retrieving info from swap table.
- */
-static inline struct folio *swp_tb_to_folio(unsigned long swp_tb)
-{
-	VM_WARN_ON(!swp_tb_is_folio(swp_tb));
-	return (void *)swp_tb;
-}
-
-static inline void *swp_tb_to_shadow(unsigned long swp_tb)
-{
-	VM_WARN_ON(!swp_tb_is_shadow(swp_tb));
-	return (void *)swp_tb;
-}
-
 /*
  * Helpers for accessing or modifying the swap table of a cluster,
  * the swap cluster must be locked.
@@ -92,17 +27,6 @@ static inline void __swap_table_set(struct swap_cluster_info *ci,
 	atomic_long_set(&table[off], swp_tb);
 }
 
-static inline unsigned long __swap_table_xchg(struct swap_cluster_info *ci,
-					      unsigned int off, unsigned long swp_tb)
-{
-	atomic_long_t *table = rcu_dereference_protected(ci->table, true);
-
-	lockdep_assert_held(&ci->lock);
-	VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
-	/* Ordering is guaranteed by cluster lock, relax */
-	return atomic_long_xchg_relaxed(&table[off], swp_tb);
-}
-
 static inline unsigned long __swap_table_get(struct swap_cluster_info *ci,
 					     unsigned int off)
 {
@@ -122,7 +46,7 @@ static inline unsigned long swap_table_get(struct swap_cluster_info *ci,
 
 	rcu_read_lock();
 	table = rcu_dereference(ci->table);
-	swp_tb = table ? atomic_long_read(&table[off]) : null_to_swp_tb();
+	swp_tb = table ? atomic_long_read(&table[off]) : 0;
 	rcu_read_unlock();
 
 	return swp_tb;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 46d2008e4b996..cacfafa9a540d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -474,7 +474,7 @@ static void swap_cluster_free_table(struct swap_cluster_info *ci)
 	lockdep_assert_held(&ci->lock);
 	VM_WARN_ON_ONCE(!cluster_is_empty(ci));
 	for (ci_off = 0; ci_off < SWAPFILE_CLUSTER; ci_off++)
-		VM_WARN_ON_ONCE(!swp_tb_is_null(__swap_table_get(ci, ci_off)));
+		VM_WARN_ON_ONCE(__swap_table_get(ci, ci_off));
 	table = (void *)rcu_dereference_protected(ci->table, true);
 	rcu_assign_pointer(ci->table, NULL);
 
@@ -843,26 +843,6 @@ static bool cluster_scan_range(struct swap_info_struct *si,
 	return true;
 }
 
-/*
- * Currently, the swap table is not used for count tracking, just
- * do a sanity check here to ensure nothing leaked, so the swap
- * table should be empty upon freeing.
- */
-static void swap_cluster_assert_table_empty(struct swap_cluster_info *ci,
-				unsigned int start, unsigned int nr)
-{
-	unsigned int ci_off = start % SWAPFILE_CLUSTER;
-	unsigned int ci_end = ci_off + nr;
-	unsigned long swp_tb;
-
-	if (IS_ENABLED(CONFIG_DEBUG_VM)) {
-		do {
-			swp_tb = __swap_table_get(ci, ci_off);
-			VM_WARN_ON_ONCE(!swp_tb_is_null(swp_tb));
-		} while (++ci_off < ci_end);
-	}
-}
-
 static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci,
 				unsigned int start, unsigned char usage,
 				unsigned int order)
@@ -882,7 +862,6 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
 		ci->order = order;
 
 	memset(si->swap_map + start, usage, nr_pages);
-	swap_cluster_assert_table_empty(ci, start, nr_pages);
 	swap_range_alloc(si, nr_pages);
 	ci->count += nr_pages;
 
@@ -1275,7 +1254,7 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 			swap_slot_free_notify(si->bdev, offset);
 		offset++;
 	}
-	__swap_cache_clear_shadow(swp_entry(si->type, begin), nr_entries);
+	swap_cache_clear_shadow(swp_entry(si->type, begin), nr_entries);
 
 	/*
 	 * Make sure that try_to_unuse() observes si->inuse_pages reaching 0
@@ -1423,6 +1402,7 @@ int folio_alloc_swap(struct folio *folio)
 	unsigned int order = folio_order(folio);
 	unsigned int size = 1 << order;
 	swp_entry_t entry = {};
+	int err;
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio);
@@ -1457,19 +1437,23 @@ int folio_alloc_swap(struct folio *folio)
 	}
 
 	/* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */
-	if (mem_cgroup_try_charge_swap(folio, entry))
+	if (mem_cgroup_try_charge_swap(folio, entry)) {
+		err = -ENOMEM;
 		goto out_free;
+	}
 
 	if (!entry.val)
 		return -ENOMEM;
 
-	swap_cache_add_folio(folio, entry, NULL);
+	err = swap_cache_add_folio(folio, entry, __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN, NULL);
+	if (err)
+		goto out_free;
 
 	return 0;
 
 out_free:
 	put_swap_folio(folio, entry);
-	return -ENOMEM;
+	return err;
 }
 
 static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
@@ -1729,7 +1713,6 @@ static void swap_entries_free(struct swap_info_struct *si,
 
 	mem_cgroup_uncharge_swap(entry, nr_pages);
 	swap_range_free(si, offset, nr_pages);
-	swap_cluster_assert_table_empty(ci, offset, nr_pages);
 
 	if (!ci->count)
 		free_cluster(si, ci);
@@ -4057,9 +4040,9 @@ static int __init swapfile_init(void)
 	swapfile_maximum_size = arch_max_swapfile_size();
 
 	/*
-	 * Once a cluster is freed, it's swap table content is read
-	 * only, and all swap cache readers (swap_cache_*) verifies
-	 * the content before use. So it's safe to use RCU slab here.
+	 * Once a cluster is freed, it's swap table content is read only, and
+	 * all swap table readers verify the content before use. So it's safe to
+	 * use RCU slab here.
 	 */
 	if (!SWP_TABLE_USE_PAGE)
 		swap_table_cachep = kmem_cache_create("swap_table",
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 614ccf39fe3fa..558ff7f413786 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -707,13 +707,12 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
 {
 	int refcount;
 	void *shadow = NULL;
-	struct swap_cluster_info *ci;
 
 	BUG_ON(!folio_test_locked(folio));
 	BUG_ON(mapping != folio_mapping(folio));
 
 	if (folio_test_swapcache(folio)) {
-		ci = swap_cluster_get_and_lock_irq(folio);
+		swap_cache_lock_irq();
 	} else {
 		spin_lock(&mapping->host->i_lock);
 		xa_lock_irq(&mapping->i_pages);
@@ -758,9 +757,9 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
 
 		if (reclaimed && !mapping_exiting(mapping))
 			shadow = workingset_eviction(folio, target_memcg);
-		__swap_cache_del_folio(ci, folio, swap, shadow);
+		__swap_cache_del_folio(folio, swap, shadow);
 		memcg1_swapout(folio, swap);
-		swap_cluster_unlock_irq(ci);
+		swap_cache_unlock_irq();
 		put_swap_folio(folio, swap);
 	} else {
 		void (*free_folio)(struct folio *);
@@ -799,7 +798,7 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
 
 cannot_free:
 	if (folio_test_swapcache(folio)) {
-		swap_cluster_unlock_irq(ci);
+		swap_cache_unlock_irq();
 	} else {
 		xa_unlock_irq(&mapping->i_pages);
 		spin_unlock(&mapping->host->i_lock);
-- 
2.47.3
[PATCH v3 00/20] Virtual Swap Space
Posted by Nhat Pham 2 days, 8 hours ago
My sincerest apologies - it seems like the cover letter (and just the
cover letter) fails to be sent out, for some reason. I'm trying to figure
out what happened - it works when I send the entire patch series to
myself...

Anyway, resending this (in-reply-to patch 1 of the series):

Changelog:
* RFC v2 -> v3:
    * Implement a cluster-based allocation algorithm for virtual swap
      slots, inspired by Kairui Song and Chris Li's implementation, as
      well as Johannes Weiner's suggestions. This eliminates the lock
	  contention issues on the virtual swap layer.
    * Re-use swap table for the reverse mapping.
    * Remove CONFIG_VIRTUAL_SWAP.
    * Reducing the size of the swap descriptor from 48 bytes to 24
      bytes, i.e another 50% reduction in memory overhead from v2.
    * Remove swap cache and zswap tree and use the swap descriptor
      for this.
    * Remove zeromap, and replace the swap_map bytemap with 2 bitmaps
      (one for allocated slots, and one for bad slots).
    * Rebase on top of 6.19 (7d0a66e4bb9081d75c82ec4957c50034cb0ea449)
	* Update cover letter to include new benchmark results and discussion
	  on overhead in various cases.
* RFC v1 -> RFC v2:
    * Use a single atomic type (swap_refs) for reference counting
      purpose. This brings the size of the swap descriptor from 64 B
      down to 48 B (25% reduction). Suggested by Yosry Ahmed.
    * Zeromap bitmap is removed in the virtual swap implementation.
      This saves one bit per phyiscal swapfile slot.
    * Rearrange the patches and the code change to make things more
      reviewable. Suggested by Johannes Weiner.
    * Update the cover letter a bit.

This patch series implements the virtual swap space idea, based on Yosry's
proposals at LSFMMBPF 2023 (see [1], [2], [3]), as well as valuable
inputs from Johannes Weiner. The same idea (with different
implementation details) has been floated by Rik van Riel since at least
2011 (see [8]).

This patch series is based on 6.19. There are a couple more
swap-related changes in the mm-stable branch that I would need to
coordinate with, but I would like to send this out as an update, to show
that the lock contention issues that plagued earlier versions have been
resolved and performance on the kernel build benchmark is now on-par with
baseline. Furthermore, memory overhead has been substantially reduced
compared to the last RFC version.


I. Motivation

Currently, when an anon page is swapped out, a slot in a backing swap
device is allocated and stored in the page table entries that refer to
the original page. This slot is also used as the "key" to find the
swapped out content, as well as the index to swap data structures, such
as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
backing slot in this way is performant and efficient when swap is purely
just disk space, and swapoff is rare.

However, the advent of many swap optimizations has exposed major
drawbacks of this design. The first problem is that we occupy a physical
slot in the swap space, even for pages that are NEVER expected to hit
the disk: pages compressed and stored in the zswap pool, zero-filled
pages, or pages rejected by both of these optimizations when zswap
writeback is disabled. This is the arguably central shortcoming of
zswap:
* In deployments when no disk space can be afforded for swap (such as
  mobile and embedded devices), users cannot adopt zswap, and are forced
  to use zram. This is confusing for users, and creates extra burdens
  for developers, having to develop and maintain similar features for
  two separate swap backends (writeback, cgroup charging, THP support,
  etc.). For instance, see the discussion in [4].
* Resource-wise, it is hugely wasteful in terms of disk usage. At Meta,
  we have swapfile in the order of tens to hundreds of GBs, which are
  mostly unused and only exist to enable zswap usage and zero-filled
  pages swap optimizations.
* Tying zswap (and more generally, other in-memory swap backends) to
  the current physical swapfile infrastructure makes zswap implicitly
  statically sized. This does not make sense, as unlike disk swap, in
  which we consume a limited resource (disk space or swapfile space) to
  save another resource (memory), zswap consume the same resource it is
  saving (memory). The more we zswap, the more memory we have available,
  not less. We are not rationing a limited resource when we limit
  the size of he zswap pool, but rather we are capping the resource
  (memory) saving potential of zswap. Under memory pressure, using
  more zswap is almost always better than the alternative (disk IOs, or
  even worse, OOMs), and dynamically sizing the zswap pool on demand
  allows the system to flexibly respond to these precarious scenarios.
* Operationally, static provisioning the swapfile for zswap pose
  significant challenges, because the sysadmin has to prescribe how
  much swap is needed a priori, for each combination of
  (memory size x disk space x workload usage). It is even more
  complicated when we take into account the variance of memory
  compression, which changes the reclaim dynamics (and as a result,
  swap space size requirement). The problem is further exarcebated for
  users who rely on swap utilization (and exhaustion) as an OOM signal.

  All of these factors make it very difficult to configure the swapfile
  for zswap: too small of a swapfile and we risk preventable OOMs and
  limit the memory saving potentials of zswap; too big of a swapfile
  and we waste disk space and memory due to swap metadata overhead.
  This dilemma becomes more drastic in high memory systems, which can
  have up to TBs worth of memory.

Past attempts to decouple disk and compressed swap backends, namely the
ghost swapfile approach (see [13]), as well as the alternative
compressed swap backend zram, have mainly focused on eliminating the
disk space usage of compressed backends. We want a solution that not
only tackles that same problem, but also achieve the dyamicization of
swap space to maximize the memory saving potentials while reducing
operational and static memory overhead.

Finally, any swap redesign should support efficient backend transfer,
i.e without having to perform the expensive page table walk to
update all the PTEs that refer to the swap entry:
* The main motivation for this requirement is zswap writeback. To quote
  Johannes (from [14]): "Combining compression with disk swap is
  extremely powerful, because it dramatically reduces the worst aspects
  of both: it reduces the memory footprint of compression by shedding
  the coldest data to disk; it reduces the IO latencies and flash wear
  of disk swap through the writeback cache. In practice, this reduces
  *average event rates of the entire reclaim/paging/IO stack*."
* Another motivation is to simplify swapoff, which is both complicated
  and expensive in the current design, precisely because we are storing
  an encoding of the backend positional information in the page table,
  and thus requires a full page table walk to remove these references.


II. High Level Design Overview

To fix the aforementioned issues, we need an abstraction that separates
a swap entry from its physical backing storage. IOW, we need to
“virtualize” the swap space: swap clients will work with a dynamically
allocated virtual swap slot, storing it in page table entries, and
using it to index into various swap-related data structures. The
backing storage is decoupled from the virtual swap slot, and the newly
introduced layer will “resolve” the virtual swap slot to the actual
storage. This layer also manages other metadata of the swap entry, such
as its lifetime information (swap count), via a dynamically allocated,
per-swap-entry descriptor:

struct swp_desc {
        union {
                swp_slot_t         slot;                 /*     0     8 */
                struct zswap_entry * zswap_entry;        /*     0     8 */
        };                                               /*     0     8 */
        union {
                struct folio *     swap_cache;           /*     8     8 */
                void *             shadow;               /*     8     8 */
        };                                               /*     8     8 */
        unsigned int               swap_count;           /*    16     4 */
        unsigned short             memcgid:16;           /*    20: 0  2 */
        bool                       in_swapcache:1;       /*    22: 0  1 */

        /* Bitfield combined with previous fields */

        enum swap_type             type:2;               /*    20:17  4 */

        /* size: 24, cachelines: 1, members: 6 */
        /* bit_padding: 13 bits */
        /* last cacheline: 24 bytes */
};

(output from pahole).

This design allows us to:
* Decouple zswap (and zeromapped swap entry) from backing swapfile:
  simply associate the virtual swap slot with one of the supported
  backends: a zswap entry, a zero-filled swap page, a slot on the
  swapfile, or an in-memory page.
* Simplify and optimize swapoff: we only have to fault the page in and
  have the virtual swap slot points to the page instead of the on-disk
  physical swap slot. No need to perform any page table walking.

The size of the virtual swap descriptor is 24 bytes. Note that this is
not all "new" overhead, as the swap descriptor will replace:
* the swap_cgroup arrays (one per swap type) in the old design, which
  is a massive source of static memory overhead. With the new design,
  it is only allocated for used clusters.
* the swap tables, which holds the swap cache and workingset shadows.
* the zeromap bitmap, which is a bitmap of physical swap slots to
  indicate whether the swapped out page is zero-filled or not.
* huge chunk of the swap_map. The swap_map is now replaced by 2 bitmaps,
  one for allocated slots, and one for bad slots, representing 3 possible
  states of a slot on the swapfile: allocated, free, and bad.
* the zswap tree.

So, in terms of additional memory overhead:
* For zswap entries, the added memory overhead is rather minimal. The
  new indirection pointer neatly replaces the existing zswap tree.
  We really only incur less than one word of overhead for swap count
  blow up (since we no longer use swap continuation) and the swap type.
* For physical swap entries, the new design will impose fewer than 3 words
  memory overhead. However, as noted above this overhead is only for
  actively used swap entries, whereas in the current design the overhead is
  static (including the swap cgroup array for example).

  The primary victim of this overhead will be zram users. However, as
  zswap now no longer takes up disk space, zram users can consider
  switching to zswap (which, as a bonus, has a lot of useful features
  out of the box, such as cgroup tracking, dynamic zswap pool sizing,
  LRU-ordering writeback, etc.).

For a more concrete example, suppose we have a 32 GB swapfile (i.e.
8,388,608 swap entries), and we use zswap.

0% usage, or 0 entries: 0.00 MB
* Old design total overhead: 25.00 MB
* Vswap total overhead: 0.00 MB

25% usage, or 2,097,152 entries:
* Old design total overhead: 57.00 MB
* Vswap total overhead: 48.25 MB

50% usage, or 4,194,304 entries:
* Old design total overhead: 89.00 MB
* Vswap total overhead: 96.50 MB

75% usage, or 6,291,456 entries:
* Old design total overhead: 121.00 MB
* Vswap total overhead: 144.75 MB

100% usage, or 8,388,608 entries:
* Old design total overhead: 153.00 MB
* Vswap total overhead: 193.00 MB

So even in the worst case scenario for virtual swap, i.e when we
somehow have an oracle to correctly size the swapfile for zswap
pool to 32 GB, the added overhead is only 40 MB, which is a mere
0.12% of the total swapfile :)

In practice, the overhead will be closer to the 50-75% usage case, as
systems tend to leave swap headroom for pathological events or sudden
spikes in memory requirements. The added overhead in these cases are
practically neglible. And in deployments where swapfiles for zswap
are previously sparsely used, switching over to virtual swap will
actually reduce memory overhead.

Doing the same math for the disk swap, which is the worst case for
virtual swap in terms of swap backends:

0% usage, or 0 entries: 0.00 MB
* Old design total overhead: 25.00 MB
* Vswap total overhead: 2.00 MB

25% usage, or 2,097,152 entries:
* Old design total overhead: 41.00 MB
* Vswap total overhead: 66.25 MB

50% usage, or 4,194,304 entries:
* Old design total overhead: 57.00 MB
* Vswap total overhead: 130.50 MB

75% usage, or 6,291,456 entries:
* Old design total overhead: 73.00 MB
* Vswap total overhead: 194.75 MB

100% usage, or 8,388,608 entries:
* Old design total overhead: 89.00 MB
* Vswap total overhead: 259.00 MB

The added overhead is 170MB, which is 0.5% of the total swapfile size,
again in the worst case when we have a sizing oracle.

Please see the attached patches for more implementation details.


III. Usage and Benchmarking

This patch series introduce no new syscalls or userspace API. Existing
userspace setups will work as-is, except we no longer have to create a
swapfile or set memory.swap.max if we want to use zswap, as zswap is no
longer tied to physical swap. The zswap pool will be automatically and
dynamically sized based on memory usage and reclaim dynamics.

To measure the performance of the new implementation, I have run the
following benchmarks:

1. Kernel building: 52 workers (one per processor), memory.max = 3G.

Using zswap as the backend:

Baseline:
real: mean: 185.2s, stdev: 0.93s
sys: mean: 683.7s, stdev: 33.77s

Vswap:
real: mean: 184.88s, stdev: 0.57s
sys: mean: 675.14s, stdev: 32.8s

We actually see a slight improvement in systime (by 1.5%) :) This is
likely because we no longer have to perform swap charging for zswap
entries, and virtual swap allocator is simpler than that of physical
swap.

Using SSD swap as the backend:

Baseline:
real: mean: 200.3s, stdev: 2.33s
sys: mean: 489.88s, stdev: 9.62s

Vswap:
real: mean: 201.47s, stdev: 2.98s
sys: mean: 487.36s, stdev: 5.53s

The performance is neck-to-neck.


IV. Future Use Cases

While the patch series focus on two applications (decoupling swap
backends and swapoff optimization/simplification), this new,
future-proof design also allows us to implement new swap features more
easily and efficiently:

* Multi-tier swapping (as mentioned in [5]), with transparent
  transferring (promotion/demotion) of pages across tiers (see [8] and
  [9]). Similar to swapoff, with the old design we would need to
  perform the expensive page table walk.
* Swapfile compaction to alleviate fragmentation (as proposed by Ying
  Huang in [6]).
* Mixed backing THP swapin (see [7]): Once you have pinned down the
  backing store of THPs, then you can dispatch each range of subpages
  to appropriate backend swapin handler.
* Swapping a folio out with discontiguous physical swap slots
  (see [10]).
* Zswap writeback optimization: The current architecture pre-reserves
  physical swap space for pages when they enter the zswap pool, giving
  the kernel no flexibility at writeback time. With the virtual swap
  implementation, the backends are decoupled, and physical swap space
  is allocated on-demand at writeback time, at which point we can make
  much smarter decisions: we can batch multiple zswap writeback
  operations into a single IO request, allocating contiguous physical
  swap slots for that request. We can even perform compressed writeback
  (i.e writing these pages without decompressing them) (see [12]).


V. References

[1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com/
[2]: https://lwn.net/Articles/932077/
[3]: https://www.youtube.com/watch?v=Hwqw_TBGEhg
[4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@infradead.org/
[5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com/
[6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com/
[7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=6gvee1x3ttbOnifGneqcRm9Hoeun=uFQ2w@mail.gmail.com/
[8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/
[9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjpqxMwxS2C9TQ@mail.gmail.com/
[10]: https://lore.kernel.org/all/CACePvbUkMYMencuKfpDqtG1Ej7LiUS87VRAXb8sBn1yANikEmQ@mail.gmail.com/
[11]: https://lore.kernel.org/all/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com/
[12]: https://lore.kernel.org/linux-mm/ZeZSDLWwDed0CgT3@casper.infradead.org/
[13]: https://lore.kernel.org/all/20251121-ghost-v1-1-cfc0efcf3855@kernel.org/
[14]: https://lore.kernel.org/linux-mm/20251202170222.GD430226@cmpxchg.org/

Nhat Pham (20):
  mm/swap: decouple swap cache from physical swap infrastructure
  swap: rearrange the swap header file
  mm: swap: add an abstract API for locking out swapoff
  zswap: add new helpers for zswap entry operations
  mm/swap: add a new function to check if a swap entry is in swap
    cached.
  mm: swap: add a separate type for physical swap slots
  mm: create scaffolds for the new virtual swap implementation
  zswap: prepare zswap for swap virtualization
  mm: swap: allocate a virtual swap slot for each swapped out page
  swap: move swap cache to virtual swap descriptor
  zswap: move zswap entry management to the virtual swap descriptor
  swap: implement the swap_cgroup API using virtual swap
  swap: manage swap entry lifecycle at the virtual swap layer
  mm: swap: decouple virtual swap slot from backing store
  zswap: do not start zswap shrinker if there is no physical swap slots
  swap: do not unnecesarily pin readahead swap entries
  swapfile: remove zeromap bitmap
  memcg: swap: only charge physical swap slots
  swap: simplify swapoff using virtual swap
  swapfile: replace the swap map with bitmaps

 Documentation/mm/swap-table.rst |   69 --
 MAINTAINERS                     |    2 +
 include/linux/cpuhotplug.h      |    1 +
 include/linux/mm_types.h        |   16 +
 include/linux/shmem_fs.h        |    7 +-
 include/linux/swap.h            |  135 ++-
 include/linux/swap_cgroup.h     |   13 -
 include/linux/swapops.h         |   25 +
 include/linux/zswap.h           |   17 +-
 kernel/power/swap.c             |    6 +-
 mm/Makefile                     |    5 +-
 mm/huge_memory.c                |   11 +-
 mm/internal.h                   |   12 +-
 mm/memcontrol-v1.c              |    6 +
 mm/memcontrol.c                 |  142 ++-
 mm/memory.c                     |  101 +-
 mm/migrate.c                    |   13 +-
 mm/mincore.c                    |   15 +-
 mm/page_io.c                    |   83 +-
 mm/shmem.c                      |  215 +---
 mm/swap.h                       |  157 +--
 mm/swap_cgroup.c                |  172 ---
 mm/swap_state.c                 |  306 +----
 mm/swap_table.h                 |   78 +-
 mm/swapfile.c                   | 1518 ++++-------------------
 mm/userfaultfd.c                |   18 +-
 mm/vmscan.c                     |   28 +-
 mm/vswap.c                      | 2025 +++++++++++++++++++++++++++++++
 mm/zswap.c                      |  142 +--
 29 files changed, 2853 insertions(+), 2485 deletions(-)
 delete mode 100644 Documentation/mm/swap-table.rst
 delete mode 100644 mm/swap_cgroup.c
 create mode 100644 mm/vswap.c


base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
-- 
2.47.3
Re: [PATCH v3 00/20] Virtual Swap Space
Posted by Chris Li 1 day, 18 hours ago
On Sun, Feb 8, 2026 at 4:15 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> My sincerest apologies - it seems like the cover letter (and just the
> cover letter) fails to be sent out, for some reason. I'm trying to figure
> out what happened - it works when I send the entire patch series to
> myself...
>
> Anyway, resending this (in-reply-to patch 1 of the series):

For the record I did receive your original V3 cover letter from the
linux-mm mailing list.

> Changelog:
> * RFC v2 -> v3:
>     * Implement a cluster-based allocation algorithm for virtual swap
>       slots, inspired by Kairui Song and Chris Li's implementation, as
>       well as Johannes Weiner's suggestions. This eliminates the lock
>           contention issues on the virtual swap layer.
>     * Re-use swap table for the reverse mapping.
>     * Remove CONFIG_VIRTUAL_SWAP.
>     * Reducing the size of the swap descriptor from 48 bytes to 24

Is the per swap slot entry overhead 24 bytes in your implementation?
The current swap overhead is 3 static +8 dynamic, your 24 dynamic is a
big jump. You can argue that 8->24 is not a big jump . But it is an
unnecessary price compared to the alternatives, which is 8 dynamic +
4(optional redirect).

>       bytes, i.e another 50% reduction in memory overhead from v2.
>     * Remove swap cache and zswap tree and use the swap descriptor
>       for this.
>     * Remove zeromap, and replace the swap_map bytemap with 2 bitmaps
>       (one for allocated slots, and one for bad slots).
>     * Rebase on top of 6.19 (7d0a66e4bb9081d75c82ec4957c50034cb0ea449)

My git log shows 7d0a66e4bb9081d75c82ec4957c50034cb0ea449 is tag "v6.18".

>         * Update cover letter to include new benchmark results and discussion
>           on overhead in various cases.
> * RFC v1 -> RFC v2:
>     * Use a single atomic type (swap_refs) for reference counting
>       purpose. This brings the size of the swap descriptor from 64 B
>       down to 48 B (25% reduction). Suggested by Yosry Ahmed.
>     * Zeromap bitmap is removed in the virtual swap implementation.
>       This saves one bit per phyiscal swapfile slot.
>     * Rearrange the patches and the code change to make things more
>       reviewable. Suggested by Johannes Weiner.
>     * Update the cover letter a bit.
>
> This patch series implements the virtual swap space idea, based on Yosry's
> proposals at LSFMMBPF 2023 (see [1], [2], [3]), as well as valuable
> inputs from Johannes Weiner. The same idea (with different
> implementation details) has been floated by Rik van Riel since at least
> 2011 (see [8]).
>
> This patch series is based on 6.19. There are a couple more
> swap-related changes in the mm-stable branch that I would need to
> coordinate with, but I would like to send this out as an update, to show

Ah, you need to mention that in the first line to Andrew. Spell out
this series is not for Andrew to consume in the MM series. It can't
any way because it does not apply to mm-unstable nor mm-stable.

BTW, I have the following compile error with this series (fedora 43).
Same config compile fine on v6.19.

In file included from ./include/linux/local_lock.h:5,
                 from ./include/linux/mmzone.h:24,
                 from ./include/linux/gfp.h:7,
                 from ./include/linux/mm.h:7,
                 from mm/vswap.c:7:
mm/vswap.c: In function ‘vswap_cpu_dead’:
./include/linux/percpu-defs.h:221:45: error: initialization from
pointer to non-enclosed address space
  221 |         const void __percpu *__vpp_verify = (typeof((ptr) +
0))NULL;    \
      |                                             ^
./include/linux/local_lock_internal.h:105:40: note: in definition of
macro ‘__local_lock_acquire’
  105 |                 __l = (local_lock_t *)(lock);
         \
      |                                        ^~~~
./include/linux/local_lock.h:17:41: note: in expansion of macro
‘__local_lock’
   17 | #define local_lock(lock)                __local_lock(this_cpu_ptr(lock))
      |                                         ^~~~~~~~~~~~
./include/linux/percpu-defs.h:245:9: note: in expansion of macro
‘__verify_pcpu_ptr’
  245 |         __verify_pcpu_ptr(ptr);
         \
      |         ^~~~~~~~~~~~~~~~~
./include/linux/percpu-defs.h:256:27: note: in expansion of macro ‘raw_cpu_ptr’
  256 | #define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
      |                           ^~~~~~~~~~~
./include/linux/local_lock.h:17:54: note: in expansion of macro
‘this_cpu_ptr’
   17 | #define local_lock(lock)
__local_lock(this_cpu_ptr(lock))
      |
^~~~~~~~~~~~
mm/vswap.c:1518:9: note: in expansion of macro ‘local_lock’
 1518 |         local_lock(&percpu_cluster->lock);
      |         ^~~~~~~~~~

> that the lock contention issues that plagued earlier versions have been
> resolved and performance on the kernel build benchmark is now on-par with
> baseline. Furthermore, memory overhead has been substantially reduced
> compared to the last RFC version.
>
>
> I. Motivation
>
> Currently, when an anon page is swapped out, a slot in a backing swap
> device is allocated and stored in the page table entries that refer to
> the original page. This slot is also used as the "key" to find the
> swapped out content, as well as the index to swap data structures, such
> as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
> backing slot in this way is performant and efficient when swap is purely
> just disk space, and swapoff is rare.
>
> However, the advent of many swap optimizations has exposed major
> drawbacks of this design. The first problem is that we occupy a physical
> slot in the swap space, even for pages that are NEVER expected to hit
> the disk: pages compressed and stored in the zswap pool, zero-filled
> pages, or pages rejected by both of these optimizations when zswap
> writeback is disabled. This is the arguably central shortcoming of
> zswap:
> * In deployments when no disk space can be afforded for swap (such as
>   mobile and embedded devices), users cannot adopt zswap, and are forced
>   to use zram. This is confusing for users, and creates extra burdens
>   for developers, having to develop and maintain similar features for
>   two separate swap backends (writeback, cgroup charging, THP support,
>   etc.). For instance, see the discussion in [4].
> * Resource-wise, it is hugely wasteful in terms of disk usage. At Meta,
>   we have swapfile in the order of tens to hundreds of GBs, which are
>   mostly unused and only exist to enable zswap usage and zero-filled
>   pages swap optimizations.
> * Tying zswap (and more generally, other in-memory swap backends) to
>   the current physical swapfile infrastructure makes zswap implicitly
>   statically sized. This does not make sense, as unlike disk swap, in
>   which we consume a limited resource (disk space or swapfile space) to
>   save another resource (memory), zswap consume the same resource it is
>   saving (memory). The more we zswap, the more memory we have available,
>   not less. We are not rationing a limited resource when we limit
>   the size of he zswap pool, but rather we are capping the resource
>   (memory) saving potential of zswap. Under memory pressure, using
>   more zswap is almost always better than the alternative (disk IOs, or
>   even worse, OOMs), and dynamically sizing the zswap pool on demand
>   allows the system to flexibly respond to these precarious scenarios.
> * Operationally, static provisioning the swapfile for zswap pose
>   significant challenges, because the sysadmin has to prescribe how
>   much swap is needed a priori, for each combination of
>   (memory size x disk space x workload usage). It is even more
>   complicated when we take into account the variance of memory
>   compression, which changes the reclaim dynamics (and as a result,
>   swap space size requirement). The problem is further exarcebated for
>   users who rely on swap utilization (and exhaustion) as an OOM signal.
>
>   All of these factors make it very difficult to configure the swapfile
>   for zswap: too small of a swapfile and we risk preventable OOMs and
>   limit the memory saving potentials of zswap; too big of a swapfile
>   and we waste disk space and memory due to swap metadata overhead.
>   This dilemma becomes more drastic in high memory systems, which can
>   have up to TBs worth of memory.
>
> Past attempts to decouple disk and compressed swap backends, namely the
> ghost swapfile approach (see [13]), as well as the alternative
> compressed swap backend zram, have mainly focused on eliminating the
> disk space usage of compressed backends. We want a solution that not
> only tackles that same problem, but also achieve the dyamicization of
> swap space to maximize the memory saving potentials while reducing
> operational and static memory overhead.
>
> Finally, any swap redesign should support efficient backend transfer,
> i.e without having to perform the expensive page table walk to
> update all the PTEs that refer to the swap entry:
> * The main motivation for this requirement is zswap writeback. To quote
>   Johannes (from [14]): "Combining compression with disk swap is
>   extremely powerful, because it dramatically reduces the worst aspects
>   of both: it reduces the memory footprint of compression by shedding
>   the coldest data to disk; it reduces the IO latencies and flash wear
>   of disk swap through the writeback cache. In practice, this reduces
>   *average event rates of the entire reclaim/paging/IO stack*."
> * Another motivation is to simplify swapoff, which is both complicated
>   and expensive in the current design, precisely because we are storing
>   an encoding of the backend positional information in the page table,
>   and thus requires a full page table walk to remove these references.
>
>
> II. High Level Design Overview
>
> To fix the aforementioned issues, we need an abstraction that separates
> a swap entry from its physical backing storage. IOW, we need to
> “virtualize” the swap space: swap clients will work with a dynamically
> allocated virtual swap slot, storing it in page table entries, and
> using it to index into various swap-related data structures. The
> backing storage is decoupled from the virtual swap slot, and the newly
> introduced layer will “resolve” the virtual swap slot to the actual
> storage. This layer also manages other metadata of the swap entry, such
> as its lifetime information (swap count), via a dynamically allocated,
> per-swap-entry descriptor:
>
> struct swp_desc {
>         union {
>                 swp_slot_t         slot;                 /*     0     8 */
>                 struct zswap_entry * zswap_entry;        /*     0     8 */
>         };                                               /*     0     8 */
>         union {
>                 struct folio *     swap_cache;           /*     8     8 */
>                 void *             shadow;               /*     8     8 */
>         };                                               /*     8     8 */
>         unsigned int               swap_count;           /*    16     4 */
>         unsigned short             memcgid:16;           /*    20: 0  2 */
>         bool                       in_swapcache:1;       /*    22: 0  1 */
>
>         /* Bitfield combined with previous fields */
>
>         enum swap_type             type:2;               /*    20:17  4 */
>
>         /* size: 24, cachelines: 1, members: 6 */
>         /* bit_padding: 13 bits */
>         /* last cacheline: 24 bytes */
> };
>
> (output from pahole).
>
> This design allows us to:
> * Decouple zswap (and zeromapped swap entry) from backing swapfile:
>   simply associate the virtual swap slot with one of the supported
>   backends: a zswap entry, a zero-filled swap page, a slot on the
>   swapfile, or an in-memory page.
> * Simplify and optimize swapoff: we only have to fault the page in and
>   have the virtual swap slot points to the page instead of the on-disk
>   physical swap slot. No need to perform any page table walking.
>
> The size of the virtual swap descriptor is 24 bytes. Note that this is
> not all "new" overhead, as the swap descriptor will replace:
> * the swap_cgroup arrays (one per swap type) in the old design, which
>   is a massive source of static memory overhead. With the new design,
>   it is only allocated for used clusters.
> * the swap tables, which holds the swap cache and workingset shadows.
> * the zeromap bitmap, which is a bitmap of physical swap slots to
>   indicate whether the swapped out page is zero-filled or not.
> * huge chunk of the swap_map. The swap_map is now replaced by 2 bitmaps,
>   one for allocated slots, and one for bad slots, representing 3 possible
>   states of a slot on the swapfile: allocated, free, and bad.
> * the zswap tree.
>
> So, in terms of additional memory overhead:
> * For zswap entries, the added memory overhead is rather minimal. The
>   new indirection pointer neatly replaces the existing zswap tree.
>   We really only incur less than one word of overhead for swap count
>   blow up (since we no longer use swap continuation) and the swap type.
> * For physical swap entries, the new design will impose fewer than 3 words
>   memory overhead. However, as noted above this overhead is only for
>   actively used swap entries, whereas in the current design the overhead is
>   static (including the swap cgroup array for example).
>
>   The primary victim of this overhead will be zram users. However, as
>   zswap now no longer takes up disk space, zram users can consider
>   switching to zswap (which, as a bonus, has a lot of useful features
>   out of the box, such as cgroup tracking, dynamic zswap pool sizing,
>   LRU-ordering writeback, etc.).
>
> For a more concrete example, suppose we have a 32 GB swapfile (i.e.
> 8,388,608 swap entries), and we use zswap.
>
> 0% usage, or 0 entries: 0.00 MB
> * Old design total overhead: 25.00 MB
> * Vswap total overhead: 0.00 MB
>
> 25% usage, or 2,097,152 entries:
> * Old design total overhead: 57.00 MB
> * Vswap total overhead: 48.25 MB
>
> 50% usage, or 4,194,304 entries:
> * Old design total overhead: 89.00 MB
> * Vswap total overhead: 96.50 MB
>
> 75% usage, or 6,291,456 entries:
> * Old design total overhead: 121.00 MB
> * Vswap total overhead: 144.75 MB
>
> 100% usage, or 8,388,608 entries:
> * Old design total overhead: 153.00 MB
> * Vswap total overhead: 193.00 MB
>
> So even in the worst case scenario for virtual swap, i.e when we
> somehow have an oracle to correctly size the swapfile for zswap
> pool to 32 GB, the added overhead is only 40 MB, which is a mere
> 0.12% of the total swapfile :)
>
> In practice, the overhead will be closer to the 50-75% usage case, as
> systems tend to leave swap headroom for pathological events or sudden
> spikes in memory requirements. The added overhead in these cases are
> practically neglible. And in deployments where swapfiles for zswap
> are previously sparsely used, switching over to virtual swap will
> actually reduce memory overhead.
>
> Doing the same math for the disk swap, which is the worst case for
> virtual swap in terms of swap backends:
>
> 0% usage, or 0 entries: 0.00 MB
> * Old design total overhead: 25.00 MB
> * Vswap total overhead: 2.00 MB
>
> 25% usage, or 2,097,152 entries:
> * Old design total overhead: 41.00 MB
> * Vswap total overhead: 66.25 MB
>
> 50% usage, or 4,194,304 entries:
> * Old design total overhead: 57.00 MB
> * Vswap total overhead: 130.50 MB
>
> 75% usage, or 6,291,456 entries:
> * Old design total overhead: 73.00 MB
> * Vswap total overhead: 194.75 MB
>
> 100% usage, or 8,388,608 entries:
> * Old design total overhead: 89.00 MB
> * Vswap total overhead: 259.00 MB
>
> The added overhead is 170MB, which is 0.5% of the total swapfile size,
> again in the worst case when we have a sizing oracle.
>
> Please see the attached patches for more implementation details.
>
>
> III. Usage and Benchmarking
>
> This patch series introduce no new syscalls or userspace API. Existing
> userspace setups will work as-is, except we no longer have to create a
> swapfile or set memory.swap.max if we want to use zswap, as zswap is no
> longer tied to physical swap. The zswap pool will be automatically and
> dynamically sized based on memory usage and reclaim dynamics.
>
> To measure the performance of the new implementation, I have run the
> following benchmarks:
>
> 1. Kernel building: 52 workers (one per processor), memory.max = 3G.
>
> Using zswap as the backend:
>
> Baseline:
> real: mean: 185.2s, stdev: 0.93s
> sys: mean: 683.7s, stdev: 33.77s
>
> Vswap:
> real: mean: 184.88s, stdev: 0.57s
> sys: mean: 675.14s, stdev: 32.8s

Can you show your user space time as well to complete the picture?

How many runs do you have for stdev 32.8s?

>
> We actually see a slight improvement in systime (by 1.5%) :) This is
> likely because we no longer have to perform swap charging for zswap
> entries, and virtual swap allocator is simpler than that of physical
> swap.
>
> Using SSD swap as the backend:
Please include zram swap test data as well. Android heavily uses zram
for swapping.

>
> Baseline:
> real: mean: 200.3s, stdev: 2.33s
> sys: mean: 489.88s, stdev: 9.62s
>
> Vswap:
> real: mean: 201.47s, stdev: 2.98s
> sys: mean: 487.36s, stdev: 5.53s
>
> The performance is neck-to-neck.

I strongly suspect there is some performance difference that hasn't
been covered by your test yet. Need more conformation by others on the
performance measurement. The swap testing is tricky. You want to push
to stress barely within the OOM limit. Need more data.

Chris

>
>
> IV. Future Use Cases
>
> While the patch series focus on two applications (decoupling swap
> backends and swapoff optimization/simplification), this new,
> future-proof design also allows us to implement new swap features more
> easily and efficiently:
>
> * Multi-tier swapping (as mentioned in [5]), with transparent
>   transferring (promotion/demotion) of pages across tiers (see [8] and
>   [9]). Similar to swapoff, with the old design we would need to
>   perform the expensive page table walk.
> * Swapfile compaction to alleviate fragmentation (as proposed by Ying
>   Huang in [6]).
> * Mixed backing THP swapin (see [7]): Once you have pinned down the
>   backing store of THPs, then you can dispatch each range of subpages
>   to appropriate backend swapin handler.
> * Swapping a folio out with discontiguous physical swap slots
>   (see [10]).
> * Zswap writeback optimization: The current architecture pre-reserves
>   physical swap space for pages when they enter the zswap pool, giving
>   the kernel no flexibility at writeback time. With the virtual swap
>   implementation, the backends are decoupled, and physical swap space
>   is allocated on-demand at writeback time, at which point we can make
>   much smarter decisions: we can batch multiple zswap writeback
>   operations into a single IO request, allocating contiguous physical
>   swap slots for that request. We can even perform compressed writeback
>   (i.e writing these pages without decompressing them) (see [12]).
>
>
> V. References
>
> [1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com/
> [2]: https://lwn.net/Articles/932077/
> [3]: https://www.youtube.com/watch?v=Hwqw_TBGEhg
> [4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@infradead.org/
> [5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com/
> [6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com/
> [7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=6gvee1x3ttbOnifGneqcRm9Hoeun=uFQ2w@mail.gmail.com/
> [8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/
> [9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjpqxMwxS2C9TQ@mail.gmail.com/
> [10]: https://lore.kernel.org/all/CACePvbUkMYMencuKfpDqtG1Ej7LiUS87VRAXb8sBn1yANikEmQ@mail.gmail.com/
> [11]: https://lore.kernel.org/all/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com/
> [12]: https://lore.kernel.org/linux-mm/ZeZSDLWwDed0CgT3@casper.infradead.org/
> [13]: https://lore.kernel.org/all/20251121-ghost-v1-1-cfc0efcf3855@kernel.org/
> [14]: https://lore.kernel.org/linux-mm/20251202170222.GD430226@cmpxchg.org/
>
> Nhat Pham (20):
>   mm/swap: decouple swap cache from physical swap infrastructure
>   swap: rearrange the swap header file
>   mm: swap: add an abstract API for locking out swapoff
>   zswap: add new helpers for zswap entry operations
>   mm/swap: add a new function to check if a swap entry is in swap
>     cached.
>   mm: swap: add a separate type for physical swap slots
>   mm: create scaffolds for the new virtual swap implementation
>   zswap: prepare zswap for swap virtualization
>   mm: swap: allocate a virtual swap slot for each swapped out page
>   swap: move swap cache to virtual swap descriptor
>   zswap: move zswap entry management to the virtual swap descriptor
>   swap: implement the swap_cgroup API using virtual swap
>   swap: manage swap entry lifecycle at the virtual swap layer
>   mm: swap: decouple virtual swap slot from backing store
>   zswap: do not start zswap shrinker if there is no physical swap slots
>   swap: do not unnecesarily pin readahead swap entries
>   swapfile: remove zeromap bitmap
>   memcg: swap: only charge physical swap slots
>   swap: simplify swapoff using virtual swap
>   swapfile: replace the swap map with bitmaps
>
>  Documentation/mm/swap-table.rst |   69 --
>  MAINTAINERS                     |    2 +
>  include/linux/cpuhotplug.h      |    1 +
>  include/linux/mm_types.h        |   16 +
>  include/linux/shmem_fs.h        |    7 +-
>  include/linux/swap.h            |  135 ++-
>  include/linux/swap_cgroup.h     |   13 -
>  include/linux/swapops.h         |   25 +
>  include/linux/zswap.h           |   17 +-
>  kernel/power/swap.c             |    6 +-
>  mm/Makefile                     |    5 +-
>  mm/huge_memory.c                |   11 +-
>  mm/internal.h                   |   12 +-
>  mm/memcontrol-v1.c              |    6 +
>  mm/memcontrol.c                 |  142 ++-
>  mm/memory.c                     |  101 +-
>  mm/migrate.c                    |   13 +-
>  mm/mincore.c                    |   15 +-
>  mm/page_io.c                    |   83 +-
>  mm/shmem.c                      |  215 +---
>  mm/swap.h                       |  157 +--
>  mm/swap_cgroup.c                |  172 ---
>  mm/swap_state.c                 |  306 +----
>  mm/swap_table.h                 |   78 +-
>  mm/swapfile.c                   | 1518 ++++-------------------
>  mm/userfaultfd.c                |   18 +-
>  mm/vmscan.c                     |   28 +-
>  mm/vswap.c                      | 2025 +++++++++++++++++++++++++++++++
>  mm/zswap.c                      |  142 +--
>  29 files changed, 2853 insertions(+), 2485 deletions(-)
>  delete mode 100644 Documentation/mm/swap-table.rst
>  delete mode 100644 mm/swap_cgroup.c
>  create mode 100644 mm/vswap.c
>
>
> base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
> --
> 2.47.3
>
Re: [PATCH v3 00/20] Virtual Swap Space
Posted by Nhat Pham 12 hours ago
On Mon, Feb 9, 2026 at 4:20 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Sun, Feb 8, 2026 at 4:15 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > My sincerest apologies - it seems like the cover letter (and just the
> > cover letter) fails to be sent out, for some reason. I'm trying to figure
> > out what happened - it works when I send the entire patch series to
> > myself...
> >
> > Anyway, resending this (in-reply-to patch 1 of the series):
>
> For the record I did receive your original V3 cover letter from the
> linux-mm mailing list.

I have no idea what happened to be honest. It did not show up on lore
for a couple of hours, and my coworkers did not receive the cover
letter email initially. I did not receive any error message or logs
either - git send-email returns Success to me, and when I checked on
the web gmail client (since I used a gmail email account), the whole
series is there.

I tried re-sending a couple times, to no avail. Then, in a couple of
hours, all of these attempts showed up.

Anyway, this is my bad - I'll be more patient next time. If it does
not show up for a couple of hours then I'll do some more digging.

>
> > Changelog:
> > * RFC v2 -> v3:
> >     * Implement a cluster-based allocation algorithm for virtual swap
> >       slots, inspired by Kairui Song and Chris Li's implementation, as
> >       well as Johannes Weiner's suggestions. This eliminates the lock
> >           contention issues on the virtual swap layer.
> >     * Re-use swap table for the reverse mapping.
> >     * Remove CONFIG_VIRTUAL_SWAP.
> >     * Reducing the size of the swap descriptor from 48 bytes to 24
>
> Is the per swap slot entry overhead 24 bytes in your implementation?
> The current swap overhead is 3 static +8 dynamic, your 24 dynamic is a
> big jump. You can argue that 8->24 is not a big jump . But it is an
> unnecessary price compared to the alternatives, which is 8 dynamic +
> 4(optional redirect).

It depends in cases - you can check the memory overhead discussion below :)

>
> >       bytes, i.e another 50% reduction in memory overhead from v2.
> >     * Remove swap cache and zswap tree and use the swap descriptor
> >       for this.
> >     * Remove zeromap, and replace the swap_map bytemap with 2 bitmaps
> >       (one for allocated slots, and one for bad slots).
> >     * Rebase on top of 6.19 (7d0a66e4bb9081d75c82ec4957c50034cb0ea449)
>
> My git log shows 7d0a66e4bb9081d75c82ec4957c50034cb0ea449 is tag "v6.18".

Oh yeah I forgot to update that. That was from an old cover letter of
an old version that never got sent out - I'll correct that in future
versions

(if you scroll down to the bottom of the cover letter you should see
the correct base, which should be 6.19).

>
> >         * Update cover letter to include new benchmark results and discussion
> >           on overhead in various cases.
> > * RFC v1 -> RFC v2:
> >     * Use a single atomic type (swap_refs) for reference counting
> >       purpose. This brings the size of the swap descriptor from 64 B
> >       down to 48 B (25% reduction). Suggested by Yosry Ahmed.
> >     * Zeromap bitmap is removed in the virtual swap implementation.
> >       This saves one bit per phyiscal swapfile slot.
> >     * Rearrange the patches and the code change to make things more
> >       reviewable. Suggested by Johannes Weiner.
> >     * Update the cover letter a bit.
> >
> > This patch series implements the virtual swap space idea, based on Yosry's
> > proposals at LSFMMBPF 2023 (see [1], [2], [3]), as well as valuable
> > inputs from Johannes Weiner. The same idea (with different
> > implementation details) has been floated by Rik van Riel since at least
> > 2011 (see [8]).
> >
> > This patch series is based on 6.19. There are a couple more
> > swap-related changes in the mm-stable branch that I would need to
> > coordinate with, but I would like to send this out as an update, to show
>
> Ah, you need to mention that in the first line to Andrew. Spell out
> this series is not for Andrew to consume in the MM series. It can't
> any way because it does not apply to mm-unstable nor mm-stable.

Fair - I'll make sure to move this paragraph to above the changelog next time :)

>
> BTW, I have the following compile error with this series (fedora 43).
> Same config compile fine on v6.19.
>
> In file included from ./include/linux/local_lock.h:5,
>                  from ./include/linux/mmzone.h:24,
>                  from ./include/linux/gfp.h:7,
>                  from ./include/linux/mm.h:7,
>                  from mm/vswap.c:7:
> mm/vswap.c: In function ‘vswap_cpu_dead’:
> ./include/linux/percpu-defs.h:221:45: error: initialization from
> pointer to non-enclosed address space
>   221 |         const void __percpu *__vpp_verify = (typeof((ptr) +
> 0))NULL;    \
>       |                                             ^
> ./include/linux/local_lock_internal.h:105:40: note: in definition of
> macro ‘__local_lock_acquire’
>   105 |                 __l = (local_lock_t *)(lock);
>          \
>       |                                        ^~~~
> ./include/linux/local_lock.h:17:41: note: in expansion of macro
> ‘__local_lock’
>    17 | #define local_lock(lock)                __local_lock(this_cpu_ptr(lock))
>       |                                         ^~~~~~~~~~~~
> ./include/linux/percpu-defs.h:245:9: note: in expansion of macro
> ‘__verify_pcpu_ptr’
>   245 |         __verify_pcpu_ptr(ptr);
>          \
>       |         ^~~~~~~~~~~~~~~~~
> ./include/linux/percpu-defs.h:256:27: note: in expansion of macro ‘raw_cpu_ptr’
>   256 | #define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
>       |                           ^~~~~~~~~~~
> ./include/linux/local_lock.h:17:54: note: in expansion of macro
> ‘this_cpu_ptr’
>    17 | #define local_lock(lock)
> __local_lock(this_cpu_ptr(lock))
>       |
> ^~~~~~~~~~~~
> mm/vswap.c:1518:9: note: in expansion of macro ‘local_lock’
>  1518 |         local_lock(&percpu_cluster->lock);
>       |         ^~~~~~~~~~

Ah that's strange. It compiled on all of my setups (I tested with a couple
different ones), but I must have missed some cases. Would you mind
sharing your configs so that I can reproduce this compilation error?

(although I'm sure kernel test robot will scream at me soon, which
usually includes configs that cause the compilation issue).

>
> > that the lock contention issues that plagued earlier versions have been
> > resolved and performance on the kernel build benchmark is now on-par with
> > baseline. Furthermore, memory overhead has been substantially reduced
> > compared to the last RFC version.
> >
> >
> > I. Motivation
> >
> > Currently, when an anon page is swapped out, a slot in a backing swap
> > device is allocated and stored in the page table entries that refer to
> > the original page. This slot is also used as the "key" to find the
> > swapped out content, as well as the index to swap data structures, such
> > as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
> > backing slot in this way is performant and efficient when swap is purely
> > just disk space, and swapoff is rare.
> >
> > However, the advent of many swap optimizations has exposed major
> > drawbacks of this design. The first problem is that we occupy a physical
> > slot in the swap space, even for pages that are NEVER expected to hit
> > the disk: pages compressed and stored in the zswap pool, zero-filled
> > pages, or pages rejected by both of these optimizations when zswap
> > writeback is disabled. This is the arguably central shortcoming of
> > zswap:
> > * In deployments when no disk space can be afforded for swap (such as
> >   mobile and embedded devices), users cannot adopt zswap, and are forced
> >   to use zram. This is confusing for users, and creates extra burdens
> >   for developers, having to develop and maintain similar features for
> >   two separate swap backends (writeback, cgroup charging, THP support,
> >   etc.). For instance, see the discussion in [4].
> > * Resource-wise, it is hugely wasteful in terms of disk usage. At Meta,
> >   we have swapfile in the order of tens to hundreds of GBs, which are
> >   mostly unused and only exist to enable zswap usage and zero-filled
> >   pages swap optimizations.
> > * Tying zswap (and more generally, other in-memory swap backends) to
> >   the current physical swapfile infrastructure makes zswap implicitly
> >   statically sized. This does not make sense, as unlike disk swap, in
> >   which we consume a limited resource (disk space or swapfile space) to
> >   save another resource (memory), zswap consume the same resource it is
> >   saving (memory). The more we zswap, the more memory we have available,
> >   not less. We are not rationing a limited resource when we limit
> >   the size of he zswap pool, but rather we are capping the resource
> >   (memory) saving potential of zswap. Under memory pressure, using
> >   more zswap is almost always better than the alternative (disk IOs, or
> >   even worse, OOMs), and dynamically sizing the zswap pool on demand
> >   allows the system to flexibly respond to these precarious scenarios.
> > * Operationally, static provisioning the swapfile for zswap pose
> >   significant challenges, because the sysadmin has to prescribe how
> >   much swap is needed a priori, for each combination of
> >   (memory size x disk space x workload usage). It is even more
> >   complicated when we take into account the variance of memory
> >   compression, which changes the reclaim dynamics (and as a result,
> >   swap space size requirement). The problem is further exarcebated for
> >   users who rely on swap utilization (and exhaustion) as an OOM signal.
> >
> >   All of these factors make it very difficult to configure the swapfile
> >   for zswap: too small of a swapfile and we risk preventable OOMs and
> >   limit the memory saving potentials of zswap; too big of a swapfile
> >   and we waste disk space and memory due to swap metadata overhead.
> >   This dilemma becomes more drastic in high memory systems, which can
> >   have up to TBs worth of memory.
> >
> > Past attempts to decouple disk and compressed swap backends, namely the
> > ghost swapfile approach (see [13]), as well as the alternative
> > compressed swap backend zram, have mainly focused on eliminating the
> > disk space usage of compressed backends. We want a solution that not
> > only tackles that same problem, but also achieve the dyamicization of
> > swap space to maximize the memory saving potentials while reducing
> > operational and static memory overhead.
> >
> > Finally, any swap redesign should support efficient backend transfer,
> > i.e without having to perform the expensive page table walk to
> > update all the PTEs that refer to the swap entry:
> > * The main motivation for this requirement is zswap writeback. To quote
> >   Johannes (from [14]): "Combining compression with disk swap is
> >   extremely powerful, because it dramatically reduces the worst aspects
> >   of both: it reduces the memory footprint of compression by shedding
> >   the coldest data to disk; it reduces the IO latencies and flash wear
> >   of disk swap through the writeback cache. In practice, this reduces
> >   *average event rates of the entire reclaim/paging/IO stack*."
> > * Another motivation is to simplify swapoff, which is both complicated
> >   and expensive in the current design, precisely because we are storing
> >   an encoding of the backend positional information in the page table,
> >   and thus requires a full page table walk to remove these references.
> >
> >
> > II. High Level Design Overview
> >
> > To fix the aforementioned issues, we need an abstraction that separates
> > a swap entry from its physical backing storage. IOW, we need to
> > “virtualize” the swap space: swap clients will work with a dynamically
> > allocated virtual swap slot, storing it in page table entries, and
> > using it to index into various swap-related data structures. The
> > backing storage is decoupled from the virtual swap slot, and the newly
> > introduced layer will “resolve” the virtual swap slot to the actual
> > storage. This layer also manages other metadata of the swap entry, such
> > as its lifetime information (swap count), via a dynamically allocated,
> > per-swap-entry descriptor:
> >
> > struct swp_desc {
> >         union {
> >                 swp_slot_t         slot;                 /*     0     8 */
> >                 struct zswap_entry * zswap_entry;        /*     0     8 */
> >         };                                               /*     0     8 */
> >         union {
> >                 struct folio *     swap_cache;           /*     8     8 */
> >                 void *             shadow;               /*     8     8 */
> >         };                                               /*     8     8 */
> >         unsigned int               swap_count;           /*    16     4 */
> >         unsigned short             memcgid:16;           /*    20: 0  2 */
> >         bool                       in_swapcache:1;       /*    22: 0  1 */
> >
> >         /* Bitfield combined with previous fields */
> >
> >         enum swap_type             type:2;               /*    20:17  4 */
> >
> >         /* size: 24, cachelines: 1, members: 6 */
> >         /* bit_padding: 13 bits */
> >         /* last cacheline: 24 bytes */
> > };
> >
> > (output from pahole).
> >
> > This design allows us to:
> > * Decouple zswap (and zeromapped swap entry) from backing swapfile:
> >   simply associate the virtual swap slot with one of the supported
> >   backends: a zswap entry, a zero-filled swap page, a slot on the
> >   swapfile, or an in-memory page.
> > * Simplify and optimize swapoff: we only have to fault the page in and
> >   have the virtual swap slot points to the page instead of the on-disk
> >   physical swap slot. No need to perform any page table walking.
> >
> > The size of the virtual swap descriptor is 24 bytes. Note that this is
> > not all "new" overhead, as the swap descriptor will replace:
> > * the swap_cgroup arrays (one per swap type) in the old design, which
> >   is a massive source of static memory overhead. With the new design,
> >   it is only allocated for used clusters.
> > * the swap tables, which holds the swap cache and workingset shadows.
> > * the zeromap bitmap, which is a bitmap of physical swap slots to
> >   indicate whether the swapped out page is zero-filled or not.
> > * huge chunk of the swap_map. The swap_map is now replaced by 2 bitmaps,
> >   one for allocated slots, and one for bad slots, representing 3 possible
> >   states of a slot on the swapfile: allocated, free, and bad.
> > * the zswap tree.
> >
> > So, in terms of additional memory overhead:
> > * For zswap entries, the added memory overhead is rather minimal. The
> >   new indirection pointer neatly replaces the existing zswap tree.
> >   We really only incur less than one word of overhead for swap count
> >   blow up (since we no longer use swap continuation) and the swap type.
> > * For physical swap entries, the new design will impose fewer than 3 words
> >   memory overhead. However, as noted above this overhead is only for
> >   actively used swap entries, whereas in the current design the overhead is
> >   static (including the swap cgroup array for example).
> >
> >   The primary victim of this overhead will be zram users. However, as
> >   zswap now no longer takes up disk space, zram users can consider
> >   switching to zswap (which, as a bonus, has a lot of useful features
> >   out of the box, such as cgroup tracking, dynamic zswap pool sizing,
> >   LRU-ordering writeback, etc.).
> >
> > For a more concrete example, suppose we have a 32 GB swapfile (i.e.
> > 8,388,608 swap entries), and we use zswap.
> >
> > 0% usage, or 0 entries: 0.00 MB
> > * Old design total overhead: 25.00 MB
> > * Vswap total overhead: 0.00 MB
> >
> > 25% usage, or 2,097,152 entries:
> > * Old design total overhead: 57.00 MB
> > * Vswap total overhead: 48.25 MB
> >
> > 50% usage, or 4,194,304 entries:
> > * Old design total overhead: 89.00 MB
> > * Vswap total overhead: 96.50 MB
> >
> > 75% usage, or 6,291,456 entries:
> > * Old design total overhead: 121.00 MB
> > * Vswap total overhead: 144.75 MB
> >
> > 100% usage, or 8,388,608 entries:
> > * Old design total overhead: 153.00 MB
> > * Vswap total overhead: 193.00 MB
> >
> > So even in the worst case scenario for virtual swap, i.e when we
> > somehow have an oracle to correctly size the swapfile for zswap
> > pool to 32 GB, the added overhead is only 40 MB, which is a mere
> > 0.12% of the total swapfile :)
> >
> > In practice, the overhead will be closer to the 50-75% usage case, as
> > systems tend to leave swap headroom for pathological events or sudden
> > spikes in memory requirements. The added overhead in these cases are
> > practically neglible. And in deployments where swapfiles for zswap
> > are previously sparsely used, switching over to virtual swap will
> > actually reduce memory overhead.
> >
> > Doing the same math for the disk swap, which is the worst case for
> > virtual swap in terms of swap backends:
> >
> > 0% usage, or 0 entries: 0.00 MB
> > * Old design total overhead: 25.00 MB
> > * Vswap total overhead: 2.00 MB
> >
> > 25% usage, or 2,097,152 entries:
> > * Old design total overhead: 41.00 MB
> > * Vswap total overhead: 66.25 MB
> >
> > 50% usage, or 4,194,304 entries:
> > * Old design total overhead: 57.00 MB
> > * Vswap total overhead: 130.50 MB
> >
> > 75% usage, or 6,291,456 entries:
> > * Old design total overhead: 73.00 MB
> > * Vswap total overhead: 194.75 MB
> >
> > 100% usage, or 8,388,608 entries:
> > * Old design total overhead: 89.00 MB
> > * Vswap total overhead: 259.00 MB
> >
> > The added overhead is 170MB, which is 0.5% of the total swapfile size,
> > again in the worst case when we have a sizing oracle.
> >
> > Please see the attached patches for more implementation details.
> >
> >
> > III. Usage and Benchmarking
> >
> > This patch series introduce no new syscalls or userspace API. Existing
> > userspace setups will work as-is, except we no longer have to create a
> > swapfile or set memory.swap.max if we want to use zswap, as zswap is no
> > longer tied to physical swap. The zswap pool will be automatically and
> > dynamically sized based on memory usage and reclaim dynamics.
> >
> > To measure the performance of the new implementation, I have run the
> > following benchmarks:
> >
> > 1. Kernel building: 52 workers (one per processor), memory.max = 3G.
> >
> > Using zswap as the backend:
> >
> > Baseline:
> > real: mean: 185.2s, stdev: 0.93s
> > sys: mean: 683.7s, stdev: 33.77s
> >
> > Vswap:
> > real: mean: 184.88s, stdev: 0.57s
> > sys: mean: 675.14s, stdev: 32.8s
>
> Can you show your user space time as well to complete the picture?

Will do next time! I used to include user time as well, but I noticed
that folks (for e.g see [1]) only include systime, not even real time,
so I figure nobody cares about user time :)

(I still include real time because some of my past work improves sys
time but regresses real time, so I figure that's relevant).

[1]: https://lore.kernel.org/linux-mm/20260128-swap-table-p3-v2-0-fe0b67ef0215@tencent.com/

But yeah no big deal. I'll dig through my logs to see if I still have
the numbers, but if not I'll include it in next version.

>
> How many runs do you have for stdev 32.8s?

5 runs! I average out the result of 5 runs.

>
> >
> > We actually see a slight improvement in systime (by 1.5%) :) This is
> > likely because we no longer have to perform swap charging for zswap
> > entries, and virtual swap allocator is simpler than that of physical
> > swap.
> >
> > Using SSD swap as the backend:
> Please include zram swap test data as well. Android heavily uses zram
> for swapping.
> >
> > Baseline:
> > real: mean: 200.3s, stdev: 2.33s
> > sys: mean: 489.88s, stdev: 9.62s
> >
> > Vswap:
> > real: mean: 201.47s, stdev: 2.98s
> > sys: mean: 487.36s, stdev: 5.53s
> >
> > The performance is neck-to-neck.
>
> I strongly suspect there is some performance difference that hasn't
> been covered by your test yet. Need more conformation by others on the
> performance measurement. The swap testing is tricky. You want to push
> to stress barely within the OOM limit. Need more data.

Very fair point :) I will say though - the kernel build test, with
memory.max limit sets, does generate a sizable amount of swapping, and
does OOM if you don't set up swap. Take my words for now, but I will
try to include average per-run (z)swap activity stats (zswpout zswpin
etc.) in future versions if you're interested :)

I've been trying to running more stress tests to trigger crashes and
performance regression. One of the big reasons why I haven't sent
anything til now is to fix obvious performance issues (the
aforementioned lock contention) and bugs. It's a complicated piece of
work.

As always, would love to receive code/design feedback from you (and
Kairui, and other swap reviewers), and I would appreciate very much if
other swap folks can play with the patch series on their setup as well
for performance testing, or let me know if there is any particular
case that they're interested in :)

Thanks for your review, Chris!



>
> Chris
>
> >
> >
> > IV. Future Use Cases
> >
> > While the patch series focus on two applications (decoupling swap
> > backends and swapoff optimization/simplification), this new,
> > future-proof design also allows us to implement new swap features more
> > easily and efficiently:
> >
> > * Multi-tier swapping (as mentioned in [5]), with transparent
> >   transferring (promotion/demotion) of pages across tiers (see [8] and
> >   [9]). Similar to swapoff, with the old design we would need to
> >   perform the expensive page table walk.
> > * Swapfile compaction to alleviate fragmentation (as proposed by Ying
> >   Huang in [6]).
> > * Mixed backing THP swapin (see [7]): Once you have pinned down the
> >   backing store of THPs, then you can dispatch each range of subpages
> >   to appropriate backend swapin handler.
> > * Swapping a folio out with discontiguous physical swap slots
> >   (see [10]).
> > * Zswap writeback optimization: The current architecture pre-reserves
> >   physical swap space for pages when they enter the zswap pool, giving
> >   the kernel no flexibility at writeback time. With the virtual swap
> >   implementation, the backends are decoupled, and physical swap space
> >   is allocated on-demand at writeback time, at which point we can make
> >   much smarter decisions: we can batch multiple zswap writeback
> >   operations into a single IO request, allocating contiguous physical
> >   swap slots for that request. We can even perform compressed writeback
> >   (i.e writing these pages without decompressing them) (see [12]).
> >
> >
> > V. References
> >
> > [1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com/
> > [2]: https://lwn.net/Articles/932077/
> > [3]: https://www.youtube.com/watch?v=Hwqw_TBGEhg
> > [4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@infradead.org/
> > [5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com/
> > [6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com/
> > [7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=6gvee1x3ttbOnifGneqcRm9Hoeun=uFQ2w@mail.gmail.com/
> > [8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/
> > [9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjpqxMwxS2C9TQ@mail.gmail.com/
> > [10]: https://lore.kernel.org/all/CACePvbUkMYMencuKfpDqtG1Ej7LiUS87VRAXb8sBn1yANikEmQ@mail.gmail.com/
> > [11]: https://lore.kernel.org/all/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com/
> > [12]: https://lore.kernel.org/linux-mm/ZeZSDLWwDed0CgT3@casper.infradead.org/
> > [13]: https://lore.kernel.org/all/20251121-ghost-v1-1-cfc0efcf3855@kernel.org/
> > [14]: https://lore.kernel.org/linux-mm/20251202170222.GD430226@cmpxchg.org/
> >
> > Nhat Pham (20):
> >   mm/swap: decouple swap cache from physical swap infrastructure
> >   swap: rearrange the swap header file
> >   mm: swap: add an abstract API for locking out swapoff
> >   zswap: add new helpers for zswap entry operations
> >   mm/swap: add a new function to check if a swap entry is in swap
> >     cached.
> >   mm: swap: add a separate type for physical swap slots
> >   mm: create scaffolds for the new virtual swap implementation
> >   zswap: prepare zswap for swap virtualization
> >   mm: swap: allocate a virtual swap slot for each swapped out page
> >   swap: move swap cache to virtual swap descriptor
> >   zswap: move zswap entry management to the virtual swap descriptor
> >   swap: implement the swap_cgroup API using virtual swap
> >   swap: manage swap entry lifecycle at the virtual swap layer
> >   mm: swap: decouple virtual swap slot from backing store
> >   zswap: do not start zswap shrinker if there is no physical swap slots
> >   swap: do not unnecesarily pin readahead swap entries
> >   swapfile: remove zeromap bitmap
> >   memcg: swap: only charge physical swap slots
> >   swap: simplify swapoff using virtual swap
> >   swapfile: replace the swap map with bitmaps
> >
> >  Documentation/mm/swap-table.rst |   69 --
> >  MAINTAINERS                     |    2 +
> >  include/linux/cpuhotplug.h      |    1 +
> >  include/linux/mm_types.h        |   16 +
> >  include/linux/shmem_fs.h        |    7 +-
> >  include/linux/swap.h            |  135 ++-
> >  include/linux/swap_cgroup.h     |   13 -
> >  include/linux/swapops.h         |   25 +
> >  include/linux/zswap.h           |   17 +-
> >  kernel/power/swap.c             |    6 +-
> >  mm/Makefile                     |    5 +-
> >  mm/huge_memory.c                |   11 +-
> >  mm/internal.h                   |   12 +-
> >  mm/memcontrol-v1.c              |    6 +
> >  mm/memcontrol.c                 |  142 ++-
> >  mm/memory.c                     |  101 +-
> >  mm/migrate.c                    |   13 +-
> >  mm/mincore.c                    |   15 +-
> >  mm/page_io.c                    |   83 +-
> >  mm/shmem.c                      |  215 +---
> >  mm/swap.h                       |  157 +--
> >  mm/swap_cgroup.c                |  172 ---
> >  mm/swap_state.c                 |  306 +----
> >  mm/swap_table.h                 |   78 +-
> >  mm/swapfile.c                   | 1518 ++++-------------------
> >  mm/userfaultfd.c                |   18 +-
> >  mm/vmscan.c                     |   28 +-
> >  mm/vswap.c                      | 2025 +++++++++++++++++++++++++++++++
> >  mm/zswap.c                      |  142 +--
> >  29 files changed, 2853 insertions(+), 2485 deletions(-)
> >  delete mode 100644 Documentation/mm/swap-table.rst
> >  delete mode 100644 mm/swap_cgroup.c
> >  create mode 100644 mm/vswap.c
> >
> >
> > base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
> > --
> > 2.47.3
> >
Re: [PATCH v3 00/20] Virtual Swap Space
Posted by Chris Li 7 hours ago
On Tue, Feb 10, 2026 at 10:00 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Mon, Feb 9, 2026 at 4:20 AM Chris Li <chrisl@kernel.org> wrote:
> >
> > On Sun, Feb 8, 2026 at 4:15 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > My sincerest apologies - it seems like the cover letter (and just the
> > > cover letter) fails to be sent out, for some reason. I'm trying to figure
> > > out what happened - it works when I send the entire patch series to
> > > myself...
> > >
> > > Anyway, resending this (in-reply-to patch 1 of the series):
> >
> > For the record I did receive your original V3 cover letter from the
> > linux-mm mailing list.
>
> I have no idea what happened to be honest. It did not show up on lore
> for a couple of hours, and my coworkers did not receive the cover
> letter email initially. I did not receive any error message or logs
> either - git send-email returns Success to me, and when I checked on
> the web gmail client (since I used a gmail email account), the whole
> series is there.
>
> I tried re-sending a couple times, to no avail. Then, in a couple of
> hours, all of these attempts showed up.
>
> Anyway, this is my bad - I'll be more patient next time. If it does
> not show up for a couple of hours then I'll do some more digging.

No problem. Just want to provide more data points if that helps you
debug your email issue.

> > > Changelog:
> > > * RFC v2 -> v3:
> > >     * Implement a cluster-based allocation algorithm for virtual swap
> > >       slots, inspired by Kairui Song and Chris Li's implementation, as
> > >       well as Johannes Weiner's suggestions. This eliminates the lock
> > >           contention issues on the virtual swap layer.
> > >     * Re-use swap table for the reverse mapping.
> > >     * Remove CONFIG_VIRTUAL_SWAP.
> > >     * Reducing the size of the swap descriptor from 48 bytes to 24
> >
> > Is the per swap slot entry overhead 24 bytes in your implementation?
> > The current swap overhead is 3 static +8 dynamic, your 24 dynamic is a
> > big jump. You can argue that 8->24 is not a big jump . But it is an
> > unnecessary price compared to the alternatives, which is 8 dynamic +
> > 4(optional redirect).
>
> It depends in cases - you can check the memory overhead discussion below :)

I think the "24B dynamic" sums up the VS memory overhead pretty well
without going into the detail tables. You can drive from case
discussion from that.

> > BTW, I have the following compile error with this series (fedora 43).
> > Same config compile fine on v6.19.
> >
> > In file included from ./include/linux/local_lock.h:5,
> >                  from ./include/linux/mmzone.h:24,
> >                  from ./include/linux/gfp.h:7,
> >                  from ./include/linux/mm.h:7,
> >                  from mm/vswap.c:7:
> > mm/vswap.c: In function ‘vswap_cpu_dead’:
> > ./include/linux/percpu-defs.h:221:45: error: initialization from
> > pointer to non-enclosed address space
> >   221 |         const void __percpu *__vpp_verify = (typeof((ptr) +
> > 0))NULL;    \
> >       |                                             ^
> > ./include/linux/local_lock_internal.h:105:40: note: in definition of
> > macro ‘__local_lock_acquire’
> >   105 |                 __l = (local_lock_t *)(lock);
> >          \
> >       |                                        ^~~~
> > ./include/linux/local_lock.h:17:41: note: in expansion of macro
> > ‘__local_lock’
> >    17 | #define local_lock(lock)                __local_lock(this_cpu_ptr(lock))
> >       |                                         ^~~~~~~~~~~~
> > ./include/linux/percpu-defs.h:245:9: note: in expansion of macro
> > ‘__verify_pcpu_ptr’
> >   245 |         __verify_pcpu_ptr(ptr);
> >          \
> >       |         ^~~~~~~~~~~~~~~~~
> > ./include/linux/percpu-defs.h:256:27: note: in expansion of macro ‘raw_cpu_ptr’
> >   256 | #define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
> >       |                           ^~~~~~~~~~~
> > ./include/linux/local_lock.h:17:54: note: in expansion of macro
> > ‘this_cpu_ptr’
> >    17 | #define local_lock(lock)
> > __local_lock(this_cpu_ptr(lock))
> >       |
> > ^~~~~~~~~~~~
> > mm/vswap.c:1518:9: note: in expansion of macro ‘local_lock’
> >  1518 |         local_lock(&percpu_cluster->lock);
> >       |         ^~~~~~~~~~
>
> Ah that's strange. It compiled on all of my setups (I tested with a couple
> different ones), but I must have missed some cases. Would you mind
> sharing your configs so that I can reproduce this compilation error?

See attached config.gz. It is also possible the newer gcc version
contributes to that error. Anyway, that is preventing me from stress
testing your series on my setup.

>
> >
> > > 1. Kernel building: 52 workers (one per processor), memory.max = 3G.
> > >
> > > Using zswap as the backend:
> > >
> > > Baseline:
> > > real: mean: 185.2s, stdev: 0.93s
> > > sys: mean: 683.7s, stdev: 33.77s
> > >
> > > Vswap:
> > > real: mean: 184.88s, stdev: 0.57s
> > > sys: mean: 675.14s, stdev: 32.8s
> >
> > Can you show your user space time as well to complete the picture?
>
> Will do next time! I used to include user time as well, but I noticed
> that folks (for e.g see [1]) only include systime, not even real time,
> so I figure nobody cares about user time :)
>
> (I still include real time because some of my past work improves sys
> time but regresses real time, so I figure that's relevant).
>
> [1]: https://lore.kernel.org/linux-mm/20260128-swap-table-p3-v2-0-fe0b67ef0215@tencent.com/
>
> But yeah no big deal. I'll dig through my logs to see if I still have
> the numbers, but if not I'll include it in next version.

Mostly I want to get an impression how hard you push our swap test cases.

>
> >
> > How many runs do you have for stdev 32.8s?
>
> 5 runs! I average out the result of 5 runs.

The stddev is 33 seconds. Measure 5 times then average result is not
enough sample to get your to 1.5% resolution (8 seconds), which fall
into the range of noise.

> > I strongly suspect there is some performance difference that hasn't
> > been covered by your test yet. Need more conformation by others on the
> > performance measurement. The swap testing is tricky. You want to push
> > to stress barely within the OOM limit. Need more data.
>
> Very fair point :) I will say though - the kernel build test, with
> memory.max limit sets, does generate a sizable amount of swapping, and
> does OOM if you don't set up swap. Take my words for now, but I will
> try to include average per-run (z)swap activity stats (zswpout zswpin
> etc.) in future versions if you're interested :)

Including the user space time will help determine the level of swap
pressure as well. I don't need the absolutely zswapout count just yet.

> I've been trying to running more stress tests to trigger crashes and
> performance regression. One of the big reasons why I haven't sent
> anything til now is to fix obvious performance issues (the
> aforementioned lock contention) and bugs. It's a complicated piece of
> work.
>
> As always, would love to receive code/design feedback from you (and
> Kairui, and other swap reviewers), and I would appreciate very much if
> other swap folks can play with the patch series on their setup as well
> for performance testing, or let me know if there is any particular
> case that they're interested in :)

I understand Kairui has some measurements that show regressions.

If you can fix the compiling error I can do some stress testing myself
to provide more data points.

Thanks

Chris
Re: [PATCH v3 00/20] Virtual Swap Space
Posted by Johannes Weiner 1 day, 4 hours ago
Hi Chris,

On Mon, Feb 09, 2026 at 04:20:21AM -0800, Chris Li wrote:
> On Sun, Feb 8, 2026 at 4:15 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > My sincerest apologies - it seems like the cover letter (and just the
> > cover letter) fails to be sent out, for some reason. I'm trying to figure
> > out what happened - it works when I send the entire patch series to
> > myself...
> >
> > Anyway, resending this (in-reply-to patch 1 of the series):
> 
> For the record I did receive your original V3 cover letter from the
> linux-mm mailing list.
> 
> > Changelog:
> > * RFC v2 -> v3:
> >     * Implement a cluster-based allocation algorithm for virtual swap
> >       slots, inspired by Kairui Song and Chris Li's implementation, as
> >       well as Johannes Weiner's suggestions. This eliminates the lock
> >           contention issues on the virtual swap layer.
> >     * Re-use swap table for the reverse mapping.
> >     * Remove CONFIG_VIRTUAL_SWAP.
> >     * Reducing the size of the swap descriptor from 48 bytes to 24
> 
> Is the per swap slot entry overhead 24 bytes in your implementation?
> The current swap overhead is 3 static +8 dynamic, your 24 dynamic is a
> big jump. You can argue that 8->24 is not a big jump . But it is an
> unnecessary price compared to the alternatives, which is 8 dynamic +
> 4(optional redirect).

No, this is not the net overhead.

The descriptor consolidates and eliminates several other data
structures.

Here is the more detailed breakdown:

> > The size of the virtual swap descriptor is 24 bytes. Note that this is
> > not all "new" overhead, as the swap descriptor will replace:
> > * the swap_cgroup arrays (one per swap type) in the old design, which
> >   is a massive source of static memory overhead. With the new design,
> >   it is only allocated for used clusters.
> > * the swap tables, which holds the swap cache and workingset shadows.
> > * the zeromap bitmap, which is a bitmap of physical swap slots to
> >   indicate whether the swapped out page is zero-filled or not.
> > * huge chunk of the swap_map. The swap_map is now replaced by 2 bitmaps,
> >   one for allocated slots, and one for bad slots, representing 3 possible
> >   states of a slot on the swapfile: allocated, free, and bad.
> > * the zswap tree.
> >
> > So, in terms of additional memory overhead:
> > * For zswap entries, the added memory overhead is rather minimal. The
> >   new indirection pointer neatly replaces the existing zswap tree.
> >   We really only incur less than one word of overhead for swap count
> >   blow up (since we no longer use swap continuation) and the swap type.
> > * For physical swap entries, the new design will impose fewer than 3 words
> >   memory overhead. However, as noted above this overhead is only for
> >   actively used swap entries, whereas in the current design the overhead is
> >   static (including the swap cgroup array for example).
> >
> >   The primary victim of this overhead will be zram users. However, as
> >   zswap now no longer takes up disk space, zram users can consider
> >   switching to zswap (which, as a bonus, has a lot of useful features
> >   out of the box, such as cgroup tracking, dynamic zswap pool sizing,
> >   LRU-ordering writeback, etc.).
> >
> > For a more concrete example, suppose we have a 32 GB swapfile (i.e.
> > 8,388,608 swap entries), and we use zswap.
> >
> > 0% usage, or 0 entries: 0.00 MB
> > * Old design total overhead: 25.00 MB
> > * Vswap total overhead: 0.00 MB
> >
> > 25% usage, or 2,097,152 entries:
> > * Old design total overhead: 57.00 MB
> > * Vswap total overhead: 48.25 MB
> >
> > 50% usage, or 4,194,304 entries:
> > * Old design total overhead: 89.00 MB
> > * Vswap total overhead: 96.50 MB
> >
> > 75% usage, or 6,291,456 entries:
> > * Old design total overhead: 121.00 MB
> > * Vswap total overhead: 144.75 MB
> >
> > 100% usage, or 8,388,608 entries:
> > * Old design total overhead: 153.00 MB
> > * Vswap total overhead: 193.00 MB
> >
> > So even in the worst case scenario for virtual swap, i.e when we
> > somehow have an oracle to correctly size the swapfile for zswap
> > pool to 32 GB, the added overhead is only 40 MB, which is a mere
> > 0.12% of the total swapfile :)
> >
> > In practice, the overhead will be closer to the 50-75% usage case, as
> > systems tend to leave swap headroom for pathological events or sudden
> > spikes in memory requirements. The added overhead in these cases are
> > practically neglible. And in deployments where swapfiles for zswap
> > are previously sparsely used, switching over to virtual swap will
> > actually reduce memory overhead.
> >
> > Doing the same math for the disk swap, which is the worst case for
> > virtual swap in terms of swap backends:
> >
> > 0% usage, or 0 entries: 0.00 MB
> > * Old design total overhead: 25.00 MB
> > * Vswap total overhead: 2.00 MB
> >
> > 25% usage, or 2,097,152 entries:
> > * Old design total overhead: 41.00 MB
> > * Vswap total overhead: 66.25 MB
> >
> > 50% usage, or 4,194,304 entries:
> > * Old design total overhead: 57.00 MB
> > * Vswap total overhead: 130.50 MB
> >
> > 75% usage, or 6,291,456 entries:
> > * Old design total overhead: 73.00 MB
> > * Vswap total overhead: 194.75 MB
> >
> > 100% usage, or 8,388,608 entries:
> > * Old design total overhead: 89.00 MB
> > * Vswap total overhead: 259.00 MB
> >
> > The added overhead is 170MB, which is 0.5% of the total swapfile size,
> > again in the worst case when we have a sizing oracle.
Re: [PATCH v3 00/20] Virtual Swap Space
Posted by Chris Li 9 hours ago
Hi Johannes,

On Mon, Feb 9, 2026 at 6:36 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Hi Chris,
>
> On Mon, Feb 09, 2026 at 04:20:21AM -0800, Chris Li wrote:
> > Is the per swap slot entry overhead 24 bytes in your implementation?
> > The current swap overhead is 3 static +8 dynamic, your 24 dynamic is a
> > big jump. You can argue that 8->24 is not a big jump . But it is an
> > unnecessary price compared to the alternatives, which is 8 dynamic +
> > 4(optional redirect).
>
> No, this is not the net overhead.

I am talking about the total metadata overhead per swap entry. Not net.

> The descriptor consolidates and eliminates several other data
> structures.

Adding members previously not there and making some members bigger
along the way. For example, the swap_map from 1 byte to a 4 byte
count.

>
> Here is the more detailed breakdown:

It seems you did not finish your sentence before sending your reply.

Anyway, I saw the total per swap entry overhead bump to 24 bytes
dynamic. Let me know what is the correct number for VS if you
disagree.

Chris

> > > The size of the virtual swap descriptor is 24 bytes. Note that this is
> > > not all "new" overhead, as the swap descriptor will replace:
> > > * the swap_cgroup arrays (one per swap type) in the old design, which
> > >   is a massive source of static memory overhead. With the new design,
> > >   it is only allocated for used clusters.
> > > * the swap tables, which holds the swap cache and workingset shadows.
> > > * the zeromap bitmap, which is a bitmap of physical swap slots to
> > >   indicate whether the swapped out page is zero-filled or not.
> > > * huge chunk of the swap_map. The swap_map is now replaced by 2 bitmaps,
> > >   one for allocated slots, and one for bad slots, representing 3 possible
> > >   states of a slot on the swapfile: allocated, free, and bad.
> > > * the zswap tree.
> > >
> > > So, in terms of additional memory overhead:
> > > * For zswap entries, the added memory overhead is rather minimal. The
> > >   new indirection pointer neatly replaces the existing zswap tree.
> > >   We really only incur less than one word of overhead for swap count
> > >   blow up (since we no longer use swap continuation) and the swap type.
> > > * For physical swap entries, the new design will impose fewer than 3 words
> > >   memory overhead. However, as noted above this overhead is only for
> > >   actively used swap entries, whereas in the current design the overhead is
> > >   static (including the swap cgroup array for example).
> > >
> > >   The primary victim of this overhead will be zram users. However, as
> > >   zswap now no longer takes up disk space, zram users can consider
> > >   switching to zswap (which, as a bonus, has a lot of useful features
> > >   out of the box, such as cgroup tracking, dynamic zswap pool sizing,
> > >   LRU-ordering writeback, etc.).
> > >
> > > For a more concrete example, suppose we have a 32 GB swapfile (i.e.
> > > 8,388,608 swap entries), and we use zswap.
> > >
> > > 0% usage, or 0 entries: 0.00 MB
> > > * Old design total overhead: 25.00 MB
> > > * Vswap total overhead: 0.00 MB
> > >
> > > 25% usage, or 2,097,152 entries:
> > > * Old design total overhead: 57.00 MB
> > > * Vswap total overhead: 48.25 MB
> > >
> > > 50% usage, or 4,194,304 entries:
> > > * Old design total overhead: 89.00 MB
> > > * Vswap total overhead: 96.50 MB
> > >
> > > 75% usage, or 6,291,456 entries:
> > > * Old design total overhead: 121.00 MB
> > > * Vswap total overhead: 144.75 MB
> > >
> > > 100% usage, or 8,388,608 entries:
> > > * Old design total overhead: 153.00 MB
> > > * Vswap total overhead: 193.00 MB
> > >
> > > So even in the worst case scenario for virtual swap, i.e when we
> > > somehow have an oracle to correctly size the swapfile for zswap
> > > pool to 32 GB, the added overhead is only 40 MB, which is a mere
> > > 0.12% of the total swapfile :)
> > >
> > > In practice, the overhead will be closer to the 50-75% usage case, as
> > > systems tend to leave swap headroom for pathological events or sudden
> > > spikes in memory requirements. The added overhead in these cases are
> > > practically neglible. And in deployments where swapfiles for zswap
> > > are previously sparsely used, switching over to virtual swap will
> > > actually reduce memory overhead.
> > >
> > > Doing the same math for the disk swap, which is the worst case for
> > > virtual swap in terms of swap backends:
> > >
> > > 0% usage, or 0 entries: 0.00 MB
> > > * Old design total overhead: 25.00 MB
> > > * Vswap total overhead: 2.00 MB
> > >
> > > 25% usage, or 2,097,152 entries:
> > > * Old design total overhead: 41.00 MB
> > > * Vswap total overhead: 66.25 MB
> > >
> > > 50% usage, or 4,194,304 entries:
> > > * Old design total overhead: 57.00 MB
> > > * Vswap total overhead: 130.50 MB
> > >
> > > 75% usage, or 6,291,456 entries:
> > > * Old design total overhead: 73.00 MB
> > > * Vswap total overhead: 194.75 MB
> > >
> > > 100% usage, or 8,388,608 entries:
> > > * Old design total overhead: 89.00 MB
> > > * Vswap total overhead: 259.00 MB
> > >
> > > The added overhead is 170MB, which is 0.5% of the total swapfile size,
> > > again in the worst case when we have a sizing oracle.
Re: [PATCH v3 00/20] Virtual Swap Space
Posted by Johannes Weiner 7 hours ago
Hi Chris,

On Tue, Feb 10, 2026 at 01:24:03PM -0800, Chris Li wrote:
> Hi Johannes,
> On Mon, Feb 9, 2026 at 6:36 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > Here is the more detailed breakdown:
> 
> It seems you did not finish your sentence before sending your reply.

I did. I trimmed the quote of Nhat's cover letter to the parts
addressing your questions. If you use gmail, click the three dots:

> > > > The size of the virtual swap descriptor is 24 bytes. Note that this is
> > > > not all "new" overhead, as the swap descriptor will replace:
> > > > * the swap_cgroup arrays (one per swap type) in the old design, which
> > > >   is a massive source of static memory overhead. With the new design,
> > > >   it is only allocated for used clusters.
> > > > * the swap tables, which holds the swap cache and workingset shadows.
> > > > * the zeromap bitmap, which is a bitmap of physical swap slots to
> > > >   indicate whether the swapped out page is zero-filled or not.
> > > > * huge chunk of the swap_map. The swap_map is now replaced by 2 bitmaps,
> > > >   one for allocated slots, and one for bad slots, representing 3 possible
> > > >   states of a slot on the swapfile: allocated, free, and bad.
> > > > * the zswap tree.
> > > >
> > > > So, in terms of additional memory overhead:
> > > > * For zswap entries, the added memory overhead is rather minimal. The
> > > >   new indirection pointer neatly replaces the existing zswap tree.
> > > >   We really only incur less than one word of overhead for swap count
> > > >   blow up (since we no longer use swap continuation) and the swap type.
> > > > * For physical swap entries, the new design will impose fewer than 3 words
> > > >   memory overhead. However, as noted above this overhead is only for
> > > >   actively used swap entries, whereas in the current design the overhead is
> > > >   static (including the swap cgroup array for example).
> > > >
> > > >   The primary victim of this overhead will be zram users. However, as
> > > >   zswap now no longer takes up disk space, zram users can consider
> > > >   switching to zswap (which, as a bonus, has a lot of useful features
> > > >   out of the box, such as cgroup tracking, dynamic zswap pool sizing,
> > > >   LRU-ordering writeback, etc.).
> > > >
> > > > For a more concrete example, suppose we have a 32 GB swapfile (i.e.
> > > > 8,388,608 swap entries), and we use zswap.
> > > >
> > > > 0% usage, or 0 entries: 0.00 MB
> > > > * Old design total overhead: 25.00 MB
> > > > * Vswap total overhead: 0.00 MB
> > > >
> > > > 25% usage, or 2,097,152 entries:
> > > > * Old design total overhead: 57.00 MB
> > > > * Vswap total overhead: 48.25 MB
> > > >
> > > > 50% usage, or 4,194,304 entries:
> > > > * Old design total overhead: 89.00 MB
> > > > * Vswap total overhead: 96.50 MB
> > > >
> > > > 75% usage, or 6,291,456 entries:
> > > > * Old design total overhead: 121.00 MB
> > > > * Vswap total overhead: 144.75 MB
> > > >
> > > > 100% usage, or 8,388,608 entries:
> > > > * Old design total overhead: 153.00 MB
> > > > * Vswap total overhead: 193.00 MB
> > > >
> > > > So even in the worst case scenario for virtual swap, i.e when we
> > > > somehow have an oracle to correctly size the swapfile for zswap
> > > > pool to 32 GB, the added overhead is only 40 MB, which is a mere
> > > > 0.12% of the total swapfile :)
> > > >
> > > > In practice, the overhead will be closer to the 50-75% usage case, as
> > > > systems tend to leave swap headroom for pathological events or sudden
> > > > spikes in memory requirements. The added overhead in these cases are
> > > > practically neglible. And in deployments where swapfiles for zswap
> > > > are previously sparsely used, switching over to virtual swap will
> > > > actually reduce memory overhead.
> > > >
> > > > Doing the same math for the disk swap, which is the worst case for
> > > > virtual swap in terms of swap backends:
> > > >
> > > > 0% usage, or 0 entries: 0.00 MB
> > > > * Old design total overhead: 25.00 MB
> > > > * Vswap total overhead: 2.00 MB
> > > >
> > > > 25% usage, or 2,097,152 entries:
> > > > * Old design total overhead: 41.00 MB
> > > > * Vswap total overhead: 66.25 MB
> > > >
> > > > 50% usage, or 4,194,304 entries:
> > > > * Old design total overhead: 57.00 MB
> > > > * Vswap total overhead: 130.50 MB
> > > >
> > > > 75% usage, or 6,291,456 entries:
> > > > * Old design total overhead: 73.00 MB
> > > > * Vswap total overhead: 194.75 MB
> > > >
> > > > 100% usage, or 8,388,608 entries:
> > > > * Old design total overhead: 89.00 MB
> > > > * Vswap total overhead: 259.00 MB
> > > >
> > > > The added overhead is 170MB, which is 0.5% of the total swapfile size,
> > > > again in the worst case when we have a sizing oracle.
[PATCH v3 00/20] Virtual Swap Space
Posted by Nhat Pham 2 days, 8 hours ago
My sincerest apologies - it seems like the cover letter (and just the
cover letter) fails to be sent out, for some reason. I'm trying to figure
out what happened - it works when I send the entire patch series to
myself...

Anyway, resending this (in-reply-to patch 1 of the series):

Changelog:
* RFC v2 -> v3:
    * Implement a cluster-based allocation algorithm for virtual swap
      slots, inspired by Kairui Song and Chris Li's implementation, as
      well as Johannes Weiner's suggestions. This eliminates the lock
	  contention issues on the virtual swap layer.
    * Re-use swap table for the reverse mapping.
    * Remove CONFIG_VIRTUAL_SWAP.
    * Reducing the size of the swap descriptor from 48 bytes to 24
      bytes, i.e another 50% reduction in memory overhead from v2.
    * Remove swap cache and zswap tree and use the swap descriptor
      for this.
    * Remove zeromap, and replace the swap_map bytemap with 2 bitmaps
      (one for allocated slots, and one for bad slots).
    * Rebase on top of 6.19 (7d0a66e4bb9081d75c82ec4957c50034cb0ea449)
	* Update cover letter to include new benchmark results and discussion
	  on overhead in various cases.
* RFC v1 -> RFC v2:
    * Use a single atomic type (swap_refs) for reference counting
      purpose. This brings the size of the swap descriptor from 64 B
      down to 48 B (25% reduction). Suggested by Yosry Ahmed.
    * Zeromap bitmap is removed in the virtual swap implementation.
      This saves one bit per phyiscal swapfile slot.
    * Rearrange the patches and the code change to make things more
      reviewable. Suggested by Johannes Weiner.
    * Update the cover letter a bit.

This patch series implements the virtual swap space idea, based on Yosry's
proposals at LSFMMBPF 2023 (see [1], [2], [3]), as well as valuable
inputs from Johannes Weiner. The same idea (with different
implementation details) has been floated by Rik van Riel since at least
2011 (see [8]).

This patch series is based on 6.19. There are a couple more
swap-related changes in the mm-stable branch that I would need to
coordinate with, but I would like to send this out as an update, to show
that the lock contention issues that plagued earlier versions have been
resolved and performance on the kernel build benchmark is now on-par with
baseline. Furthermore, memory overhead has been substantially reduced
compared to the last RFC version.


I. Motivation

Currently, when an anon page is swapped out, a slot in a backing swap
device is allocated and stored in the page table entries that refer to
the original page. This slot is also used as the "key" to find the
swapped out content, as well as the index to swap data structures, such
as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
backing slot in this way is performant and efficient when swap is purely
just disk space, and swapoff is rare.

However, the advent of many swap optimizations has exposed major
drawbacks of this design. The first problem is that we occupy a physical
slot in the swap space, even for pages that are NEVER expected to hit
the disk: pages compressed and stored in the zswap pool, zero-filled
pages, or pages rejected by both of these optimizations when zswap
writeback is disabled. This is the arguably central shortcoming of
zswap:
* In deployments when no disk space can be afforded for swap (such as
  mobile and embedded devices), users cannot adopt zswap, and are forced
  to use zram. This is confusing for users, and creates extra burdens
  for developers, having to develop and maintain similar features for
  two separate swap backends (writeback, cgroup charging, THP support,
  etc.). For instance, see the discussion in [4].
* Resource-wise, it is hugely wasteful in terms of disk usage. At Meta,
  we have swapfile in the order of tens to hundreds of GBs, which are
  mostly unused and only exist to enable zswap usage and zero-filled
  pages swap optimizations.
* Tying zswap (and more generally, other in-memory swap backends) to
  the current physical swapfile infrastructure makes zswap implicitly
  statically sized. This does not make sense, as unlike disk swap, in
  which we consume a limited resource (disk space or swapfile space) to
  save another resource (memory), zswap consume the same resource it is
  saving (memory). The more we zswap, the more memory we have available,
  not less. We are not rationing a limited resource when we limit
  the size of he zswap pool, but rather we are capping the resource
  (memory) saving potential of zswap. Under memory pressure, using
  more zswap is almost always better than the alternative (disk IOs, or
  even worse, OOMs), and dynamically sizing the zswap pool on demand
  allows the system to flexibly respond to these precarious scenarios.
* Operationally, static provisioning the swapfile for zswap pose
  significant challenges, because the sysadmin has to prescribe how
  much swap is needed a priori, for each combination of
  (memory size x disk space x workload usage). It is even more
  complicated when we take into account the variance of memory
  compression, which changes the reclaim dynamics (and as a result,
  swap space size requirement). The problem is further exarcebated for
  users who rely on swap utilization (and exhaustion) as an OOM signal.

  All of these factors make it very difficult to configure the swapfile
  for zswap: too small of a swapfile and we risk preventable OOMs and
  limit the memory saving potentials of zswap; too big of a swapfile
  and we waste disk space and memory due to swap metadata overhead.
  This dilemma becomes more drastic in high memory systems, which can
  have up to TBs worth of memory.

Past attempts to decouple disk and compressed swap backends, namely the
ghost swapfile approach (see [13]), as well as the alternative
compressed swap backend zram, have mainly focused on eliminating the
disk space usage of compressed backends. We want a solution that not
only tackles that same problem, but also achieve the dyamicization of
swap space to maximize the memory saving potentials while reducing
operational and static memory overhead.

Finally, any swap redesign should support efficient backend transfer,
i.e without having to perform the expensive page table walk to
update all the PTEs that refer to the swap entry:
* The main motivation for this requirement is zswap writeback. To quote
  Johannes (from [14]): "Combining compression with disk swap is
  extremely powerful, because it dramatically reduces the worst aspects
  of both: it reduces the memory footprint of compression by shedding
  the coldest data to disk; it reduces the IO latencies and flash wear
  of disk swap through the writeback cache. In practice, this reduces
  *average event rates of the entire reclaim/paging/IO stack*."
* Another motivation is to simplify swapoff, which is both complicated
  and expensive in the current design, precisely because we are storing
  an encoding of the backend positional information in the page table,
  and thus requires a full page table walk to remove these references.


II. High Level Design Overview

To fix the aforementioned issues, we need an abstraction that separates
a swap entry from its physical backing storage. IOW, we need to
“virtualize” the swap space: swap clients will work with a dynamically
allocated virtual swap slot, storing it in page table entries, and
using it to index into various swap-related data structures. The
backing storage is decoupled from the virtual swap slot, and the newly
introduced layer will “resolve” the virtual swap slot to the actual
storage. This layer also manages other metadata of the swap entry, such
as its lifetime information (swap count), via a dynamically allocated,
per-swap-entry descriptor:

struct swp_desc {
        union {
                swp_slot_t         slot;                 /*     0     8 */
                struct zswap_entry * zswap_entry;        /*     0     8 */
        };                                               /*     0     8 */
        union {
                struct folio *     swap_cache;           /*     8     8 */
                void *             shadow;               /*     8     8 */
        };                                               /*     8     8 */
        unsigned int               swap_count;           /*    16     4 */
        unsigned short             memcgid:16;           /*    20: 0  2 */
        bool                       in_swapcache:1;       /*    22: 0  1 */

        /* Bitfield combined with previous fields */

        enum swap_type             type:2;               /*    20:17  4 */

        /* size: 24, cachelines: 1, members: 6 */
        /* bit_padding: 13 bits */
        /* last cacheline: 24 bytes */
};

(output from pahole).

This design allows us to:
* Decouple zswap (and zeromapped swap entry) from backing swapfile:
  simply associate the virtual swap slot with one of the supported
  backends: a zswap entry, a zero-filled swap page, a slot on the
  swapfile, or an in-memory page.
* Simplify and optimize swapoff: we only have to fault the page in and
  have the virtual swap slot points to the page instead of the on-disk
  physical swap slot. No need to perform any page table walking.

The size of the virtual swap descriptor is 24 bytes. Note that this is
not all "new" overhead, as the swap descriptor will replace:
* the swap_cgroup arrays (one per swap type) in the old design, which
  is a massive source of static memory overhead. With the new design,
  it is only allocated for used clusters.
* the swap tables, which holds the swap cache and workingset shadows.
* the zeromap bitmap, which is a bitmap of physical swap slots to
  indicate whether the swapped out page is zero-filled or not.
* huge chunk of the swap_map. The swap_map is now replaced by 2 bitmaps,
  one for allocated slots, and one for bad slots, representing 3 possible
  states of a slot on the swapfile: allocated, free, and bad.
* the zswap tree.

So, in terms of additional memory overhead:
* For zswap entries, the added memory overhead is rather minimal. The
  new indirection pointer neatly replaces the existing zswap tree.
  We really only incur less than one word of overhead for swap count
  blow up (since we no longer use swap continuation) and the swap type.
* For physical swap entries, the new design will impose fewer than 3 words
  memory overhead. However, as noted above this overhead is only for
  actively used swap entries, whereas in the current design the overhead is
  static (including the swap cgroup array for example).

  The primary victim of this overhead will be zram users. However, as
  zswap now no longer takes up disk space, zram users can consider
  switching to zswap (which, as a bonus, has a lot of useful features
  out of the box, such as cgroup tracking, dynamic zswap pool sizing,
  LRU-ordering writeback, etc.).

For a more concrete example, suppose we have a 32 GB swapfile (i.e.
8,388,608 swap entries), and we use zswap.

0% usage, or 0 entries: 0.00 MB
* Old design total overhead: 25.00 MB
* Vswap total overhead: 0.00 MB

25% usage, or 2,097,152 entries:
* Old design total overhead: 57.00 MB
* Vswap total overhead: 48.25 MB

50% usage, or 4,194,304 entries:
* Old design total overhead: 89.00 MB
* Vswap total overhead: 96.50 MB

75% usage, or 6,291,456 entries:
* Old design total overhead: 121.00 MB
* Vswap total overhead: 144.75 MB

100% usage, or 8,388,608 entries:
* Old design total overhead: 153.00 MB
* Vswap total overhead: 193.00 MB

So even in the worst case scenario for virtual swap, i.e when we
somehow have an oracle to correctly size the swapfile for zswap
pool to 32 GB, the added overhead is only 40 MB, which is a mere
0.12% of the total swapfile :)

In practice, the overhead will be closer to the 50-75% usage case, as
systems tend to leave swap headroom for pathological events or sudden
spikes in memory requirements. The added overhead in these cases are
practically neglible. And in deployments where swapfiles for zswap
are previously sparsely used, switching over to virtual swap will
actually reduce memory overhead.

Doing the same math for the disk swap, which is the worst case for
virtual swap in terms of swap backends:

0% usage, or 0 entries: 0.00 MB
* Old design total overhead: 25.00 MB
* Vswap total overhead: 2.00 MB

25% usage, or 2,097,152 entries:
* Old design total overhead: 41.00 MB
* Vswap total overhead: 66.25 MB

50% usage, or 4,194,304 entries:
* Old design total overhead: 57.00 MB
* Vswap total overhead: 130.50 MB

75% usage, or 6,291,456 entries:
* Old design total overhead: 73.00 MB
* Vswap total overhead: 194.75 MB

100% usage, or 8,388,608 entries:
* Old design total overhead: 89.00 MB
* Vswap total overhead: 259.00 MB

The added overhead is 170MB, which is 0.5% of the total swapfile size,
again in the worst case when we have a sizing oracle.

Please see the attached patches for more implementation details.


III. Usage and Benchmarking

This patch series introduce no new syscalls or userspace API. Existing
userspace setups will work as-is, except we no longer have to create a
swapfile or set memory.swap.max if we want to use zswap, as zswap is no
longer tied to physical swap. The zswap pool will be automatically and
dynamically sized based on memory usage and reclaim dynamics.

To measure the performance of the new implementation, I have run the
following benchmarks:

1. Kernel building: 52 workers (one per processor), memory.max = 3G.

Using zswap as the backend:

Baseline:
real: mean: 185.2s, stdev: 0.93s
sys: mean: 683.7s, stdev: 33.77s

Vswap:
real: mean: 184.88s, stdev: 0.57s
sys: mean: 675.14s, stdev: 32.8s

We actually see a slight improvement in systime (by 1.5%) :) This is
likely because we no longer have to perform swap charging for zswap
entries, and virtual swap allocator is simpler than that of physical
swap.

Using SSD swap as the backend:

Baseline:
real: mean: 200.3s, stdev: 2.33s
sys: mean: 489.88s, stdev: 9.62s

Vswap:
real: mean: 201.47s, stdev: 2.98s
sys: mean: 487.36s, stdev: 5.53s

The performance is neck-to-neck.


IV. Future Use Cases

While the patch series focus on two applications (decoupling swap
backends and swapoff optimization/simplification), this new,
future-proof design also allows us to implement new swap features more
easily and efficiently:

* Multi-tier swapping (as mentioned in [5]), with transparent
  transferring (promotion/demotion) of pages across tiers (see [8] and
  [9]). Similar to swapoff, with the old design we would need to
  perform the expensive page table walk.
* Swapfile compaction to alleviate fragmentation (as proposed by Ying
  Huang in [6]).
* Mixed backing THP swapin (see [7]): Once you have pinned down the
  backing store of THPs, then you can dispatch each range of subpages
  to appropriate backend swapin handler.
* Swapping a folio out with discontiguous physical swap slots
  (see [10]).
* Zswap writeback optimization: The current architecture pre-reserves
  physical swap space for pages when they enter the zswap pool, giving
  the kernel no flexibility at writeback time. With the virtual swap
  implementation, the backends are decoupled, and physical swap space
  is allocated on-demand at writeback time, at which point we can make
  much smarter decisions: we can batch multiple zswap writeback
  operations into a single IO request, allocating contiguous physical
  swap slots for that request. We can even perform compressed writeback
  (i.e writing these pages without decompressing them) (see [12]).


V. References

[1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com/
[2]: https://lwn.net/Articles/932077/
[3]: https://www.youtube.com/watch?v=Hwqw_TBGEhg
[4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@infradead.org/
[5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com/
[6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com/
[7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=6gvee1x3ttbOnifGneqcRm9Hoeun=uFQ2w@mail.gmail.com/
[8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/
[9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjpqxMwxS2C9TQ@mail.gmail.com/
[10]: https://lore.kernel.org/all/CACePvbUkMYMencuKfpDqtG1Ej7LiUS87VRAXb8sBn1yANikEmQ@mail.gmail.com/
[11]: https://lore.kernel.org/all/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com/
[12]: https://lore.kernel.org/linux-mm/ZeZSDLWwDed0CgT3@casper.infradead.org/
[13]: https://lore.kernel.org/all/20251121-ghost-v1-1-cfc0efcf3855@kernel.org/
[14]: https://lore.kernel.org/linux-mm/20251202170222.GD430226@cmpxchg.org/

Nhat Pham (20):
  mm/swap: decouple swap cache from physical swap infrastructure
  swap: rearrange the swap header file
  mm: swap: add an abstract API for locking out swapoff
  zswap: add new helpers for zswap entry operations
  mm/swap: add a new function to check if a swap entry is in swap
    cached.
  mm: swap: add a separate type for physical swap slots
  mm: create scaffolds for the new virtual swap implementation
  zswap: prepare zswap for swap virtualization
  mm: swap: allocate a virtual swap slot for each swapped out page
  swap: move swap cache to virtual swap descriptor
  zswap: move zswap entry management to the virtual swap descriptor
  swap: implement the swap_cgroup API using virtual swap
  swap: manage swap entry lifecycle at the virtual swap layer
  mm: swap: decouple virtual swap slot from backing store
  zswap: do not start zswap shrinker if there is no physical swap slots
  swap: do not unnecesarily pin readahead swap entries
  swapfile: remove zeromap bitmap
  memcg: swap: only charge physical swap slots
  swap: simplify swapoff using virtual swap
  swapfile: replace the swap map with bitmaps

 Documentation/mm/swap-table.rst |   69 --
 MAINTAINERS                     |    2 +
 include/linux/cpuhotplug.h      |    1 +
 include/linux/mm_types.h        |   16 +
 include/linux/shmem_fs.h        |    7 +-
 include/linux/swap.h            |  135 ++-
 include/linux/swap_cgroup.h     |   13 -
 include/linux/swapops.h         |   25 +
 include/linux/zswap.h           |   17 +-
 kernel/power/swap.c             |    6 +-
 mm/Makefile                     |    5 +-
 mm/huge_memory.c                |   11 +-
 mm/internal.h                   |   12 +-
 mm/memcontrol-v1.c              |    6 +
 mm/memcontrol.c                 |  142 ++-
 mm/memory.c                     |  101 +-
 mm/migrate.c                    |   13 +-
 mm/mincore.c                    |   15 +-
 mm/page_io.c                    |   83 +-
 mm/shmem.c                      |  215 +---
 mm/swap.h                       |  157 +--
 mm/swap_cgroup.c                |  172 ---
 mm/swap_state.c                 |  306 +----
 mm/swap_table.h                 |   78 +-
 mm/swapfile.c                   | 1518 ++++-------------------
 mm/userfaultfd.c                |   18 +-
 mm/vmscan.c                     |   28 +-
 mm/vswap.c                      | 2025 +++++++++++++++++++++++++++++++
 mm/zswap.c                      |  142 +--
 29 files changed, 2853 insertions(+), 2485 deletions(-)
 delete mode 100644 Documentation/mm/swap-table.rst
 delete mode 100644 mm/swap_cgroup.c
 create mode 100644 mm/vswap.c


base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
-- 
2.47.3
Re: [PATCH v3 00/20] Virtual Swap Space
Posted by Kairui Song 12 hours ago
On Mon, Feb 9, 2026 at 7:57 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> Anyway, resending this (in-reply-to patch 1 of the series):

Hi Nhat,

> Changelog:
> * RFC v2 -> v3:
>     * Implement a cluster-based allocation algorithm for virtual swap
>       slots, inspired by Kairui Song and Chris Li's implementation, as
>       well as Johannes Weiner's suggestions. This eliminates the lock
>           contention issues on the virtual swap layer.
>     * Re-use swap table for the reverse mapping.
>     * Remove CONFIG_VIRTUAL_SWAP.

I really do think we better make this optional, not a replacement or
mandatory. There are many hard to evaluate effects as this
fundamentally changes the swap workflow with a lot of behavior changes
at once. e.g. it seems the folio will be reactivated instead of
splitted if the physical swap device is fragmented; slot is allocated
at IO and not at unmap, and maybe many others. Just like zswap is
optional. Some common workloads would see an obvious performance or
memory usage regression following this design, see below.

>     * Reducing the size of the swap descriptor from 48 bytes to 24
>       bytes, i.e another 50% reduction in memory overhead from v2.

Honestly if you keep reducing that you might just end up
reimplementing the swap table format :)

> This patch series is based on 6.19. There are a couple more
> swap-related changes in the mm-stable branch that I would need to
> coordinate with, but I would like to send this out as an update, to show
> that the lock contention issues that plagued earlier versions have been
> resolved and performance on the kernel build benchmark is now on-par with
> baseline. Furthermore, memory overhead has been substantially reduced
> compared to the last RFC version.

Thanks for the effort!

> * Operationally, static provisioning the swapfile for zswap pose
>   significant challenges, because the sysadmin has to prescribe how
>   much swap is needed a priori, for each combination of
>   (memory size x disk space x workload usage). It is even more
>   complicated when we take into account the variance of memory
>   compression, which changes the reclaim dynamics (and as a result,
>   swap space size requirement). The problem is further exarcebated for
>   users who rely on swap utilization (and exhaustion) as an OOM signal.

So I thought about it again, this one seems not to be an issue. In
most cases, having a 1:1 virtual swap setup is enough, and very soon
the static overhead will be really trivial. There won't even be any
fragmentation issue either, since if the physical memory size is
identical to swap space, then you can always find a matching part. And
besides, dynamic growth of swap files is actually very doable and
useful, that will make physical swap files adjustable at runtime, so
users won't need to waste a swap type id to extend physical swap
space.

> * Another motivation is to simplify swapoff, which is both complicated
>   and expensive in the current design, precisely because we are storing
>   an encoding of the backend positional information in the page table,
>   and thus requires a full page table walk to remove these references.

The swapoff here is not really a clean swapoff, minor faults will
still be triggered afterwards, and metadata is not released. So this
new swapoff cannot really guarantee the same performance as the old
swapoff. And on the other hand we can already just read everything
into the swap cache then ignore the page table walk with the older
design too, that's just not a clean swapoff.

> struct swp_desc {
>         union {
>                 swp_slot_t         slot;                 /*     0     8 */
>                 struct zswap_entry * zswap_entry;        /*     0     8 */
>         };                                               /*     0     8 */
>         union {
>                 struct folio *     swap_cache;           /*     8     8 */
>                 void *             shadow;               /*     8     8 */
>         };                                               /*     8     8 */
>         unsigned int               swap_count;           /*    16     4 */
>         unsigned short             memcgid:16;           /*    20: 0  2 */
>         bool                       in_swapcache:1;       /*    22: 0  1 */

A standalone bit for swapcache looks like the old SWAP_HAS_CACHE that
causes many issues...

>
>         /* Bitfield combined with previous fields */
>
>         enum swap_type             type:2;               /*    20:17  4 */
>
>         /* size: 24, cachelines: 1, members: 6 */
>         /* bit_padding: 13 bits */
>         /* last cacheline: 24 bytes */
> };

Having a struct larger than 8 bytes means you can't load it
atomically, that limits your lock design. About a year ago Chris
shared with me an idea to use CAS on swap entries once they are small
and unified, that's why swap table is using atomic_long_t and have
helpers like __swap_table_xchg, we are not making good use of them yet
though. Meanwhile we have already consolidated the lock scope to folio
in many places, holding the folio lock then doing the CAS without
touching cluster lock at all for many swap operations might be
feasible soon.

E.g. we already have a cluster-lockless version of swap check in swap table p3:
https://lore.kernel.org/linux-mm/20260128-swap-table-p3-v2-11-fe0b67ef0215@tencent.com/

That might also greatly simplify the locking on IO and migration
performance between swap devices.

> Doing the same math for the disk swap, which is the worst case for
> virtual swap in terms of swap backends:

Actually this worst case is a very common case... see below.

> 0% usage, or 0 entries: 0.00 MB
> * Old design total overhead: 25.00 MB
> * Vswap total overhead: 2.00 MB
>
> 25% usage, or 2,097,152 entries:
> * Old design total overhead: 41.00 MB
> * Vswap total overhead: 66.25 MB
>
> 50% usage, or 4,194,304 entries:
> * Old design total overhead: 57.00 MB
> * Vswap total overhead: 130.50 MB
>
> 75% usage, or 6,291,456 entries:
> * Old design total overhead: 73.00 MB
> * Vswap total overhead: 194.75 MB
>
> 100% usage, or 8,388,608 entries:
> * Old design total overhead: 89.00 MB
> * Vswap total overhead: 259.00 MB
>
> The added overhead is 170MB, which is 0.5% of the total swapfile size,
> again in the worst case when we have a sizing oracle.

Hmm.. With the swap table we will have a stable 8 bytes per slot in
all cases, in current mm-stable we use 11 bytes (8 bytes dyn and 3
bytes static), and in the posted p3 we already get 10 bytes (8 bytes
dyn and 2 bytes static). P4 or follow up was already demonstrated
last year with working code, and it makes everything dynamic
(8 bytes fully dyn, I'll rebase and send that once p3 is merged).

So with mm-stable and follow up, for 32G swap device:

0% usage, or 0/8,388,608 entries: 0.00 MB
* mm-stable total overhead: 25.50 MB (which is swap table p2)
* swap-table p3 overhead: 17.50 MB
* swap-table p4 overhead: 0.50 MB
* Vswap total overhead: 2.00 MB

100% usage, or 8,388,608/8,388,608 entries:
* mm-stable total overhead: 89.5 MB (which is swap table p2)
* swap-table p3 overhead: 81.5 MB
* swap-table p4 overhead: 64.5 MB
* Vswap total overhead: 259.00 MB

That 3 - 4 times more memory usage, quite a trade off. With a
128G device, which is not something rare, it would be 1G of memory.
Swap table p3 / p4 is about 320M / 256M, and we do have a way to cut
that down close to be <1 byte or 3 byte per page with swap table
compaction, which was discussed in LSFMM last year, or even 1 bit
which was once suggested by Baolin, that would make it much smaller
down to <24MB (This is just an idea for now, but the compaction is
very doable as we already have "LRU"s for swap clusters in swap
allocator).

I don't think it looks good as a mandatory overhead. We do have a huge
user base of swap over many different kinds of devices, it was not
long ago two new kernel bugzilla issue  or bug reported was sent to
the maillist about swap over disk, and I'm still trying to investigate
one of them which seems to be actually a page LRU issue and not swap
problem..  OK a little off topic, anyway, I'm not saying that we don't
want more features, as I mentioned above, it would be better if this
can be optional and minimal. See more test info below.

> We actually see a slight improvement in systime (by 1.5%) :) This is
> likely because we no longer have to perform swap charging for zswap
> entries, and virtual swap allocator is simpler than that of physical
> swap.

Congrats! Yeah, I guess that's because vswap has a smaller lock scope
than zswap with a reduced callpath?

>
> Using SSD swap as the backend:
>
> Baseline:
> real: mean: 200.3s, stdev: 2.33s
> sys: mean: 489.88s, stdev: 9.62s
>
> Vswap:
> real: mean: 201.47s, stdev: 2.98s
> sys: mean: 487.36s, stdev: 5.53s
>
> The performance is neck-to-neck.

Thanks for the bench, but please also test with global pressure too.
One mistake I made when working on the prototype of swap tables was
only focusing on cgroup memory pressure, which is really not how
everyone uses Linux, and that's why I reworked it for a long time to
tweak the RCU allocation / freeing of swap table pages so there won't
be any regression even for lowend and global pressure. That's kind of
critical for devices like Android.

I did an overnight bench on this with global pressure, comparing to
mainline 6.19 and swap table p3 (I do include such test for each swap
table serie, p2 / p3 is close so I just rebase and latest p3 on top of
your base commit just to be fair and that's easier for me too) and it
doesn't look that good.

Test machine setup for vm-scalability:
# lscpu | grep "Model name"
Model name:          AMD EPYC 7K62 48-Core Processor

# free -m
              total        used        free      shared  buff/cache   available
Mem:          31582         909       26388           8        4284       29989
Swap:         40959          41       40918

The swap setup follows the recommendation from Huang
(https://lore.kernel.org/linux-mm/87ed474kvx.fsf@yhuang6-desk2.ccr.corp.intel.com/).

Test (average of 18 test run):
vm-scalability/usemem --init-time -O -y -x -n 1 56G

6.19:
Throughput: 618.49 MB/s (stdev 31.3)
Free latency: 5754780.50us (stdev 69542.7)

swap-table-p3 (3.8%, 0.5% better):
Throughput: 642.02 MB/s (stdev 25.1)
Free latency: 5728544.16us (stdev 48592.51)

vswap (3.2%, 244% worse):
Throughput: 598.67 MB/s (stdev 25.1)
Free latency: 13987175.66us (stdev 125148.57)

That's a huge regression with freeing. I have a vm-scatiliby test
matrix, not every setup has such significant >200% regression, but on
average the freeing time is about at least 15 - 50% slower (for
example /data/vm-scalability/usemem --init-time -O -y -x -n 32 1536M
the regression is about 2583221.62us vs 2153735.59us). Throughput is
all lower too.

Freeing is important as it was causing many problems before, it's the
reason why we had a swap slot freeing cache years ago (and later we
removed that since the freeing cache causes more problems and swap
allocator already improved it better than having the cache). People
even tried to optimize that:
https://lore.kernel.org/linux-mm/20250909065349.574894-1-liulei.rjpt@vivo.com/
(This seems a already fixed downstream issue, solved by swap allocator
or swap table). Some workloads might amplify the free latency greatly
and cause serious lags as shown above.

Another thing I personally cares about is how swap works on my daily
laptop :), building the kernel in a 2G test VM using NVME as swap,
which is a very practical workload I do everyday, the result is also
not good (average of 8 test run, make -j12):
#free -m
               total        used        free      shared  buff/cache   available
Mem:            1465         216        1026           0         300        1248
Swap:           4095          36        4059

6.19 systime:
109.6s
swap-table p3:
108.9s
vswap systime:
118.7s

On a build server, it's also slower (make -j48 with 4G memory VM and
NVME swap, average of 10 testrun):
# free -m
               total        used        free      shared  buff/cache   available
Mem:            3877        1444        2019         737        1376        2432
Swap:          32767        1886       30881

# lscpu | grep "Model name"
Model name:                              Intel(R) Xeon(R) Platinum
8255C CPU @ 2.50GHz

6.19 systime:
435.601s
swap-table p3:
432.793s
vswap systime:
455.652s

In conclusion it's about 4.3 - 8.3% slower for common workloads under
global pressure, and there is a up to 200% regression on freeing. ZRAM
shows an even larger workload regression but I'll skip that part since
your series is focusing on zswap now. Redis is also ~20% slower
compared to mm-stable (327515.00 RPS vs 405827.81 RPS), that's mostly
due to swap-table-p2 in mm-stable so I didn't do further comparisons.

So if that's not a bug with this series, I think the double free or
decoupling of swap / underlying slots might be the problem with the
freeing regression shown above. That's really a serious issue, and the
global pressure might be a critical issue too as the metadata is much
larger, and is already causing regressions for very common workloads.
Low end users could hit the min watermark easily and could have
serious jitters or allocation failures.

That's part of the issue I've found, so I really do think we need a
flexible way to implementa that and not have a mandatory layer. After
swap table P4 we should be able to figure out a way to fit all needs,
with a clean defined set of swap API, metadata and layers, as was
discussed at LSFMM last year.
Re: [PATCH v3 00/20] Virtual Swap Space
Posted by Chris Li 8 hours ago
Hi Kairui,

Thank you so much for the performance test.

I will only comment on the performance number in this sub email thread.

On Tue, Feb 10, 2026 at 10:00 AM Kairui Song <ryncsn@gmail.com> wrote:
> Actually this worst case is a very common case... see below.
>
> > 0% usage, or 0 entries: 0.00 MB
> > * Old design total overhead: 25.00 MB
> > * Vswap total overhead: 2.00 MB
> >
> > 25% usage, or 2,097,152 entries:
> > * Old design total overhead: 41.00 MB
> > * Vswap total overhead: 66.25 MB
> >
> > 50% usage, or 4,194,304 entries:
> > * Old design total overhead: 57.00 MB
> > * Vswap total overhead: 130.50 MB
> >
> > 75% usage, or 6,291,456 entries:
> > * Old design total overhead: 73.00 MB
> > * Vswap total overhead: 194.75 MB
> >
> > 100% usage, or 8,388,608 entries:
> > * Old design total overhead: 89.00 MB
> > * Vswap total overhead: 259.00 MB
> >
> > The added overhead is 170MB, which is 0.5% of the total swapfile size,
> > again in the worst case when we have a sizing oracle.
>
> Hmm.. With the swap table we will have a stable 8 bytes per slot in
> all cases, in current mm-stable we use 11 bytes (8 bytes dyn and 3
> bytes static), and in the posted p3 we already get 10 bytes (8 bytes
> dyn and 2 bytes static). P4 or follow up was already demonstrated
> last year with working code, and it makes everything dynamic
> (8 bytes fully dyn, I'll rebase and send that once p3 is merged).
>
> So with mm-stable and follow up, for 32G swap device:
>
> 0% usage, or 0/8,388,608 entries: 0.00 MB
> * mm-stable total overhead: 25.50 MB (which is swap table p2)
> * swap-table p3 overhead: 17.50 MB
> * swap-table p4 overhead: 0.50 MB
> * Vswap total overhead: 2.00 MB
>
> 100% usage, or 8,388,608/8,388,608 entries:
> * mm-stable total overhead: 89.5 MB (which is swap table p2)
> * swap-table p3 overhead: 81.5 MB
> * swap-table p4 overhead: 64.5 MB
> * Vswap total overhead: 259.00 MB
>
> That 3 - 4 times more memory usage, quite a trade off. With a

Agree. That has been my main complaint about VS is the per swap entry
metadata overhead. This VS series reverted the swap table, but memory
and CPU performance is worse than swap table.

> 128G device, which is not something rare, it would be 1G of memory.
> Swap table p3 / p4 is about 320M / 256M, and we do have a way to cut
> that down close to be <1 byte or 3 byte per page with swap table
> compaction, which was discussed in LSFMM last year, or even 1 bit
> which was once suggested by Baolin, that would make it much smaller
> down to <24MB (This is just an idea for now, but the compaction is
> very doable as we already have "LRU"s for swap clusters in swap
> allocator).
>
> I don't think it looks good as a mandatory overhead. We do have a huge
> user base of swap over many different kinds of devices, it was not
> long ago two new kernel bugzilla issue  or bug reported was sent to
> the maillist about swap over disk, and I'm still trying to investigate
> one of them which seems to be actually a page LRU issue and not swap
> problem..  OK a little off topic, anyway, I'm not saying that we don't
> want more features, as I mentioned above, it would be better if this
> can be optional and minimal. See more test info below.
>
> > We actually see a slight improvement in systime (by 1.5%) :) This is
> > likely because we no longer have to perform swap charging for zswap
> > entries, and virtual swap allocator is simpler than that of physical
> > swap.
>
> Congrats! Yeah, I guess that's because vswap has a smaller lock scope
> than zswap with a reduced callpath?

Whole series is too much zswap centric and punishes other swaps.

>
> >
> > Using SSD swap as the backend:
> >
> > Baseline:
> > real: mean: 200.3s, stdev: 2.33s
> > sys: mean: 489.88s, stdev: 9.62s
> >
> > Vswap:
> > real: mean: 201.47s, stdev: 2.98s
> > sys: mean: 487.36s, stdev: 5.53s
> >
> > The performance is neck-to-neck.
>
> Thanks for the bench, but please also test with global pressure too.
> One mistake I made when working on the prototype of swap tables was
> only focusing on cgroup memory pressure, which is really not how
> everyone uses Linux, and that's why I reworked it for a long time to
> tweak the RCU allocation / freeing of swap table pages so there won't
> be any regression even for lowend and global pressure. That's kind of
> critical for devices like Android.
>
> I did an overnight bench on this with global pressure, comparing to
> mainline 6.19 and swap table p3 (I do include such test for each swap
> table serie, p2 / p3 is close so I just rebase and latest p3 on top of
> your base commit just to be fair and that's easier for me too) and it
> doesn't look that good.
>
> Test machine setup for vm-scalability:
> # lscpu | grep "Model name"
> Model name:          AMD EPYC 7K62 48-Core Processor
>
> # free -m
>               total        used        free      shared  buff/cache   available
> Mem:          31582         909       26388           8        4284       29989
> Swap:         40959          41       40918
>
> The swap setup follows the recommendation from Huang
> (https://lore.kernel.org/linux-mm/87ed474kvx.fsf@yhuang6-desk2.ccr.corp.intel.com/).
>
> Test (average of 18 test run):
> vm-scalability/usemem --init-time -O -y -x -n 1 56G
>
> 6.19:
> Throughput: 618.49 MB/s (stdev 31.3)
> Free latency: 5754780.50us (stdev 69542.7)
>
> swap-table-p3 (3.8%, 0.5% better):
> Throughput: 642.02 MB/s (stdev 25.1)
> Free latency: 5728544.16us (stdev 48592.51)
>
> vswap (3.2%, 244% worse):

Now that is a deal breaker for me. Not the similar performance with
baseline or swap table P3.

> Throughput: 598.67 MB/s (stdev 25.1)
> Free latency: 13987175.66us (stdev 125148.57)
>
> That's a huge regression with freeing. I have a vm-scatiliby test
> matrix, not every setup has such significant >200% regression, but on
> average the freeing time is about at least 15 - 50% slower (for
> example /data/vm-scalability/usemem --init-time -O -y -x -n 32 1536M
> the regression is about 2583221.62us vs 2153735.59us). Throughput is
> all lower too.
>
> Freeing is important as it was causing many problems before, it's the
> reason why we had a swap slot freeing cache years ago (and later we
> removed that since the freeing cache causes more problems and swap
> allocator already improved it better than having the cache). People
> even tried to optimize that:
> https://lore.kernel.org/linux-mm/20250909065349.574894-1-liulei.rjpt@vivo.com/
> (This seems a already fixed downstream issue, solved by swap allocator
> or swap table). Some workloads might amplify the free latency greatly
> and cause serious lags as shown above.
>
> Another thing I personally cares about is how swap works on my daily
> laptop :), building the kernel in a 2G test VM using NVME as swap,
> which is a very practical workload I do everyday, the result is also
> not good (average of 8 test run, make -j12):
> #free -m
>                total        used        free      shared  buff/cache   available
> Mem:            1465         216        1026           0         300        1248
> Swap:           4095          36        4059
>
> 6.19 systime:
> 109.6s
> swap-table p3:
> 108.9s
> vswap systime:
> 118.7s
>
> On a build server, it's also slower (make -j48 with 4G memory VM and
> NVME swap, average of 10 testrun):
> # free -m
>                total        used        free      shared  buff/cache   available
> Mem:            3877        1444        2019         737        1376        2432
> Swap:          32767        1886       30881
>
> # lscpu | grep "Model name"
> Model name:                              Intel(R) Xeon(R) Platinum
> 8255C CPU @ 2.50GHz
>
> 6.19 systime:
> 435.601s
> swap-table p3:
> 432.793s
> vswap systime:
> 455.652s
>
> In conclusion it's about 4.3 - 8.3% slower for common workloads under

At 4-8% I would consider it a statically significant performance
regression to favor swap table implementations.

> global pressure, and there is a up to 200% regression on freeing. ZRAM
> shows an even larger workload regression but I'll skip that part since
> your series is focusing on zswap now. Redis is also ~20% slower
> compared to mm-stable (327515.00 RPS vs 405827.81 RPS), that's mostly
> due to swap-table-p2 in mm-stable so I didn't do further comparisons.
>
> So if that's not a bug with this series, I think the double free or
> decoupling of swap / underlying slots might be the problem with the
> freeing regression shown above. That's really a serious issue, and the
> global pressure might be a critical issue too as the metadata is much
> larger, and is already causing regressions for very common workloads.
> Low end users could hit the min watermark easily and could have
> serious jitters or allocation failures.
>
> That's part of the issue I've found, so I really do think we need a
> flexible way to implementa that and not have a mandatory layer. After
> swap table P4 we should be able to figure out a way to fit all needs,
> with a clean defined set of swap API, metadata and layers, as was
> discussed at LSFMM last year.

Agree. That matches my view, get the fundamental infrastructure for
swap right first (swap table), then do those fancier feature
enhancement like online growing the size of swapfile.

Chris
Re: [PATCH v3 00/20] Virtual Swap Space
Posted by Nhat Pham 11 hours ago
On Tue, Feb 10, 2026 at 10:00 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Mon, Feb 9, 2026 at 7:57 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > Anyway, resending this (in-reply-to patch 1 of the series):
>
> Hi Nhat,
>
> > Changelog:
> > * RFC v2 -> v3:
> >     * Implement a cluster-based allocation algorithm for virtual swap
> >       slots, inspired by Kairui Song and Chris Li's implementation, as
> >       well as Johannes Weiner's suggestions. This eliminates the lock
> >           contention issues on the virtual swap layer.
> >     * Re-use swap table for the reverse mapping.
> >     * Remove CONFIG_VIRTUAL_SWAP.
>
> I really do think we better make this optional, not a replacement or
> mandatory. There are many hard to evaluate effects as this
> fundamentally changes the swap workflow with a lot of behavior changes
> at once. e.g. it seems the folio will be reactivated instead of
> splitted if the physical swap device is fragmented; slot is allocated
> at IO and not at unmap, and maybe many others. Just like zswap is
> optional. Some common workloads would see an obvious performance or
> memory usage regression following this design, see below.

Ideally, if we can close the performance gap and have only one
version, then that would be the best :)

Problem with making it optional, or maintaining effectively two swap
implementations, is that it will make the patch series unreadable and
unreviewable, and the code base unmaintanable :) You'll have x2 the
amount of code to reason about and test, much more merge conflicts at
rebase and cherry-pick time. And any improvement to one version takes
extra work to graft onto the other version.

>
> >     * Reducing the size of the swap descriptor from 48 bytes to 24
> >       bytes, i.e another 50% reduction in memory overhead from v2.
>
> Honestly if you keep reducing that you might just end up
> reimplementing the swap table format :)

There's nothing wrong with that ;)

I like the swap table format (and your cluster-based swap allocator) a
lot. This patch series does not aim to remove that design - I just
want to separate the address space of physical and virtual swaps to
enable new use cases...

>
> > This patch series is based on 6.19. There are a couple more
> > swap-related changes in the mm-stable branch that I would need to
> > coordinate with, but I would like to send this out as an update, to show
> > that the lock contention issues that plagued earlier versions have been
> > resolved and performance on the kernel build benchmark is now on-par with
> > baseline. Furthermore, memory overhead has been substantially reduced
> > compared to the last RFC version.
>
> Thanks for the effort!
>
> > * Operationally, static provisioning the swapfile for zswap pose
> >   significant challenges, because the sysadmin has to prescribe how
> >   much swap is needed a priori, for each combination of
> >   (memory size x disk space x workload usage). It is even more
> >   complicated when we take into account the variance of memory
> >   compression, which changes the reclaim dynamics (and as a result,
> >   swap space size requirement). The problem is further exarcebated for
> >   users who rely on swap utilization (and exhaustion) as an OOM signal.
>
> So I thought about it again, this one seems not to be an issue. In

I mean, it is a real production issue :) We have a variety of server
machines and services. Each of the former has its own memory and drive
size. Each of the latter has its own access characteristics,
compressibility, latency tolerance (and hence would prefer a different
swapping solutions - zswap, disk swap, zswap x disk swap). Coupled
with the fact that now multiple services can cooccur on one host, and
one services can be deployed on different kinds of hosts, statically
sizing the swapfile becomes operationally impossible and leaves a lot
of wins on the table. So swap space has to be dynamic.


> most cases, having a 1:1 virtual swap setup is enough, and very soon
> the static overhead will be really trivial. There won't even be any
> fragmentation issue either, since if the physical memory size is
> identical to swap space, then you can always find a matching part. And
> besides, dynamic growth of swap files is actually very doable and
> useful, that will make physical swap files adjustable at runtime, so
> users won't need to waste a swap type id to extend physical swap
> space.

By "dynamic growth of swap files", do you mean dynamically adjusting
the size of the swapfile? then that capacity does not exist right now,
and I don't see a good design laid out for it... At the very least,
the swap allocator needs to be dynamic in nature. I assume it's going
to look something very similar to vswap's current attempt, which
relies on a tree structure (radix tree i.e xarray). Sounds familiar?
;)

I feel like each of the problem I mention in this cover letter can be
solved partially with some amount of hacks, but none of them will
solve it all. And once you slaps all the hacks together, you just get
virtual swap, potentially shoved within specific backend codebase
(zswap or zram). That's not... ideal.

>
> > * Another motivation is to simplify swapoff, which is both complicated
> >   and expensive in the current design, precisely because we are storing
> >   an encoding of the backend positional information in the page table,
> >   and thus requires a full page table walk to remove these references.
>
> The swapoff here is not really a clean swapoff, minor faults will
> still be triggered afterwards, and metadata is not released. So this
> new swapoff cannot really guarantee the same performance as the old
> swapoff. And on the other hand we can already just read everything
> into the swap cache then ignore the page table walk with the older
> design too, that's just not a clean swapoff.

I don't understand your point regarding the "reading everything into
swap cache". Yes, you can do that, but you would still lock the swap
device in place, because the page table entries still refer to slots
on the physical swap device - you cannot free the swap device, nor
space on disk, not even the swapfile's metadata (especially since the
swap cache is now intertwined with the physical swap layer).

>
> > struct swp_desc {
> >         union {
> >                 swp_slot_t         slot;                 /*     0     8 */
> >                 struct zswap_entry * zswap_entry;        /*     0     8 */
> >         };                                               /*     0     8 */
> >         union {
> >                 struct folio *     swap_cache;           /*     8     8 */
> >                 void *             shadow;               /*     8     8 */
> >         };                                               /*     8     8 */
> >         unsigned int               swap_count;           /*    16     4 */
> >         unsigned short             memcgid:16;           /*    20: 0  2 */
> >         bool                       in_swapcache:1;       /*    22: 0  1 */
>
> A standalone bit for swapcache looks like the old SWAP_HAS_CACHE that
> causes many issues...

Yeah this was based on 6.19, which did not have your swap cache change yet :)

I have taken a look at your latest swap table work in mm-stable, and I
think most of that can conceptually incorporated in to this line of
work as well.

Chiefly, the new swap cache synchronization scheme (i.e whoever puts
the folio in swap cache first gets exclusive rights) still works in
virtual swap world (and hence, the removal of swap cache pin, which is
one bit in the virtual swap descriptor).

Similarly, do you think we cannot hold the folio lock in place of the
cluster lock in the virtual swap world? Same for a lot of the memory
overhead reduction tricks (such as using shadow for cgroup id instead
of a separate swap_cgroup unsigned short field). I think comparing the
two this way is a bit apples-to-oranges (especially given the new
features enabled by vswap).

[...]

> That 3 - 4 times more memory usage, quite a trade off. With a
> 128G device, which is not something rare, it would be 1G of memory.
> Swap table p3 / p4 is about 320M / 256M, and we do have a way to cut
> that down close to be <1 byte or 3 byte per page with swap table
> compaction, which was discussed in LSFMM last year, or even 1 bit
> which was once suggested by Baolin, that would make it much smaller
> down to <24MB (This is just an idea for now, but the compaction is
> very doable as we already have "LRU"s for swap clusters in swap
> allocator).
>
> I don't think it looks good as a mandatory overhead. We do have a huge
> user base of swap over many different kinds of devices, it was not
> long ago two new kernel bugzilla issue  or bug reported was sent to
> the maillist about swap over disk, and I'm still trying to investigate
> one of them which seems to be actually a page LRU issue and not swap
> problem..  OK a little off topic, anyway, I'm not saying that we don't
> want more features, as I mentioned above, it would be better if this
> can be optional and minimal. See more test info below.

Side note - I might have missed this. If it's still ongoing, would
love to help debug this :)

>
> > We actually see a slight improvement in systime (by 1.5%) :) This is
> > likely because we no longer have to perform swap charging for zswap
> > entries, and virtual swap allocator is simpler than that of physical
> > swap.
>
> Congrats! Yeah, I guess that's because vswap has a smaller lock scope
> than zswap with a reduced callpath?

Ah yeah that too. I neglected to mention this, but with vswap you can
merge several swap operations in zswap code path and no longer have to
release-then-reacquire the swap locks, since zswap entries live in the
same lock scope as swap cache entries.

It's more of a side note either way, because my main goal with this
patch series is to enable new features. Getting a performance win is
always nice of course :)

>
> >
> > Using SSD swap as the backend:
> >
> > Baseline:
> > real: mean: 200.3s, stdev: 2.33s
> > sys: mean: 489.88s, stdev: 9.62s
> >
> > Vswap:
> > real: mean: 201.47s, stdev: 2.98s
> > sys: mean: 487.36s, stdev: 5.53s
> >
> > The performance is neck-to-neck.
>
> Thanks for the bench, but please also test with global pressure too.

Do you mean using memory to the point where it triggered the global watermarks?

> One mistake I made when working on the prototype of swap tables was
> only focusing on cgroup memory pressure, which is really not how
> everyone uses Linux, and that's why I reworked it for a long time to
> tweak the RCU allocation / freeing of swap table pages so there won't
> be any regression even for lowend and global pressure. That's kind of
> critical for devices like Android.
>
> I did an overnight bench on this with global pressure, comparing to
> mainline 6.19 and swap table p3 (I do include such test for each swap
> table serie, p2 / p3 is close so I just rebase and latest p3 on top of
> your base commit just to be fair and that's easier for me too) and it
> doesn't look that good.
>
> Test machine setup for vm-scalability:
> # lscpu | grep "Model name"
> Model name:          AMD EPYC 7K62 48-Core Processor
>
> # free -m
>               total        used        free      shared  buff/cache   available
> Mem:          31582         909       26388           8        4284       29989
> Swap:         40959          41       40918
>
> The swap setup follows the recommendation from Huang
> (https://lore.kernel.org/linux-mm/87ed474kvx.fsf@yhuang6-desk2.ccr.corp.intel.com/).
>
> Test (average of 18 test run):
> vm-scalability/usemem --init-time -O -y -x -n 1 56G
>
> 6.19:
> Throughput: 618.49 MB/s (stdev 31.3)
> Free latency: 5754780.50us (stdev 69542.7)
>
> swap-table-p3 (3.8%, 0.5% better):
> Throughput: 642.02 MB/s (stdev 25.1)
> Free latency: 5728544.16us (stdev 48592.51)
>
> vswap (3.2%, 244% worse):
> Throughput: 598.67 MB/s (stdev 25.1)
> Free latency: 13987175.66us (stdev 125148.57)
>
> That's a huge regression with freeing. I have a vm-scatiliby test
> matrix, not every setup has such significant >200% regression, but on
> average the freeing time is about at least 15 - 50% slower (for
> example /data/vm-scalability/usemem --init-time -O -y -x -n 32 1536M
> the regression is about 2583221.62us vs 2153735.59us). Throughput is
> all lower too.
>
> Freeing is important as it was causing many problems before, it's the
> reason why we had a swap slot freeing cache years ago (and later we
> removed that since the freeing cache causes more problems and swap
> allocator already improved it better than having the cache). People
> even tried to optimize that:
> https://lore.kernel.org/linux-mm/20250909065349.574894-1-liulei.rjpt@vivo.com/
> (This seems a already fixed downstream issue, solved by swap allocator
> or swap table). Some workloads might amplify the free latency greatly
> and cause serious lags as shown above.
>
> Another thing I personally cares about is how swap works on my daily
> laptop :), building the kernel in a 2G test VM using NVME as swap,
> which is a very practical workload I do everyday, the result is also
> not good (average of 8 test run, make -j12):

Hmm this one I don't think I can reproduce without your laptop ;)

Jokes aside, I did try to run the kernel build with disk swapping, and
the performance is on par with baseline. Swap performance with NVME
swap tends to be dominated by IO work in my experiments. Do you think
I missed something here? Maybe it's the concurrency difference (since
I always run with -j$(nproc), i.e the number of workers == the number
of processors).

> #free -m
>                total        used        free      shared  buff/cache   available
> Mem:            1465         216        1026           0         300        1248
> Swap:           4095          36        4059
>
> 6.19 systime:
> 109.6s
> swap-table p3:
> 108.9s
> vswap systime:
> 118.7s
>
> On a build server, it's also slower (make -j48 with 4G memory VM and
> NVME swap, average of 10 testrun):
> # free -m
>                total        used        free      shared  buff/cache   available
> Mem:            3877        1444        2019         737        1376        2432
> Swap:          32767        1886       30881
>
> # lscpu | grep "Model name"
> Model name:                              Intel(R) Xeon(R) Platinum
> 8255C CPU @ 2.50GHz
>
> 6.19 systime:
> 435.601s
> swap-table p3:
> 432.793s
> vswap systime:
> 455.652s
>
> In conclusion it's about 4.3 - 8.3% slower for common workloads under
> global pressure, and there is a up to 200% regression on freeing. ZRAM
> shows an even larger workload regression but I'll skip that part since
> your series is focusing on zswap now. Redis is also ~20% slower
> compared to mm-stable (327515.00 RPS vs 405827.81 RPS), that's mostly
> due to swap-table-p2 in mm-stable so I didn't do further comparisons.

I'll see if I can reproduce the issues! I'll start with usemem one
first, as that seems easier to reproduce...

>
> So if that's not a bug with this series, I think the double free or

It could be a non-crashing bug that subtly regresses certain swap
operations, but yeah let me study your test case first!

> decoupling of swap / underlying slots might be the problem with the
> freeing regression shown above. That's really a serious issue, and the
> global pressure might be a critical issue too as the metadata is much
> larger, and is already causing regressions for very common workloads.
> Low end users could hit the min watermark easily and could have
> serious jitters or allocation failures.
>
> That's part of the issue I've found, so I really do think we need a
> flexible way to implementa that and not have a mandatory layer. After
> swap table P4 we should be able to figure out a way to fit all needs,
> with a clean defined set of swap API, metadata and layers, as was
> discussed at LSFMM last year.
Re: [PATCH v3 00/20] Virtual Swap Space
Posted by Nhat Pham 11 hours ago
On Tue, Feb 10, 2026 at 11:11 AM Nhat Pham <nphamcs@gmail.com> wrote:
>>
> Hmm this one I don't think I can reproduce without your laptop ;)
>
> Jokes aside, I did try to run the kernel build with disk swapping, and
> the performance is on par with baseline. Swap performance with NVME
> swap tends to be dominated by IO work in my experiments. Do you think
> I missed something here? Maybe it's the concurrency difference (since
> I always run with -j$(nproc), i.e the number of workers == the number
> of processors).

Ah I just noticed that your numbers include only systime. Ignore my IO
comments then.

(I still think in real production system, with disk swapping enabled,
then IO wait time is going to be really important. If you're going to
use disk swap, then this affects real time just as much if not more
than kernel CPU overhead).
Re: [PATCH v3 00/20] Virtual Swap Space
Posted by Johannes Weiner 11 hours ago
Hello Kairui,

On Wed, Feb 11, 2026 at 01:59:34AM +0800, Kairui Song wrote:
> On Mon, Feb 9, 2026 at 7:57 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >     * Reducing the size of the swap descriptor from 48 bytes to 24
> >       bytes, i.e another 50% reduction in memory overhead from v2.
> 
> Honestly if you keep reducing that you might just end up
> reimplementing the swap table format :)

Yeah, it turns out we need the same data points to describe and track
a swapped out page ;)

> > * Operationally, static provisioning the swapfile for zswap pose
> >   significant challenges, because the sysadmin has to prescribe how
> >   much swap is needed a priori, for each combination of
> >   (memory size x disk space x workload usage). It is even more
> >   complicated when we take into account the variance of memory
> >   compression, which changes the reclaim dynamics (and as a result,
> >   swap space size requirement). The problem is further exarcebated for
> >   users who rely on swap utilization (and exhaustion) as an OOM signal.
> 
> So I thought about it again, this one seems not to be an issue. In
> most cases, having a 1:1 virtual swap setup is enough, and very soon
> the static overhead will be really trivial. There won't even be any
> fragmentation issue either, since if the physical memory size is
> identical to swap space, then you can always find a matching part. And
> besides, dynamic growth of swap files is actually very doable and
> useful, that will make physical swap files adjustable at runtime, so
> users won't need to waste a swap type id to extend physical swap
> space.

The issue is address space separation. We don't want things inside the
compressed pool to consume disk space; nor do we want entries that
live on disk to take usable space away from the compressed pool.

The regression reports are fair, thanks for highlighting those. And
whether to make this optional is also a fair discussion.

But some of the numbers comparisons really strike me as apples to
oranges comparisons. It seems to miss the core issue this series is
trying to address.

> > * Another motivation is to simplify swapoff, which is both complicated
> >   and expensive in the current design, precisely because we are storing
> >   an encoding of the backend positional information in the page table,
> >   and thus requires a full page table walk to remove these references.
> 
> The swapoff here is not really a clean swapoff, minor faults will
> still be triggered afterwards, and metadata is not released. So this
> new swapoff cannot really guarantee the same performance as the old
> swapoff.

That seems very academic to me. The goal is to relinquish disk space,
and these patches make that a lot faster.

Let's put it the other way round: if today we had a fast swapoff read
sequence with lazy minor faults to resolve page tables, would we
accept patches that implement the expensive try_to_unuse() scans and
make it mandatory? Considering the worst-case runtime it can cause?

I don't think so. We have this scan because the page table references
are pointing to disk slots, and this is the only way to free them.

> And on the other hand we can already just read everything
> into the swap cache then ignore the page table walk with the older
> design too, that's just not a clean swapoff.

How can you relinquish the disk slot as long as the swp_entry_t is in
circulation?
[PATCH v3 00/20] Virtual Swap Space
Posted by Nhat Pham 2 days, 8 hours ago
My sincerest apologies - it seems like the cover letter (and just the
cover letter) fails to be sent out, for some reason. I'm trying to figure
out what happened - it works when I send the entire patch series to
myself...

Anyway, resending this (in-reply-to patch 1 of the series):

Changelog:
* RFC v2 -> v3:
    * Implement a cluster-based allocation algorithm for virtual swap
      slots, inspired by Kairui Song and Chris Li's implementation, as
      well as Johannes Weiner's suggestions. This eliminates the lock
	  contention issues on the virtual swap layer.
    * Re-use swap table for the reverse mapping.
    * Remove CONFIG_VIRTUAL_SWAP.
    * Reducing the size of the swap descriptor from 48 bytes to 24
      bytes, i.e another 50% reduction in memory overhead from v2.
    * Remove swap cache and zswap tree and use the swap descriptor
      for this.
    * Remove zeromap, and replace the swap_map bytemap with 2 bitmaps
      (one for allocated slots, and one for bad slots).
    * Rebase on top of 6.19 (7d0a66e4bb9081d75c82ec4957c50034cb0ea449)
	* Update cover letter to include new benchmark results and discussion
	  on overhead in various cases.
* RFC v1 -> RFC v2:
    * Use a single atomic type (swap_refs) for reference counting
      purpose. This brings the size of the swap descriptor from 64 B
      down to 48 B (25% reduction). Suggested by Yosry Ahmed.
    * Zeromap bitmap is removed in the virtual swap implementation.
      This saves one bit per phyiscal swapfile slot.
    * Rearrange the patches and the code change to make things more
      reviewable. Suggested by Johannes Weiner.
    * Update the cover letter a bit.

This patch series implements the virtual swap space idea, based on Yosry's
proposals at LSFMMBPF 2023 (see [1], [2], [3]), as well as valuable
inputs from Johannes Weiner. The same idea (with different
implementation details) has been floated by Rik van Riel since at least
2011 (see [8]).

This patch series is based on 6.19. There are a couple more
swap-related changes in the mm-stable branch that I would need to
coordinate with, but I would like to send this out as an update, to show
that the lock contention issues that plagued earlier versions have been
resolved and performance on the kernel build benchmark is now on-par with
baseline. Furthermore, memory overhead has been substantially reduced
compared to the last RFC version.


I. Motivation

Currently, when an anon page is swapped out, a slot in a backing swap
device is allocated and stored in the page table entries that refer to
the original page. This slot is also used as the "key" to find the
swapped out content, as well as the index to swap data structures, such
as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
backing slot in this way is performant and efficient when swap is purely
just disk space, and swapoff is rare.

However, the advent of many swap optimizations has exposed major
drawbacks of this design. The first problem is that we occupy a physical
slot in the swap space, even for pages that are NEVER expected to hit
the disk: pages compressed and stored in the zswap pool, zero-filled
pages, or pages rejected by both of these optimizations when zswap
writeback is disabled. This is the arguably central shortcoming of
zswap:
* In deployments when no disk space can be afforded for swap (such as
  mobile and embedded devices), users cannot adopt zswap, and are forced
  to use zram. This is confusing for users, and creates extra burdens
  for developers, having to develop and maintain similar features for
  two separate swap backends (writeback, cgroup charging, THP support,
  etc.). For instance, see the discussion in [4].
* Resource-wise, it is hugely wasteful in terms of disk usage. At Meta,
  we have swapfile in the order of tens to hundreds of GBs, which are
  mostly unused and only exist to enable zswap usage and zero-filled
  pages swap optimizations.
* Tying zswap (and more generally, other in-memory swap backends) to
  the current physical swapfile infrastructure makes zswap implicitly
  statically sized. This does not make sense, as unlike disk swap, in
  which we consume a limited resource (disk space or swapfile space) to
  save another resource (memory), zswap consume the same resource it is
  saving (memory). The more we zswap, the more memory we have available,
  not less. We are not rationing a limited resource when we limit
  the size of he zswap pool, but rather we are capping the resource
  (memory) saving potential of zswap. Under memory pressure, using
  more zswap is almost always better than the alternative (disk IOs, or
  even worse, OOMs), and dynamically sizing the zswap pool on demand
  allows the system to flexibly respond to these precarious scenarios.
* Operationally, static provisioning the swapfile for zswap pose
  significant challenges, because the sysadmin has to prescribe how
  much swap is needed a priori, for each combination of
  (memory size x disk space x workload usage). It is even more
  complicated when we take into account the variance of memory
  compression, which changes the reclaim dynamics (and as a result,
  swap space size requirement). The problem is further exarcebated for
  users who rely on swap utilization (and exhaustion) as an OOM signal.

  All of these factors make it very difficult to configure the swapfile
  for zswap: too small of a swapfile and we risk preventable OOMs and
  limit the memory saving potentials of zswap; too big of a swapfile
  and we waste disk space and memory due to swap metadata overhead.
  This dilemma becomes more drastic in high memory systems, which can
  have up to TBs worth of memory.

Past attempts to decouple disk and compressed swap backends, namely the
ghost swapfile approach (see [13]), as well as the alternative
compressed swap backend zram, have mainly focused on eliminating the
disk space usage of compressed backends. We want a solution that not
only tackles that same problem, but also achieve the dyamicization of
swap space to maximize the memory saving potentials while reducing
operational and static memory overhead.

Finally, any swap redesign should support efficient backend transfer,
i.e without having to perform the expensive page table walk to
update all the PTEs that refer to the swap entry:
* The main motivation for this requirement is zswap writeback. To quote
  Johannes (from [14]): "Combining compression with disk swap is
  extremely powerful, because it dramatically reduces the worst aspects
  of both: it reduces the memory footprint of compression by shedding
  the coldest data to disk; it reduces the IO latencies and flash wear
  of disk swap through the writeback cache. In practice, this reduces
  *average event rates of the entire reclaim/paging/IO stack*."
* Another motivation is to simplify swapoff, which is both complicated
  and expensive in the current design, precisely because we are storing
  an encoding of the backend positional information in the page table,
  and thus requires a full page table walk to remove these references.


II. High Level Design Overview

To fix the aforementioned issues, we need an abstraction that separates
a swap entry from its physical backing storage. IOW, we need to
“virtualize” the swap space: swap clients will work with a dynamically
allocated virtual swap slot, storing it in page table entries, and
using it to index into various swap-related data structures. The
backing storage is decoupled from the virtual swap slot, and the newly
introduced layer will “resolve” the virtual swap slot to the actual
storage. This layer also manages other metadata of the swap entry, such
as its lifetime information (swap count), via a dynamically allocated,
per-swap-entry descriptor:

struct swp_desc {
        union {
                swp_slot_t         slot;                 /*     0     8 */
                struct zswap_entry * zswap_entry;        /*     0     8 */
        };                                               /*     0     8 */
        union {
                struct folio *     swap_cache;           /*     8     8 */
                void *             shadow;               /*     8     8 */
        };                                               /*     8     8 */
        unsigned int               swap_count;           /*    16     4 */
        unsigned short             memcgid:16;           /*    20: 0  2 */
        bool                       in_swapcache:1;       /*    22: 0  1 */

        /* Bitfield combined with previous fields */

        enum swap_type             type:2;               /*    20:17  4 */

        /* size: 24, cachelines: 1, members: 6 */
        /* bit_padding: 13 bits */
        /* last cacheline: 24 bytes */
};

(output from pahole).

This design allows us to:
* Decouple zswap (and zeromapped swap entry) from backing swapfile:
  simply associate the virtual swap slot with one of the supported
  backends: a zswap entry, a zero-filled swap page, a slot on the
  swapfile, or an in-memory page.
* Simplify and optimize swapoff: we only have to fault the page in and
  have the virtual swap slot points to the page instead of the on-disk
  physical swap slot. No need to perform any page table walking.

The size of the virtual swap descriptor is 24 bytes. Note that this is
not all "new" overhead, as the swap descriptor will replace:
* the swap_cgroup arrays (one per swap type) in the old design, which
  is a massive source of static memory overhead. With the new design,
  it is only allocated for used clusters.
* the swap tables, which holds the swap cache and workingset shadows.
* the zeromap bitmap, which is a bitmap of physical swap slots to
  indicate whether the swapped out page is zero-filled or not.
* huge chunk of the swap_map. The swap_map is now replaced by 2 bitmaps,
  one for allocated slots, and one for bad slots, representing 3 possible
  states of a slot on the swapfile: allocated, free, and bad.
* the zswap tree.

So, in terms of additional memory overhead:
* For zswap entries, the added memory overhead is rather minimal. The
  new indirection pointer neatly replaces the existing zswap tree.
  We really only incur less than one word of overhead for swap count
  blow up (since we no longer use swap continuation) and the swap type.
* For physical swap entries, the new design will impose fewer than 3 words
  memory overhead. However, as noted above this overhead is only for
  actively used swap entries, whereas in the current design the overhead is
  static (including the swap cgroup array for example).

  The primary victim of this overhead will be zram users. However, as
  zswap now no longer takes up disk space, zram users can consider
  switching to zswap (which, as a bonus, has a lot of useful features
  out of the box, such as cgroup tracking, dynamic zswap pool sizing,
  LRU-ordering writeback, etc.).

For a more concrete example, suppose we have a 32 GB swapfile (i.e.
8,388,608 swap entries), and we use zswap.

0% usage, or 0 entries: 0.00 MB
* Old design total overhead: 25.00 MB
* Vswap total overhead: 0.00 MB

25% usage, or 2,097,152 entries:
* Old design total overhead: 57.00 MB
* Vswap total overhead: 48.25 MB

50% usage, or 4,194,304 entries:
* Old design total overhead: 89.00 MB
* Vswap total overhead: 96.50 MB

75% usage, or 6,291,456 entries:
* Old design total overhead: 121.00 MB
* Vswap total overhead: 144.75 MB

100% usage, or 8,388,608 entries:
* Old design total overhead: 153.00 MB
* Vswap total overhead: 193.00 MB

So even in the worst case scenario for virtual swap, i.e when we
somehow have an oracle to correctly size the swapfile for zswap
pool to 32 GB, the added overhead is only 40 MB, which is a mere
0.12% of the total swapfile :)

In practice, the overhead will be closer to the 50-75% usage case, as
systems tend to leave swap headroom for pathological events or sudden
spikes in memory requirements. The added overhead in these cases are
practically neglible. And in deployments where swapfiles for zswap
are previously sparsely used, switching over to virtual swap will
actually reduce memory overhead.

Doing the same math for the disk swap, which is the worst case for
virtual swap in terms of swap backends:

0% usage, or 0 entries: 0.00 MB
* Old design total overhead: 25.00 MB
* Vswap total overhead: 2.00 MB

25% usage, or 2,097,152 entries:
* Old design total overhead: 41.00 MB
* Vswap total overhead: 66.25 MB

50% usage, or 4,194,304 entries:
* Old design total overhead: 57.00 MB
* Vswap total overhead: 130.50 MB

75% usage, or 6,291,456 entries:
* Old design total overhead: 73.00 MB
* Vswap total overhead: 194.75 MB

100% usage, or 8,388,608 entries:
* Old design total overhead: 89.00 MB
* Vswap total overhead: 259.00 MB

The added overhead is 170MB, which is 0.5% of the total swapfile size,
again in the worst case when we have a sizing oracle.

Please see the attached patches for more implementation details.


III. Usage and Benchmarking

This patch series introduce no new syscalls or userspace API. Existing
userspace setups will work as-is, except we no longer have to create a
swapfile or set memory.swap.max if we want to use zswap, as zswap is no
longer tied to physical swap. The zswap pool will be automatically and
dynamically sized based on memory usage and reclaim dynamics.

To measure the performance of the new implementation, I have run the
following benchmarks:

1. Kernel building: 52 workers (one per processor), memory.max = 3G.

Using zswap as the backend:

Baseline:
real: mean: 185.2s, stdev: 0.93s
sys: mean: 683.7s, stdev: 33.77s

Vswap:
real: mean: 184.88s, stdev: 0.57s
sys: mean: 675.14s, stdev: 32.8s

We actually see a slight improvement in systime (by 1.5%) :) This is
likely because we no longer have to perform swap charging for zswap
entries, and virtual swap allocator is simpler than that of physical
swap.

Using SSD swap as the backend:

Baseline:
real: mean: 200.3s, stdev: 2.33s
sys: mean: 489.88s, stdev: 9.62s

Vswap:
real: mean: 201.47s, stdev: 2.98s
sys: mean: 487.36s, stdev: 5.53s

The performance is neck-to-neck.


IV. Future Use Cases

While the patch series focus on two applications (decoupling swap
backends and swapoff optimization/simplification), this new,
future-proof design also allows us to implement new swap features more
easily and efficiently:

* Multi-tier swapping (as mentioned in [5]), with transparent
  transferring (promotion/demotion) of pages across tiers (see [8] and
  [9]). Similar to swapoff, with the old design we would need to
  perform the expensive page table walk.
* Swapfile compaction to alleviate fragmentation (as proposed by Ying
  Huang in [6]).
* Mixed backing THP swapin (see [7]): Once you have pinned down the
  backing store of THPs, then you can dispatch each range of subpages
  to appropriate backend swapin handler.
* Swapping a folio out with discontiguous physical swap slots
  (see [10]).
* Zswap writeback optimization: The current architecture pre-reserves
  physical swap space for pages when they enter the zswap pool, giving
  the kernel no flexibility at writeback time. With the virtual swap
  implementation, the backends are decoupled, and physical swap space
  is allocated on-demand at writeback time, at which point we can make
  much smarter decisions: we can batch multiple zswap writeback
  operations into a single IO request, allocating contiguous physical
  swap slots for that request. We can even perform compressed writeback
  (i.e writing these pages without decompressing them) (see [12]).


V. References

[1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com/
[2]: https://lwn.net/Articles/932077/
[3]: https://www.youtube.com/watch?v=Hwqw_TBGEhg
[4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@infradead.org/
[5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com/
[6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com/
[7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=6gvee1x3ttbOnifGneqcRm9Hoeun=uFQ2w@mail.gmail.com/
[8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/
[9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjpqxMwxS2C9TQ@mail.gmail.com/
[10]: https://lore.kernel.org/all/CACePvbUkMYMencuKfpDqtG1Ej7LiUS87VRAXb8sBn1yANikEmQ@mail.gmail.com/
[11]: https://lore.kernel.org/all/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com/
[12]: https://lore.kernel.org/linux-mm/ZeZSDLWwDed0CgT3@casper.infradead.org/
[13]: https://lore.kernel.org/all/20251121-ghost-v1-1-cfc0efcf3855@kernel.org/
[14]: https://lore.kernel.org/linux-mm/20251202170222.GD430226@cmpxchg.org/

Nhat Pham (20):
  mm/swap: decouple swap cache from physical swap infrastructure
  swap: rearrange the swap header file
  mm: swap: add an abstract API for locking out swapoff
  zswap: add new helpers for zswap entry operations
  mm/swap: add a new function to check if a swap entry is in swap
    cached.
  mm: swap: add a separate type for physical swap slots
  mm: create scaffolds for the new virtual swap implementation
  zswap: prepare zswap for swap virtualization
  mm: swap: allocate a virtual swap slot for each swapped out page
  swap: move swap cache to virtual swap descriptor
  zswap: move zswap entry management to the virtual swap descriptor
  swap: implement the swap_cgroup API using virtual swap
  swap: manage swap entry lifecycle at the virtual swap layer
  mm: swap: decouple virtual swap slot from backing store
  zswap: do not start zswap shrinker if there is no physical swap slots
  swap: do not unnecesarily pin readahead swap entries
  swapfile: remove zeromap bitmap
  memcg: swap: only charge physical swap slots
  swap: simplify swapoff using virtual swap
  swapfile: replace the swap map with bitmaps

 Documentation/mm/swap-table.rst |   69 --
 MAINTAINERS                     |    2 +
 include/linux/cpuhotplug.h      |    1 +
 include/linux/mm_types.h        |   16 +
 include/linux/shmem_fs.h        |    7 +-
 include/linux/swap.h            |  135 ++-
 include/linux/swap_cgroup.h     |   13 -
 include/linux/swapops.h         |   25 +
 include/linux/zswap.h           |   17 +-
 kernel/power/swap.c             |    6 +-
 mm/Makefile                     |    5 +-
 mm/huge_memory.c                |   11 +-
 mm/internal.h                   |   12 +-
 mm/memcontrol-v1.c              |    6 +
 mm/memcontrol.c                 |  142 ++-
 mm/memory.c                     |  101 +-
 mm/migrate.c                    |   13 +-
 mm/mincore.c                    |   15 +-
 mm/page_io.c                    |   83 +-
 mm/shmem.c                      |  215 +---
 mm/swap.h                       |  157 +--
 mm/swap_cgroup.c                |  172 ---
 mm/swap_state.c                 |  306 +----
 mm/swap_table.h                 |   78 +-
 mm/swapfile.c                   | 1518 ++++-------------------
 mm/userfaultfd.c                |   18 +-
 mm/vmscan.c                     |   28 +-
 mm/vswap.c                      | 2025 +++++++++++++++++++++++++++++++
 mm/zswap.c                      |  142 +--
 29 files changed, 2853 insertions(+), 2485 deletions(-)
 delete mode 100644 Documentation/mm/swap-table.rst
 delete mode 100644 mm/swap_cgroup.c
 create mode 100644 mm/vswap.c


base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
-- 
2.47.3
Re: [PATCH v3 01/20] mm/swap: decouple swap cache from physical swap infrastructure
Posted by kernel test robot 2 days, 4 hours ago
Hi Nhat,

kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on v6.19]
[cannot apply to akpm-mm/mm-everything tj-cgroup/for-next tip/smp/core next-20260205]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Nhat-Pham/swap-rearrange-the-swap-header-file/20260209-065842
base:   linus/master
patch link:    https://lore.kernel.org/r/20260208215839.87595-2-nphamcs%40gmail.com
patch subject: [PATCH v3 01/20] mm/swap: decouple swap cache from physical swap infrastructure
config: x86_64-allnoconfig (https://download.01.org/0day-ci/archive/20260209/202602091044.soVrWeDA-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260209/202602091044.soVrWeDA-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602091044.soVrWeDA-lkp@intel.com/

All errors (new ones prefixed by >>):

>> mm/vmscan.c:715:3: error: call to undeclared function 'swap_cache_lock_irq'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
     715 |                 swap_cache_lock_irq();
         |                 ^
>> mm/vmscan.c:762:3: error: call to undeclared function 'swap_cache_unlock_irq'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
     762 |                 swap_cache_unlock_irq();
         |                 ^
   mm/vmscan.c:762:3: note: did you mean 'swap_cluster_unlock_irq'?
   mm/swap.h:350:20: note: 'swap_cluster_unlock_irq' declared here
     350 | static inline void swap_cluster_unlock_irq(struct swap_cluster_info *ci)
         |                    ^
   mm/vmscan.c:801:3: error: call to undeclared function 'swap_cache_unlock_irq'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
     801 |                 swap_cache_unlock_irq();
         |                 ^
   3 errors generated.
--
>> mm/shmem.c:2168:2: error: call to undeclared function 'swap_cache_lock_irq'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    2168 |         swap_cache_lock_irq();
         |         ^
>> mm/shmem.c:2173:2: error: call to undeclared function 'swap_cache_unlock_irq'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    2173 |         swap_cache_unlock_irq();
         |         ^
   2 errors generated.


vim +/swap_cache_lock_irq +715 mm/vmscan.c

   700	
   701	/*
   702	 * Same as remove_mapping, but if the folio is removed from the mapping, it
   703	 * gets returned with a refcount of 0.
   704	 */
   705	static int __remove_mapping(struct address_space *mapping, struct folio *folio,
   706				    bool reclaimed, struct mem_cgroup *target_memcg)
   707	{
   708		int refcount;
   709		void *shadow = NULL;
   710	
   711		BUG_ON(!folio_test_locked(folio));
   712		BUG_ON(mapping != folio_mapping(folio));
   713	
   714		if (folio_test_swapcache(folio)) {
 > 715			swap_cache_lock_irq();
   716		} else {
   717			spin_lock(&mapping->host->i_lock);
   718			xa_lock_irq(&mapping->i_pages);
   719		}
   720	
   721		/*
   722		 * The non racy check for a busy folio.
   723		 *
   724		 * Must be careful with the order of the tests. When someone has
   725		 * a ref to the folio, it may be possible that they dirty it then
   726		 * drop the reference. So if the dirty flag is tested before the
   727		 * refcount here, then the following race may occur:
   728		 *
   729		 * get_user_pages(&page);
   730		 * [user mapping goes away]
   731		 * write_to(page);
   732		 *				!folio_test_dirty(folio)    [good]
   733		 * folio_set_dirty(folio);
   734		 * folio_put(folio);
   735		 *				!refcount(folio)   [good, discard it]
   736		 *
   737		 * [oops, our write_to data is lost]
   738		 *
   739		 * Reversing the order of the tests ensures such a situation cannot
   740		 * escape unnoticed. The smp_rmb is needed to ensure the folio->flags
   741		 * load is not satisfied before that of folio->_refcount.
   742		 *
   743		 * Note that if the dirty flag is always set via folio_mark_dirty,
   744		 * and thus under the i_pages lock, then this ordering is not required.
   745		 */
   746		refcount = 1 + folio_nr_pages(folio);
   747		if (!folio_ref_freeze(folio, refcount))
   748			goto cannot_free;
   749		/* note: atomic_cmpxchg in folio_ref_freeze provides the smp_rmb */
   750		if (unlikely(folio_test_dirty(folio))) {
   751			folio_ref_unfreeze(folio, refcount);
   752			goto cannot_free;
   753		}
   754	
   755		if (folio_test_swapcache(folio)) {
   756			swp_entry_t swap = folio->swap;
   757	
   758			if (reclaimed && !mapping_exiting(mapping))
   759				shadow = workingset_eviction(folio, target_memcg);
   760			__swap_cache_del_folio(folio, swap, shadow);
   761			memcg1_swapout(folio, swap);
 > 762			swap_cache_unlock_irq();
   763			put_swap_folio(folio, swap);
   764		} else {
   765			void (*free_folio)(struct folio *);
   766	
   767			free_folio = mapping->a_ops->free_folio;
   768			/*
   769			 * Remember a shadow entry for reclaimed file cache in
   770			 * order to detect refaults, thus thrashing, later on.
   771			 *
   772			 * But don't store shadows in an address space that is
   773			 * already exiting.  This is not just an optimization,
   774			 * inode reclaim needs to empty out the radix tree or
   775			 * the nodes are lost.  Don't plant shadows behind its
   776			 * back.
   777			 *
   778			 * We also don't store shadows for DAX mappings because the
   779			 * only page cache folios found in these are zero pages
   780			 * covering holes, and because we don't want to mix DAX
   781			 * exceptional entries and shadow exceptional entries in the
   782			 * same address_space.
   783			 */
   784			if (reclaimed && folio_is_file_lru(folio) &&
   785			    !mapping_exiting(mapping) && !dax_mapping(mapping))
   786				shadow = workingset_eviction(folio, target_memcg);
   787			__filemap_remove_folio(folio, shadow);
   788			xa_unlock_irq(&mapping->i_pages);
   789			if (mapping_shrinkable(mapping))
   790				inode_lru_list_add(mapping->host);
   791			spin_unlock(&mapping->host->i_lock);
   792	
   793			if (free_folio)
   794				free_folio(folio);
   795		}
   796	
   797		return 1;
   798	
   799	cannot_free:
   800		if (folio_test_swapcache(folio)) {
   801			swap_cache_unlock_irq();
   802		} else {
   803			xa_unlock_irq(&mapping->i_pages);
   804			spin_unlock(&mapping->host->i_lock);
   805		}
   806		return 0;
   807	}
   808	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki