[v3] Virtual Swap Space

[PATCH v3 01/20] mm/swap: decouple swap cache from physical swap infrastructure

Posted by Nhat Pham 2 days, 8 hours ago

When we virtualize the swap space, we will manage swap cache at the
virtual swap layer. To prepare for this, decouple swap cache from
physical swap infrastructure.

We will also remove all the swap cache related helpers of swap table. We
will keep the rest of the swap table infrastructure, which will be
repurposed to serve as the rmap (physical -> virtual swap mapping)
later.

Note that with this patch, we will move to a single global lock to
synchronize swap cache accesses. This is temporarily, as the swap cache
will be re-partitioned in to (virtual) swap clusters once we move the
swap cache to the soon-to-be-introduced virtual swap layer.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 Documentation/mm/swap-table.rst |  69 -----------
 mm/huge_memory.c                |  11 +-
 mm/migrate.c                    |  13 +-
 mm/shmem.c                      |   7 +-
 mm/swap.h                       |  26 ++--
 mm/swap_state.c                 | 205 +++++++++++++++++---------------
 mm/swap_table.h                 |  78 +-----------
 mm/swapfile.c                   |  43 ++-----
 mm/vmscan.c                     |   9 +-
 9 files changed, 158 insertions(+), 303 deletions(-)
 delete mode 100644 Documentation/mm/swap-table.rst

diff --git a/Documentation/mm/swap-table.rst b/Documentation/mm/swap-table.rst
deleted file mode 100644
index da10bb7a0dc37..0000000000000
--- a/Documentation/mm/swap-table.rst
+++ /dev/null
@@ -1,69 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-:Author: Chris Li <chrisl@kernel.org>, Kairui Song <kasong@tencent.com>
-
-==========
-Swap Table
-==========
-
-Swap table implements swap cache as a per-cluster swap cache value array.
-
-Swap Entry
-----------
-
-A swap entry contains the information required to serve the anonymous page
-fault.
-
-Swap entry is encoded as two parts: swap type and swap offset.
-
-The swap type indicates which swap device to use.
-The swap offset is the offset of the swap file to read the page data from.
-
-Swap Cache
-----------
-
-Swap cache is a map to look up folios using swap entry as the key. The result
-value can have three possible types depending on which stage of this swap entry
-was in.
-
-1. NULL: This swap entry is not used.
-
-2. folio: A folio has been allocated and bound to this swap entry. This is
-   the transient state of swap out or swap in. The folio data can be in
-   the folio or swap file, or both.
-
-3. shadow: The shadow contains the working set information of the swapped
-   out folio. This is the normal state for a swapped out page.
-
-Swap Table Internals
---------------------
-
-The previous swap cache is implemented by XArray. The XArray is a tree
-structure. Each lookup will go through multiple nodes. Can we do better?
-
-Notice that most of the time when we look up the swap cache, we are either
-in a swap in or swap out path. We should already have the swap cluster,
-which contains the swap entry.
-
-If we have a per-cluster array to store swap cache value in the cluster.
-Swap cache lookup within the cluster can be a very simple array lookup.
-
-We give such a per-cluster swap cache value array a name: the swap table.
-
-A swap table is an array of pointers. Each pointer is the same size as a
-PTE. The size of a swap table for one swap cluster typically matches a PTE
-page table, which is one page on modern 64-bit systems.
-
-With swap table, swap cache lookup can achieve great locality, simpler,
-and faster.
-
-Locking
--------
-
-Swap table modification requires taking the cluster lock. If a folio
-is being added to or removed from the swap table, the folio must be
-locked prior to the cluster lock. After adding or removing is done, the
-folio shall be unlocked.
-
-Swap table lookup is protected by RCU and atomic read. If the lookup
-returns a folio, the user must lock the folio before use.
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40cf59301c21a..21215ac870144 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3783,7 +3783,6 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
 	/* Prevent deferred_split_scan() touching ->_refcount */
 	ds_queue = folio_split_queue_lock(folio);
 	if (folio_ref_freeze(folio, folio_cache_ref_count(folio) + 1)) {
-		struct swap_cluster_info *ci = NULL;
 		struct lruvec *lruvec;
 
 		if (old_order > 1) {
@@ -3826,7 +3825,7 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
 				return -EINVAL;
 			}
 
-			ci = swap_cluster_get_and_lock(folio);
+			swap_cache_lock();
 		}
 
 		/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
@@ -3862,8 +3861,8 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
 			 * Anonymous folio with swap cache.
 			 * NOTE: shmem in swap cache is not supported yet.
 			 */
-			if (ci) {
-				__swap_cache_replace_folio(ci, folio, new_folio);
+			if (folio_test_swapcache(folio)) {
+				__swap_cache_replace_folio(folio, new_folio);
 				continue;
 			}
 
@@ -3901,8 +3900,8 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
 		if (do_lru)
 			unlock_page_lruvec(lruvec);
 
-		if (ci)
-			swap_cluster_unlock(ci);
+		if (folio_test_swapcache(folio))
+			swap_cache_unlock();
 	} else {
 		split_queue_unlock(ds_queue);
 		return -EAGAIN;
diff --git a/mm/migrate.c b/mm/migrate.c
index 4688b9e38cd2f..11d9b43dff5d8 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -571,7 +571,6 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 		struct folio *newfolio, struct folio *folio, int expected_count)
 {
 	XA_STATE(xas, &mapping->i_pages, folio->index);
-	struct swap_cluster_info *ci = NULL;
 	struct zone *oldzone, *newzone;
 	int dirty;
 	long nr = folio_nr_pages(folio);
@@ -601,13 +600,13 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 	newzone = folio_zone(newfolio);
 
 	if (folio_test_swapcache(folio))
-		ci = swap_cluster_get_and_lock_irq(folio);
+		swap_cache_lock_irq();
 	else
 		xas_lock_irq(&xas);
 
 	if (!folio_ref_freeze(folio, expected_count)) {
-		if (ci)
-			swap_cluster_unlock_irq(ci);
+		if (folio_test_swapcache(folio))
+			swap_cache_unlock_irq();
 		else
 			xas_unlock_irq(&xas);
 		return -EAGAIN;
@@ -640,7 +639,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 	}
 
 	if (folio_test_swapcache(folio))
-		__swap_cache_replace_folio(ci, folio, newfolio);
+		__swap_cache_replace_folio(folio, newfolio);
 	else
 		xas_store(&xas, newfolio);
 
@@ -652,8 +651,8 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 	folio_ref_unfreeze(folio, expected_count - nr);
 
 	/* Leave irq disabled to prevent preemption while updating stats */
-	if (ci)
-		swap_cluster_unlock(ci);
+	if (folio_test_swapcache(folio))
+		swap_cache_unlock();
 	else
 		xas_unlock(&xas);
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 79af5f9f8b908..1db97ef2d14eb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2133,7 +2133,6 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
 				struct shmem_inode_info *info, pgoff_t index,
 				struct vm_area_struct *vma)
 {
-	struct swap_cluster_info *ci;
 	struct folio *new, *old = *foliop;
 	swp_entry_t entry = old->swap;
 	int nr_pages = folio_nr_pages(old);
@@ -2166,12 +2165,12 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
 	new->swap = entry;
 	folio_set_swapcache(new);
 
-	ci = swap_cluster_get_and_lock_irq(old);
-	__swap_cache_replace_folio(ci, old, new);
+	swap_cache_lock_irq();
+	__swap_cache_replace_folio(old, new);
 	mem_cgroup_replace_folio(old, new);
 	shmem_update_stats(new, nr_pages);
 	shmem_update_stats(old, -nr_pages);
-	swap_cluster_unlock_irq(ci);
+	swap_cache_unlock_irq();
 
 	folio_add_lru(new);
 	*foliop = new;
diff --git a/mm/swap.h b/mm/swap.h
index 1bd466da30393..8726b587a5b5d 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -199,6 +199,11 @@ void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug);
 
 /* linux/mm/swap_state.c */
 extern struct address_space swap_space __read_mostly;
+void swap_cache_lock_irq(void);
+void swap_cache_unlock_irq(void);
+void swap_cache_lock(void);
+void swap_cache_unlock(void);
+
 static inline struct address_space *swap_address_space(swp_entry_t entry)
 {
 	return &swap_space;
@@ -247,14 +252,12 @@ static inline bool folio_matches_swap_entry(const struct folio *folio,
  */
 struct folio *swap_cache_get_folio(swp_entry_t entry);
 void *swap_cache_get_shadow(swp_entry_t entry);
-void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadow);
+int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, gfp_t gfp, void **shadow);
 void swap_cache_del_folio(struct folio *folio);
-/* Below helpers require the caller to lock and pass in the swap cluster. */
-void __swap_cache_del_folio(struct swap_cluster_info *ci,
-			    struct folio *folio, swp_entry_t entry, void *shadow);
-void __swap_cache_replace_folio(struct swap_cluster_info *ci,
-				struct folio *old, struct folio *new);
-void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents);
+/* Below helpers require the caller to lock the swap cache. */
+void __swap_cache_del_folio(struct folio *folio, swp_entry_t entry, void *shadow);
+void __swap_cache_replace_folio(struct folio *old, struct folio *new);
+void swap_cache_clear_shadow(swp_entry_t entry, int nr_ents);
 
 void show_swap_cache_info(void);
 void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
@@ -411,21 +414,20 @@ static inline void *swap_cache_get_shadow(swp_entry_t entry)
 	return NULL;
 }
 
-static inline void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadow)
+static inline int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, gfp_t gfp, void **shadow)
 {
+	return 0;
 }
 
 static inline void swap_cache_del_folio(struct folio *folio)
 {
 }
 
-static inline void __swap_cache_del_folio(struct swap_cluster_info *ci,
-		struct folio *folio, swp_entry_t entry, void *shadow)
+static inline void __swap_cache_del_folio(struct folio *folio, swp_entry_t entry, void *shadow)
 {
 }
 
-static inline void __swap_cache_replace_folio(struct swap_cluster_info *ci,
-		struct folio *old, struct folio *new)
+static inline void __swap_cache_replace_folio(struct folio *old, struct folio *new)
 {
 }
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 44d228982521e..34c9d9b243a74 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -22,8 +22,8 @@
 #include <linux/vmalloc.h>
 #include <linux/huge_mm.h>
 #include <linux/shmem_fs.h>
+#include <linux/xarray.h>
 #include "internal.h"
-#include "swap_table.h"
 #include "swap.h"
 
 /*
@@ -41,6 +41,28 @@ struct address_space swap_space __read_mostly = {
 	.a_ops = &swap_aops,
 };
 
+static DEFINE_XARRAY(swap_cache);
+
+void swap_cache_lock_irq(void)
+{
+	xa_lock_irq(&swap_cache);
+}
+
+void swap_cache_unlock_irq(void)
+{
+	xa_unlock_irq(&swap_cache);
+}
+
+void swap_cache_lock(void)
+{
+	xa_lock(&swap_cache);
+}
+
+void swap_cache_unlock(void)
+{
+	xa_unlock(&swap_cache);
+}
+
 static bool enable_vma_readahead __read_mostly = true;
 
 #define SWAP_RA_ORDER_CEILING	5
@@ -86,17 +108,22 @@ void show_swap_cache_info(void)
  */
 struct folio *swap_cache_get_folio(swp_entry_t entry)
 {
-	unsigned long swp_tb;
+	void *entry_val;
 	struct folio *folio;
 
 	for (;;) {
-		swp_tb = swap_table_get(__swap_entry_to_cluster(entry),
-					swp_cluster_offset(entry));
-		if (!swp_tb_is_folio(swp_tb))
+		rcu_read_lock();
+		entry_val = xa_load(&swap_cache, entry.val);
+		if (!entry_val || xa_is_value(entry_val)) {
+			rcu_read_unlock();
 			return NULL;
-		folio = swp_tb_to_folio(swp_tb);
-		if (likely(folio_try_get(folio)))
+		}
+		folio = entry_val;
+		if (likely(folio_try_get(folio))) {
+			rcu_read_unlock();
 			return folio;
+		}
+		rcu_read_unlock();
 	}
 
 	return NULL;
@@ -112,12 +139,14 @@ struct folio *swap_cache_get_folio(swp_entry_t entry)
  */
 void *swap_cache_get_shadow(swp_entry_t entry)
 {
-	unsigned long swp_tb;
+	void *entry_val;
+
+	rcu_read_lock();
+	entry_val = xa_load(&swap_cache, entry.val);
+	rcu_read_unlock();
 
-	swp_tb = swap_table_get(__swap_entry_to_cluster(entry),
-				swp_cluster_offset(entry));
-	if (swp_tb_is_shadow(swp_tb))
-		return swp_tb_to_shadow(swp_tb);
+	if (xa_is_value(entry_val))
+		return entry_val;
 	return NULL;
 }
 
@@ -132,46 +161,58 @@ void *swap_cache_get_shadow(swp_entry_t entry)
  * with reference count or locks.
  * The caller also needs to update the corresponding swap_map slots with
  * SWAP_HAS_CACHE bit to avoid race or conflict.
+ *
+ * Return: 0 on success, negative error code on failure.
  */
-void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadowp)
+int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, gfp_t gfp, void **shadowp)
 {
-	void *shadow = NULL;
-	unsigned long old_tb, new_tb;
-	struct swap_cluster_info *ci;
-	unsigned int ci_start, ci_off, ci_end;
+	XA_STATE_ORDER(xas, &swap_cache, entry.val, folio_order(folio));
 	unsigned long nr_pages = folio_nr_pages(folio);
+	unsigned long i;
+	void *old;
 
 	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
 	VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
 	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
 
-	new_tb = folio_to_swp_tb(folio);
-	ci_start = swp_cluster_offset(entry);
-	ci_end = ci_start + nr_pages;
-	ci_off = ci_start;
-	ci = swap_cluster_lock(__swap_entry_to_info(entry), swp_offset(entry));
-	do {
-		old_tb = __swap_table_xchg(ci, ci_off, new_tb);
-		WARN_ON_ONCE(swp_tb_is_folio(old_tb));
-		if (swp_tb_is_shadow(old_tb))
-			shadow = swp_tb_to_shadow(old_tb);
-	} while (++ci_off < ci_end);
-
 	folio_ref_add(folio, nr_pages);
 	folio_set_swapcache(folio);
 	folio->swap = entry;
-	swap_cluster_unlock(ci);
 
-	node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
-	lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
+	do {
+		xas_lock_irq(&xas);
+		xas_create_range(&xas);
+		if (xas_error(&xas))
+			goto unlock;
+		for (i = 0; i < nr_pages; i++) {
+			VM_BUG_ON_FOLIO(xas.xa_index != entry.val + i, folio);
+			old = xas_load(&xas);
+			if (old && !xa_is_value(old)) {
+				VM_WARN_ON_ONCE_FOLIO(1, folio);
+				xas_set_err(&xas, -EEXIST);
+				goto unlock;
+			}
+			if (shadowp && xa_is_value(old) && !*shadowp)
+				*shadowp = old;
+			xas_store(&xas, folio);
+			xas_next(&xas);
+		}
+		node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
+		lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
+unlock:
+		xas_unlock_irq(&xas);
+	} while (xas_nomem(&xas, gfp));
 
-	if (shadowp)
-		*shadowp = shadow;
+	if (!xas_error(&xas))
+		return 0;
+
+	folio_clear_swapcache(folio);
+	folio_ref_sub(folio, nr_pages);
+	return xas_error(&xas);
 }
 
 /**
  * __swap_cache_del_folio - Removes a folio from the swap cache.
- * @ci: The locked swap cluster.
  * @folio: The folio.
  * @entry: The first swap entry that the folio corresponds to.
  * @shadow: shadow value to be filled in the swap cache.
@@ -180,30 +221,23 @@ void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadowp
  * This won't put the folio's refcount. The caller has to do that.
  *
  * Context: Caller must ensure the folio is locked and in the swap cache
- * using the index of @entry, and lock the cluster that holds the entries.
+ * using the index of @entry, and lock the swap cache xarray.
  */
-void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
-			    swp_entry_t entry, void *shadow)
+void __swap_cache_del_folio(struct folio *folio, swp_entry_t entry, void *shadow)
 {
-	unsigned long old_tb, new_tb;
-	unsigned int ci_start, ci_off, ci_end;
-	unsigned long nr_pages = folio_nr_pages(folio);
+	long nr_pages = folio_nr_pages(folio);
+	XA_STATE(xas, &swap_cache, entry.val);
+	int i;
 
-	VM_WARN_ON_ONCE(__swap_entry_to_cluster(entry) != ci);
 	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
 	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
 	VM_WARN_ON_ONCE_FOLIO(folio_test_writeback(folio), folio);
 
-	new_tb = shadow_swp_to_tb(shadow);
-	ci_start = swp_cluster_offset(entry);
-	ci_end = ci_start + nr_pages;
-	ci_off = ci_start;
-	do {
-		/* If shadow is NULL, we sets an empty shadow */
-		old_tb = __swap_table_xchg(ci, ci_off, new_tb);
-		WARN_ON_ONCE(!swp_tb_is_folio(old_tb) ||
-			     swp_tb_to_folio(old_tb) != folio);
-	} while (++ci_off < ci_end);
+	for (i = 0; i < nr_pages; i++) {
+		void *old = xas_store(&xas, shadow);
+		VM_WARN_ON_FOLIO(old != folio, folio);
+		xas_next(&xas);
+	}
 
 	folio->swap.val = 0;
 	folio_clear_swapcache(folio);
@@ -223,12 +257,11 @@ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
  */
 void swap_cache_del_folio(struct folio *folio)
 {
-	struct swap_cluster_info *ci;
 	swp_entry_t entry = folio->swap;
 
-	ci = swap_cluster_lock(__swap_entry_to_info(entry), swp_offset(entry));
-	__swap_cache_del_folio(ci, folio, entry, NULL);
-	swap_cluster_unlock(ci);
+	xa_lock_irq(&swap_cache);
+	__swap_cache_del_folio(folio, entry, NULL);
+	xa_unlock_irq(&swap_cache);
 
 	put_swap_folio(folio, entry);
 	folio_ref_sub(folio, folio_nr_pages(folio));
@@ -236,7 +269,6 @@ void swap_cache_del_folio(struct folio *folio)
 
 /**
  * __swap_cache_replace_folio - Replace a folio in the swap cache.
- * @ci: The locked swap cluster.
  * @old: The old folio to be replaced.
  * @new: The new folio.
  *
@@ -246,39 +278,23 @@ void swap_cache_del_folio(struct folio *folio)
  * the starting offset to override all slots covered by the new folio.
  *
  * Context: Caller must ensure both folios are locked, and lock the
- * cluster that holds the old folio to be replaced.
+ * swap cache xarray.
  */
-void __swap_cache_replace_folio(struct swap_cluster_info *ci,
-				struct folio *old, struct folio *new)
+void __swap_cache_replace_folio(struct folio *old, struct folio *new)
 {
 	swp_entry_t entry = new->swap;
 	unsigned long nr_pages = folio_nr_pages(new);
-	unsigned int ci_off = swp_cluster_offset(entry);
-	unsigned int ci_end = ci_off + nr_pages;
-	unsigned long old_tb, new_tb;
+	XA_STATE(xas, &swap_cache, entry.val);
+	int i;
 
 	VM_WARN_ON_ONCE(!folio_test_swapcache(old) || !folio_test_swapcache(new));
 	VM_WARN_ON_ONCE(!folio_test_locked(old) || !folio_test_locked(new));
 	VM_WARN_ON_ONCE(!entry.val);
 
-	/* Swap cache still stores N entries instead of a high-order entry */
-	new_tb = folio_to_swp_tb(new);
-	do {
-		old_tb = __swap_table_xchg(ci, ci_off, new_tb);
-		WARN_ON_ONCE(!swp_tb_is_folio(old_tb) || swp_tb_to_folio(old_tb) != old);
-	} while (++ci_off < ci_end);
-
-	/*
-	 * If the old folio is partially replaced (e.g., splitting a large
-	 * folio, the old folio is shrunk, and new split sub folios replace
-	 * the shrunk part), ensure the new folio doesn't overlap it.
-	 */
-	if (IS_ENABLED(CONFIG_DEBUG_VM) &&
-	    folio_order(old) != folio_order(new)) {
-		ci_off = swp_cluster_offset(old->swap);
-		ci_end = ci_off + folio_nr_pages(old);
-		while (ci_off++ < ci_end)
-			WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) != old);
+	for (i = 0; i < nr_pages; i++) {
+		void *old_entry = xas_store(&xas, new);
+		WARN_ON_ONCE(!old_entry || xa_is_value(old_entry) || old_entry != old);
+		xas_next(&xas);
 	}
 }
 
@@ -287,20 +303,20 @@ void __swap_cache_replace_folio(struct swap_cluster_info *ci,
  * @entry: The starting index entry.
  * @nr_ents: How many slots need to be cleared.
  *
- * Context: Caller must ensure the range is valid, all in one single cluster,
- * not occupied by any folio, and lock the cluster.
+ * Context: Caller must ensure the range is valid and all in one single cluster,
+ * not occupied by any folio.
  */
-void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents)
+void swap_cache_clear_shadow(swp_entry_t entry, int nr_ents)
 {
-	struct swap_cluster_info *ci = __swap_entry_to_cluster(entry);
-	unsigned int ci_off = swp_cluster_offset(entry), ci_end;
-	unsigned long old;
+	XA_STATE(xas, &swap_cache, entry.val);
+	int i;
 
-	ci_end = ci_off + nr_ents;
-	do {
-		old = __swap_table_xchg(ci, ci_off, null_to_swp_tb());
-		WARN_ON_ONCE(swp_tb_is_folio(old));
-	} while (++ci_off < ci_end);
+	xas_lock(&xas);
+	for (i = 0; i < nr_ents; i++) {
+		xas_store(&xas, NULL);
+		xas_next(&xas);
+	}
+	xas_unlock(&xas);
 }
 
 /*
@@ -480,7 +496,10 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	if (mem_cgroup_swapin_charge_folio(new_folio, NULL, gfp_mask, entry))
 		goto fail_unlock;
 
-	swap_cache_add_folio(new_folio, entry, &shadow);
+	/* May fail (-ENOMEM) if XArray node allocation failed. */
+	if (swap_cache_add_folio(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow))
+		goto fail_unlock;
+
 	memcg1_swapin(entry, 1);
 
 	if (shadow)
diff --git a/mm/swap_table.h b/mm/swap_table.h
index ea244a57a5b7a..ad2cb2ef46903 100644
--- a/mm/swap_table.h
+++ b/mm/swap_table.h
@@ -13,71 +13,6 @@ struct swap_table {
 
 #define SWP_TABLE_USE_PAGE (sizeof(struct swap_table) == PAGE_SIZE)
 
-/*
- * A swap table entry represents the status of a swap slot on a swap
- * (physical or virtual) device. The swap table in each cluster is a
- * 1:1 map of the swap slots in this cluster.
- *
- * Each swap table entry could be a pointer (folio), a XA_VALUE
- * (shadow), or NULL.
- */
-
-/*
- * Helpers for casting one type of info into a swap table entry.
- */
-static inline unsigned long null_to_swp_tb(void)
-{
-	BUILD_BUG_ON(sizeof(unsigned long) != sizeof(atomic_long_t));
-	return 0;
-}
-
-static inline unsigned long folio_to_swp_tb(struct folio *folio)
-{
-	BUILD_BUG_ON(sizeof(unsigned long) != sizeof(void *));
-	return (unsigned long)folio;
-}
-
-static inline unsigned long shadow_swp_to_tb(void *shadow)
-{
-	BUILD_BUG_ON((BITS_PER_XA_VALUE + 1) !=
-		     BITS_PER_BYTE * sizeof(unsigned long));
-	VM_WARN_ON_ONCE(shadow && !xa_is_value(shadow));
-	return (unsigned long)shadow;
-}
-
-/*
- * Helpers for swap table entry type checking.
- */
-static inline bool swp_tb_is_null(unsigned long swp_tb)
-{
-	return !swp_tb;
-}
-
-static inline bool swp_tb_is_folio(unsigned long swp_tb)
-{
-	return !xa_is_value((void *)swp_tb) && !swp_tb_is_null(swp_tb);
-}
-
-static inline bool swp_tb_is_shadow(unsigned long swp_tb)
-{
-	return xa_is_value((void *)swp_tb);
-}
-
-/*
- * Helpers for retrieving info from swap table.
- */
-static inline struct folio *swp_tb_to_folio(unsigned long swp_tb)
-{
-	VM_WARN_ON(!swp_tb_is_folio(swp_tb));
-	return (void *)swp_tb;
-}
-
-static inline void *swp_tb_to_shadow(unsigned long swp_tb)
-{
-	VM_WARN_ON(!swp_tb_is_shadow(swp_tb));
-	return (void *)swp_tb;
-}
-
 /*
  * Helpers for accessing or modifying the swap table of a cluster,
  * the swap cluster must be locked.
@@ -92,17 +27,6 @@ static inline void __swap_table_set(struct swap_cluster_info *ci,
 	atomic_long_set(&table[off], swp_tb);
 }
 
-static inline unsigned long __swap_table_xchg(struct swap_cluster_info *ci,
-					      unsigned int off, unsigned long swp_tb)
-{
-	atomic_long_t *table = rcu_dereference_protected(ci->table, true);
-
-	lockdep_assert_held(&ci->lock);
-	VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
-	/* Ordering is guaranteed by cluster lock, relax */
-	return atomic_long_xchg_relaxed(&table[off], swp_tb);
-}
-
 static inline unsigned long __swap_table_get(struct swap_cluster_info *ci,
 					     unsigned int off)
 {
@@ -122,7 +46,7 @@ static inline unsigned long swap_table_get(struct swap_cluster_info *ci,
 
 	rcu_read_lock();
 	table = rcu_dereference(ci->table);
-	swp_tb = table ? atomic_long_read(&table[off]) : null_to_swp_tb();
+	swp_tb = table ? atomic_long_read(&table[off]) : 0;
 	rcu_read_unlock();
 
 	return swp_tb;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 46d2008e4b996..cacfafa9a540d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -474,7 +474,7 @@ static void swap_cluster_free_table(struct swap_cluster_info *ci)
 	lockdep_assert_held(&ci->lock);
 	VM_WARN_ON_ONCE(!cluster_is_empty(ci));
 	for (ci_off = 0; ci_off < SWAPFILE_CLUSTER; ci_off++)
-		VM_WARN_ON_ONCE(!swp_tb_is_null(__swap_table_get(ci, ci_off)));
+		VM_WARN_ON_ONCE(__swap_table_get(ci, ci_off));
 	table = (void *)rcu_dereference_protected(ci->table, true);
 	rcu_assign_pointer(ci->table, NULL);
 
@@ -843,26 +843,6 @@ static bool cluster_scan_range(struct swap_info_struct *si,
 	return true;
 }
 
-/*
- * Currently, the swap table is not used for count tracking, just
- * do a sanity check here to ensure nothing leaked, so the swap
- * table should be empty upon freeing.
- */
-static void swap_cluster_assert_table_empty(struct swap_cluster_info *ci,
-				unsigned int start, unsigned int nr)
-{
-	unsigned int ci_off = start % SWAPFILE_CLUSTER;
-	unsigned int ci_end = ci_off + nr;
-	unsigned long swp_tb;
-
-	if (IS_ENABLED(CONFIG_DEBUG_VM)) {
-		do {
-			swp_tb = __swap_table_get(ci, ci_off);
-			VM_WARN_ON_ONCE(!swp_tb_is_null(swp_tb));
-		} while (++ci_off < ci_end);
-	}
-}
-
 static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci,
 				unsigned int start, unsigned char usage,
 				unsigned int order)
@@ -882,7 +862,6 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
 		ci->order = order;
 
 	memset(si->swap_map + start, usage, nr_pages);
-	swap_cluster_assert_table_empty(ci, start, nr_pages);
 	swap_range_alloc(si, nr_pages);
 	ci->count += nr_pages;
 
@@ -1275,7 +1254,7 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 			swap_slot_free_notify(si->bdev, offset);
 		offset++;
 	}
-	__swap_cache_clear_shadow(swp_entry(si->type, begin), nr_entries);
+	swap_cache_clear_shadow(swp_entry(si->type, begin), nr_entries);
 
 	/*
 	 * Make sure that try_to_unuse() observes si->inuse_pages reaching 0
@@ -1423,6 +1402,7 @@ int folio_alloc_swap(struct folio *folio)
 	unsigned int order = folio_order(folio);
 	unsigned int size = 1 << order;
 	swp_entry_t entry = {};
+	int err;
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio);
@@ -1457,19 +1437,23 @@ int folio_alloc_swap(struct folio *folio)
 	}
 
 	/* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */
-	if (mem_cgroup_try_charge_swap(folio, entry))
+	if (mem_cgroup_try_charge_swap(folio, entry)) {
+		err = -ENOMEM;
 		goto out_free;
+	}
 
 	if (!entry.val)
 		return -ENOMEM;
 
-	swap_cache_add_folio(folio, entry, NULL);
+	err = swap_cache_add_folio(folio, entry, __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN, NULL);
+	if (err)
+		goto out_free;
 
 	return 0;
 
 out_free:
 	put_swap_folio(folio, entry);
-	return -ENOMEM;
+	return err;
 }
 
 static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
@@ -1729,7 +1713,6 @@ static void swap_entries_free(struct swap_info_struct *si,
 
 	mem_cgroup_uncharge_swap(entry, nr_pages);
 	swap_range_free(si, offset, nr_pages);
-	swap_cluster_assert_table_empty(ci, offset, nr_pages);
 
 	if (!ci->count)
 		free_cluster(si, ci);
@@ -4057,9 +4040,9 @@ static int __init swapfile_init(void)
 	swapfile_maximum_size = arch_max_swapfile_size();
 
 	/*
-	 * Once a cluster is freed, it's swap table content is read
-	 * only, and all swap cache readers (swap_cache_*) verifies
-	 * the content before use. So it's safe to use RCU slab here.
+	 * Once a cluster is freed, it's swap table content is read only, and
+	 * all swap table readers verify the content before use. So it's safe to
+	 * use RCU slab here.
 	 */
 	if (!SWP_TABLE_USE_PAGE)
 		swap_table_cachep = kmem_cache_create("swap_table",
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 614ccf39fe3fa..558ff7f413786 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -707,13 +707,12 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
 {
 	int refcount;
 	void *shadow = NULL;
-	struct swap_cluster_info *ci;
 
 	BUG_ON(!folio_test_locked(folio));
 	BUG_ON(mapping != folio_mapping(folio));
 
 	if (folio_test_swapcache(folio)) {
-		ci = swap_cluster_get_and_lock_irq(folio);
+		swap_cache_lock_irq();
 	} else {
 		spin_lock(&mapping->host->i_lock);
 		xa_lock_irq(&mapping->i_pages);
@@ -758,9 +757,9 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
 
 		if (reclaimed && !mapping_exiting(mapping))
 			shadow = workingset_eviction(folio, target_memcg);
-		__swap_cache_del_folio(ci, folio, swap, shadow);
+		__swap_cache_del_folio(folio, swap, shadow);
 		memcg1_swapout(folio, swap);
-		swap_cluster_unlock_irq(ci);
+		swap_cache_unlock_irq();
 		put_swap_folio(folio, swap);
 	} else {
 		void (*free_folio)(struct folio *);
@@ -799,7 +798,7 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
 
 cannot_free:
 	if (folio_test_swapcache(folio)) {
-		swap_cluster_unlock_irq(ci);
+		swap_cache_unlock_irq();
 	} else {
 		xa_unlock_irq(&mapping->i_pages);
 		spin_unlock(&mapping->host->i_lock);
-- 
2.47.3

[PATCH v3 00/20] Virtual Swap Space

Posted by Nhat Pham 2 days, 8 hours ago

My sincerest apologies - it seems like the cover letter (and just the
cover letter) fails to be sent out, for some reason. I'm trying to figure
out what happened - it works when I send the entire patch series to
myself...

Anyway, resending this (in-reply-to patch 1 of the series):

Changelog:
* RFC v2 -> v3:
* Implement a cluster-based allocation algorithm for virtual swap
slots, inspired by Kairui Song and Chris Li's implementation, as
well as Johannes Weiner's suggestions. This eliminates the lock
contention issues on the virtual swap layer.
* Re-use swap table for the reverse mapping.
* Remove CONFIG_VIRTUAL_SWAP.
* Reducing the size of the swap descriptor from 48 bytes to 24
bytes, i.e another 50% reduction in memory overhead from v2.
* Remove swap cache and zswap tree and use the swap descriptor
for this.
* Remove zeromap, and replace the swap_map bytemap with 2 bitmaps
(one for allocated slots, and one for bad slots).
* Rebase on top of 6.19 (7d0a66e4bb9081d75c82ec4957c50034cb0ea449)
* Update cover letter to include new benchmark results and discussion
on overhead in various cases.
* RFC v1 -> RFC v2:
* Use a single atomic type (swap_refs) for reference counting
purpose. This brings the size of the swap descriptor from 64 B
down to 48 B (25% reduction). Suggested by Yosry Ahmed.
* Zeromap bitmap is removed in the virtual swap implementation.
This saves one bit per phyiscal swapfile slot.
* Rearrange the patches and the code change to make things more
reviewable. Suggested by Johannes Weiner.
* Update the cover letter a bit.

This patch series implements the virtual swap space idea, based on Yosry's
proposals at LSFMMBPF 2023 (see [1], [2], [3]), as well as valuable
inputs from Johannes Weiner. The same idea (with different
implementation details) has been floated by Rik van Riel since at least
2011 (see [8]).

This patch series is based on 6.19. There are a couple more
swap-related changes in the mm-stable branch that I would need to
coordinate with, but I would like to send this out as an update, to show
that the lock contention issues that plagued earlier versions have been
resolved and performance on the kernel build benchmark is now on-par with
baseline. Furthermore, memory overhead has been substantially reduced
compared to the last RFC version.

I. Motivation

Currently, when an anon page is swapped out, a slot in a backing swap
device is allocated and stored in the page table entries that refer to
the original page. This slot is also used as the "key" to find the
swapped out content, as well as the index to swap data structures, such
as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
backing slot in this way is performant and efficient when swap is purely
just disk space, and swapoff is rare.

However, the advent of many swap optimizations has exposed major
drawbacks of this design. The first problem is that we occupy a physical
slot in the swap space, even for pages that are NEVER expected to hit
the disk: pages compressed and stored in the zswap pool, zero-filled
pages, or pages rejected by both of these optimizations when zswap
writeback is disabled. This is the arguably central shortcoming of
zswap:
* In deployments when no disk space can be afforded for swap (such as
mobile and embedded devices), users cannot adopt zswap, and are forced
to use zram. This is confusing for users, and creates extra burdens
for developers, having to develop and maintain similar features for
two separate swap backends (writeback, cgroup charging, THP support,
etc.). For instance, see the discussion in [4].
* Resource-wise, it is hugely wasteful in terms of disk usage. At Meta,
we have swapfile in the order of tens to hundreds of GBs, which are
mostly unused and only exist to enable zswap usage and zero-filled
pages swap optimizations.
* Tying zswap (and more generally, other in-memory swap backends) to
the current physical swapfile infrastructure makes zswap implicitly
statically sized. This does not make sense, as unlike disk swap, in
which we consume a limited resource (disk space or swapfile space) to
save another resource (memory), zswap consume the same resource it is
saving (memory). The more we zswap, the more memory we have available,
not less. We are not rationing a limited resource when we limit
the size of he zswap pool, but rather we are capping the resource
(memory) saving potential of zswap. Under memory pressure, using
more zswap is almost always better than the alternative (disk IOs, or
even worse, OOMs), and dynamically sizing the zswap pool on demand
allows the system to flexibly respond to these precarious scenarios.
* Operationally, static provisioning the swapfile for zswap pose
significant challenges, because the sysadmin has to prescribe how
much swap is needed a priori, for each combination of
(memory size x disk space x workload usage). It is even more
complicated when we take into account the variance of memory
compression, which changes the reclaim dynamics (and as a result,
swap space size requirement). The problem is further exarcebated for
users who rely on swap utilization (and exhaustion) as an OOM signal.

All of these factors make it very difficult to configure the swapfile
for zswap: too small of a swapfile and we risk preventable OOMs and
limit the memory saving potentials of zswap; too big of a swapfile
and we waste disk space and memory due to swap metadata overhead.
This dilemma becomes more drastic in high memory systems, which can
have up to TBs worth of memory.

Past attempts to decouple disk and compressed swap backends, namely the
ghost swapfile approach (see [13]), as well as the alternative
compressed swap backend zram, have mainly focused on eliminating the
disk space usage of compressed backends. We want a solution that not
only tackles that same problem, but also achieve the dyamicization of
swap space to maximize the memory saving potentials while reducing
operational and static memory overhead.

Finally, any swap redesign should support efficient backend transfer,
i.e without having to perform the expensive page table walk to
update all the PTEs that refer to the swap entry:
* The main motivation for this requirement is zswap writeback. To quote
Johannes (from [14]): "Combining compression with disk swap is
extremely powerful, because it dramatically reduces the worst aspects
of both: it reduces the memory footprint of compression by shedding
the coldest data to disk; it reduces the IO latencies and flash wear
of disk swap through the writeback cache. In practice, this reduces
*average event rates of the entire reclaim/paging/IO stack*."
* Another motivation is to simplify swapoff, which is both complicated
and expensive in the current design, precisely because we are storing
an encoding of the backend positional information in the page table,
and thus requires a full page table walk to remove these references.

II. High Level Design Overview

To fix the aforementioned issues, we need an abstraction that separates
a swap entry from its physical backing storage. IOW, we need to
“virtualize” the swap space: swap clients will work with a dynamically
allocated virtual swap slot, storing it in page table entries, and
using it to index into various swap-related data structures. The
backing storage is decoupled from the virtual swap slot, and the newly
introduced layer will “resolve” the virtual swap slot to the actual
storage. This layer also manages other metadata of the swap entry, such
as its lifetime information (swap count), via a dynamically allocated,
per-swap-entry descriptor:

struct swp_desc {
union {
swp_slot_t slot; /* 0 8 */
struct zswap_entry * zswap_entry; /* 0 8 */
}; /* 0 8 */
union {
struct folio * swap_cache; /* 8 8 */
void * shadow; /* 8 8 */
}; /* 8 8 */
unsigned int swap_count; /* 16 4 */
unsigned short memcgid:16; /* 20: 0 2 */
bool in_swapcache:1; /* 22: 0 1 */

/* Bitfield combined with previous fields */

enum swap_type type:2; /* 20:17 4 */

/* size: 24, cachelines: 1, members: 6 */
/* bit_padding: 13 bits */
/* last cacheline: 24 bytes */
};

(output from pahole).

This design allows us to:
* Decouple zswap (and zeromapped swap entry) from backing swapfile:
simply associate the virtual swap slot with one of the supported
backends: a zswap entry, a zero-filled swap page, a slot on the
swapfile, or an in-memory page.
* Simplify and optimize swapoff: we only have to fault the page in and
have the virtual swap slot points to the page instead of the on-disk
physical swap slot. No need to perform any page table walking.

The size of the virtual swap descriptor is 24 bytes. Note that this is
not all "new" overhead, as the swap descriptor will replace:
* the swap_cgroup arrays (one per swap type) in the old design, which
is a massive source of static memory overhead. With the new design,
it is only allocated for used clusters.
* the swap tables, which holds the swap cache and workingset shadows.
* the zeromap bitmap, which is a bitmap of physical swap slots to
indicate whether the swapped out page is zero-filled or not.
* huge chunk of the swap_map. The swap_map is now replaced by 2 bitmaps,
one for allocated slots, and one for bad slots, representing 3 possible
states of a slot on the swapfile: allocated, free, and bad.
* the zswap tree.

So, in terms of additional memory overhead:
* For zswap entries, the added memory overhead is rather minimal. The
new indirection pointer neatly replaces the existing zswap tree.
We really only incur less than one word of overhead for swap count
blow up (since we no longer use swap continuation) and the swap type.
* For physical swap entries, the new design will impose fewer than 3 words
memory overhead. However, as noted above this overhead is only for
actively used swap entries, whereas in the current design the overhead is
static (including the swap cgroup array for example).

The primary victim of this overhead will be zram users. However, as
zswap now no longer takes up disk space, zram users can consider
switching to zswap (which, as a bonus, has a lot of useful features
out of the box, such as cgroup tracking, dynamic zswap pool sizing,
LRU-ordering writeback, etc.).

For a more concrete example, suppose we have a 32 GB swapfile (i.e.
8,388,608 swap entries), and we use zswap.

0% usage, or 0 entries: 0.00 MB
* Old design total overhead: 25.00 MB
* Vswap total overhead: 0.00 MB

25% usage, or 2,097,152 entries:
* Old design total overhead: 57.00 MB
* Vswap total overhead: 48.25 MB

50% usage, or 4,194,304 entries:
* Old design total overhead: 89.00 MB
* Vswap total overhead: 96.50 MB

75% usage, or 6,291,456 entries:
* Old design total overhead: 121.00 MB
* Vswap total overhead: 144.75 MB

100% usage, or 8,388,608 entries:
* Old design total overhead: 153.00 MB
* Vswap total overhead: 193.00 MB

So even in the worst case scenario for virtual swap, i.e when we
somehow have an oracle to correctly size the swapfile for zswap
pool to 32 GB, the added overhead is only 40 MB, which is a mere
0.12% of the total swapfile :)

In practice, the overhead will be closer to the 50-75% usage case, as
systems tend to leave swap headroom for pathological events or sudden
spikes in memory requirements. The added overhead in these cases are
practically neglible. And in deployments where swapfiles for zswap
are previously sparsely used, switching over to virtual swap will
actually reduce memory overhead.

Doing the same math for the disk swap, which is the worst case for
virtual swap in terms of swap backends:

0% usage, or 0 entries: 0.00 MB
* Old design total overhead: 25.00 MB
* Vswap total overhead: 2.00 MB

25% usage, or 2,097,152 entries:
* Old design total overhead: 41.00 MB
* Vswap total overhead: 66.25 MB

50% usage, or 4,194,304 entries:
* Old design total overhead: 57.00 MB
* Vswap total overhead: 130.50 MB

75% usage, or 6,291,456 entries:
* Old design total overhead: 73.00 MB
* Vswap total overhead: 194.75 MB

100% usage, or 8,388,608 entries:
* Old design total overhead: 89.00 MB
* Vswap total overhead: 259.00 MB

The added overhead is 170MB, which is 0.5% of the total swapfile size,
again in the worst case when we have a sizing oracle.

Please see the attached patches for more implementation details.

III. Usage and Benchmarking

This patch series introduce no new syscalls or userspace API. Existing
userspace setups will work as-is, except we no longer have to create a
swapfile or set memory.swap.max if we want to use zswap, as zswap is no
longer tied to physical swap. The zswap pool will be automatically and
dynamically sized based on memory usage and reclaim dynamics.

To measure the performance of the new implementation, I have run the
following benchmarks:

1. Kernel building: 52 workers (one per processor), memory.max = 3G.

Using zswap as the backend:

Baseline:
real: mean: 185.2s, stdev: 0.93s
sys: mean: 683.7s, stdev: 33.77s

Vswap:
real: mean: 184.88s, stdev: 0.57s
sys: mean: 675.14s, stdev: 32.8s

We actually see a slight improvement in systime (by 1.5%) :) This is
likely because we no longer have to perform swap charging for zswap
entries, and virtual swap allocator is simpler than that of physical
swap.

Using SSD swap as the backend:

Baseline:
real: mean: 200.3s, stdev: 2.33s
sys: mean: 489.88s, stdev: 9.62s

Vswap:
real: mean: 201.47s, stdev: 2.98s
sys: mean: 487.36s, stdev: 5.53s

The performance is neck-to-neck.

IV. Future Use Cases

While the patch series focus on two applications (decoupling swap
backends and swapoff optimization/simplification), this new,
future-proof design also allows us to implement new swap features more
easily and efficiently:

* Multi-tier swapping (as mentioned in [5]), with transparent
transferring (promotion/demotion) of pages across tiers (see [8] and
[9]). Similar to swapoff, with the old design we would need to
perform the expensive page table walk.
* Swapfile compaction to alleviate fragmentation (as proposed by Ying
Huang in [6]).
* Mixed backing THP swapin (see [7]): Once you have pinned down the
backing store of THPs, then you can dispatch each range of subpages
to appropriate backend swapin handler.
* Swapping a folio out with discontiguous physical swap slots
(see [10]).
* Zswap writeback optimization: The current architecture pre-reserves
physical swap space for pages when they enter the zswap pool, giving
the kernel no flexibility at writeback time. With the virtual swap
implementation, the backends are decoupled, and physical swap space
is allocated on-demand at writeback time, at which point we can make
much smarter decisions: we can batch multiple zswap writeback
operations into a single IO request, allocating contiguous physical
swap slots for that request. We can even perform compressed writeback
(i.e writing these pages without decompressing them) (see [12]).

V. References

[1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com/
[2]: https://lwn.net/Articles/932077/
[3]: https://www.youtube.com/watch?v=Hwqw_TBGEhg
[4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@infradead.org/
[5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com/
[6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com/
[7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=6gvee1x3ttbOnifGneqcRm9Hoeun=uFQ2w@mail.gmail.com/
[8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/
[9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjpqxMwxS2C9TQ@mail.gmail.com/
[10]: https://lore.kernel.org/all/CACePvbUkMYMencuKfpDqtG1Ej7LiUS87VRAXb8sBn1yANikEmQ@mail.gmail.com/
[11]: https://lore.kernel.org/all/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com/
[12]: https://lore.kernel.org/linux-mm/ZeZSDLWwDed0CgT3@casper.infradead.org/
[13]: https://lore.kernel.org/all/20251121-ghost-v1-1-cfc0efcf3855@kernel.org/
[14]: https://lore.kernel.org/linux-mm/20251202170222.GD430226@cmpxchg.org/

Nhat Pham (20):
mm/swap: decouple swap cache from physical swap infrastructure
swap: rearrange the swap header file
mm: swap: add an abstract API for locking out swapoff
zswap: add new helpers for zswap entry operations
mm/swap: add a new function to check if a swap entry is in swap
cached.
mm: swap: add a separate type for physical swap slots
mm: create scaffolds for the new virtual swap implementation
zswap: prepare zswap for swap virtualization
mm: swap: allocate a virtual swap slot for each swapped out page
swap: move swap cache to virtual swap descriptor
zswap: move zswap entry management to the virtual swap descriptor
swap: implement the swap_cgroup API using virtual swap
swap: manage swap entry lifecycle at the virtual swap layer
mm: swap: decouple virtual swap slot from backing store
zswap: do not start zswap shrinker if there is no physical swap slots
swap: do not unnecesarily pin readahead swap entries
swapfile: remove zeromap bitmap
memcg: swap: only charge physical swap slots
swap: simplify swapoff using virtual swap
swapfile: replace the swap map with bitmaps

base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
--
2.47.3

Re: [PATCH v3 00/20] Virtual Swap Space

Posted by Chris Li 1 day, 18 hours ago

On Sun, Feb 8, 2026 at 4:15 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> My sincerest apologies - it seems like the cover letter (and just the
> cover letter) fails to be sent out, for some reason. I'm trying to figure
> out what happened - it works when I send the entire patch series to
> myself...
>
> Anyway, resending this (in-reply-to patch 1 of the series):

For the record I did receive your original V3 cover letter from the
linux-mm mailing list.

> Changelog:
> * RFC v2 -> v3:
>     * Implement a cluster-based allocation algorithm for virtual swap
>       slots, inspired by Kairui Song and Chris Li's implementation, as
>       well as Johannes Weiner's suggestions. This eliminates the lock
>           contention issues on the virtual swap layer.
>     * Re-use swap table for the reverse mapping.
>     * Remove CONFIG_VIRTUAL_SWAP.
>     * Reducing the size of the swap descriptor from 48 bytes to 24

Is the per swap slot entry overhead 24 bytes in your implementation?
The current swap overhead is 3 static +8 dynamic, your 24 dynamic is a
big jump. You can argue that 8->24 is not a big jump . But it is an
unnecessary price compared to the alternatives, which is 8 dynamic +
4(optional redirect).

>       bytes, i.e another 50% reduction in memory overhead from v2.
>     * Remove swap cache and zswap tree and use the swap descriptor
>       for this.
>     * Remove zeromap, and replace the swap_map bytemap with 2 bitmaps
>       (one for allocated slots, and one for bad slots).
>     * Rebase on top of 6.19 (7d0a66e4bb9081d75c82ec4957c50034cb0ea449)

My git log shows 7d0a66e4bb9081d75c82ec4957c50034cb0ea449 is tag "v6.18".

>         * Update cover letter to include new benchmark results and discussion
>           on overhead in various cases.
> * RFC v1 -> RFC v2:
>     * Use a single atomic type (swap_refs) for reference counting
>       purpose. This brings the size of the swap descriptor from 64 B
>       down to 48 B (25% reduction). Suggested by Yosry Ahmed.
>     * Zeromap bitmap is removed in the virtual swap implementation.
>       This saves one bit per phyiscal swapfile slot.
>     * Rearrange the patches and the code change to make things more
>       reviewable. Suggested by Johannes Weiner.
>     * Update the cover letter a bit.
>
> This patch series implements the virtual swap space idea, based on Yosry's
> proposals at LSFMMBPF 2023 (see [1], [2], [3]), as well as valuable
> inputs from Johannes Weiner. The same idea (with different
> implementation details) has been floated by Rik van Riel since at least
> 2011 (see [8]).
>
> This patch series is based on 6.19. There are a couple more
> swap-related changes in the mm-stable branch that I would need to
> coordinate with, but I would like to send this out as an update, to show

Ah, you need to mention that in the first line to Andrew. Spell out
this series is not for Andrew to consume in the MM series. It can't
any way because it does not apply to mm-unstable nor mm-stable.

BTW, I have the following compile error with this series (fedora 43).
Same config compile fine on v6.19.

In file included from ./include/linux/local_lock.h:5,
                 from ./include/linux/mmzone.h:24,
                 from ./include/linux/gfp.h:7,
                 from ./include/linux/mm.h:7,
                 from mm/vswap.c:7:
mm/vswap.c: In function ‘vswap_cpu_dead’:
./include/linux/percpu-defs.h:221:45: error: initialization from
pointer to non-enclosed address space
  221 |         const void __percpu *__vpp_verify = (typeof((ptr) +
0))NULL;    \
      |                                             ^
./include/linux/local_lock_internal.h:105:40: note: in definition of
macro ‘__local_lock_acquire’
  105 |                 __l = (local_lock_t *)(lock);
         \
      |                                        ^~~~
./include/linux/local_lock.h:17:41: note: in expansion of macro
‘__local_lock’
   17 | #define local_lock(lock)                __local_lock(this_cpu_ptr(lock))
      |                                         ^~~~~~~~~~~~
./include/linux/percpu-defs.h:245:9: note: in expansion of macro
‘__verify_pcpu_ptr’
  245 |         __verify_pcpu_ptr(ptr);
         \
      |         ^~~~~~~~~~~~~~~~~
./include/linux/percpu-defs.h:256:27: note: in expansion of macro ‘raw_cpu_ptr’
  256 | #define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
      |                           ^~~~~~~~~~~
./include/linux/local_lock.h:17:54: note: in expansion of macro
‘this_cpu_ptr’
   17 | #define local_lock(lock)
__local_lock(this_cpu_ptr(lock))
      |
^~~~~~~~~~~~
mm/vswap.c:1518:9: note: in expansion of macro ‘local_lock’
 1518 |         local_lock(&percpu_cluster->lock);
      |         ^~~~~~~~~~

> that the lock contention issues that plagued earlier versions have been
> resolved and performance on the kernel build benchmark is now on-par with
> baseline. Furthermore, memory overhead has been substantially reduced
> compared to the last RFC version.
>
>
> I. Motivation
>
> Currently, when an anon page is swapped out, a slot in a backing swap
> device is allocated and stored in the page table entries that refer to
> the original page. This slot is also used as the "key" to find the
> swapped out content, as well as the index to swap data structures, such
> as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
> backing slot in this way is performant and efficient when swap is purely
> just disk space, and swapoff is rare.
>
> However, the advent of many swap optimizations has exposed major
> drawbacks of this design. The first problem is that we occupy a physical
> slot in the swap space, even for pages that are NEVER expected to hit
> the disk: pages compressed and stored in the zswap pool, zero-filled
> pages, or pages rejected by both of these optimizations when zswap
> writeback is disabled. This is the arguably central shortcoming of
> zswap:
> * In deployments when no disk space can be afforded for swap (such as
>   mobile and embedded devices), users cannot adopt zswap, and are forced
>   to use zram. This is confusing for users, and creates extra burdens
>   for developers, having to develop and maintain similar features for
>   two separate swap backends (writeback, cgroup charging, THP support,
>   etc.). For instance, see the discussion in [4].
> * Resource-wise, it is hugely wasteful in terms of disk usage. At Meta,
>   we have swapfile in the order of tens to hundreds of GBs, which are
>   mostly unused and only exist to enable zswap usage and zero-filled
>   pages swap optimizations.
> * Tying zswap (and more generally, other in-memory swap backends) to
>   the current physical swapfile infrastructure makes zswap implicitly
>   statically sized. This does not make sense, as unlike disk swap, in
>   which we consume a limited resource (disk space or swapfile space) to
>   save another resource (memory), zswap consume the same resource it is
>   saving (memory). The more we zswap, the more memory we have available,
>   not less. We are not rationing a limited resource when we limit
>   the size of he zswap pool, but rather we are capping the resource
>   (memory) saving potential of zswap. Under memory pressure, using
>   more zswap is almost always better than the alternative (disk IOs, or
>   even worse, OOMs), and dynamically sizing the zswap pool on demand
>   allows the system to flexibly respond to these precarious scenarios.
> * Operationally, static provisioning the swapfile for zswap pose
>   significant challenges, because the sysadmin has to prescribe how
>   much swap is needed a priori, for each combination of
>   (memory size x disk space x workload usage). It is even more
>   complicated when we take into account the variance of memory
>   compression, which changes the reclaim dynamics (and as a result,
>   swap space size requirement). The problem is further exarcebated for
>   users who rely on swap utilization (and exhaustion) as an OOM signal.
>
>   All of these factors make it very difficult to configure the swapfile
>   for zswap: too small of a swapfile and we risk preventable OOMs and
>   limit the memory saving potentials of zswap; too big of a swapfile
>   and we waste disk space and memory due to swap metadata overhead.
>   This dilemma becomes more drastic in high memory systems, which can
>   have up to TBs worth of memory.
>
> Past attempts to decouple disk and compressed swap backends, namely the
> ghost swapfile approach (see [13]), as well as the alternative
> compressed swap backend zram, have mainly focused on eliminating the
> disk space usage of compressed backends. We want a solution that not
> only tackles that same problem, but also achieve the dyamicization of
> swap space to maximize the memory saving potentials while reducing
> operational and static memory overhead.
>
> Finally, any swap redesign should support efficient backend transfer,
> i.e without having to perform the expensive page table walk to
> update all the PTEs that refer to the swap entry:
> * The main motivation for this requirement is zswap writeback. To quote
>   Johannes (from [14]): "Combining compression with disk swap is
>   extremely powerful, because it dramatically reduces the worst aspects
>   of both: it reduces the memory footprint of compression by shedding
>   the coldest data to disk; it reduces the IO latencies and flash wear
>   of disk swap through the writeback cache. In practice, this reduces
>   *average event rates of the entire reclaim/paging/IO stack*."
> * Another motivation is to simplify swapoff, which is both complicated
>   and expensive in the current design, precisely because we are storing
>   an encoding of the backend positional information in the page table,
>   and thus requires a full page table walk to remove these references.
>
>
> II. High Level Design Overview
>
> To fix the aforementioned issues, we need an abstraction that separates
> a swap entry from its physical backing storage. IOW, we need to
> “virtualize” the swap space: swap clients will work with a dynamically
> allocated virtual swap slot, storing it in page table entries, and
> using it to index into various swap-related data structures. The
> backing storage is decoupled from the virtual swap slot, and the newly
> introduced layer will “resolve” the virtual swap slot to the actual
> storage. This layer also manages other metadata of the swap entry, such
> as its lifetime information (swap count), via a dynamically allocated,
> per-swap-entry descriptor:
>
> struct swp_desc {
>         union {
>                 swp_slot_t         slot;                 /*     0     8 */
>                 struct zswap_entry * zswap_entry;        /*     0     8 */
>         };                                               /*     0     8 */
>         union {
>                 struct folio *     swap_cache;           /*     8     8 */
>                 void *             shadow;               /*     8     8 */
>         };                                               /*     8     8 */
>         unsigned int               swap_count;           /*    16     4 */
>         unsigned short             memcgid:16;           /*    20: 0  2 */
>         bool                       in_swapcache:1;       /*    22: 0  1 */
>
>         /* Bitfield combined with previous fields */
>
>         enum swap_type             type:2;               /*    20:17  4 */
>
>         /* size: 24, cachelines: 1, members: 6 */
>         /* bit_padding: 13 bits */
>         /* last cacheline: 24 bytes */
> };
>
> (output from pahole).
>
> This design allows us to:
> * Decouple zswap (and zeromapped swap entry) from backing swapfile:
>   simply associate the virtual swap slot with one of the supported
>   backends: a zswap entry, a zero-filled swap page, a slot on the
>   swapfile, or an in-memory page.
> * Simplify and optimize swapoff: we only have to fault the page in and
>   have the virtual swap slot points to the page instead of the on-disk
>   physical swap slot. No need to perform any page table walking.
>
> The size of the virtual swap descriptor is 24 bytes. Note that this is
> not all "new" overhead, as the swap descriptor will replace:
> * the swap_cgroup arrays (one per swap type) in the old design, which
>   is a massive source of static memory overhead. With the new design,
>   it is only allocated for used clusters.
> * the swap tables, which holds the swap cache and workingset shadows.
> * the zeromap bitmap, which is a bitmap of physical swap slots to
>   indicate whether the swapped out page is zero-filled or not.
> * huge chunk of the swap_map. The swap_map is now replaced by 2 bitmaps,
>   one for allocated slots, and one for bad slots, representing 3 possible
>   states of a slot on the swapfile: allocated, free, and bad.
> * the zswap tree.
>
> So, in terms of additional memory overhead:
> * For zswap entries, the added memory overhead is rather minimal. The
>   new indirection pointer neatly replaces the existing zswap tree.
>   We really only incur less than one word of overhead for swap count
>   blow up (since we no longer use swap continuation) and the swap type.
> * For physical swap entries, the new design will impose fewer than 3 words
>   memory overhead. However, as noted above this overhead is only for
>   actively used swap entries, whereas in the current design the overhead is
>   static (including the swap cgroup array for example).
>
>   The primary victim of this overhead will be zram users. However, as
>   zswap now no longer takes up disk space, zram users can consider
>   switching to zswap (which, as a bonus, has a lot of useful features
>   out of the box, such as cgroup tracking, dynamic zswap pool sizing,
>   LRU-ordering writeback, etc.).
>
> For a more concrete example, suppose we have a 32 GB swapfile (i.e.
> 8,388,608 swap entries), and we use zswap.
>
> 0% usage, or 0 entries: 0.00 MB
> * Old design total overhead: 25.00 MB
> * Vswap total overhead: 0.00 MB
>
> 25% usage, or 2,097,152 entries:
> * Old design total overhead: 57.00 MB
> * Vswap total overhead: 48.25 MB
>
> 50% usage, or 4,194,304 entries:
> * Old design total overhead: 89.00 MB
> * Vswap total overhead: 96.50 MB
>
> 75% usage, or 6,291,456 entries:
> * Old design total overhead: 121.00 MB
> * Vswap total overhead: 144.75 MB
>
> 100% usage, or 8,388,608 entries:
> * Old design total overhead: 153.00 MB
> * Vswap total overhead: 193.00 MB
>
> So even in the worst case scenario for virtual swap, i.e when we
> somehow have an oracle to correctly size the swapfile for zswap
> pool to 32 GB, the added overhead is only 40 MB, which is a mere
> 0.12% of the total swapfile :)
>
> In practice, the overhead will be closer to the 50-75% usage case, as
> systems tend to leave swap headroom for pathological events or sudden
> spikes in memory requirements. The added overhead in these cases are
> practically neglible. And in deployments where swapfiles for zswap
> are previously sparsely used, switching over to virtual swap will
> actually reduce memory overhead.
>
> Doing the same math for the disk swap, which is the worst case for
> virtual swap in terms of swap backends:
>
> 0% usage, or 0 entries: 0.00 MB
> * Old design total overhead: 25.00 MB
> * Vswap total overhead: 2.00 MB
>
> 25% usage, or 2,097,152 entries:
> * Old design total overhead: 41.00 MB
> * Vswap total overhead: 66.25 MB
>
> 50% usage, or 4,194,304 entries:
> * Old design total overhead: 57.00 MB
> * Vswap total overhead: 130.50 MB
>
> 75% usage, or 6,291,456 entries:
> * Old design total overhead: 73.00 MB
> * Vswap total overhead: 194.75 MB
>
> 100% usage, or 8,388,608 entries:
> * Old design total overhead: 89.00 MB
> * Vswap total overhead: 259.00 MB
>
> The added overhead is 170MB, which is 0.5% of the total swapfile size,
> again in the worst case when we have a sizing oracle.
>
> Please see the attached patches for more implementation details.
>
>
> III. Usage and Benchmarking
>
> This patch series introduce no new syscalls or userspace API. Existing
> userspace setups will work as-is, except we no longer have to create a
> swapfile or set memory.swap.max if we want to use zswap, as zswap is no
> longer tied to physical swap. The zswap pool will be automatically and
> dynamically sized based on memory usage and reclaim dynamics.
>
> To measure the performance of the new implementation, I have run the
> following benchmarks:
>
> 1. Kernel building: 52 workers (one per processor), memory.max = 3G.
>
> Using zswap as the backend:
>
> Baseline:
> real: mean: 185.2s, stdev: 0.93s
> sys: mean: 683.7s, stdev: 33.77s
>
> Vswap:
> real: mean: 184.88s, stdev: 0.57s
> sys: mean: 675.14s, stdev: 32.8s

Can you show your user space time as well to complete the picture?

How many runs do you have for stdev 32.8s?

>
> We actually see a slight improvement in systime (by 1.5%) :) This is
> likely because we no longer have to perform swap charging for zswap
> entries, and virtual swap allocator is simpler than that of physical
> swap.
>
> Using SSD swap as the backend:
Please include zram swap test data as well. Android heavily uses zram
for swapping.

>
> Baseline:
> real: mean: 200.3s, stdev: 2.33s
> sys: mean: 489.88s, stdev: 9.62s
>
> Vswap:
> real: mean: 201.47s, stdev: 2.98s
> sys: mean: 487.36s, stdev: 5.53s
>
> The performance is neck-to-neck.

I strongly suspect there is some performance difference that hasn't
been covered by your test yet. Need more conformation by others on the
performance measurement. The swap testing is tricky. You want to push
to stress barely within the OOM limit. Need more data.

Chris

>
>
> IV. Future Use Cases
>
> While the patch series focus on two applications (decoupling swap
> backends and swapoff optimization/simplification), this new,
> future-proof design also allows us to implement new swap features more
> easily and efficiently:
>
> * Multi-tier swapping (as mentioned in [5]), with transparent
>   transferring (promotion/demotion) of pages across tiers (see [8] and
>   [9]). Similar to swapoff, with the old design we would need to
>   perform the expensive page table walk.
> * Swapfile compaction to alleviate fragmentation (as proposed by Ying
>   Huang in [6]).
> * Mixed backing THP swapin (see [7]): Once you have pinned down the
>   backing store of THPs, then you can dispatch each range of subpages
>   to appropriate backend swapin handler.
> * Swapping a folio out with discontiguous physical swap slots
>   (see [10]).
> * Zswap writeback optimization: The current architecture pre-reserves
>   physical swap space for pages when they enter the zswap pool, giving
>   the kernel no flexibility at writeback time. With the virtual swap
>   implementation, the backends are decoupled, and physical swap space
>   is allocated on-demand at writeback time, at which point we can make
>   much smarter decisions: we can batch multiple zswap writeback
>   operations into a single IO request, allocating contiguous physical
>   swap slots for that request. We can even perform compressed writeback
>   (i.e writing these pages without decompressing them) (see [12]).
>
>
> V. References
>
> [1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com/
> [2]: https://lwn.net/Articles/932077/
> [3]: https://www.youtube.com/watch?v=Hwqw_TBGEhg
> [4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@infradead.org/
> [5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com/
> [6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com/
> [7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=6gvee1x3ttbOnifGneqcRm9Hoeun=uFQ2w@mail.gmail.com/
> [8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/
> [9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjpqxMwxS2C9TQ@mail.gmail.com/
> [10]: https://lore.kernel.org/all/CACePvbUkMYMencuKfpDqtG1Ej7LiUS87VRAXb8sBn1yANikEmQ@mail.gmail.com/
> [11]: https://lore.kernel.org/all/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com/
> [12]: https://lore.kernel.org/linux-mm/ZeZSDLWwDed0CgT3@casper.infradead.org/
> [13]: https://lore.kernel.org/all/20251121-ghost-v1-1-cfc0efcf3855@kernel.org/
> [14]: https://lore.kernel.org/linux-mm/20251202170222.GD430226@cmpxchg.org/
>
> Nhat Pham (20):
>   mm/swap: decouple swap cache from physical swap infrastructure
>   swap: rearrange the swap header file
>   mm: swap: add an abstract API for locking out swapoff
>   zswap: add new helpers for zswap entry operations
>   mm/swap: add a new function to check if a swap entry is in swap
>     cached.
>   mm: swap: add a separate type for physical swap slots
>   mm: create scaffolds for the new virtual swap implementation
>   zswap: prepare zswap for swap virtualization
>   mm: swap: allocate a virtual swap slot for each swapped out page
>   swap: move swap cache to virtual swap descriptor
>   zswap: move zswap entry management to the virtual swap descriptor
>   swap: implement the swap_cgroup API using virtual swap
>   swap: manage swap entry lifecycle at the virtual swap layer
>   mm: swap: decouple virtual swap slot from backing store
>   zswap: do not start zswap shrinker if there is no physical swap slots
>   swap: do not unnecesarily pin readahead swap entries
>   swapfile: remove zeromap bitmap
>   memcg: swap: only charge physical swap slots
>   swap: simplify swapoff using virtual swap
>   swapfile: replace the swap map with bitmaps
>
>  Documentation/mm/swap-table.rst |   69 --
>  MAINTAINERS                     |    2 +
>  include/linux/cpuhotplug.h      |    1 +
>  include/linux/mm_types.h        |   16 +
>  include/linux/shmem_fs.h        |    7 +-
>  include/linux/swap.h            |  135 ++-
>  include/linux/swap_cgroup.h     |   13 -
>  include/linux/swapops.h         |   25 +
>  include/linux/zswap.h           |   17 +-
>  kernel/power/swap.c             |    6 +-
>  mm/Makefile                     |    5 +-
>  mm/huge_memory.c                |   11 +-
>  mm/internal.h                   |   12 +-
>  mm/memcontrol-v1.c              |    6 +
>  mm/memcontrol.c                 |  142 ++-
>  mm/memory.c                     |  101 +-
>  mm/migrate.c                    |   13 +-
>  mm/mincore.c                    |   15 +-
>  mm/page_io.c                    |   83 +-
>  mm/shmem.c                      |  215 +---
>  mm/swap.h                       |  157 +--
>  mm/swap_cgroup.c                |  172 ---
>  mm/swap_state.c                 |  306 +----
>  mm/swap_table.h                 |   78 +-
>  mm/swapfile.c                   | 1518 ++++-------------------
>  mm/userfaultfd.c                |   18 +-
>  mm/vmscan.c                     |   28 +-
>  mm/vswap.c                      | 2025 +++++++++++++++++++++++++++++++
>  mm/zswap.c                      |  142 +--
>  29 files changed, 2853 insertions(+), 2485 deletions(-)
>  delete mode 100644 Documentation/mm/swap-table.rst
>  delete mode 100644 mm/swap_cgroup.c
>  create mode 100644 mm/vswap.c
>
>
> base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
> --
> 2.47.3
>

Re: [PATCH v3 00/20] Virtual Swap Space

Posted by Nhat Pham 12 hours ago

On Mon, Feb 9, 2026 at 4:20 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Sun, Feb 8, 2026 at 4:15 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > My sincerest apologies - it seems like the cover letter (and just the
> > cover letter) fails to be sent out, for some reason. I'm trying to figure
> > out what happened - it works when I send the entire patch series to
> > myself...
> >
> > Anyway, resending this (in-reply-to patch 1 of the series):
>
> For the record I did receive your original V3 cover letter from the
> linux-mm mailing list.

I have no idea what happened to be honest. It did not show up on lore
for a couple of hours, and my coworkers did not receive the cover
letter email initially. I did not receive any error message or logs
either - git send-email returns Success to me, and when I checked on
the web gmail client (since I used a gmail email account), the whole
series is there.

I tried re-sending a couple times, to no avail. Then, in a couple of
hours, all of these attempts showed up.

Anyway, this is my bad - I'll be more patient next time. If it does
not show up for a couple of hours then I'll do some more digging.

>
> > Changelog:
> > * RFC v2 -> v3:
> >     * Implement a cluster-based allocation algorithm for virtual swap
> >       slots, inspired by Kairui Song and Chris Li's implementation, as
> >       well as Johannes Weiner's suggestions. This eliminates the lock
> >           contention issues on the virtual swap layer.
> >     * Re-use swap table for the reverse mapping.
> >     * Remove CONFIG_VIRTUAL_SWAP.
> >     * Reducing the size of the swap descriptor from 48 bytes to 24
>
> Is the per swap slot entry overhead 24 bytes in your implementation?
> The current swap overhead is 3 static +8 dynamic, your 24 dynamic is a
> big jump. You can argue that 8->24 is not a big jump . But it is an
> unnecessary price compared to the alternatives, which is 8 dynamic +
> 4(optional redirect).

It depends in cases - you can check the memory overhead discussion below :)

>
> >       bytes, i.e another 50% reduction in memory overhead from v2.
> >     * Remove swap cache and zswap tree and use the swap descriptor
> >       for this.
> >     * Remove zeromap, and replace the swap_map bytemap with 2 bitmaps
> >       (one for allocated slots, and one for bad slots).
> >     * Rebase on top of 6.19 (7d0a66e4bb9081d75c82ec4957c50034cb0ea449)
>
> My git log shows 7d0a66e4bb9081d75c82ec4957c50034cb0ea449 is tag "v6.18".

Oh yeah I forgot to update that. That was from an old cover letter of
an old version that never got sent out - I'll correct that in future
versions

(if you scroll down to the bottom of the cover letter you should see
the correct base, which should be 6.19).

>
> >         * Update cover letter to include new benchmark results and discussion
> >           on overhead in various cases.
> > * RFC v1 -> RFC v2:
> >     * Use a single atomic type (swap_refs) for reference counting
> >       purpose. This brings the size of the swap descriptor from 64 B
> >       down to 48 B (25% reduction). Suggested by Yosry Ahmed.
> >     * Zeromap bitmap is removed in the virtual swap implementation.
> >       This saves one bit per phyiscal swapfile slot.
> >     * Rearrange the patches and the code change to make things more
> >       reviewable. Suggested by Johannes Weiner.
> >     * Update the cover letter a bit.
> >
> > This patch series implements the virtual swap space idea, based on Yosry's
> > proposals at LSFMMBPF 2023 (see [1], [2], [3]), as well as valuable
> > inputs from Johannes Weiner. The same idea (with different
> > implementation details) has been floated by Rik van Riel since at least
> > 2011 (see [8]).
> >
> > This patch series is based on 6.19. There are a couple more
> > swap-related changes in the mm-stable branch that I would need to
> > coordinate with, but I would like to send this out as an update, to show
>
> Ah, you need to mention that in the first line to Andrew. Spell out
> this series is not for Andrew to consume in the MM series. It can't
> any way because it does not apply to mm-unstable nor mm-stable.

Fair - I'll make sure to move this paragraph to above the changelog next time :)

>
> BTW, I have the following compile error with this series (fedora 43).
> Same config compile fine on v6.19.
>
> In file included from ./include/linux/local_lock.h:5,
>                  from ./include/linux/mmzone.h:24,
>                  from ./include/linux/gfp.h:7,
>                  from ./include/linux/mm.h:7,
>                  from mm/vswap.c:7:
> mm/vswap.c: In function ‘vswap_cpu_dead’:
> ./include/linux/percpu-defs.h:221:45: error: initialization from
> pointer to non-enclosed address space
>   221 |         const void __percpu *__vpp_verify = (typeof((ptr) +
> 0))NULL;    \
>       |                                             ^
> ./include/linux/local_lock_internal.h:105:40: note: in definition of
> macro ‘__local_lock_acquire’
>   105 |                 __l = (local_lock_t *)(lock);
>          \
>       |                                        ^~~~
> ./include/linux/local_lock.h:17:41: note: in expansion of macro
> ‘__local_lock’
>    17 | #define local_lock(lock)                __local_lock(this_cpu_ptr(lock))
>       |                                         ^~~~~~~~~~~~
> ./include/linux/percpu-defs.h:245:9: note: in expansion of macro
> ‘__verify_pcpu_ptr’
>   245 |         __verify_pcpu_ptr(ptr);
>          \
>       |         ^~~~~~~~~~~~~~~~~
> ./include/linux/percpu-defs.h:256:27: note: in expansion of macro ‘raw_cpu_ptr’
>   256 | #define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
>       |                           ^~~~~~~~~~~
> ./include/linux/local_lock.h:17:54: note: in expansion of macro
> ‘this_cpu_ptr’
>    17 | #define local_lock(lock)
> __local_lock(this_cpu_ptr(lock))
>       |
> ^~~~~~~~~~~~
> mm/vswap.c:1518:9: note: in expansion of macro ‘local_lock’
>  1518 |         local_lock(&percpu_cluster->lock);
>       |         ^~~~~~~~~~

Ah that's strange. It compiled on all of my setups (I tested with a couple
different ones), but I must have missed some cases. Would you mind
sharing your configs so that I can reproduce this compilation error?

(although I'm sure kernel test robot will scream at me soon, which
usually includes configs that cause the compilation issue).

>
> > that the lock contention issues that plagued earlier versions have been
> > resolved and performance on the kernel build benchmark is now on-par with
> > baseline. Furthermore, memory overhead has been substantially reduced
> > compared to the last RFC version.
> >
> >
> > I. Motivation
> >
> > Currently, when an anon page is swapped out, a slot in a backing swap
> > device is allocated and stored in the page table entries that refer to
> > the original page. This slot is also used as the "key" to find the
> > swapped out content, as well as the index to swap data structures, such
> > as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
> > backing slot in this way is performant and efficient when swap is purely
> > just disk space, and swapoff is rare.
> >
> > However, the advent of many swap optimizations has exposed major
> > drawbacks of this design. The first problem is that we occupy a physical
> > slot in the swap space, even for pages that are NEVER expected to hit
> > the disk: pages compressed and stored in the zswap pool, zero-filled
> > pages, or pages rejected by both of these optimizations when zswap
> > writeback is disabled. This is the arguably central shortcoming of
> > zswap:
> > * In deployments when no disk space can be afforded for swap (such as
> >   mobile and embedded devices), users cannot adopt zswap, and are forced
> >   to use zram. This is confusing for users, and creates extra burdens
> >   for developers, having to develop and maintain similar features for
> >   two separate swap backends (writeback, cgroup charging, THP support,
> >   etc.). For instance, see the discussion in [4].
> > * Resource-wise, it is hugely wasteful in terms of disk usage. At Meta,
> >   we have swapfile in the order of tens to hundreds of GBs, which are
> >   mostly unused and only exist to enable zswap usage and zero-filled
> >   pages swap optimizations.
> > * Tying zswap (and more generally, other in-memory swap backends) to
> >   the current physical swapfile infrastructure makes zswap implicitly
> >   statically sized. This does not make sense, as unlike disk swap, in
> >   which we consume a limited resource (disk space or swapfile space) to
> >   save another resource (memory), zswap consume the same resource it is
> >   saving (memory). The more we zswap, the more memory we have available,
> >   not less. We are not rationing a limited resource when we limit
> >   the size of he zswap pool, but rather we are capping the resource
> >   (memory) saving potential of zswap. Under memory pressure, using
> >   more zswap is almost always better than the alternative (disk IOs, or
> >   even worse, OOMs), and dynamically sizing the zswap pool on demand
> >   allows the system to flexibly respond to these precarious scenarios.
> > * Operationally, static provisioning the swapfile for zswap pose
> >   significant challenges, because the sysadmin has to prescribe how
> >   much swap is needed a priori, for each combination of
> >   (memory size x disk space x workload usage). It is even more
> >   complicated when we take into account the variance of memory
> >   compression, which changes the reclaim dynamics (and as a result,
> >   swap space size requirement). The problem is further exarcebated for
> >   users who rely on swap utilization (and exhaustion) as an OOM signal.
> >
> >   All of these factors make it very difficult to configure the swapfile
> >   for zswap: too small of a swapfile and we risk preventable OOMs and
> >   limit the memory saving potentials of zswap; too big of a swapfile
> >   and we waste disk space and memory due to swap metadata overhead.
> >   This dilemma becomes more drastic in high memory systems, which can
> >   have up to TBs worth of memory.
> >
> > Past attempts to decouple disk and compressed swap backends, namely the
> > ghost swapfile approach (see [13]), as well as the alternative
> > compressed swap backend zram, have mainly focused on eliminating the
> > disk space usage of compressed backends. We want a solution that not
> > only tackles that same problem, but also achieve the dyamicization of
> > swap space to maximize the memory saving potentials while reducing
> > operational and static memory overhead.
> >
> > Finally, any swap redesign should support efficient backend transfer,
> > i.e without having to perform the expensive page table walk to
> > update all the PTEs that refer to the swap entry:
> > * The main motivation for this requirement is zswap writeback. To quote
> >   Johannes (from [14]): "Combining compression with disk swap is
> >   extremely powerful, because it dramatically reduces the worst aspects
> >   of both: it reduces the memory footprint of compression by shedding
> >   the coldest data to disk; it reduces the IO latencies and flash wear
> >   of disk swap through the writeback cache. In practice, this reduces
> >   *average event rates of the entire reclaim/paging/IO stack*."
> > * Another motivation is to simplify swapoff, which is both complicated
> >   and expensive in the current design, precisely because we are storing
> >   an encoding of the backend positional information in the page table,
> >   and thus requires a full page table walk to remove these references.
> >
> >
> > II. High Level Design Overview
> >
> > To fix the aforementioned issues, we need an abstraction that separates
> > a swap entry from its physical backing storage. IOW, we need to
> > “virtualize” the swap space: swap clients will work with a dynamically
> > allocated virtual swap slot, storing it in page table entries, and
> > using it to index into various swap-related data structures. The
> > backing storage is decoupled from the virtual swap slot, and the newly
> > introduced layer will “resolve” the virtual swap slot to the actual
> > storage. This layer also manages other metadata of the swap entry, such
> > as its lifetime information (swap count), via a dynamically allocated,
> > per-swap-entry descriptor:
> >
> > struct swp_desc {
> >         union {
> >                 swp_slot_t         slot;                 /*     0     8 */
> >                 struct zswap_entry * zswap_entry;        /*     0     8 */
> >         };                                               /*     0     8 */
> >         union {
> >                 struct folio *     swap_cache;           /*     8     8 */
> >                 void *             shadow;               /*     8     8 */
> >         };                                               /*     8     8 */
> >         unsigned int               swap_count;           /*    16     4 */
> >         unsigned short             memcgid:16;           /*    20: 0  2 */
> >         bool                       in_swapcache:1;       /*    22: 0  1 */
> >
> >         /* Bitfield combined with previous fields */
> >
> >         enum swap_type             type:2;               /*    20:17  4 */
> >
> >         /* size: 24, cachelines: 1, members: 6 */
> >         /* bit_padding: 13 bits */
> >         /* last cacheline: 24 bytes */
> > };
> >
> > (output from pahole).
> >
> > This design allows us to:
> > * Decouple zswap (and zeromapped swap entry) from backing swapfile:
> >   simply associate the virtual swap slot with one of the supported
> >   backends: a zswap entry, a zero-filled swap page, a slot on the
> >   swapfile, or an in-memory page.
> > * Simplify and optimize swapoff: we only have to fault the page in and
> >   have the virtual swap slot points to the page instead of the on-disk
> >   physical swap slot. No need to perform any page table walking.
> >
> > The size of the virtual swap descriptor is 24 bytes. Note that this is
> > not all "new" overhead, as the swap descriptor will replace:
> > * the swap_cgroup arrays (one per swap type) in the old design, which
> >   is a massive source of static memory overhead. With the new design,
> >   it is only allocated for used clusters.
> > * the swap tables, which holds the swap cache and workingset shadows.
> > * the zeromap bitmap, which is a bitmap of physical swap slots to
> >   indicate whether the swapped out page is zero-filled or not.
> > * huge chunk of the swap_map. The swap_map is now replaced by 2 bitmaps,
> >   one for allocated slots, and one for bad slots, representing 3 possible
> >   states of a slot on the swapfile: allocated, free, and bad.
> > * the zswap tree.
> >
> > So, in terms of additional memory overhead:
> > * For zswap entries, the added memory overhead is rather minimal. The
> >   new indirection pointer neatly replaces the existing zswap tree.
> >   We really only incur less than one word of overhead for swap count
> >   blow up (since we no longer use swap continuation) and the swap type.
> > * For physical swap entries, the new design will impose fewer than 3 words
> >   memory overhead. However, as noted above this overhead is only for
> >   actively used swap entries, whereas in the current design the overhead is
> >   static (including the swap cgroup array for example).
> >
> >   The primary victim of this overhead will be zram users. However, as
> >   zswap now no longer takes up disk space, zram users can consider
> >   switching to zswap (which, as a bonus, has a lot of useful features
> >   out of the box, such as cgroup tracking, dynamic zswap pool sizing,
> >   LRU-ordering writeback, etc.).
> >
> > For a more concrete example, suppose we have a 32 GB swapfile (i.e.
> > 8,388,608 swap entries), and we use zswap.
> >
> > 0% usage, or 0 entries: 0.00 MB
> > * Old design total overhead: 25.00 MB
> > * Vswap total overhead: 0.00 MB
> >
> > 25% usage, or 2,097,152 entries:
> > * Old design total overhead: 57.00 MB
> > * Vswap total overhead: 48.25 MB
> >
> > 50% usage, or 4,194,304 entries:
> > * Old design total overhead: 89.00 MB
> > * Vswap total overhead: 96.50 MB
> >
> > 75% usage, or 6,291,456 entries:
> > * Old design total overhead: 121.00 MB
> > * Vswap total overhead: 144.75 MB
> >
> > 100% usage, or 8,388,608 entries:
> > * Old design total overhead: 153.00 MB
> > * Vswap total overhead: 193.00 MB
> >
> > So even in the worst case scenario for virtual swap, i.e when we
> > somehow have an oracle to correctly size the swapfile for zswap
> > pool to 32 GB, the added overhead is only 40 MB, which is a mere
> > 0.12% of the total swapfile :)
> >
> > In practice, the overhead will be closer to the 50-75% usage case, as
> > systems tend to leave swap headroom for pathological events or sudden
> > spikes in memory requirements. The added overhead in these cases are
> > practically neglible. And in deployments where swapfiles for zswap
> > are previously sparsely used, switching over to virtual swap will
> > actually reduce memory overhead.
> >
> > Doing the same math for the disk swap, which is the worst case for
> > virtual swap in terms of swap backends:
> >
> > 0% usage, or 0 entries: 0.00 MB
> > * Old design total overhead: 25.00 MB
> > * Vswap total overhead: 2.00 MB
> >
> > 25% usage, or 2,097,152 entries:
> > * Old design total overhead: 41.00 MB
> > * Vswap total overhead: 66.25 MB
> >
> > 50% usage, or 4,194,304 entries:
> > * Old design total overhead: 57.00 MB
> > * Vswap total overhead: 130.50 MB
> >
> > 75% usage, or 6,291,456 entries:
> > * Old design total overhead: 73.00 MB
> > * Vswap total overhead: 194.75 MB
> >
> > 100% usage, or 8,388,608 entries:
> > * Old design total overhead: 89.00 MB
> > * Vswap total overhead: 259.00 MB
> >
> > The added overhead is 170MB, which is 0.5% of the total swapfile size,
> > again in the worst case when we have a sizing oracle.
> >
> > Please see the attached patches for more implementation details.
> >
> >
> > III. Usage and Benchmarking
> >
> > This patch series introduce no new syscalls or userspace API. Existing
> > userspace setups will work as-is, except we no longer have to create a
> > swapfile or set memory.swap.max if we want to use zswap, as zswap is no
> > longer tied to physical swap. The zswap pool will be automatically and
> > dynamically sized based on memory usage and reclaim dynamics.
> >
> > To measure the performance of the new implementation, I have run the
> > following benchmarks:
> >
> > 1. Kernel building: 52 workers (one per processor), memory.max = 3G.
> >
> > Using zswap as the backend:
> >
> > Baseline:
> > real: mean: 185.2s, stdev: 0.93s
> > sys: mean: 683.7s, stdev: 33.77s
> >
> > Vswap:
> > real: mean: 184.88s, stdev: 0.57s
> > sys: mean: 675.14s, stdev: 32.8s
>
> Can you show your user space time as well to complete the picture?

Will do next time! I used to include user time as well, but I noticed
that folks (for e.g see [1]) only include systime, not even real time,
so I figure nobody cares about user time :)

(I still include real time because some of my past work improves sys
time but regresses real time, so I figure that's relevant).

[1]: https://lore.kernel.org/linux-mm/20260128-swap-table-p3-v2-0-fe0b67ef0215@tencent.com/

But yeah no big deal. I'll dig through my logs to see if I still have
the numbers, but if not I'll include it in next version.

>
> How many runs do you have for stdev 32.8s?

5 runs! I average out the result of 5 runs.

>
> >
> > We actually see a slight improvement in systime (by 1.5%) :) This is
> > likely because we no longer have to perform swap charging for zswap
> > entries, and virtual swap allocator is simpler than that of physical
> > swap.
> >
> > Using SSD swap as the backend:
> Please include zram swap test data as well. Android heavily uses zram
> for swapping.
> >
> > Baseline:
> > real: mean: 200.3s, stdev: 2.33s
> > sys: mean: 489.88s, stdev: 9.62s
> >
> > Vswap:
> > real: mean: 201.47s, stdev: 2.98s
> > sys: mean: 487.36s, stdev: 5.53s
> >
> > The performance is neck-to-neck.
>
> I strongly suspect there is some performance difference that hasn't
> been covered by your test yet. Need more conformation by others on the
> performance measurement. The swap testing is tricky. You want to push
> to stress barely within the OOM limit. Need more data.

Very fair point :) I will say though - the kernel build test, with
memory.max limit sets, does generate a sizable amount of swapping, and
does OOM if you don't set up swap. Take my words for now, but I will
try to include average per-run (z)swap activity stats (zswpout zswpin
etc.) in future versions if you're interested :)

I've been trying to running more stress tests to trigger crashes and
performance regression. One of the big reasons why I haven't sent
anything til now is to fix obvious performance issues (the
aforementioned lock contention) and bugs. It's a complicated piece of
work.

As always, would love to receive code/design feedback from you (and
Kairui, and other swap reviewers), and I would appreciate very much if
other swap folks can play with the patch series on their setup as well
for performance testing, or let me know if there is any particular
case that they're interested in :)

Thanks for your review, Chris!



>
> Chris
>
> >
> >
> > IV. Future Use Cases
> >
> > While the patch series focus on two applications (decoupling swap
> > backends and swapoff optimization/simplification), this new,
> > future-proof design also allows us to implement new swap features more
> > easily and efficiently:
> >
> > * Multi-tier swapping (as mentioned in [5]), with transparent
> >   transferring (promotion/demotion) of pages across tiers (see [8] and
> >   [9]). Similar to swapoff, with the old design we would need to
> >   perform the expensive page table walk.
> > * Swapfile compaction to alleviate fragmentation (as proposed by Ying
> >   Huang in [6]).
> > * Mixed backing THP swapin (see [7]): Once you have pinned down the
> >   backing store of THPs, then you can dispatch each range of subpages
> >   to appropriate backend swapin handler.
> > * Swapping a folio out with discontiguous physical swap slots
> >   (see [10]).
> > * Zswap writeback optimization: The current architecture pre-reserves
> >   physical swap space for pages when they enter the zswap pool, giving
> >   the kernel no flexibility at writeback time. With the virtual swap
> >   implementation, the backends are decoupled, and physical swap space
> >   is allocated on-demand at writeback time, at which point we can make
> >   much smarter decisions: we can batch multiple zswap writeback
> >   operations into a single IO request, allocating contiguous physical
> >   swap slots for that request. We can even perform compressed writeback
> >   (i.e writing these pages without decompressing them) (see [12]).
> >
> >
> > V. References
> >
> > [1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com/
> > [2]: https://lwn.net/Articles/932077/
> > [3]: https://www.youtube.com/watch?v=Hwqw_TBGEhg
> > [4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@infradead.org/
> > [5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com/
> > [6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com/
> > [7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=6gvee1x3ttbOnifGneqcRm9Hoeun=uFQ2w@mail.gmail.com/
> > [8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/
> > [9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjpqxMwxS2C9TQ@mail.gmail.com/
> > [10]: https://lore.kernel.org/all/CACePvbUkMYMencuKfpDqtG1Ej7LiUS87VRAXb8sBn1yANikEmQ@mail.gmail.com/
> > [11]: https://lore.kernel.org/all/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com/
> > [12]: https://lore.kernel.org/linux-mm/ZeZSDLWwDed0CgT3@casper.infradead.org/
> > [13]: https://lore.kernel.org/all/20251121-ghost-v1-1-cfc0efcf3855@kernel.org/
> > [14]: https://lore.kernel.org/linux-mm/20251202170222.GD430226@cmpxchg.org/
> >
> > Nhat Pham (20):
> >   mm/swap: decouple swap cache from physical swap infrastructure
> >   swap: rearrange the swap header file
> >   mm: swap: add an abstract API for locking out swapoff
> >   zswap: add new helpers for zswap entry operations
> >   mm/swap: add a new function to check if a swap entry is in swap
> >     cached.
> >   mm: swap: add a separate type for physical swap slots
> >   mm: create scaffolds for the new virtual swap implementation
> >   zswap: prepare zswap for swap virtualization
> >   mm: swap: allocate a virtual swap slot for each swapped out page
> >   swap: move swap cache to virtual swap descriptor
> >   zswap: move zswap entry management to the virtual swap descriptor
> >   swap: implement the swap_cgroup API using virtual swap
> >   swap: manage swap entry lifecycle at the virtual swap layer
> >   mm: swap: decouple virtual swap slot from backing store
> >   zswap: do not start zswap shrinker if there is no physical swap slots
> >   swap: do not unnecesarily pin readahead swap entries
> >   swapfile: remove zeromap bitmap
> >   memcg: swap: only charge physical swap slots
> >   swap: simplify swapoff using virtual swap
> >   swapfile: replace the swap map with bitmaps
> >
> >  Documentation/mm/swap-table.rst |   69 --
> >  MAINTAINERS                     |    2 +
> >  include/linux/cpuhotplug.h      |    1 +
> >  include/linux/mm_types.h        |   16 +
> >  include/linux/shmem_fs.h        |    7 +-
> >  include/linux/swap.h            |  135 ++-
> >  include/linux/swap_cgroup.h     |   13 -
> >  include/linux/swapops.h         |   25 +
> >  include/linux/zswap.h           |   17 +-
> >  kernel/power/swap.c             |    6 +-
> >  mm/Makefile                     |    5 +-
> >  mm/huge_memory.c                |   11 +-
> >  mm/internal.h                   |   12 +-
> >  mm/memcontrol-v1.c              |    6 +
> >  mm/memcontrol.c                 |  142 ++-
> >  mm/memory.c                     |  101 +-
> >  mm/migrate.c                    |   13 +-
> >  mm/mincore.c                    |   15 +-
> >  mm/page_io.c                    |   83 +-
> >  mm/shmem.c                      |  215 +---
> >  mm/swap.h                       |  157 +--
> >  mm/swap_cgroup.c                |  172 ---
> >  mm/swap_state.c                 |  306 +----
> >  mm/swap_table.h                 |   78 +-
> >  mm/swapfile.c                   | 1518 ++++-------------------
> >  mm/userfaultfd.c                |   18 +-
> >  mm/vmscan.c                     |   28 +-
> >  mm/vswap.c                      | 2025 +++++++++++++++++++++++++++++++
> >  mm/zswap.c                      |  142 +--
> >  29 files changed, 2853 insertions(+), 2485 deletions(-)
> >  delete mode 100644 Documentation/mm/swap-table.rst
> >  delete mode 100644 mm/swap_cgroup.c
> >  create mode 100644 mm/vswap.c
> >
> >
> > base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
> > --
> > 2.47.3
> >

Re: [PATCH v3 00/20] Virtual Swap Space

Posted by Chris Li 7 hours ago

On Tue, Feb 10, 2026 at 10:00 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Mon, Feb 9, 2026 at 4:20 AM Chris Li <chrisl@kernel.org> wrote:
> >
> > On Sun, Feb 8, 2026 at 4:15 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > My sincerest apologies - it seems like the cover letter (and just the
> > > cover letter) fails to be sent out, for some reason. I'm trying to figure
> > > out what happened - it works when I send the entire patch series to
> > > myself...
> > >
> > > Anyway, resending this (in-reply-to patch 1 of the series):
> >
> > For the record I did receive your original V3 cover letter from the
> > linux-mm mailing list.
>
> I have no idea what happened to be honest. It did not show up on lore
> for a couple of hours, and my coworkers did not receive the cover
> letter email initially. I did not receive any error message or logs
> either - git send-email returns Success to me, and when I checked on
> the web gmail client (since I used a gmail email account), the whole
> series is there.
>
> I tried re-sending a couple times, to no avail. Then, in a couple of
> hours, all of these attempts showed up.
>
> Anyway, this is my bad - I'll be more patient next time. If it does
> not show up for a couple of hours then I'll do some more digging.

No problem. Just want to provide more data points if that helps you
debug your email issue.

> > > Changelog:
> > > * RFC v2 -> v3:
> > >     * Implement a cluster-based allocation algorithm for virtual swap
> > >       slots, inspired by Kairui Song and Chris Li's implementation, as
> > >       well as Johannes Weiner's suggestions. This eliminates the lock
> > >           contention issues on the virtual swap layer.
> > >     * Re-use swap table for the reverse mapping.
> > >     * Remove CONFIG_VIRTUAL_SWAP.
> > >     * Reducing the size of the swap descriptor from 48 bytes to 24
> >
> > Is the per swap slot entry overhead 24 bytes in your implementation?
> > The current swap overhead is 3 static +8 dynamic, your 24 dynamic is a
> > big jump. You can argue that 8->24 is not a big jump . But it is an
> > unnecessary price compared to the alternatives, which is 8 dynamic +
> > 4(optional redirect).
>
> It depends in cases - you can check the memory overhead discussion below :)

I think the "24B dynamic" sums up the VS memory overhead pretty well
without going into the detail tables. You can drive from case
discussion from that.

> > BTW, I have the following compile error with this series (fedora 43).
> > Same config compile fine on v6.19.
> >
> > In file included from ./include/linux/local_lock.h:5,
> >                  from ./include/linux/mmzone.h:24,
> >                  from ./include/linux/gfp.h:7,
> >                  from ./include/linux/mm.h:7,
> >                  from mm/vswap.c:7:
> > mm/vswap.c: In function ‘vswap_cpu_dead’:
> > ./include/linux/percpu-defs.h:221:45: error: initialization from
> > pointer to non-enclosed address space
> >   221 |         const void __percpu *__vpp_verify = (typeof((ptr) +
> > 0))NULL;    \
> >       |                                             ^
> > ./include/linux/local_lock_internal.h:105:40: note: in definition of
> > macro ‘__local_lock_acquire’
> >   105 |                 __l = (local_lock_t *)(lock);
> >          \
> >       |                                        ^~~~
> > ./include/linux/local_lock.h:17:41: note: in expansion of macro
> > ‘__local_lock’
> >    17 | #define local_lock(lock)                __local_lock(this_cpu_ptr(lock))
> >       |                                         ^~~~~~~~~~~~
> > ./include/linux/percpu-defs.h:245:9: note: in expansion of macro
> > ‘__verify_pcpu_ptr’
> >   245 |         __verify_pcpu_ptr(ptr);
> >          \
> >       |         ^~~~~~~~~~~~~~~~~
> > ./include/linux/percpu-defs.h:256:27: note: in expansion of macro ‘raw_cpu_ptr’
> >   256 | #define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
> >       |                           ^~~~~~~~~~~
> > ./include/linux/local_lock.h:17:54: note: in expansion of macro
> > ‘this_cpu_ptr’
> >    17 | #define local_lock(lock)
> > __local_lock(this_cpu_ptr(lock))
> >       |
> > ^~~~~~~~~~~~
> > mm/vswap.c:1518:9: note: in expansion of macro ‘local_lock’
> >  1518 |         local_lock(&percpu_cluster->lock);
> >       |         ^~~~~~~~~~
>
> Ah that's strange. It compiled on all of my setups (I tested with a couple
> different ones), but I must have missed some cases. Would you mind
> sharing your configs so that I can reproduce this compilation error?

See attached config.gz. It is also possible the newer gcc version
contributes to that error. Anyway, that is preventing me from stress
testing your series on my setup.

>
> >
> > > 1. Kernel building: 52 workers (one per processor), memory.max = 3G.
> > >
> > > Using zswap as the backend:
> > >
> > > Baseline:
> > > real: mean: 185.2s, stdev: 0.93s
> > > sys: mean: 683.7s, stdev: 33.77s
> > >
> > > Vswap:
> > > real: mean: 184.88s, stdev: 0.57s
> > > sys: mean: 675.14s, stdev: 32.8s
> >
> > Can you show your user space time as well to complete the picture?
>
> Will do next time! I used to include user time as well, but I noticed
> that folks (for e.g see [1]) only include systime, not even real time,
> so I figure nobody cares about user time :)
>
> (I still include real time because some of my past work improves sys
> time but regresses real time, so I figure that's relevant).
>
> [1]: https://lore.kernel.org/linux-mm/20260128-swap-table-p3-v2-0-fe0b67ef0215@tencent.com/
>
> But yeah no big deal. I'll dig through my logs to see if I still have
> the numbers, but if not I'll include it in next version.

Mostly I want to get an impression how hard you push our swap test cases.

>
> >
> > How many runs do you have for stdev 32.8s?
>
> 5 runs! I average out the result of 5 runs.

The stddev is 33 seconds. Measure 5 times then average result is not
enough sample to get your to 1.5% resolution (8 seconds), which fall
into the range of noise.

> > I strongly suspect there is some performance difference that hasn't
> > been covered by your test yet. Need more conformation by others on the
> > performance measurement. The swap testing is tricky. You want to push
> > to stress barely within the OOM limit. Need more data.
>
> Very fair point :) I will say though - the kernel build test, with
> memory.max limit sets, does generate a sizable amount of swapping, and
> does OOM if you don't set up swap. Take my words for now, but I will
> try to include average per-run (z)swap activity stats (zswpout zswpin
> etc.) in future versions if you're interested :)

Including the user space time will help determine the level of swap
pressure as well. I don't need the absolutely zswapout count just yet.

> I've been trying to running more stress tests to trigger crashes and
> performance regression. One of the big reasons why I haven't sent
> anything til now is to fix obvious performance issues (the
> aforementioned lock contention) and bugs. It's a complicated piece of
> work.
>
> As always, would love to receive code/design feedback from you (and
> Kairui, and other swap reviewers), and I would appreciate very much if
> other swap folks can play with the patch series on their setup as well
> for performance testing, or let me know if there is any particular
> case that they're interested in :)

I understand Kairui has some measurements that show regressions.

If you can fix the compiling error I can do some stress testing myself
to provide more data points.

Thanks

Chris

Re: [PATCH v3 00/20] Virtual Swap Space

Posted by Johannes Weiner 1 day, 4 hours ago

Hi Chris,

On Mon, Feb 09, 2026 at 04:20:21AM -0800, Chris Li wrote:
> On Sun, Feb 8, 2026 at 4:15 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > My sincerest apologies - it seems like the cover letter (and just the
> > cover letter) fails to be sent out, for some reason. I'm trying to figure
> > out what happened - it works when I send the entire patch series to
> > myself...
> >
> > Anyway, resending this (in-reply-to patch 1 of the series):
> 
> For the record I did receive your original V3 cover letter from the
> linux-mm mailing list.
> 
> > Changelog:
> > * RFC v2 -> v3:
> >     * Implement a cluster-based allocation algorithm for virtual swap
> >       slots, inspired by Kairui Song and Chris Li's implementation, as
> >       well as Johannes Weiner's suggestions. This eliminates the lock
> >           contention issues on the virtual swap layer.
> >     * Re-use swap table for the reverse mapping.
> >     * Remove CONFIG_VIRTUAL_SWAP.
> >     * Reducing the size of the swap descriptor from 48 bytes to 24
> 
> Is the per swap slot entry overhead 24 bytes in your implementation?
> The current swap overhead is 3 static +8 dynamic, your 24 dynamic is a
> big jump. You can argue that 8->24 is not a big jump . But it is an
> unnecessary price compared to the alternatives, which is 8 dynamic +
> 4(optional redirect).

No, this is not the net overhead.

The descriptor consolidates and eliminates several other data
structures.

Here is the more detailed breakdown:

> > The size of the virtual swap descriptor is 24 bytes. Note that this is
> > not all "new" overhead, as the swap descriptor will replace:
> > * the swap_cgroup arrays (one per swap type) in the old design, which
> >   is a massive source of static memory overhead. With the new design,
> >   it is only allocated for used clusters.
> > * the swap tables, which holds the swap cache and workingset shadows.
> > * the zeromap bitmap, which is a bitmap of physical swap slots to
> >   indicate whether the swapped out page is zero-filled or not.
> > * huge chunk of the swap_map. The swap_map is now replaced by 2 bitmaps,
> >   one for allocated slots, and one for bad slots, representing 3 possible
> >   states of a slot on the swapfile: allocated, free, and bad.
> > * the zswap tree.
> >
> > So, in terms of additional memory overhead:
> > * For zswap entries, the added memory overhead is rather minimal. The
> >   new indirection pointer neatly replaces the existing zswap tree.
> >   We really only incur less than one word of overhead for swap count
> >   blow up (since we no longer use swap continuation) and the swap type.
> > * For physical swap entries, the new design will impose fewer than 3 words
> >   memory overhead. However, as noted above this overhead is only for
> >   actively used swap entries, whereas in the current design the overhead is
> >   static (including the swap cgroup array for example).
> >
> >   The primary victim of this overhead will be zram users. However, as
> >   zswap now no longer takes up disk space, zram users can consider
> >   switching to zswap (which, as a bonus, has a lot of useful features
> >   out of the box, such as cgroup tracking, dynamic zswap pool sizing,
> >   LRU-ordering writeback, etc.).
> >
> > For a more concrete example, suppose we have a 32 GB swapfile (i.e.
> > 8,388,608 swap entries), and we use zswap.
> >
> > 0% usage, or 0 entries: 0.00 MB
> > * Old design total overhead: 25.00 MB
> > * Vswap total overhead: 0.00 MB
> >
> > 25% usage, or 2,097,152 entries:
> > * Old design total overhead: 57.00 MB
> > * Vswap total overhead: 48.25 MB
> >
> > 50% usage, or 4,194,304 entries:
> > * Old design total overhead: 89.00 MB
> > * Vswap total overhead: 96.50 MB
> >
> > 75% usage, or 6,291,456 entries:
> > * Old design total overhead: 121.00 MB
> > * Vswap total overhead: 144.75 MB
> >
> > 100% usage, or 8,388,608 entries:
> > * Old design total overhead: 153.00 MB
> > * Vswap total overhead: 193.00 MB
> >
> > So even in the worst case scenario for virtual swap, i.e when we
> > somehow have an oracle to correctly size the swapfile for zswap
> > pool to 32 GB, the added overhead is only 40 MB, which is a mere
> > 0.12% of the total swapfile :)
> >
> > In practice, the overhead will be closer to the 50-75% usage case, as
> > systems tend to leave swap headroom for pathological events or sudden
> > spikes in memory requirements. The added overhead in these cases are
> > practically neglible. And in deployments where swapfiles for zswap
> > are previously sparsely used, switching over to virtual swap will
> > actually reduce memory overhead.
> >
> > Doing the same math for the disk swap, which is the worst case for
> > virtual swap in terms of swap backends:
> >
> > 0% usage, or 0 entries: 0.00 MB
> > * Old design total overhead: 25.00 MB
> > * Vswap total overhead: 2.00 MB
> >
> > 25% usage, or 2,097,152 entries:
> > * Old design total overhead: 41.00 MB
> > * Vswap total overhead: 66.25 MB
> >
> > 50% usage, or 4,194,304 entries:
> > * Old design total overhead: 57.00 MB
> > * Vswap total overhead: 130.50 MB
> >
> > 75% usage, or 6,291,456 entries:
> > * Old design total overhead: 73.00 MB
> > * Vswap total overhead: 194.75 MB
> >
> > 100% usage, or 8,388,608 entries:
> > * Old design total overhead: 89.00 MB
> > * Vswap total overhead: 259.00 MB
> >
> > The added overhead is 170MB, which is 0.5% of the total swapfile size,
> > again in the worst case when we have a sizing oracle.

Re: [PATCH v3 00/20] Virtual Swap Space

Posted by Chris Li 9 hours ago

Hi Johannes,

On Mon, Feb 9, 2026 at 6:36 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Hi Chris,
>
> On Mon, Feb 09, 2026 at 04:20:21AM -0800, Chris Li wrote:
> > Is the per swap slot entry overhead 24 bytes in your implementation?
> > The current swap overhead is 3 static +8 dynamic, your 24 dynamic is a
> > big jump. You can argue that 8->24 is not a big jump . But it is an
> > unnecessary price compared to the alternatives, which is 8 dynamic +
> > 4(optional redirect).
>
> No, this is not the net overhead.

I am talking about the total metadata overhead per swap entry. Not net.

> The descriptor consolidates and eliminates several other data
> structures.

Adding members previously not there and making some members bigger
along the way. For example, the swap_map from 1 byte to a 4 byte
count.

>
> Here is the more detailed breakdown:

It seems you did not finish your sentence before sending your reply.

Anyway, I saw the total per swap entry overhead bump to 24 bytes
dynamic. Let me know what is the correct number for VS if you
disagree.

Chris

> > > The size of the virtual swap descriptor is 24 bytes. Note that this is
> > > not all "new" overhead, as the swap descriptor will replace:
> > > * the swap_cgroup arrays (one per swap type) in the old design, which
> > >   is a massive source of static memory overhead. With the new design,
> > >   it is only allocated for used clusters.
> > > * the swap tables, which holds the swap cache and workingset shadows.
> > > * the zeromap bitmap, which is a bitmap of physical swap slots to
> > >   indicate whether the swapped out page is zero-filled or not.
> > > * huge chunk of the swap_map. The swap_map is now replaced by 2 bitmaps,
> > >   one for allocated slots, and one for bad slots, representing 3 possible
> > >   states of a slot on the swapfile: allocated, free, and bad.
> > > * the zswap tree.
> > >
> > > So, in terms of additional memory overhead:
> > > * For zswap entries, the added memory overhead is rather minimal. The
> > >   new indirection pointer neatly replaces the existing zswap tree.
> > >   We really only incur less than one word of overhead for swap count
> > >   blow up (since we no longer use swap continuation) and the swap type.
> > > * For physical swap entries, the new design will impose fewer than 3 words
> > >   memory overhead. However, as noted above this overhead is only for
> > >   actively used swap entries, whereas in the current design the overhead is
> > >   static (including the swap cgroup array for example).
> > >
> > >   The primary victim of this overhead will be zram users. However, as
> > >   zswap now no longer takes up disk space, zram users can consider
> > >   switching to zswap (which, as a bonus, has a lot of useful features
> > >   out of the box, such as cgroup tracking, dynamic zswap pool sizing,
> > >   LRU-ordering writeback, etc.).
> > >
> > > For a more concrete example, suppose we have a 32 GB swapfile (i.e.
> > > 8,388,608 swap entries), and we use zswap.
> > >
> > > 0% usage, or 0 entries: 0.00 MB
> > > * Old design total overhead: 25.00 MB
> > > * Vswap total overhead: 0.00 MB
> > >
> > > 25% usage, or 2,097,152 entries:
> > > * Old design total overhead: 57.00 MB
> > > * Vswap total overhead: 48.25 MB
> > >
> > > 50% usage, or 4,194,304 entries:
> > > * Old design total overhead: 89.00 MB
> > > * Vswap total overhead: 96.50 MB
> > >
> > > 75% usage, or 6,291,456 entries:
> > > * Old design total overhead: 121.00 MB
> > > * Vswap total overhead: 144.75 MB
> > >
> > > 100% usage, or 8,388,608 entries:
> > > * Old design total overhead: 153.00 MB
> > > * Vswap total overhead: 193.00 MB
> > >
> > > So even in the worst case scenario for virtual swap, i.e when we
> > > somehow have an oracle to correctly size the swapfile for zswap
> > > pool to 32 GB, the added overhead is only 40 MB, which is a mere
> > > 0.12% of the total swapfile :)
> > >
> > > In practice, the overhead will be closer to the 50-75% usage case, as
> > > systems tend to leave swap headroom for pathological events or sudden
> > > spikes in memory requirements. The added overhead in these cases are
> > > practically neglible. And in deployments where swapfiles for zswap
> > > are previously sparsely used, switching over to virtual swap will
> > > actually reduce memory overhead.
> > >
> > > Doing the same math for the disk swap, which is the worst case for
> > > virtual swap in terms of swap backends:
> > >
> > > 0% usage, or 0 entries: 0.00 MB
> > > * Old design total overhead: 25.00 MB
> > > * Vswap total overhead: 2.00 MB
> > >
> > > 25% usage, or 2,097,152 entries:
> > > * Old design total overhead: 41.00 MB
> > > * Vswap total overhead: 66.25 MB
> > >
> > > 50% usage, or 4,194,304 entries:
> > > * Old design total overhead: 57.00 MB
> > > * Vswap total overhead: 130.50 MB
> > >
> > > 75% usage, or 6,291,456 entries:
> > > * Old design total overhead: 73.00 MB
> > > * Vswap total overhead: 194.75 MB
> > >
> > > 100% usage, or 8,388,608 entries:
> > > * Old design total overhead: 89.00 MB
> > > * Vswap total overhead: 259.00 MB
> > >
> > > The added overhead is 170MB, which is 0.5% of the total swapfile size,
> > > again in the worst case when we have a sizing oracle.

Re: [PATCH v3 00/20] Virtual Swap Space

Posted by Johannes Weiner 7 hours ago

Hi Chris,

On Tue, Feb 10, 2026 at 01:24:03PM -0800, Chris Li wrote:
> Hi Johannes,
> On Mon, Feb 9, 2026 at 6:36 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > Here is the more detailed breakdown:
> 
> It seems you did not finish your sentence before sending your reply.

I did. I trimmed the quote of Nhat's cover letter to the parts
addressing your questions. If you use gmail, click the three dots:

> > > > The size of the virtual swap descriptor is 24 bytes. Note that this is
> > > > not all "new" overhead, as the swap descriptor will replace:
> > > > * the swap_cgroup arrays (one per swap type) in the old design, which
> > > >   is a massive source of static memory overhead. With the new design,
> > > >   it is only allocated for used clusters.
> > > > * the swap tables, which holds the swap cache and workingset shadows.
> > > > * the zeromap bitmap, which is a bitmap of physical swap slots to
> > > >   indicate whether the swapped out page is zero-filled or not.
> > > > * huge chunk of the swap_map. The swap_map is now replaced by 2 bitmaps,
> > > >   one for allocated slots, and one for bad slots, representing 3 possible
> > > >   states of a slot on the swapfile: allocated, free, and bad.
> > > > * the zswap tree.
> > > >
> > > > So, in terms of additional memory overhead:
> > > > * For zswap entries, the added memory overhead is rather minimal. The
> > > >   new indirection pointer neatly replaces the existing zswap tree.
> > > >   We really only incur less than one word of overhead for swap count
> > > >   blow up (since we no longer use swap continuation) and the swap type.
> > > > * For physical swap entries, the new design will impose fewer than 3 words
> > > >   memory overhead. However, as noted above this overhead is only for
> > > >   actively used swap entries, whereas in the current design the overhead is
> > > >   static (including the swap cgroup array for example).
> > > >
> > > >   The primary victim of this overhead will be zram users. However, as
> > > >   zswap now no longer takes up disk space, zram users can consider
> > > >   switching to zswap (which, as a bonus, has a lot of useful features
> > > >   out of the box, such as cgroup tracking, dynamic zswap pool sizing,
> > > >   LRU-ordering writeback, etc.).
> > > >
> > > > For a more concrete example, suppose we have a 32 GB swapfile (i.e.
> > > > 8,388,608 swap entries), and we use zswap.
> > > >
> > > > 0% usage, or 0 entries: 0.00 MB
> > > > * Old design total overhead: 25.00 MB
> > > > * Vswap total overhead: 0.00 MB
> > > >
> > > > 25% usage, or 2,097,152 entries:
> > > > * Old design total overhead: 57.00 MB
> > > > * Vswap total overhead: 48.25 MB
> > > >
> > > > 50% usage, or 4,194,304 entries:
> > > > * Old design total overhead: 89.00 MB
> > > > * Vswap total overhead: 96.50 MB
> > > >
> > > > 75% usage, or 6,291,456 entries:
> > > > * Old design total overhead: 121.00 MB
> > > > * Vswap total overhead: 144.75 MB
> > > >
> > > > 100% usage, or 8,388,608 entries:
> > > > * Old design total overhead: 153.00 MB
> > > > * Vswap total overhead: 193.00 MB
> > > >
> > > > So even in the worst case scenario for virtual swap, i.e when we
> > > > somehow have an oracle to correctly size the swapfile for zswap
> > > > pool to 32 GB, the added overhead is only 40 MB, which is a mere
> > > > 0.12% of the total swapfile :)
> > > >
> > > > In practice, the overhead will be closer to the 50-75% usage case, as
> > > > systems tend to leave swap headroom for pathological events or sudden
> > > > spikes in memory requirements. The added overhead in these cases are
> > > > practically neglible. And in deployments where swapfiles for zswap
> > > > are previously sparsely used, switching over to virtual swap will
> > > > actually reduce memory overhead.
> > > >
> > > > Doing the same math for the disk swap, which is the worst case for
> > > > virtual swap in terms of swap backends:
> > > >
> > > > 0% usage, or 0 entries: 0.00 MB
> > > > * Old design total overhead: 25.00 MB
> > > > * Vswap total overhead: 2.00 MB
> > > >
> > > > 25% usage, or 2,097,152 entries:
> > > > * Old design total overhead: 41.00 MB
> > > > * Vswap total overhead: 66.25 MB
> > > >
> > > > 50% usage, or 4,194,304 entries:
> > > > * Old design total overhead: 57.00 MB
> > > > * Vswap total overhead: 130.50 MB
> > > >
> > > > 75% usage, or 6,291,456 entries:
> > > > * Old design total overhead: 73.00 MB
> > > > * Vswap total overhead: 194.75 MB
> > > >
> > > > 100% usage, or 8,388,608 entries:
> > > > * Old design total overhead: 89.00 MB
> > > > * Vswap total overhead: 259.00 MB
> > > >
> > > > The added overhead is 170MB, which is 0.5% of the total swapfile size,
> > > > again in the worst case when we have a sizing oracle.