[PATCH v2 06/12] mm, swap: implement helpers for reserving data in the swap table

Kairui Song posted 12 patches 1 week, 5 days ago
[PATCH v2 06/12] mm, swap: implement helpers for reserving data in the swap table
Posted by Kairui Song 1 week, 5 days ago
From: Kairui Song <kasong@tencent.com>

To prepare for using the swap table as the unified swap layer, introduce
macros and helpers for storing multiple kinds of data in a swap table
entry.

From now on, we are storing PFN in the swap table to make space for
extra counting bits (SWAP_COUNT). Shadows are still stored as they are,
as the SWAP_COUNT is not used yet.

Also, rename shadow_swp_to_tb to shadow_to_swp_tb. That's a spelling
error, not really worth a separate fix.

No behaviour change yet, just prepare the API.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap_state.c |   6 +--
 mm/swap_table.h | 131 +++++++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 124 insertions(+), 13 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 6d0eef7470be..e213ee35c1d2 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -148,7 +148,7 @@ void __swap_cache_add_folio(struct swap_cluster_info *ci,
 	VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
 	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
 
-	new_tb = folio_to_swp_tb(folio);
+	new_tb = folio_to_swp_tb(folio, 0);
 	ci_start = swp_cluster_offset(entry);
 	ci_off = ci_start;
 	ci_end = ci_start + nr_pages;
@@ -249,7 +249,7 @@ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
 	VM_WARN_ON_ONCE_FOLIO(folio_test_writeback(folio), folio);
 
 	si = __swap_entry_to_info(entry);
-	new_tb = shadow_swp_to_tb(shadow);
+	new_tb = shadow_to_swp_tb(shadow, 0);
 	ci_start = swp_cluster_offset(entry);
 	ci_end = ci_start + nr_pages;
 	ci_off = ci_start;
@@ -331,7 +331,7 @@ void __swap_cache_replace_folio(struct swap_cluster_info *ci,
 	VM_WARN_ON_ONCE(!entry.val);
 
 	/* Swap cache still stores N entries instead of a high-order entry */
-	new_tb = folio_to_swp_tb(new);
+	new_tb = folio_to_swp_tb(new, 0);
 	do {
 		old_tb = __swap_table_xchg(ci, ci_off, new_tb);
 		WARN_ON_ONCE(!swp_tb_is_folio(old_tb) || swp_tb_to_folio(old_tb) != old);
diff --git a/mm/swap_table.h b/mm/swap_table.h
index 10e11d1f3b04..10762ac5f4f5 100644
--- a/mm/swap_table.h
+++ b/mm/swap_table.h
@@ -12,17 +12,72 @@ struct swap_table {
 };
 
 #define SWP_TABLE_USE_PAGE (sizeof(struct swap_table) == PAGE_SIZE)
-#define SWP_TB_COUNT_BITS		4
 
 /*
  * A swap table entry represents the status of a swap slot on a swap
  * (physical or virtual) device. The swap table in each cluster is a
  * 1:1 map of the swap slots in this cluster.
  *
- * Each swap table entry could be a pointer (folio), a XA_VALUE
- * (shadow), or NULL.
+ * Swap table entry type and bits layouts:
+ *
+ * NULL:     |---------------- 0 ---------------| - Free slot
+ * Shadow:   | SWAP_COUNT |---- SHADOW_VAL ---|1| - Swapped out slot
+ * PFN:      | SWAP_COUNT |------ PFN -------|10| - Cached slot
+ * Pointer:  |----------- Pointer ----------|100| - (Unused)
+ * Bad:      |------------- 1 -------------|1000| - Bad slot
+ *
+ * SWAP_COUNT is `SWP_TB_COUNT_BITS` long, each entry is an atomic long.
+ *
+ * Usages:
+ *
+ * - NULL: Swap slot is unused, could be allocated.
+ *
+ * - Shadow: Swap slot is used and not cached (usually swapped out). It reuses
+ *   the XA_VALUE format to be compatible with working set shadows. SHADOW_VAL
+ *   part might be all 0 if the working shadow info is absent. In such a case,
+ *   we still want to keep the shadow format as a placeholder.
+ *
+ *   Memcg ID is embedded in SHADOW_VAL.
+ *
+ * - PFN: Swap slot is in use, and cached. Memcg info is recorded on the page
+ *   struct.
+ *
+ * - Pointer: Unused yet. `0b100` is reserved for potential pointer usage
+ *   because only the lower three bits can be used as a marker for 8 bytes
+ *   aligned pointers.
+ *
+ * - Bad: Swap slot is reserved, protects swap header or holes on swap devices.
  */
 
+#if defined(MAX_POSSIBLE_PHYSMEM_BITS)
+#define SWAP_CACHE_PFN_BITS (MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT)
+#elif defined(MAX_PHYSMEM_BITS)
+#define SWAP_CACHE_PFN_BITS (MAX_PHYSMEM_BITS - PAGE_SHIFT)
+#else
+#define SWAP_CACHE_PFN_BITS (BITS_PER_LONG - PAGE_SHIFT)
+#endif
+
+/* NULL Entry, all 0 */
+#define SWP_TB_NULL		0UL
+
+/* Swapped out: shadow */
+#define SWP_TB_SHADOW_MARK	0b1UL
+
+/* Cached: PFN */
+#define SWP_TB_PFN_BITS		(SWAP_CACHE_PFN_BITS + SWP_TB_PFN_MARK_BITS)
+#define SWP_TB_PFN_MARK		0b10UL
+#define SWP_TB_PFN_MARK_BITS	2
+#define SWP_TB_PFN_MARK_MASK	(BIT(SWP_TB_PFN_MARK_BITS) - 1)
+
+/* SWAP_COUNT part for PFN or shadow, the width can be shrunk or extended */
+#define SWP_TB_COUNT_BITS      min(4, BITS_PER_LONG - SWP_TB_PFN_BITS)
+#define SWP_TB_COUNT_MASK      (~((~0UL) >> SWP_TB_COUNT_BITS))
+#define SWP_TB_COUNT_SHIFT     (BITS_PER_LONG - SWP_TB_COUNT_BITS)
+#define SWP_TB_COUNT_MAX       ((1 << SWP_TB_COUNT_BITS) - 1)
+
+/* Bad slot: ends with 0b1000 and rests of bits are all 1 */
+#define SWP_TB_BAD		((~0UL) << 3)
+
 /* Macro for shadow offset calculation */
 #define SWAP_COUNT_SHIFT	SWP_TB_COUNT_BITS
 
@@ -35,18 +90,47 @@ static inline unsigned long null_to_swp_tb(void)
 	return 0;
 }
 
-static inline unsigned long folio_to_swp_tb(struct folio *folio)
+static inline unsigned long __count_to_swp_tb(unsigned char count)
 {
+	/*
+	 * At least three values are needed to distinguish free (0),
+	 * used (count > 0 && count < SWP_TB_COUNT_MAX), and
+	 * overflow (count == SWP_TB_COUNT_MAX).
+	 */
+	BUILD_BUG_ON(SWP_TB_COUNT_MAX < 2 || SWP_TB_COUNT_BITS < 2);
+	VM_WARN_ON(count > SWP_TB_COUNT_MAX);
+	return ((unsigned long)count) << SWP_TB_COUNT_SHIFT;
+}
+
+static inline unsigned long pfn_to_swp_tb(unsigned long pfn, unsigned int count)
+{
+	unsigned long swp_tb;
+
 	BUILD_BUG_ON(sizeof(unsigned long) != sizeof(void *));
-	return (unsigned long)folio;
+	BUILD_BUG_ON(SWAP_CACHE_PFN_BITS >
+		     (BITS_PER_LONG - SWP_TB_PFN_MARK_BITS - SWP_TB_COUNT_BITS));
+
+	swp_tb = (pfn << SWP_TB_PFN_MARK_BITS) | SWP_TB_PFN_MARK;
+	VM_WARN_ON_ONCE(swp_tb & SWP_TB_COUNT_MASK);
+
+	return swp_tb | __count_to_swp_tb(count);
+}
+
+static inline unsigned long folio_to_swp_tb(struct folio *folio, unsigned int count)
+{
+	return pfn_to_swp_tb(folio_pfn(folio), count);
 }
 
-static inline unsigned long shadow_swp_to_tb(void *shadow)
+static inline unsigned long shadow_to_swp_tb(void *shadow, unsigned int count)
 {
 	BUILD_BUG_ON((BITS_PER_XA_VALUE + 1) !=
 		     BITS_PER_BYTE * sizeof(unsigned long));
+	BUILD_BUG_ON((unsigned long)xa_mk_value(0) != SWP_TB_SHADOW_MARK);
+
 	VM_WARN_ON_ONCE(shadow && !xa_is_value(shadow));
-	return (unsigned long)shadow;
+	VM_WARN_ON_ONCE(shadow && ((unsigned long)shadow & SWP_TB_COUNT_MASK));
+
+	return (unsigned long)shadow | __count_to_swp_tb(count) | SWP_TB_SHADOW_MARK;
 }
 
 /*
@@ -59,7 +143,7 @@ static inline bool swp_tb_is_null(unsigned long swp_tb)
 
 static inline bool swp_tb_is_folio(unsigned long swp_tb)
 {
-	return !xa_is_value((void *)swp_tb) && !swp_tb_is_null(swp_tb);
+	return ((swp_tb & SWP_TB_PFN_MARK_MASK) == SWP_TB_PFN_MARK);
 }
 
 static inline bool swp_tb_is_shadow(unsigned long swp_tb)
@@ -67,19 +151,44 @@ static inline bool swp_tb_is_shadow(unsigned long swp_tb)
 	return xa_is_value((void *)swp_tb);
 }
 
+static inline bool swp_tb_is_bad(unsigned long swp_tb)
+{
+	return swp_tb == SWP_TB_BAD;
+}
+
+static inline bool swp_tb_is_countable(unsigned long swp_tb)
+{
+	return (swp_tb_is_shadow(swp_tb) || swp_tb_is_folio(swp_tb) ||
+		swp_tb_is_null(swp_tb));
+}
+
 /*
  * Helpers for retrieving info from swap table.
  */
 static inline struct folio *swp_tb_to_folio(unsigned long swp_tb)
 {
 	VM_WARN_ON(!swp_tb_is_folio(swp_tb));
-	return (void *)swp_tb;
+	return pfn_folio((swp_tb & ~SWP_TB_COUNT_MASK) >> SWP_TB_PFN_MARK_BITS);
 }
 
 static inline void *swp_tb_to_shadow(unsigned long swp_tb)
 {
 	VM_WARN_ON(!swp_tb_is_shadow(swp_tb));
-	return (void *)swp_tb;
+	/* No shift needed, xa_value is stored as it is in the lower bits. */
+	return (void *)(swp_tb & ~SWP_TB_COUNT_MASK);
+}
+
+static inline unsigned char __swp_tb_get_count(unsigned long swp_tb)
+{
+	VM_WARN_ON(!swp_tb_is_countable(swp_tb));
+	return ((swp_tb & SWP_TB_COUNT_MASK) >> SWP_TB_COUNT_SHIFT);
+}
+
+static inline int swp_tb_get_count(unsigned long swp_tb)
+{
+	if (swp_tb_is_countable(swp_tb))
+		return __swp_tb_get_count(swp_tb);
+	return -EINVAL;
 }
 
 /*
@@ -124,6 +233,8 @@ static inline unsigned long swap_table_get(struct swap_cluster_info *ci,
 	atomic_long_t *table;
 	unsigned long swp_tb;
 
+	VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
+
 	rcu_read_lock();
 	table = rcu_dereference(ci->table);
 	swp_tb = table ? atomic_long_read(&table[off]) : null_to_swp_tb();

-- 
2.52.0
Re: [PATCH v2 06/12] mm, swap: implement helpers for reserving data in the swap table
Posted by YoungJun Park 1 week, 4 days ago
On Wed, Jan 28, 2026 at 05:28:30PM +0800, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>

> +static inline bool swp_tb_is_countable(unsigned long swp_tb)
> +{
> +	return (swp_tb_is_shadow(swp_tb) || swp_tb_is_folio(swp_tb) ||
> +		swp_tb_is_null(swp_tb));
> +}

What do you think about simplifying swp_tb_is_countable by just checking
!swp_tb_is_bad(swp_tb)?

Since this function appears to be called frequently, reducing the number of
comparisons would be beneficial for performance. If validation is
necessary for debugging perhaps we could introduce a separate version for debugging
purposes.

> +static inline int swp_tb_get_count(unsigned long swp_tb)
> +{
> +	if (swp_tb_is_countable(swp_tb))
> +		return __swp_tb_get_count(swp_tb);
> +	return -EINVAL;
>  }

Or, could we simply drop the check in swp_tb_get_count and call
__swp_tb_get_count directly?
If we define SWP_TB_BAD to have 0 in the count bits (MSB), it will
naturally yield a count of 0.

Thanks!
Youngjun Park
Re: [PATCH v2 06/12] mm, swap: implement helpers for reserving data in the swap table
Posted by Kairui Song 1 week ago
On Thu, Jan 29, 2026 at 3:28 PM YoungJun Park <youngjun.park@lge.com> wrote:
>
> On Wed, Jan 28, 2026 at 05:28:30PM +0800, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
>
> > +static inline bool swp_tb_is_countable(unsigned long swp_tb)
> > +{
> > +     return (swp_tb_is_shadow(swp_tb) || swp_tb_is_folio(swp_tb) ||
> > +             swp_tb_is_null(swp_tb));
> > +}

Hi YoungJun,

Thanks for the review.

>
> What do you think about simplifying swp_tb_is_countable by just checking
> !swp_tb_is_bad(swp_tb)?
>
> Since this function appears to be called frequently, reducing the number of
> comparisons would be beneficial for performance. If validation is
> necessary for debugging perhaps we could introduce a separate version for debugging
> purposes.

There are already two variants for getting the count of a swap table
entry: swp_tb_get_count and __swp_tb_get_count. Ideally callers that
know the swap table entry is valid can just call __swp_tb_get_count
for lower overhead, swp_tb_is_countable is less frequently used.

>
> > +static inline int swp_tb_get_count(unsigned long swp_tb)
> > +{
> > +     if (swp_tb_is_countable(swp_tb))
> > +             return __swp_tb_get_count(swp_tb);
> > +     return -EINVAL;
> >  }
>
> Or, could we simply drop the check in swp_tb_get_count and call
> __swp_tb_get_count directly?
> If we define SWP_TB_BAD to have 0 in the count bits (MSB), it will
> naturally yield a count of 0.

One reason I used `countable` and not `bad` is because I'd introduce
other type of swap table entries soon, i.e. hibernation type for
hibernation only usage. Calling swp_tb_get_count on them returns
-EINVAL or other error code, which I think is looking good. I'm open
to suggestions on the naming and design.