mm, swap: swap table phase IV with dynamic ghost swapfile

[PATCH RFC 08/15] mm, swap: store and check memcg info in the swap table

Posted by Kairui Song via B4 Relay 1 month, 1 week ago

From: Kairui Song <kasong@tencent.com>

To prepare for merging the swap_cgroup_ctrl into the swap table, store
the memcg info in the swap table on swapout.

This is done by using the existing shadow format.

Note this also changes the refault counting at the nearest online memcg
level:

Unlike file folios, anon folios are mostly exclusive to one mem cgroup,
and each cgroup is likely to have different characteristics.

When commit b910718a948a ("mm: vmscan: detect file thrashing at the
reclaim root") moved the refault accounting to the reclaim root level,
anon shadows don't even exist, and it's explicitly for file pages. Later
commit aae466b0052e ("mm/swap: implement workingset detection for
anonymous LRU") added anon shadows following a similar design. And in
shrink_lruvec, an active LRU's shrinking is done regardlessly when it's
low.

For MGLRU, it's a bit different, but with the PID refault control, it's
more accurate to let the nearest online memcg take the refault feedback
too.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/internal.h   | 20 ++++++++++++++++++++
 mm/swap.h       |  7 ++++---
 mm/swap_state.c | 50 +++++++++++++++++++++++++++++++++-----------------
 mm/swapfile.c   |  4 +++-
 mm/vmscan.c     |  6 +-----
 mm/workingset.c | 16 +++++++++++-----
 6 files changed, 72 insertions(+), 31 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index cb0af847d7d9..5bbe081c9048 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1714,6 +1714,7 @@ static inline void shrinker_debugfs_remove(struct dentry *debugfs_entry,
 #endif /* CONFIG_SHRINKER_DEBUG */
 
 /* Only track the nodes of mappings with shadow entries */
+#define WORKINGSET_SHIFT 1
 void workingset_update_node(struct xa_node *node);
 extern struct list_lru shadow_nodes;
 #define mapping_set_update(xas, mapping) do {			\
@@ -1722,6 +1723,25 @@ extern struct list_lru shadow_nodes;
 		xas_set_lru(xas, &shadow_nodes);		\
 	}							\
 } while (0)
+static inline unsigned short shadow_to_memcgid(void *shadow)
+{
+	unsigned long entry = xa_to_value(shadow);
+	unsigned short memcgid;
+
+	entry >>= (WORKINGSET_SHIFT + NODES_SHIFT);
+	memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
+
+	return memcgid;
+}
+static inline void *memcgid_to_shadow(unsigned short memcgid)
+{
+	unsigned long val;
+
+	val = memcgid;
+	val <<= (NODES_SHIFT + WORKINGSET_SHIFT);
+
+	return xa_mk_value(val);
+}
 
 /* mremap.c */
 unsigned long move_page_tables(struct pagetable_move_control *pmc);
diff --git a/mm/swap.h b/mm/swap.h
index da41e9cea46d..c95f5fafea42 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -265,6 +265,8 @@ static inline bool folio_matches_swap_entry(const struct folio *folio,
 	return folio_entry.val == round_down(entry.val, nr_pages);
 }
 
+bool folio_maybe_swapped(struct folio *folio);
+
 /*
  * All swap cache helpers below require the caller to ensure the swap entries
  * used are valid and stabilize the device by any of the following ways:
@@ -286,9 +288,8 @@ struct folio *swap_cache_alloc_folio(swp_entry_t target_entry, gfp_t gfp_mask,
 /* Below helpers require the caller to lock and pass in the swap cluster. */
 void __swap_cache_add_folio(struct swap_cluster_info *ci,
 			    struct folio *folio, swp_entry_t entry);
-void __swap_cache_del_folio(struct swap_cluster_info *ci,
-			    struct folio *folio, void *shadow,
-			    bool charged, bool reclaim);
+void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
+			    void *shadow, bool charged, bool reclaim);
 void __swap_cache_replace_folio(struct swap_cluster_info *ci,
 				struct folio *old, struct folio *new);
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 40f037576c5f..cc4bf40320ef 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -143,22 +143,11 @@ static int __swap_cache_check_batch(struct swap_cluster_info *ci,
 {
 	unsigned int ci_end = ci_off + nr;
 	unsigned long old_tb;
+	unsigned int memcgid;
 
 	if (unlikely(!ci->table))
 		return -ENOENT;
 
-	do {
-		old_tb = __swap_table_get(ci, ci_off);
-		if (unlikely(swp_tb_is_folio(old_tb)) ||
-		    unlikely(!__swp_tb_get_count(old_tb)))
-			break;
-		if (swp_tb_is_shadow(old_tb))
-			*shadowp = swp_tb_to_shadow(old_tb);
-	} while (++ci_off < ci_end);
-
-	if (likely(ci_off == ci_end))
-		return 0;
-
 	/*
 	 * If the target slot is not suitable for adding swap cache, return
 	 * -EEXIST or -ENOENT. If the batch is not suitable, could be a
@@ -169,7 +158,21 @@ static int __swap_cache_check_batch(struct swap_cluster_info *ci,
 		return -EEXIST;
 	if (!__swp_tb_get_count(old_tb))
 		return -ENOENT;
-	return -EBUSY;
+	if (WARN_ON_ONCE(!swp_tb_is_shadow(old_tb)))
+		return -ENOENT;
+	*shadowp = swp_tb_to_shadow(old_tb);
+	memcgid = shadow_to_memcgid(*shadowp);
+
+	WARN_ON_ONCE(!mem_cgroup_disabled() && !memcgid);
+	do {
+		old_tb = __swap_table_get(ci, ci_off);
+		if (unlikely(swp_tb_is_folio(old_tb)) ||
+		    unlikely(!__swp_tb_get_count(old_tb)) ||
+		    memcgid != shadow_to_memcgid(swp_tb_to_shadow(old_tb)))
+			return -EBUSY;
+	} while (++ci_off < ci_end);
+
+	return 0;
 }
 
 void __swap_cache_add_folio(struct swap_cluster_info *ci,
@@ -261,8 +264,7 @@ static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
 
 	/* For memsw accouting, swap is uncharged when folio is added to swap cache */
 	memcg1_swapin(folio);
-	if (shadow)
-		workingset_refault(folio, shadow);
+	workingset_refault(folio, shadow);
 
 	/* Caller will initiate read into locked new_folio */
 	folio_add_lru(folio);
@@ -319,7 +321,8 @@ struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp_mask,
  * __swap_cache_del_folio - Removes a folio from the swap cache.
  * @ci: The locked swap cluster.
  * @folio: The folio.
- * @shadow: shadow value to be filled in the swap cache.
+ * @shadow: Shadow to restore when the folio is not charged. Ignored when
+ *          @charged is true, as the shadow is computed internally.
  * @charged: If folio->swap is charged to folio->memcg.
  * @reclaim: If the folio is being reclaimed. When true on cgroup v1,
  *           the memory charge is transferred from memory to swap.
@@ -336,6 +339,7 @@ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
 	int count;
 	unsigned long old_tb;
 	struct swap_info_struct *si;
+	struct mem_cgroup *memcg = NULL;
 	swp_entry_t entry = folio->swap;
 	unsigned int ci_start, ci_off, ci_end;
 	bool folio_swapped = false, need_free = false;
@@ -353,7 +357,13 @@ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
 	 * charging (e.g. swapin charge failure, or swap alloc charge failure).
 	 */
 	if (charged)
-		mem_cgroup_swap_free_folio(folio, reclaim);
+		memcg = mem_cgroup_swap_free_folio(folio, reclaim);
+	if (reclaim) {
+		WARN_ON(!charged);
+		shadow = workingset_eviction(folio, memcg);
+	} else if (memcg) {
+		shadow = memcgid_to_shadow(mem_cgroup_private_id(memcg));
+	}
 
 	si = __swap_entry_to_info(entry);
 	ci_start = swp_cluster_offset(entry);
@@ -392,6 +402,11 @@ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
  * swap_cache_del_folio - Removes a folio from the swap cache.
  * @folio: The folio.
  *
+ * Force delete a folio from the swap cache. This is only safe to use for
+ * folios that are not swapped out (swap count == 0) to release the swap
+ * space from being pinned by swap cache, or remove a clean and charged
+ * folio that no one modified or is still using.
+ *
  * Same as __swap_cache_del_folio, but handles lock and refcount. The
  * caller must ensure the folio is either clean or has a swap count
  * equal to zero, or it may cause data loss.
@@ -404,6 +419,7 @@ void swap_cache_del_folio(struct folio *folio)
 	swp_entry_t entry = folio->swap;
 
 	ci = swap_cluster_lock(__swap_entry_to_info(entry), swp_offset(entry));
+	VM_WARN_ON_ONCE(folio_test_dirty(folio) && folio_maybe_swapped(folio));
 	__swap_cache_del_folio(ci, folio, NULL, true, false);
 	swap_cluster_unlock(ci);
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index c0169bce46c9..2cd3e260f1bf 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1972,9 +1972,11 @@ int swp_swapcount(swp_entry_t entry)
  * decrease of swap count is possible through swap_put_entries_direct, so this
  * may return a false positive.
  *
+ * Caller can hold the ci lock to get a stable result.
+ *
  * Context: Caller must ensure the folio is locked and in the swap cache.
  */
-static bool folio_maybe_swapped(struct folio *folio)
+bool folio_maybe_swapped(struct folio *folio)
 {
 	swp_entry_t entry = folio->swap;
 	struct swap_cluster_info *ci;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5112f81cf875..4565c9c3ac60 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -755,11 +755,7 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
 	}
 
 	if (folio_test_swapcache(folio)) {
-		swp_entry_t swap = folio->swap;
-
-		if (reclaimed && !mapping_exiting(mapping))
-			shadow = workingset_eviction(folio, target_memcg);
-		__swap_cache_del_folio(ci, folio, shadow, true, true);
+		__swap_cache_del_folio(ci, folio, NULL, true, true);
 		swap_cluster_unlock_irq(ci);
 	} else {
 		void (*free_folio)(struct folio *);
diff --git a/mm/workingset.c b/mm/workingset.c
index 37a94979900f..765a954baefa 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -202,12 +202,18 @@ static unsigned int bucket_order[ANON_AND_FILE] __read_mostly;
 static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
 			 bool workingset, bool file)
 {
+	void *shadow;
+
 	eviction &= file ? EVICTION_MASK : EVICTION_MASK_ANON;
 	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
 	eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
 	eviction = (eviction << WORKINGSET_SHIFT) | workingset;
 
-	return xa_mk_value(eviction);
+	shadow = xa_mk_value(eviction);
+	/* Sanity check for retrieving memcgid from anon shadow. */
+	VM_WARN_ON_ONCE(shadow_to_memcgid(shadow) != memcgid);
+
+	return shadow;
 }
 
 static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
@@ -232,7 +238,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 
 #ifdef CONFIG_LRU_GEN
 
-static void *lru_gen_eviction(struct folio *folio)
+static void *lru_gen_eviction(struct folio *folio, struct mem_cgroup *memcg)
 {
 	int hist;
 	unsigned long token;
@@ -244,7 +250,6 @@ static void *lru_gen_eviction(struct folio *folio)
 	int refs = folio_lru_refs(folio);
 	bool workingset = folio_test_workingset(folio);
 	int tier = lru_tier_from_refs(refs, workingset);
-	struct mem_cgroup *memcg = folio_memcg(folio);
 	struct pglist_data *pgdat = folio_pgdat(folio);
 
 	BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH >
@@ -252,6 +257,7 @@ static void *lru_gen_eviction(struct folio *folio)
 
 	lruvec = mem_cgroup_lruvec(memcg, pgdat);
 	lrugen = &lruvec->lrugen;
+	memcg = lruvec_memcg(lruvec);
 	min_seq = READ_ONCE(lrugen->min_seq[type]);
 	token = (min_seq << LRU_REFS_WIDTH) | max(refs - 1, 0);
 
@@ -329,7 +335,7 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
 
 #else /* !CONFIG_LRU_GEN */
 
-static void *lru_gen_eviction(struct folio *folio)
+static void *lru_gen_eviction(struct folio *folio, struct mem_cgroup *target_memcg)
 {
 	return NULL;
 }
@@ -396,7 +402,7 @@ void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg)
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 
 	if (lru_gen_enabled())
-		return lru_gen_eviction(folio);
+		return lru_gen_eviction(folio, target_memcg);
 
 	lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
 	/* XXX: target_memcg can be NULL, go through lruvec */

-- 
2.53.0

Re: [PATCH RFC 08/15] mm, swap: store and check memcg info in the swap table

Posted by Johannes Weiner 1 month, 1 week ago

On Fri, Feb 20, 2026 at 07:42:09AM +0800, Kairui Song via B4 Relay wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> To prepare for merging the swap_cgroup_ctrl into the swap table, store
> the memcg info in the swap table on swapout.
> 
> This is done by using the existing shadow format.
> 
> Note this also changes the refault counting at the nearest online memcg
> level:
> 
> Unlike file folios, anon folios are mostly exclusive to one mem cgroup,
> and each cgroup is likely to have different characteristics.

This is not correct.

As much as I like the idea of storing the swap_cgroup association
inside the shadow entry, the refault evaluation needs to happen at the
level that drove eviction.

Consider a workload that is split into cgroups purely for accounting,
not for setting different limits:

workload (limit domain)
`- component A
`- component B

This means the two components must compete freely, and it must behave
as if there is only one LRU. When pages get reclaimed in a round-robin
fashion, both A and B get aged at the same pace. Likewise, when pages
in A refault, they must challenge the *combined* workingset of both A
and B, not just the local pages.

Otherwise, you risk retaining stale workingset in one subgroup while
the other one is thrashing. This breaks userspace expectations.

Re: [PATCH RFC 08/15] mm, swap: store and check memcg info in the swap table

Posted by Kairui Song 1 month, 1 week ago

On Tue, Feb 24, 2026 at 12:46 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, Feb 20, 2026 at 07:42:09AM +0800, Kairui Song via B4 Relay wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > To prepare for merging the swap_cgroup_ctrl into the swap table, store
> > the memcg info in the swap table on swapout.
> >
> > This is done by using the existing shadow format.
> >
> > Note this also changes the refault counting at the nearest online memcg
> > level:
> >
> > Unlike file folios, anon folios are mostly exclusive to one mem cgroup,
> > and each cgroup is likely to have different characteristics.
>
> This is not correct.
>
> As much as I like the idea of storing the swap_cgroup association
> inside the shadow entry, the refault evaluation needs to happen at the
> level that drove eviction.
>
> Consider a workload that is split into cgroups purely for accounting,
> not for setting different limits:
>
> workload (limit domain)
> `- component A
> `- component B
>
> This means the two components must compete freely, and it must behave
> as if there is only one LRU. When pages get reclaimed in a round-robin
> fashion, both A and B get aged at the same pace. Likewise, when pages
> in A refault, they must challenge the *combined* workingset of both A
> and B, not just the local pages.
>
> Otherwise, you risk retaining stale workingset in one subgroup while
> the other one is thrashing. This breaks userspace expectations.
>

Hi Johannes, thanks for pointing this out.

I'm just not sure how much of a real problem this is. The refault
challenge change was made in commit b910718a948a which was before anon
shadow was introduced. And shadows could get reclaimed, especially
when under pressure (and we could be doing that again by reclaiming
full_clusters with swap tables). And MGLRU simply ignores the
target_memcg here yet it performs surprisingly well with multiple
memcg setups. And I did find a comment in workingset.c saying the
kernel used to activate all pages, which is also fine. And that commit
also mentioned the active list shrinking, but anon active list gets
shrinked just fine without refault feedback in shrink_lruvec under
can_age_anon_pages.

So in this RFC I just be a bit aggressive and changed it. I can do
some tests with different memory size setup.

If we are not OK with it, then just use a ci->memcg_table then we are
fine, everything is still dynamic but single slot usage could be a bit
higher, 8 bytes to 10 bytes: and maybe find a way later to make
ci->memcg_table NULL and shrink back to 8 bytes with, e.g. MGLRU and
balance the memcg with things like aging feed back maybe (the later
part is just idea but seems doable?).

Re: [PATCH RFC 08/15] mm, swap: store and check memcg info in the swap table

Posted by Johannes Weiner 1 month, 1 week ago

On Tue, Feb 24, 2026 at 04:34:00PM +0800, Kairui Song wrote:
> On Tue, Feb 24, 2026 at 12:46 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Fri, Feb 20, 2026 at 07:42:09AM +0800, Kairui Song via B4 Relay wrote:
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > To prepare for merging the swap_cgroup_ctrl into the swap table, store
> > > the memcg info in the swap table on swapout.
> > >
> > > This is done by using the existing shadow format.
> > >
> > > Note this also changes the refault counting at the nearest online memcg
> > > level:
> > >
> > > Unlike file folios, anon folios are mostly exclusive to one mem cgroup,
> > > and each cgroup is likely to have different characteristics.
> >
> > This is not correct.
> >
> > As much as I like the idea of storing the swap_cgroup association
> > inside the shadow entry, the refault evaluation needs to happen at the
> > level that drove eviction.
> >
> > Consider a workload that is split into cgroups purely for accounting,
> > not for setting different limits:
> >
> > workload (limit domain)
> > `- component A
> > `- component B
> >
> > This means the two components must compete freely, and it must behave
> > as if there is only one LRU. When pages get reclaimed in a round-robin
> > fashion, both A and B get aged at the same pace. Likewise, when pages
> > in A refault, they must challenge the *combined* workingset of both A
> > and B, not just the local pages.
> >
> > Otherwise, you risk retaining stale workingset in one subgroup while
> > the other one is thrashing. This breaks userspace expectations.
> >
> 
> Hi Johannes, thanks for pointing this out.
> 
> I'm just not sure how much of a real problem this is. The refault
> challenge change was made in commit b910718a948a which was before anon
> shadow was introduced. And shadows could get reclaimed, especially
> when under pressure (and we could be doing that again by reclaiming
> full_clusters with swap tables). And MGLRU simply ignores the
> target_memcg here yet it performs surprisingly well with multiple
> memcg setups. And I did find a comment in workingset.c saying the
> kernel used to activate all pages, which is also fine. And that commit
> also mentioned the active list shrinking, but anon active list gets
> shrinked just fine without refault feedback in shrink_lruvec under
> can_age_anon_pages.

                    *if inactive anon is empty, as part of the second
                     chance logic

Please try to understand *why* this code is the way it is before
throwing it all out. It was driven by real production problems. The
fact that some workloads don't care is not prove that many don't hurt
if you break this.

Anon refault detection was added for that reason: Once you have swap,
you facilitate anon workingsets that exceed memory capacity. At that
point, cache replacement strategies apply. Scan resistance matters.

With fast modern compression and flash swap, the anon set alone can be
larger than memory capacity. Everything that
6a3ed2123a78de22a9e2b2855068a8d89f8e14f4 says about file cache starts
applying to anonymous pages: you don't want to throw out the hot anon
workingset just because somebody is doing a one-off burst scan through
a larger set of cold, swapped out pages.

Like I said in the LSFMM thread, there is no difference between anon
and file. There didn't use to be historically. The LRU lists were
split mechanically because noswap systems became common (lots of RAM +
rotational drives = sad swap) and there was no point in scanning/aging
anonymous memory if there is no swap space.

But no reasonable argument has been put forth why anon should be aged
completely differently than file when you DO have swap.

There is more explanation of Why for the cgroup behavior in the cover
letter portion of 53138cea7f398d2cdd0fa22adeec7e16093e1ebd.