[v5] mm, swap: swap table phase II: unify swapin use swap cache and cleanup flags

[PATCH v5 14/19] mm, swap: cleanup swap entry management workflow

Posted by Kairui Song 1 month, 3 weeks ago

From: Kairui Song <kasong@tencent.com>

The current swap entry allocation/freeing workflow has never had a clear
definition. This makes it hard to debug or add new optimizations.

This commit introduces a proper definition of how swap entries would be
allocated and freed. Now, most operations are folio based, so they will
never exceed one swap cluster, and we now have a cleaner border between
swap and the rest of mm, making it much easier to follow and debug,
especially with new added sanity checks. Also making more optimization
possible.

Swap entry will be mostly allocated and free with a folio bound.
The folio lock will be useful for resolving many swap ralated races.

Now swap allocation (except hibernation) always starts with a folio in
the swap cache, and gets duped/freed protected by the folio lock:

- folio_alloc_swap() - The only allocation entry point now.
  Context: The folio must be locked.
  This allocates one or a set of continuous swap slots for a folio and
  binds them to the folio by adding the folio to the swap cache. The
  swap slots' swap count start with zero value.

- folio_dup_swap() - Increase the swap count of one or more entries.
  Context: The folio must be locked and in the swap cache. For now, the
  caller still has to lock the new swap entry owner (e.g., PTL).
  This increases the ref count of swap entries allocated to a folio.
  Newly allocated swap slots' count has to be increased by this helper
  as the folio got unmapped (and swap entries got installed).

- folio_put_swap() - Decrease the swap count of one or more entries.
  Context: The folio must be locked and in the swap cache. For now, the
  caller still has to lock the new swap entry owner (e.g., PTL).
  This decreases the ref count of swap entries allocated to a folio.
  Typically, swapin will decrease the swap count as the folio got
  installed back and the swap entry got uninstalled

  This won't remove the folio from the swap cache and free the
  slot. Lazy freeing of swap cache is helpful for reducing IO.
  There is already a folio_free_swap() for immediate cache reclaim.
  This part could be further optimized later.

The above locking constraints could be further relaxed when the swap
table if fully implemented. Currently dup still needs the caller
to lock the swap entry container (e.g. PTL), or a concurrent zap
may underflow the swap count.

Some swap users need to interact with swap count without involving folio
(e.g. forking/zapping the page table or mapping truncate without swapin).
In such cases, the caller has to ensure there is no race condition on
whatever owns the swap count and use the below helpers:

- swap_put_entries_direct() - Decrease the swap count directly.
  Context: The caller must lock whatever is referencing the slots to
  avoid a race.

  Typically the page table zapping or shmem mapping truncate will need
  to free swap slots directly. If a slot is cached (has a folio bound),
  this will also try to release the swap cache.

- swap_dup_entry_direct() - Increase the swap count directly.
  Context: The caller must lock whatever is referencing the entries to
  avoid race, and the entries must already have a swap count > 1.

  Typically, forking will need to copy the page table and hence needs to
  increase the swap count of the entries in the table. The page table is
  locked while referencing the swap entries, so the entries all have a
  swap count > 1 and can't be freed.

Hibernation subsystem is a bit different, so two special wrappers are here:

- swap_alloc_hibernation_slot() - Allocate one entry from one device.
- swap_free_hibernation_slot() - Free one entry allocated by the above
helper.

All hibernation entries are exclusive to the hibernation subsystem and
should not interact with ordinary swap routines.

By separating the workflows, it will be possible to bind folio more
tightly with swap cache and get rid of the SWAP_HAS_CACHE as a temporary
pin.

This commit should not introduce any behavior change

Cc: linux-pm@vger.kernel.org
Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 arch/s390/mm/gmap_helpers.c |   2 +-
 arch/s390/mm/pgtable.c      |   2 +-
 include/linux/swap.h        |  58 ++++++++---------
 kernel/power/swap.c         |  10 +--
 mm/madvise.c                |   2 +-
 mm/memory.c                 |  15 +++--
 mm/rmap.c                   |   7 +-
 mm/shmem.c                  |  10 +--
 mm/swap.h                   |  37 +++++++++++
 mm/swapfile.c               | 152 +++++++++++++++++++++++++++++++-------------
 10 files changed, 197 insertions(+), 98 deletions(-)

diff --git a/arch/s390/mm/gmap_helpers.c b/arch/s390/mm/gmap_helpers.c
index d41b19925a5a..dd89fce28531 100644
--- a/arch/s390/mm/gmap_helpers.c
+++ b/arch/s390/mm/gmap_helpers.c
@@ -32,7 +32,7 @@ static void ptep_zap_softleaf_entry(struct mm_struct *mm, softleaf_t entry)
 		dec_mm_counter(mm, MM_SWAPENTS);
 	else if (softleaf_is_migration(entry))
 		dec_mm_counter(mm, mm_counter(softleaf_to_folio(entry)));
-	free_swap_and_cache(entry);
+	swap_put_entries_direct(entry, 1);
 }
 
 /**
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 666adcd681ab..b22181e1079e 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -682,7 +682,7 @@ static void ptep_zap_softleaf_entry(struct mm_struct *mm, softleaf_t entry)
 
 		dec_mm_counter(mm, mm_counter(folio));
 	}
-	free_swap_and_cache(entry);
+	swap_put_entries_direct(entry, 1);
 }
 
 void ptep_zap_unused(struct mm_struct *mm, unsigned long addr,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 74df3004c850..aaa868f60b9c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -452,14 +452,8 @@ static inline long get_nr_swap_pages(void)
 }
 
 extern void si_swapinfo(struct sysinfo *);
-int folio_alloc_swap(struct folio *folio);
-bool folio_free_swap(struct folio *folio);
 void put_swap_folio(struct folio *folio, swp_entry_t entry);
-extern swp_entry_t get_swap_page_of_type(int);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
-extern int swap_duplicate_nr(swp_entry_t entry, int nr);
-extern void swap_free_nr(swp_entry_t entry, int nr_pages);
-extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
 int swap_type_of(dev_t device, sector_t offset);
 int find_first_swap(dev_t *device);
 extern unsigned int count_swap_pages(int, int);
@@ -471,6 +465,29 @@ struct backing_dev_info;
 extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
 sector_t swap_folio_sector(struct folio *folio);
 
+/*
+ * If there is an existing swap slot reference (swap entry) and the caller
+ * guarantees that there is no race modification of it (e.g., PTL
+ * protecting the swap entry in page table; shmem's cmpxchg protects t
+ * he swap entry in shmem mapping), these two helpers below can be used
+ * to put/dup the entries directly.
+ *
+ * All entries must be allocated by folio_alloc_swap(). And they must have
+ * a swap count > 1. See comments of folio_*_swap helpers for more info.
+ */
+int swap_dup_entry_direct(swp_entry_t entry);
+void swap_put_entries_direct(swp_entry_t entry, int nr);
+
+/*
+ * folio_free_swap tries to free the swap entries pinned by a swap cache
+ * folio, it has to be here to be called by other components.
+ */
+bool folio_free_swap(struct folio *folio);
+
+/* Allocate / free (hibernation) exclusive entries */
+swp_entry_t swap_alloc_hibernation_slot(int type);
+void swap_free_hibernation_slot(swp_entry_t entry);
+
 static inline void put_swap_device(struct swap_info_struct *si)
 {
 	percpu_ref_put(&si->users);
@@ -498,10 +515,6 @@ static inline void put_swap_device(struct swap_info_struct *si)
 #define free_pages_and_swap_cache(pages, nr) \
 	release_pages((pages), (nr));
 
-static inline void free_swap_and_cache_nr(swp_entry_t entry, int nr)
-{
-}
-
 static inline void free_swap_cache(struct folio *folio)
 {
 }
@@ -511,12 +524,12 @@ static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
 	return 0;
 }
 
-static inline int swap_duplicate_nr(swp_entry_t swp, int nr_pages)
+static inline int swap_dup_entry_direct(swp_entry_t ent)
 {
 	return 0;
 }
 
-static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
+static inline void swap_put_entries_direct(swp_entry_t ent, int nr)
 {
 }
 
@@ -539,11 +552,6 @@ static inline int swp_swapcount(swp_entry_t entry)
 	return 0;
 }
 
-static inline int folio_alloc_swap(struct folio *folio)
-{
-	return -EINVAL;
-}
-
 static inline bool folio_free_swap(struct folio *folio)
 {
 	return false;
@@ -556,22 +564,6 @@ static inline int add_swap_extent(struct swap_info_struct *sis,
 	return -EINVAL;
 }
 #endif /* CONFIG_SWAP */
-
-static inline int swap_duplicate(swp_entry_t entry)
-{
-	return swap_duplicate_nr(entry, 1);
-}
-
-static inline void free_swap_and_cache(swp_entry_t entry)
-{
-	free_swap_and_cache_nr(entry, 1);
-}
-
-static inline void swap_free(swp_entry_t entry)
-{
-	swap_free_nr(entry, 1);
-}
-
 #ifdef CONFIG_MEMCG
 static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
 {
diff --git a/kernel/power/swap.c b/kernel/power/swap.c
index 33a186373bef..859476a714ac 100644
--- a/kernel/power/swap.c
+++ b/kernel/power/swap.c
@@ -174,10 +174,10 @@ sector_t alloc_swapdev_block(int swap)
 	 * Allocate a swap page and register that it has been allocated, so that
 	 * it can be freed in case of an error.
 	 */
-	offset = swp_offset(get_swap_page_of_type(swap));
+	offset = swp_offset(swap_alloc_hibernation_slot(swap));
 	if (offset) {
 		if (swsusp_extents_insert(offset))
-			swap_free(swp_entry(swap, offset));
+			swap_free_hibernation_slot(swp_entry(swap, offset));
 		else
 			return swapdev_block(swap, offset);
 	}
@@ -186,6 +186,7 @@ sector_t alloc_swapdev_block(int swap)
 
 void free_all_swap_pages(int swap)
 {
+	unsigned long offset;
 	struct rb_node *node;
 
 	/*
@@ -197,8 +198,9 @@ void free_all_swap_pages(int swap)
 
 		ext = rb_entry(node, struct swsusp_extent, node);
 		rb_erase(node, &swsusp_extents);
-		swap_free_nr(swp_entry(swap, ext->start),
-			     ext->end - ext->start + 1);
+
+		for (offset = ext->start; offset < ext->end; offset++)
+			swap_free_hibernation_slot(swp_entry(swap, offset));
 
 		kfree(ext);
 	}
diff --git a/mm/madvise.c b/mm/madvise.c
index 6bf7009fa5ce..5f79f6fabfc0 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -694,7 +694,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 				max_nr = (end - addr) / PAGE_SIZE;
 				nr = swap_pte_batch(pte, max_nr, ptent);
 				nr_swap -= nr;
-				free_swap_and_cache_nr(entry, nr);
+				swap_put_entries_direct(entry, nr);
 				clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
 			} else if (softleaf_is_hwpoison(entry) ||
 				   softleaf_is_poison_marker(entry)) {
diff --git a/mm/memory.c b/mm/memory.c
index a4c58341c44a..a61508107f6d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -934,7 +934,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	struct page *page;
 
 	if (likely(softleaf_is_swap(entry))) {
-		if (swap_duplicate(entry) < 0)
+		if (swap_dup_entry_direct(entry) < 0)
 			return -EIO;
 
 		/* make sure dst_mm is on swapoff's mmlist. */
@@ -1744,7 +1744,7 @@ static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
 
 		nr = swap_pte_batch(pte, max_nr, ptent);
 		rss[MM_SWAPENTS] -= nr;
-		free_swap_and_cache_nr(entry, nr);
+		swap_put_entries_direct(entry, nr);
 	} else if (softleaf_is_migration(entry)) {
 		struct folio *folio = softleaf_to_folio(entry);
 
@@ -4933,7 +4933,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	/*
 	 * Some architectures may have to restore extra metadata to the page
 	 * when reading from swap. This metadata may be indexed by swap entry
-	 * so this must be called before swap_free().
+	 * so this must be called before folio_put_swap().
 	 */
 	arch_swap_restore(folio_swap(entry, folio), folio);
 
@@ -4971,6 +4971,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (unlikely(folio != swapcache)) {
 		folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
 		folio_add_lru_vma(folio, vma);
+		folio_put_swap(swapcache, NULL);
 	} else if (!folio_test_anon(folio)) {
 		/*
 		 * We currently only expect !anon folios that are fully
@@ -4979,9 +4980,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) != nr_pages, folio);
 		VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio);
 		folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
+		folio_put_swap(folio, NULL);
 	} else {
+		VM_WARN_ON_ONCE(nr_pages != 1 && nr_pages != folio_nr_pages(folio));
 		folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
-					rmap_flags);
+					 rmap_flags);
+		folio_put_swap(folio, nr_pages == 1 ? page : NULL);
 	}
 
 	VM_BUG_ON(!folio_test_anon(folio) ||
@@ -4995,7 +4999,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 * Do it after mapping, so raced page faults will likely see the folio
 	 * in swap cache and wait on the folio lock.
 	 */
-	swap_free_nr(entry, nr_pages);
 	if (should_try_to_free_swap(si, folio, vma, nr_pages, vmf->flags))
 		folio_free_swap(folio);
 
@@ -5005,7 +5008,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		 * Hold the lock to avoid the swap entry to be reused
 		 * until we take the PT lock for the pte_same() check
 		 * (to avoid false positives from pte_same). For
-		 * further safety release the lock after the swap_free
+		 * further safety release the lock after the folio_put_swap
 		 * so that the swap count won't change under a
 		 * parallel locked swapcache.
 		 */
diff --git a/mm/rmap.c b/mm/rmap.c
index d6799afe1114..e805ddc5a27b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -82,6 +82,7 @@
 #include <trace/events/migrate.h>
 
 #include "internal.h"
+#include "swap.h"
 
 static struct kmem_cache *anon_vma_cachep;
 static struct kmem_cache *anon_vma_chain_cachep;
@@ -2147,7 +2148,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				goto discard;
 			}
 
-			if (swap_duplicate(entry) < 0) {
+			if (folio_dup_swap(folio, subpage) < 0) {
 				set_pte_at(mm, address, pvmw.pte, pteval);
 				goto walk_abort;
 			}
@@ -2158,7 +2159,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			 * so we'll not check/care.
 			 */
 			if (arch_unmap_one(mm, vma, address, pteval) < 0) {
-				swap_free(entry);
+				folio_put_swap(folio, subpage);
 				set_pte_at(mm, address, pvmw.pte, pteval);
 				goto walk_abort;
 			}
@@ -2166,7 +2167,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			/* See folio_try_share_anon_rmap(): clear PTE first. */
 			if (anon_exclusive &&
 			    folio_try_share_anon_rmap_pte(folio, subpage)) {
-				swap_free(entry);
+				folio_put_swap(folio, subpage);
 				set_pte_at(mm, address, pvmw.pte, pteval);
 				goto walk_abort;
 			}
diff --git a/mm/shmem.c b/mm/shmem.c
index e36330cdd066..df346f0c8ddc 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -970,7 +970,7 @@ static long shmem_free_swap(struct address_space *mapping,
 	old = xa_cmpxchg_irq(&mapping->i_pages, index, radswap, NULL, 0);
 	if (old != radswap)
 		return 0;
-	free_swap_and_cache_nr(radix_to_swp_entry(radswap), 1 << order);
+	swap_put_entries_direct(radix_to_swp_entry(radswap), 1 << order);
 
 	return 1 << order;
 }
@@ -1667,7 +1667,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
 			spin_unlock(&shmem_swaplist_lock);
 		}
 
-		swap_duplicate_nr(folio->swap, nr_pages);
+		folio_dup_swap(folio, NULL);
 		shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap));
 
 		BUG_ON(folio_mapped(folio));
@@ -1688,7 +1688,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
 		/* Swap entry might be erased by racing shmem_free_swap() */
 		if (!error) {
 			shmem_recalc_inode(inode, 0, -nr_pages);
-			swap_free_nr(folio->swap, nr_pages);
+			folio_put_swap(folio, NULL);
 		}
 
 		/*
@@ -2174,6 +2174,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
 
 	nr_pages = folio_nr_pages(folio);
 	folio_wait_writeback(folio);
+	folio_put_swap(folio, NULL);
 	swap_cache_del_folio(folio);
 	/*
 	 * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks
@@ -2181,7 +2182,6 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
 	 * in shmem_evict_inode().
 	 */
 	shmem_recalc_inode(inode, -nr_pages, -nr_pages);
-	swap_free_nr(swap, nr_pages);
 }
 
 static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
@@ -2404,9 +2404,9 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	if (sgp == SGP_WRITE)
 		folio_mark_accessed(folio);
 
+	folio_put_swap(folio, NULL);
 	swap_cache_del_folio(folio);
 	folio_mark_dirty(folio);
-	swap_free_nr(swap, nr_pages);
 	put_swap_device(si);
 
 	*foliop = folio;
diff --git a/mm/swap.h b/mm/swap.h
index 6777b2ab9d92..9ed12936b889 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -183,6 +183,28 @@ static inline void swap_cluster_unlock_irq(struct swap_cluster_info *ci)
 	spin_unlock_irq(&ci->lock);
 }
 
+/*
+ * Below are the core routines for doing swap for a folio.
+ * All helpers requires the folio to be locked, and a locked folio
+ * in the swap cache pins the swap entries / slots allocated to the
+ * folio, swap relies heavily on the swap cache and folio lock for
+ * synchronization.
+ *
+ * folio_alloc_swap(): the entry point for a folio to be swapped
+ * out. It allocates swap slots and pins the slots with swap cache.
+ * The slots start with a swap count of zero.
+ *
+ * folio_dup_swap(): increases the swap count of a folio, usually
+ * during it gets unmapped and a swap entry is installed to replace
+ * it (e.g., swap entry in page table). A swap slot with swap
+ * count == 0 should only be increasd by this helper.
+ *
+ * folio_put_swap(): does the opposite thing of folio_dup_swap().
+ */
+int folio_alloc_swap(struct folio *folio);
+int folio_dup_swap(struct folio *folio, struct page *subpage);
+void folio_put_swap(struct folio *folio, struct page *subpage);
+
 /* linux/mm/page_io.c */
 int sio_pool_init(void);
 struct swap_iocb;
@@ -363,9 +385,24 @@ static inline struct swap_info_struct *__swap_entry_to_info(swp_entry_t entry)
 	return NULL;
 }
 
+static inline int folio_alloc_swap(struct folio *folio)
+{
+	return -EINVAL;
+}
+
+static inline int folio_dup_swap(struct folio *folio, struct page *page)
+{
+	return -EINVAL;
+}
+
+static inline void folio_put_swap(struct folio *folio, struct page *page)
+{
+}
+
 static inline void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 {
 }
+
 static inline void swap_write_unplug(struct swap_iocb *sio)
 {
 }
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 38f3c369df72..f812fdea68b3 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -58,6 +58,9 @@ static void swap_entries_free(struct swap_info_struct *si,
 			      swp_entry_t entry, unsigned int nr_pages);
 static void swap_range_alloc(struct swap_info_struct *si,
 			     unsigned int nr_entries);
+static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr);
+static bool swap_entries_put_map(struct swap_info_struct *si,
+				 swp_entry_t entry, int nr);
 static bool folio_swapcache_freeable(struct folio *folio);
 static void move_cluster(struct swap_info_struct *si,
 			 struct swap_cluster_info *ci, struct list_head *list,
@@ -1482,6 +1485,12 @@ int folio_alloc_swap(struct folio *folio)
 	 */
 	WARN_ON_ONCE(swap_cache_add_folio(folio, entry, NULL, true));
 
+	/*
+	 * Allocator should always allocate aligned entries so folio based
+	 * operations never crossed more than one cluster.
+	 */
+	VM_WARN_ON_ONCE_FOLIO(!IS_ALIGNED(folio->swap.val, size), folio);
+
 	return 0;
 
 out_free:
@@ -1489,6 +1498,66 @@ int folio_alloc_swap(struct folio *folio)
 	return -ENOMEM;
 }
 
+/**
+ * folio_dup_swap() - Increase swap count of swap entries of a folio.
+ * @folio: folio with swap entries bounded.
+ * @subpage: if not NULL, only increase the swap count of this subpage.
+ *
+ * Typically called when the folio is unmapped and have its swap entry to
+ * take its palce.
+ *
+ * Context: Caller must ensure the folio is locked and in the swap cache.
+ * NOTE: The caller also has to ensure there is no raced call to
+ * swap_put_entries_direct on its swap entry before this helper returns, or
+ * the swap map may underflow. Currently, we only accept @subpage == NULL
+ * for shmem due to the limitation of swap continuation: shmem always
+ * duplicates the swap entry only once, so there is no such issue for it.
+ */
+int folio_dup_swap(struct folio *folio, struct page *subpage)
+{
+	int err = 0;
+	swp_entry_t entry = folio->swap;
+	unsigned long nr_pages = folio_nr_pages(folio);
+
+	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
+	VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio);
+
+	if (subpage) {
+		entry.val += folio_page_idx(folio, subpage);
+		nr_pages = 1;
+	}
+
+	while (!err && __swap_duplicate(entry, 1, nr_pages) == -ENOMEM)
+		err = add_swap_count_continuation(entry, GFP_ATOMIC);
+
+	return err;
+}
+
+/**
+ * folio_put_swap() - Decrease swap count of swap entries of a folio.
+ * @folio: folio with swap entries bounded, must be in swap cache and locked.
+ * @subpage: if not NULL, only decrease the swap count of this subpage.
+ *
+ * This won't free the swap slots even if swap count drops to zero, they are
+ * still pinned by the swap cache. User may call folio_free_swap to free them.
+ * Context: Caller must ensure the folio is locked and in the swap cache.
+ */
+void folio_put_swap(struct folio *folio, struct page *subpage)
+{
+	swp_entry_t entry = folio->swap;
+	unsigned long nr_pages = folio_nr_pages(folio);
+
+	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
+	VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio);
+
+	if (subpage) {
+		entry.val += folio_page_idx(folio, subpage);
+		nr_pages = 1;
+	}
+
+	swap_entries_put_map(__swap_entry_to_info(entry), entry, nr_pages);
+}
+
 static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
 {
 	struct swap_info_struct *si;
@@ -1729,28 +1798,6 @@ static void swap_entries_free(struct swap_info_struct *si,
 		partial_free_cluster(si, ci);
 }
 
-/*
- * Caller has made sure that the swap device corresponding to entry
- * is still around or has not been recycled.
- */
-void swap_free_nr(swp_entry_t entry, int nr_pages)
-{
-	int nr;
-	struct swap_info_struct *sis;
-	unsigned long offset = swp_offset(entry);
-
-	sis = _swap_info_get(entry);
-	if (!sis)
-		return;
-
-	while (nr_pages) {
-		nr = min_t(int, nr_pages, SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
-		swap_entries_put_map(sis, swp_entry(sis->type, offset), nr);
-		offset += nr;
-		nr_pages -= nr;
-	}
-}
-
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
@@ -1940,16 +1987,19 @@ bool folio_free_swap(struct folio *folio)
 }
 
 /**
- * free_swap_and_cache_nr() - Release reference on range of swap entries and
- *                            reclaim their cache if no more references remain.
+ * swap_put_entries_direct() - Release reference on range of swap entries and
+ *                             reclaim their cache if no more references remain.
  * @entry: First entry of range.
  * @nr: Number of entries in range.
  *
  * For each swap entry in the contiguous range, release a reference. If any swap
  * entries become free, try to reclaim their underlying folios, if present. The
  * offset range is defined by [entry.offset, entry.offset + nr).
+ *
+ * Context: Caller must ensure there is no race condition on the reference
+ * owner. e.g., locking the PTL of a PTE containing the entry being released.
  */
-void free_swap_and_cache_nr(swp_entry_t entry, int nr)
+void swap_put_entries_direct(swp_entry_t entry, int nr)
 {
 	const unsigned long start_offset = swp_offset(entry);
 	const unsigned long end_offset = start_offset + nr;
@@ -1958,10 +2008,9 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 	unsigned long offset;
 
 	si = get_swap_device(entry);
-	if (!si)
+	if (WARN_ON_ONCE(!si))
 		return;
-
-	if (WARN_ON(end_offset > si->max))
+	if (WARN_ON_ONCE(end_offset > si->max))
 		goto out;
 
 	/*
@@ -2005,8 +2054,8 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 }
 
 #ifdef CONFIG_HIBERNATION
-
-swp_entry_t get_swap_page_of_type(int type)
+/* Allocate a slot for hibernation */
+swp_entry_t swap_alloc_hibernation_slot(int type)
 {
 	struct swap_info_struct *si = swap_type_to_info(type);
 	unsigned long offset;
@@ -2034,6 +2083,27 @@ swp_entry_t get_swap_page_of_type(int type)
 	return entry;
 }
 
+/* Free a slot allocated by swap_alloc_hibernation_slot */
+void swap_free_hibernation_slot(swp_entry_t entry)
+{
+	struct swap_info_struct *si;
+	struct swap_cluster_info *ci;
+	pgoff_t offset = swp_offset(entry);
+
+	si = get_swap_device(entry);
+	if (WARN_ON(!si))
+		return;
+
+	ci = swap_cluster_lock(si, offset);
+	swap_entry_put_locked(si, ci, entry, 1);
+	WARN_ON(swap_entry_swapped(si, entry));
+	swap_cluster_unlock(ci);
+
+	/* In theory readahead might add it to the swap cache by accident */
+	__try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
+	put_swap_device(si);
+}
+
 /*
  * Find the swap type that corresponds to given device (if any).
  *
@@ -2195,7 +2265,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	/*
 	 * Some architectures may have to restore extra metadata to the page
 	 * when reading from swap. This metadata may be indexed by swap entry
-	 * so this must be called before swap_free().
+	 * so this must be called before folio_put_swap().
 	 */
 	arch_swap_restore(folio_swap(entry, folio), folio);
 
@@ -2236,7 +2306,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 		new_pte = pte_mkuffd_wp(new_pte);
 setpte:
 	set_pte_at(vma->vm_mm, addr, pte, new_pte);
-	swap_free(entry);
+	folio_put_swap(folio, page);
 out:
 	if (pte)
 		pte_unmap_unlock(pte, ptl);
@@ -3746,28 +3816,22 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 	return err;
 }
 
-/**
- * swap_duplicate_nr() - Increase reference count of nr contiguous swap entries
- *                       by 1.
- *
+/*
+ * swap_dup_entry_direct() - Increase reference count of a swap entry by one.
  * @entry: first swap entry from which we want to increase the refcount.
- * @nr: Number of entries in range.
  *
  * Returns 0 for success, or -ENOMEM if a swap_count_continuation is required
  * but could not be atomically allocated.  Returns 0, just as if it succeeded,
  * if __swap_duplicate() fails for another reason (-EINVAL or -ENOENT), which
  * might occur if a page table entry has got corrupted.
  *
- * Note that we are currently not handling the case where nr > 1 and we need to
- * add swap count continuation. This is OK, because no such user exists - shmem
- * is the only user that can pass nr > 1, and it never re-duplicates any swap
- * entry it owns.
+ * Context: Caller must ensure there is no race condition on the reference
+ * owner. e.g., locking the PTL of a PTE containing the entry being increased.
  */
-int swap_duplicate_nr(swp_entry_t entry, int nr)
+int swap_dup_entry_direct(swp_entry_t entry)
 {
 	int err = 0;
-
-	while (!err && __swap_duplicate(entry, 1, nr) == -ENOMEM)
+	while (!err && __swap_duplicate(entry, 1, 1) == -ENOMEM)
 		err = add_swap_count_continuation(entry, GFP_ATOMIC);
 	return err;
 }

-- 
2.52.0

Re: [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow

Posted by Chris Mason 1 week, 3 days ago

Kairui Song <ryncsn@gmail.com> wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> The current swap entry allocation/freeing workflow has never had a clear
> definition. This makes it hard to debug or add new optimizations.
> 
> This commit introduces a proper definition of how swap entries would be
> allocated and freed. Now, most operations are folio based, so they will
> never exceed one swap cluster, and we now have a cleaner border between
> swap and the rest of mm, making it much easier to follow and debug,
> especially with new added sanity checks. Also making more optimization
> possible.
> 
> Swap entry will be mostly allocated and free with a folio bound.
> The folio lock will be useful for resolving many swap ralated races.
> 
> Now swap allocation (except hibernation) always starts with a folio in
> the swap cache, and gets duped/freed protected by the folio lock:

[ ... ]

Hi everyone,

Apologies for looping back around on this one, but I've got some prompt
changes that break down larger commits into sections and run them
independently through review.  I reran it on mm-new, and it found something
new:

> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 64970ee11fcf7..85bf4f7d9ae7b 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c

[ ... ]

> @@ -2195,7 +2265,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
>  	/*
>  	 * Some architectures may have to restore extra metadata to the page
>  	 * when reading from swap. This metadata may be indexed by swap entry
> -	 * so this must be called before swap_free().
> +	 * so this must be called before folio_put_swap().
>  	 */
>  	arch_swap_restore(folio_swap(entry, folio), folio);
>
> @@ -2236,7 +2306,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
>  		new_pte = pte_mkuffd_wp(new_pte);
>  setpte:
>  	set_pte_at(vma->vm_mm, addr, pte, new_pte);
> -	swap_free(entry);
> +	folio_put_swap(folio, page);

Does this corrupt the swap map when ksm_might_need_to_copy() returns a
new folio? In that case, folio != swapcache, and the new folio is NOT in
the swap cache with folio->swap = 0. This would trigger
VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio) and call
swap_entries_put_map() with entry.val = 0.

Compare with do_swap_page() which correctly uses folio_put_swap(swapcache,
NULL) when folio != swapcache. Should this use the original entry parameter
or the swapcache folio instead?

Re: [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow

Posted by Kairui Song 1 week, 2 days ago

On Thu, Jan 29, 2026 at 11:32:38AM +0800, Chris Mason wrote:
> Kairui Song <ryncsn@gmail.com> wrote:
> > From: Kairui Song <kasong@tencent.com>
> > 
> > The current swap entry allocation/freeing workflow has never had a clear
> > definition. This makes it hard to debug or add new optimizations.
> > 
> > This commit introduces a proper definition of how swap entries would be
> > allocated and freed. Now, most operations are folio based, so they will
> > never exceed one swap cluster, and we now have a cleaner border between
> > swap and the rest of mm, making it much easier to follow and debug,
> > especially with new added sanity checks. Also making more optimization
> > possible.
> > 
> > Swap entry will be mostly allocated and free with a folio bound.
> > The folio lock will be useful for resolving many swap ralated races.
> > 
> > Now swap allocation (except hibernation) always starts with a folio in
> > the swap cache, and gets duped/freed protected by the folio lock:
> 
> [ ... ]
> 
> Hi everyone,
> 
> Apologies for looping back around on this one, but I've got some prompt
> changes that break down larger commits into sections and run them
> independently through review.  I reran it on mm-new, and it found something
> new:
> 
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 64970ee11fcf7..85bf4f7d9ae7b 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> 
> [ ... ]
> 
> > @@ -2195,7 +2265,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
> >  	/*
> >  	 * Some architectures may have to restore extra metadata to the page
> >  	 * when reading from swap. This metadata may be indexed by swap entry
> > -	 * so this must be called before swap_free().
> > +	 * so this must be called before folio_put_swap().
> >  	 */
> >  	arch_swap_restore(folio_swap(entry, folio), folio);
> >
> > @@ -2236,7 +2306,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
> >  		new_pte = pte_mkuffd_wp(new_pte);
> >  setpte:
> >  	set_pte_at(vma->vm_mm, addr, pte, new_pte);
> > -	swap_free(entry);
> > +	folio_put_swap(folio, page);
> 
> Does this corrupt the swap map when ksm_might_need_to_copy() returns a
> new folio? In that case, folio != swapcache, and the new folio is NOT in
> the swap cache with folio->swap = 0. This would trigger
> VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio) and call
> swap_entries_put_map() with entry.val = 0.
> 
> Compare with do_swap_page() which correctly uses folio_put_swap(swapcache,
> NULL) when folio != swapcache. Should this use the original entry parameter
> or the swapcache folio instead?

Thanks again for running the AI review. And it's really helpful.

This is a valid case, I missed the KSM copy pages for swapoff indeed.

We do need the following change squashed as you suggested.

Hi Andrew, can you help squash add following fix? I just ran more
stress tests with KSM and racing swapoff, and everything is looking
good now.

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8c0f31363c1f..d652486898de 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2305,7 +2305,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 		new_pte = pte_mkuffd_wp(new_pte);
 setpte:
 	set_pte_at(vma->vm_mm, addr, pte, new_pte);
-	folio_put_swap(folio, page);
+	folio_put_swap(swapcache, folio_file_page(swapcache, swp_offset(entry)));
 out:
 	if (pte)
 		pte_unmap_unlock(pte, ptl);

Re: [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow

Posted by Kairui Song 3 weeks, 4 days ago

On Sat, Dec 20, 2025 at 3:45 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> The current swap entry allocation/freeing workflow has never had a clear
> definition. This makes it hard to debug or add new optimizations.
>
> This commit introduces a proper definition of how swap entries would be
> allocated and freed. Now, most operations are folio based, so they will
> never exceed one swap cluster, and we now have a cleaner border between
> swap and the rest of mm, making it much easier to follow and debug,
> especially with new added sanity checks. Also making more optimization
> possible.

...

>
> Cc: linux-pm@vger.kernel.org
> Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---

Hi Andrew,

Is it convenient for you to squash this attached fix into this patch?
That's the two issues from Chris Mason and Lai Yi combined in a clean
to apply format, only 3 lines change.

There might be minor conflict by removing the WARN_ON in two following
patches, but should be easy to resolve. I can send a v6 if that's
troublesome.

Re: [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow

Posted by Andrew Morton 3 weeks, 4 days ago

On Thu, 15 Jan 2026 00:53:41 +0800 Kairui Song <ryncsn@gmail.com> wrote:

> Is it convenient for you to squash this attached fix into this patch?

Done, below

> That's the two issues from Chris Mason and Lai Yi combined in a clean
> to apply format, only 3 lines change.

Let's cc them!

> There might be minor conflict by removing the WARN_ON in two following
> patches, but should be easy to resolve. I can send a v6 if that's
> troublesome.

All fixed up, thanks.


From: Kairui Song <kasong@tencent.com>
Subject: mm, swap: fix locking and leaking with hibernation snapshot releasing
Date: Thu, 15 Jan 2026 00:15:27 +0800

fix leak, per Chris Mason.  Remove WARN_ON, per Lai Yi

Link: https://lkml.kernel.org/r/CAMgjq7AUz10uETVm8ozDWcB3XohkOqf0i33KGrAquvEVvfp5cg@mail.gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Lai Yi <yi1.lai@linux.intel.com>
Cc: Chris Mason <clm@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/power/swap.c |    2 +-
 mm/swapfile.c       |    1 -
 2 files changed, 1 insertion(+), 2 deletions(-)

--- a/kernel/power/swap.c~mm-swap-cleanup-swap-entry-management-workflow-fix
+++ a/kernel/power/swap.c
@@ -199,7 +199,7 @@ void free_all_swap_pages(int swap)
 		ext = rb_entry(node, struct swsusp_extent, node);
 		rb_erase(node, &swsusp_extents);
 
-		for (offset = ext->start; offset < ext->end; offset++)
+		for (offset = ext->start; offset <= ext->end; offset++)
 			swap_free_hibernation_slot(swp_entry(swap, offset));
 
 		kfree(ext);
--- a/mm/swapfile.c~mm-swap-cleanup-swap-entry-management-workflow-fix
+++ a/mm/swapfile.c
@@ -2096,7 +2096,6 @@ void swap_free_hibernation_slot(swp_entr
 
 	ci = swap_cluster_lock(si, offset);
 	swap_entry_put_locked(si, ci, entry, 1);
-	WARN_ON(swap_entry_swapped(si, entry));
 	swap_cluster_unlock(ci);
 
 	/* In theory readahead might add it to the swap cache by accident */
_

Re: [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow

Posted by Chris Li 3 weeks, 3 days ago

On Wed, Jan 14, 2026 at 2:29 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Thu, 15 Jan 2026 00:53:41 +0800 Kairui Song <ryncsn@gmail.com> wrote:
>
> > Is it convenient for you to squash this attached fix into this patch?
>
> Done, below
>
> > That's the two issues from Chris Mason and Lai Yi combined in a clean
> > to apply format, only 3 lines change.
>
> Let's cc them!
>
> > There might be minor conflict by removing the WARN_ON in two following
> > patches, but should be easy to resolve. I can send a v6 if that's
> > troublesome.
>
> All fixed up, thanks.
>
>
> From: Kairui Song <kasong@tencent.com>
> Subject: mm, swap: fix locking and leaking with hibernation snapshot releasing
> Date: Thu, 15 Jan 2026 00:15:27 +0800
>
> fix leak, per Chris Mason.  Remove WARN_ON, per Lai Yi

That is a great catch. Thanks.

>
> Link: https://lkml.kernel.org/r/CAMgjq7AUz10uETVm8ozDWcB3XohkOqf0i33KGrAquvEVvfp5cg@mail.gmail.com
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Cc: Lai Yi <yi1.lai@linux.intel.com>
> Cc: Chris Mason <clm@fb.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

That small fix looks good to me.

Chris

Re: [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow

Posted by Lai, Yi 3 weeks, 5 days ago

Hi Kairui Song,

Greetings!

I used Syzkaller and found that there is possible deadlock in swap_free_hibernation_slot in linux-next next-20260113.

After bisection and the first bad commit is:
"
33be6f68989d mm. swap: cleanup swap entry management workflow
"

All detailed into can be found at:
https://github.com/laifryiee/syzkaller_logs/tree/main/260114_102849_swap_free_hibernation_slot
Syzkaller repro code:
https://github.com/laifryiee/syzkaller_logs/tree/main/260114_102849_swap_free_hibernation_slot/repro.c
Syzkaller repro syscall steps:
https://github.com/laifryiee/syzkaller_logs/tree/main/260114_102849_swap_free_hibernation_slot/repro.prog
Syzkaller report:
https://github.com/laifryiee/syzkaller_logs/tree/main/260114_102849_swap_free_hibernation_slot/repro.report
Kconfig(make olddefconfig):
https://github.com/laifryiee/syzkaller_logs/tree/main/260114_102849_swap_free_hibernation_slot/kconfig_origin
Bisect info:
https://github.com/laifryiee/syzkaller_logs/tree/main/260114_102849_swap_free_hibernation_slot/bisect_info.log
bzImage:
https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/260114_102849_swap_free_hibernation_slot/bzImage_0f853ca2a798ead9d24d39cad99b0966815c582a
Issue dmesg:
https://github.com/laifryiee/syzkaller_logs/blob/main/260114_102849_swap_free_hibernation_slot/0f853ca2a798ead9d24d39cad99b0966815c582a_dmesg.log

"
[   62.477554] ============================================
[   62.477802] WARNING: possible recursive locking detected
[   62.478059] 6.19.0-rc5-next-20260113-0f853ca2a798 #1 Not tainted
[   62.478324] --------------------------------------------
[   62.478549] repro/668 is trying to acquire lock:
[   62.478759] ffff888011664018 (&cluster_info[i].lock){+.+.}-{3:3}, at: swap_free_hibernation_slot+0x13e/0x2a0
[   62.479271]
[   62.479271] but task is already holding lock:
[   62.479519] ffff888011664018 (&cluster_info[i].lock){+.+.}-{3:3}, at: swap_free_hibernation_slot+0xfa/0x2a0
[   62.479984]
[   62.479984] other info that might help us debug this:
[   62.480293]  Possible unsafe locking scenario:
[   62.480293]
[   62.480565]        CPU0
[   62.480686]        ----
[   62.480809]   lock(&cluster_info[i].lock);
[   62.481010]   lock(&cluster_info[i].lock);
[   62.481205]
[   62.481205]  *** DEADLOCK ***
[   62.481205]
[   62.481481]  May be due to missing lock nesting notation
[   62.481481]
[   62.481802] 2 locks held by repro/668:
[   62.481981]  #0: ffffffff87542e28 (system_transition_mutex){+.+.}-{4:4}, at: lock_system_sleep+0x92/0xb0
[   62.482439]  #1: ffff888011664018 (&cluster_info[i].lock){+.+.}-{3:3}, at: swap_free_hibernation_slot+0xfa/0x0
[   62.482936]
[   62.482936] stack backtrace:
[   62.483131] CPU: 0 UID: 0 PID: 668 Comm: repro Not tainted 6.19.0-rc5-next-20260113-0f853ca2a798 #1 PREEMPT(l
[   62.483143] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.q4
[   62.483151] Call Trace:
[   62.483156]  <TASK>
[   62.483160]  dump_stack_lvl+0xea/0x150
[   62.483195]  dump_stack+0x19/0x20
[   62.483206]  print_deadlock_bug+0x22e/0x300
[   62.483215]  __lock_acquire+0x1325/0x2210
[   62.483226]  lock_acquire+0x170/0x2f0
[   62.483234]  ? swap_free_hibernation_slot+0x13e/0x2a0
[   62.483249]  _raw_spin_lock+0x38/0x50
[   62.483267]  ? swap_free_hibernation_slot+0x13e/0x2a0
[   62.483279]  swap_free_hibernation_slot+0x13e/0x2a0
[   62.483291]  ? __pfx_swap_free_hibernation_slot+0x10/0x10
[   62.483303]  ? locks_remove_file+0xe2/0x7f0
[   62.483322]  ? __pfx_snapshot_release+0x10/0x10
[   62.483331]  free_all_swap_pages+0xdd/0x160
[   62.483339]  ? __pfx_snapshot_release+0x10/0x10
[   62.483346]  snapshot_release+0xac/0x200
[   62.483353]  __fput+0x41f/0xb70
[   62.483369]  ____fput+0x22/0x30
[   62.483376]  task_work_run+0x19e/0x2b0
[   62.483391]  ? __pfx_task_work_run+0x10/0x10
[   62.483398]  ? nsproxy_free+0x2da/0x5b0
[   62.483410]  ? switch_task_namespaces+0x118/0x130
[   62.483421]  do_exit+0x869/0x2810
[   62.483435]  ? do_group_exit+0x1d8/0x2c0
[   62.483445]  ? __pfx_do_exit+0x10/0x10
[   62.483451]  ? __this_cpu_preempt_check+0x21/0x30
[   62.483463]  ? _raw_spin_unlock_irq+0x2c/0x60
[   62.483474]  ? lockdep_hardirqs_on+0x85/0x110
[   62.483486]  ? _raw_spin_unlock_irq+0x2c/0x60
[   62.483498]  ? trace_hardirqs_on+0x26/0x130
[   62.483516]  do_group_exit+0xe4/0x2c0
[   62.483524]  __x64_sys_exit_group+0x4d/0x60
[   62.483531]  x64_sys_call+0x21a2/0x21b0
[   62.483544]  do_syscall_64+0x6d/0x1180
[   62.483560]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   62.483584] RIP: 0033:0x7fe84fb18a4d
[   62.483595] Code: Unable to access opcode bytes at 0x7fe84fb18a23.
[   62.483602] RSP: 002b:00007fff3e35c928 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[   62.483610] RAX: ffffffffffffffda RBX: 00007fe84fbf69e0 RCX: 00007fe84fb18a4d
[   62.483615] RDX: 00000000000000e7 RSI: ffffffffffffff80 RDI: 0000000000000001
[   62.483620] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000020
[   62.483624] R10: 00007fff3e35c7d0 R11: 0000000000000246 R12: 00007fe84fbf69e0
[   62.483629] R13: 00007fe84fbfbf00 R14: 0000000000000001 R15: 00007fe84fbfbee8
[   62.483640]  </TASK>
"

Hope this cound be insightful to you.

Regards,
Yi Lai

---

If you don't need the following environment to reproduce the problem or if you
already have one reproduced environment, please ignore the following information.

How to reproduce:
git clone https://gitlab.com/xupengfe/repro_vm_env.git
cd repro_vm_env
tar -xvf repro_vm_env.tar.gz
cd repro_vm_env; ./start3.sh  // it needs qemu-system-x86_64 and I used v7.1.0
  // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
  // You could change the bzImage_xxx as you want
  // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
You could use below command to log in, there is no password for root.
ssh -p 10023 root@localhost

After login vm(virtual machine) successfully, you could transfer reproduced
binary to the vm by below way, and reproduce the problem in vm:
gcc -pthread -o repro repro.c
scp -P 10023 repro root@localhost:/root/

Get the bzImage for target kernel:
Please use target kconfig and copy it to kernel_src/.config
make olddefconfig
make -jx bzImage           //x should equal or less than cpu num your pc has

Fill the bzImage file into above start3.sh to load the target kernel in vm.


Tips:
If you already have qemu-system-x86_64, please ignore below info.
If you want to install qemu v7.1.0 version:
git clone https://github.com/qemu/qemu.git
cd qemu
git checkout -f v7.1.0
mkdir build
cd build
yum install -y ninja-build.x86_64
yum -y install libslirp-devel.x86_64
../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
make
make install

Re: [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow

Posted by Kairui Song 3 weeks, 4 days ago

On Wed, Jan 14, 2026 at 9:28 PM Lai, Yi <yi1.lai@linux.intel.com> wrote:
>
> Hi Kairui Song,
>
> Greetings!
>
> I used Syzkaller and found that there is possible deadlock in swap_free_hibernation_slot in linux-next next-20260113.
>
> After bisection and the first bad commit is:
> "
> 33be6f68989d mm. swap: cleanup swap entry management workflow
> "
>
> All detailed into can be found at:
> https://github.com/laifryiee/syzkaller_logs/tree/main/260114_102849_swap_free_hibernation_slot
> Syzkaller repro code:
> https://github.com/laifryiee/syzkaller_logs/tree/main/260114_102849_swap_free_hibernation_slot/repro.c
> Syzkaller repro syscall steps:
> https://github.com/laifryiee/syzkaller_logs/tree/main/260114_102849_swap_free_hibernation_slot/repro.prog
> Syzkaller report:
> https://github.com/laifryiee/syzkaller_logs/tree/main/260114_102849_swap_free_hibernation_slot/repro.report
> Kconfig(make olddefconfig):
> https://github.com/laifryiee/syzkaller_logs/tree/main/260114_102849_swap_free_hibernation_slot/kconfig_origin
> Bisect info:
> https://github.com/laifryiee/syzkaller_logs/tree/main/260114_102849_swap_free_hibernation_slot/bisect_info.log
> bzImage:
> https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/260114_102849_swap_free_hibernation_slot/bzImage_0f853ca2a798ead9d24d39cad99b0966815c582a
> Issue dmesg:
> https://github.com/laifryiee/syzkaller_logs/blob/main/260114_102849_swap_free_hibernation_slot/0f853ca2a798ead9d24d39cad99b0966815c582a_dmesg.log
>
> "
> [   62.477554] ============================================
> [   62.477802] WARNING: possible recursive locking detected
> [   62.478059] 6.19.0-rc5-next-20260113-0f853ca2a798 #1 Not tainted
> [   62.478324] --------------------------------------------
> [   62.478549] repro/668 is trying to acquire lock:
> [   62.478759] ffff888011664018 (&cluster_info[i].lock){+.+.}-{3:3}, at: swap_free_hibernation_slot+0x13e/0x2a0
> [   62.479271]
> [   62.479271] but task is already holding lock:
> [   62.479519] ffff888011664018 (&cluster_info[i].lock){+.+.}-{3:3}, at: swap_free_hibernation_slot+0xfa/0x2a0
> [   62.479984]
> [   62.479984] other info that might help us debug this:
> [   62.480293]  Possible unsafe locking scenario:
> [   62.480293]
> [   62.480565]        CPU0
> [   62.480686]        ----
> [   62.480809]   lock(&cluster_info[i].lock);
> [   62.481010]   lock(&cluster_info[i].lock);
> [   62.481205]
> [   62.481205]  *** DEADLOCK ***
> [   62.481205]
> [   62.481481]  May be due to missing lock nesting notation
> [   62.481481]
> [   62.481802] 2 locks held by repro/668:
> [   62.481981]  #0: ffffffff87542e28 (system_transition_mutex){+.+.}-{4:4}, at: lock_system_sleep+0x92/0xb0
> [   62.482439]  #1: ffff888011664018 (&cluster_info[i].lock){+.+.}-{3:3}, at: swap_free_hibernation_slot+0xfa/0x0
> [   62.482936]
> [   62.482936] stack backtrace:
> [   62.483131] CPU: 0 UID: 0 PID: 668 Comm: repro Not tainted 6.19.0-rc5-next-20260113-0f853ca2a798 #1 PREEMPT(l
> [   62.483143] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.q4
> [   62.483151] Call Trace:
> [   62.483156]  <TASK>
> [   62.483160]  dump_stack_lvl+0xea/0x150
> [   62.483195]  dump_stack+0x19/0x20
> [   62.483206]  print_deadlock_bug+0x22e/0x300
> [   62.483215]  __lock_acquire+0x1325/0x2210
> [   62.483226]  lock_acquire+0x170/0x2f0
> [   62.483234]  ? swap_free_hibernation_slot+0x13e/0x2a0
> [   62.483249]  _raw_spin_lock+0x38/0x50
> [   62.483267]  ? swap_free_hibernation_slot+0x13e/0x2a0
> [   62.483279]  swap_free_hibernation_slot+0x13e/0x2a0
> [   62.483291]  ? __pfx_swap_free_hibernation_slot+0x10/0x10
> [   62.483303]  ? locks_remove_file+0xe2/0x7f0
> [   62.483322]  ? __pfx_snapshot_release+0x10/0x10
> [   62.483331]  free_all_swap_pages+0xdd/0x160
> [   62.483339]  ? __pfx_snapshot_release+0x10/0x10
> [   62.483346]  snapshot_release+0xac/0x200
> [   62.483353]  __fput+0x41f/0xb70
> [   62.483369]  ____fput+0x22/0x30
> [   62.483376]  task_work_run+0x19e/0x2b0
> [   62.483391]  ? __pfx_task_work_run+0x10/0x10
> [   62.483398]  ? nsproxy_free+0x2da/0x5b0
> [   62.483410]  ? switch_task_namespaces+0x118/0x130
> [   62.483421]  do_exit+0x869/0x2810
> [   62.483435]  ? do_group_exit+0x1d8/0x2c0
> [   62.483445]  ? __pfx_do_exit+0x10/0x10
> [   62.483451]  ? __this_cpu_preempt_check+0x21/0x30
> [   62.483463]  ? _raw_spin_unlock_irq+0x2c/0x60
> [   62.483474]  ? lockdep_hardirqs_on+0x85/0x110
> [   62.483486]  ? _raw_spin_unlock_irq+0x2c/0x60
> [   62.483498]  ? trace_hardirqs_on+0x26/0x130
> [   62.483516]  do_group_exit+0xe4/0x2c0
> [   62.483524]  __x64_sys_exit_group+0x4d/0x60
> [   62.483531]  x64_sys_call+0x21a2/0x21b0
> [   62.483544]  do_syscall_64+0x6d/0x1180
> [   62.483560]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   62.483584] RIP: 0033:0x7fe84fb18a4d
> [   62.483595] Code: Unable to access opcode bytes at 0x7fe84fb18a23.
> [   62.483602] RSP: 002b:00007fff3e35c928 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> [   62.483610] RAX: ffffffffffffffda RBX: 00007fe84fbf69e0 RCX: 00007fe84fb18a4d
> [   62.483615] RDX: 00000000000000e7 RSI: ffffffffffffff80 RDI: 0000000000000001
> [   62.483620] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000020
> [   62.483624] R10: 00007fff3e35c7d0 R11: 0000000000000246 R12: 00007fe84fbf69e0
> [   62.483629] R13: 00007fe84fbfbf00 R14: 0000000000000001 R15: 00007fe84fbfbee8
> [   62.483640]  </TASK>
> "
>
> Hope this cound be insightful to you.
>
> Regards,
> Yi Lai
>
> ---
>
> If you don't need the following environment to reproduce the problem or if you
> already have one reproduced environment, please ignore the following information.
>
> How to reproduce:
> git clone https://gitlab.com/xupengfe/repro_vm_env.git
> cd repro_vm_env
> tar -xvf repro_vm_env.tar.gz
> cd repro_vm_env; ./start3.sh  // it needs qemu-system-x86_64 and I used v7.1.0
>   // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
>   // You could change the bzImage_xxx as you want
>   // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
> You could use below command to log in, there is no password for root.
> ssh -p 10023 root@localhost
>
> After login vm(virtual machine) successfully, you could transfer reproduced
> binary to the vm by below way, and reproduce the problem in vm:
> gcc -pthread -o repro repro.c
> scp -P 10023 repro root@localhost:/root/
>
> Get the bzImage for target kernel:
> Please use target kconfig and copy it to kernel_src/.config
> make olddefconfig
> make -jx bzImage           //x should equal or less than cpu num your pc has
>
> Fill the bzImage file into above start3.sh to load the target kernel in vm.
>
>
> Tips:
> If you already have qemu-system-x86_64, please ignore below info.
> If you want to install qemu v7.1.0 version:
> git clone https://github.com/qemu/qemu.git
> cd qemu
> git checkout -f v7.1.0
> mkdir build
> cd build
> yum install -y ninja-build.x86_64
> yum -y install libslirp-devel.x86_64
> ../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
> make
> make install
>

Thanks Lai!

The issue is with the WARN_ON I added... Didn't notice that the
WARN_ON requires ci lock, so we better just remove that.

Following change should fix the issue you reported:
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 85bf4f7d9ae7..8c0f31363c1f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2096,7 +2096,6 @@ void swap_free_hibernation_slot(swp_entry_t entry)

        ci = swap_cluster_lock(si, offset);
        swap_entry_put_locked(si, ci, entry, 1);
-       WARN_ON(swap_entry_swapped(si, entry));
        swap_cluster_unlock(ci);

        /* In theory readahead might add it to the swap cache by accident */

---

swap_entry_swapped requires CI lock. There wasn't any WARN_ON before,
it was added to just ensure things worked as expected, really not
needed.

Re: [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow

Posted by Chris Mason 3 weeks, 5 days ago

On Sat, 20 Dec 2025 03:43:43 +0800 Kairui Song <ryncsn@gmail.com> wrote:

Hi everyone,

I ran mm-new through the AI review prompts, and this one was flagged:

> diff --git a/kernel/power/swap.c b/kernel/power/swap.c
> index 8050e51828351..036489d3b01a9 100644
> --- a/kernel/power/swap.c
> +++ b/kernel/power/swap.c

[ ... ]

> @@ -186,6 +186,7 @@ sector_t alloc_swapdev_block(int swap)
>
>  void free_all_swap_pages(int swap)
>  {
> +	unsigned long offset;
>  	struct rb_node *node;
>
>  	/*
> @@ -197,8 +198,9 @@ void free_all_swap_pages(int swap)
>
>  		ext = rb_entry(node, struct swsusp_extent, node);
>  		rb_erase(node, &swsusp_extents);
> -		swap_free_nr(swp_entry(swap, ext->start),
> -			     ext->end - ext->start + 1);
> +
> +		for (offset = ext->start; offset < ext->end; offset++)
> +			swap_free_hibernation_slot(swp_entry(swap, offset));
                                           ^^^^^^^^^^^

Is this loop missing the final slot at ext->end?

The old code freed (ext->end - ext->start + 1) entries, which is the
inclusive range [ext->start, ext->end]. Looking at swsusp_extents_insert(),
when a single slot is inserted it sets ext->start = ext->end = offset, and
merging extends the range by incrementing ext->end or decrementing
ext->start. This means ext->end is inclusive.

The new loop uses "offset < ext->end" which iterates [ext->start, ext->end),
missing the slot at ext->end. Should this be "offset <= ext->end" instead?

Re: [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow

Posted by Kairui Song 3 weeks, 4 days ago

On Wed, Jan 14, 2026 at 8:17 PM Chris Mason <clm@meta.com> wrote:
>
> On Sat, 20 Dec 2025 03:43:43 +0800 Kairui Song <ryncsn@gmail.com> wrote:
>
> Hi everyone,
>
> I ran mm-new through the AI review prompts, and this one was flagged:
>
> > diff --git a/kernel/power/swap.c b/kernel/power/swap.c
> > index 8050e51828351..036489d3b01a9 100644
> > --- a/kernel/power/swap.c
> > +++ b/kernel/power/swap.c
>
> [ ... ]
>
> > @@ -186,6 +186,7 @@ sector_t alloc_swapdev_block(int swap)
> >
> >  void free_all_swap_pages(int swap)
> >  {
> > +     unsigned long offset;
> >       struct rb_node *node;
> >
> >       /*
> > @@ -197,8 +198,9 @@ void free_all_swap_pages(int swap)
> >
> >               ext = rb_entry(node, struct swsusp_extent, node);
> >               rb_erase(node, &swsusp_extents);
> > -             swap_free_nr(swp_entry(swap, ext->start),
> > -                          ext->end - ext->start + 1);
> > +
> > +             for (offset = ext->start; offset < ext->end; offset++)
> > +                     swap_free_hibernation_slot(swp_entry(swap, offset));
>                                            ^^^^^^^^^^^
>
> Is this loop missing the final slot at ext->end?
>
> The old code freed (ext->end - ext->start + 1) entries, which is the
> inclusive range [ext->start, ext->end]. Looking at swsusp_extents_insert(),
> when a single slot is inserted it sets ext->start = ext->end = offset, and
> merging extends the range by incrementing ext->end or decrementing
> ext->start. This means ext->end is inclusive.
>
> The new loop uses "offset < ext->end" which iterates [ext->start, ext->end),
> missing the slot at ext->end. Should this be "offset <= ext->end" instead?

Wow, nice catch. Indeed that would be one swap leak for each
hibernation snapshot release I think. I only tested normal
hibernations, didn't realize there is issue with the "release before
use" path. `offset <= ext->end` is the right one here.

Re: [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow

Posted by Baoquan He 1 month, 3 weeks ago

On 12/20/25 at 03:43am, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> The current swap entry allocation/freeing workflow has never had a clear
> definition. This makes it hard to debug or add new optimizations.
> 
> This commit introduces a proper definition of how swap entries would be
> allocated and freed. Now, most operations are folio based, so they will
> never exceed one swap cluster, and we now have a cleaner border between
> swap and the rest of mm, making it much easier to follow and debug,
> especially with new added sanity checks. Also making more optimization
> possible.
> 
> Swap entry will be mostly allocated and free with a folio bound.
                                          ~~~~
                                          freed, typo
> The folio lock will be useful for resolving many swap ralated races.
> 
> Now swap allocation (except hibernation) always starts with a folio in
> the swap cache, and gets duped/freed protected by the folio lock:
> 
> - folio_alloc_swap() - The only allocation entry point now.
>   Context: The folio must be locked.
>   This allocates one or a set of continuous swap slots for a folio and
>   binds them to the folio by adding the folio to the swap cache. The
>   swap slots' swap count start with zero value.
> 
> - folio_dup_swap() - Increase the swap count of one or more entries.
>   Context: The folio must be locked and in the swap cache. For now, the
>   caller still has to lock the new swap entry owner (e.g., PTL).
>   This increases the ref count of swap entries allocated to a folio.
>   Newly allocated swap slots' count has to be increased by this helper
>   as the folio got unmapped (and swap entries got installed).
> 
> - folio_put_swap() - Decrease the swap count of one or more entries.
>   Context: The folio must be locked and in the swap cache. For now, the
>   caller still has to lock the new swap entry owner (e.g., PTL).
>   This decreases the ref count of swap entries allocated to a folio.
>   Typically, swapin will decrease the swap count as the folio got
>   installed back and the swap entry got uninstalled
> 
>   This won't remove the folio from the swap cache and free the
>   slot. Lazy freeing of swap cache is helpful for reducing IO.
>   There is already a folio_free_swap() for immediate cache reclaim.
>   This part could be further optimized later.
> 
> The above locking constraints could be further relaxed when the swap
> table if fully implemented. Currently dup still needs the caller
        ~~ s/if/is/ typo

> to lock the swap entry container (e.g. PTL), or a concurrent zap
> may underflow the swap count.
......

Re: [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow

Posted by Kairui Song 1 month, 2 weeks ago

On Sat, Dec 20, 2025 at 12:02 PM Baoquan He <bhe@redhat.com> wrote:
>
> On 12/20/25 at 03:43am, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > The current swap entry allocation/freeing workflow has never had a clear
> > definition. This makes it hard to debug or add new optimizations.
> >
> > This commit introduces a proper definition of how swap entries would be
> > allocated and freed. Now, most operations are folio based, so they will
> > never exceed one swap cluster, and we now have a cleaner border between
> > swap and the rest of mm, making it much easier to follow and debug,
> > especially with new added sanity checks. Also making more optimization
> > possible.
> >
> > Swap entry will be mostly allocated and free with a folio bound.
>                                           ~~~~
>                                           freed, typo

Ack, nice catch.

> > The folio lock will be useful for resolving many swap ralated races.
> >
> > Now swap allocation (except hibernation) always starts with a folio in
> > the swap cache, and gets duped/freed protected by the folio lock:
> >
> > - folio_alloc_swap() - The only allocation entry point now.
> >   Context: The folio must be locked.
> >   This allocates one or a set of continuous swap slots for a folio and
> >   binds them to the folio by adding the folio to the swap cache. The
> >   swap slots' swap count start with zero value.
> >
> > - folio_dup_swap() - Increase the swap count of one or more entries.
> >   Context: The folio must be locked and in the swap cache. For now, the
> >   caller still has to lock the new swap entry owner (e.g., PTL).
> >   This increases the ref count of swap entries allocated to a folio.
> >   Newly allocated swap slots' count has to be increased by this helper
> >   as the folio got unmapped (and swap entries got installed).
> >
> > - folio_put_swap() - Decrease the swap count of one or more entries.
> >   Context: The folio must be locked and in the swap cache. For now, the
> >   caller still has to lock the new swap entry owner (e.g., PTL).
> >   This decreases the ref count of swap entries allocated to a folio.
> >   Typically, swapin will decrease the swap count as the folio got
> >   installed back and the swap entry got uninstalled
> >
> >   This won't remove the folio from the swap cache and free the
> >   slot. Lazy freeing of swap cache is helpful for reducing IO.
> >   There is already a folio_free_swap() for immediate cache reclaim.
> >   This part could be further optimized later.
> >
> > The above locking constraints could be further relaxed when the swap
> > table if fully implemented. Currently dup still needs the caller
>         ~~ s/if/is/ typo

Ack, Thanks!

>
> > to lock the swap entry container (e.g. PTL), or a concurrent zap
> > may underflow the swap count.
> ......
>
>

Re: [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow

Posted by Kairui Song 1 month ago

On Mon, Dec 22, 2025 at 10:43 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Sat, Dec 20, 2025 at 12:02 PM Baoquan He <bhe@redhat.com> wrote:
> >
> > On 12/20/25 at 03:43am, Kairui Song wrote:
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > The current swap entry allocation/freeing workflow has never had a clear
> > > definition. This makes it hard to debug or add new optimizations.
> > >
> > > This commit introduces a proper definition of how swap entries would be
> > > allocated and freed. Now, most operations are folio based, so they will
> > > never exceed one swap cluster, and we now have a cleaner border between
> > > swap and the rest of mm, making it much easier to follow and debug,
> > > especially with new added sanity checks. Also making more optimization
> > > possible.
> > >
> > > Swap entry will be mostly allocated and free with a folio bound.
> >                                           ~~~~
> >                                           freed, typo
>
> Ack, nice catch.
>
> > > The folio lock will be useful for resolving many swap ralated races.
> > >
> > > Now swap allocation (except hibernation) always starts with a folio in
> > > the swap cache, and gets duped/freed protected by the folio lock:
> > >
> > > - folio_alloc_swap() - The only allocation entry point now.
> > >   Context: The folio must be locked.
> > >   This allocates one or a set of continuous swap slots for a folio and
> > >   binds them to the folio by adding the folio to the swap cache. The
> > >   swap slots' swap count start with zero value.
> > >
> > > - folio_dup_swap() - Increase the swap count of one or more entries.
> > >   Context: The folio must be locked and in the swap cache. For now, the
> > >   caller still has to lock the new swap entry owner (e.g., PTL).
> > >   This increases the ref count of swap entries allocated to a folio.
> > >   Newly allocated swap slots' count has to be increased by this helper
> > >   as the folio got unmapped (and swap entries got installed).
> > >
> > > - folio_put_swap() - Decrease the swap count of one or more entries.
> > >   Context: The folio must be locked and in the swap cache. For now, the
> > >   caller still has to lock the new swap entry owner (e.g., PTL).
> > >   This decreases the ref count of swap entries allocated to a folio.
> > >   Typically, swapin will decrease the swap count as the folio got
> > >   installed back and the swap entry got uninstalled
> > >
> > >   This won't remove the folio from the swap cache and free the
> > >   slot. Lazy freeing of swap cache is helpful for reducing IO.
> > >   There is already a folio_free_swap() for immediate cache reclaim.
> > >   This part could be further optimized later.
> > >
> > > The above locking constraints could be further relaxed when the swap
> > > table if fully implemented. Currently dup still needs the caller
> >         ~~ s/if/is/ typo
>
> Ack, Thanks!

Hi Andrew,

There are no other problems reported with the series so far, for the
two typos here, could you help update the commit message?