[PATCH 7/9] mm, swap: remove contention workaround for swap cache

Kairui Song posted 9 patches 1 month, 1 week ago
There is a newer version of this series
[PATCH 7/9] mm, swap: remove contention workaround for swap cache
Posted by Kairui Song 1 month, 1 week ago
From: Kairui Song <kasong@tencent.com>

Swap cluster setup will try to shuffle the clusters on initialization.
It was helpful to avoid contention for the swap cache space. The cluster
size (2M) was much smaller than each swap cache space (64M), so shuffling
the cluster means the allocator will try to allocate swap slots that are
in different swap cache spaces for each CPU, reducing the chance of two
CPUs using the same swap cache space, and hence reducing the contention.

Now, swap cache is managed by swap clusters, this shuffle is pointless.
Just remove it, and clean up related macros.

This should also improve the HDD swap performance as shuffling IO is a
bad idea for HDD, and now the shuffling is gone.

Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202504241621.f27743ec-lkp@intel.com
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap.h     |  4 ----
 mm/swapfile.c | 32 ++++++++------------------------
 mm/zswap.c    |  7 +++++--
 3 files changed, 13 insertions(+), 30 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index 4af42bc2cd72..ce3ec62cc05e 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -153,10 +153,6 @@ int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug);
 void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug);
 
 /* linux/mm/swap_state.c */
-/* One swap address space for each 64M swap space */
-#define SWAP_ADDRESS_SPACE_SHIFT	14
-#define SWAP_ADDRESS_SPACE_PAGES	(1 << SWAP_ADDRESS_SPACE_SHIFT)
-#define SWAP_ADDRESS_SPACE_MASK		(SWAP_ADDRESS_SPACE_PAGES - 1)
 extern struct address_space swap_space __ro_after_init;
 static inline struct address_space *swap_address_space(swp_entry_t entry)
 {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index df68b5e242a6..0c8001c99f30 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3203,21 +3203,14 @@ static int setup_swap_map(struct swap_info_struct *si,
 	return 0;
 }
 
-#define SWAP_CLUSTER_INFO_COLS						\
-	DIV_ROUND_UP(L1_CACHE_BYTES, sizeof(struct swap_cluster_info))
-#define SWAP_CLUSTER_SPACE_COLS						\
-	DIV_ROUND_UP(SWAP_ADDRESS_SPACE_PAGES, SWAPFILE_CLUSTER)
-#define SWAP_CLUSTER_COLS						\
-	max_t(unsigned int, SWAP_CLUSTER_INFO_COLS, SWAP_CLUSTER_SPACE_COLS)
-
 static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 						union swap_header *swap_header,
 						unsigned long maxpages)
 {
 	unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
 	struct swap_cluster_info *cluster_info;
-	unsigned long i, j, idx;
 	int err = -ENOMEM;
+	unsigned long i;
 
 	cluster_info = kvcalloc(nr_clusters, sizeof(*cluster_info), GFP_KERNEL);
 	if (!cluster_info)
@@ -3266,22 +3259,13 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 		INIT_LIST_HEAD(&si->frag_clusters[i]);
 	}
 
-	/*
-	 * Reduce false cache line sharing between cluster_info and
-	 * sharing same address space.
-	 */
-	for (j = 0; j < SWAP_CLUSTER_COLS; j++) {
-		for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
-			struct swap_cluster_info *ci;
-			idx = i * SWAP_CLUSTER_COLS + j;
-			ci = cluster_info + idx;
-			if (idx >= nr_clusters)
-				continue;
-			if (ci->count) {
-				ci->flags = CLUSTER_FLAG_NONFULL;
-				list_add_tail(&ci->list, &si->nonfull_clusters[0]);
-				continue;
-			}
+	for (i = 0; i < nr_clusters; i++) {
+		struct swap_cluster_info *ci = &cluster_info[i];
+
+		if (ci->count) {
+			ci->flags = CLUSTER_FLAG_NONFULL;
+			list_add_tail(&ci->list, &si->nonfull_clusters[0]);
+		} else {
 			ci->flags = CLUSTER_FLAG_FREE;
 			list_add_tail(&ci->list, &si->free_clusters);
 		}
diff --git a/mm/zswap.c b/mm/zswap.c
index c869859eec77..c0a9be14a725 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -237,10 +237,13 @@ static bool zswap_has_pool;
 * helpers and fwd declarations
 **********************************/
 
+/* One swap address space for each 64M swap space */
+#define ZSWAP_ADDRESS_SPACE_SHIFT 14
+#define ZSWAP_ADDRESS_SPACE_PAGES (1 << ZSWAP_ADDRESS_SPACE_SHIFT)
 static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
 {
 	return &zswap_trees[swp_type(swp)][swp_offset(swp)
-		>> SWAP_ADDRESS_SPACE_SHIFT];
+		>> ZSWAP_ADDRESS_SPACE_SHIFT];
 }
 
 #define zswap_pool_debug(msg, p)				\
@@ -1771,7 +1774,7 @@ int zswap_swapon(int type, unsigned long nr_pages)
 	struct xarray *trees, *tree;
 	unsigned int nr, i;
 
-	nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
+	nr = DIV_ROUND_UP(nr_pages, ZSWAP_ADDRESS_SPACE_PAGES);
 	trees = kvcalloc(nr, sizeof(*tree), GFP_KERNEL);
 	if (!trees) {
 		pr_err("alloc failed, zswap disabled for swap type %d\n", type);
-- 
2.51.0
Re: [PATCH 7/9] mm, swap: remove contention workaround for swap cache
Posted by Barry Song 1 month ago
On Sat, Aug 23, 2025 at 3:21 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Swap cluster setup will try to shuffle the clusters on initialization.
> It was helpful to avoid contention for the swap cache space. The cluster
> size (2M) was much smaller than each swap cache space (64M), so shuffling
> the cluster means the allocator will try to allocate swap slots that are
> in different swap cache spaces for each CPU, reducing the chance of two
> CPUs using the same swap cache space, and hence reducing the contention.
>
> Now, swap cache is managed by swap clusters, this shuffle is pointless.
> Just remove it, and clean up related macros.
>
> This should also improve the HDD swap performance as shuffling IO is a
> bad idea for HDD, and now the shuffling is gone.
>
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202504241621.f27743ec-lkp@intel.com
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---

Reviewed-by: Barry Song <baohua@kernel.org>
Re: [PATCH 7/9] mm, swap: remove contention workaround for swap cache
Posted by Chris Li 1 month ago
Hi Kairui,

It feels so good to remove that 64M swap cache space. Thank you for
making it happen.

Some nitpick follows. I am fine as is as well.

Acked-by: Chris Li <chrisl@kernel.org>

Chris

On Fri, Aug 22, 2025 at 12:21 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Swap cluster setup will try to shuffle the clusters on initialization.
> It was helpful to avoid contention for the swap cache space. The cluster
> size (2M) was much smaller than each swap cache space (64M), so shuffling
> the cluster means the allocator will try to allocate swap slots that are
> in different swap cache spaces for each CPU, reducing the chance of two
> CPUs using the same swap cache space, and hence reducing the contention.
>
> Now, swap cache is managed by swap clusters, this shuffle is pointless.
> Just remove it, and clean up related macros.
>
> This should also improve the HDD swap performance as shuffling IO is a
> bad idea for HDD, and now the shuffling is gone.

Did you have any numbers to prove that :-). Last time the swap
allocator stress testing has already destroyed two of my SAS drives
dedicated for testing. So I am not very keen on running the HDD swap
stress test. The HDD swap stress test are super slow to run, it takes
ages.

>
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202504241621.f27743ec-lkp@intel.com
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swap.h     |  4 ----
>  mm/swapfile.c | 32 ++++++++------------------------
>  mm/zswap.c    |  7 +++++--
>  3 files changed, 13 insertions(+), 30 deletions(-)
>
> diff --git a/mm/swap.h b/mm/swap.h
> index 4af42bc2cd72..ce3ec62cc05e 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -153,10 +153,6 @@ int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug);
>  void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug);
>
>  /* linux/mm/swap_state.c */
> -/* One swap address space for each 64M swap space */
> -#define SWAP_ADDRESS_SPACE_SHIFT       14
> -#define SWAP_ADDRESS_SPACE_PAGES       (1 << SWAP_ADDRESS_SPACE_SHIFT)
> -#define SWAP_ADDRESS_SPACE_MASK                (SWAP_ADDRESS_SPACE_PAGES - 1)
>  extern struct address_space swap_space __ro_after_init;
>  static inline struct address_space *swap_address_space(swp_entry_t entry)
>  {
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index df68b5e242a6..0c8001c99f30 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3203,21 +3203,14 @@ static int setup_swap_map(struct swap_info_struct *si,
>         return 0;
>  }
>
> -#define SWAP_CLUSTER_INFO_COLS                                         \
> -       DIV_ROUND_UP(L1_CACHE_BYTES, sizeof(struct swap_cluster_info))
> -#define SWAP_CLUSTER_SPACE_COLS                                                \
> -       DIV_ROUND_UP(SWAP_ADDRESS_SPACE_PAGES, SWAPFILE_CLUSTER)
> -#define SWAP_CLUSTER_COLS                                              \
> -       max_t(unsigned int, SWAP_CLUSTER_INFO_COLS, SWAP_CLUSTER_SPACE_COLS)
> -
>  static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>                                                 union swap_header *swap_header,
>                                                 unsigned long maxpages)
>  {
>         unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
>         struct swap_cluster_info *cluster_info;
> -       unsigned long i, j, idx;
>         int err = -ENOMEM;
> +       unsigned long i;

Nitpick: This line location change is not necessary.

>
>         cluster_info = kvcalloc(nr_clusters, sizeof(*cluster_info), GFP_KERNEL);
>         if (!cluster_info)
> @@ -3266,22 +3259,13 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>                 INIT_LIST_HEAD(&si->frag_clusters[i]);
>         }
>
> -       /*
> -        * Reduce false cache line sharing between cluster_info and
> -        * sharing same address space.
> -        */
> -       for (j = 0; j < SWAP_CLUSTER_COLS; j++) {
> -               for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
> -                       struct swap_cluster_info *ci;
> -                       idx = i * SWAP_CLUSTER_COLS + j;
> -                       ci = cluster_info + idx;
> -                       if (idx >= nr_clusters)
> -                               continue;
> -                       if (ci->count) {
> -                               ci->flags = CLUSTER_FLAG_NONFULL;
> -                               list_add_tail(&ci->list, &si->nonfull_clusters[0]);
> -                               continue;
> -                       }
> +       for (i = 0; i < nr_clusters; i++) {
> +               struct swap_cluster_info *ci = &cluster_info[i];

struct swap_cluster_info *ci = cluster_info + i;
looks simpler. Pure nitpick and personal preference, you don't have to
follow it.

> +
> +               if (ci->count) {
> +                       ci->flags = CLUSTER_FLAG_NONFULL;
> +                       list_add_tail(&ci->list, &si->nonfull_clusters[0]);
> +               } else {
>                         ci->flags = CLUSTER_FLAG_FREE;
>                         list_add_tail(&ci->list, &si->free_clusters);
>                 }
> diff --git a/mm/zswap.c b/mm/zswap.c
> index c869859eec77..c0a9be14a725 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -237,10 +237,13 @@ static bool zswap_has_pool;
>  * helpers and fwd declarations
>  **********************************/
>
> +/* One swap address space for each 64M swap space */
> +#define ZSWAP_ADDRESS_SPACE_SHIFT 14
> +#define ZSWAP_ADDRESS_SPACE_PAGES (1 << ZSWAP_ADDRESS_SPACE_SHIFT)
>  static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
>  {
>         return &zswap_trees[swp_type(swp)][swp_offset(swp)
> -               >> SWAP_ADDRESS_SPACE_SHIFT];
> +               >> ZSWAP_ADDRESS_SPACE_SHIFT];
>  }
>
>  #define zswap_pool_debug(msg, p)                               \
> @@ -1771,7 +1774,7 @@ int zswap_swapon(int type, unsigned long nr_pages)
>         struct xarray *trees, *tree;
>         unsigned int nr, i;
>
> -       nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
> +       nr = DIV_ROUND_UP(nr_pages, ZSWAP_ADDRESS_SPACE_PAGES);
>         trees = kvcalloc(nr, sizeof(*tree), GFP_KERNEL);
>         if (!trees) {
>                 pr_err("alloc failed, zswap disabled for swap type %d\n", type);
> --
> 2.51.0
>
Re: [PATCH 7/9] mm, swap: remove contention workaround for swap cache
Posted by Kairui Song 1 month ago
On Sat, Aug 30, 2025 at 1:03 PM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Kairui,
>
> It feels so good to remove that 64M swap cache space. Thank you for
> making it happen.
>
> Some nitpick follows. I am fine as is as well.
>
> Acked-by: Chris Li <chrisl@kernel.org>

Thanks.

>
> Chris
>
> On Fri, Aug 22, 2025 at 12:21 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Swap cluster setup will try to shuffle the clusters on initialization.
> > It was helpful to avoid contention for the swap cache space. The cluster
> > size (2M) was much smaller than each swap cache space (64M), so shuffling
> > the cluster means the allocator will try to allocate swap slots that are
> > in different swap cache spaces for each CPU, reducing the chance of two
> > CPUs using the same swap cache space, and hence reducing the contention.
> >
> > Now, swap cache is managed by swap clusters, this shuffle is pointless.
> > Just remove it, and clean up related macros.
> >
> > This should also improve the HDD swap performance as shuffling IO is a
> > bad idea for HDD, and now the shuffling is gone.
>
> Did you have any numbers to prove that :-). Last time the swap
> allocator stress testing has already destroyed two of my SAS drives
> dedicated for testing. So I am not very keen on running the HDD swap
> stress test. The HDD swap stress test are super slow to run, it takes
> ages.

I did some test months before, removing the cluster shuffle did help.
I didn't test it again this time, only did some stress test. Doing
performance test on HDD is really not a good experience as my HDD
drives are too old so a long running test kills them easily.

And I couldn't find any other factor that is causing a serial HDD IO
regression, maybe the bot can help verify. If this doesn't help, we'll
think of something else. But I don't think HDD based SWAP will ever
have a practical good performance as they are terrible at rand read...

Anyway, let me try again with HDD today, maybe I'll get some useful data.

>
> >
> > Reported-by: kernel test robot <oliver.sang@intel.com>
> > Closes: https://lore.kernel.org/oe-lkp/202504241621.f27743ec-lkp@intel.com
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/swap.h     |  4 ----
> >  mm/swapfile.c | 32 ++++++++------------------------
> >  mm/zswap.c    |  7 +++++--
> >  3 files changed, 13 insertions(+), 30 deletions(-)
> >
> > diff --git a/mm/swap.h b/mm/swap.h
> > index 4af42bc2cd72..ce3ec62cc05e 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -153,10 +153,6 @@ int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug);
> >  void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug);
> >
> >  /* linux/mm/swap_state.c */
> > -/* One swap address space for each 64M swap space */
> > -#define SWAP_ADDRESS_SPACE_SHIFT       14
> > -#define SWAP_ADDRESS_SPACE_PAGES       (1 << SWAP_ADDRESS_SPACE_SHIFT)
> > -#define SWAP_ADDRESS_SPACE_MASK                (SWAP_ADDRESS_SPACE_PAGES - 1)
> >  extern struct address_space swap_space __ro_after_init;
> >  static inline struct address_space *swap_address_space(swp_entry_t entry)
> >  {
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index df68b5e242a6..0c8001c99f30 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -3203,21 +3203,14 @@ static int setup_swap_map(struct swap_info_struct *si,
> >         return 0;
> >  }
> >
> > -#define SWAP_CLUSTER_INFO_COLS                                         \
> > -       DIV_ROUND_UP(L1_CACHE_BYTES, sizeof(struct swap_cluster_info))
> > -#define SWAP_CLUSTER_SPACE_COLS                                                \
> > -       DIV_ROUND_UP(SWAP_ADDRESS_SPACE_PAGES, SWAPFILE_CLUSTER)
> > -#define SWAP_CLUSTER_COLS                                              \
> > -       max_t(unsigned int, SWAP_CLUSTER_INFO_COLS, SWAP_CLUSTER_SPACE_COLS)
> > -
> >  static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
> >                                                 union swap_header *swap_header,
> >                                                 unsigned long maxpages)
> >  {
> >         unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
> >         struct swap_cluster_info *cluster_info;
> > -       unsigned long i, j, idx;
> >         int err = -ENOMEM;
> > +       unsigned long i;
>
> Nitpick: This line location change is not necessary.
>
> >
> >         cluster_info = kvcalloc(nr_clusters, sizeof(*cluster_info), GFP_KERNEL);
> >         if (!cluster_info)
> > @@ -3266,22 +3259,13 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
> >                 INIT_LIST_HEAD(&si->frag_clusters[i]);
> >         }
> >
> > -       /*
> > -        * Reduce false cache line sharing between cluster_info and
> > -        * sharing same address space.
> > -        */
> > -       for (j = 0; j < SWAP_CLUSTER_COLS; j++) {
> > -               for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
> > -                       struct swap_cluster_info *ci;
> > -                       idx = i * SWAP_CLUSTER_COLS + j;
> > -                       ci = cluster_info + idx;
> > -                       if (idx >= nr_clusters)
> > -                               continue;
> > -                       if (ci->count) {
> > -                               ci->flags = CLUSTER_FLAG_NONFULL;
> > -                               list_add_tail(&ci->list, &si->nonfull_clusters[0]);
> > -                               continue;
> > -                       }
> > +       for (i = 0; i < nr_clusters; i++) {
> > +               struct swap_cluster_info *ci = &cluster_info[i];
>
> struct swap_cluster_info *ci = cluster_info + i;
> looks simpler. Pure nitpick and personal preference, you don't have to
> follow it.
>
> > +
> > +               if (ci->count) {
> > +                       ci->flags = CLUSTER_FLAG_NONFULL;
> > +                       list_add_tail(&ci->list, &si->nonfull_clusters[0]);
> > +               } else {
> >                         ci->flags = CLUSTER_FLAG_FREE;
> >                         list_add_tail(&ci->list, &si->free_clusters);
> >                 }
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index c869859eec77..c0a9be14a725 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -237,10 +237,13 @@ static bool zswap_has_pool;
> >  * helpers and fwd declarations
> >  **********************************/
> >
> > +/* One swap address space for each 64M swap space */
> > +#define ZSWAP_ADDRESS_SPACE_SHIFT 14
> > +#define ZSWAP_ADDRESS_SPACE_PAGES (1 << ZSWAP_ADDRESS_SPACE_SHIFT)
> >  static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
> >  {
> >         return &zswap_trees[swp_type(swp)][swp_offset(swp)
> > -               >> SWAP_ADDRESS_SPACE_SHIFT];
> > +               >> ZSWAP_ADDRESS_SPACE_SHIFT];
> >  }
> >
> >  #define zswap_pool_debug(msg, p)                               \
> > @@ -1771,7 +1774,7 @@ int zswap_swapon(int type, unsigned long nr_pages)
> >         struct xarray *trees, *tree;
> >         unsigned int nr, i;
> >
> > -       nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
> > +       nr = DIV_ROUND_UP(nr_pages, ZSWAP_ADDRESS_SPACE_PAGES);
> >         trees = kvcalloc(nr, sizeof(*tree), GFP_KERNEL);
> >         if (!trees) {
> >                 pr_err("alloc failed, zswap disabled for swap type %d\n", type);
> > --
> > 2.51.0
> >
>
Re: [PATCH 7/9] mm, swap: remove contention workaround for swap cache
Posted by Chris Li 1 month ago
On Sat, Aug 30, 2025 at 8:25 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Sat, Aug 30, 2025 at 1:03 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > Hi Kairui,
> >
> > It feels so good to remove that 64M swap cache space. Thank you for
> > making it happen.
> >
> > Some nitpick follows. I am fine as is as well.
> >
> > Acked-by: Chris Li <chrisl@kernel.org>
>
> Thanks.
>
> >
> > Chris
> >
> > On Fri, Aug 22, 2025 at 12:21 PM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > Swap cluster setup will try to shuffle the clusters on initialization.
> > > It was helpful to avoid contention for the swap cache space. The cluster
> > > size (2M) was much smaller than each swap cache space (64M), so shuffling
> > > the cluster means the allocator will try to allocate swap slots that are
> > > in different swap cache spaces for each CPU, reducing the chance of two
> > > CPUs using the same swap cache space, and hence reducing the contention.
> > >
> > > Now, swap cache is managed by swap clusters, this shuffle is pointless.
> > > Just remove it, and clean up related macros.
> > >
> > > This should also improve the HDD swap performance as shuffling IO is a
> > > bad idea for HDD, and now the shuffling is gone.
> >
> > Did you have any numbers to prove that :-). Last time the swap
> > allocator stress testing has already destroyed two of my SAS drives
> > dedicated for testing. So I am not very keen on running the HDD swap
> > stress test. The HDD swap stress test are super slow to run, it takes
> > ages.
>
> I did some test months before, removing the cluster shuffle did help.
> I didn't test it again this time, only did some stress test. Doing
> performance test on HDD is really not a good experience as my HDD
> drives are too old so a long running test kills them easily.
>
> And I couldn't find any other factor that is causing a serial HDD IO
> regression, maybe the bot can help verify. If this doesn't help, we'll
> think of something else. But I don't think HDD based SWAP will ever
> have a practical good performance as they are terrible at rand read...
>
> Anyway, let me try again with HDD today, maybe I'll get some useful data.

I am pulling your leg about the HDD number :-) I know that thing is
hard to get and I'm counting on you having a dead HDD on the way to
get it. Evil Chris that is.

I think we don't have to make claims like that HDD is faster. Just say
HDD might or might not swap faster, because we haven't really test it.
People who are about HDD swap can test themself and report back to the
mailing list.
I think we should use SSD to simulate HDD, testing the HDD code works
in the simulated mode, no crash, no data corruption that is good
enough.

HDD swap is so slow that most people don't really care. I asked in the
LPC a while back, nobody there is using HDD swap seriously. At most,
just in case my SSD swap overflows, allow it to overflow to HDD.

Chris
Re: [PATCH 7/9] mm, swap: remove contention workaround for swap cache
Posted by Kairui Song 1 month ago
On Sat, Aug 30, 2025 at 11:24 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Sat, Aug 30, 2025 at 1:03 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > Hi Kairui,
> >
> > It feels so good to remove that 64M swap cache space. Thank you for
> > making it happen.
> >
> > Some nitpick follows. I am fine as is as well.
> >
> > Acked-by: Chris Li <chrisl@kernel.org>
>
> Thanks.
>
> >
> > Chris
> >
> > On Fri, Aug 22, 2025 at 12:21 PM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > Swap cluster setup will try to shuffle the clusters on initialization.
> > > It was helpful to avoid contention for the swap cache space. The cluster
> > > size (2M) was much smaller than each swap cache space (64M), so shuffling
> > > the cluster means the allocator will try to allocate swap slots that are
> > > in different swap cache spaces for each CPU, reducing the chance of two
> > > CPUs using the same swap cache space, and hence reducing the contention.
> > >
> > > Now, swap cache is managed by swap clusters, this shuffle is pointless.
> > > Just remove it, and clean up related macros.
> > >
> > > This should also improve the HDD swap performance as shuffling IO is a
> > > bad idea for HDD, and now the shuffling is gone.
> >
> > Did you have any numbers to prove that :-). Last time the swap
> > allocator stress testing has already destroyed two of my SAS drives
> > dedicated for testing. So I am not very keen on running the HDD swap
> > stress test. The HDD swap stress test are super slow to run, it takes
> > ages.
>
> I did some test months before, removing the cluster shuffle did help.
> I didn't test it again this time, only did some stress test. Doing
> performance test on HDD is really not a good experience as my HDD
> drives are too old so a long running test kills them easily.
>
> And I couldn't find any other factor that is causing a serial HDD IO
> regression, maybe the bot can help verify. If this doesn't help, we'll
> think of something else. But I don't think HDD based SWAP will ever
> have a practical good performance as they are terrible at rand read...
>
> Anyway, let me try again with HDD today, maybe I'll get some useful data.

So I tried to run some HDD test for many rounds, basically doing the
test in the URL below manually. Test is done using nr_task = 8. The
HDD swap partition size is 8G.

Do the preparation following:
https://github.com/intel/lkp-tests/blob/master/setup/swapin_setup
(Make usemem hold 8G memory and push them to swap)

And do the test with:
https://github.com/intel/lkp-tests/blob/master/programs/swapin/run
(Use SIGUSR1 to make usemem to read its memory and swapin)

Before this patch:
Test run 1:
1073741824 bytes / 878662493 usecs = 1193 KB/s
33019 usecs to free memory
1073741824 bytes / 891315681 usecs = 1176 KB/s
35144 usecs to free memory
1073741824 bytes / 898801090 usecs = 1166 KB/s
36305 usecs to free memory
1073741824 bytes / 925899753 usecs = 1132 KB/s
20498 usecs to free memory
1073741824 bytes / 927522592 usecs = 1130 KB/s
34397 usecs to free memory
1073741824 bytes / 928164994 usecs = 1129 KB/s
35908 usecs to free memory
1073741824 bytes / 929890294 usecs = 1127 KB/s
35014 usecs to free memory
1073741824 bytes / 929997808 usecs = 1127 KB/s
30491 usecs to free memory
test done

Test run 2:
1073741824 bytes / 771932432 usecs = 1358 KB/s
31194 usecs to free memory
1073741824 bytes / 788739551 usecs = 1329 KB/s
25714 usecs to free memory
1073741824 bytes / 795853979 usecs = 1317 KB/s
33809 usecs to free memory
1073741824 bytes / 798019211 usecs = 1313 KB/s
32019 usecs to free memory
1073741824 bytes / 798771141 usecs = 1312 KB/s
31689 usecs to free memory
1073741824 bytes / 800384757 usecs = 1310 KB/s
32622 usecs to free memory
1073741824 bytes / 800822764 usecs = 1309 KB/s
1073741824 bytes / 800882227 usecs = 1309 KB/s
32789 usecs to free memory
30577 usecs to free memory
test done

Test run 3:
1073741824 bytes / 775202370 usecs = 1352 KB/s
31832 usecs to free memory
1073741824 bytes / 777618372 usecs = 1348 KB/s
30172 usecs to free memory
1073741824 bytes / 778180006 usecs = 1347 KB/s
32482 usecs to free memory
1073741824 bytes / 778521023 usecs = 1346 KB/s
30188 usecs to free memory
1073741824 bytes / 779207791 usecs = 1345 KB/s
29364 usecs to free memory
1073741824 bytes / 780753200 usecs = 1343 KB/s
29860 usecs to free memory
1073741824 bytes / 781078362 usecs = 1342 KB/s
30449 usecs to free memory
1073741824 bytes / 781224993 usecs = 1342 KB/s
19557 usecs to free memory
test done


After this patch:
Test run 1:
1073741824 bytes / 569803736 usecs = 1840 KB/s
29032 usecs to free memory
1073741824 bytes / 573718349 usecs = 1827 KB/s
30399 usecs to free memory
1073741824 bytes / 592070142 usecs = 1771 KB/s
31896 usecs to free memory
1073741824 bytes / 593484694 usecs = 1766 KB/s
30650 usecs to free memory
1073741824 bytes / 596693866 usecs = 1757 KB/s
31582 usecs to free memory
1073741824 bytes / 597359263 usecs = 1755 KB/s
26436 usecs to free memory
1073741824 bytes / 598339187 usecs = 1752 KB/s
30697 usecs to free memory
1073741824 bytes / 598674138 usecs = 1751 KB/s
29791 usecs to free memory
test done

Test run 2:
1073741824 bytes / 578821803 usecs = 1811 KB/s
28433 usecs to free memory
1073741824 bytes / 584262760 usecs = 1794 KB/s
28565 usecs to free memory
1073741824 bytes / 586118970 usecs = 1789 KB/s
27365 usecs to free memory
1073741824 bytes / 589159154 usecs = 1779 KB/s
42645 usecs to free memory
1073741824 bytes / 593487980 usecs = 1766 KB/s
28684 usecs to free memory
1073741824 bytes / 606025290 usecs = 1730 KB/s
28974 usecs to free memory
1073741824 bytes / 607547362 usecs = 1725 KB/s
33221 usecs to free memory
1073741824 bytes / 607882511 usecs = 1724 KB/s
31393 usecs to free memory
test done

Test run 3:
1073741824 bytes / 487637856 usecs = 2150 KB/s
28022 usecs to free memory
1073741824 bytes / 491211037 usecs = 2134 KB/s
28229 usecs to free memory
1073741824 bytes / 527698561 usecs = 1987 KB/s
30265 usecs to free memory
1073741824 bytes / 531719920 usecs = 1972 KB/s
30373 usecs to free memory
1073741824 bytes / 532555758 usecs = 1968 KB/s
30019 usecs to free memory
1073741824 bytes / 532942789 usecs = 1967 KB/s
29354 usecs to free memory
1073741824 bytes / 540793872 usecs = 1938 KB/s
32703 usecs to free memory
1073741824 bytes / 541343777 usecs = 1936 KB/s
33428 usecs to free memory
test done

It seems to match the ~33% swapin.throughput regression reported by
the bot, it's about ~40% faster with this patch applied. I'll add this
test result to V2.
Re: [PATCH 7/9] mm, swap: remove contention workaround for swap cache
Posted by Chris Li 1 month ago
On Sun, Aug 31, 2025 at 8:55 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> So I tried to run some HDD test for many rounds, basically doing the
> test in the URL below manually. Test is done using nr_task = 8. The
> HDD swap partition size is 8G.
>
> Do the preparation following:
> https://github.com/intel/lkp-tests/blob/master/setup/swapin_setup
> (Make usemem hold 8G memory and push them to swap)
>
> And do the test with:
> https://github.com/intel/lkp-tests/blob/master/programs/swapin/run
> (Use SIGUSR1 to make usemem to read its memory and swapin)
>
> Before this patch:
> Test run 1:
> 1073741824 bytes / 878662493 usecs = 1193 KB/s
> 33019 usecs to free memory
> 1073741824 bytes / 891315681 usecs = 1176 KB/s
> 35144 usecs to free memory
> 1073741824 bytes / 898801090 usecs = 1166 KB/s
> 36305 usecs to free memory
> 1073741824 bytes / 925899753 usecs = 1132 KB/s
> 20498 usecs to free memory
> 1073741824 bytes / 927522592 usecs = 1130 KB/s
> 34397 usecs to free memory
> 1073741824 bytes / 928164994 usecs = 1129 KB/s
> 35908 usecs to free memory
> 1073741824 bytes / 929890294 usecs = 1127 KB/s
> 35014 usecs to free memory
> 1073741824 bytes / 929997808 usecs = 1127 KB/s
> 30491 usecs to free memory
> test done
>
> Test run 2:
> 1073741824 bytes / 771932432 usecs = 1358 KB/s
> 31194 usecs to free memory
> 1073741824 bytes / 788739551 usecs = 1329 KB/s
> 25714 usecs to free memory
> 1073741824 bytes / 795853979 usecs = 1317 KB/s
> 33809 usecs to free memory
> 1073741824 bytes / 798019211 usecs = 1313 KB/s
> 32019 usecs to free memory
> 1073741824 bytes / 798771141 usecs = 1312 KB/s
> 31689 usecs to free memory
> 1073741824 bytes / 800384757 usecs = 1310 KB/s
> 32622 usecs to free memory
> 1073741824 bytes / 800822764 usecs = 1309 KB/s
> 1073741824 bytes / 800882227 usecs = 1309 KB/s
> 32789 usecs to free memory
> 30577 usecs to free memory
> test done
>
> Test run 3:
> 1073741824 bytes / 775202370 usecs = 1352 KB/s
> 31832 usecs to free memory
> 1073741824 bytes / 777618372 usecs = 1348 KB/s
> 30172 usecs to free memory
> 1073741824 bytes / 778180006 usecs = 1347 KB/s
> 32482 usecs to free memory
> 1073741824 bytes / 778521023 usecs = 1346 KB/s
> 30188 usecs to free memory
> 1073741824 bytes / 779207791 usecs = 1345 KB/s
> 29364 usecs to free memory
> 1073741824 bytes / 780753200 usecs = 1343 KB/s
> 29860 usecs to free memory
> 1073741824 bytes / 781078362 usecs = 1342 KB/s
> 30449 usecs to free memory
> 1073741824 bytes / 781224993 usecs = 1342 KB/s
> 19557 usecs to free memory
> test done
>
>
> After this patch:
> Test run 1:
> 1073741824 bytes / 569803736 usecs = 1840 KB/s
> 29032 usecs to free memory
> 1073741824 bytes / 573718349 usecs = 1827 KB/s
> 30399 usecs to free memory
> 1073741824 bytes / 592070142 usecs = 1771 KB/s
> 31896 usecs to free memory
> 1073741824 bytes / 593484694 usecs = 1766 KB/s
> 30650 usecs to free memory
> 1073741824 bytes / 596693866 usecs = 1757 KB/s
> 31582 usecs to free memory
> 1073741824 bytes / 597359263 usecs = 1755 KB/s
> 26436 usecs to free memory
> 1073741824 bytes / 598339187 usecs = 1752 KB/s
> 30697 usecs to free memory
> 1073741824 bytes / 598674138 usecs = 1751 KB/s
> 29791 usecs to free memory
> test done
>
> Test run 2:
> 1073741824 bytes / 578821803 usecs = 1811 KB/s
> 28433 usecs to free memory
> 1073741824 bytes / 584262760 usecs = 1794 KB/s
> 28565 usecs to free memory
> 1073741824 bytes / 586118970 usecs = 1789 KB/s
> 27365 usecs to free memory
> 1073741824 bytes / 589159154 usecs = 1779 KB/s
> 42645 usecs to free memory
> 1073741824 bytes / 593487980 usecs = 1766 KB/s
> 28684 usecs to free memory
> 1073741824 bytes / 606025290 usecs = 1730 KB/s
> 28974 usecs to free memory
> 1073741824 bytes / 607547362 usecs = 1725 KB/s
> 33221 usecs to free memory
> 1073741824 bytes / 607882511 usecs = 1724 KB/s
> 31393 usecs to free memory
> test done
>
> Test run 3:
> 1073741824 bytes / 487637856 usecs = 2150 KB/s
> 28022 usecs to free memory
> 1073741824 bytes / 491211037 usecs = 2134 KB/s
> 28229 usecs to free memory
> 1073741824 bytes / 527698561 usecs = 1987 KB/s
> 30265 usecs to free memory
> 1073741824 bytes / 531719920 usecs = 1972 KB/s
> 30373 usecs to free memory
> 1073741824 bytes / 532555758 usecs = 1968 KB/s
> 30019 usecs to free memory
> 1073741824 bytes / 532942789 usecs = 1967 KB/s
> 29354 usecs to free memory
> 1073741824 bytes / 540793872 usecs = 1938 KB/s
> 32703 usecs to free memory
> 1073741824 bytes / 541343777 usecs = 1936 KB/s
> 33428 usecs to free memory
> test done
>
> It seems to match the ~33% swapin.throughput regression reported by
> the bot, it's about ~40% faster with this patch applied. I'll add this
> test result to V2.

Oh, wow you do have the HDD number, congrates. Now we can make the
claim with numbers.
I hope you did not cripple a HDD to get that number.

Thanks.

Chris