[v4] mm: swap: mTHP swap allocator base on swap cluster order

[PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Posted by Chris Li 1 year, 7 months ago

Track the nonfull cluster as well as the empty cluster
on lists. Each order has one nonfull cluster list.

The cluster will remember which order it was used during
new cluster allocation.

When the cluster has free entry, add to the nonfull[order]
list.  When the free cluster list is empty, also allocate
from the nonempty list of that order.

This improves the mTHP swap allocation success rate.

There are limitations if the distribution of numbers of
different orders of mTHP changes a lot. e.g. there are a lot
of nonfull cluster assign to order A while later time there
are a lot of order B allocation while very little allocation
in order A. Currently the cluster used by order A will not
reused by order B unless the cluster is 100% empty.

Signed-off-by: Chris Li <chrisl@kernel.org>
---
 include/linux/swap.h |  4 ++++
 mm/swapfile.c        | 34 +++++++++++++++++++++++++++++++---
 2 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index e9be95468fc7..db8d6000c116 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -254,9 +254,11 @@ struct swap_cluster_info {
 				 */
 	u16 count;
 	u8 flags;
+	u8 order;
 	struct list_head list;
 };
 #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
+#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
 
 
 /*
@@ -296,6 +298,8 @@ struct swap_info_struct {
 	unsigned long *zeromap;		/* vmalloc'ed bitmap to track zero pages */
 	struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
 	struct list_head free_clusters; /* free clusters list */
+	struct list_head nonfull_clusters[SWAP_NR_ORDERS];
+					/* list of cluster that contains at least one free slot */
 	unsigned int lowest_bit;	/* index of first free in swap_map */
 	unsigned int highest_bit;	/* index of last free in swap_map */
 	unsigned int pages;		/* total of usable pages of swap */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index f70d25005d2c..e13a33664cfa 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -361,14 +361,21 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 	memset(si->swap_map + idx * SWAPFILE_CLUSTER,
 			SWAP_MAP_BAD, SWAPFILE_CLUSTER);
 
-	list_add_tail(&ci->list, &si->discard_clusters);
+	if (ci->flags)
+		list_move_tail(&ci->list, &si->discard_clusters);
+	else
+		list_add_tail(&ci->list, &si->discard_clusters);
+	ci->flags = 0;
 	schedule_work(&si->discard_work);
 }
 
 static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
 {
+	if (ci->flags & CLUSTER_FLAG_NONFULL)
+		list_move_tail(&ci->list, &si->free_clusters);
+	else
+		list_add_tail(&ci->list, &si->free_clusters);
 	ci->flags = CLUSTER_FLAG_FREE;
-	list_add_tail(&ci->list, &si->free_clusters);
 }
 
 /*
@@ -491,7 +498,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
 	ci->count--;
 
 	if (!ci->count)
-		free_cluster(p, ci);
+		return free_cluster(p, ci);
+
+	if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
+		list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
+		ci->flags |= CLUSTER_FLAG_NONFULL;
+	}
 }
 
 /*
@@ -550,6 +562,18 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	if (tmp == SWAP_NEXT_INVALID) {
 		if (!list_empty(&si->free_clusters)) {
 			ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
+			list_del(&ci->list);
+			spin_lock(&ci->lock);
+			ci->order = order;
+			ci->flags = 0;
+			spin_unlock(&ci->lock);
+			tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
+		} else if (!list_empty(&si->nonfull_clusters[order])) {
+			ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, list);
+			list_del(&ci->list);
+			spin_lock(&ci->lock);
+			ci->flags = 0;
+			spin_unlock(&ci->lock);
 			tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
 		} else if (!list_empty(&si->discard_clusters)) {
 			/*
@@ -964,6 +988,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
 	ci = lock_cluster(si, offset);
 	memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
 	ci->count = 0;
+	ci->order = 0;
 	ci->flags = 0;
 	free_cluster(si, ci);
 	unlock_cluster(ci);
@@ -2919,6 +2944,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
 	INIT_LIST_HEAD(&p->free_clusters);
 	INIT_LIST_HEAD(&p->discard_clusters);
 
+	for (i = 0; i < SWAP_NR_ORDERS; i++)
+		INIT_LIST_HEAD(&p->nonfull_clusters[i]);
+
 	for (i = 0; i < swap_header->info.nr_badpages; i++) {
 		unsigned int page_nr = swap_header->info.badpages[i];
 		if (page_nr == 0 || page_nr > swap_header->info.last_page)

-- 
2.45.2.803.g4e1b14247a-goog

Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Posted by Ryan Roberts 1 year, 6 months ago

On 11/07/2024 08:29, Chris Li wrote:
> Track the nonfull cluster as well as the empty cluster
> on lists. Each order has one nonfull cluster list.
> 
> The cluster will remember which order it was used during
> new cluster allocation.
> 
> When the cluster has free entry, add to the nonfull[order]
> list.  When the free cluster list is empty, also allocate
> from the nonempty list of that order.
> 
> This improves the mTHP swap allocation success rate.
> 
> There are limitations if the distribution of numbers of
> different orders of mTHP changes a lot. e.g. there are a lot
> of nonfull cluster assign to order A while later time there
> are a lot of order B allocation while very little allocation
> in order A. Currently the cluster used by order A will not
> reused by order B unless the cluster is 100% empty.
> 
> Signed-off-by: Chris Li <chrisl@kernel.org>
> ---
>  include/linux/swap.h |  4 ++++
>  mm/swapfile.c        | 34 +++++++++++++++++++++++++++++++---
>  2 files changed, 35 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index e9be95468fc7..db8d6000c116 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -254,9 +254,11 @@ struct swap_cluster_info {
>  				 */
>  	u16 count;
>  	u8 flags;
> +	u8 order;
>  	struct list_head list;
>  };
>  #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
> +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
>  
>  
>  /*
> @@ -296,6 +298,8 @@ struct swap_info_struct {
>  	unsigned long *zeromap;		/* vmalloc'ed bitmap to track zero pages */
>  	struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
>  	struct list_head free_clusters; /* free clusters list */
> +	struct list_head nonfull_clusters[SWAP_NR_ORDERS];
> +					/* list of cluster that contains at least one free slot */
>  	unsigned int lowest_bit;	/* index of first free in swap_map */
>  	unsigned int highest_bit;	/* index of last free in swap_map */
>  	unsigned int pages;		/* total of usable pages of swap */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index f70d25005d2c..e13a33664cfa 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -361,14 +361,21 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>  	memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>  			SWAP_MAP_BAD, SWAPFILE_CLUSTER);
>  
> -	list_add_tail(&ci->list, &si->discard_clusters);
> +	if (ci->flags)

I'm not sure this is future proof; what happens if a flag is added in future
that does not indicate that the cluster is on a list. Perhaps explicitly check
CLUSTER_FLAG_NONFULL? Or `if (!list_empty(&ci->list))`.

> +		list_move_tail(&ci->list, &si->discard_clusters);
> +	else
> +		list_add_tail(&ci->list, &si->discard_clusters);
> +	ci->flags = 0;

Bug: (I think?) the cluster ends up on the discard_clusters list and
swap_do_scheduled_discard() calls __free_cluster() which will then call
list_add_tail() to put it on the free_clusters list. But since it is on the
discard_list at that point, shouldn't it call list_move_tail()?

>  	schedule_work(&si->discard_work);
>  }
>  
>  static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
>  {
> +	if (ci->flags & CLUSTER_FLAG_NONFULL)
> +		list_move_tail(&ci->list, &si->free_clusters);
> +	else
> +		list_add_tail(&ci->list, &si->free_clusters);
>  	ci->flags = CLUSTER_FLAG_FREE;
> -	list_add_tail(&ci->list, &si->free_clusters);
>  }
>  
>  /*
> @@ -491,7 +498,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
>  	ci->count--;
>  
>  	if (!ci->count)
> -		free_cluster(p, ci);
> +		return free_cluster(p, ci);

nit: I'm not sure what the kernel style guide says about this, but I'm not a
huge fan of returning void. I'd find it clearer if you just turn the below `if`
into an `else if`.

> +
> +	if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
> +		list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);

I find the transitions when you add and remove a cluster from the
nonfull_clusters list a bit strange (if I've understood correctly): It is added
to the list whenever there is at least one free swap entry if not already on the
list. But you take it off the list when assigning it as the current cluster for
a cpu in scan_swap_map_try_ssd_cluster().

So you could have this situation:

  - cpuA allocs cluster from free list (exclusive to that cpu)
  - cpuA allocs 1 swap entry from current cluster
  - swap entry is freed; cluster added to nonfull_clusters
  - cpuB "allocs" cluster from nonfull_clusters

At this point both cpuA and cpuB share the same cluster as their current
cluster. So why not just put the cluster on the nonfull_clusters list at
allocation time (when removed from free_list) and only remove it from the
nonfull_clusters list when it is completely full (or at least definitely doesn't
have room for an `order` allocation)? Then you allow "stealing" always instead
of just sometimes. You would likely want to move the cluster to the end of the
nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
chances of multiple CPUs using the same cluster.

Another potential optimization (which was in my hacked version IIRC) is to only
add/remove from nonfull list when `total - count` crosses the (1 << order)
boundary rather than when becoming completely full. You definitely won't be able
to allocate order-2 if there are only 3 pages available, for example.

> +		ci->flags |= CLUSTER_FLAG_NONFULL;
> +	}
>  }
>  
>  /*
> @@ -550,6 +562,18 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>  	if (tmp == SWAP_NEXT_INVALID) {
>  		if (!list_empty(&si->free_clusters)) {
>  			ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> +			list_del(&ci->list);
> +			spin_lock(&ci->lock);
> +			ci->order = order;
> +			ci->flags = 0;
> +			spin_unlock(&ci->lock);
> +			tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
> +		} else if (!list_empty(&si->nonfull_clusters[order])) {
> +			ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, list);
> +			list_del(&ci->list);
> +			spin_lock(&ci->lock);
> +			ci->flags = 0;
> +			spin_unlock(&ci->lock);
>  			tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
>  		} else if (!list_empty(&si->discard_clusters)) {
>  			/*
> @@ -964,6 +988,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
>  	ci = lock_cluster(si, offset);
>  	memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
>  	ci->count = 0;
> +	ci->order = 0;
>  	ci->flags = 0;

Wonder if it would be better to put this in __free_cluster()?

Thanks,
Ryan

>  	free_cluster(si, ci);
>  	unlock_cluster(ci);
> @@ -2919,6 +2944,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
>  	INIT_LIST_HEAD(&p->free_clusters);
>  	INIT_LIST_HEAD(&p->discard_clusters);
>  
> +	for (i = 0; i < SWAP_NR_ORDERS; i++)
> +		INIT_LIST_HEAD(&p->nonfull_clusters[i]);
> +
>  	for (i = 0; i < swap_header->info.nr_badpages; i++) {
>  		unsigned int page_nr = swap_header->info.badpages[i];
>  		if (page_nr == 0 || page_nr > swap_header->info.last_page)
>

Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Posted by Chris Li 1 year, 6 months ago

On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 11/07/2024 08:29, Chris Li wrote:
> > Track the nonfull cluster as well as the empty cluster
> > on lists. Each order has one nonfull cluster list.
> >
> > The cluster will remember which order it was used during
> > new cluster allocation.
> >
> > When the cluster has free entry, add to the nonfull[order]
> > list.  When the free cluster list is empty, also allocate
> > from the nonempty list of that order.
> >
> > This improves the mTHP swap allocation success rate.
> >
> > There are limitations if the distribution of numbers of
> > different orders of mTHP changes a lot. e.g. there are a lot
> > of nonfull cluster assign to order A while later time there
> > are a lot of order B allocation while very little allocation
> > in order A. Currently the cluster used by order A will not
> > reused by order B unless the cluster is 100% empty.
> >
> > Signed-off-by: Chris Li <chrisl@kernel.org>
> > ---
> >  include/linux/swap.h |  4 ++++
> >  mm/swapfile.c        | 34 +++++++++++++++++++++++++++++++---
> >  2 files changed, 35 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index e9be95468fc7..db8d6000c116 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -254,9 +254,11 @@ struct swap_cluster_info {
> >                                */
> >       u16 count;
> >       u8 flags;
> > +     u8 order;
> >       struct list_head list;
> >  };
> >  #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
> > +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
> >
> >
> >  /*
> > @@ -296,6 +298,8 @@ struct swap_info_struct {
> >       unsigned long *zeromap;         /* vmalloc'ed bitmap to track zero pages */
> >       struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
> >       struct list_head free_clusters; /* free clusters list */
> > +     struct list_head nonfull_clusters[SWAP_NR_ORDERS];
> > +                                     /* list of cluster that contains at least one free slot */
> >       unsigned int lowest_bit;        /* index of first free in swap_map */
> >       unsigned int highest_bit;       /* index of last free in swap_map */
> >       unsigned int pages;             /* total of usable pages of swap */
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index f70d25005d2c..e13a33664cfa 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -361,14 +361,21 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> >       memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> >                       SWAP_MAP_BAD, SWAPFILE_CLUSTER);
> >
> > -     list_add_tail(&ci->list, &si->discard_clusters);
> > +     if (ci->flags)
>
> I'm not sure this is future proof; what happens if a flag is added in future
> that does not indicate that the cluster is on a list. Perhaps explicitly check
> CLUSTER_FLAG_NONFULL? Or `if (!list_empty(&ci->list))`.

Currently flags are only used to track which list it is on. BTW, this
line has changed to check for explicite which list in patch 3 the big
rewrite. I can move that line change to patch 2 if you want.

>
> > +             list_move_tail(&ci->list, &si->discard_clusters);
> > +     else
> > +             list_add_tail(&ci->list, &si->discard_clusters);
> > +     ci->flags = 0;
>
> Bug: (I think?) the cluster ends up on the discard_clusters list and
> swap_do_scheduled_discard() calls __free_cluster() which will then call

swap_do_scheduled_discard() delete the entry from discard list.
The flag does not track the discard list state.

> list_add_tail() to put it on the free_clusters list. But since it is on the
> discard_list at that point, shouldn't it call list_move_tail()?

See above. Call list_move_tail() would be a mistake.

>
> >       schedule_work(&si->discard_work);
> >  }
> >
> >  static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> >  {
> > +     if (ci->flags & CLUSTER_FLAG_NONFULL)
> > +             list_move_tail(&ci->list, &si->free_clusters);
> > +     else
> > +             list_add_tail(&ci->list, &si->free_clusters);
> >       ci->flags = CLUSTER_FLAG_FREE;
> > -     list_add_tail(&ci->list, &si->free_clusters);
> >  }
> >
> >  /*
> > @@ -491,7 +498,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
> >       ci->count--;
> >
> >       if (!ci->count)
> > -             free_cluster(p, ci);
> > +             return free_cluster(p, ci);
>
> nit: I'm not sure what the kernel style guide says about this, but I'm not a
> huge fan of returning void. I'd find it clearer if you just turn the below `if`
> into an `else if`.

I try to avoid 'else if' if possible.
Changed to
if (!ci->count) {
              free_cluster(p, ci);
              return;
}

>
> > +
> > +     if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
> > +             list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
>
> I find the transitions when you add and remove a cluster from the
> nonfull_clusters list a bit strange (if I've understood correctly): It is added
> to the list whenever there is at least one free swap entry if not already on the
> list. But you take it off the list when assigning it as the current cluster for
> a cpu in scan_swap_map_try_ssd_cluster().
>
> So you could have this situation:
>
>   - cpuA allocs cluster from free list (exclusive to that cpu)
>   - cpuA allocs 1 swap entry from current cluster
>   - swap entry is freed; cluster added to nonfull_clusters
>   - cpuB "allocs" cluster from nonfull_clusters
>
> At this point both cpuA and cpuB share the same cluster as their current
> cluster. So why not just put the cluster on the nonfull_clusters list at
> allocation time (when removed from free_list) and only remove it from the

The big rewrite on patch 3 does that, taking it off the free list and
moving it into nonfull.
I am only making the minimal change in this step so the big rewrite can land.

> nonfull_clusters list when it is completely full (or at least definitely doesn't
> have room for an `order` allocation)? Then you allow "stealing" always instead
> of just sometimes. You would likely want to move the cluster to the end of the
> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
> chances of multiple CPUs using the same cluster.

For nonfull clusters it is less important to avoid multiple CPU
sharing the cluster. Because the cluster already has previous swap
entries allocated from the previous CPU. Those behaviors will be fine
tuned after the patch 3 big rewrite. Try to make this patch simple.

> Another potential optimization (which was in my hacked version IIRC) is to only
> add/remove from nonfull list when `total - count` crosses the (1 << order)
> boundary rather than when becoming completely full. You definitely won't be able
> to allocate order-2 if there are only 3 pages available, for example.

That is in patch 3 as well. This patch is just doing the bare minimum
to introduce the nonfull list.

>
> > +             ci->flags |= CLUSTER_FLAG_NONFULL;
> > +     }
> >  }
> >
> >  /*
> > @@ -550,6 +562,18 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> >       if (tmp == SWAP_NEXT_INVALID) {
> >               if (!list_empty(&si->free_clusters)) {
> >                       ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> > +                     list_del(&ci->list);
> > +                     spin_lock(&ci->lock);
> > +                     ci->order = order;
> > +                     ci->flags = 0;
> > +                     spin_unlock(&ci->lock);
> > +                     tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
> > +             } else if (!list_empty(&si->nonfull_clusters[order])) {
> > +                     ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, list);
> > +                     list_del(&ci->list);
> > +                     spin_lock(&ci->lock);
> > +                     ci->flags = 0;
> > +                     spin_unlock(&ci->lock);
> >                       tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
> >               } else if (!list_empty(&si->discard_clusters)) {
> >                       /*
> > @@ -964,6 +988,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
> >       ci = lock_cluster(si, offset);
> >       memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
> >       ci->count = 0;
> > +     ci->order = 0;
> >       ci->flags = 0;
>
> Wonder if it would be better to put this in __free_cluster()?

Both flags and order were moved to __free_cluster() in patch 3 of this
series. The order is best assigned together with flags when the
cluster changes the list.

Thanks for the review. The patch 3 big rewrite is the heavy lifting.

Chris

Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Posted by Ryan Roberts 1 year, 6 months ago

On 16/07/2024 23:46, Chris Li wrote:
> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 11/07/2024 08:29, Chris Li wrote:
>>> Track the nonfull cluster as well as the empty cluster
>>> on lists. Each order has one nonfull cluster list.
>>>
>>> The cluster will remember which order it was used during
>>> new cluster allocation.
>>>
>>> When the cluster has free entry, add to the nonfull[order]
>>> list.  When the free cluster list is empty, also allocate
>>> from the nonempty list of that order.
>>>
>>> This improves the mTHP swap allocation success rate.
>>>
>>> There are limitations if the distribution of numbers of
>>> different orders of mTHP changes a lot. e.g. there are a lot
>>> of nonfull cluster assign to order A while later time there
>>> are a lot of order B allocation while very little allocation
>>> in order A. Currently the cluster used by order A will not
>>> reused by order B unless the cluster is 100% empty.
>>>
>>> Signed-off-by: Chris Li <chrisl@kernel.org>
>>> ---
>>>  include/linux/swap.h |  4 ++++
>>>  mm/swapfile.c        | 34 +++++++++++++++++++++++++++++++---
>>>  2 files changed, 35 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index e9be95468fc7..db8d6000c116 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -254,9 +254,11 @@ struct swap_cluster_info {
>>>                                */
>>>       u16 count;
>>>       u8 flags;
>>> +     u8 order;
>>>       struct list_head list;
>>>  };
>>>  #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
>>> +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
>>>
>>>
>>>  /*
>>> @@ -296,6 +298,8 @@ struct swap_info_struct {
>>>       unsigned long *zeromap;         /* vmalloc'ed bitmap to track zero pages */
>>>       struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
>>>       struct list_head free_clusters; /* free clusters list */
>>> +     struct list_head nonfull_clusters[SWAP_NR_ORDERS];
>>> +                                     /* list of cluster that contains at least one free slot */
>>>       unsigned int lowest_bit;        /* index of first free in swap_map */
>>>       unsigned int highest_bit;       /* index of last free in swap_map */
>>>       unsigned int pages;             /* total of usable pages of swap */
>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>> index f70d25005d2c..e13a33664cfa 100644
>>> --- a/mm/swapfile.c
>>> +++ b/mm/swapfile.c
>>> @@ -361,14 +361,21 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>>>       memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>>>                       SWAP_MAP_BAD, SWAPFILE_CLUSTER);
>>>
>>> -     list_add_tail(&ci->list, &si->discard_clusters);
>>> +     if (ci->flags)
>>
>> I'm not sure this is future proof; what happens if a flag is added in future
>> that does not indicate that the cluster is on a list. Perhaps explicitly check
>> CLUSTER_FLAG_NONFULL? Or `if (!list_empty(&ci->list))`.
> 
> Currently flags are only used to track which list it is on.

Yes, I get that it works correctly at the moment. I just don't think it's wise
for the code to assume that any flag being set means its on a list; that feels
fragile for future.

> BTW, this
> line has changed to check for explicite which list in patch 3 the big
> rewrite. I can move that line change to patch 2 if you want.

That would get my vote; let's make every patch as good as it can be.

> 
>>
>>> +             list_move_tail(&ci->list, &si->discard_clusters);
>>> +     else
>>> +             list_add_tail(&ci->list, &si->discard_clusters);
>>> +     ci->flags = 0;
>>
>> Bug: (I think?) the cluster ends up on the discard_clusters list and
>> swap_do_scheduled_discard() calls __free_cluster() which will then call
> 
> swap_do_scheduled_discard() delete the entry from discard list.

Ahh yes, my bad!

> The flag does not track the discard list state.
> 
>> list_add_tail() to put it on the free_clusters list. But since it is on the
>> discard_list at that point, shouldn't it call list_move_tail()?
> 
> See above. Call list_move_tail() would be a mistake.
> 
>>
>>>       schedule_work(&si->discard_work);
>>>  }
>>>
>>>  static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
>>>  {
>>> +     if (ci->flags & CLUSTER_FLAG_NONFULL)
>>> +             list_move_tail(&ci->list, &si->free_clusters);
>>> +     else
>>> +             list_add_tail(&ci->list, &si->free_clusters);
>>>       ci->flags = CLUSTER_FLAG_FREE;
>>> -     list_add_tail(&ci->list, &si->free_clusters);
>>>  }
>>>
>>>  /*
>>> @@ -491,7 +498,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
>>>       ci->count--;
>>>
>>>       if (!ci->count)
>>> -             free_cluster(p, ci);
>>> +             return free_cluster(p, ci);
>>
>> nit: I'm not sure what the kernel style guide says about this, but I'm not a
>> huge fan of returning void. I'd find it clearer if you just turn the below `if`
>> into an `else if`.
> 
> I try to avoid 'else if' if possible.
> Changed to
> if (!ci->count) {
>               free_cluster(p, ci);
>               return;
> }

ok

> 
>>
>>> +
>>> +     if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>>> +             list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
>>
>> I find the transitions when you add and remove a cluster from the
>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
>> to the list whenever there is at least one free swap entry if not already on the
>> list. But you take it off the list when assigning it as the current cluster for
>> a cpu in scan_swap_map_try_ssd_cluster().
>>
>> So you could have this situation:
>>
>>   - cpuA allocs cluster from free list (exclusive to that cpu)
>>   - cpuA allocs 1 swap entry from current cluster
>>   - swap entry is freed; cluster added to nonfull_clusters
>>   - cpuB "allocs" cluster from nonfull_clusters
>>
>> At this point both cpuA and cpuB share the same cluster as their current
>> cluster. So why not just put the cluster on the nonfull_clusters list at
>> allocation time (when removed from free_list) and only remove it from the
> 
> The big rewrite on patch 3 does that, taking it off the free list and
> moving it into nonfull.

Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
scan_swap_map_slots()" I assumed that was just a refactoring of the code to
separate the SSD and HDD code paths. Personally I'd prefer to see the
refactoring separated from behavioural changes.

Since the patch was titled RFC and I thought it was just refactoring, I was
deferring review. But sounds like it is actually required to realize the test
results quoted on the cover letter?

> I am only making the minimal change in this step so the big rewrite can land.
> 
>> nonfull_clusters list when it is completely full (or at least definitely doesn't
>> have room for an `order` allocation)? Then you allow "stealing" always instead
>> of just sometimes. You would likely want to move the cluster to the end of the
>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
>> chances of multiple CPUs using the same cluster.
> 
> For nonfull clusters it is less important to avoid multiple CPU
> sharing the cluster. Because the cluster already has previous swap
> entries allocated from the previous CPU. 

But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
could be slightly ahead of cpuB so that cpuA allocates all the free pages and
cpuB just ends up scanning and finding nothing to allocate. I think do want to
share the cluster when you really need to, but try to avoid it if there are
other options, and I think moving the cluster to the end of the list might be a
way to help that?

> Those behaviors will be fine
> tuned after the patch 3 big rewrite. Try to make this patch simple.
> 
>> Another potential optimization (which was in my hacked version IIRC) is to only
>> add/remove from nonfull list when `total - count` crosses the (1 << order)
>> boundary rather than when becoming completely full. You definitely won't be able
>> to allocate order-2 if there are only 3 pages available, for example.
> 
> That is in patch 3 as well. This patch is just doing the bare minimum
> to introduce the nonfull list.
> 
>>
>>> +             ci->flags |= CLUSTER_FLAG_NONFULL;
>>> +     }
>>>  }
>>>
>>>  /*
>>> @@ -550,6 +562,18 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>>       if (tmp == SWAP_NEXT_INVALID) {
>>>               if (!list_empty(&si->free_clusters)) {
>>>                       ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
>>> +                     list_del(&ci->list);
>>> +                     spin_lock(&ci->lock);
>>> +                     ci->order = order;
>>> +                     ci->flags = 0;
>>> +                     spin_unlock(&ci->lock);
>>> +                     tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
>>> +             } else if (!list_empty(&si->nonfull_clusters[order])) {
>>> +                     ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, list);
>>> +                     list_del(&ci->list);
>>> +                     spin_lock(&ci->lock);
>>> +                     ci->flags = 0;
>>> +                     spin_unlock(&ci->lock);
>>>                       tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
>>>               } else if (!list_empty(&si->discard_clusters)) {
>>>                       /*
>>> @@ -964,6 +988,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
>>>       ci = lock_cluster(si, offset);
>>>       memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
>>>       ci->count = 0;
>>> +     ci->order = 0;
>>>       ci->flags = 0;
>>
>> Wonder if it would be better to put this in __free_cluster()?
> 
> Both flags and order were moved to __free_cluster() in patch 3 of this
> series. The order is best assigned together with flags when the
> cluster changes the list.
> 
> Thanks for the review. The patch 3 big rewrite is the heavy lifting.

OK, but sounds like patch 3 isn't really RFC after all, but a crucial part of
the series? I'll try to take a look at it today.

> 
> Chris

Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Posted by Chris Li 1 year, 6 months ago

On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 16/07/2024 23:46, Chris Li wrote:
> > On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 11/07/2024 08:29, Chris Li wrote:
> >>> Track the nonfull cluster as well as the empty cluster
> >>> on lists. Each order has one nonfull cluster list.
> >>>
> >>> The cluster will remember which order it was used during
> >>> new cluster allocation.
> >>>
> >>> When the cluster has free entry, add to the nonfull[order]
> >>> list.  When the free cluster list is empty, also allocate
> >>> from the nonempty list of that order.
> >>>
> >>> This improves the mTHP swap allocation success rate.
> >>>
> >>> There are limitations if the distribution of numbers of
> >>> different orders of mTHP changes a lot. e.g. there are a lot
> >>> of nonfull cluster assign to order A while later time there
> >>> are a lot of order B allocation while very little allocation
> >>> in order A. Currently the cluster used by order A will not
> >>> reused by order B unless the cluster is 100% empty.
> >>>
> >>> Signed-off-by: Chris Li <chrisl@kernel.org>
> >>> ---
> >>>  include/linux/swap.h |  4 ++++
> >>>  mm/swapfile.c        | 34 +++++++++++++++++++++++++++++++---
> >>>  2 files changed, 35 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/include/linux/swap.h b/include/linux/swap.h
> >>> index e9be95468fc7..db8d6000c116 100644
> >>> --- a/include/linux/swap.h
> >>> +++ b/include/linux/swap.h
> >>> @@ -254,9 +254,11 @@ struct swap_cluster_info {
> >>>                                */
> >>>       u16 count;
> >>>       u8 flags;
> >>> +     u8 order;
> >>>       struct list_head list;
> >>>  };
> >>>  #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
> >>> +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
> >>>
> >>>
> >>>  /*
> >>> @@ -296,6 +298,8 @@ struct swap_info_struct {
> >>>       unsigned long *zeromap;         /* vmalloc'ed bitmap to track zero pages */
> >>>       struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
> >>>       struct list_head free_clusters; /* free clusters list */
> >>> +     struct list_head nonfull_clusters[SWAP_NR_ORDERS];
> >>> +                                     /* list of cluster that contains at least one free slot */
> >>>       unsigned int lowest_bit;        /* index of first free in swap_map */
> >>>       unsigned int highest_bit;       /* index of last free in swap_map */
> >>>       unsigned int pages;             /* total of usable pages of swap */
> >>> diff --git a/mm/swapfile.c b/mm/swapfile.c
> >>> index f70d25005d2c..e13a33664cfa 100644
> >>> --- a/mm/swapfile.c
> >>> +++ b/mm/swapfile.c
> >>> @@ -361,14 +361,21 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> >>>       memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> >>>                       SWAP_MAP_BAD, SWAPFILE_CLUSTER);
> >>>
> >>> -     list_add_tail(&ci->list, &si->discard_clusters);
> >>> +     if (ci->flags)
> >>
> >> I'm not sure this is future proof; what happens if a flag is added in future
> >> that does not indicate that the cluster is on a list. Perhaps explicitly check
> >> CLUSTER_FLAG_NONFULL? Or `if (!list_empty(&ci->list))`.
> >
> > Currently flags are only used to track which list it is on.
>
> Yes, I get that it works correctly at the moment. I just don't think it's wise
> for the code to assume that any flag being set means its on a list; that feels
> fragile for future.

ACK.

>
> > BTW, this
> > line has changed to check for explicite which list in patch 3 the big
> > rewrite. I can move that line change to patch 2 if you want.
>
> That would get my vote; let's make every patch as good as it can be.

Done.

>
> >
> >>
> >>> +             list_move_tail(&ci->list, &si->discard_clusters);
> >>> +     else
> >>> +             list_add_tail(&ci->list, &si->discard_clusters);
> >>> +     ci->flags = 0;
> >>
> >> Bug: (I think?) the cluster ends up on the discard_clusters list and
> >> swap_do_scheduled_discard() calls __free_cluster() which will then call
> >
> > swap_do_scheduled_discard() delete the entry from discard list.
>
> Ahh yes, my bad!
>
> > The flag does not track the discard list state.
> >
> >> list_add_tail() to put it on the free_clusters list. But since it is on the
> >> discard_list at that point, shouldn't it call list_move_tail()?
> >
> > See above. Call list_move_tail() would be a mistake.
> >
> >>
> >>>       schedule_work(&si->discard_work);
> >>>  }
> >>>
> >>>  static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> >>>  {
> >>> +     if (ci->flags & CLUSTER_FLAG_NONFULL)
> >>> +             list_move_tail(&ci->list, &si->free_clusters);
> >>> +     else
> >>> +             list_add_tail(&ci->list, &si->free_clusters);
> >>>       ci->flags = CLUSTER_FLAG_FREE;
> >>> -     list_add_tail(&ci->list, &si->free_clusters);
> >>>  }
> >>>
> >>>  /*
> >>> @@ -491,7 +498,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
> >>>       ci->count--;
> >>>
> >>>       if (!ci->count)
> >>> -             free_cluster(p, ci);
> >>> +             return free_cluster(p, ci);
> >>
> >> nit: I'm not sure what the kernel style guide says about this, but I'm not a
> >> huge fan of returning void. I'd find it clearer if you just turn the below `if`
> >> into an `else if`.
> >
> > I try to avoid 'else if' if possible.
> > Changed to
> > if (!ci->count) {
> >               free_cluster(p, ci);
> >               return;
> > }
>
> ok
>
> >
> >>
> >>> +
> >>> +     if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
> >>> +             list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
> >>
> >> I find the transitions when you add and remove a cluster from the
> >> nonfull_clusters list a bit strange (if I've understood correctly): It is added
> >> to the list whenever there is at least one free swap entry if not already on the
> >> list. But you take it off the list when assigning it as the current cluster for
> >> a cpu in scan_swap_map_try_ssd_cluster().
> >>
> >> So you could have this situation:
> >>
> >>   - cpuA allocs cluster from free list (exclusive to that cpu)
> >>   - cpuA allocs 1 swap entry from current cluster
> >>   - swap entry is freed; cluster added to nonfull_clusters
> >>   - cpuB "allocs" cluster from nonfull_clusters
> >>
> >> At this point both cpuA and cpuB share the same cluster as their current
> >> cluster. So why not just put the cluster on the nonfull_clusters list at
> >> allocation time (when removed from free_list) and only remove it from the
> >
> > The big rewrite on patch 3 does that, taking it off the free list and
> > moving it into nonfull.
>
> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
> separate the SSD and HDD code paths. Personally I'd prefer to see the
> refactoring separated from behavioural changes.

It is not a refactoring. It is a big rewrite of the swap allocator
using the cluster. Behavior change is expected. The goal is completely
removing the brute force scanning of swap_map[] array for cluster swap
allocation.

>
> Since the patch was titled RFC and I thought it was just refactoring, I was
> deferring review. But sounds like it is actually required to realize the test
> results quoted on the cover letter?

Yes, required because it handles the previous fall out case try_ssd()
failed. This big rewrite has gone through a lot of testing and bug
fix. It is pretty stable now. The only reason I keep it as RFC is
because it is not feature complete. Currently it does not do swap
cache reclaim. The next version will have swap cache reclaim and
remove the RFC.

>
> > I am only making the minimal change in this step so the big rewrite can land.
> >
> >> nonfull_clusters list when it is completely full (or at least definitely doesn't
> >> have room for an `order` allocation)? Then you allow "stealing" always instead
> >> of just sometimes. You would likely want to move the cluster to the end of the
> >> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
> >> chances of multiple CPUs using the same cluster.
> >
> > For nonfull clusters it is less important to avoid multiple CPU
> > sharing the cluster. Because the cluster already has previous swap
> > entries allocated from the previous CPU.
>
> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
> could be slightly ahead of cpuB so that cpuA allocates all the free pages and

That happens to exist per cpu next pointer already. When the other CPU
advances to the next cluster pointer, it can cross with the other
CPU's next cluster pointer.

> cpuB just ends up scanning and finding nothing to allocate. I think do want to
> share the cluster when you really need to, but try to avoid it if there are
> other options, and I think moving the cluster to the end of the list might be a
> way to help that?

Simply moving to the end of the list can create a possible deadloop
when all clusters have been scanned and not available swap range
found.

We have tried many different approaches including moving to the end of
the list. It can cause more fragmentation because each CPU allocates
their swap slot cache (64 entries) from a different cluster.

> > Those behaviors will be fine
> > tuned after the patch 3 big rewrite. Try to make this patch simple.

Again, I want to keep it simple here so patch 3 can land.

> >> Another potential optimization (which was in my hacked version IIRC) is to only
> >> add/remove from nonfull list when `total - count` crosses the (1 << order)
> >> boundary rather than when becoming completely full. You definitely won't be able
> >> to allocate order-2 if there are only 3 pages available, for example.
> >
> > That is in patch 3 as well. This patch is just doing the bare minimum
> > to introduce the nonfull list.
> >
> >>
> >>> +             ci->flags |= CLUSTER_FLAG_NONFULL;
> >>> +     }
> >>>  }
> >>>
> >>>  /*
> >>> @@ -550,6 +562,18 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> >>>       if (tmp == SWAP_NEXT_INVALID) {
> >>>               if (!list_empty(&si->free_clusters)) {
> >>>                       ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> >>> +                     list_del(&ci->list);
> >>> +                     spin_lock(&ci->lock);
> >>> +                     ci->order = order;
> >>> +                     ci->flags = 0;
> >>> +                     spin_unlock(&ci->lock);
> >>> +                     tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
> >>> +             } else if (!list_empty(&si->nonfull_clusters[order])) {
> >>> +                     ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, list);
> >>> +                     list_del(&ci->list);
> >>> +                     spin_lock(&ci->lock);
> >>> +                     ci->flags = 0;
> >>> +                     spin_unlock(&ci->lock);
> >>>                       tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
> >>>               } else if (!list_empty(&si->discard_clusters)) {
> >>>                       /*
> >>> @@ -964,6 +988,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
> >>>       ci = lock_cluster(si, offset);
> >>>       memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
> >>>       ci->count = 0;
> >>> +     ci->order = 0;
> >>>       ci->flags = 0;
> >>
> >> Wonder if it would be better to put this in __free_cluster()?
> >
> > Both flags and order were moved to __free_cluster() in patch 3 of this
> > series. The order is best assigned together with flags when the
> > cluster changes the list.
> >
> > Thanks for the review. The patch 3 big rewrite is the heavy lifting.
>
> OK, but sounds like patch 3 isn't really RFC after all, but a crucial part of
> the series? I'll try to take a look at it today.

Yes, it is the cluster swap allocator big rewrite.

Thank you for taking a look.

Chris

Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Posted by Huang, Ying 1 year, 6 months ago

Chris Li <chrisl@kernel.org> writes:

> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 16/07/2024 23:46, Chris Li wrote:
>> > On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>> >>
>> >> On 11/07/2024 08:29, Chris Li wrote:
>> >>> Track the nonfull cluster as well as the empty cluster
>> >>> on lists. Each order has one nonfull cluster list.
>> >>>
>> >>> The cluster will remember which order it was used during
>> >>> new cluster allocation.
>> >>>
>> >>> When the cluster has free entry, add to the nonfull[order]
>> >>> list.  When the free cluster list is empty, also allocate
>> >>> from the nonempty list of that order.
>> >>>
>> >>> This improves the mTHP swap allocation success rate.
>> >>>
>> >>> There are limitations if the distribution of numbers of
>> >>> different orders of mTHP changes a lot. e.g. there are a lot
>> >>> of nonfull cluster assign to order A while later time there
>> >>> are a lot of order B allocation while very little allocation
>> >>> in order A. Currently the cluster used by order A will not
>> >>> reused by order B unless the cluster is 100% empty.
>> >>>
>> >>> Signed-off-by: Chris Li <chrisl@kernel.org>
>> >>> ---
>> >>>  include/linux/swap.h |  4 ++++
>> >>>  mm/swapfile.c        | 34 +++++++++++++++++++++++++++++++---
>> >>>  2 files changed, 35 insertions(+), 3 deletions(-)
>> >>>
>> >>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> >>> index e9be95468fc7..db8d6000c116 100644
>> >>> --- a/include/linux/swap.h
>> >>> +++ b/include/linux/swap.h
>> >>> @@ -254,9 +254,11 @@ struct swap_cluster_info {
>> >>>                                */
>> >>>       u16 count;
>> >>>       u8 flags;
>> >>> +     u8 order;
>> >>>       struct list_head list;
>> >>>  };
>> >>>  #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
>> >>> +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
>> >>>
>> >>>
>> >>>  /*
>> >>> @@ -296,6 +298,8 @@ struct swap_info_struct {
>> >>>       unsigned long *zeromap;         /* vmalloc'ed bitmap to track zero pages */
>> >>>       struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
>> >>>       struct list_head free_clusters; /* free clusters list */
>> >>> +     struct list_head nonfull_clusters[SWAP_NR_ORDERS];
>> >>> +                                     /* list of cluster that contains at least one free slot */
>> >>>       unsigned int lowest_bit;        /* index of first free in swap_map */
>> >>>       unsigned int highest_bit;       /* index of last free in swap_map */
>> >>>       unsigned int pages;             /* total of usable pages of swap */
>> >>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> >>> index f70d25005d2c..e13a33664cfa 100644
>> >>> --- a/mm/swapfile.c
>> >>> +++ b/mm/swapfile.c
>> >>> @@ -361,14 +361,21 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>> >>>       memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>> >>>                       SWAP_MAP_BAD, SWAPFILE_CLUSTER);
>> >>>
>> >>> -     list_add_tail(&ci->list, &si->discard_clusters);
>> >>> +     if (ci->flags)
>> >>
>> >> I'm not sure this is future proof; what happens if a flag is added in future
>> >> that does not indicate that the cluster is on a list. Perhaps explicitly check
>> >> CLUSTER_FLAG_NONFULL? Or `if (!list_empty(&ci->list))`.
>> >
>> > Currently flags are only used to track which list it is on.
>>
>> Yes, I get that it works correctly at the moment. I just don't think it's wise
>> for the code to assume that any flag being set means its on a list; that feels
>> fragile for future.
>
> ACK.
>
>>
>> > BTW, this
>> > line has changed to check for explicite which list in patch 3 the big
>> > rewrite. I can move that line change to patch 2 if you want.
>>
>> That would get my vote; let's make every patch as good as it can be.
>
> Done.
>
>>
>> >
>> >>
>> >>> +             list_move_tail(&ci->list, &si->discard_clusters);
>> >>> +     else
>> >>> +             list_add_tail(&ci->list, &si->discard_clusters);
>> >>> +     ci->flags = 0;
>> >>
>> >> Bug: (I think?) the cluster ends up on the discard_clusters list and
>> >> swap_do_scheduled_discard() calls __free_cluster() which will then call
>> >
>> > swap_do_scheduled_discard() delete the entry from discard list.
>>
>> Ahh yes, my bad!
>>
>> > The flag does not track the discard list state.
>> >
>> >> list_add_tail() to put it on the free_clusters list. But since it is on the
>> >> discard_list at that point, shouldn't it call list_move_tail()?
>> >
>> > See above. Call list_move_tail() would be a mistake.
>> >
>> >>
>> >>>       schedule_work(&si->discard_work);
>> >>>  }
>> >>>
>> >>>  static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
>> >>>  {
>> >>> +     if (ci->flags & CLUSTER_FLAG_NONFULL)
>> >>> +             list_move_tail(&ci->list, &si->free_clusters);
>> >>> +     else
>> >>> +             list_add_tail(&ci->list, &si->free_clusters);
>> >>>       ci->flags = CLUSTER_FLAG_FREE;
>> >>> -     list_add_tail(&ci->list, &si->free_clusters);
>> >>>  }
>> >>>
>> >>>  /*
>> >>> @@ -491,7 +498,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
>> >>>       ci->count--;
>> >>>
>> >>>       if (!ci->count)
>> >>> -             free_cluster(p, ci);
>> >>> +             return free_cluster(p, ci);
>> >>
>> >> nit: I'm not sure what the kernel style guide says about this, but I'm not a
>> >> huge fan of returning void. I'd find it clearer if you just turn the below `if`
>> >> into an `else if`.
>> >
>> > I try to avoid 'else if' if possible.
>> > Changed to
>> > if (!ci->count) {
>> >               free_cluster(p, ci);
>> >               return;
>> > }
>>
>> ok
>>
>> >
>> >>
>> >>> +
>> >>> +     if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>> >>> +             list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
>> >>
>> >> I find the transitions when you add and remove a cluster from the
>> >> nonfull_clusters list a bit strange (if I've understood correctly): It is added
>> >> to the list whenever there is at least one free swap entry if not already on the
>> >> list. But you take it off the list when assigning it as the current cluster for
>> >> a cpu in scan_swap_map_try_ssd_cluster().
>> >>
>> >> So you could have this situation:
>> >>
>> >>   - cpuA allocs cluster from free list (exclusive to that cpu)
>> >>   - cpuA allocs 1 swap entry from current cluster
>> >>   - swap entry is freed; cluster added to nonfull_clusters
>> >>   - cpuB "allocs" cluster from nonfull_clusters
>> >>
>> >> At this point both cpuA and cpuB share the same cluster as their current
>> >> cluster. So why not just put the cluster on the nonfull_clusters list at
>> >> allocation time (when removed from free_list) and only remove it from the
>> >
>> > The big rewrite on patch 3 does that, taking it off the free list and
>> > moving it into nonfull.
>>
>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
>> separate the SSD and HDD code paths. Personally I'd prefer to see the
>> refactoring separated from behavioural changes.
>
> It is not a refactoring. It is a big rewrite of the swap allocator
> using the cluster. Behavior change is expected. The goal is completely
> removing the brute force scanning of swap_map[] array for cluster swap
> allocation.
>
>>
>> Since the patch was titled RFC and I thought it was just refactoring, I was
>> deferring review. But sounds like it is actually required to realize the test
>> results quoted on the cover letter?
>
> Yes, required because it handles the previous fall out case try_ssd()
> failed. This big rewrite has gone through a lot of testing and bug
> fix. It is pretty stable now. The only reason I keep it as RFC is
> because it is not feature complete. Currently it does not do swap
> cache reclaim. The next version will have swap cache reclaim and
> remove the RFC.
>
>>
>> > I am only making the minimal change in this step so the big rewrite can land.
>> >
>> >> nonfull_clusters list when it is completely full (or at least definitely doesn't
>> >> have room for an `order` allocation)? Then you allow "stealing" always instead
>> >> of just sometimes. You would likely want to move the cluster to the end of the
>> >> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
>> >> chances of multiple CPUs using the same cluster.
>> >
>> > For nonfull clusters it is less important to avoid multiple CPU
>> > sharing the cluster. Because the cluster already has previous swap
>> > entries allocated from the previous CPU.
>>
>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
>
> That happens to exist per cpu next pointer already. When the other CPU
> advances to the next cluster pointer, it can cross with the other
> CPU's next cluster pointer.

No.  si->percpu_cluster[cpu].next will keep in the current per cpu
cluster only.  If it doesn't do that, we should fix it.

I agree with Ryan that we should make per cpu cluster correct.  A
cluster in per cpu cluster shouldn't be put in nonfull list.  When we
scan to the end of a per cpu cluster, we can put the cluster in nonfull
list if necessary.  And, we should make it correct in this patch instead
of later in series.  I understand that you want to make the patch itself
simple, but it's important to make code simple to be understood too.
Consistent design choice will do that.

>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
>> share the cluster when you really need to, but try to avoid it if there are
>> other options, and I think moving the cluster to the end of the list might be a
>> way to help that?
>
> Simply moving to the end of the list can create a possible deadloop
> when all clusters have been scanned and not available swap range
> found.

This is another reason that we should put the cluster in
nonfull_clusters[order--] if there are no free swap entry with "order"
in the cluster.  It makes design complex to keep it in
nonfull_clusters[order].

> We have tried many different approaches including moving to the end of
> the list. It can cause more fragmentation because each CPU allocates
> their swap slot cache (64 entries) from a different cluster.
>
>> > Those behaviors will be fine
>> > tuned after the patch 3 big rewrite. Try to make this patch simple.
>
> Again, I want to keep it simple here so patch 3 can land.
>
>> >> Another potential optimization (which was in my hacked version IIRC) is to only
>> >> add/remove from nonfull list when `total - count` crosses the (1 << order)
>> >> boundary rather than when becoming completely full. You definitely won't be able
>> >> to allocate order-2 if there are only 3 pages available, for example.
>> >
>> > That is in patch 3 as well. This patch is just doing the bare minimum
>> > to introduce the nonfull list.
>> >
>> >>
>> >>> +             ci->flags |= CLUSTER_FLAG_NONFULL;
>> >>> +     }
>> >>>  }
>> >>>
>> >>>  /*
>> >>> @@ -550,6 +562,18 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>> >>>       if (tmp == SWAP_NEXT_INVALID) {
>> >>>               if (!list_empty(&si->free_clusters)) {
>> >>>                       ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
>> >>> +                     list_del(&ci->list);
>> >>> +                     spin_lock(&ci->lock);
>> >>> +                     ci->order = order;
>> >>> +                     ci->flags = 0;
>> >>> +                     spin_unlock(&ci->lock);
>> >>> +                     tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
>> >>> +             } else if (!list_empty(&si->nonfull_clusters[order])) {
>> >>> +                     ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, list);
>> >>> +                     list_del(&ci->list);
>> >>> +                     spin_lock(&ci->lock);
>> >>> +                     ci->flags = 0;
>> >>> +                     spin_unlock(&ci->lock);
>> >>>                       tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
>> >>>               } else if (!list_empty(&si->discard_clusters)) {
>> >>>                       /*
>> >>> @@ -964,6 +988,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
>> >>>       ci = lock_cluster(si, offset);
>> >>>       memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
>> >>>       ci->count = 0;
>> >>> +     ci->order = 0;
>> >>>       ci->flags = 0;
>> >>
>> >> Wonder if it would be better to put this in __free_cluster()?
>> >
>> > Both flags and order were moved to __free_cluster() in patch 3 of this
>> > series. The order is best assigned together with flags when the
>> > cluster changes the list.
>> >
>> > Thanks for the review. The patch 3 big rewrite is the heavy lifting.
>>
>> OK, but sounds like patch 3 isn't really RFC after all, but a crucial part of
>> the series? I'll try to take a look at it today.
>
> Yes, it is the cluster swap allocator big rewrite.
>
> Thank you for taking a look.

--
Best Regards,
Huang, Ying

Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Posted by Ryan Roberts 1 year, 6 months ago

On 18/07/2024 08:53, Huang, Ying wrote:
> Chris Li <chrisl@kernel.org> writes:
> 
>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> On 16/07/2024 23:46, Chris Li wrote:
>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>
>>>>> On 11/07/2024 08:29, Chris Li wrote:
>>>>>> Track the nonfull cluster as well as the empty cluster
>>>>>> on lists. Each order has one nonfull cluster list.
>>>>>>
>>>>>> The cluster will remember which order it was used during
>>>>>> new cluster allocation.
>>>>>>
>>>>>> When the cluster has free entry, add to the nonfull[order]
>>>>>> list.  When the free cluster list is empty, also allocate
>>>>>> from the nonempty list of that order.
>>>>>>
>>>>>> This improves the mTHP swap allocation success rate.
>>>>>>
>>>>>> There are limitations if the distribution of numbers of
>>>>>> different orders of mTHP changes a lot. e.g. there are a lot
>>>>>> of nonfull cluster assign to order A while later time there
>>>>>> are a lot of order B allocation while very little allocation
>>>>>> in order A. Currently the cluster used by order A will not
>>>>>> reused by order B unless the cluster is 100% empty.
>>>>>>
>>>>>> Signed-off-by: Chris Li <chrisl@kernel.org>
>>>>>> ---
>>>>>>  include/linux/swap.h |  4 ++++
>>>>>>  mm/swapfile.c        | 34 +++++++++++++++++++++++++++++++---
>>>>>>  2 files changed, 35 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>>>> index e9be95468fc7..db8d6000c116 100644
>>>>>> --- a/include/linux/swap.h
>>>>>> +++ b/include/linux/swap.h
>>>>>> @@ -254,9 +254,11 @@ struct swap_cluster_info {
>>>>>>                                */
>>>>>>       u16 count;
>>>>>>       u8 flags;
>>>>>> +     u8 order;
>>>>>>       struct list_head list;
>>>>>>  };
>>>>>>  #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
>>>>>> +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
>>>>>>
>>>>>>
>>>>>>  /*
>>>>>> @@ -296,6 +298,8 @@ struct swap_info_struct {
>>>>>>       unsigned long *zeromap;         /* vmalloc'ed bitmap to track zero pages */
>>>>>>       struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
>>>>>>       struct list_head free_clusters; /* free clusters list */
>>>>>> +     struct list_head nonfull_clusters[SWAP_NR_ORDERS];
>>>>>> +                                     /* list of cluster that contains at least one free slot */
>>>>>>       unsigned int lowest_bit;        /* index of first free in swap_map */
>>>>>>       unsigned int highest_bit;       /* index of last free in swap_map */
>>>>>>       unsigned int pages;             /* total of usable pages of swap */
>>>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>>>>> index f70d25005d2c..e13a33664cfa 100644
>>>>>> --- a/mm/swapfile.c
>>>>>> +++ b/mm/swapfile.c
>>>>>> @@ -361,14 +361,21 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>>>>>>       memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>>>>>>                       SWAP_MAP_BAD, SWAPFILE_CLUSTER);
>>>>>>
>>>>>> -     list_add_tail(&ci->list, &si->discard_clusters);
>>>>>> +     if (ci->flags)
>>>>>
>>>>> I'm not sure this is future proof; what happens if a flag is added in future
>>>>> that does not indicate that the cluster is on a list. Perhaps explicitly check
>>>>> CLUSTER_FLAG_NONFULL? Or `if (!list_empty(&ci->list))`.
>>>>
>>>> Currently flags are only used to track which list it is on.
>>>
>>> Yes, I get that it works correctly at the moment. I just don't think it's wise
>>> for the code to assume that any flag being set means its on a list; that feels
>>> fragile for future.
>>
>> ACK.
>>
>>>
>>>> BTW, this
>>>> line has changed to check for explicite which list in patch 3 the big
>>>> rewrite. I can move that line change to patch 2 if you want.
>>>
>>> That would get my vote; let's make every patch as good as it can be.
>>
>> Done.
>>
>>>
>>>>
>>>>>
>>>>>> +             list_move_tail(&ci->list, &si->discard_clusters);
>>>>>> +     else
>>>>>> +             list_add_tail(&ci->list, &si->discard_clusters);
>>>>>> +     ci->flags = 0;
>>>>>
>>>>> Bug: (I think?) the cluster ends up on the discard_clusters list and
>>>>> swap_do_scheduled_discard() calls __free_cluster() which will then call
>>>>
>>>> swap_do_scheduled_discard() delete the entry from discard list.
>>>
>>> Ahh yes, my bad!
>>>
>>>> The flag does not track the discard list state.
>>>>
>>>>> list_add_tail() to put it on the free_clusters list. But since it is on the
>>>>> discard_list at that point, shouldn't it call list_move_tail()?
>>>>
>>>> See above. Call list_move_tail() would be a mistake.
>>>>
>>>>>
>>>>>>       schedule_work(&si->discard_work);
>>>>>>  }
>>>>>>
>>>>>>  static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
>>>>>>  {
>>>>>> +     if (ci->flags & CLUSTER_FLAG_NONFULL)
>>>>>> +             list_move_tail(&ci->list, &si->free_clusters);
>>>>>> +     else
>>>>>> +             list_add_tail(&ci->list, &si->free_clusters);
>>>>>>       ci->flags = CLUSTER_FLAG_FREE;
>>>>>> -     list_add_tail(&ci->list, &si->free_clusters);
>>>>>>  }
>>>>>>
>>>>>>  /*
>>>>>> @@ -491,7 +498,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
>>>>>>       ci->count--;
>>>>>>
>>>>>>       if (!ci->count)
>>>>>> -             free_cluster(p, ci);
>>>>>> +             return free_cluster(p, ci);
>>>>>
>>>>> nit: I'm not sure what the kernel style guide says about this, but I'm not a
>>>>> huge fan of returning void. I'd find it clearer if you just turn the below `if`
>>>>> into an `else if`.
>>>>
>>>> I try to avoid 'else if' if possible.
>>>> Changed to
>>>> if (!ci->count) {
>>>>               free_cluster(p, ci);
>>>>               return;
>>>> }
>>>
>>> ok
>>>
>>>>
>>>>>
>>>>>> +
>>>>>> +     if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>>>>>> +             list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
>>>>>
>>>>> I find the transitions when you add and remove a cluster from the
>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
>>>>> to the list whenever there is at least one free swap entry if not already on the
>>>>> list. But you take it off the list when assigning it as the current cluster for
>>>>> a cpu in scan_swap_map_try_ssd_cluster().
>>>>>
>>>>> So you could have this situation:
>>>>>
>>>>>   - cpuA allocs cluster from free list (exclusive to that cpu)
>>>>>   - cpuA allocs 1 swap entry from current cluster
>>>>>   - swap entry is freed; cluster added to nonfull_clusters
>>>>>   - cpuB "allocs" cluster from nonfull_clusters
>>>>>
>>>>> At this point both cpuA and cpuB share the same cluster as their current
>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at
>>>>> allocation time (when removed from free_list) and only remove it from the
>>>>
>>>> The big rewrite on patch 3 does that, taking it off the free list and
>>>> moving it into nonfull.
>>>
>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
>>> separate the SSD and HDD code paths. Personally I'd prefer to see the
>>> refactoring separated from behavioural changes.
>>
>> It is not a refactoring. It is a big rewrite of the swap allocator
>> using the cluster. Behavior change is expected. The goal is completely
>> removing the brute force scanning of swap_map[] array for cluster swap
>> allocation.
>>
>>>
>>> Since the patch was titled RFC and I thought it was just refactoring, I was
>>> deferring review. But sounds like it is actually required to realize the test
>>> results quoted on the cover letter?
>>
>> Yes, required because it handles the previous fall out case try_ssd()
>> failed. This big rewrite has gone through a lot of testing and bug
>> fix. It is pretty stable now. The only reason I keep it as RFC is
>> because it is not feature complete. Currently it does not do swap
>> cache reclaim. The next version will have swap cache reclaim and
>> remove the RFC.
>>
>>>
>>>> I am only making the minimal change in this step so the big rewrite can land.
>>>>
>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't
>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead
>>>>> of just sometimes. You would likely want to move the cluster to the end of the
>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
>>>>> chances of multiple CPUs using the same cluster.
>>>>
>>>> For nonfull clusters it is less important to avoid multiple CPU
>>>> sharing the cluster. Because the cluster already has previous swap
>>>> entries allocated from the previous CPU.
>>>
>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
>>
>> That happens to exist per cpu next pointer already. When the other CPU
>> advances to the next cluster pointer, it can cross with the other
>> CPU's next cluster pointer.
> 
> No.  si->percpu_cluster[cpu].next will keep in the current per cpu
> cluster only.  If it doesn't do that, we should fix it.
> 
> I agree with Ryan that we should make per cpu cluster correct.  A
> cluster in per cpu cluster shouldn't be put in nonfull list.  When we
> scan to the end of a per cpu cluster, we can put the cluster in nonfull
> list if necessary.  And, we should make it correct in this patch instead
> of later in series.  I understand that you want to make the patch itself
> simple, but it's important to make code simple to be understood too.
> Consistent design choice will do that.

I think I'm actually arguing for the opposite of what you suggest here.

As I see it, there are 2 possible approaches; either a cluster is always
considered exclusive to a single cpu when its set as a per-cpu cluster, so it
does not appear on the nonfull list. Or a cluster is considered sharable in this
case, in which case it should be added to the nonfull list.

The code at the moment sort of does both; when a cpu decides to use a cluster in
the nonfull list, it removes it from that list to make it exclusive. But as soon
as a single swap entry is freed from that cluster it is put back on the list.
This neither-one-policy-nor-the-other seems odd to me.

I think Huang, Ying is arguing to keep it always exclusive while installed as a
per-cpu cluster. I was arguing to make it always shared. Perhaps the best
approach is to implement the exclusive policy in this patch (you'd need a flag
to note if any pages were freed while in exclusive use, then when exclusive use
completes, put it back on the nonfull list if the flag was set). Then migrate to
the shared approach as part of the "big rewrite"?

> 
>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
>>> share the cluster when you really need to, but try to avoid it if there are
>>> other options, and I think moving the cluster to the end of the list might be a
>>> way to help that?
>>
>> Simply moving to the end of the list can create a possible deadloop
>> when all clusters have been scanned and not available swap range
>> found.
> 
> This is another reason that we should put the cluster in
> nonfull_clusters[order--] if there are no free swap entry with "order"
> in the cluster.  It makes design complex to keep it in
> nonfull_clusters[order].
> 
>> We have tried many different approaches including moving to the end of
>> the list. It can cause more fragmentation because each CPU allocates
>> their swap slot cache (64 entries) from a different cluster.
>>
>>>> Those behaviors will be fine
>>>> tuned after the patch 3 big rewrite. Try to make this patch simple.
>>
>> Again, I want to keep it simple here so patch 3 can land.
>>
>>>>> Another potential optimization (which was in my hacked version IIRC) is to only
>>>>> add/remove from nonfull list when `total - count` crosses the (1 << order)
>>>>> boundary rather than when becoming completely full. You definitely won't be able
>>>>> to allocate order-2 if there are only 3 pages available, for example.
>>>>
>>>> That is in patch 3 as well. This patch is just doing the bare minimum
>>>> to introduce the nonfull list.
>>>>
>>>>>
>>>>>> +             ci->flags |= CLUSTER_FLAG_NONFULL;
>>>>>> +     }
>>>>>>  }
>>>>>>
>>>>>>  /*
>>>>>> @@ -550,6 +562,18 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>>>>>       if (tmp == SWAP_NEXT_INVALID) {
>>>>>>               if (!list_empty(&si->free_clusters)) {
>>>>>>                       ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
>>>>>> +                     list_del(&ci->list);
>>>>>> +                     spin_lock(&ci->lock);
>>>>>> +                     ci->order = order;
>>>>>> +                     ci->flags = 0;
>>>>>> +                     spin_unlock(&ci->lock);
>>>>>> +                     tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
>>>>>> +             } else if (!list_empty(&si->nonfull_clusters[order])) {
>>>>>> +                     ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, list);
>>>>>> +                     list_del(&ci->list);
>>>>>> +                     spin_lock(&ci->lock);
>>>>>> +                     ci->flags = 0;
>>>>>> +                     spin_unlock(&ci->lock);
>>>>>>                       tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
>>>>>>               } else if (!list_empty(&si->discard_clusters)) {
>>>>>>                       /*
>>>>>> @@ -964,6 +988,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
>>>>>>       ci = lock_cluster(si, offset);
>>>>>>       memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
>>>>>>       ci->count = 0;
>>>>>> +     ci->order = 0;
>>>>>>       ci->flags = 0;
>>>>>
>>>>> Wonder if it would be better to put this in __free_cluster()?
>>>>
>>>> Both flags and order were moved to __free_cluster() in patch 3 of this
>>>> series. The order is best assigned together with flags when the
>>>> cluster changes the list.
>>>>
>>>> Thanks for the review. The patch 3 big rewrite is the heavy lifting.
>>>
>>> OK, but sounds like patch 3 isn't really RFC after all, but a crucial part of
>>> the series? I'll try to take a look at it today.
>>
>> Yes, it is the cluster swap allocator big rewrite.
>>
>> Thank you for taking a look.
> 
> --
> Best Regards,
> Huang, Ying

Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Posted by Huang, Ying 1 year, 6 months ago

Ryan Roberts <ryan.roberts@arm.com> writes:

> On 18/07/2024 08:53, Huang, Ying wrote:
>> Chris Li <chrisl@kernel.org> writes:
>> 
>>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 16/07/2024 23:46, Chris Li wrote:
>>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> On 11/07/2024 08:29, Chris Li wrote:

[snip]

>>>>>>> +
>>>>>>> +     if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>>>>>>> +             list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
>>>>>>
>>>>>> I find the transitions when you add and remove a cluster from the
>>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
>>>>>> to the list whenever there is at least one free swap entry if not already on the
>>>>>> list. But you take it off the list when assigning it as the current cluster for
>>>>>> a cpu in scan_swap_map_try_ssd_cluster().
>>>>>>
>>>>>> So you could have this situation:
>>>>>>
>>>>>>   - cpuA allocs cluster from free list (exclusive to that cpu)
>>>>>>   - cpuA allocs 1 swap entry from current cluster
>>>>>>   - swap entry is freed; cluster added to nonfull_clusters
>>>>>>   - cpuB "allocs" cluster from nonfull_clusters
>>>>>>
>>>>>> At this point both cpuA and cpuB share the same cluster as their current
>>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at
>>>>>> allocation time (when removed from free_list) and only remove it from the
>>>>>
>>>>> The big rewrite on patch 3 does that, taking it off the free list and
>>>>> moving it into nonfull.
>>>>
>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
>>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
>>>> separate the SSD and HDD code paths. Personally I'd prefer to see the
>>>> refactoring separated from behavioural changes.
>>>
>>> It is not a refactoring. It is a big rewrite of the swap allocator
>>> using the cluster. Behavior change is expected. The goal is completely
>>> removing the brute force scanning of swap_map[] array for cluster swap
>>> allocation.
>>>
>>>>
>>>> Since the patch was titled RFC and I thought it was just refactoring, I was
>>>> deferring review. But sounds like it is actually required to realize the test
>>>> results quoted on the cover letter?
>>>
>>> Yes, required because it handles the previous fall out case try_ssd()
>>> failed. This big rewrite has gone through a lot of testing and bug
>>> fix. It is pretty stable now. The only reason I keep it as RFC is
>>> because it is not feature complete. Currently it does not do swap
>>> cache reclaim. The next version will have swap cache reclaim and
>>> remove the RFC.
>>>
>>>>
>>>>> I am only making the minimal change in this step so the big rewrite can land.
>>>>>
>>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't
>>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead
>>>>>> of just sometimes. You would likely want to move the cluster to the end of the
>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
>>>>>> chances of multiple CPUs using the same cluster.
>>>>>
>>>>> For nonfull clusters it is less important to avoid multiple CPU
>>>>> sharing the cluster. Because the cluster already has previous swap
>>>>> entries allocated from the previous CPU.
>>>>
>>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
>>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
>>>
>>> That happens to exist per cpu next pointer already. When the other CPU
>>> advances to the next cluster pointer, it can cross with the other
>>> CPU's next cluster pointer.
>> 
>> No.  si->percpu_cluster[cpu].next will keep in the current per cpu
>> cluster only.  If it doesn't do that, we should fix it.
>> 
>> I agree with Ryan that we should make per cpu cluster correct.  A
>> cluster in per cpu cluster shouldn't be put in nonfull list.  When we
>> scan to the end of a per cpu cluster, we can put the cluster in nonfull
>> list if necessary.  And, we should make it correct in this patch instead
>> of later in series.  I understand that you want to make the patch itself
>> simple, but it's important to make code simple to be understood too.
>> Consistent design choice will do that.
>
> I think I'm actually arguing for the opposite of what you suggest here.

Sorry, I misunderstood your words.

> As I see it, there are 2 possible approaches; either a cluster is always
> considered exclusive to a single cpu when its set as a per-cpu cluster, so it
> does not appear on the nonfull list. Or a cluster is considered sharable in this
> case, in which case it should be added to the nonfull list.
>
> The code at the moment sort of does both; when a cpu decides to use a cluster in
> the nonfull list, it removes it from that list to make it exclusive. But as soon
> as a single swap entry is freed from that cluster it is put back on the list.
> This neither-one-policy-nor-the-other seems odd to me.
>
> I think Huang, Ying is arguing to keep it always exclusive while installed as a
> per-cpu cluster.

Yes.

> I was arguing to make it always shared. Perhaps the best
> approach is to implement the exclusive policy in this patch (you'd need a flag
> to note if any pages were freed while in exclusive use, then when exclusive use
> completes, put it back on the nonfull list if the flag was set). Then migrate to
> the shared approach as part of the "big rewrite"?
>> 
>>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
>>>> share the cluster when you really need to, but try to avoid it if there are
>>>> other options, and I think moving the cluster to the end of the list might be a
>>>> way to help that?
>>>
>>> Simply moving to the end of the list can create a possible deadloop
>>> when all clusters have been scanned and not available swap range
>>> found.

I also think that the shared approach has dead loop issue.

>> This is another reason that we should put the cluster in
>> nonfull_clusters[order--] if there are no free swap entry with "order"
>> in the cluster.  It makes design complex to keep it in
>> nonfull_clusters[order].
>> 
>>> We have tried many different approaches including moving to the end of
>>> the list. It can cause more fragmentation because each CPU allocates
>>> their swap slot cache (64 entries) from a different cluster.
>>>
>>>>> Those behaviors will be fine
>>>>> tuned after the patch 3 big rewrite. Try to make this patch simple.
>>>
>>> Again, I want to keep it simple here so patch 3 can land.
>>>
>>>>>> Another potential optimization (which was in my hacked version IIRC) is to only
>>>>>> add/remove from nonfull list when `total - count` crosses the (1 << order)
>>>>>> boundary rather than when becoming completely full. You definitely won't be able
>>>>>> to allocate order-2 if there are only 3 pages available, for example.
>>>>>
>>>>> That is in patch 3 as well. This patch is just doing the bare minimum
>>>>> to introduce the nonfull list.
>>>>>

[snip]

--
Best Regards,
Huang, Ying

Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Posted by Ryan Roberts 1 year, 6 months ago

On 22/07/2024 03:14, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> On 18/07/2024 08:53, Huang, Ying wrote:
>>> Chris Li <chrisl@kernel.org> writes:
>>>
>>>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>
>>>>> On 16/07/2024 23:46, Chris Li wrote:
>>>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>
>>>>>>> On 11/07/2024 08:29, Chris Li wrote:
> 
> [snip]
> 
>>>>>>>> +
>>>>>>>> +     if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>>>>>>>> +             list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
>>>>>>>
>>>>>>> I find the transitions when you add and remove a cluster from the
>>>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
>>>>>>> to the list whenever there is at least one free swap entry if not already on the
>>>>>>> list. But you take it off the list when assigning it as the current cluster for
>>>>>>> a cpu in scan_swap_map_try_ssd_cluster().
>>>>>>>
>>>>>>> So you could have this situation:
>>>>>>>
>>>>>>>   - cpuA allocs cluster from free list (exclusive to that cpu)
>>>>>>>   - cpuA allocs 1 swap entry from current cluster
>>>>>>>   - swap entry is freed; cluster added to nonfull_clusters
>>>>>>>   - cpuB "allocs" cluster from nonfull_clusters
>>>>>>>
>>>>>>> At this point both cpuA and cpuB share the same cluster as their current
>>>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at
>>>>>>> allocation time (when removed from free_list) and only remove it from the
>>>>>>
>>>>>> The big rewrite on patch 3 does that, taking it off the free list and
>>>>>> moving it into nonfull.
>>>>>
>>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
>>>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
>>>>> separate the SSD and HDD code paths. Personally I'd prefer to see the
>>>>> refactoring separated from behavioural changes.
>>>>
>>>> It is not a refactoring. It is a big rewrite of the swap allocator
>>>> using the cluster. Behavior change is expected. The goal is completely
>>>> removing the brute force scanning of swap_map[] array for cluster swap
>>>> allocation.
>>>>
>>>>>
>>>>> Since the patch was titled RFC and I thought it was just refactoring, I was
>>>>> deferring review. But sounds like it is actually required to realize the test
>>>>> results quoted on the cover letter?
>>>>
>>>> Yes, required because it handles the previous fall out case try_ssd()
>>>> failed. This big rewrite has gone through a lot of testing and bug
>>>> fix. It is pretty stable now. The only reason I keep it as RFC is
>>>> because it is not feature complete. Currently it does not do swap
>>>> cache reclaim. The next version will have swap cache reclaim and
>>>> remove the RFC.
>>>>
>>>>>
>>>>>> I am only making the minimal change in this step so the big rewrite can land.
>>>>>>
>>>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't
>>>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead
>>>>>>> of just sometimes. You would likely want to move the cluster to the end of the
>>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
>>>>>>> chances of multiple CPUs using the same cluster.
>>>>>>
>>>>>> For nonfull clusters it is less important to avoid multiple CPU
>>>>>> sharing the cluster. Because the cluster already has previous swap
>>>>>> entries allocated from the previous CPU.
>>>>>
>>>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
>>>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
>>>>
>>>> That happens to exist per cpu next pointer already. When the other CPU
>>>> advances to the next cluster pointer, it can cross with the other
>>>> CPU's next cluster pointer.
>>>
>>> No.  si->percpu_cluster[cpu].next will keep in the current per cpu
>>> cluster only.  If it doesn't do that, we should fix it.
>>>
>>> I agree with Ryan that we should make per cpu cluster correct.  A
>>> cluster in per cpu cluster shouldn't be put in nonfull list.  When we
>>> scan to the end of a per cpu cluster, we can put the cluster in nonfull
>>> list if necessary.  And, we should make it correct in this patch instead
>>> of later in series.  I understand that you want to make the patch itself
>>> simple, but it's important to make code simple to be understood too.
>>> Consistent design choice will do that.
>>
>> I think I'm actually arguing for the opposite of what you suggest here.
> 
> Sorry, I misunderstood your words.
> 
>> As I see it, there are 2 possible approaches; either a cluster is always
>> considered exclusive to a single cpu when its set as a per-cpu cluster, so it
>> does not appear on the nonfull list. Or a cluster is considered sharable in this
>> case, in which case it should be added to the nonfull list.
>>
>> The code at the moment sort of does both; when a cpu decides to use a cluster in
>> the nonfull list, it removes it from that list to make it exclusive. But as soon
>> as a single swap entry is freed from that cluster it is put back on the list.
>> This neither-one-policy-nor-the-other seems odd to me.
>>
>> I think Huang, Ying is arguing to keep it always exclusive while installed as a
>> per-cpu cluster.
> 
> Yes.
> 
>> I was arguing to make it always shared. Perhaps the best
>> approach is to implement the exclusive policy in this patch (you'd need a flag
>> to note if any pages were freed while in exclusive use, then when exclusive use
>> completes, put it back on the nonfull list if the flag was set). Then migrate to
>> the shared approach as part of the "big rewrite"?
>>>
>>>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
>>>>> share the cluster when you really need to, but try to avoid it if there are
>>>>> other options, and I think moving the cluster to the end of the list might be a
>>>>> way to help that?
>>>>
>>>> Simply moving to the end of the list can create a possible deadloop
>>>> when all clusters have been scanned and not available swap range
>>>> found.
> 
> I also think that the shared approach has dead loop issue.

What exactly do you mean by dead loop issue? Perhaps you are suggesting the code
won't know when to stop dequeing/requeuing clusters on the nonfull list and will
go forever? That's surely just an implementation issue to solve? It's not a
reason to avoid the design principle; if we agree that maintaining sharability
of the cluster is preferred then the code must be written to guard against the
dead loop problem. It could be done by remembering the first cluster you
dequeued/requeued in scan_swap_map_try_ssd_cluster() and stop when you get back
to it. (I think holding the si lock will protect against concurrently freeing
the cluster so it should definitely remain in the list?).

Which actually makes me wonder; what is the mechanism that prevents the current
per-cpu cluster from being freed? Is that just handled by the conflict detection
thingy? Perhaps that would be better handled with a flag to mark it in use, or
raise count when its current. (If Chris has implemented that in the "big
rewrite" patch, sorry, I still haven't gotten around to looking at it :-| )

> 
>>> This is another reason that we should put the cluster in
>>> nonfull_clusters[order--] if there are no free swap entry with "order"
>>> in the cluster.  It makes design complex to keep it in
>>> nonfull_clusters[order].
>>>
>>>> We have tried many different approaches including moving to the end of
>>>> the list. It can cause more fragmentation because each CPU allocates
>>>> their swap slot cache (64 entries) from a different cluster.
>>>>
>>>>>> Those behaviors will be fine
>>>>>> tuned after the patch 3 big rewrite. Try to make this patch simple.
>>>>
>>>> Again, I want to keep it simple here so patch 3 can land.
>>>>
>>>>>>> Another potential optimization (which was in my hacked version IIRC) is to only
>>>>>>> add/remove from nonfull list when `total - count` crosses the (1 << order)
>>>>>>> boundary rather than when becoming completely full. You definitely won't be able
>>>>>>> to allocate order-2 if there are only 3 pages available, for example.
>>>>>>
>>>>>> That is in patch 3 as well. This patch is just doing the bare minimum
>>>>>> to introduce the nonfull list.
>>>>>>
> 
> [snip]
> 
> --
> Best Regards,
> Huang, Ying

Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Posted by Huang, Ying 1 year, 6 months ago

Ryan Roberts <ryan.roberts@arm.com> writes:

> On 22/07/2024 03:14, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> On 18/07/2024 08:53, Huang, Ying wrote:
>>>> Chris Li <chrisl@kernel.org> writes:
>>>>
>>>>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> On 16/07/2024 23:46, Chris Li wrote:
>>>>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>
>>>>>>>> On 11/07/2024 08:29, Chris Li wrote:
>> 
>> [snip]
>> 
>>>>>>>>> +
>>>>>>>>> +     if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>>>>>>>>> +             list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
>>>>>>>>
>>>>>>>> I find the transitions when you add and remove a cluster from the
>>>>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
>>>>>>>> to the list whenever there is at least one free swap entry if not already on the
>>>>>>>> list. But you take it off the list when assigning it as the current cluster for
>>>>>>>> a cpu in scan_swap_map_try_ssd_cluster().
>>>>>>>>
>>>>>>>> So you could have this situation:
>>>>>>>>
>>>>>>>>   - cpuA allocs cluster from free list (exclusive to that cpu)
>>>>>>>>   - cpuA allocs 1 swap entry from current cluster
>>>>>>>>   - swap entry is freed; cluster added to nonfull_clusters
>>>>>>>>   - cpuB "allocs" cluster from nonfull_clusters
>>>>>>>>
>>>>>>>> At this point both cpuA and cpuB share the same cluster as their current
>>>>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at
>>>>>>>> allocation time (when removed from free_list) and only remove it from the
>>>>>>>
>>>>>>> The big rewrite on patch 3 does that, taking it off the free list and
>>>>>>> moving it into nonfull.
>>>>>>
>>>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
>>>>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
>>>>>> separate the SSD and HDD code paths. Personally I'd prefer to see the
>>>>>> refactoring separated from behavioural changes.
>>>>>
>>>>> It is not a refactoring. It is a big rewrite of the swap allocator
>>>>> using the cluster. Behavior change is expected. The goal is completely
>>>>> removing the brute force scanning of swap_map[] array for cluster swap
>>>>> allocation.
>>>>>
>>>>>>
>>>>>> Since the patch was titled RFC and I thought it was just refactoring, I was
>>>>>> deferring review. But sounds like it is actually required to realize the test
>>>>>> results quoted on the cover letter?
>>>>>
>>>>> Yes, required because it handles the previous fall out case try_ssd()
>>>>> failed. This big rewrite has gone through a lot of testing and bug
>>>>> fix. It is pretty stable now. The only reason I keep it as RFC is
>>>>> because it is not feature complete. Currently it does not do swap
>>>>> cache reclaim. The next version will have swap cache reclaim and
>>>>> remove the RFC.
>>>>>
>>>>>>
>>>>>>> I am only making the minimal change in this step so the big rewrite can land.
>>>>>>>
>>>>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't
>>>>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead
>>>>>>>> of just sometimes. You would likely want to move the cluster to the end of the
>>>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
>>>>>>>> chances of multiple CPUs using the same cluster.
>>>>>>>
>>>>>>> For nonfull clusters it is less important to avoid multiple CPU
>>>>>>> sharing the cluster. Because the cluster already has previous swap
>>>>>>> entries allocated from the previous CPU.
>>>>>>
>>>>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
>>>>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
>>>>>
>>>>> That happens to exist per cpu next pointer already. When the other CPU
>>>>> advances to the next cluster pointer, it can cross with the other
>>>>> CPU's next cluster pointer.
>>>>
>>>> No.  si->percpu_cluster[cpu].next will keep in the current per cpu
>>>> cluster only.  If it doesn't do that, we should fix it.
>>>>
>>>> I agree with Ryan that we should make per cpu cluster correct.  A
>>>> cluster in per cpu cluster shouldn't be put in nonfull list.  When we
>>>> scan to the end of a per cpu cluster, we can put the cluster in nonfull
>>>> list if necessary.  And, we should make it correct in this patch instead
>>>> of later in series.  I understand that you want to make the patch itself
>>>> simple, but it's important to make code simple to be understood too.
>>>> Consistent design choice will do that.
>>>
>>> I think I'm actually arguing for the opposite of what you suggest here.
>> 
>> Sorry, I misunderstood your words.
>> 
>>> As I see it, there are 2 possible approaches; either a cluster is always
>>> considered exclusive to a single cpu when its set as a per-cpu cluster, so it
>>> does not appear on the nonfull list. Or a cluster is considered sharable in this
>>> case, in which case it should be added to the nonfull list.
>>>
>>> The code at the moment sort of does both; when a cpu decides to use a cluster in
>>> the nonfull list, it removes it from that list to make it exclusive. But as soon
>>> as a single swap entry is freed from that cluster it is put back on the list.
>>> This neither-one-policy-nor-the-other seems odd to me.
>>>
>>> I think Huang, Ying is arguing to keep it always exclusive while installed as a
>>> per-cpu cluster.
>> 
>> Yes.
>> 
>>> I was arguing to make it always shared. Perhaps the best
>>> approach is to implement the exclusive policy in this patch (you'd need a flag
>>> to note if any pages were freed while in exclusive use, then when exclusive use
>>> completes, put it back on the nonfull list if the flag was set). Then migrate to
>>> the shared approach as part of the "big rewrite"?
>>>>
>>>>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
>>>>>> share the cluster when you really need to, but try to avoid it if there are
>>>>>> other options, and I think moving the cluster to the end of the list might be a
>>>>>> way to help that?
>>>>>
>>>>> Simply moving to the end of the list can create a possible deadloop
>>>>> when all clusters have been scanned and not available swap range
>>>>> found.
>> 
>> I also think that the shared approach has dead loop issue.
>
> What exactly do you mean by dead loop issue? Perhaps you are suggesting the code
> won't know when to stop dequeing/requeuing clusters on the nonfull list and will
> go forever? That's surely just an implementation issue to solve? It's not a
> reason to avoid the design principle; if we agree that maintaining sharability
> of the cluster is preferred then the code must be written to guard against the
> dead loop problem. It could be done by remembering the first cluster you
> dequeued/requeued in scan_swap_map_try_ssd_cluster() and stop when you get back
> to it. (I think holding the si lock will protect against concurrently freeing
> the cluster so it should definitely remain in the list?).

I believe that you can find some way to avoid the dead loop issue,
although your suggestion may kill the performance via looping a long list
of nonfull clusters.  And, I understand that in some situations it may
be better to share clusters among CPUs.  So my suggestion is,

- Make swap_cluster_info->order more accurate, don't pretend that we
  have free swap entries with that order even after we are sure that we
  haven't.

My question is whether it's so important to share the per-cpu cluster
among CPUs?  I suggest to start with simple design, that is, per-CPU
cluster will not be shared among CPUs in most cases.

Another choice for sharing is when we run short of free swap space, we
disable per-CPU cluster and allocate from the shared non-full cluster
list directly.

> Which actually makes me wonder; what is the mechanism that prevents the current
> per-cpu cluster from being freed? Is that just handled by the conflict detection
> thingy? Perhaps that would be better handled with a flag to mark it in use, or
> raise count when its current. (If Chris has implemented that in the "big
> rewrite" patch, sorry, I still haven't gotten around to looking at it :-| )

Yes.  We may need a flag for that.

>> 
>>>> This is another reason that we should put the cluster in
>>>> nonfull_clusters[order--] if there are no free swap entry with "order"
>>>> in the cluster.  It makes design complex to keep it in
>>>> nonfull_clusters[order].
>>>>
>>>>> We have tried many different approaches including moving to the end of
>>>>> the list. It can cause more fragmentation because each CPU allocates
>>>>> their swap slot cache (64 entries) from a different cluster.
>>>>>
>>>>>>> Those behaviors will be fine
>>>>>>> tuned after the patch 3 big rewrite. Try to make this patch simple.
>>>>>
>>>>> Again, I want to keep it simple here so patch 3 can land.
>>>>>
>>>>>>>> Another potential optimization (which was in my hacked version IIRC) is to only
>>>>>>>> add/remove from nonfull list when `total - count` crosses the (1 << order)
>>>>>>>> boundary rather than when becoming completely full. You definitely won't be able
>>>>>>>> to allocate order-2 if there are only 3 pages available, for example.
>>>>>>>
>>>>>>> That is in patch 3 as well. This patch is just doing the bare minimum
>>>>>>> to introduce the nonfull list.
>>>>>>>
>> 
>> [snip]

--
Best Regards,
Huang, Ying

Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Posted by Ryan Roberts 1 year, 6 months ago

On 22/07/2024 09:49, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> On 22/07/2024 03:14, Huang, Ying wrote:
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>> On 18/07/2024 08:53, Huang, Ying wrote:
>>>>> Chris Li <chrisl@kernel.org> writes:
>>>>>
>>>>>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>
>>>>>>> On 16/07/2024 23:46, Chris Li wrote:
>>>>>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>
>>>>>>>>> On 11/07/2024 08:29, Chris Li wrote:
>>>
>>> [snip]
>>>
>>>>>>>>>> +
>>>>>>>>>> +     if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>>>>>>>>>> +             list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
>>>>>>>>>
>>>>>>>>> I find the transitions when you add and remove a cluster from the
>>>>>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
>>>>>>>>> to the list whenever there is at least one free swap entry if not already on the
>>>>>>>>> list. But you take it off the list when assigning it as the current cluster for
>>>>>>>>> a cpu in scan_swap_map_try_ssd_cluster().
>>>>>>>>>
>>>>>>>>> So you could have this situation:
>>>>>>>>>
>>>>>>>>>   - cpuA allocs cluster from free list (exclusive to that cpu)
>>>>>>>>>   - cpuA allocs 1 swap entry from current cluster
>>>>>>>>>   - swap entry is freed; cluster added to nonfull_clusters
>>>>>>>>>   - cpuB "allocs" cluster from nonfull_clusters
>>>>>>>>>
>>>>>>>>> At this point both cpuA and cpuB share the same cluster as their current
>>>>>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at
>>>>>>>>> allocation time (when removed from free_list) and only remove it from the
>>>>>>>>
>>>>>>>> The big rewrite on patch 3 does that, taking it off the free list and
>>>>>>>> moving it into nonfull.
>>>>>>>
>>>>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
>>>>>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
>>>>>>> separate the SSD and HDD code paths. Personally I'd prefer to see the
>>>>>>> refactoring separated from behavioural changes.
>>>>>>
>>>>>> It is not a refactoring. It is a big rewrite of the swap allocator
>>>>>> using the cluster. Behavior change is expected. The goal is completely
>>>>>> removing the brute force scanning of swap_map[] array for cluster swap
>>>>>> allocation.
>>>>>>
>>>>>>>
>>>>>>> Since the patch was titled RFC and I thought it was just refactoring, I was
>>>>>>> deferring review. But sounds like it is actually required to realize the test
>>>>>>> results quoted on the cover letter?
>>>>>>
>>>>>> Yes, required because it handles the previous fall out case try_ssd()
>>>>>> failed. This big rewrite has gone through a lot of testing and bug
>>>>>> fix. It is pretty stable now. The only reason I keep it as RFC is
>>>>>> because it is not feature complete. Currently it does not do swap
>>>>>> cache reclaim. The next version will have swap cache reclaim and
>>>>>> remove the RFC.
>>>>>>
>>>>>>>
>>>>>>>> I am only making the minimal change in this step so the big rewrite can land.
>>>>>>>>
>>>>>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't
>>>>>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead
>>>>>>>>> of just sometimes. You would likely want to move the cluster to the end of the
>>>>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
>>>>>>>>> chances of multiple CPUs using the same cluster.
>>>>>>>>
>>>>>>>> For nonfull clusters it is less important to avoid multiple CPU
>>>>>>>> sharing the cluster. Because the cluster already has previous swap
>>>>>>>> entries allocated from the previous CPU.
>>>>>>>
>>>>>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
>>>>>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
>>>>>>
>>>>>> That happens to exist per cpu next pointer already. When the other CPU
>>>>>> advances to the next cluster pointer, it can cross with the other
>>>>>> CPU's next cluster pointer.
>>>>>
>>>>> No.  si->percpu_cluster[cpu].next will keep in the current per cpu
>>>>> cluster only.  If it doesn't do that, we should fix it.
>>>>>
>>>>> I agree with Ryan that we should make per cpu cluster correct.  A
>>>>> cluster in per cpu cluster shouldn't be put in nonfull list.  When we
>>>>> scan to the end of a per cpu cluster, we can put the cluster in nonfull
>>>>> list if necessary.  And, we should make it correct in this patch instead
>>>>> of later in series.  I understand that you want to make the patch itself
>>>>> simple, but it's important to make code simple to be understood too.
>>>>> Consistent design choice will do that.
>>>>
>>>> I think I'm actually arguing for the opposite of what you suggest here.
>>>
>>> Sorry, I misunderstood your words.
>>>
>>>> As I see it, there are 2 possible approaches; either a cluster is always
>>>> considered exclusive to a single cpu when its set as a per-cpu cluster, so it
>>>> does not appear on the nonfull list. Or a cluster is considered sharable in this
>>>> case, in which case it should be added to the nonfull list.
>>>>
>>>> The code at the moment sort of does both; when a cpu decides to use a cluster in
>>>> the nonfull list, it removes it from that list to make it exclusive. But as soon
>>>> as a single swap entry is freed from that cluster it is put back on the list.
>>>> This neither-one-policy-nor-the-other seems odd to me.
>>>>
>>>> I think Huang, Ying is arguing to keep it always exclusive while installed as a
>>>> per-cpu cluster.
>>>
>>> Yes.
>>>
>>>> I was arguing to make it always shared. Perhaps the best
>>>> approach is to implement the exclusive policy in this patch (you'd need a flag
>>>> to note if any pages were freed while in exclusive use, then when exclusive use
>>>> completes, put it back on the nonfull list if the flag was set). Then migrate to
>>>> the shared approach as part of the "big rewrite"?
>>>>>
>>>>>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
>>>>>>> share the cluster when you really need to, but try to avoid it if there are
>>>>>>> other options, and I think moving the cluster to the end of the list might be a
>>>>>>> way to help that?
>>>>>>
>>>>>> Simply moving to the end of the list can create a possible deadloop
>>>>>> when all clusters have been scanned and not available swap range
>>>>>> found.
>>>
>>> I also think that the shared approach has dead loop issue.
>>
>> What exactly do you mean by dead loop issue? Perhaps you are suggesting the code
>> won't know when to stop dequeing/requeuing clusters on the nonfull list and will
>> go forever? That's surely just an implementation issue to solve? It's not a
>> reason to avoid the design principle; if we agree that maintaining sharability
>> of the cluster is preferred then the code must be written to guard against the
>> dead loop problem. It could be done by remembering the first cluster you
>> dequeued/requeued in scan_swap_map_try_ssd_cluster() and stop when you get back
>> to it. (I think holding the si lock will protect against concurrently freeing
>> the cluster so it should definitely remain in the list?).
> 
> I believe that you can find some way to avoid the dead loop issue,
> although your suggestion may kill the performance via looping a long list
> of nonfull clusters.  

I don't agree; If the clusters are considered exclusive (i.e. removed from the
list when made current for a cpu), that only reduces the size of the list by a
maximum of the number of CPUs in the system, which I suspect is pretty small
compared to the number of nonfull clusters.

> And, I understand that in some situations it may
> be better to share clusters among CPUs.  So my suggestion is,
> 
> - Make swap_cluster_info->order more accurate, don't pretend that we
>   have free swap entries with that order even after we are sure that we
>   haven't.

Is this patch pretending that today? I don't think so? But I agree that a
cluster should only be on the per-order nonfull list if we know there are at
least enough free swap entries in that cluster to cover the order. Of course
that doesn't tell us for sure because they may not be contiguous.

> 
> My question is whether it's so important to share the per-cpu cluster
> among CPUs? 

My rationale for sharing is that the preference previously has been to favour
efficient use of swap space; we don't want to fail a request for allocation of a
given order if there are actually slots available just because they have been
reserved by another CPU. And I'm still asserting that it should be ~zero cost to
do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
actually help improve allocation success, then I'm happy to take the exclusive
approach.

> I suggest to start with simple design, that is, per-CPU
> cluster will not be shared among CPUs in most cases.

I'm all for starting simple; I think that's what I already proposed (exclusive
in this patch, then shared in the "big rewrite"). I'm just objecting to the
current half-and-half policy in this patch.

> 
> Another choice for sharing is when we run short of free swap space, we
> disable per-CPU cluster and allocate from the shared non-full cluster
> list directly.
> 
>> Which actually makes me wonder; what is the mechanism that prevents the current
>> per-cpu cluster from being freed? Is that just handled by the conflict detection
>> thingy? Perhaps that would be better handled with a flag to mark it in use, or
>> raise count when its current. (If Chris has implemented that in the "big
>> rewrite" patch, sorry, I still haven't gotten around to looking at it :-| )
> 
> Yes.  We may need a flag for that.
> 
>>>
>>>>> This is another reason that we should put the cluster in
>>>>> nonfull_clusters[order--] if there are no free swap entry with "order"
>>>>> in the cluster.  It makes design complex to keep it in
>>>>> nonfull_clusters[order].
>>>>>
>>>>>> We have tried many different approaches including moving to the end of
>>>>>> the list. It can cause more fragmentation because each CPU allocates
>>>>>> their swap slot cache (64 entries) from a different cluster.
>>>>>>
>>>>>>>> Those behaviors will be fine
>>>>>>>> tuned after the patch 3 big rewrite. Try to make this patch simple.
>>>>>>
>>>>>> Again, I want to keep it simple here so patch 3 can land.
>>>>>>
>>>>>>>>> Another potential optimization (which was in my hacked version IIRC) is to only
>>>>>>>>> add/remove from nonfull list when `total - count` crosses the (1 << order)
>>>>>>>>> boundary rather than when becoming completely full. You definitely won't be able
>>>>>>>>> to allocate order-2 if there are only 3 pages available, for example.
>>>>>>>>
>>>>>>>> That is in patch 3 as well. This patch is just doing the bare minimum
>>>>>>>> to introduce the nonfull list.
>>>>>>>>
>>>
>>> [snip]
> 
> --
> Best Regards,
> Huang, Ying

Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Posted by Huang, Ying 1 year, 6 months ago

Ryan Roberts <ryan.roberts@arm.com> writes:

> On 22/07/2024 09:49, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> On 22/07/2024 03:14, Huang, Ying wrote:
>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>
>>>>> On 18/07/2024 08:53, Huang, Ying wrote:
>>>>>> Chris Li <chrisl@kernel.org> writes:
>>>>>>
>>>>>>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>
>>>>>>>> On 16/07/2024 23:46, Chris Li wrote:
>>>>>>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On 11/07/2024 08:29, Chris Li wrote:
>>>>
>>>> [snip]
>>>>
>>>>>>>>>>> +
>>>>>>>>>>> +     if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>>>>>>>>>>> +             list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
>>>>>>>>>>
>>>>>>>>>> I find the transitions when you add and remove a cluster from the
>>>>>>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
>>>>>>>>>> to the list whenever there is at least one free swap entry if not already on the
>>>>>>>>>> list. But you take it off the list when assigning it as the current cluster for
>>>>>>>>>> a cpu in scan_swap_map_try_ssd_cluster().
>>>>>>>>>>
>>>>>>>>>> So you could have this situation:
>>>>>>>>>>
>>>>>>>>>>   - cpuA allocs cluster from free list (exclusive to that cpu)
>>>>>>>>>>   - cpuA allocs 1 swap entry from current cluster
>>>>>>>>>>   - swap entry is freed; cluster added to nonfull_clusters
>>>>>>>>>>   - cpuB "allocs" cluster from nonfull_clusters
>>>>>>>>>>
>>>>>>>>>> At this point both cpuA and cpuB share the same cluster as their current
>>>>>>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at
>>>>>>>>>> allocation time (when removed from free_list) and only remove it from the
>>>>>>>>>
>>>>>>>>> The big rewrite on patch 3 does that, taking it off the free list and
>>>>>>>>> moving it into nonfull.
>>>>>>>>
>>>>>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
>>>>>>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
>>>>>>>> separate the SSD and HDD code paths. Personally I'd prefer to see the
>>>>>>>> refactoring separated from behavioural changes.
>>>>>>>
>>>>>>> It is not a refactoring. It is a big rewrite of the swap allocator
>>>>>>> using the cluster. Behavior change is expected. The goal is completely
>>>>>>> removing the brute force scanning of swap_map[] array for cluster swap
>>>>>>> allocation.
>>>>>>>
>>>>>>>>
>>>>>>>> Since the patch was titled RFC and I thought it was just refactoring, I was
>>>>>>>> deferring review. But sounds like it is actually required to realize the test
>>>>>>>> results quoted on the cover letter?
>>>>>>>
>>>>>>> Yes, required because it handles the previous fall out case try_ssd()
>>>>>>> failed. This big rewrite has gone through a lot of testing and bug
>>>>>>> fix. It is pretty stable now. The only reason I keep it as RFC is
>>>>>>> because it is not feature complete. Currently it does not do swap
>>>>>>> cache reclaim. The next version will have swap cache reclaim and
>>>>>>> remove the RFC.
>>>>>>>
>>>>>>>>
>>>>>>>>> I am only making the minimal change in this step so the big rewrite can land.
>>>>>>>>>
>>>>>>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't
>>>>>>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead
>>>>>>>>>> of just sometimes. You would likely want to move the cluster to the end of the
>>>>>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
>>>>>>>>>> chances of multiple CPUs using the same cluster.
>>>>>>>>>
>>>>>>>>> For nonfull clusters it is less important to avoid multiple CPU
>>>>>>>>> sharing the cluster. Because the cluster already has previous swap
>>>>>>>>> entries allocated from the previous CPU.
>>>>>>>>
>>>>>>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
>>>>>>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
>>>>>>>
>>>>>>> That happens to exist per cpu next pointer already. When the other CPU
>>>>>>> advances to the next cluster pointer, it can cross with the other
>>>>>>> CPU's next cluster pointer.
>>>>>>
>>>>>> No.  si->percpu_cluster[cpu].next will keep in the current per cpu
>>>>>> cluster only.  If it doesn't do that, we should fix it.
>>>>>>
>>>>>> I agree with Ryan that we should make per cpu cluster correct.  A
>>>>>> cluster in per cpu cluster shouldn't be put in nonfull list.  When we
>>>>>> scan to the end of a per cpu cluster, we can put the cluster in nonfull
>>>>>> list if necessary.  And, we should make it correct in this patch instead
>>>>>> of later in series.  I understand that you want to make the patch itself
>>>>>> simple, but it's important to make code simple to be understood too.
>>>>>> Consistent design choice will do that.
>>>>>
>>>>> I think I'm actually arguing for the opposite of what you suggest here.
>>>>
>>>> Sorry, I misunderstood your words.
>>>>
>>>>> As I see it, there are 2 possible approaches; either a cluster is always
>>>>> considered exclusive to a single cpu when its set as a per-cpu cluster, so it
>>>>> does not appear on the nonfull list. Or a cluster is considered sharable in this
>>>>> case, in which case it should be added to the nonfull list.
>>>>>
>>>>> The code at the moment sort of does both; when a cpu decides to use a cluster in
>>>>> the nonfull list, it removes it from that list to make it exclusive. But as soon
>>>>> as a single swap entry is freed from that cluster it is put back on the list.
>>>>> This neither-one-policy-nor-the-other seems odd to me.
>>>>>
>>>>> I think Huang, Ying is arguing to keep it always exclusive while installed as a
>>>>> per-cpu cluster.
>>>>
>>>> Yes.
>>>>
>>>>> I was arguing to make it always shared. Perhaps the best
>>>>> approach is to implement the exclusive policy in this patch (you'd need a flag
>>>>> to note if any pages were freed while in exclusive use, then when exclusive use
>>>>> completes, put it back on the nonfull list if the flag was set). Then migrate to
>>>>> the shared approach as part of the "big rewrite"?
>>>>>>
>>>>>>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
>>>>>>>> share the cluster when you really need to, but try to avoid it if there are
>>>>>>>> other options, and I think moving the cluster to the end of the list might be a
>>>>>>>> way to help that?
>>>>>>>
>>>>>>> Simply moving to the end of the list can create a possible deadloop
>>>>>>> when all clusters have been scanned and not available swap range
>>>>>>> found.
>>>>
>>>> I also think that the shared approach has dead loop issue.
>>>
>>> What exactly do you mean by dead loop issue? Perhaps you are suggesting the code
>>> won't know when to stop dequeing/requeuing clusters on the nonfull list and will
>>> go forever? That's surely just an implementation issue to solve? It's not a
>>> reason to avoid the design principle; if we agree that maintaining sharability
>>> of the cluster is preferred then the code must be written to guard against the
>>> dead loop problem. It could be done by remembering the first cluster you
>>> dequeued/requeued in scan_swap_map_try_ssd_cluster() and stop when you get back
>>> to it. (I think holding the si lock will protect against concurrently freeing
>>> the cluster so it should definitely remain in the list?).
>> 
>> I believe that you can find some way to avoid the dead loop issue,
>> although your suggestion may kill the performance via looping a long list
>> of nonfull clusters.  
>
> I don't agree; If the clusters are considered exclusive (i.e. removed from the
> list when made current for a cpu), that only reduces the size of the list by a
> maximum of the number of CPUs in the system, which I suspect is pretty small
> compared to the number of nonfull clusters.

Anyway, this depends on details.  If we cannot allocate a order-N swap
entry from the cluster, we should remove it from the nonfull list for
order-N (This is the behavior of this patch too).  Your original
suggestion appears like that you want to keep all cluster with order-N
on the nonfull list for order-N always unless the number of free swap
entry is less than 1<<N.

>> And, I understand that in some situations it may
>> be better to share clusters among CPUs.  So my suggestion is,
>> 
>> - Make swap_cluster_info->order more accurate, don't pretend that we
>>   have free swap entries with that order even after we are sure that we
>>   haven't.
>
> Is this patch pretending that today? I don't think so?

IIUC, in this patch swap_cluster_info->order is still "N" even if we are
sure that there are no order-N free swap entry in the cluster.

> But I agree that a
> cluster should only be on the per-order nonfull list if we know there are at
> least enough free swap entries in that cluster to cover the order. Of course
> that doesn't tell us for sure because they may not be contiguous.

We can check that when free swap entry via checking adjacent swap
entries.  IMHO, the performance should be acceptable.

>> 
>> My question is whether it's so important to share the per-cpu cluster
>> among CPUs? 
>
> My rationale for sharing is that the preference previously has been to favour
> efficient use of swap space; we don't want to fail a request for allocation of a
> given order if there are actually slots available just because they have been
> reserved by another CPU. And I'm still asserting that it should be ~zero cost to
> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
> actually help improve allocation success, then I'm happy to take the exclusive
> approach.
>
>> I suggest to start with simple design, that is, per-CPU
>> cluster will not be shared among CPUs in most cases.
>
> I'm all for starting simple; I think that's what I already proposed (exclusive
> in this patch, then shared in the "big rewrite"). I'm just objecting to the
> current half-and-half policy in this patch.

Sounds good to me.  We can start with exclusive solution and evaluate
whether shared solution is good.

>> 
>> Another choice for sharing is when we run short of free swap space, we
>> disable per-CPU cluster and allocate from the shared non-full cluster
>> list directly.
>> 
>>> Which actually makes me wonder; what is the mechanism that prevents the current
>>> per-cpu cluster from being freed? Is that just handled by the conflict detection
>>> thingy? Perhaps that would be better handled with a flag to mark it in use, or
>>> raise count when its current. (If Chris has implemented that in the "big
>>> rewrite" patch, sorry, I still haven't gotten around to looking at it :-| )
>> 
>> Yes.  We may need a flag for that.
>> 
>>>>
>>>>>> This is another reason that we should put the cluster in
>>>>>> nonfull_clusters[order--] if there are no free swap entry with "order"
>>>>>> in the cluster.  It makes design complex to keep it in
>>>>>> nonfull_clusters[order].
>>>>>>
>>>>>>> We have tried many different approaches including moving to the end of
>>>>>>> the list. It can cause more fragmentation because each CPU allocates
>>>>>>> their swap slot cache (64 entries) from a different cluster.
>>>>>>>
>>>>>>>>> Those behaviors will be fine
>>>>>>>>> tuned after the patch 3 big rewrite. Try to make this patch simple.
>>>>>>>
>>>>>>> Again, I want to keep it simple here so patch 3 can land.
>>>>>>>
>>>>>>>>>> Another potential optimization (which was in my hacked version IIRC) is to only
>>>>>>>>>> add/remove from nonfull list when `total - count` crosses the (1 << order)
>>>>>>>>>> boundary rather than when becoming completely full. You definitely won't be able
>>>>>>>>>> to allocate order-2 if there are only 3 pages available, for example.
>>>>>>>>>
>>>>>>>>> That is in patch 3 as well. This patch is just doing the bare minimum
>>>>>>>>> to introduce the nonfull list.
>>>>>>>>>
>>>>
>>>> [snip]

--
Best Regards,
Huang, Ying

Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Posted by Ryan Roberts 1 year, 6 months ago

On 23/07/2024 07:27, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> On 22/07/2024 09:49, Huang, Ying wrote:
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>> On 22/07/2024 03:14, Huang, Ying wrote:
>>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>>
>>>>>> On 18/07/2024 08:53, Huang, Ying wrote:
>>>>>>> Chris Li <chrisl@kernel.org> writes:
>>>>>>>
>>>>>>>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>
>>>>>>>>> On 16/07/2024 23:46, Chris Li wrote:
>>>>>>>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 11/07/2024 08:29, Chris Li wrote:
>>>>>
>>>>> [snip]
>>>>>
>>>>>>>>>>>> +
>>>>>>>>>>>> +     if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>>>>>>>>>>>> +             list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
>>>>>>>>>>>
>>>>>>>>>>> I find the transitions when you add and remove a cluster from the
>>>>>>>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
>>>>>>>>>>> to the list whenever there is at least one free swap entry if not already on the
>>>>>>>>>>> list. But you take it off the list when assigning it as the current cluster for
>>>>>>>>>>> a cpu in scan_swap_map_try_ssd_cluster().
>>>>>>>>>>>
>>>>>>>>>>> So you could have this situation:
>>>>>>>>>>>
>>>>>>>>>>>   - cpuA allocs cluster from free list (exclusive to that cpu)
>>>>>>>>>>>   - cpuA allocs 1 swap entry from current cluster
>>>>>>>>>>>   - swap entry is freed; cluster added to nonfull_clusters
>>>>>>>>>>>   - cpuB "allocs" cluster from nonfull_clusters
>>>>>>>>>>>
>>>>>>>>>>> At this point both cpuA and cpuB share the same cluster as their current
>>>>>>>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at
>>>>>>>>>>> allocation time (when removed from free_list) and only remove it from the
>>>>>>>>>>
>>>>>>>>>> The big rewrite on patch 3 does that, taking it off the free list and
>>>>>>>>>> moving it into nonfull.
>>>>>>>>>
>>>>>>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
>>>>>>>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
>>>>>>>>> separate the SSD and HDD code paths. Personally I'd prefer to see the
>>>>>>>>> refactoring separated from behavioural changes.
>>>>>>>>
>>>>>>>> It is not a refactoring. It is a big rewrite of the swap allocator
>>>>>>>> using the cluster. Behavior change is expected. The goal is completely
>>>>>>>> removing the brute force scanning of swap_map[] array for cluster swap
>>>>>>>> allocation.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Since the patch was titled RFC and I thought it was just refactoring, I was
>>>>>>>>> deferring review. But sounds like it is actually required to realize the test
>>>>>>>>> results quoted on the cover letter?
>>>>>>>>
>>>>>>>> Yes, required because it handles the previous fall out case try_ssd()
>>>>>>>> failed. This big rewrite has gone through a lot of testing and bug
>>>>>>>> fix. It is pretty stable now. The only reason I keep it as RFC is
>>>>>>>> because it is not feature complete. Currently it does not do swap
>>>>>>>> cache reclaim. The next version will have swap cache reclaim and
>>>>>>>> remove the RFC.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I am only making the minimal change in this step so the big rewrite can land.
>>>>>>>>>>
>>>>>>>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't
>>>>>>>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead
>>>>>>>>>>> of just sometimes. You would likely want to move the cluster to the end of the
>>>>>>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
>>>>>>>>>>> chances of multiple CPUs using the same cluster.
>>>>>>>>>>
>>>>>>>>>> For nonfull clusters it is less important to avoid multiple CPU
>>>>>>>>>> sharing the cluster. Because the cluster already has previous swap
>>>>>>>>>> entries allocated from the previous CPU.
>>>>>>>>>
>>>>>>>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
>>>>>>>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
>>>>>>>>
>>>>>>>> That happens to exist per cpu next pointer already. When the other CPU
>>>>>>>> advances to the next cluster pointer, it can cross with the other
>>>>>>>> CPU's next cluster pointer.
>>>>>>>
>>>>>>> No.  si->percpu_cluster[cpu].next will keep in the current per cpu
>>>>>>> cluster only.  If it doesn't do that, we should fix it.
>>>>>>>
>>>>>>> I agree with Ryan that we should make per cpu cluster correct.  A
>>>>>>> cluster in per cpu cluster shouldn't be put in nonfull list.  When we
>>>>>>> scan to the end of a per cpu cluster, we can put the cluster in nonfull
>>>>>>> list if necessary.  And, we should make it correct in this patch instead
>>>>>>> of later in series.  I understand that you want to make the patch itself
>>>>>>> simple, but it's important to make code simple to be understood too.
>>>>>>> Consistent design choice will do that.
>>>>>>
>>>>>> I think I'm actually arguing for the opposite of what you suggest here.
>>>>>
>>>>> Sorry, I misunderstood your words.
>>>>>
>>>>>> As I see it, there are 2 possible approaches; either a cluster is always
>>>>>> considered exclusive to a single cpu when its set as a per-cpu cluster, so it
>>>>>> does not appear on the nonfull list. Or a cluster is considered sharable in this
>>>>>> case, in which case it should be added to the nonfull list.
>>>>>>
>>>>>> The code at the moment sort of does both; when a cpu decides to use a cluster in
>>>>>> the nonfull list, it removes it from that list to make it exclusive. But as soon
>>>>>> as a single swap entry is freed from that cluster it is put back on the list.
>>>>>> This neither-one-policy-nor-the-other seems odd to me.
>>>>>>
>>>>>> I think Huang, Ying is arguing to keep it always exclusive while installed as a
>>>>>> per-cpu cluster.
>>>>>
>>>>> Yes.
>>>>>
>>>>>> I was arguing to make it always shared. Perhaps the best
>>>>>> approach is to implement the exclusive policy in this patch (you'd need a flag
>>>>>> to note if any pages were freed while in exclusive use, then when exclusive use
>>>>>> completes, put it back on the nonfull list if the flag was set). Then migrate to
>>>>>> the shared approach as part of the "big rewrite"?
>>>>>>>
>>>>>>>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
>>>>>>>>> share the cluster when you really need to, but try to avoid it if there are
>>>>>>>>> other options, and I think moving the cluster to the end of the list might be a
>>>>>>>>> way to help that?
>>>>>>>>
>>>>>>>> Simply moving to the end of the list can create a possible deadloop
>>>>>>>> when all clusters have been scanned and not available swap range
>>>>>>>> found.
>>>>>
>>>>> I also think that the shared approach has dead loop issue.
>>>>
>>>> What exactly do you mean by dead loop issue? Perhaps you are suggesting the code
>>>> won't know when to stop dequeing/requeuing clusters on the nonfull list and will
>>>> go forever? That's surely just an implementation issue to solve? It's not a
>>>> reason to avoid the design principle; if we agree that maintaining sharability
>>>> of the cluster is preferred then the code must be written to guard against the
>>>> dead loop problem. It could be done by remembering the first cluster you
>>>> dequeued/requeued in scan_swap_map_try_ssd_cluster() and stop when you get back
>>>> to it. (I think holding the si lock will protect against concurrently freeing
>>>> the cluster so it should definitely remain in the list?).
>>>
>>> I believe that you can find some way to avoid the dead loop issue,
>>> although your suggestion may kill the performance via looping a long list
>>> of nonfull clusters.  
>>
>> I don't agree; If the clusters are considered exclusive (i.e. removed from the
>> list when made current for a cpu), that only reduces the size of the list by a
>> maximum of the number of CPUs in the system, which I suspect is pretty small
>> compared to the number of nonfull clusters.
> 
> Anyway, this depends on details.  If we cannot allocate a order-N swap
> entry from the cluster, we should remove it from the nonfull list for
> order-N (This is the behavior of this patch too). 

Yes that's a good point, and I conceed it is more difficult to detect that
condition if the cluster is shared. I suspect that with a bit of thinking, we
could find a way though.

> Your original
> suggestion appears like that you want to keep all cluster with order-N
> on the nonfull list for order-N always unless the number of free swap
> entry is less than 1<<N.

Well I think that's certainly one of the conditions for removing it. But agree
that if a full scan of the cluster has been performed and no swap entries have
been freed since the scan started then it should also be removed from the list.

> 
>>> And, I understand that in some situations it may
>>> be better to share clusters among CPUs.  So my suggestion is,
>>>
>>> - Make swap_cluster_info->order more accurate, don't pretend that we
>>>   have free swap entries with that order even after we are sure that we
>>>   haven't.
>>
>> Is this patch pretending that today? I don't think so?
> 
> IIUC, in this patch swap_cluster_info->order is still "N" even if we are
> sure that there are no order-N free swap entry in the cluster.

Oh I see what you mean. I think you and Chris already discussed this? IIRC
Chris's point was that if you move that cluster to N-1, eventually all clusters
are for order-0 and you have no means of allocating high orders until a whole
cluster becomes free. That logic certainly makes sense to me, so think its
better for swap_cluster_info->order to remain static while the cluster is
allocated. (I only skimmed that conversation so appologies if I got the
conclusion wrong!).

> 
>> But I agree that a
>> cluster should only be on the per-order nonfull list if we know there are at
>> least enough free swap entries in that cluster to cover the order. Of course
>> that doesn't tell us for sure because they may not be contiguous.
> 
> We can check that when free swap entry via checking adjacent swap
> entries.  IMHO, the performance should be acceptable.

Would you then use the result of that scanning to "promote" a cluster's order?
e.g. swap_cluster_info->order = N+1? That would be neat. But this all feels like
a separate change on top of what Chris is doing here. For high orders there
could be quite a bit of scanning required in the worst case for every page that
gets freed.

> 
>>>
>>> My question is whether it's so important to share the per-cpu cluster
>>> among CPUs? 
>>
>> My rationale for sharing is that the preference previously has been to favour
>> efficient use of swap space; we don't want to fail a request for allocation of a
>> given order if there are actually slots available just because they have been
>> reserved by another CPU. And I'm still asserting that it should be ~zero cost to
>> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
>> actually help improve allocation success, then I'm happy to take the exclusive
>> approach.
>>
>>> I suggest to start with simple design, that is, per-CPU
>>> cluster will not be shared among CPUs in most cases.
>>
>> I'm all for starting simple; I think that's what I already proposed (exclusive
>> in this patch, then shared in the "big rewrite"). I'm just objecting to the
>> current half-and-half policy in this patch.
> 
> Sounds good to me.  We can start with exclusive solution and evaluate
> whether shared solution is good.

Yep. And also evaluate the dynamic order inc/dec idea too...

> 
>>>
>>> Another choice for sharing is when we run short of free swap space, we
>>> disable per-CPU cluster and allocate from the shared non-full cluster
>>> list directly.
>>>
>>>> Which actually makes me wonder; what is the mechanism that prevents the current
>>>> per-cpu cluster from being freed? Is that just handled by the conflict detection
>>>> thingy? Perhaps that would be better handled with a flag to mark it in use, or
>>>> raise count when its current. (If Chris has implemented that in the "big
>>>> rewrite" patch, sorry, I still haven't gotten around to looking at it :-| )
>>>
>>> Yes.  We may need a flag for that.
>>>
>>>>>
>>>>>>> This is another reason that we should put the cluster in
>>>>>>> nonfull_clusters[order--] if there are no free swap entry with "order"
>>>>>>> in the cluster.  It makes design complex to keep it in
>>>>>>> nonfull_clusters[order].
>>>>>>>
>>>>>>>> We have tried many different approaches including moving to the end of
>>>>>>>> the list. It can cause more fragmentation because each CPU allocates
>>>>>>>> their swap slot cache (64 entries) from a different cluster.
>>>>>>>>
>>>>>>>>>> Those behaviors will be fine
>>>>>>>>>> tuned after the patch 3 big rewrite. Try to make this patch simple.
>>>>>>>>
>>>>>>>> Again, I want to keep it simple here so patch 3 can land.
>>>>>>>>
>>>>>>>>>>> Another potential optimization (which was in my hacked version IIRC) is to only
>>>>>>>>>>> add/remove from nonfull list when `total - count` crosses the (1 << order)
>>>>>>>>>>> boundary rather than when becoming completely full. You definitely won't be able
>>>>>>>>>>> to allocate order-2 if there are only 3 pages available, for example.
>>>>>>>>>>
>>>>>>>>>> That is in patch 3 as well. This patch is just doing the bare minimum
>>>>>>>>>> to introduce the nonfull list.
>>>>>>>>>>
>>>>>
>>>>> [snip]
> 
> --
> Best Regards,
> Huang, Ying

Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Posted by Huang, Ying 1 year, 6 months ago

Ryan Roberts <ryan.roberts@arm.com> writes:

> On 23/07/2024 07:27, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> On 22/07/2024 09:49, Huang, Ying wrote:
>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>
>>>>> On 22/07/2024 03:14, Huang, Ying wrote:
>>>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>>>
>>>>>>> On 18/07/2024 08:53, Huang, Ying wrote:
>>>>>>>> Chris Li <chrisl@kernel.org> writes:
>>>>>>>>
>>>>>>>>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On 16/07/2024 23:46, Chris Li wrote:
>>>>>>>>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 11/07/2024 08:29, Chris Li wrote:
>>>>>>
>>>>>> [snip]
>>>>>>
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +     if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>>>>>>>>>>>>> +             list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
>>>>>>>>>>>>
>>>>>>>>>>>> I find the transitions when you add and remove a cluster from the
>>>>>>>>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
>>>>>>>>>>>> to the list whenever there is at least one free swap entry if not already on the
>>>>>>>>>>>> list. But you take it off the list when assigning it as the current cluster for
>>>>>>>>>>>> a cpu in scan_swap_map_try_ssd_cluster().
>>>>>>>>>>>>
>>>>>>>>>>>> So you could have this situation:
>>>>>>>>>>>>
>>>>>>>>>>>>   - cpuA allocs cluster from free list (exclusive to that cpu)
>>>>>>>>>>>>   - cpuA allocs 1 swap entry from current cluster
>>>>>>>>>>>>   - swap entry is freed; cluster added to nonfull_clusters
>>>>>>>>>>>>   - cpuB "allocs" cluster from nonfull_clusters
>>>>>>>>>>>>
>>>>>>>>>>>> At this point both cpuA and cpuB share the same cluster as their current
>>>>>>>>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at
>>>>>>>>>>>> allocation time (when removed from free_list) and only remove it from the
>>>>>>>>>>>
>>>>>>>>>>> The big rewrite on patch 3 does that, taking it off the free list and
>>>>>>>>>>> moving it into nonfull.
>>>>>>>>>>
>>>>>>>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
>>>>>>>>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
>>>>>>>>>> separate the SSD and HDD code paths. Personally I'd prefer to see the
>>>>>>>>>> refactoring separated from behavioural changes.
>>>>>>>>>
>>>>>>>>> It is not a refactoring. It is a big rewrite of the swap allocator
>>>>>>>>> using the cluster. Behavior change is expected. The goal is completely
>>>>>>>>> removing the brute force scanning of swap_map[] array for cluster swap
>>>>>>>>> allocation.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Since the patch was titled RFC and I thought it was just refactoring, I was
>>>>>>>>>> deferring review. But sounds like it is actually required to realize the test
>>>>>>>>>> results quoted on the cover letter?
>>>>>>>>>
>>>>>>>>> Yes, required because it handles the previous fall out case try_ssd()
>>>>>>>>> failed. This big rewrite has gone through a lot of testing and bug
>>>>>>>>> fix. It is pretty stable now. The only reason I keep it as RFC is
>>>>>>>>> because it is not feature complete. Currently it does not do swap
>>>>>>>>> cache reclaim. The next version will have swap cache reclaim and
>>>>>>>>> remove the RFC.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I am only making the minimal change in this step so the big rewrite can land.
>>>>>>>>>>>
>>>>>>>>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't
>>>>>>>>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead
>>>>>>>>>>>> of just sometimes. You would likely want to move the cluster to the end of the
>>>>>>>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
>>>>>>>>>>>> chances of multiple CPUs using the same cluster.
>>>>>>>>>>>
>>>>>>>>>>> For nonfull clusters it is less important to avoid multiple CPU
>>>>>>>>>>> sharing the cluster. Because the cluster already has previous swap
>>>>>>>>>>> entries allocated from the previous CPU.
>>>>>>>>>>
>>>>>>>>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
>>>>>>>>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
>>>>>>>>>
>>>>>>>>> That happens to exist per cpu next pointer already. When the other CPU
>>>>>>>>> advances to the next cluster pointer, it can cross with the other
>>>>>>>>> CPU's next cluster pointer.
>>>>>>>>
>>>>>>>> No.  si->percpu_cluster[cpu].next will keep in the current per cpu
>>>>>>>> cluster only.  If it doesn't do that, we should fix it.
>>>>>>>>
>>>>>>>> I agree with Ryan that we should make per cpu cluster correct.  A
>>>>>>>> cluster in per cpu cluster shouldn't be put in nonfull list.  When we
>>>>>>>> scan to the end of a per cpu cluster, we can put the cluster in nonfull
>>>>>>>> list if necessary.  And, we should make it correct in this patch instead
>>>>>>>> of later in series.  I understand that you want to make the patch itself
>>>>>>>> simple, but it's important to make code simple to be understood too.
>>>>>>>> Consistent design choice will do that.
>>>>>>>
>>>>>>> I think I'm actually arguing for the opposite of what you suggest here.
>>>>>>
>>>>>> Sorry, I misunderstood your words.
>>>>>>
>>>>>>> As I see it, there are 2 possible approaches; either a cluster is always
>>>>>>> considered exclusive to a single cpu when its set as a per-cpu cluster, so it
>>>>>>> does not appear on the nonfull list. Or a cluster is considered sharable in this
>>>>>>> case, in which case it should be added to the nonfull list.
>>>>>>>
>>>>>>> The code at the moment sort of does both; when a cpu decides to use a cluster in
>>>>>>> the nonfull list, it removes it from that list to make it exclusive. But as soon
>>>>>>> as a single swap entry is freed from that cluster it is put back on the list.
>>>>>>> This neither-one-policy-nor-the-other seems odd to me.
>>>>>>>
>>>>>>> I think Huang, Ying is arguing to keep it always exclusive while installed as a
>>>>>>> per-cpu cluster.
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>> I was arguing to make it always shared. Perhaps the best
>>>>>>> approach is to implement the exclusive policy in this patch (you'd need a flag
>>>>>>> to note if any pages were freed while in exclusive use, then when exclusive use
>>>>>>> completes, put it back on the nonfull list if the flag was set). Then migrate to
>>>>>>> the shared approach as part of the "big rewrite"?
>>>>>>>>
>>>>>>>>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
>>>>>>>>>> share the cluster when you really need to, but try to avoid it if there are
>>>>>>>>>> other options, and I think moving the cluster to the end of the list might be a
>>>>>>>>>> way to help that?
>>>>>>>>>
>>>>>>>>> Simply moving to the end of the list can create a possible deadloop
>>>>>>>>> when all clusters have been scanned and not available swap range
>>>>>>>>> found.
>>>>>>
>>>>>> I also think that the shared approach has dead loop issue.
>>>>>
>>>>> What exactly do you mean by dead loop issue? Perhaps you are suggesting the code
>>>>> won't know when to stop dequeing/requeuing clusters on the nonfull list and will
>>>>> go forever? That's surely just an implementation issue to solve? It's not a
>>>>> reason to avoid the design principle; if we agree that maintaining sharability
>>>>> of the cluster is preferred then the code must be written to guard against the
>>>>> dead loop problem. It could be done by remembering the first cluster you
>>>>> dequeued/requeued in scan_swap_map_try_ssd_cluster() and stop when you get back
>>>>> to it. (I think holding the si lock will protect against concurrently freeing
>>>>> the cluster so it should definitely remain in the list?).
>>>>
>>>> I believe that you can find some way to avoid the dead loop issue,
>>>> although your suggestion may kill the performance via looping a long list
>>>> of nonfull clusters.  
>>>
>>> I don't agree; If the clusters are considered exclusive (i.e. removed from the
>>> list when made current for a cpu), that only reduces the size of the list by a
>>> maximum of the number of CPUs in the system, which I suspect is pretty small
>>> compared to the number of nonfull clusters.
>> 
>> Anyway, this depends on details.  If we cannot allocate a order-N swap
>> entry from the cluster, we should remove it from the nonfull list for
>> order-N (This is the behavior of this patch too). 
>
> Yes that's a good point, and I conceed it is more difficult to detect that
> condition if the cluster is shared. I suspect that with a bit of thinking, we
> could find a way though.
>
>> Your original
>> suggestion appears like that you want to keep all cluster with order-N
>> on the nonfull list for order-N always unless the number of free swap
>> entry is less than 1<<N.
>
> Well I think that's certainly one of the conditions for removing it. But agree
> that if a full scan of the cluster has been performed and no swap entries have
> been freed since the scan started then it should also be removed from the list.
>
>> 
>>>> And, I understand that in some situations it may
>>>> be better to share clusters among CPUs.  So my suggestion is,
>>>>
>>>> - Make swap_cluster_info->order more accurate, don't pretend that we
>>>>   have free swap entries with that order even after we are sure that we
>>>>   haven't.
>>>
>>> Is this patch pretending that today? I don't think so?
>> 
>> IIUC, in this patch swap_cluster_info->order is still "N" even if we are
>> sure that there are no order-N free swap entry in the cluster.
>
> Oh I see what you mean. I think you and Chris already discussed this? IIRC
> Chris's point was that if you move that cluster to N-1, eventually all clusters
> are for order-0 and you have no means of allocating high orders until a whole
> cluster becomes free. That logic certainly makes sense to me, so think its
> better for swap_cluster_info->order to remain static while the cluster is
> allocated. (I only skimmed that conversation so appologies if I got the
> conclusion wrong!).
>
>> 
>>> But I agree that a
>>> cluster should only be on the per-order nonfull list if we know there are at
>>> least enough free swap entries in that cluster to cover the order. Of course
>>> that doesn't tell us for sure because they may not be contiguous.
>> 
>> We can check that when free swap entry via checking adjacent swap
>> entries.  IMHO, the performance should be acceptable.
>
> Would you then use the result of that scanning to "promote" a cluster's order?
> e.g. swap_cluster_info->order = N+1? That would be neat. But this all feels like
> a separate change on top of what Chris is doing here. For high orders there
> could be quite a bit of scanning required in the worst case for every page that
> gets freed.

We can try to optimize it to control overhead if necessary.

>> 
>>>>
>>>> My question is whether it's so important to share the per-cpu cluster
>>>> among CPUs? 
>>>
>>> My rationale for sharing is that the preference previously has been to favour
>>> efficient use of swap space; we don't want to fail a request for allocation of a
>>> given order if there are actually slots available just because they have been
>>> reserved by another CPU. And I'm still asserting that it should be ~zero cost to
>>> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
>>> actually help improve allocation success, then I'm happy to take the exclusive
>>> approach.
>>>
>>>> I suggest to start with simple design, that is, per-CPU
>>>> cluster will not be shared among CPUs in most cases.
>>>
>>> I'm all for starting simple; I think that's what I already proposed (exclusive
>>> in this patch, then shared in the "big rewrite"). I'm just objecting to the
>>> current half-and-half policy in this patch.
>> 
>> Sounds good to me.  We can start with exclusive solution and evaluate
>> whether shared solution is good.
>
> Yep. And also evaluate the dynamic order inc/dec idea too...

Dynamic order inc/dec tries solving a more fundamental problem.  For
example,

- Initially, almost only order-0 pages are swapped out, most non-full
  clusters are order-0.

- Later, quite some order-0 swap entries are freed so that there are
  quite some order-4 swap entries available.

- Order-4 pages need to be swapped out, but no enough order-4 non-full
  clusters available.

So, we need a way to migrate non-full clusters among orders to adjust to
the various situations automatically.

But yes, data is needed for any performance related change.

--
Best Regards,
Huang, Ying

>> 
>>>>
>>>> Another choice for sharing is when we run short of free swap space, we
>>>> disable per-CPU cluster and allocate from the shared non-full cluster
>>>> list directly.
>>>>
>>>>> Which actually makes me wonder; what is the mechanism that prevents the current
>>>>> per-cpu cluster from being freed? Is that just handled by the conflict detection
>>>>> thingy? Perhaps that would be better handled with a flag to mark it in use, or
>>>>> raise count when its current. (If Chris has implemented that in the "big
>>>>> rewrite" patch, sorry, I still haven't gotten around to looking at it :-| )
>>>>
>>>> Yes.  We may need a flag for that.
>>>>
>>>>>>
>>>>>>>> This is another reason that we should put the cluster in
>>>>>>>> nonfull_clusters[order--] if there are no free swap entry with "order"
>>>>>>>> in the cluster.  It makes design complex to keep it in
>>>>>>>> nonfull_clusters[order].
>>>>>>>>
>>>>>>>>> We have tried many different approaches including moving to the end of
>>>>>>>>> the list. It can cause more fragmentation because each CPU allocates
>>>>>>>>> their swap slot cache (64 entries) from a different cluster.
>>>>>>>>>
>>>>>>>>>>> Those behaviors will be fine
>>>>>>>>>>> tuned after the patch 3 big rewrite. Try to make this patch simple.
>>>>>>>>>
>>>>>>>>> Again, I want to keep it simple here so patch 3 can land.
>>>>>>>>>
>>>>>>>>>>>> Another potential optimization (which was in my hacked version IIRC) is to only
>>>>>>>>>>>> add/remove from nonfull list when `total - count` crosses the (1 << order)
>>>>>>>>>>>> boundary rather than when becoming completely full. You definitely won't be able
>>>>>>>>>>>> to allocate order-2 if there are only 3 pages available, for example.
>>>>>>>>>>>
>>>>>>>>>>> That is in patch 3 as well. This patch is just doing the bare minimum
>>>>>>>>>>> to introduce the nonfull list.
>>>>>>>>>>>
>>>>>>
>>>>>> [snip]
>> 
>> --
>> Best Regards,
>> Huang, Ying

Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Posted by Chris Li 1 year, 6 months ago

On Wed, Jul 24, 2024 at 11:57 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Ryan Roberts <ryan.roberts@arm.com> writes:
>
> > On 23/07/2024 07:27, Huang, Ying wrote:
> >> Ryan Roberts <ryan.roberts@arm.com> writes:
> >>
> >>> On 22/07/2024 09:49, Huang, Ying wrote:
> >>>> Ryan Roberts <ryan.roberts@arm.com> writes:
> >>>>
> >>>>> On 22/07/2024 03:14, Huang, Ying wrote:
> >>>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
> >>>>>>
> >>>>>>> On 18/07/2024 08:53, Huang, Ying wrote:
> >>>>>>>> Chris Li <chrisl@kernel.org> writes:
> >>>>>>>>
> >>>>>>>>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 16/07/2024 23:46, Chris Li wrote:
> >>>>>>>>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 11/07/2024 08:29, Chris Li wrote:
> >>>>>>
> >>>>>> [snip]
> >>>>>>
> >>>>>>>>>>>>> +
> >>>>>>>>>>>>> +     if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
> >>>>>>>>>>>>> +             list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
> >>>>>>>>>>>>
> >>>>>>>>>>>> I find the transitions when you add and remove a cluster from the
> >>>>>>>>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
> >>>>>>>>>>>> to the list whenever there is at least one free swap entry if not already on the
> >>>>>>>>>>>> list. But you take it off the list when assigning it as the current cluster for
> >>>>>>>>>>>> a cpu in scan_swap_map_try_ssd_cluster().
> >>>>>>>>>>>>
> >>>>>>>>>>>> So you could have this situation:
> >>>>>>>>>>>>
> >>>>>>>>>>>>   - cpuA allocs cluster from free list (exclusive to that cpu)
> >>>>>>>>>>>>   - cpuA allocs 1 swap entry from current cluster
> >>>>>>>>>>>>   - swap entry is freed; cluster added to nonfull_clusters
> >>>>>>>>>>>>   - cpuB "allocs" cluster from nonfull_clusters
> >>>>>>>>>>>>
> >>>>>>>>>>>> At this point both cpuA and cpuB share the same cluster as their current
> >>>>>>>>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at
> >>>>>>>>>>>> allocation time (when removed from free_list) and only remove it from the
> >>>>>>>>>>>
> >>>>>>>>>>> The big rewrite on patch 3 does that, taking it off the free list and
> >>>>>>>>>>> moving it into nonfull.
> >>>>>>>>>>
> >>>>>>>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
> >>>>>>>>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
> >>>>>>>>>> separate the SSD and HDD code paths. Personally I'd prefer to see the
> >>>>>>>>>> refactoring separated from behavioural changes.
> >>>>>>>>>
> >>>>>>>>> It is not a refactoring. It is a big rewrite of the swap allocator
> >>>>>>>>> using the cluster. Behavior change is expected. The goal is completely
> >>>>>>>>> removing the brute force scanning of swap_map[] array for cluster swap
> >>>>>>>>> allocation.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Since the patch was titled RFC and I thought it was just refactoring, I was
> >>>>>>>>>> deferring review. But sounds like it is actually required to realize the test
> >>>>>>>>>> results quoted on the cover letter?
> >>>>>>>>>
> >>>>>>>>> Yes, required because it handles the previous fall out case try_ssd()
> >>>>>>>>> failed. This big rewrite has gone through a lot of testing and bug
> >>>>>>>>> fix. It is pretty stable now. The only reason I keep it as RFC is
> >>>>>>>>> because it is not feature complete. Currently it does not do swap
> >>>>>>>>> cache reclaim. The next version will have swap cache reclaim and
> >>>>>>>>> remove the RFC.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> I am only making the minimal change in this step so the big rewrite can land.
> >>>>>>>>>>>
> >>>>>>>>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't
> >>>>>>>>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead
> >>>>>>>>>>>> of just sometimes. You would likely want to move the cluster to the end of the
> >>>>>>>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
> >>>>>>>>>>>> chances of multiple CPUs using the same cluster.
> >>>>>>>>>>>
> >>>>>>>>>>> For nonfull clusters it is less important to avoid multiple CPU
> >>>>>>>>>>> sharing the cluster. Because the cluster already has previous swap
> >>>>>>>>>>> entries allocated from the previous CPU.
> >>>>>>>>>>
> >>>>>>>>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
> >>>>>>>>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
> >>>>>>>>>
> >>>>>>>>> That happens to exist per cpu next pointer already. When the other CPU
> >>>>>>>>> advances to the next cluster pointer, it can cross with the other
> >>>>>>>>> CPU's next cluster pointer.
> >>>>>>>>
> >>>>>>>> No.  si->percpu_cluster[cpu].next will keep in the current per cpu
> >>>>>>>> cluster only.  If it doesn't do that, we should fix it.
> >>>>>>>>
> >>>>>>>> I agree with Ryan that we should make per cpu cluster correct.  A
> >>>>>>>> cluster in per cpu cluster shouldn't be put in nonfull list.  When we
> >>>>>>>> scan to the end of a per cpu cluster, we can put the cluster in nonfull
> >>>>>>>> list if necessary.  And, we should make it correct in this patch instead
> >>>>>>>> of later in series.  I understand that you want to make the patch itself
> >>>>>>>> simple, but it's important to make code simple to be understood too.
> >>>>>>>> Consistent design choice will do that.
> >>>>>>>
> >>>>>>> I think I'm actually arguing for the opposite of what you suggest here.
> >>>>>>
> >>>>>> Sorry, I misunderstood your words.
> >>>>>>
> >>>>>>> As I see it, there are 2 possible approaches; either a cluster is always
> >>>>>>> considered exclusive to a single cpu when its set as a per-cpu cluster, so it
> >>>>>>> does not appear on the nonfull list. Or a cluster is considered sharable in this
> >>>>>>> case, in which case it should be added to the nonfull list.
> >>>>>>>
> >>>>>>> The code at the moment sort of does both; when a cpu decides to use a cluster in
> >>>>>>> the nonfull list, it removes it from that list to make it exclusive. But as soon
> >>>>>>> as a single swap entry is freed from that cluster it is put back on the list.
> >>>>>>> This neither-one-policy-nor-the-other seems odd to me.
> >>>>>>>
> >>>>>>> I think Huang, Ying is arguing to keep it always exclusive while installed as a
> >>>>>>> per-cpu cluster.
> >>>>>>
> >>>>>> Yes.
> >>>>>>
> >>>>>>> I was arguing to make it always shared. Perhaps the best
> >>>>>>> approach is to implement the exclusive policy in this patch (you'd need a flag
> >>>>>>> to note if any pages were freed while in exclusive use, then when exclusive use
> >>>>>>> completes, put it back on the nonfull list if the flag was set). Then migrate to
> >>>>>>> the shared approach as part of the "big rewrite"?
> >>>>>>>>
> >>>>>>>>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
> >>>>>>>>>> share the cluster when you really need to, but try to avoid it if there are
> >>>>>>>>>> other options, and I think moving the cluster to the end of the list might be a
> >>>>>>>>>> way to help that?
> >>>>>>>>>
> >>>>>>>>> Simply moving to the end of the list can create a possible deadloop
> >>>>>>>>> when all clusters have been scanned and not available swap range
> >>>>>>>>> found.
> >>>>>>
> >>>>>> I also think that the shared approach has dead loop issue.
> >>>>>
> >>>>> What exactly do you mean by dead loop issue? Perhaps you are suggesting the code
> >>>>> won't know when to stop dequeing/requeuing clusters on the nonfull list and will
> >>>>> go forever? That's surely just an implementation issue to solve? It's not a
> >>>>> reason to avoid the design principle; if we agree that maintaining sharability
> >>>>> of the cluster is preferred then the code must be written to guard against the
> >>>>> dead loop problem. It could be done by remembering the first cluster you
> >>>>> dequeued/requeued in scan_swap_map_try_ssd_cluster() and stop when you get back
> >>>>> to it. (I think holding the si lock will protect against concurrently freeing
> >>>>> the cluster so it should definitely remain in the list?).
> >>>>
> >>>> I believe that you can find some way to avoid the dead loop issue,
> >>>> although your suggestion may kill the performance via looping a long list
> >>>> of nonfull clusters.
> >>>
> >>> I don't agree; If the clusters are considered exclusive (i.e. removed from the
> >>> list when made current for a cpu), that only reduces the size of the list by a
> >>> maximum of the number of CPUs in the system, which I suspect is pretty small
> >>> compared to the number of nonfull clusters.
> >>
> >> Anyway, this depends on details.  If we cannot allocate a order-N swap
> >> entry from the cluster, we should remove it from the nonfull list for
> >> order-N (This is the behavior of this patch too).
> >
> > Yes that's a good point, and I conceed it is more difficult to detect that
> > condition if the cluster is shared. I suspect that with a bit of thinking, we
> > could find a way though.
> >
> >> Your original
> >> suggestion appears like that you want to keep all cluster with order-N
> >> on the nonfull list for order-N always unless the number of free swap
> >> entry is less than 1<<N.
> >
> > Well I think that's certainly one of the conditions for removing it. But agree
> > that if a full scan of the cluster has been performed and no swap entries have
> > been freed since the scan started then it should also be removed from the list.
> >
> >>
> >>>> And, I understand that in some situations it may
> >>>> be better to share clusters among CPUs.  So my suggestion is,
> >>>>
> >>>> - Make swap_cluster_info->order more accurate, don't pretend that we
> >>>>   have free swap entries with that order even after we are sure that we
> >>>>   haven't.
> >>>
> >>> Is this patch pretending that today? I don't think so?
> >>
> >> IIUC, in this patch swap_cluster_info->order is still "N" even if we are
> >> sure that there are no order-N free swap entry in the cluster.
> >
> > Oh I see what you mean. I think you and Chris already discussed this? IIRC
> > Chris's point was that if you move that cluster to N-1, eventually all clusters
> > are for order-0 and you have no means of allocating high orders until a whole
> > cluster becomes free. That logic certainly makes sense to me, so think its
> > better for swap_cluster_info->order to remain static while the cluster is
> > allocated. (I only skimmed that conversation so appologies if I got the
> > conclusion wrong!).
> >
> >>
> >>> But I agree that a
> >>> cluster should only be on the per-order nonfull list if we know there are at
> >>> least enough free swap entries in that cluster to cover the order. Of course
> >>> that doesn't tell us for sure because they may not be contiguous.
> >>
> >> We can check that when free swap entry via checking adjacent swap
> >> entries.  IMHO, the performance should be acceptable.
> >
> > Would you then use the result of that scanning to "promote" a cluster's order?
> > e.g. swap_cluster_info->order = N+1? That would be neat. But this all feels like
> > a separate change on top of what Chris is doing here. For high orders there
> > could be quite a bit of scanning required in the worst case for every page that
> > gets freed.
>
> We can try to optimize it to control overhead if necessary.
>
> >>
> >>>>
> >>>> My question is whether it's so important to share the per-cpu cluster
> >>>> among CPUs?
> >>>
> >>> My rationale for sharing is that the preference previously has been to favour
> >>> efficient use of swap space; we don't want to fail a request for allocation of a
> >>> given order if there are actually slots available just because they have been
> >>> reserved by another CPU. And I'm still asserting that it should be ~zero cost to
> >>> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
> >>> actually help improve allocation success, then I'm happy to take the exclusive
> >>> approach.
> >>>
> >>>> I suggest to start with simple design, that is, per-CPU
> >>>> cluster will not be shared among CPUs in most cases.
> >>>
> >>> I'm all for starting simple; I think that's what I already proposed (exclusive
> >>> in this patch, then shared in the "big rewrite"). I'm just objecting to the
> >>> current half-and-half policy in this patch.
> >>
> >> Sounds good to me.  We can start with exclusive solution and evaluate
> >> whether shared solution is good.
> >
> > Yep. And also evaluate the dynamic order inc/dec idea too...
>
> Dynamic order inc/dec tries solving a more fundamental problem.  For
> example,
>
> - Initially, almost only order-0 pages are swapped out, most non-full
>   clusters are order-0.
>
> - Later, quite some order-0 swap entries are freed so that there are
>   quite some order-4 swap entries available.

If the freeing of swap entry is random distribution. You need 16
continuous swap entries free at the same time at aligned 16 base
locations. The total number of order 4 free swap space add up together
is much lower than the order 0 allocatable swap space.
If having one entry free is 50% probability(swapfile half full), then
having 16 swap entries is continually free is (0.5) EXP 16 = 1.5 E-5.
If the swapfile is 80% full, that number drops to 6.5 E -12.

> - Order-4 pages need to be swapped out, but no enough order-4 non-full
>   clusters available.

Exactly.

>
> So, we need a way to migrate non-full clusters among orders to adjust to
> the various situations automatically.

There is no easy way to migrate swap entries to different locations.
That is why I like to have discontiguous swap entries allocation for
mTHP.

>
> But yes, data is needed for any performance related change.

Chris

Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Posted by Huang, Ying 1 year, 6 months ago

Chris Li <chrisl@kernel.org> writes:

> On Wed, Jul 24, 2024 at 11:57 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>
>> > On 23/07/2024 07:27, Huang, Ying wrote:
>> >> Ryan Roberts <ryan.roberts@arm.com> writes:
>> >>
>> >>> On 22/07/2024 09:49, Huang, Ying wrote:
>> >>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> >>>>
>> >>>>> On 22/07/2024 03:14, Huang, Ying wrote:
>> >>>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> >>>>>>
>> >>>>>>> On 18/07/2024 08:53, Huang, Ying wrote:
>> >>>>>>>> Chris Li <chrisl@kernel.org> writes:
>> >>>>>>>>
>> >>>>>>>>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> On 16/07/2024 23:46, Chris Li wrote:
>> >>>>>>>>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On 11/07/2024 08:29, Chris Li wrote:
>> >>>>>>
>> >>>>>> [snip]
>> >>>>>>
>> >>>>>>>>>>>>> +
>> >>>>>>>>>>>>> +     if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>> >>>>>>>>>>>>> +             list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> I find the transitions when you add and remove a cluster from the
>> >>>>>>>>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
>> >>>>>>>>>>>> to the list whenever there is at least one free swap entry if not already on the
>> >>>>>>>>>>>> list. But you take it off the list when assigning it as the current cluster for
>> >>>>>>>>>>>> a cpu in scan_swap_map_try_ssd_cluster().
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> So you could have this situation:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>   - cpuA allocs cluster from free list (exclusive to that cpu)
>> >>>>>>>>>>>>   - cpuA allocs 1 swap entry from current cluster
>> >>>>>>>>>>>>   - swap entry is freed; cluster added to nonfull_clusters
>> >>>>>>>>>>>>   - cpuB "allocs" cluster from nonfull_clusters
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> At this point both cpuA and cpuB share the same cluster as their current
>> >>>>>>>>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at
>> >>>>>>>>>>>> allocation time (when removed from free_list) and only remove it from the
>> >>>>>>>>>>>
>> >>>>>>>>>>> The big rewrite on patch 3 does that, taking it off the free list and
>> >>>>>>>>>>> moving it into nonfull.
>> >>>>>>>>>>
>> >>>>>>>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
>> >>>>>>>>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
>> >>>>>>>>>> separate the SSD and HDD code paths. Personally I'd prefer to see the
>> >>>>>>>>>> refactoring separated from behavioural changes.
>> >>>>>>>>>
>> >>>>>>>>> It is not a refactoring. It is a big rewrite of the swap allocator
>> >>>>>>>>> using the cluster. Behavior change is expected. The goal is completely
>> >>>>>>>>> removing the brute force scanning of swap_map[] array for cluster swap
>> >>>>>>>>> allocation.
>> >>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Since the patch was titled RFC and I thought it was just refactoring, I was
>> >>>>>>>>>> deferring review. But sounds like it is actually required to realize the test
>> >>>>>>>>>> results quoted on the cover letter?
>> >>>>>>>>>
>> >>>>>>>>> Yes, required because it handles the previous fall out case try_ssd()
>> >>>>>>>>> failed. This big rewrite has gone through a lot of testing and bug
>> >>>>>>>>> fix. It is pretty stable now. The only reason I keep it as RFC is
>> >>>>>>>>> because it is not feature complete. Currently it does not do swap
>> >>>>>>>>> cache reclaim. The next version will have swap cache reclaim and
>> >>>>>>>>> remove the RFC.
>> >>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>> I am only making the minimal change in this step so the big rewrite can land.
>> >>>>>>>>>>>
>> >>>>>>>>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't
>> >>>>>>>>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead
>> >>>>>>>>>>>> of just sometimes. You would likely want to move the cluster to the end of the
>> >>>>>>>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
>> >>>>>>>>>>>> chances of multiple CPUs using the same cluster.
>> >>>>>>>>>>>
>> >>>>>>>>>>> For nonfull clusters it is less important to avoid multiple CPU
>> >>>>>>>>>>> sharing the cluster. Because the cluster already has previous swap
>> >>>>>>>>>>> entries allocated from the previous CPU.
>> >>>>>>>>>>
>> >>>>>>>>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
>> >>>>>>>>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
>> >>>>>>>>>
>> >>>>>>>>> That happens to exist per cpu next pointer already. When the other CPU
>> >>>>>>>>> advances to the next cluster pointer, it can cross with the other
>> >>>>>>>>> CPU's next cluster pointer.
>> >>>>>>>>
>> >>>>>>>> No.  si->percpu_cluster[cpu].next will keep in the current per cpu
>> >>>>>>>> cluster only.  If it doesn't do that, we should fix it.
>> >>>>>>>>
>> >>>>>>>> I agree with Ryan that we should make per cpu cluster correct.  A
>> >>>>>>>> cluster in per cpu cluster shouldn't be put in nonfull list.  When we
>> >>>>>>>> scan to the end of a per cpu cluster, we can put the cluster in nonfull
>> >>>>>>>> list if necessary.  And, we should make it correct in this patch instead
>> >>>>>>>> of later in series.  I understand that you want to make the patch itself
>> >>>>>>>> simple, but it's important to make code simple to be understood too.
>> >>>>>>>> Consistent design choice will do that.
>> >>>>>>>
>> >>>>>>> I think I'm actually arguing for the opposite of what you suggest here.
>> >>>>>>
>> >>>>>> Sorry, I misunderstood your words.
>> >>>>>>
>> >>>>>>> As I see it, there are 2 possible approaches; either a cluster is always
>> >>>>>>> considered exclusive to a single cpu when its set as a per-cpu cluster, so it
>> >>>>>>> does not appear on the nonfull list. Or a cluster is considered sharable in this
>> >>>>>>> case, in which case it should be added to the nonfull list.
>> >>>>>>>
>> >>>>>>> The code at the moment sort of does both; when a cpu decides to use a cluster in
>> >>>>>>> the nonfull list, it removes it from that list to make it exclusive. But as soon
>> >>>>>>> as a single swap entry is freed from that cluster it is put back on the list.
>> >>>>>>> This neither-one-policy-nor-the-other seems odd to me.
>> >>>>>>>
>> >>>>>>> I think Huang, Ying is arguing to keep it always exclusive while installed as a
>> >>>>>>> per-cpu cluster.
>> >>>>>>
>> >>>>>> Yes.
>> >>>>>>
>> >>>>>>> I was arguing to make it always shared. Perhaps the best
>> >>>>>>> approach is to implement the exclusive policy in this patch (you'd need a flag
>> >>>>>>> to note if any pages were freed while in exclusive use, then when exclusive use
>> >>>>>>> completes, put it back on the nonfull list if the flag was set). Then migrate to
>> >>>>>>> the shared approach as part of the "big rewrite"?
>> >>>>>>>>
>> >>>>>>>>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
>> >>>>>>>>>> share the cluster when you really need to, but try to avoid it if there are
>> >>>>>>>>>> other options, and I think moving the cluster to the end of the list might be a
>> >>>>>>>>>> way to help that?
>> >>>>>>>>>
>> >>>>>>>>> Simply moving to the end of the list can create a possible deadloop
>> >>>>>>>>> when all clusters have been scanned and not available swap range
>> >>>>>>>>> found.
>> >>>>>>
>> >>>>>> I also think that the shared approach has dead loop issue.
>> >>>>>
>> >>>>> What exactly do you mean by dead loop issue? Perhaps you are suggesting the code
>> >>>>> won't know when to stop dequeing/requeuing clusters on the nonfull list and will
>> >>>>> go forever? That's surely just an implementation issue to solve? It's not a
>> >>>>> reason to avoid the design principle; if we agree that maintaining sharability
>> >>>>> of the cluster is preferred then the code must be written to guard against the
>> >>>>> dead loop problem. It could be done by remembering the first cluster you
>> >>>>> dequeued/requeued in scan_swap_map_try_ssd_cluster() and stop when you get back
>> >>>>> to it. (I think holding the si lock will protect against concurrently freeing
>> >>>>> the cluster so it should definitely remain in the list?).
>> >>>>
>> >>>> I believe that you can find some way to avoid the dead loop issue,
>> >>>> although your suggestion may kill the performance via looping a long list
>> >>>> of nonfull clusters.
>> >>>
>> >>> I don't agree; If the clusters are considered exclusive (i.e. removed from the
>> >>> list when made current for a cpu), that only reduces the size of the list by a
>> >>> maximum of the number of CPUs in the system, which I suspect is pretty small
>> >>> compared to the number of nonfull clusters.
>> >>
>> >> Anyway, this depends on details.  If we cannot allocate a order-N swap
>> >> entry from the cluster, we should remove it from the nonfull list for
>> >> order-N (This is the behavior of this patch too).
>> >
>> > Yes that's a good point, and I conceed it is more difficult to detect that
>> > condition if the cluster is shared. I suspect that with a bit of thinking, we
>> > could find a way though.
>> >
>> >> Your original
>> >> suggestion appears like that you want to keep all cluster with order-N
>> >> on the nonfull list for order-N always unless the number of free swap
>> >> entry is less than 1<<N.
>> >
>> > Well I think that's certainly one of the conditions for removing it. But agree
>> > that if a full scan of the cluster has been performed and no swap entries have
>> > been freed since the scan started then it should also be removed from the list.
>> >
>> >>
>> >>>> And, I understand that in some situations it may
>> >>>> be better to share clusters among CPUs.  So my suggestion is,
>> >>>>
>> >>>> - Make swap_cluster_info->order more accurate, don't pretend that we
>> >>>>   have free swap entries with that order even after we are sure that we
>> >>>>   haven't.
>> >>>
>> >>> Is this patch pretending that today? I don't think so?
>> >>
>> >> IIUC, in this patch swap_cluster_info->order is still "N" even if we are
>> >> sure that there are no order-N free swap entry in the cluster.
>> >
>> > Oh I see what you mean. I think you and Chris already discussed this? IIRC
>> > Chris's point was that if you move that cluster to N-1, eventually all clusters
>> > are for order-0 and you have no means of allocating high orders until a whole
>> > cluster becomes free. That logic certainly makes sense to me, so think its
>> > better for swap_cluster_info->order to remain static while the cluster is
>> > allocated. (I only skimmed that conversation so appologies if I got the
>> > conclusion wrong!).
>> >
>> >>
>> >>> But I agree that a
>> >>> cluster should only be on the per-order nonfull list if we know there are at
>> >>> least enough free swap entries in that cluster to cover the order. Of course
>> >>> that doesn't tell us for sure because they may not be contiguous.
>> >>
>> >> We can check that when free swap entry via checking adjacent swap
>> >> entries.  IMHO, the performance should be acceptable.
>> >
>> > Would you then use the result of that scanning to "promote" a cluster's order?
>> > e.g. swap_cluster_info->order = N+1? That would be neat. But this all feels like
>> > a separate change on top of what Chris is doing here. For high orders there
>> > could be quite a bit of scanning required in the worst case for every page that
>> > gets freed.
>>
>> We can try to optimize it to control overhead if necessary.
>>
>> >>
>> >>>>
>> >>>> My question is whether it's so important to share the per-cpu cluster
>> >>>> among CPUs?
>> >>>
>> >>> My rationale for sharing is that the preference previously has been to favour
>> >>> efficient use of swap space; we don't want to fail a request for allocation of a
>> >>> given order if there are actually slots available just because they have been
>> >>> reserved by another CPU. And I'm still asserting that it should be ~zero cost to
>> >>> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
>> >>> actually help improve allocation success, then I'm happy to take the exclusive
>> >>> approach.
>> >>>
>> >>>> I suggest to start with simple design, that is, per-CPU
>> >>>> cluster will not be shared among CPUs in most cases.
>> >>>
>> >>> I'm all for starting simple; I think that's what I already proposed (exclusive
>> >>> in this patch, then shared in the "big rewrite"). I'm just objecting to the
>> >>> current half-and-half policy in this patch.
>> >>
>> >> Sounds good to me.  We can start with exclusive solution and evaluate
>> >> whether shared solution is good.
>> >
>> > Yep. And also evaluate the dynamic order inc/dec idea too...
>>
>> Dynamic order inc/dec tries solving a more fundamental problem.  For
>> example,
>>
>> - Initially, almost only order-0 pages are swapped out, most non-full
>>   clusters are order-0.
>>
>> - Later, quite some order-0 swap entries are freed so that there are
>>   quite some order-4 swap entries available.
>
> If the freeing of swap entry is random distribution. You need 16
> continuous swap entries free at the same time at aligned 16 base
> locations. The total number of order 4 free swap space add up together
> is much lower than the order 0 allocatable swap space.
> If having one entry free is 50% probability(swapfile half full), then
> having 16 swap entries is continually free is (0.5) EXP 16 = 1.5 E-5.
> If the swapfile is 80% full, that number drops to 6.5 E -12.

This depends on workloads.  Quite some workloads will show some degree
of spatial locality.  For a workload with no spatial locality at all as
above, mTHP may be not a good choice at the first place.

>> - Order-4 pages need to be swapped out, but no enough order-4 non-full
>>   clusters available.
>
> Exactly.
>
>>
>> So, we need a way to migrate non-full clusters among orders to adjust to
>> the various situations automatically.
>
> There is no easy way to migrate swap entries to different locations.
> That is why I like to have discontiguous swap entries allocation for
> mTHP.

We suggest to migrate non-full swap clsuters among different lists, not
swap entries.

>>
>> But yes, data is needed for any performance related change.

BTW: I think non-full cluster isn't a good name.  Partial cluster is
much better and follows the same convention as partial slab.

--
Best Regards,
Huang, Ying

Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Posted by Chris Li 1 year, 6 months ago

On Thu, Jul 25, 2024 at 7:07 PM Huang, Ying <ying.huang@intel.com> wrote:
> > If the freeing of swap entry is random distribution. You need 16
> > continuous swap entries free at the same time at aligned 16 base
> > locations. The total number of order 4 free swap space add up together
> > is much lower than the order 0 allocatable swap space.
> > If having one entry free is 50% probability(swapfile half full), then
> > having 16 swap entries is continually free is (0.5) EXP 16 = 1.5 E-5.
> > If the swapfile is 80% full, that number drops to 6.5 E -12.
>
> This depends on workloads.  Quite some workloads will show some degree
> of spatial locality.  For a workload with no spatial locality at all as
> above, mTHP may be not a good choice at the first place.

The fragmentation comes from the order 0 entry not from the mTHP. mTHP
have their own valid usage case, and should be separate from how you
use the order 0 entry. That is why I consider this kind of strategy
only works on the lucky case. I would much prefer the strategy that
can guarantee work not depend on luck.

> >> - Order-4 pages need to be swapped out, but no enough order-4 non-full
> >>   clusters available.
> >
> > Exactly.
> >
> >>
> >> So, we need a way to migrate non-full clusters among orders to adjust to
> >> the various situations automatically.
> >
> > There is no easy way to migrate swap entries to different locations.
> > That is why I like to have discontiguous swap entries allocation for
> > mTHP.
>
> We suggest to migrate non-full swap clsuters among different lists, not
> swap entries.

Then you have the down side of reducing the number of total high order
clusters. By chance it is much easier to fragment the cluster than
anti-fragment a cluster.  The orders of clusters have a natural
tendency to move down rather than move up, given long enough time of
random access. It will likely run out of high order clusters in the
long run if we don't have any separation of orders.

> >> But yes, data is needed for any performance related change.
>
> BTW: I think non-full cluster isn't a good name.  Partial cluster is
> much better and follows the same convention as partial slab.

I am not opposed to it. The only reason I hold off on the rename is
because there are patches from Kairui I am testing depending on it.
Let's finish up the V5 patch with the swap cache reclaim code path
then do the renaming as one batch job. We actually have more than one
list that has the clusters partially full. It helps reduce the repeat
scan of the cluster that is not full but also not able to allocate
swap entries for this order.  Just the name of one of them as
"partial" is not precise either. Because the other lists are also
partially full. We'd better give them precise meaning systematically.

Chris

Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Posted by Huang, Ying 1 year, 6 months ago

Chris Li <chrisl@kernel.org> writes:

> On Thu, Jul 25, 2024 at 7:07 PM Huang, Ying <ying.huang@intel.com> wrote:
>> > If the freeing of swap entry is random distribution. You need 16
>> > continuous swap entries free at the same time at aligned 16 base
>> > locations. The total number of order 4 free swap space add up together
>> > is much lower than the order 0 allocatable swap space.
>> > If having one entry free is 50% probability(swapfile half full), then
>> > having 16 swap entries is continually free is (0.5) EXP 16 = 1.5 E-5.
>> > If the swapfile is 80% full, that number drops to 6.5 E -12.
>>
>> This depends on workloads.  Quite some workloads will show some degree
>> of spatial locality.  For a workload with no spatial locality at all as
>> above, mTHP may be not a good choice at the first place.
>
> The fragmentation comes from the order 0 entry not from the mTHP. mTHP
> have their own valid usage case, and should be separate from how you
> use the order 0 entry. That is why I consider this kind of strategy
> only works on the lucky case. I would much prefer the strategy that
> can guarantee work not depend on luck.

It seems that you have some perfect solution.  Will learn it when you
post it.

>> >> - Order-4 pages need to be swapped out, but no enough order-4 non-full
>> >>   clusters available.
>> >
>> > Exactly.
>> >
>> >>
>> >> So, we need a way to migrate non-full clusters among orders to adjust to
>> >> the various situations automatically.
>> >
>> > There is no easy way to migrate swap entries to different locations.
>> > That is why I like to have discontiguous swap entries allocation for
>> > mTHP.
>>
>> We suggest to migrate non-full swap clsuters among different lists, not
>> swap entries.
>
> Then you have the down side of reducing the number of total high order
> clusters. By chance it is much easier to fragment the cluster than
> anti-fragment a cluster.  The orders of clusters have a natural
> tendency to move down rather than move up, given long enough time of
> random access. It will likely run out of high order clusters in the
> long run if we don't have any separation of orders.

As my example above, you may have almost 0 high-order clusters forever.
So, your solution only works for very specific use cases.  It's not a
general solution.

>> >> But yes, data is needed for any performance related change.
>>
>> BTW: I think non-full cluster isn't a good name.  Partial cluster is
>> much better and follows the same convention as partial slab.
>
> I am not opposed to it. The only reason I hold off on the rename is
> because there are patches from Kairui I am testing depending on it.
> Let's finish up the V5 patch with the swap cache reclaim code path
> then do the renaming as one batch job. We actually have more than one
> list that has the clusters partially full. It helps reduce the repeat
> scan of the cluster that is not full but also not able to allocate
> swap entries for this order.  Just the name of one of them as
> "partial" is not precise either. Because the other lists are also
> partially full. We'd better give them precise meaning systematically.

I don't think that it's hard to do a search/replace before the next
version.

--
Best Regards,
Huang, Ying

Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Posted by Chris Li 1 year, 6 months ago

On Thu, Jul 25, 2024 at 10:55 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > On Thu, Jul 25, 2024 at 7:07 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> > If the freeing of swap entry is random distribution. You need 16
> >> > continuous swap entries free at the same time at aligned 16 base
> >> > locations. The total number of order 4 free swap space add up together
> >> > is much lower than the order 0 allocatable swap space.
> >> > If having one entry free is 50% probability(swapfile half full), then
> >> > having 16 swap entries is continually free is (0.5) EXP 16 = 1.5 E-5.
> >> > If the swapfile is 80% full, that number drops to 6.5 E -12.
> >>
> >> This depends on workloads.  Quite some workloads will show some degree
> >> of spatial locality.  For a workload with no spatial locality at all as
> >> above, mTHP may be not a good choice at the first place.
> >
> > The fragmentation comes from the order 0 entry not from the mTHP. mTHP
> > have their own valid usage case, and should be separate from how you
> > use the order 0 entry. That is why I consider this kind of strategy
> > only works on the lucky case. I would much prefer the strategy that
> > can guarantee work not depend on luck.
>
> It seems that you have some perfect solution.  Will learn it when you
> post it.

No, I don't have perfect solutions. I see puting limit on order 0 swap
usage and writing out discontinuous swap entries from a folio are more
deterministic and not depend on luck. Both have their price to pay as
well.

>
> >> >> - Order-4 pages need to be swapped out, but no enough order-4 non-full
> >> >>   clusters available.
> >> >
> >> > Exactly.
> >> >
> >> >>
> >> >> So, we need a way to migrate non-full clusters among orders to adjust to
> >> >> the various situations automatically.
> >> >
> >> > There is no easy way to migrate swap entries to different locations.
> >> > That is why I like to have discontiguous swap entries allocation for
> >> > mTHP.
> >>
> >> We suggest to migrate non-full swap clsuters among different lists, not
> >> swap entries.
> >
> > Then you have the down side of reducing the number of total high order
> > clusters. By chance it is much easier to fragment the cluster than
> > anti-fragment a cluster.  The orders of clusters have a natural
> > tendency to move down rather than move up, given long enough time of
> > random access. It will likely run out of high order clusters in the
> > long run if we don't have any separation of orders.
>
> As my example above, you may have almost 0 high-order clusters forever.
> So, your solution only works for very specific use cases.  It's not a
> general solution.

One simple solution is having an optional limitation of 0 order swap.
I understand you don't like that option, but there is no other easy
solution to achieve the same effectiveness, so far. If there is, I
like to hear it.

>
> >> >> But yes, data is needed for any performance related change.
> >>
> >> BTW: I think non-full cluster isn't a good name.  Partial cluster is
> >> much better and follows the same convention as partial slab.
> >
> > I am not opposed to it. The only reason I hold off on the rename is
> > because there are patches from Kairui I am testing depending on it.
> > Let's finish up the V5 patch with the swap cache reclaim code path
> > then do the renaming as one batch job. We actually have more than one
> > list that has the clusters partially full. It helps reduce the repeat
> > scan of the cluster that is not full but also not able to allocate
> > swap entries for this order.  Just the name of one of them as
> > "partial" is not precise either. Because the other lists are also
> > partially full. We'd better give them precise meaning systematically.
>
> I don't think that it's hard to do a search/replace before the next
> version.

The overhead is on the other internal experimental patches. Again,
I am not opposed to renaming it. Just want to do it at one batch not
many times, including other list names.

Chris

Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Posted by Huang, Ying 1 year, 6 months ago

Chris Li <chrisl@kernel.org> writes:

> On Thu, Jul 25, 2024 at 10:55 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>> > On Thu, Jul 25, 2024 at 7:07 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> > If the freeing of swap entry is random distribution. You need 16
>> >> > continuous swap entries free at the same time at aligned 16 base
>> >> > locations. The total number of order 4 free swap space add up together
>> >> > is much lower than the order 0 allocatable swap space.
>> >> > If having one entry free is 50% probability(swapfile half full), then
>> >> > having 16 swap entries is continually free is (0.5) EXP 16 = 1.5 E-5.
>> >> > If the swapfile is 80% full, that number drops to 6.5 E -12.
>> >>
>> >> This depends on workloads.  Quite some workloads will show some degree
>> >> of spatial locality.  For a workload with no spatial locality at all as
>> >> above, mTHP may be not a good choice at the first place.
>> >
>> > The fragmentation comes from the order 0 entry not from the mTHP. mTHP
>> > have their own valid usage case, and should be separate from how you
>> > use the order 0 entry. That is why I consider this kind of strategy
>> > only works on the lucky case. I would much prefer the strategy that
>> > can guarantee work not depend on luck.
>>
>> It seems that you have some perfect solution.  Will learn it when you
>> post it.
>
> No, I don't have perfect solutions. I see puting limit on order 0 swap
> usage and writing out discontinuous swap entries from a folio are more
> deterministic and not depend on luck. Both have their price to pay as
> well.
>
>>
>> >> >> - Order-4 pages need to be swapped out, but no enough order-4 non-full
>> >> >>   clusters available.
>> >> >
>> >> > Exactly.
>> >> >
>> >> >>
>> >> >> So, we need a way to migrate non-full clusters among orders to adjust to
>> >> >> the various situations automatically.
>> >> >
>> >> > There is no easy way to migrate swap entries to different locations.
>> >> > That is why I like to have discontiguous swap entries allocation for
>> >> > mTHP.
>> >>
>> >> We suggest to migrate non-full swap clsuters among different lists, not
>> >> swap entries.
>> >
>> > Then you have the down side of reducing the number of total high order
>> > clusters. By chance it is much easier to fragment the cluster than
>> > anti-fragment a cluster.  The orders of clusters have a natural
>> > tendency to move down rather than move up, given long enough time of
>> > random access. It will likely run out of high order clusters in the
>> > long run if we don't have any separation of orders.
>>
>> As my example above, you may have almost 0 high-order clusters forever.
>> So, your solution only works for very specific use cases.  It's not a
>> general solution.
>
> One simple solution is having an optional limitation of 0 order swap.
> I understand you don't like that option, but there is no other easy
> solution to achieve the same effectiveness, so far. If there is, I
> like to hear it.

Just as you said, it's optional, so it's not general solution.  This may
trigger OOM in general solution.

>>
>> >> >> But yes, data is needed for any performance related change.
>> >>
>> >> BTW: I think non-full cluster isn't a good name.  Partial cluster is
>> >> much better and follows the same convention as partial slab.
>> >
>> > I am not opposed to it. The only reason I hold off on the rename is
>> > because there are patches from Kairui I am testing depending on it.
>> > Let's finish up the V5 patch with the swap cache reclaim code path
>> > then do the renaming as one batch job. We actually have more than one
>> > list that has the clusters partially full. It helps reduce the repeat
>> > scan of the cluster that is not full but also not able to allocate
>> > swap entries for this order.  Just the name of one of them as
>> > "partial" is not precise either. Because the other lists are also
>> > partially full. We'd better give them precise meaning systematically.
>>
>> I don't think that it's hard to do a search/replace before the next
>> version.
>
> The overhead is on the other internal experimental patches. Again,
> I am not opposed to renaming it. Just want to do it at one batch not
> many times, including other list names.

--
Best Regards,
Huang, Ying

Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Posted by Chris Li 1 year, 6 months ago

On Fri, Jul 26, 2024 at 12:21 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > On Thu, Jul 25, 2024 at 10:55 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Chris Li <chrisl@kernel.org> writes:
> >>
> >> > On Thu, Jul 25, 2024 at 7:07 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> > If the freeing of swap entry is random distribution. You need 16
> >> >> > continuous swap entries free at the same time at aligned 16 base
> >> >> > locations. The total number of order 4 free swap space add up together
> >> >> > is much lower than the order 0 allocatable swap space.
> >> >> > If having one entry free is 50% probability(swapfile half full), then
> >> >> > having 16 swap entries is continually free is (0.5) EXP 16 = 1.5 E-5.
> >> >> > If the swapfile is 80% full, that number drops to 6.5 E -12.
> >> >>
> >> >> This depends on workloads.  Quite some workloads will show some degree
> >> >> of spatial locality.  For a workload with no spatial locality at all as
> >> >> above, mTHP may be not a good choice at the first place.
> >> >
> >> > The fragmentation comes from the order 0 entry not from the mTHP. mTHP
> >> > have their own valid usage case, and should be separate from how you
> >> > use the order 0 entry. That is why I consider this kind of strategy
> >> > only works on the lucky case. I would much prefer the strategy that
> >> > can guarantee work not depend on luck.
> >>
> >> It seems that you have some perfect solution.  Will learn it when you
> >> post it.
> >
> > No, I don't have perfect solutions. I see puting limit on order 0 swap
> > usage and writing out discontinuous swap entries from a folio are more
> > deterministic and not depend on luck. Both have their price to pay as
> > well.
> >
> >>
> >> >> >> - Order-4 pages need to be swapped out, but no enough order-4 non-full
> >> >> >>   clusters available.
> >> >> >
> >> >> > Exactly.
> >> >> >
> >> >> >>
> >> >> >> So, we need a way to migrate non-full clusters among orders to adjust to
> >> >> >> the various situations automatically.
> >> >> >
> >> >> > There is no easy way to migrate swap entries to different locations.
> >> >> > That is why I like to have discontiguous swap entries allocation for
> >> >> > mTHP.
> >> >>
> >> >> We suggest to migrate non-full swap clsuters among different lists, not
> >> >> swap entries.
> >> >
> >> > Then you have the down side of reducing the number of total high order
> >> > clusters. By chance it is much easier to fragment the cluster than
> >> > anti-fragment a cluster.  The orders of clusters have a natural
> >> > tendency to move down rather than move up, given long enough time of
> >> > random access. It will likely run out of high order clusters in the
> >> > long run if we don't have any separation of orders.
> >>
> >> As my example above, you may have almost 0 high-order clusters forever.
> >> So, your solution only works for very specific use cases.  It's not a
> >> general solution.
> >
> > One simple solution is having an optional limitation of 0 order swap.
> > I understand you don't like that option, but there is no other easy
> > solution to achieve the same effectiveness, so far. If there is, I
> > like to hear it.
>
> Just as you said, it's optional, so it's not general solution.  This may
> trigger OOM in general solution.

Agree it is not a general solution. This option is simple and useful.
The more general solution is just write out discontiguous swap entries.

Chris

Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Posted by Huang, Ying 1 year, 6 months ago

Chris Li <chrisl@kernel.org> writes:

> On Fri, Jul 26, 2024 at 12:21 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>> > On Thu, Jul 25, 2024 at 10:55 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Chris Li <chrisl@kernel.org> writes:
>> >>
>> >> > On Thu, Jul 25, 2024 at 7:07 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> > If the freeing of swap entry is random distribution. You need 16
>> >> >> > continuous swap entries free at the same time at aligned 16 base
>> >> >> > locations. The total number of order 4 free swap space add up together
>> >> >> > is much lower than the order 0 allocatable swap space.
>> >> >> > If having one entry free is 50% probability(swapfile half full), then
>> >> >> > having 16 swap entries is continually free is (0.5) EXP 16 = 1.5 E-5.
>> >> >> > If the swapfile is 80% full, that number drops to 6.5 E -12.
>> >> >>
>> >> >> This depends on workloads.  Quite some workloads will show some degree
>> >> >> of spatial locality.  For a workload with no spatial locality at all as
>> >> >> above, mTHP may be not a good choice at the first place.
>> >> >
>> >> > The fragmentation comes from the order 0 entry not from the mTHP. mTHP
>> >> > have their own valid usage case, and should be separate from how you
>> >> > use the order 0 entry. That is why I consider this kind of strategy
>> >> > only works on the lucky case. I would much prefer the strategy that
>> >> > can guarantee work not depend on luck.
>> >>
>> >> It seems that you have some perfect solution.  Will learn it when you
>> >> post it.
>> >
>> > No, I don't have perfect solutions. I see puting limit on order 0 swap
>> > usage and writing out discontinuous swap entries from a folio are more
>> > deterministic and not depend on luck. Both have their price to pay as
>> > well.
>> >
>> >>
>> >> >> >> - Order-4 pages need to be swapped out, but no enough order-4 non-full
>> >> >> >>   clusters available.
>> >> >> >
>> >> >> > Exactly.
>> >> >> >
>> >> >> >>
>> >> >> >> So, we need a way to migrate non-full clusters among orders to adjust to
>> >> >> >> the various situations automatically.
>> >> >> >
>> >> >> > There is no easy way to migrate swap entries to different locations.
>> >> >> > That is why I like to have discontiguous swap entries allocation for
>> >> >> > mTHP.
>> >> >>
>> >> >> We suggest to migrate non-full swap clsuters among different lists, not
>> >> >> swap entries.
>> >> >
>> >> > Then you have the down side of reducing the number of total high order
>> >> > clusters. By chance it is much easier to fragment the cluster than
>> >> > anti-fragment a cluster.  The orders of clusters have a natural
>> >> > tendency to move down rather than move up, given long enough time of
>> >> > random access. It will likely run out of high order clusters in the
>> >> > long run if we don't have any separation of orders.
>> >>
>> >> As my example above, you may have almost 0 high-order clusters forever.
>> >> So, your solution only works for very specific use cases.  It's not a
>> >> general solution.
>> >
>> > One simple solution is having an optional limitation of 0 order swap.
>> > I understand you don't like that option, but there is no other easy
>> > solution to achieve the same effectiveness, so far. If there is, I
>> > like to hear it.
>>
>> Just as you said, it's optional, so it's not general solution.  This may
>> trigger OOM in general solution.
>
> Agree it is not a general solution. This option is simple and useful.
> The more general solution is just write out discontiguous swap entries.

I just don't know how to do that.  For example, how to put the folio in
swap cache.  Will wait you to show the implementation.

--
Best Regards,
Huang, Ying

Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Posted by Chris Li 1 year, 6 months ago

Hi Ryan and Ying,

Sorry I was busy. I am catching up on the email now.

On Wed, Jul 24, 2024 at 1:33 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 23/07/2024 07:27, Huang, Ying wrote:
> > Ryan Roberts <ryan.roberts@arm.com> writes:
> >
> >> On 22/07/2024 09:49, Huang, Ying wrote:
> >>> Ryan Roberts <ryan.roberts@arm.com> writes:
> >>>
> >>>> On 22/07/2024 03:14, Huang, Ying wrote:
> >>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
> >>>>>
> >>>>>> On 18/07/2024 08:53, Huang, Ying wrote:
> >>>>>>> Chris Li <chrisl@kernel.org> writes:
> >>>>>>>
> >>>>>>>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>>>>
> >>>>>>>>> On 16/07/2024 23:46, Chris Li wrote:
> >>>>>>>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On 11/07/2024 08:29, Chris Li wrote:
> >>>>>
> >>>>> [snip]
> >>>>>
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +     if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
> >>>>>>>>>>>> +             list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
> >>>>>>>>>>>
> >>>>>>>>>>> I find the transitions when you add and remove a cluster from the
> >>>>>>>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
> >>>>>>>>>>> to the list whenever there is at least one free swap entry if not already on the
> >>>>>>>>>>> list. But you take it off the list when assigning it as the current cluster for
> >>>>>>>>>>> a cpu in scan_swap_map_try_ssd_cluster().
> >>>>>>>>>>>
> >>>>>>>>>>> So you could have this situation:
> >>>>>>>>>>>
> >>>>>>>>>>>   - cpuA allocs cluster from free list (exclusive to that cpu)
> >>>>>>>>>>>   - cpuA allocs 1 swap entry from current cluster
> >>>>>>>>>>>   - swap entry is freed; cluster added to nonfull_clusters
> >>>>>>>>>>>   - cpuB "allocs" cluster from nonfull_clusters
> >>>>>>>>>>>
> >>>>>>>>>>> At this point both cpuA and cpuB share the same cluster as their current
> >>>>>>>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at
> >>>>>>>>>>> allocation time (when removed from free_list) and only remove it from the
> >>>>>>>>>>
> >>>>>>>>>> The big rewrite on patch 3 does that, taking it off the free list and
> >>>>>>>>>> moving it into nonfull.
> >>>>>>>>>
> >>>>>>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
> >>>>>>>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
> >>>>>>>>> separate the SSD and HDD code paths. Personally I'd prefer to see the
> >>>>>>>>> refactoring separated from behavioural changes.
> >>>>>>>>
> >>>>>>>> It is not a refactoring. It is a big rewrite of the swap allocator
> >>>>>>>> using the cluster. Behavior change is expected. The goal is completely
> >>>>>>>> removing the brute force scanning of swap_map[] array for cluster swap
> >>>>>>>> allocation.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Since the patch was titled RFC and I thought it was just refactoring, I was
> >>>>>>>>> deferring review. But sounds like it is actually required to realize the test
> >>>>>>>>> results quoted on the cover letter?
> >>>>>>>>
> >>>>>>>> Yes, required because it handles the previous fall out case try_ssd()
> >>>>>>>> failed. This big rewrite has gone through a lot of testing and bug
> >>>>>>>> fix. It is pretty stable now. The only reason I keep it as RFC is
> >>>>>>>> because it is not feature complete. Currently it does not do swap
> >>>>>>>> cache reclaim. The next version will have swap cache reclaim and
> >>>>>>>> remove the RFC.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> I am only making the minimal change in this step so the big rewrite can land.
> >>>>>>>>>>
> >>>>>>>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't
> >>>>>>>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead
> >>>>>>>>>>> of just sometimes. You would likely want to move the cluster to the end of the
> >>>>>>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
> >>>>>>>>>>> chances of multiple CPUs using the same cluster.
> >>>>>>>>>>
> >>>>>>>>>> For nonfull clusters it is less important to avoid multiple CPU
> >>>>>>>>>> sharing the cluster. Because the cluster already has previous swap
> >>>>>>>>>> entries allocated from the previous CPU.
> >>>>>>>>>
> >>>>>>>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
> >>>>>>>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
> >>>>>>>>
> >>>>>>>> That happens to exist per cpu next pointer already. When the other CPU
> >>>>>>>> advances to the next cluster pointer, it can cross with the other
> >>>>>>>> CPU's next cluster pointer.
> >>>>>>>
> >>>>>>> No.  si->percpu_cluster[cpu].next will keep in the current per cpu
> >>>>>>> cluster only.  If it doesn't do that, we should fix it.
> >>>>>>>
> >>>>>>> I agree with Ryan that we should make per cpu cluster correct.  A
> >>>>>>> cluster in per cpu cluster shouldn't be put in nonfull list.  When we
> >>>>>>> scan to the end of a per cpu cluster, we can put the cluster in nonfull
> >>>>>>> list if necessary.  And, we should make it correct in this patch instead
> >>>>>>> of later in series.  I understand that you want to make the patch itself
> >>>>>>> simple, but it's important to make code simple to be understood too.
> >>>>>>> Consistent design choice will do that.
> >>>>>>
> >>>>>> I think I'm actually arguing for the opposite of what you suggest here.
> >>>>>
> >>>>> Sorry, I misunderstood your words.
> >>>>>
> >>>>>> As I see it, there are 2 possible approaches; either a cluster is always
> >>>>>> considered exclusive to a single cpu when its set as a per-cpu cluster, so it
> >>>>>> does not appear on the nonfull list. Or a cluster is considered sharable in this
> >>>>>> case, in which case it should be added to the nonfull list.
> >>>>>>
> >>>>>> The code at the moment sort of does both; when a cpu decides to use a cluster in
> >>>>>> the nonfull list, it removes it from that list to make it exclusive. But as soon
> >>>>>> as a single swap entry is freed from that cluster it is put back on the list.
> >>>>>> This neither-one-policy-nor-the-other seems odd to me.
> >>>>>>
> >>>>>> I think Huang, Ying is arguing to keep it always exclusive while installed as a
> >>>>>> per-cpu cluster.
> >>>>>
> >>>>> Yes.
> >>>>>
> >>>>>> I was arguing to make it always shared. Perhaps the best
> >>>>>> approach is to implement the exclusive policy in this patch (you'd need a flag
> >>>>>> to note if any pages were freed while in exclusive use, then when exclusive use
> >>>>>> completes, put it back on the nonfull list if the flag was set). Then migrate to
> >>>>>> the shared approach as part of the "big rewrite"?
> >>>>>>>
> >>>>>>>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
> >>>>>>>>> share the cluster when you really need to, but try to avoid it if there are
> >>>>>>>>> other options, and I think moving the cluster to the end of the list might be a
> >>>>>>>>> way to help that?
> >>>>>>>>
> >>>>>>>> Simply moving to the end of the list can create a possible deadloop
> >>>>>>>> when all clusters have been scanned and not available swap range
> >>>>>>>> found.
> >>>>>
> >>>>> I also think that the shared approach has dead loop issue.
> >>>>
> >>>> What exactly do you mean by dead loop issue? Perhaps you are suggesting the code
> >>>> won't know when to stop dequeing/requeuing clusters on the nonfull list and will
> >>>> go forever? That's surely just an implementation issue to solve? It's not a
> >>>> reason to avoid the design principle; if we agree that maintaining sharability
> >>>> of the cluster is preferred then the code must be written to guard against the
> >>>> dead loop problem. It could be done by remembering the first cluster you
> >>>> dequeued/requeued in scan_swap_map_try_ssd_cluster() and stop when you get back
> >>>> to it. (I think holding the si lock will protect against concurrently freeing
> >>>> the cluster so it should definitely remain in the list?).
> >>>
> >>> I believe that you can find some way to avoid the dead loop issue,
> >>> although your suggestion may kill the performance via looping a long list
> >>> of nonfull clusters.
> >>
> >> I don't agree; If the clusters are considered exclusive (i.e. removed from the
> >> list when made current for a cpu), that only reduces the size of the list by a
> >> maximum of the number of CPUs in the system, which I suspect is pretty small
> >> compared to the number of nonfull clusters.
> >
> > Anyway, this depends on details.  If we cannot allocate a order-N swap
> > entry from the cluster, we should remove it from the nonfull list for
> > order-N (This is the behavior of this patch too).

Yes, Kairui implements something like that in the reclaim part of the
patch series. It is after patch 3. We are heavily testing the
performance and the stability of the reclaim patches. May I post the
reclaim together with patch 3 for discussion. If you want we can
discuss the re-order the patch in a later iteration.

>
> Yes that's a good point, and I conceed it is more difficult to detect that
> condition if the cluster is shared. I suspect that with a bit of thinking, we
> could find a way though.

Kaiui has  the patch series show a good performance number that beats
the current swap cache reclaim.

I want to make a point regarding the patch ordering before vs after
patch 3 (aka the big rewrite).
Previously, the "san_swap_map_try_ssd_cluster" only did partial
allocation. It does not sucessfully allocate a swap entry 100% the
time.  The patch 3 makes the cluster allocation function return the
swap entry 100% of the time. There are no more fallback retry loops
outside of the cluster allocation function. Also the try_ssd function
does not do swap cache reclaims while the cluster allocation function
will need to. These two have very different constraints.

There for, adding different cluster header into
san_swap_map_try_ssd_cluste will be a lot of waste investment of
development time in the sense that, that function will need to be
rewrite any way, the end result is very different.

That is why I want to make this change patch after patch 3. There is
also the long test cycle after the modification to make sure the swap
code path is stable. I am not resisting a change of patch orders, it
is that patch can't directly be removed before patch 3 before the big
rewrite.


>
> > Your original
> > suggestion appears like that you want to keep all cluster with order-N
> > on the nonfull list for order-N always unless the number of free swap
> > entry is less than 1<<N.
>
> Well I think that's certainly one of the conditions for removing it. But agree
> that if a full scan of the cluster has been performed and no swap entries have
> been freed since the scan started then it should also be removed from the list.

Yes, in the later patch of patch, beyond patch 3, we have the almost
full cluster that for the cluster has been scan and not able to
allocate order N entry.

>
> >
> >>> And, I understand that in some situations it may
> >>> be better to share clusters among CPUs.  So my suggestion is,
> >>>
> >>> - Make swap_cluster_info->order more accurate, don't pretend that we
> >>>   have free swap entries with that order even after we are sure that we
> >>>   haven't.
> >>
> >> Is this patch pretending that today? I don't think so?
> >
> > IIUC, in this patch swap_cluster_info->order is still "N" even if we are
> > sure that there are no order-N free swap entry in the cluster.
>
> Oh I see what you mean. I think you and Chris already discussed this? IIRC
> Chris's point was that if you move that cluster to N-1, eventually all clusters
> are for order-0 and you have no means of allocating high orders until a whole
> cluster becomes free. That logic certainly makes sense to me, so think its
> better for swap_cluster_info->order to remain static while the cluster is
> allocated. (I only skimmed that conversation so appologies if I got the
> conclusion wrong!).

Yes, that is the original intent, keep the cluster order as much as possible.

>
> >
> >> But I agree that a
> >> cluster should only be on the per-order nonfull list if we know there are at
> >> least enough free swap entries in that cluster to cover the order. Of course
> >> that doesn't tell us for sure because they may not be contiguous.
> >
> > We can check that when free swap entry via checking adjacent swap
> > entries.  IMHO, the performance should be acceptable.
>
> Would you then use the result of that scanning to "promote" a cluster's order?
> e.g. swap_cluster_info->order = N+1? That would be neat. But this all feels like
> a separate change on top of what Chris is doing here. For high orders there
> could be quite a bit of scanning required in the worst case for every page that
> gets freed.

Right, I feel that is a different set of patches. Even this series is
hard enough for review. Those order promotion and demotion is heading
towards a buddy system design. I want to point out that even the buddy
system is not able to handle the case that swapfile is almost full and
the recently freed swap entries are not contiguous.

We can invest in the buddy system, which doesn't handle all the
fragmentation issues. Or I prefer to go directly to the discontiguous
swap entry. We pay a price for the indirect mapping of swap entries.
But it will solve the fragmentation issue 100%.


>
> >
> >>>
> >>> My question is whether it's so important to share the per-cpu cluster
> >>> among CPUs?
> >>
> >> My rationale for sharing is that the preference previously has been to favour
> >> efficient use of swap space; we don't want to fail a request for allocation of a
> >> given order if there are actually slots available just because they have been
> >> reserved by another CPU. And I'm still asserting that it should be ~zero cost to
> >> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
> >> actually help improve allocation success, then I'm happy to take the exclusive
> >> approach.
> >>
> >>> I suggest to start with simple design, that is, per-CPU
> >>> cluster will not be shared among CPUs in most cases.
> >>
> >> I'm all for starting simple; I think that's what I already proposed (exclusive
> >> in this patch, then shared in the "big rewrite"). I'm just objecting to the
> >> current half-and-half policy in this patch.
> >
> > Sounds good to me.  We can start with exclusive solution and evaluate
> > whether shared solution is good.
>
> Yep. And also evaluate the dynamic order inc/dec idea too...

It is not able to avoid fragementation 100% of the time. I prefer the
discontinued swap entry as the next step, which guarantees forward
progress, we will not be stuck in a situation where we are not able to
allocate swap entries due to fragmentation.

Chris

>
> >
> >>>
> >>> Another choice for sharing is when we run short of free swap space, we
> >>> disable per-CPU cluster and allocate from the shared non-full cluster
> >>> list directly.
> >>>
> >>>> Which actually makes me wonder; what is the mechanism that prevents the current
> >>>> per-cpu cluster from being freed? Is that just handled by the conflict detection
> >>>> thingy? Perhaps that would be better handled with a flag to mark it in use, or
> >>>> raise count when its current. (If Chris has implemented that in the "big
> >>>> rewrite" patch, sorry, I still haven't gotten around to looking at it :-| )
> >>>
> >>> Yes.  We may need a flag for that.
> >>>
> >>>>>
> >>>>>>> This is another reason that we should put the cluster in
> >>>>>>> nonfull_clusters[order--] if there are no free swap entry with "order"
> >>>>>>> in the cluster.  It makes design complex to keep it in
> >>>>>>> nonfull_clusters[order].
> >>>>>>>
> >>>>>>>> We have tried many different approaches including moving to the end of
> >>>>>>>> the list. It can cause more fragmentation because each CPU allocates
> >>>>>>>> their swap slot cache (64 entries) from a different cluster.
> >>>>>>>>
> >>>>>>>>>> Those behaviors will be fine
> >>>>>>>>>> tuned after the patch 3 big rewrite. Try to make this patch simple.
> >>>>>>>>
> >>>>>>>> Again, I want to keep it simple here so patch 3 can land.
> >>>>>>>>
> >>>>>>>>>>> Another potential optimization (which was in my hacked version IIRC) is to only
> >>>>>>>>>>> add/remove from nonfull list when `total - count` crosses the (1 << order)
> >>>>>>>>>>> boundary rather than when becoming completely full. You definitely won't be able
> >>>>>>>>>>> to allocate order-2 if there are only 3 pages available, for example.
> >>>>>>>>>>
> >>>>>>>>>> That is in patch 3 as well. This patch is just doing the bare minimum
> >>>>>>>>>> to introduce the nonfull list.
> >>>>>>>>>>
> >>>>>
> >>>>> [snip]
> >
> > --
> > Best Regards,
> > Huang, Ying
>

Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Posted by Huang, Ying 1 year, 6 months ago

Chris Li <chrisl@kernel.org> writes:

> Hi Ryan and Ying,
>
> Sorry I was busy. I am catching up on the email now.
>
> On Wed, Jul 24, 2024 at 1:33 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 23/07/2024 07:27, Huang, Ying wrote:
>> > Ryan Roberts <ryan.roberts@arm.com> writes:
>> >
>> >> On 22/07/2024 09:49, Huang, Ying wrote:
>> >>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> >>>
>> >>>> On 22/07/2024 03:14, Huang, Ying wrote:
>> >>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> >>>>>
>> >>>>>> On 18/07/2024 08:53, Huang, Ying wrote:
>> >>>>>>> Chris Li <chrisl@kernel.org> writes:
>> >>>>>>>
>> >>>>>>>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>> >>>>>>>>>
>> >>>>>>>>> On 16/07/2024 23:46, Chris Li wrote:
>> >>>>>>>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>> On 11/07/2024 08:29, Chris Li wrote:
>> >>>>>
>> >>>>> [snip]
>> >>>>>
>> >>>>>>>>>>>> +
>> >>>>>>>>>>>> +     if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>> >>>>>>>>>>>> +             list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
>> >>>>>>>>>>>
>> >>>>>>>>>>> I find the transitions when you add and remove a cluster from the
>> >>>>>>>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
>> >>>>>>>>>>> to the list whenever there is at least one free swap entry if not already on the
>> >>>>>>>>>>> list. But you take it off the list when assigning it as the current cluster for
>> >>>>>>>>>>> a cpu in scan_swap_map_try_ssd_cluster().
>> >>>>>>>>>>>
>> >>>>>>>>>>> So you could have this situation:
>> >>>>>>>>>>>
>> >>>>>>>>>>>   - cpuA allocs cluster from free list (exclusive to that cpu)
>> >>>>>>>>>>>   - cpuA allocs 1 swap entry from current cluster
>> >>>>>>>>>>>   - swap entry is freed; cluster added to nonfull_clusters
>> >>>>>>>>>>>   - cpuB "allocs" cluster from nonfull_clusters
>> >>>>>>>>>>>
>> >>>>>>>>>>> At this point both cpuA and cpuB share the same cluster as their current
>> >>>>>>>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at
>> >>>>>>>>>>> allocation time (when removed from free_list) and only remove it from the
>> >>>>>>>>>>
>> >>>>>>>>>> The big rewrite on patch 3 does that, taking it off the free list and
>> >>>>>>>>>> moving it into nonfull.
>> >>>>>>>>>
>> >>>>>>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
>> >>>>>>>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
>> >>>>>>>>> separate the SSD and HDD code paths. Personally I'd prefer to see the
>> >>>>>>>>> refactoring separated from behavioural changes.
>> >>>>>>>>
>> >>>>>>>> It is not a refactoring. It is a big rewrite of the swap allocator
>> >>>>>>>> using the cluster. Behavior change is expected. The goal is completely
>> >>>>>>>> removing the brute force scanning of swap_map[] array for cluster swap
>> >>>>>>>> allocation.
>> >>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> Since the patch was titled RFC and I thought it was just refactoring, I was
>> >>>>>>>>> deferring review. But sounds like it is actually required to realize the test
>> >>>>>>>>> results quoted on the cover letter?
>> >>>>>>>>
>> >>>>>>>> Yes, required because it handles the previous fall out case try_ssd()
>> >>>>>>>> failed. This big rewrite has gone through a lot of testing and bug
>> >>>>>>>> fix. It is pretty stable now. The only reason I keep it as RFC is
>> >>>>>>>> because it is not feature complete. Currently it does not do swap
>> >>>>>>>> cache reclaim. The next version will have swap cache reclaim and
>> >>>>>>>> remove the RFC.
>> >>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>> I am only making the minimal change in this step so the big rewrite can land.
>> >>>>>>>>>>
>> >>>>>>>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't
>> >>>>>>>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead
>> >>>>>>>>>>> of just sometimes. You would likely want to move the cluster to the end of the
>> >>>>>>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
>> >>>>>>>>>>> chances of multiple CPUs using the same cluster.
>> >>>>>>>>>>
>> >>>>>>>>>> For nonfull clusters it is less important to avoid multiple CPU
>> >>>>>>>>>> sharing the cluster. Because the cluster already has previous swap
>> >>>>>>>>>> entries allocated from the previous CPU.
>> >>>>>>>>>
>> >>>>>>>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
>> >>>>>>>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
>> >>>>>>>>
>> >>>>>>>> That happens to exist per cpu next pointer already. When the other CPU
>> >>>>>>>> advances to the next cluster pointer, it can cross with the other
>> >>>>>>>> CPU's next cluster pointer.
>> >>>>>>>
>> >>>>>>> No.  si->percpu_cluster[cpu].next will keep in the current per cpu
>> >>>>>>> cluster only.  If it doesn't do that, we should fix it.
>> >>>>>>>
>> >>>>>>> I agree with Ryan that we should make per cpu cluster correct.  A
>> >>>>>>> cluster in per cpu cluster shouldn't be put in nonfull list.  When we
>> >>>>>>> scan to the end of a per cpu cluster, we can put the cluster in nonfull
>> >>>>>>> list if necessary.  And, we should make it correct in this patch instead
>> >>>>>>> of later in series.  I understand that you want to make the patch itself
>> >>>>>>> simple, but it's important to make code simple to be understood too.
>> >>>>>>> Consistent design choice will do that.
>> >>>>>>
>> >>>>>> I think I'm actually arguing for the opposite of what you suggest here.
>> >>>>>
>> >>>>> Sorry, I misunderstood your words.
>> >>>>>
>> >>>>>> As I see it, there are 2 possible approaches; either a cluster is always
>> >>>>>> considered exclusive to a single cpu when its set as a per-cpu cluster, so it
>> >>>>>> does not appear on the nonfull list. Or a cluster is considered sharable in this
>> >>>>>> case, in which case it should be added to the nonfull list.
>> >>>>>>
>> >>>>>> The code at the moment sort of does both; when a cpu decides to use a cluster in
>> >>>>>> the nonfull list, it removes it from that list to make it exclusive. But as soon
>> >>>>>> as a single swap entry is freed from that cluster it is put back on the list.
>> >>>>>> This neither-one-policy-nor-the-other seems odd to me.
>> >>>>>>
>> >>>>>> I think Huang, Ying is arguing to keep it always exclusive while installed as a
>> >>>>>> per-cpu cluster.
>> >>>>>
>> >>>>> Yes.
>> >>>>>
>> >>>>>> I was arguing to make it always shared. Perhaps the best
>> >>>>>> approach is to implement the exclusive policy in this patch (you'd need a flag
>> >>>>>> to note if any pages were freed while in exclusive use, then when exclusive use
>> >>>>>> completes, put it back on the nonfull list if the flag was set). Then migrate to
>> >>>>>> the shared approach as part of the "big rewrite"?
>> >>>>>>>
>> >>>>>>>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
>> >>>>>>>>> share the cluster when you really need to, but try to avoid it if there are
>> >>>>>>>>> other options, and I think moving the cluster to the end of the list might be a
>> >>>>>>>>> way to help that?
>> >>>>>>>>
>> >>>>>>>> Simply moving to the end of the list can create a possible deadloop
>> >>>>>>>> when all clusters have been scanned and not available swap range
>> >>>>>>>> found.
>> >>>>>
>> >>>>> I also think that the shared approach has dead loop issue.
>> >>>>
>> >>>> What exactly do you mean by dead loop issue? Perhaps you are suggesting the code
>> >>>> won't know when to stop dequeing/requeuing clusters on the nonfull list and will
>> >>>> go forever? That's surely just an implementation issue to solve? It's not a
>> >>>> reason to avoid the design principle; if we agree that maintaining sharability
>> >>>> of the cluster is preferred then the code must be written to guard against the
>> >>>> dead loop problem. It could be done by remembering the first cluster you
>> >>>> dequeued/requeued in scan_swap_map_try_ssd_cluster() and stop when you get back
>> >>>> to it. (I think holding the si lock will protect against concurrently freeing
>> >>>> the cluster so it should definitely remain in the list?).
>> >>>
>> >>> I believe that you can find some way to avoid the dead loop issue,
>> >>> although your suggestion may kill the performance via looping a long list
>> >>> of nonfull clusters.
>> >>
>> >> I don't agree; If the clusters are considered exclusive (i.e. removed from the
>> >> list when made current for a cpu), that only reduces the size of the list by a
>> >> maximum of the number of CPUs in the system, which I suspect is pretty small
>> >> compared to the number of nonfull clusters.
>> >
>> > Anyway, this depends on details.  If we cannot allocate a order-N swap
>> > entry from the cluster, we should remove it from the nonfull list for
>> > order-N (This is the behavior of this patch too).
>
> Yes, Kairui implements something like that in the reclaim part of the
> patch series. It is after patch 3. We are heavily testing the
> performance and the stability of the reclaim patches. May I post the
> reclaim together with patch 3 for discussion. If you want we can
> discuss the re-order the patch in a later iteration.
>
>>
>> Yes that's a good point, and I conceed it is more difficult to detect that
>> condition if the cluster is shared. I suspect that with a bit of thinking, we
>> could find a way though.
>
> Kaiui has  the patch series show a good performance number that beats
> the current swap cache reclaim.
>
> I want to make a point regarding the patch ordering before vs after
> patch 3 (aka the big rewrite).
> Previously, the "san_swap_map_try_ssd_cluster" only did partial
> allocation. It does not sucessfully allocate a swap entry 100% the
> time.  The patch 3 makes the cluster allocation function return the
> swap entry 100% of the time. There are no more fallback retry loops
> outside of the cluster allocation function. Also the try_ssd function
> does not do swap cache reclaims while the cluster allocation function
> will need to. These two have very different constraints.
>
> There for, adding different cluster header into
> san_swap_map_try_ssd_cluste will be a lot of waste investment of
> development time in the sense that, that function will need to be
> rewrite any way, the end result is very different.

I am not a big fan of implementing the final solution directly.
Personally, I prefer to improve step by step.

> That is why I want to make this change patch after patch 3. There is
> also the long test cycle after the modification to make sure the swap
> code path is stable. I am not resisting a change of patch orders, it
> is that patch can't directly be removed before patch 3 before the big
> rewrite.
>
>
>>
>> > Your original
>> > suggestion appears like that you want to keep all cluster with order-N
>> > on the nonfull list for order-N always unless the number of free swap
>> > entry is less than 1<<N.
>>
>> Well I think that's certainly one of the conditions for removing it. But agree
>> that if a full scan of the cluster has been performed and no swap entries have
>> been freed since the scan started then it should also be removed from the list.
>
> Yes, in the later patch of patch, beyond patch 3, we have the almost
> full cluster that for the cluster has been scan and not able to
> allocate order N entry.
>
>>
>> >
>> >>> And, I understand that in some situations it may
>> >>> be better to share clusters among CPUs.  So my suggestion is,
>> >>>
>> >>> - Make swap_cluster_info->order more accurate, don't pretend that we
>> >>>   have free swap entries with that order even after we are sure that we
>> >>>   haven't.
>> >>
>> >> Is this patch pretending that today? I don't think so?
>> >
>> > IIUC, in this patch swap_cluster_info->order is still "N" even if we are
>> > sure that there are no order-N free swap entry in the cluster.
>>
>> Oh I see what you mean. I think you and Chris already discussed this? IIRC
>> Chris's point was that if you move that cluster to N-1, eventually all clusters
>> are for order-0 and you have no means of allocating high orders until a whole
>> cluster becomes free. That logic certainly makes sense to me, so think its
>> better for swap_cluster_info->order to remain static while the cluster is
>> allocated. (I only skimmed that conversation so appologies if I got the
>> conclusion wrong!).
>
> Yes, that is the original intent, keep the cluster order as much as possible.
>
>>
>> >
>> >> But I agree that a
>> >> cluster should only be on the per-order nonfull list if we know there are at
>> >> least enough free swap entries in that cluster to cover the order. Of course
>> >> that doesn't tell us for sure because they may not be contiguous.
>> >
>> > We can check that when free swap entry via checking adjacent swap
>> > entries.  IMHO, the performance should be acceptable.
>>
>> Would you then use the result of that scanning to "promote" a cluster's order?
>> e.g. swap_cluster_info->order = N+1? That would be neat. But this all feels like
>> a separate change on top of what Chris is doing here. For high orders there
>> could be quite a bit of scanning required in the worst case for every page that
>> gets freed.
>
> Right, I feel that is a different set of patches. Even this series is
> hard enough for review. Those order promotion and demotion is heading
> towards a buddy system design. I want to point out that even the buddy
> system is not able to handle the case that swapfile is almost full and
> the recently freed swap entries are not contiguous.
>
> We can invest in the buddy system, which doesn't handle all the
> fragmentation issues. Or I prefer to go directly to the discontiguous
> swap entry. We pay a price for the indirect mapping of swap entries.
> But it will solve the fragmentation issue 100%.

It's good if we can solve the fragmentation issue 100%.  Just need to
pay attention to the cost.

>>
>> >
>> >>>
>> >>> My question is whether it's so important to share the per-cpu cluster
>> >>> among CPUs?
>> >>
>> >> My rationale for sharing is that the preference previously has been to favour
>> >> efficient use of swap space; we don't want to fail a request for allocation of a
>> >> given order if there are actually slots available just because they have been
>> >> reserved by another CPU. And I'm still asserting that it should be ~zero cost to
>> >> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
>> >> actually help improve allocation success, then I'm happy to take the exclusive
>> >> approach.
>> >>
>> >>> I suggest to start with simple design, that is, per-CPU
>> >>> cluster will not be shared among CPUs in most cases.
>> >>
>> >> I'm all for starting simple; I think that's what I already proposed (exclusive
>> >> in this patch, then shared in the "big rewrite"). I'm just objecting to the
>> >> current half-and-half policy in this patch.
>> >
>> > Sounds good to me.  We can start with exclusive solution and evaluate
>> > whether shared solution is good.
>>
>> Yep. And also evaluate the dynamic order inc/dec idea too...
>
> It is not able to avoid fragementation 100% of the time. I prefer the
> discontinued swap entry as the next step, which guarantees forward
> progress, we will not be stuck in a situation where we are not able to
> allocate swap entries due to fragmentation.

If my understanding were correct, the implementation complexity of the
order promotion/demotion isn't at the same level of that of discontinued
swap entry.

--
Best Regards,
Huang, Ying

>
>>
>> >
>> >>>
>> >>> Another choice for sharing is when we run short of free swap space, we
>> >>> disable per-CPU cluster and allocate from the shared non-full cluster
>> >>> list directly.
>> >>>
>> >>>> Which actually makes me wonder; what is the mechanism that prevents the current
>> >>>> per-cpu cluster from being freed? Is that just handled by the conflict detection
>> >>>> thingy? Perhaps that would be better handled with a flag to mark it in use, or
>> >>>> raise count when its current. (If Chris has implemented that in the "big
>> >>>> rewrite" patch, sorry, I still haven't gotten around to looking at it :-| )
>> >>>
>> >>> Yes.  We may need a flag for that.
>> >>>
>> >>>>>
>> >>>>>>> This is another reason that we should put the cluster in
>> >>>>>>> nonfull_clusters[order--] if there are no free swap entry with "order"
>> >>>>>>> in the cluster.  It makes design complex to keep it in
>> >>>>>>> nonfull_clusters[order].
>> >>>>>>>
>> >>>>>>>> We have tried many different approaches including moving to the end of
>> >>>>>>>> the list. It can cause more fragmentation because each CPU allocates
>> >>>>>>>> their swap slot cache (64 entries) from a different cluster.
>> >>>>>>>>
>> >>>>>>>>>> Those behaviors will be fine
>> >>>>>>>>>> tuned after the patch 3 big rewrite. Try to make this patch simple.
>> >>>>>>>>
>> >>>>>>>> Again, I want to keep it simple here so patch 3 can land.
>> >>>>>>>>
>> >>>>>>>>>>> Another potential optimization (which was in my hacked version IIRC) is to only
>> >>>>>>>>>>> add/remove from nonfull list when `total - count` crosses the (1 << order)
>> >>>>>>>>>>> boundary rather than when becoming completely full. You definitely won't be able
>> >>>>>>>>>>> to allocate order-2 if there are only 3 pages available, for example.
>> >>>>>>>>>>
>> >>>>>>>>>> That is in patch 3 as well. This patch is just doing the bare minimum
>> >>>>>>>>>> to introduce the nonfull list.
>> >>>>>>>>>>
>> >>>>>
>> >>>>> [snip]
>> >
>> > --
>> > Best Regards,
>> > Huang, Ying
>>

Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Posted by Chris Li 1 year, 6 months ago

On Wed, Jul 24, 2024 at 11:46 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > Hi Ryan and Ying,
> >
> > Sorry I was busy. I am catching up on the email now.
> >
> > On Wed, Jul 24, 2024 at 1:33 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 23/07/2024 07:27, Huang, Ying wrote:
> >> > Ryan Roberts <ryan.roberts@arm.com> writes:
> >> >
> >> >> On 22/07/2024 09:49, Huang, Ying wrote:
> >> >>> Ryan Roberts <ryan.roberts@arm.com> writes:
> >> >>>
> >> >>>> On 22/07/2024 03:14, Huang, Ying wrote:
> >> >>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
> >> >>>>>
> >> >>>>>> On 18/07/2024 08:53, Huang, Ying wrote:
> >> >>>>>>> Chris Li <chrisl@kernel.org> writes:
> >> >>>>>>>
> >> >>>>>>>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >> >>>>>>>>>
> >> >>>>>>>>> On 16/07/2024 23:46, Chris Li wrote:
> >> >>>>>>>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> On 11/07/2024 08:29, Chris Li wrote:
> >> >>>>>
> >> >>>>> [snip]
> >> >>>>>
> >> >>>>>>>>>>>> +
> >> >>>>>>>>>>>> +     if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
> >> >>>>>>>>>>>> +             list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> I find the transitions when you add and remove a cluster from the
> >> >>>>>>>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
> >> >>>>>>>>>>> to the list whenever there is at least one free swap entry if not already on the
> >> >>>>>>>>>>> list. But you take it off the list when assigning it as the current cluster for
> >> >>>>>>>>>>> a cpu in scan_swap_map_try_ssd_cluster().
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> So you could have this situation:
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>   - cpuA allocs cluster from free list (exclusive to that cpu)
> >> >>>>>>>>>>>   - cpuA allocs 1 swap entry from current cluster
> >> >>>>>>>>>>>   - swap entry is freed; cluster added to nonfull_clusters
> >> >>>>>>>>>>>   - cpuB "allocs" cluster from nonfull_clusters
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> At this point both cpuA and cpuB share the same cluster as their current
> >> >>>>>>>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at
> >> >>>>>>>>>>> allocation time (when removed from free_list) and only remove it from the
> >> >>>>>>>>>>
> >> >>>>>>>>>> The big rewrite on patch 3 does that, taking it off the free list and
> >> >>>>>>>>>> moving it into nonfull.
> >> >>>>>>>>>
> >> >>>>>>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
> >> >>>>>>>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
> >> >>>>>>>>> separate the SSD and HDD code paths. Personally I'd prefer to see the
> >> >>>>>>>>> refactoring separated from behavioural changes.
> >> >>>>>>>>
> >> >>>>>>>> It is not a refactoring. It is a big rewrite of the swap allocator
> >> >>>>>>>> using the cluster. Behavior change is expected. The goal is completely
> >> >>>>>>>> removing the brute force scanning of swap_map[] array for cluster swap
> >> >>>>>>>> allocation.
> >> >>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> Since the patch was titled RFC and I thought it was just refactoring, I was
> >> >>>>>>>>> deferring review. But sounds like it is actually required to realize the test
> >> >>>>>>>>> results quoted on the cover letter?
> >> >>>>>>>>
> >> >>>>>>>> Yes, required because it handles the previous fall out case try_ssd()
> >> >>>>>>>> failed. This big rewrite has gone through a lot of testing and bug
> >> >>>>>>>> fix. It is pretty stable now. The only reason I keep it as RFC is
> >> >>>>>>>> because it is not feature complete. Currently it does not do swap
> >> >>>>>>>> cache reclaim. The next version will have swap cache reclaim and
> >> >>>>>>>> remove the RFC.
> >> >>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>> I am only making the minimal change in this step so the big rewrite can land.
> >> >>>>>>>>>>
> >> >>>>>>>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't
> >> >>>>>>>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead
> >> >>>>>>>>>>> of just sometimes. You would likely want to move the cluster to the end of the
> >> >>>>>>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
> >> >>>>>>>>>>> chances of multiple CPUs using the same cluster.
> >> >>>>>>>>>>
> >> >>>>>>>>>> For nonfull clusters it is less important to avoid multiple CPU
> >> >>>>>>>>>> sharing the cluster. Because the cluster already has previous swap
> >> >>>>>>>>>> entries allocated from the previous CPU.
> >> >>>>>>>>>
> >> >>>>>>>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
> >> >>>>>>>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
> >> >>>>>>>>
> >> >>>>>>>> That happens to exist per cpu next pointer already. When the other CPU
> >> >>>>>>>> advances to the next cluster pointer, it can cross with the other
> >> >>>>>>>> CPU's next cluster pointer.
> >> >>>>>>>
> >> >>>>>>> No.  si->percpu_cluster[cpu].next will keep in the current per cpu
> >> >>>>>>> cluster only.  If it doesn't do that, we should fix it.
> >> >>>>>>>
> >> >>>>>>> I agree with Ryan that we should make per cpu cluster correct.  A
> >> >>>>>>> cluster in per cpu cluster shouldn't be put in nonfull list.  When we
> >> >>>>>>> scan to the end of a per cpu cluster, we can put the cluster in nonfull
> >> >>>>>>> list if necessary.  And, we should make it correct in this patch instead
> >> >>>>>>> of later in series.  I understand that you want to make the patch itself
> >> >>>>>>> simple, but it's important to make code simple to be understood too.
> >> >>>>>>> Consistent design choice will do that.
> >> >>>>>>
> >> >>>>>> I think I'm actually arguing for the opposite of what you suggest here.
> >> >>>>>
> >> >>>>> Sorry, I misunderstood your words.
> >> >>>>>
> >> >>>>>> As I see it, there are 2 possible approaches; either a cluster is always
> >> >>>>>> considered exclusive to a single cpu when its set as a per-cpu cluster, so it
> >> >>>>>> does not appear on the nonfull list. Or a cluster is considered sharable in this
> >> >>>>>> case, in which case it should be added to the nonfull list.
> >> >>>>>>
> >> >>>>>> The code at the moment sort of does both; when a cpu decides to use a cluster in
> >> >>>>>> the nonfull list, it removes it from that list to make it exclusive. But as soon
> >> >>>>>> as a single swap entry is freed from that cluster it is put back on the list.
> >> >>>>>> This neither-one-policy-nor-the-other seems odd to me.
> >> >>>>>>
> >> >>>>>> I think Huang, Ying is arguing to keep it always exclusive while installed as a
> >> >>>>>> per-cpu cluster.
> >> >>>>>
> >> >>>>> Yes.
> >> >>>>>
> >> >>>>>> I was arguing to make it always shared. Perhaps the best
> >> >>>>>> approach is to implement the exclusive policy in this patch (you'd need a flag
> >> >>>>>> to note if any pages were freed while in exclusive use, then when exclusive use
> >> >>>>>> completes, put it back on the nonfull list if the flag was set). Then migrate to
> >> >>>>>> the shared approach as part of the "big rewrite"?
> >> >>>>>>>
> >> >>>>>>>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
> >> >>>>>>>>> share the cluster when you really need to, but try to avoid it if there are
> >> >>>>>>>>> other options, and I think moving the cluster to the end of the list might be a
> >> >>>>>>>>> way to help that?
> >> >>>>>>>>
> >> >>>>>>>> Simply moving to the end of the list can create a possible deadloop
> >> >>>>>>>> when all clusters have been scanned and not available swap range
> >> >>>>>>>> found.
> >> >>>>>
> >> >>>>> I also think that the shared approach has dead loop issue.
> >> >>>>
> >> >>>> What exactly do you mean by dead loop issue? Perhaps you are suggesting the code
> >> >>>> won't know when to stop dequeing/requeuing clusters on the nonfull list and will
> >> >>>> go forever? That's surely just an implementation issue to solve? It's not a
> >> >>>> reason to avoid the design principle; if we agree that maintaining sharability
> >> >>>> of the cluster is preferred then the code must be written to guard against the
> >> >>>> dead loop problem. It could be done by remembering the first cluster you
> >> >>>> dequeued/requeued in scan_swap_map_try_ssd_cluster() and stop when you get back
> >> >>>> to it. (I think holding the si lock will protect against concurrently freeing
> >> >>>> the cluster so it should definitely remain in the list?).
> >> >>>
> >> >>> I believe that you can find some way to avoid the dead loop issue,
> >> >>> although your suggestion may kill the performance via looping a long list
> >> >>> of nonfull clusters.
> >> >>
> >> >> I don't agree; If the clusters are considered exclusive (i.e. removed from the
> >> >> list when made current for a cpu), that only reduces the size of the list by a
> >> >> maximum of the number of CPUs in the system, which I suspect is pretty small
> >> >> compared to the number of nonfull clusters.
> >> >
> >> > Anyway, this depends on details.  If we cannot allocate a order-N swap
> >> > entry from the cluster, we should remove it from the nonfull list for
> >> > order-N (This is the behavior of this patch too).
> >
> > Yes, Kairui implements something like that in the reclaim part of the
> > patch series. It is after patch 3. We are heavily testing the
> > performance and the stability of the reclaim patches. May I post the
> > reclaim together with patch 3 for discussion. If you want we can
> > discuss the re-order the patch in a later iteration.
> >
> >>
> >> Yes that's a good point, and I conceed it is more difficult to detect that
> >> condition if the cluster is shared. I suspect that with a bit of thinking, we
> >> could find a way though.
> >
> > Kaiui has  the patch series show a good performance number that beats
> > the current swap cache reclaim.
> >
> > I want to make a point regarding the patch ordering before vs after
> > patch 3 (aka the big rewrite).
> > Previously, the "san_swap_map_try_ssd_cluster" only did partial
> > allocation. It does not sucessfully allocate a swap entry 100% the
> > time.  The patch 3 makes the cluster allocation function return the
> > swap entry 100% of the time. There are no more fallback retry loops
> > outside of the cluster allocation function. Also the try_ssd function
> > does not do swap cache reclaims while the cluster allocation function
> > will need to. These two have very different constraints.
> >
> > There for, adding different cluster header into
> > san_swap_map_try_ssd_cluste will be a lot of waste investment of
> > development time in the sense that, that function will need to be
> > rewrite any way, the end result is very different.
>
> I am not a big fan of implementing the final solution directly.
> Personally, I prefer to improve step by step.

The current proposed order also improves things step by step. The only
disagreement here is which patch order we introduce yet another list
in addition to the nonfull one. I just feel that it does not make
sense to invest into new code if that new code is going to be
completely rewrite anyway in the next two patches.

Unless you mean is we should not do the patch 3 big rewrite and should
continue the scan_swap_map_try_ssd_cluster() way of only doing half of
the allocation job and let scan_swap_map_slots() do the complex retry
on top of try_ssd(). I feel the overall code is more complex and less
maintainable.

> > That is why I want to make this change patch after patch 3. There is
> > also the long test cycle after the modification to make sure the swap
> > code path is stable. I am not resisting a change of patch orders, it
> > is that patch can't directly be removed before patch 3 before the big
> > rewrite.
> >
> >
> >>
> >> > Your original
> >> > suggestion appears like that you want to keep all cluster with order-N
> >> > on the nonfull list for order-N always unless the number of free swap
> >> > entry is less than 1<<N.
> >>
> >> Well I think that's certainly one of the conditions for removing it. But agree
> >> that if a full scan of the cluster has been performed and no swap entries have
> >> been freed since the scan started then it should also be removed from the list.
> >
> > Yes, in the later patch of patch, beyond patch 3, we have the almost
> > full cluster that for the cluster has been scan and not able to
> > allocate order N entry.
> >
> >>
> >> >
> >> >>> And, I understand that in some situations it may
> >> >>> be better to share clusters among CPUs.  So my suggestion is,
> >> >>>
> >> >>> - Make swap_cluster_info->order more accurate, don't pretend that we
> >> >>>   have free swap entries with that order even after we are sure that we
> >> >>>   haven't.
> >> >>
> >> >> Is this patch pretending that today? I don't think so?
> >> >
> >> > IIUC, in this patch swap_cluster_info->order is still "N" even if we are
> >> > sure that there are no order-N free swap entry in the cluster.
> >>
> >> Oh I see what you mean. I think you and Chris already discussed this? IIRC
> >> Chris's point was that if you move that cluster to N-1, eventually all clusters
> >> are for order-0 and you have no means of allocating high orders until a whole
> >> cluster becomes free. That logic certainly makes sense to me, so think its
> >> better for swap_cluster_info->order to remain static while the cluster is
> >> allocated. (I only skimmed that conversation so appologies if I got the
> >> conclusion wrong!).
> >
> > Yes, that is the original intent, keep the cluster order as much as possible.
> >
> >>
> >> >
> >> >> But I agree that a
> >> >> cluster should only be on the per-order nonfull list if we know there are at
> >> >> least enough free swap entries in that cluster to cover the order. Of course
> >> >> that doesn't tell us for sure because they may not be contiguous.
> >> >
> >> > We can check that when free swap entry via checking adjacent swap
> >> > entries.  IMHO, the performance should be acceptable.
> >>
> >> Would you then use the result of that scanning to "promote" a cluster's order?
> >> e.g. swap_cluster_info->order = N+1? That would be neat. But this all feels like
> >> a separate change on top of what Chris is doing here. For high orders there
> >> could be quite a bit of scanning required in the worst case for every page that
> >> gets freed.
> >
> > Right, I feel that is a different set of patches. Even this series is
> > hard enough for review. Those order promotion and demotion is heading
> > towards a buddy system design. I want to point out that even the buddy
> > system is not able to handle the case that swapfile is almost full and
> > the recently freed swap entries are not contiguous.
> >
> > We can invest in the buddy system, which doesn't handle all the
> > fragmentation issues. Or I prefer to go directly to the discontiguous
> > swap entry. We pay a price for the indirect mapping of swap entries.
> > But it will solve the fragmentation issue 100%.
>
> It's good if we can solve the fragmentation issue 100%.  Just need to
> pay attention to the cost.

The cost you mean the development cost or the run time cost (memory and cpu)?

>
> >>
> >> >
> >> >>>
> >> >>> My question is whether it's so important to share the per-cpu cluster
> >> >>> among CPUs?
> >> >>
> >> >> My rationale for sharing is that the preference previously has been to favour
> >> >> efficient use of swap space; we don't want to fail a request for allocation of a
> >> >> given order if there are actually slots available just because they have been
> >> >> reserved by another CPU. And I'm still asserting that it should be ~zero cost to
> >> >> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
> >> >> actually help improve allocation success, then I'm happy to take the exclusive
> >> >> approach.
> >> >>
> >> >>> I suggest to start with simple design, that is, per-CPU
> >> >>> cluster will not be shared among CPUs in most cases.
> >> >>
> >> >> I'm all for starting simple; I think that's what I already proposed (exclusive
> >> >> in this patch, then shared in the "big rewrite"). I'm just objecting to the
> >> >> current half-and-half policy in this patch.
> >> >
> >> > Sounds good to me.  We can start with exclusive solution and evaluate
> >> > whether shared solution is good.
> >>
> >> Yep. And also evaluate the dynamic order inc/dec idea too...
> >
> > It is not able to avoid fragementation 100% of the time. I prefer the
> > discontinued swap entry as the next step, which guarantees forward
> > progress, we will not be stuck in a situation where we are not able to
> > allocate swap entries due to fragmentation.
>
> If my understanding were correct, the implementation complexity of the
> order promotion/demotion isn't at the same level of that of discontinued
> swap entry.

Discontinued swap entry has higher complexity but higher payout as
well. It can get us to the place where cluster promotion/demotion
can't.

I also feel that if we implement something towards a buddy system
allocator for swap, we should do a proper buddy allocator
implementation of data structures.

Chris

Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Posted by Huang, Ying 1 year, 6 months ago

Chris Li <chrisl@kernel.org> writes:

> On Wed, Jul 24, 2024 at 11:46 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>> > Hi Ryan and Ying,
>> >
>> > Sorry I was busy. I am catching up on the email now.
>> >
>> > On Wed, Jul 24, 2024 at 1:33 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>> >>
>> >> On 23/07/2024 07:27, Huang, Ying wrote:
>> >> > Ryan Roberts <ryan.roberts@arm.com> writes:
>> >> >
>> >> >> On 22/07/2024 09:49, Huang, Ying wrote:
>> >> >>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> >> >>>
>> >> >>>> On 22/07/2024 03:14, Huang, Ying wrote:
>> >> >>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> >> >>>>>
>> >> >>>>>> On 18/07/2024 08:53, Huang, Ying wrote:
>> >> >>>>>>> Chris Li <chrisl@kernel.org> writes:
>> >> >>>>>>>
>> >> >>>>>>>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>> >> >>>>>>>>>
>> >> >>>>>>>>> On 16/07/2024 23:46, Chris Li wrote:
>> >> >>>>>>>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> On 11/07/2024 08:29, Chris Li wrote:
>> >> >>>>>
>> >> >>>>> [snip]
>> >> >>>>>
>> >> >>>>>>>>>>>> +
>> >> >>>>>>>>>>>> +     if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>> >> >>>>>>>>>>>> +             list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> I find the transitions when you add and remove a cluster from the
>> >> >>>>>>>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
>> >> >>>>>>>>>>> to the list whenever there is at least one free swap entry if not already on the
>> >> >>>>>>>>>>> list. But you take it off the list when assigning it as the current cluster for
>> >> >>>>>>>>>>> a cpu in scan_swap_map_try_ssd_cluster().
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> So you could have this situation:
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>>   - cpuA allocs cluster from free list (exclusive to that cpu)
>> >> >>>>>>>>>>>   - cpuA allocs 1 swap entry from current cluster
>> >> >>>>>>>>>>>   - swap entry is freed; cluster added to nonfull_clusters
>> >> >>>>>>>>>>>   - cpuB "allocs" cluster from nonfull_clusters
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> At this point both cpuA and cpuB share the same cluster as their current
>> >> >>>>>>>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at
>> >> >>>>>>>>>>> allocation time (when removed from free_list) and only remove it from the
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> The big rewrite on patch 3 does that, taking it off the free list and
>> >> >>>>>>>>>> moving it into nonfull.
>> >> >>>>>>>>>
>> >> >>>>>>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
>> >> >>>>>>>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
>> >> >>>>>>>>> separate the SSD and HDD code paths. Personally I'd prefer to see the
>> >> >>>>>>>>> refactoring separated from behavioural changes.
>> >> >>>>>>>>
>> >> >>>>>>>> It is not a refactoring. It is a big rewrite of the swap allocator
>> >> >>>>>>>> using the cluster. Behavior change is expected. The goal is completely
>> >> >>>>>>>> removing the brute force scanning of swap_map[] array for cluster swap
>> >> >>>>>>>> allocation.
>> >> >>>>>>>>
>> >> >>>>>>>>>
>> >> >>>>>>>>> Since the patch was titled RFC and I thought it was just refactoring, I was
>> >> >>>>>>>>> deferring review. But sounds like it is actually required to realize the test
>> >> >>>>>>>>> results quoted on the cover letter?
>> >> >>>>>>>>
>> >> >>>>>>>> Yes, required because it handles the previous fall out case try_ssd()
>> >> >>>>>>>> failed. This big rewrite has gone through a lot of testing and bug
>> >> >>>>>>>> fix. It is pretty stable now. The only reason I keep it as RFC is
>> >> >>>>>>>> because it is not feature complete. Currently it does not do swap
>> >> >>>>>>>> cache reclaim. The next version will have swap cache reclaim and
>> >> >>>>>>>> remove the RFC.
>> >> >>>>>>>>
>> >> >>>>>>>>>
>> >> >>>>>>>>>> I am only making the minimal change in this step so the big rewrite can land.
>> >> >>>>>>>>>>
>> >> >>>>>>>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't
>> >> >>>>>>>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead
>> >> >>>>>>>>>>> of just sometimes. You would likely want to move the cluster to the end of the
>> >> >>>>>>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
>> >> >>>>>>>>>>> chances of multiple CPUs using the same cluster.
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> For nonfull clusters it is less important to avoid multiple CPU
>> >> >>>>>>>>>> sharing the cluster. Because the cluster already has previous swap
>> >> >>>>>>>>>> entries allocated from the previous CPU.
>> >> >>>>>>>>>
>> >> >>>>>>>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
>> >> >>>>>>>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
>> >> >>>>>>>>
>> >> >>>>>>>> That happens to exist per cpu next pointer already. When the other CPU
>> >> >>>>>>>> advances to the next cluster pointer, it can cross with the other
>> >> >>>>>>>> CPU's next cluster pointer.
>> >> >>>>>>>
>> >> >>>>>>> No.  si->percpu_cluster[cpu].next will keep in the current per cpu
>> >> >>>>>>> cluster only.  If it doesn't do that, we should fix it.
>> >> >>>>>>>
>> >> >>>>>>> I agree with Ryan that we should make per cpu cluster correct.  A
>> >> >>>>>>> cluster in per cpu cluster shouldn't be put in nonfull list.  When we
>> >> >>>>>>> scan to the end of a per cpu cluster, we can put the cluster in nonfull
>> >> >>>>>>> list if necessary.  And, we should make it correct in this patch instead
>> >> >>>>>>> of later in series.  I understand that you want to make the patch itself
>> >> >>>>>>> simple, but it's important to make code simple to be understood too.
>> >> >>>>>>> Consistent design choice will do that.
>> >> >>>>>>
>> >> >>>>>> I think I'm actually arguing for the opposite of what you suggest here.
>> >> >>>>>
>> >> >>>>> Sorry, I misunderstood your words.
>> >> >>>>>
>> >> >>>>>> As I see it, there are 2 possible approaches; either a cluster is always
>> >> >>>>>> considered exclusive to a single cpu when its set as a per-cpu cluster, so it
>> >> >>>>>> does not appear on the nonfull list. Or a cluster is considered sharable in this
>> >> >>>>>> case, in which case it should be added to the nonfull list.
>> >> >>>>>>
>> >> >>>>>> The code at the moment sort of does both; when a cpu decides to use a cluster in
>> >> >>>>>> the nonfull list, it removes it from that list to make it exclusive. But as soon
>> >> >>>>>> as a single swap entry is freed from that cluster it is put back on the list.
>> >> >>>>>> This neither-one-policy-nor-the-other seems odd to me.
>> >> >>>>>>
>> >> >>>>>> I think Huang, Ying is arguing to keep it always exclusive while installed as a
>> >> >>>>>> per-cpu cluster.
>> >> >>>>>
>> >> >>>>> Yes.
>> >> >>>>>
>> >> >>>>>> I was arguing to make it always shared. Perhaps the best
>> >> >>>>>> approach is to implement the exclusive policy in this patch (you'd need a flag
>> >> >>>>>> to note if any pages were freed while in exclusive use, then when exclusive use
>> >> >>>>>> completes, put it back on the nonfull list if the flag was set). Then migrate to
>> >> >>>>>> the shared approach as part of the "big rewrite"?
>> >> >>>>>>>
>> >> >>>>>>>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
>> >> >>>>>>>>> share the cluster when you really need to, but try to avoid it if there are
>> >> >>>>>>>>> other options, and I think moving the cluster to the end of the list might be a
>> >> >>>>>>>>> way to help that?
>> >> >>>>>>>>
>> >> >>>>>>>> Simply moving to the end of the list can create a possible deadloop
>> >> >>>>>>>> when all clusters have been scanned and not available swap range
>> >> >>>>>>>> found.
>> >> >>>>>
>> >> >>>>> I also think that the shared approach has dead loop issue.
>> >> >>>>
>> >> >>>> What exactly do you mean by dead loop issue? Perhaps you are suggesting the code
>> >> >>>> won't know when to stop dequeing/requeuing clusters on the nonfull list and will
>> >> >>>> go forever? That's surely just an implementation issue to solve? It's not a
>> >> >>>> reason to avoid the design principle; if we agree that maintaining sharability
>> >> >>>> of the cluster is preferred then the code must be written to guard against the
>> >> >>>> dead loop problem. It could be done by remembering the first cluster you
>> >> >>>> dequeued/requeued in scan_swap_map_try_ssd_cluster() and stop when you get back
>> >> >>>> to it. (I think holding the si lock will protect against concurrently freeing
>> >> >>>> the cluster so it should definitely remain in the list?).
>> >> >>>
>> >> >>> I believe that you can find some way to avoid the dead loop issue,
>> >> >>> although your suggestion may kill the performance via looping a long list
>> >> >>> of nonfull clusters.
>> >> >>
>> >> >> I don't agree; If the clusters are considered exclusive (i.e. removed from the
>> >> >> list when made current for a cpu), that only reduces the size of the list by a
>> >> >> maximum of the number of CPUs in the system, which I suspect is pretty small
>> >> >> compared to the number of nonfull clusters.
>> >> >
>> >> > Anyway, this depends on details.  If we cannot allocate a order-N swap
>> >> > entry from the cluster, we should remove it from the nonfull list for
>> >> > order-N (This is the behavior of this patch too).
>> >
>> > Yes, Kairui implements something like that in the reclaim part of the
>> > patch series. It is after patch 3. We are heavily testing the
>> > performance and the stability of the reclaim patches. May I post the
>> > reclaim together with patch 3 for discussion. If you want we can
>> > discuss the re-order the patch in a later iteration.
>> >
>> >>
>> >> Yes that's a good point, and I conceed it is more difficult to detect that
>> >> condition if the cluster is shared. I suspect that with a bit of thinking, we
>> >> could find a way though.
>> >
>> > Kaiui has  the patch series show a good performance number that beats
>> > the current swap cache reclaim.
>> >
>> > I want to make a point regarding the patch ordering before vs after
>> > patch 3 (aka the big rewrite).
>> > Previously, the "san_swap_map_try_ssd_cluster" only did partial
>> > allocation. It does not sucessfully allocate a swap entry 100% the
>> > time.  The patch 3 makes the cluster allocation function return the
>> > swap entry 100% of the time. There are no more fallback retry loops
>> > outside of the cluster allocation function. Also the try_ssd function
>> > does not do swap cache reclaims while the cluster allocation function
>> > will need to. These two have very different constraints.
>> >
>> > There for, adding different cluster header into
>> > san_swap_map_try_ssd_cluste will be a lot of waste investment of
>> > development time in the sense that, that function will need to be
>> > rewrite any way, the end result is very different.
>>
>> I am not a big fan of implementing the final solution directly.
>> Personally, I prefer to improve step by step.
>
> The current proposed order also improves things step by step. The only
> disagreement here is which patch order we introduce yet another list
> in addition to the nonfull one. I just feel that it does not make
> sense to invest into new code if that new code is going to be
> completely rewrite anyway in the next two patches.
>
> Unless you mean is we should not do the patch 3 big rewrite and should
> continue the scan_swap_map_try_ssd_cluster() way of only doing half of
> the allocation job and let scan_swap_map_slots() do the complex retry
> on top of try_ssd(). I feel the overall code is more complex and less
> maintainable.

I haven't look at [3/3], will wait for your next version for that.  So,
I cannot say which order is better.  Please consider reviewers' effort
too.  Small step patch is easier to be understood and reviewed.

>> > That is why I want to make this change patch after patch 3. There is
>> > also the long test cycle after the modification to make sure the swap
>> > code path is stable. I am not resisting a change of patch orders, it
>> > is that patch can't directly be removed before patch 3 before the big
>> > rewrite.
>> >
>> >
>> >>
>> >> > Your original
>> >> > suggestion appears like that you want to keep all cluster with order-N
>> >> > on the nonfull list for order-N always unless the number of free swap
>> >> > entry is less than 1<<N.
>> >>
>> >> Well I think that's certainly one of the conditions for removing it. But agree
>> >> that if a full scan of the cluster has been performed and no swap entries have
>> >> been freed since the scan started then it should also be removed from the list.
>> >
>> > Yes, in the later patch of patch, beyond patch 3, we have the almost
>> > full cluster that for the cluster has been scan and not able to
>> > allocate order N entry.
>> >
>> >>
>> >> >
>> >> >>> And, I understand that in some situations it may
>> >> >>> be better to share clusters among CPUs.  So my suggestion is,
>> >> >>>
>> >> >>> - Make swap_cluster_info->order more accurate, don't pretend that we
>> >> >>>   have free swap entries with that order even after we are sure that we
>> >> >>>   haven't.
>> >> >>
>> >> >> Is this patch pretending that today? I don't think so?
>> >> >
>> >> > IIUC, in this patch swap_cluster_info->order is still "N" even if we are
>> >> > sure that there are no order-N free swap entry in the cluster.
>> >>
>> >> Oh I see what you mean. I think you and Chris already discussed this? IIRC
>> >> Chris's point was that if you move that cluster to N-1, eventually all clusters
>> >> are for order-0 and you have no means of allocating high orders until a whole
>> >> cluster becomes free. That logic certainly makes sense to me, so think its
>> >> better for swap_cluster_info->order to remain static while the cluster is
>> >> allocated. (I only skimmed that conversation so appologies if I got the
>> >> conclusion wrong!).
>> >
>> > Yes, that is the original intent, keep the cluster order as much as possible.
>> >
>> >>
>> >> >
>> >> >> But I agree that a
>> >> >> cluster should only be on the per-order nonfull list if we know there are at
>> >> >> least enough free swap entries in that cluster to cover the order. Of course
>> >> >> that doesn't tell us for sure because they may not be contiguous.
>> >> >
>> >> > We can check that when free swap entry via checking adjacent swap
>> >> > entries.  IMHO, the performance should be acceptable.
>> >>
>> >> Would you then use the result of that scanning to "promote" a cluster's order?
>> >> e.g. swap_cluster_info->order = N+1? That would be neat. But this all feels like
>> >> a separate change on top of what Chris is doing here. For high orders there
>> >> could be quite a bit of scanning required in the worst case for every page that
>> >> gets freed.
>> >
>> > Right, I feel that is a different set of patches. Even this series is
>> > hard enough for review. Those order promotion and demotion is heading
>> > towards a buddy system design. I want to point out that even the buddy
>> > system is not able to handle the case that swapfile is almost full and
>> > the recently freed swap entries are not contiguous.
>> >
>> > We can invest in the buddy system, which doesn't handle all the
>> > fragmentation issues. Or I prefer to go directly to the discontiguous
>> > swap entry. We pay a price for the indirect mapping of swap entries.
>> > But it will solve the fragmentation issue 100%.
>>
>> It's good if we can solve the fragmentation issue 100%.  Just need to
>> pay attention to the cost.
>
> The cost you mean the development cost or the run time cost (memory and cpu)?

I mean runtime cost.

>>
>> >>
>> >> >
>> >> >>>
>> >> >>> My question is whether it's so important to share the per-cpu cluster
>> >> >>> among CPUs?
>> >> >>
>> >> >> My rationale for sharing is that the preference previously has been to favour
>> >> >> efficient use of swap space; we don't want to fail a request for allocation of a
>> >> >> given order if there are actually slots available just because they have been
>> >> >> reserved by another CPU. And I'm still asserting that it should be ~zero cost to
>> >> >> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
>> >> >> actually help improve allocation success, then I'm happy to take the exclusive
>> >> >> approach.
>> >> >>
>> >> >>> I suggest to start with simple design, that is, per-CPU
>> >> >>> cluster will not be shared among CPUs in most cases.
>> >> >>
>> >> >> I'm all for starting simple; I think that's what I already proposed (exclusive
>> >> >> in this patch, then shared in the "big rewrite"). I'm just objecting to the
>> >> >> current half-and-half policy in this patch.
>> >> >
>> >> > Sounds good to me.  We can start with exclusive solution and evaluate
>> >> > whether shared solution is good.
>> >>
>> >> Yep. And also evaluate the dynamic order inc/dec idea too...
>> >
>> > It is not able to avoid fragementation 100% of the time. I prefer the
>> > discontinued swap entry as the next step, which guarantees forward
>> > progress, we will not be stuck in a situation where we are not able to
>> > allocate swap entries due to fragmentation.
>>
>> If my understanding were correct, the implementation complexity of the
>> order promotion/demotion isn't at the same level of that of discontinued
>> swap entry.
>
> Discontinued swap entry has higher complexity but higher payout as
> well. It can get us to the place where cluster promotion/demotion
> can't.
>
> I also feel that if we implement something towards a buddy system
> allocator for swap, we should do a proper buddy allocator
> implementation of data structures.

I don't think that it's easy to implement a real buddy allocator for
swap entries.  So, I avoid to use buddy in my words.

--
Best Regards,
Huang, Ying

Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Posted by Chris Li 1 year, 6 months ago

On Thu, Jul 25, 2024 at 7:13 PM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > The current proposed order also improves things step by step. The only
> > disagreement here is which patch order we introduce yet another list
> > in addition to the nonfull one. I just feel that it does not make
> > sense to invest into new code if that new code is going to be
> > completely rewrite anyway in the next two patches.
> >
> > Unless you mean is we should not do the patch 3 big rewrite and should
> > continue the scan_swap_map_try_ssd_cluster() way of only doing half of
> > the allocation job and let scan_swap_map_slots() do the complex retry
> > on top of try_ssd(). I feel the overall code is more complex and less
> > maintainable.
>
> I haven't look at [3/3], will wait for your next version for that.  So,
> I cannot say which order is better.  Please consider reviewers' effort
> too.  Small step patch is easier to be understood and reviewed.

That is exactly the reason I don't want to introduce too much new code
depending on the scan_swap_map_slots() behavior, which will be
abandoned in the big rewrite. Their constraints are very different. I
want to make the big rewrite patch 3 as small as possible. Using
incremental follow up patches to improve it.

>
> >> > That is why I want to make this change patch after patch 3. There is
> >> > also the long test cycle after the modification to make sure the swap
> >> > code path is stable. I am not resisting a change of patch orders, it
> >> > is that patch can't directly be removed before patch 3 before the big
> >> > rewrite.
> >> >
> >> >
> >> >>
> >> >> > Your original
> >> >> > suggestion appears like that you want to keep all cluster with order-N
> >> >> > on the nonfull list for order-N always unless the number of free swap
> >> >> > entry is less than 1<<N.
> >> >>
> >> >> Well I think that's certainly one of the conditions for removing it. But agree
> >> >> that if a full scan of the cluster has been performed and no swap entries have
> >> >> been freed since the scan started then it should also be removed from the list.
> >> >
> >> > Yes, in the later patch of patch, beyond patch 3, we have the almost
> >> > full cluster that for the cluster has been scan and not able to
> >> > allocate order N entry.
> >> >
> >> >>
> >> >> >
> >> >> >>> And, I understand that in some situations it may
> >> >> >>> be better to share clusters among CPUs.  So my suggestion is,
> >> >> >>>
> >> >> >>> - Make swap_cluster_info->order more accurate, don't pretend that we
> >> >> >>>   have free swap entries with that order even after we are sure that we
> >> >> >>>   haven't.
> >> >> >>
> >> >> >> Is this patch pretending that today? I don't think so?
> >> >> >
> >> >> > IIUC, in this patch swap_cluster_info->order is still "N" even if we are
> >> >> > sure that there are no order-N free swap entry in the cluster.
> >> >>
> >> >> Oh I see what you mean. I think you and Chris already discussed this? IIRC
> >> >> Chris's point was that if you move that cluster to N-1, eventually all clusters
> >> >> are for order-0 and you have no means of allocating high orders until a whole
> >> >> cluster becomes free. That logic certainly makes sense to me, so think its
> >> >> better for swap_cluster_info->order to remain static while the cluster is
> >> >> allocated. (I only skimmed that conversation so appologies if I got the
> >> >> conclusion wrong!).
> >> >
> >> > Yes, that is the original intent, keep the cluster order as much as possible.
> >> >
> >> >>
> >> >> >
> >> >> >> But I agree that a
> >> >> >> cluster should only be on the per-order nonfull list if we know there are at
> >> >> >> least enough free swap entries in that cluster to cover the order. Of course
> >> >> >> that doesn't tell us for sure because they may not be contiguous.
> >> >> >
> >> >> > We can check that when free swap entry via checking adjacent swap
> >> >> > entries.  IMHO, the performance should be acceptable.
> >> >>
> >> >> Would you then use the result of that scanning to "promote" a cluster's order?
> >> >> e.g. swap_cluster_info->order = N+1? That would be neat. But this all feels like
> >> >> a separate change on top of what Chris is doing here. For high orders there
> >> >> could be quite a bit of scanning required in the worst case for every page that
> >> >> gets freed.
> >> >
> >> > Right, I feel that is a different set of patches. Even this series is
> >> > hard enough for review. Those order promotion and demotion is heading
> >> > towards a buddy system design. I want to point out that even the buddy
> >> > system is not able to handle the case that swapfile is almost full and
> >> > the recently freed swap entries are not contiguous.
> >> >
> >> > We can invest in the buddy system, which doesn't handle all the
> >> > fragmentation issues. Or I prefer to go directly to the discontiguous
> >> > swap entry. We pay a price for the indirect mapping of swap entries.
> >> > But it will solve the fragmentation issue 100%.
> >>
> >> It's good if we can solve the fragmentation issue 100%.  Just need to
> >> pay attention to the cost.
> >
> > The cost you mean the development cost or the run time cost (memory and cpu)?
>
> I mean runtime cost.

Thanks for the clarification. Agree that we need to pay attention to
the run time cost. That is given.

> >> >> >>> My question is whether it's so important to share the per-cpu cluster
> >> >> >>> among CPUs?
> >> >> >>
> >> >> >> My rationale for sharing is that the preference previously has been to favour
> >> >> >> efficient use of swap space; we don't want to fail a request for allocation of a
> >> >> >> given order if there are actually slots available just because they have been
> >> >> >> reserved by another CPU. And I'm still asserting that it should be ~zero cost to
> >> >> >> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
> >> >> >> actually help improve allocation success, then I'm happy to take the exclusive
> >> >> >> approach.
> >> >> >>
> >> >> >>> I suggest to start with simple design, that is, per-CPU
> >> >> >>> cluster will not be shared among CPUs in most cases.
> >> >> >>
> >> >> >> I'm all for starting simple; I think that's what I already proposed (exclusive
> >> >> >> in this patch, then shared in the "big rewrite"). I'm just objecting to the
> >> >> >> current half-and-half policy in this patch.
> >> >> >
> >> >> > Sounds good to me.  We can start with exclusive solution and evaluate
> >> >> > whether shared solution is good.
> >> >>
> >> >> Yep. And also evaluate the dynamic order inc/dec idea too...
> >> >
> >> > It is not able to avoid fragementation 100% of the time. I prefer the
> >> > discontinued swap entry as the next step, which guarantees forward
> >> > progress, we will not be stuck in a situation where we are not able to
> >> > allocate swap entries due to fragmentation.
> >>
> >> If my understanding were correct, the implementation complexity of the
> >> order promotion/demotion isn't at the same level of that of discontinued
> >> swap entry.
> >
> > Discontinued swap entry has higher complexity but higher payout as
> > well. It can get us to the place where cluster promotion/demotion
> > can't.
> >
> > I also feel that if we implement something towards a buddy system
> > allocator for swap, we should do a proper buddy allocator
> > implementation of data structures.
>
> I don't think that it's easy to implement a real buddy allocator for
> swap entries.  So, I avoid to use buddy in my words.

Then such a mix of cluster order promote/demote lose some benefit of
the buddy system. Because it lacks the proper data structure to
support buddy allocation. The buddy allocator provides more general
migration between orders. For the limited usage case of cluster
promotion/demotion is supported (by luck). We need to evaluate whether
it is worth the additional complexity.

Chris

Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Posted by Huang, Ying 1 year, 6 months ago

Chris Li <chrisl@kernel.org> writes:

> On Thu, Jul 25, 2024 at 7:13 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >
>> > The current proposed order also improves things step by step. The only
>> > disagreement here is which patch order we introduce yet another list
>> > in addition to the nonfull one. I just feel that it does not make
>> > sense to invest into new code if that new code is going to be
>> > completely rewrite anyway in the next two patches.
>> >
>> > Unless you mean is we should not do the patch 3 big rewrite and should
>> > continue the scan_swap_map_try_ssd_cluster() way of only doing half of
>> > the allocation job and let scan_swap_map_slots() do the complex retry
>> > on top of try_ssd(). I feel the overall code is more complex and less
>> > maintainable.
>>
>> I haven't look at [3/3], will wait for your next version for that.  So,
>> I cannot say which order is better.  Please consider reviewers' effort
>> too.  Small step patch is easier to be understood and reviewed.
>
> That is exactly the reason I don't want to introduce too much new code
> depending on the scan_swap_map_slots() behavior, which will be
> abandoned in the big rewrite. Their constraints are very different. I
> want to make the big rewrite patch 3 as small as possible. Using
> incremental follow up patches to improve it.
>
>>
>> >> > That is why I want to make this change patch after patch 3. There is
>> >> > also the long test cycle after the modification to make sure the swap
>> >> > code path is stable. I am not resisting a change of patch orders, it
>> >> > is that patch can't directly be removed before patch 3 before the big
>> >> > rewrite.
>> >> >
>> >> >
>> >> >>
>> >> >> > Your original
>> >> >> > suggestion appears like that you want to keep all cluster with order-N
>> >> >> > on the nonfull list for order-N always unless the number of free swap
>> >> >> > entry is less than 1<<N.
>> >> >>
>> >> >> Well I think that's certainly one of the conditions for removing it. But agree
>> >> >> that if a full scan of the cluster has been performed and no swap entries have
>> >> >> been freed since the scan started then it should also be removed from the list.
>> >> >
>> >> > Yes, in the later patch of patch, beyond patch 3, we have the almost
>> >> > full cluster that for the cluster has been scan and not able to
>> >> > allocate order N entry.
>> >> >
>> >> >>
>> >> >> >
>> >> >> >>> And, I understand that in some situations it may
>> >> >> >>> be better to share clusters among CPUs.  So my suggestion is,
>> >> >> >>>
>> >> >> >>> - Make swap_cluster_info->order more accurate, don't pretend that we
>> >> >> >>>   have free swap entries with that order even after we are sure that we
>> >> >> >>>   haven't.
>> >> >> >>
>> >> >> >> Is this patch pretending that today? I don't think so?
>> >> >> >
>> >> >> > IIUC, in this patch swap_cluster_info->order is still "N" even if we are
>> >> >> > sure that there are no order-N free swap entry in the cluster.
>> >> >>
>> >> >> Oh I see what you mean. I think you and Chris already discussed this? IIRC
>> >> >> Chris's point was that if you move that cluster to N-1, eventually all clusters
>> >> >> are for order-0 and you have no means of allocating high orders until a whole
>> >> >> cluster becomes free. That logic certainly makes sense to me, so think its
>> >> >> better for swap_cluster_info->order to remain static while the cluster is
>> >> >> allocated. (I only skimmed that conversation so appologies if I got the
>> >> >> conclusion wrong!).
>> >> >
>> >> > Yes, that is the original intent, keep the cluster order as much as possible.
>> >> >
>> >> >>
>> >> >> >
>> >> >> >> But I agree that a
>> >> >> >> cluster should only be on the per-order nonfull list if we know there are at
>> >> >> >> least enough free swap entries in that cluster to cover the order. Of course
>> >> >> >> that doesn't tell us for sure because they may not be contiguous.
>> >> >> >
>> >> >> > We can check that when free swap entry via checking adjacent swap
>> >> >> > entries.  IMHO, the performance should be acceptable.
>> >> >>
>> >> >> Would you then use the result of that scanning to "promote" a cluster's order?
>> >> >> e.g. swap_cluster_info->order = N+1? That would be neat. But this all feels like
>> >> >> a separate change on top of what Chris is doing here. For high orders there
>> >> >> could be quite a bit of scanning required in the worst case for every page that
>> >> >> gets freed.
>> >> >
>> >> > Right, I feel that is a different set of patches. Even this series is
>> >> > hard enough for review. Those order promotion and demotion is heading
>> >> > towards a buddy system design. I want to point out that even the buddy
>> >> > system is not able to handle the case that swapfile is almost full and
>> >> > the recently freed swap entries are not contiguous.
>> >> >
>> >> > We can invest in the buddy system, which doesn't handle all the
>> >> > fragmentation issues. Or I prefer to go directly to the discontiguous
>> >> > swap entry. We pay a price for the indirect mapping of swap entries.
>> >> > But it will solve the fragmentation issue 100%.
>> >>
>> >> It's good if we can solve the fragmentation issue 100%.  Just need to
>> >> pay attention to the cost.
>> >
>> > The cost you mean the development cost or the run time cost (memory and cpu)?
>>
>> I mean runtime cost.
>
> Thanks for the clarification. Agree that we need to pay attention to
> the run time cost. That is given.
>
>> >> >> >>> My question is whether it's so important to share the per-cpu cluster
>> >> >> >>> among CPUs?
>> >> >> >>
>> >> >> >> My rationale for sharing is that the preference previously has been to favour
>> >> >> >> efficient use of swap space; we don't want to fail a request for allocation of a
>> >> >> >> given order if there are actually slots available just because they have been
>> >> >> >> reserved by another CPU. And I'm still asserting that it should be ~zero cost to
>> >> >> >> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
>> >> >> >> actually help improve allocation success, then I'm happy to take the exclusive
>> >> >> >> approach.
>> >> >> >>
>> >> >> >>> I suggest to start with simple design, that is, per-CPU
>> >> >> >>> cluster will not be shared among CPUs in most cases.
>> >> >> >>
>> >> >> >> I'm all for starting simple; I think that's what I already proposed (exclusive
>> >> >> >> in this patch, then shared in the "big rewrite"). I'm just objecting to the
>> >> >> >> current half-and-half policy in this patch.
>> >> >> >
>> >> >> > Sounds good to me.  We can start with exclusive solution and evaluate
>> >> >> > whether shared solution is good.
>> >> >>
>> >> >> Yep. And also evaluate the dynamic order inc/dec idea too...
>> >> >
>> >> > It is not able to avoid fragementation 100% of the time. I prefer the
>> >> > discontinued swap entry as the next step, which guarantees forward
>> >> > progress, we will not be stuck in a situation where we are not able to
>> >> > allocate swap entries due to fragmentation.
>> >>
>> >> If my understanding were correct, the implementation complexity of the
>> >> order promotion/demotion isn't at the same level of that of discontinued
>> >> swap entry.
>> >
>> > Discontinued swap entry has higher complexity but higher payout as
>> > well. It can get us to the place where cluster promotion/demotion
>> > can't.
>> >
>> > I also feel that if we implement something towards a buddy system
>> > allocator for swap, we should do a proper buddy allocator
>> > implementation of data structures.
>>
>> I don't think that it's easy to implement a real buddy allocator for
>> swap entries.  So, I avoid to use buddy in my words.
>
> Then such a mix of cluster order promote/demote lose some benefit of
> the buddy system. Because it lacks the proper data structure to
> support buddy allocation. The buddy allocator provides more general
> migration between orders. For the limited usage case of cluster
> promotion/demotion is supported (by luck). We need to evaluate whether
> it is worth the additional complexity.

TBH, I believe that the complexity of order promote/demote is quite low,
both for development and runtime.  A real buddy allocator may need to
increase per-swap-entry memory footprint much.

--
Best Regards,
Huang, Ying

Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Posted by Chris Li 1 year, 6 months ago

On Thu, Jul 25, 2024 at 11:05 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > On Thu, Jul 25, 2024 at 7:13 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >
> >> > The current proposed order also improves things step by step. The only
> >> > disagreement here is which patch order we introduce yet another list
> >> > in addition to the nonfull one. I just feel that it does not make
> >> > sense to invest into new code if that new code is going to be
> >> > completely rewrite anyway in the next two patches.
> >> >
> >> > Unless you mean is we should not do the patch 3 big rewrite and should
> >> > continue the scan_swap_map_try_ssd_cluster() way of only doing half of
> >> > the allocation job and let scan_swap_map_slots() do the complex retry
> >> > on top of try_ssd(). I feel the overall code is more complex and less
> >> > maintainable.
> >>
> >> I haven't look at [3/3], will wait for your next version for that.  So,
> >> I cannot say which order is better.  Please consider reviewers' effort
> >> too.  Small step patch is easier to be understood and reviewed.
> >
> > That is exactly the reason I don't want to introduce too much new code
> > depending on the scan_swap_map_slots() behavior, which will be
> > abandoned in the big rewrite. Their constraints are very different. I
> > want to make the big rewrite patch 3 as small as possible. Using
> > incremental follow up patches to improve it.
> >
> >>
> >> >> > That is why I want to make this change patch after patch 3. There is
> >> >> > also the long test cycle after the modification to make sure the swap
> >> >> > code path is stable. I am not resisting a change of patch orders, it
> >> >> > is that patch can't directly be removed before patch 3 before the big
> >> >> > rewrite.
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> > Your original
> >> >> >> > suggestion appears like that you want to keep all cluster with order-N
> >> >> >> > on the nonfull list for order-N always unless the number of free swap
> >> >> >> > entry is less than 1<<N.
> >> >> >>
> >> >> >> Well I think that's certainly one of the conditions for removing it. But agree
> >> >> >> that if a full scan of the cluster has been performed and no swap entries have
> >> >> >> been freed since the scan started then it should also be removed from the list.
> >> >> >
> >> >> > Yes, in the later patch of patch, beyond patch 3, we have the almost
> >> >> > full cluster that for the cluster has been scan and not able to
> >> >> > allocate order N entry.
> >> >> >
> >> >> >>
> >> >> >> >
> >> >> >> >>> And, I understand that in some situations it may
> >> >> >> >>> be better to share clusters among CPUs.  So my suggestion is,
> >> >> >> >>>
> >> >> >> >>> - Make swap_cluster_info->order more accurate, don't pretend that we
> >> >> >> >>>   have free swap entries with that order even after we are sure that we
> >> >> >> >>>   haven't.
> >> >> >> >>
> >> >> >> >> Is this patch pretending that today? I don't think so?
> >> >> >> >
> >> >> >> > IIUC, in this patch swap_cluster_info->order is still "N" even if we are
> >> >> >> > sure that there are no order-N free swap entry in the cluster.
> >> >> >>
> >> >> >> Oh I see what you mean. I think you and Chris already discussed this? IIRC
> >> >> >> Chris's point was that if you move that cluster to N-1, eventually all clusters
> >> >> >> are for order-0 and you have no means of allocating high orders until a whole
> >> >> >> cluster becomes free. That logic certainly makes sense to me, so think its
> >> >> >> better for swap_cluster_info->order to remain static while the cluster is
> >> >> >> allocated. (I only skimmed that conversation so appologies if I got the
> >> >> >> conclusion wrong!).
> >> >> >
> >> >> > Yes, that is the original intent, keep the cluster order as much as possible.
> >> >> >
> >> >> >>
> >> >> >> >
> >> >> >> >> But I agree that a
> >> >> >> >> cluster should only be on the per-order nonfull list if we know there are at
> >> >> >> >> least enough free swap entries in that cluster to cover the order. Of course
> >> >> >> >> that doesn't tell us for sure because they may not be contiguous.
> >> >> >> >
> >> >> >> > We can check that when free swap entry via checking adjacent swap
> >> >> >> > entries.  IMHO, the performance should be acceptable.
> >> >> >>
> >> >> >> Would you then use the result of that scanning to "promote" a cluster's order?
> >> >> >> e.g. swap_cluster_info->order = N+1? That would be neat. But this all feels like
> >> >> >> a separate change on top of what Chris is doing here. For high orders there
> >> >> >> could be quite a bit of scanning required in the worst case for every page that
> >> >> >> gets freed.
> >> >> >
> >> >> > Right, I feel that is a different set of patches. Even this series is
> >> >> > hard enough for review. Those order promotion and demotion is heading
> >> >> > towards a buddy system design. I want to point out that even the buddy
> >> >> > system is not able to handle the case that swapfile is almost full and
> >> >> > the recently freed swap entries are not contiguous.
> >> >> >
> >> >> > We can invest in the buddy system, which doesn't handle all the
> >> >> > fragmentation issues. Or I prefer to go directly to the discontiguous
> >> >> > swap entry. We pay a price for the indirect mapping of swap entries.
> >> >> > But it will solve the fragmentation issue 100%.
> >> >>
> >> >> It's good if we can solve the fragmentation issue 100%.  Just need to
> >> >> pay attention to the cost.
> >> >
> >> > The cost you mean the development cost or the run time cost (memory and cpu)?
> >>
> >> I mean runtime cost.
> >
> > Thanks for the clarification. Agree that we need to pay attention to
> > the run time cost. That is given.
> >
> >> >> >> >>> My question is whether it's so important to share the per-cpu cluster
> >> >> >> >>> among CPUs?
> >> >> >> >>
> >> >> >> >> My rationale for sharing is that the preference previously has been to favour
> >> >> >> >> efficient use of swap space; we don't want to fail a request for allocation of a
> >> >> >> >> given order if there are actually slots available just because they have been
> >> >> >> >> reserved by another CPU. And I'm still asserting that it should be ~zero cost to
> >> >> >> >> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
> >> >> >> >> actually help improve allocation success, then I'm happy to take the exclusive
> >> >> >> >> approach.
> >> >> >> >>
> >> >> >> >>> I suggest to start with simple design, that is, per-CPU
> >> >> >> >>> cluster will not be shared among CPUs in most cases.
> >> >> >> >>
> >> >> >> >> I'm all for starting simple; I think that's what I already proposed (exclusive
> >> >> >> >> in this patch, then shared in the "big rewrite"). I'm just objecting to the
> >> >> >> >> current half-and-half policy in this patch.
> >> >> >> >
> >> >> >> > Sounds good to me.  We can start with exclusive solution and evaluate
> >> >> >> > whether shared solution is good.
> >> >> >>
> >> >> >> Yep. And also evaluate the dynamic order inc/dec idea too...
> >> >> >
> >> >> > It is not able to avoid fragementation 100% of the time. I prefer the
> >> >> > discontinued swap entry as the next step, which guarantees forward
> >> >> > progress, we will not be stuck in a situation where we are not able to
> >> >> > allocate swap entries due to fragmentation.
> >> >>
> >> >> If my understanding were correct, the implementation complexity of the
> >> >> order promotion/demotion isn't at the same level of that of discontinued
> >> >> swap entry.
> >> >
> >> > Discontinued swap entry has higher complexity but higher payout as
> >> > well. It can get us to the place where cluster promotion/demotion
> >> > can't.
> >> >
> >> > I also feel that if we implement something towards a buddy system
> >> > allocator for swap, we should do a proper buddy allocator
> >> > implementation of data structures.
> >>
> >> I don't think that it's easy to implement a real buddy allocator for
> >> swap entries.  So, I avoid to use buddy in my words.
> >
> > Then such a mix of cluster order promote/demote lose some benefit of
> > the buddy system. Because it lacks the proper data structure to
> > support buddy allocation. The buddy allocator provides more general
> > migration between orders. For the limited usage case of cluster
> > promotion/demotion is supported (by luck). We need to evaluate whether
> > it is worth the additional complexity.
>
> TBH, I believe that the complexity of order promote/demote is quite low,
> both for development and runtime.  A real buddy allocator may need to
> increase per-swap-entry memory footprint much.

I mostly concern its effectiveness. Anyway, the series is already
complex enough with the big rewrite and reclaim on swap cache.

Let me know if you think it needs to be done before the big rewrite.

Chris

Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Posted by Huang, Ying 1 year, 6 months ago

Chris Li <chrisl@kernel.org> writes:

> On Thu, Jul 25, 2024 at 11:05 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>> > On Thu, Jul 25, 2024 at 7:13 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >
>> >> > The current proposed order also improves things step by step. The only
>> >> > disagreement here is which patch order we introduce yet another list
>> >> > in addition to the nonfull one. I just feel that it does not make
>> >> > sense to invest into new code if that new code is going to be
>> >> > completely rewrite anyway in the next two patches.
>> >> >
>> >> > Unless you mean is we should not do the patch 3 big rewrite and should
>> >> > continue the scan_swap_map_try_ssd_cluster() way of only doing half of
>> >> > the allocation job and let scan_swap_map_slots() do the complex retry
>> >> > on top of try_ssd(). I feel the overall code is more complex and less
>> >> > maintainable.
>> >>
>> >> I haven't look at [3/3], will wait for your next version for that.  So,
>> >> I cannot say which order is better.  Please consider reviewers' effort
>> >> too.  Small step patch is easier to be understood and reviewed.
>> >
>> > That is exactly the reason I don't want to introduce too much new code
>> > depending on the scan_swap_map_slots() behavior, which will be
>> > abandoned in the big rewrite. Their constraints are very different. I
>> > want to make the big rewrite patch 3 as small as possible. Using
>> > incremental follow up patches to improve it.
>> >
>> >>
>> >> >> > That is why I want to make this change patch after patch 3. There is
>> >> >> > also the long test cycle after the modification to make sure the swap
>> >> >> > code path is stable. I am not resisting a change of patch orders, it
>> >> >> > is that patch can't directly be removed before patch 3 before the big
>> >> >> > rewrite.
>> >> >> >
>> >> >> >
>> >> >> >>
>> >> >> >> > Your original
>> >> >> >> > suggestion appears like that you want to keep all cluster with order-N
>> >> >> >> > on the nonfull list for order-N always unless the number of free swap
>> >> >> >> > entry is less than 1<<N.
>> >> >> >>
>> >> >> >> Well I think that's certainly one of the conditions for removing it. But agree
>> >> >> >> that if a full scan of the cluster has been performed and no swap entries have
>> >> >> >> been freed since the scan started then it should also be removed from the list.
>> >> >> >
>> >> >> > Yes, in the later patch of patch, beyond patch 3, we have the almost
>> >> >> > full cluster that for the cluster has been scan and not able to
>> >> >> > allocate order N entry.
>> >> >> >
>> >> >> >>
>> >> >> >> >
>> >> >> >> >>> And, I understand that in some situations it may
>> >> >> >> >>> be better to share clusters among CPUs.  So my suggestion is,
>> >> >> >> >>>
>> >> >> >> >>> - Make swap_cluster_info->order more accurate, don't pretend that we
>> >> >> >> >>>   have free swap entries with that order even after we are sure that we
>> >> >> >> >>>   haven't.
>> >> >> >> >>
>> >> >> >> >> Is this patch pretending that today? I don't think so?
>> >> >> >> >
>> >> >> >> > IIUC, in this patch swap_cluster_info->order is still "N" even if we are
>> >> >> >> > sure that there are no order-N free swap entry in the cluster.
>> >> >> >>
>> >> >> >> Oh I see what you mean. I think you and Chris already discussed this? IIRC
>> >> >> >> Chris's point was that if you move that cluster to N-1, eventually all clusters
>> >> >> >> are for order-0 and you have no means of allocating high orders until a whole
>> >> >> >> cluster becomes free. That logic certainly makes sense to me, so think its
>> >> >> >> better for swap_cluster_info->order to remain static while the cluster is
>> >> >> >> allocated. (I only skimmed that conversation so appologies if I got the
>> >> >> >> conclusion wrong!).
>> >> >> >
>> >> >> > Yes, that is the original intent, keep the cluster order as much as possible.
>> >> >> >
>> >> >> >>
>> >> >> >> >
>> >> >> >> >> But I agree that a
>> >> >> >> >> cluster should only be on the per-order nonfull list if we know there are at
>> >> >> >> >> least enough free swap entries in that cluster to cover the order. Of course
>> >> >> >> >> that doesn't tell us for sure because they may not be contiguous.
>> >> >> >> >
>> >> >> >> > We can check that when free swap entry via checking adjacent swap
>> >> >> >> > entries.  IMHO, the performance should be acceptable.
>> >> >> >>
>> >> >> >> Would you then use the result of that scanning to "promote" a cluster's order?
>> >> >> >> e.g. swap_cluster_info->order = N+1? That would be neat. But this all feels like
>> >> >> >> a separate change on top of what Chris is doing here. For high orders there
>> >> >> >> could be quite a bit of scanning required in the worst case for every page that
>> >> >> >> gets freed.
>> >> >> >
>> >> >> > Right, I feel that is a different set of patches. Even this series is
>> >> >> > hard enough for review. Those order promotion and demotion is heading
>> >> >> > towards a buddy system design. I want to point out that even the buddy
>> >> >> > system is not able to handle the case that swapfile is almost full and
>> >> >> > the recently freed swap entries are not contiguous.
>> >> >> >
>> >> >> > We can invest in the buddy system, which doesn't handle all the
>> >> >> > fragmentation issues. Or I prefer to go directly to the discontiguous
>> >> >> > swap entry. We pay a price for the indirect mapping of swap entries.
>> >> >> > But it will solve the fragmentation issue 100%.
>> >> >>
>> >> >> It's good if we can solve the fragmentation issue 100%.  Just need to
>> >> >> pay attention to the cost.
>> >> >
>> >> > The cost you mean the development cost or the run time cost (memory and cpu)?
>> >>
>> >> I mean runtime cost.
>> >
>> > Thanks for the clarification. Agree that we need to pay attention to
>> > the run time cost. That is given.
>> >
>> >> >> >> >>> My question is whether it's so important to share the per-cpu cluster
>> >> >> >> >>> among CPUs?
>> >> >> >> >>
>> >> >> >> >> My rationale for sharing is that the preference previously has been to favour
>> >> >> >> >> efficient use of swap space; we don't want to fail a request for allocation of a
>> >> >> >> >> given order if there are actually slots available just because they have been
>> >> >> >> >> reserved by another CPU. And I'm still asserting that it should be ~zero cost to
>> >> >> >> >> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
>> >> >> >> >> actually help improve allocation success, then I'm happy to take the exclusive
>> >> >> >> >> approach.
>> >> >> >> >>
>> >> >> >> >>> I suggest to start with simple design, that is, per-CPU
>> >> >> >> >>> cluster will not be shared among CPUs in most cases.
>> >> >> >> >>
>> >> >> >> >> I'm all for starting simple; I think that's what I already proposed (exclusive
>> >> >> >> >> in this patch, then shared in the "big rewrite"). I'm just objecting to the
>> >> >> >> >> current half-and-half policy in this patch.
>> >> >> >> >
>> >> >> >> > Sounds good to me.  We can start with exclusive solution and evaluate
>> >> >> >> > whether shared solution is good.
>> >> >> >>
>> >> >> >> Yep. And also evaluate the dynamic order inc/dec idea too...
>> >> >> >
>> >> >> > It is not able to avoid fragementation 100% of the time. I prefer the
>> >> >> > discontinued swap entry as the next step, which guarantees forward
>> >> >> > progress, we will not be stuck in a situation where we are not able to
>> >> >> > allocate swap entries due to fragmentation.
>> >> >>
>> >> >> If my understanding were correct, the implementation complexity of the
>> >> >> order promotion/demotion isn't at the same level of that of discontinued
>> >> >> swap entry.
>> >> >
>> >> > Discontinued swap entry has higher complexity but higher payout as
>> >> > well. It can get us to the place where cluster promotion/demotion
>> >> > can't.
>> >> >
>> >> > I also feel that if we implement something towards a buddy system
>> >> > allocator for swap, we should do a proper buddy allocator
>> >> > implementation of data structures.
>> >>
>> >> I don't think that it's easy to implement a real buddy allocator for
>> >> swap entries.  So, I avoid to use buddy in my words.
>> >
>> > Then such a mix of cluster order promote/demote lose some benefit of
>> > the buddy system. Because it lacks the proper data structure to
>> > support buddy allocation. The buddy allocator provides more general
>> > migration between orders. For the limited usage case of cluster
>> > promotion/demotion is supported (by luck). We need to evaluate whether
>> > it is worth the additional complexity.
>>
>> TBH, I believe that the complexity of order promote/demote is quite low,
>> both for development and runtime.  A real buddy allocator may need to
>> increase per-swap-entry memory footprint much.
>
> I mostly concern its effectiveness. Anyway, the series is already
> complex enough with the big rewrite and reclaim on swap cache.
>
> Let me know if you think it needs to be done before the big rewrite.

I hope so.  But, I will not force you to do that if you don't buy in it.

--
Best Regards,
Huang, Ying

[PATCH v4 1/3] mm: swap: swap cluster switch to double link list
[PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
[PATCH v4 3/3] RFC: mm: swap: seperate SSD allocation from scan_swap_map_slots()