[PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

Chris Li posted 2 patches 1 year, 8 months ago
There is a newer version of this series
include/linux/swap.h |  18 ++--
mm/swapfile.c        | 252 +++++++++++++++++----------------------------------
2 files changed, 93 insertions(+), 177 deletions(-)
[PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
Posted by Chris Li 1 year, 8 months ago
This is the short term solutiolns "swap cluster order" listed
in my "Swap Abstraction" discussion slice 8 in the recent
LSF/MM conference.

When commit 845982eb264bc "mm: swap: allow storage of all mTHP
orders" is introduced, it only allocates the mTHP swap entries
from new empty cluster list. That works well for PMD size THP,
but it has a serius fragmentation issue reported by Barry.

https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/

The mTHP allocation failure rate raises to almost 100% after a few
hours in Barry's test run.

The reason is that all the empty cluster has been exhausted while
there are planty of free swap entries to in the cluster that is
not 100% free.

Address this by remember the swap allocation order in the cluster.
Keep track of the per order non full cluster list for later allocation.

This greatly improve the sucess rate of the mTHP swap allocation.
While I am still waiting for Barry's test result. I paste Kairui's test
result here:

I'm able to reproduce such an issue with a simple script (enabling all order of mthp):

modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024))
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0

rmdir /sys/fs/cgroup/benchmark
mkdir -p /sys/fs/cgroup/benchmark
cd /sys/fs/cgroup/benchmark
echo 8G > memory.max
echo $$ > cgroup.procs

memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary &

/usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \
        -P memcache_binary -n allkeys --key-minimum=1 \
        --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \
        --ratio 1:0 --pipeline 8 -d 1024

Before:
Totals      48805.63         0.00         0.00         5.26045         1.19100        38.91100        59.64700     51063.98
After:
Totals      71098.84         0.00         0.00         3.60585         0.71100        26.36700        39.16700     74388.74

And the fallback ratio dropped by a lot:
Before:
hugepages-32kB/stats/anon_swpout_fallback:15997
hugepages-32kB/stats/anon_swpout:18712
hugepages-512kB/stats/anon_swpout_fallback:192
hugepages-512kB/stats/anon_swpout:0
hugepages-2048kB/stats/anon_swpout_fallback:2
hugepages-2048kB/stats/anon_swpout:0
hugepages-1024kB/stats/anon_swpout_fallback:0
hugepages-1024kB/stats/anon_swpout:0
hugepages-64kB/stats/anon_swpout_fallback:18246
hugepages-64kB/stats/anon_swpout:17644
hugepages-16kB/stats/anon_swpout_fallback:13701
hugepages-16kB/stats/anon_swpout:18234
hugepages-256kB/stats/anon_swpout_fallback:8642
hugepages-256kB/stats/anon_swpout:93
hugepages-128kB/stats/anon_swpout_fallback:21497
hugepages-128kB/stats/anon_swpout:7596

(Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate)

After:
hugepages-32kB/stats/swpout:34445
hugepages-32kB/stats/swpout_fallback:0
hugepages-512kB/stats/swpout:1
hugepages-512kB/stats/swpout_fallback:134
hugepages-2048kB/stats/swpout:1
hugepages-2048kB/stats/swpout_fallback:1
hugepages-1024kB/stats/swpout:6
hugepages-1024kB/stats/swpout_fallback:0
hugepages-64kB/stats/swpout:35495
hugepages-64kB/stats/swpout_fallback:0
hugepages-16kB/stats/swpout:32441
hugepages-16kB/stats/swpout_fallback:0
hugepages-256kB/stats/swpout:2223
hugepages-256kB/stats/swpout_fallback:6278
hugepages-128kB/stats/swpout:29136
hugepages-128kB/stats/swpout_fallback:52

Reported-by: Barry Song <21cnbao@gmail.com>
Tested-by: Kairui Song <kasong@tencent.com>
Signed-off-by: Chris Li <chrisl@kernel.org>
---
Chris Li (2):
      mm: swap: swap cluster switch to double link list
      mm: swap: mTHP allocate swap entries from nonfull list

 include/linux/swap.h |  18 ++--
 mm/swapfile.c        | 252 +++++++++++++++++----------------------------------
 2 files changed, 93 insertions(+), 177 deletions(-)
---
base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed
change-id: 20240523-swap-allocator-1534c480ece4

Best regards,
-- 
Chris Li <chrisl@kernel.org>
Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
Posted by Ryan Roberts 1 year, 8 months ago
Sorry I'm late to the discussion - I've been out for the last 3.5 weeks and just
getting through my mail now...


On 24/05/2024 18:17, Chris Li wrote:
> This is the short term solutiolns "swap cluster order" listed
> in my "Swap Abstraction" discussion slice 8 in the recent
> LSF/MM conference.

I've read the article on lwn and look forward to watching the video once
available. The longer term plans look interesting.

> 
> When commit 845982eb264bc "mm: swap: allow storage of all mTHP
> orders" is introduced, it only allocates the mTHP swap entries
> from new empty cluster list. That works well for PMD size THP,
> but it has a serius fragmentation issue reported by Barry.

Yes, that was a deliberate initial approach to be conservative, just like the
original PMD-size THP support. I'm glad to see work to improve the situation!

> 
> https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
> 
> The mTHP allocation failure rate raises to almost 100% after a few
> hours in Barry's test run.
> 
> The reason is that all the empty cluster has been exhausted while
> there are planty of free swap entries to in the cluster that is
> not 100% free.
> 
> Address this by remember the swap allocation order in the cluster.
> Keep track of the per order non full cluster list for later allocation.

I don't immediately see how this helps because memory is swapped back in
per-page (currently), so just because a given cluster was initially filled with
entries of a given order, doesn't mean that those entries are freed in atomic
units; only specific pages could have been swapped back in, meaning the holes
are not of the required order. Additionally, scanning could lead to order-0
pages being populated in random places.

My naive assumption was that the obvious way to solve this problem in the short
term would be to extend the scanning logic to be able to scan for an arbitrary
order. That way you could find an allocation of the required order in any of the
clusters, even a cluster that was not originally allocated for the required order.

I guess I should read your patches to understand exactly what you are doing
rather than making assumptions...

Thanks,
Ryan

> 
> This greatly improve the sucess rate of the mTHP swap allocation.
> While I am still waiting for Barry's test result. I paste Kairui's test
> result here:
> 
> I'm able to reproduce such an issue with a simple script (enabling all order of mthp):
> 
> modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024))
> swapoff -a
> mkswap /dev/ram0
> swapon /dev/ram0
> 
> rmdir /sys/fs/cgroup/benchmark
> mkdir -p /sys/fs/cgroup/benchmark
> cd /sys/fs/cgroup/benchmark
> echo 8G > memory.max
> echo $$ > cgroup.procs
> 
> memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary &
> 
> /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \
>         -P memcache_binary -n allkeys --key-minimum=1 \
>         --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \
>         --ratio 1:0 --pipeline 8 -d 1024
> 
> Before:
> Totals      48805.63         0.00         0.00         5.26045         1.19100        38.91100        59.64700     51063.98
> After:
> Totals      71098.84         0.00         0.00         3.60585         0.71100        26.36700        39.16700     74388.74
> 
> And the fallback ratio dropped by a lot:
> Before:
> hugepages-32kB/stats/anon_swpout_fallback:15997
> hugepages-32kB/stats/anon_swpout:18712
> hugepages-512kB/stats/anon_swpout_fallback:192
> hugepages-512kB/stats/anon_swpout:0
> hugepages-2048kB/stats/anon_swpout_fallback:2
> hugepages-2048kB/stats/anon_swpout:0
> hugepages-1024kB/stats/anon_swpout_fallback:0
> hugepages-1024kB/stats/anon_swpout:0
> hugepages-64kB/stats/anon_swpout_fallback:18246
> hugepages-64kB/stats/anon_swpout:17644
> hugepages-16kB/stats/anon_swpout_fallback:13701
> hugepages-16kB/stats/anon_swpout:18234
> hugepages-256kB/stats/anon_swpout_fallback:8642
> hugepages-256kB/stats/anon_swpout:93
> hugepages-128kB/stats/anon_swpout_fallback:21497
> hugepages-128kB/stats/anon_swpout:7596
> 
> (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate)
> 
> After:
> hugepages-32kB/stats/swpout:34445
> hugepages-32kB/stats/swpout_fallback:0
> hugepages-512kB/stats/swpout:1
> hugepages-512kB/stats/swpout_fallback:134
> hugepages-2048kB/stats/swpout:1
> hugepages-2048kB/stats/swpout_fallback:1
> hugepages-1024kB/stats/swpout:6
> hugepages-1024kB/stats/swpout_fallback:0
> hugepages-64kB/stats/swpout:35495
> hugepages-64kB/stats/swpout_fallback:0
> hugepages-16kB/stats/swpout:32441
> hugepages-16kB/stats/swpout_fallback:0
> hugepages-256kB/stats/swpout:2223
> hugepages-256kB/stats/swpout_fallback:6278
> hugepages-128kB/stats/swpout:29136
> hugepages-128kB/stats/swpout_fallback:52
> 
> Reported-by: Barry Song <21cnbao@gmail.com>
> Tested-by: Kairui Song <kasong@tencent.com>
> Signed-off-by: Chris Li <chrisl@kernel.org>
> ---
> Chris Li (2):
>       mm: swap: swap cluster switch to double link list
>       mm: swap: mTHP allocate swap entries from nonfull list
> 
>  include/linux/swap.h |  18 ++--
>  mm/swapfile.c        | 252 +++++++++++++++++----------------------------------
>  2 files changed, 93 insertions(+), 177 deletions(-)
> ---
> base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed
> change-id: 20240523-swap-allocator-1534c480ece4
> 
> Best regards,
Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
Posted by Chris Li 1 year, 8 months ago
On Fri, Jun 7, 2024 at 2:43 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Sorry I'm late to the discussion - I've been out for the last 3.5 weeks and just
> getting through my mail now...

No problem at all, please take it easy.

>
>
> On 24/05/2024 18:17, Chris Li wrote:
> > This is the short term solutiolns "swap cluster order" listed
> > in my "Swap Abstraction" discussion slice 8 in the recent
> > LSF/MM conference.
>
> I've read the article on lwn and look forward to watching the video once
> available. The longer term plans look interesting.
>
> >
> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP
> > orders" is introduced, it only allocates the mTHP swap entries
> > from new empty cluster list. That works well for PMD size THP,
> > but it has a serius fragmentation issue reported by Barry.
>
> Yes, that was a deliberate initial approach to be conservative, just like the
> original PMD-size THP support. I'm glad to see work to improve the situation!
>
> >
> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
> >
> > The mTHP allocation failure rate raises to almost 100% after a few
> > hours in Barry's test run.
> >
> > The reason is that all the empty cluster has been exhausted while
> > there are planty of free swap entries to in the cluster that is
> > not 100% free.
> >
> > Address this by remember the swap allocation order in the cluster.
> > Keep track of the per order non full cluster list for later allocation.
>
> I don't immediately see how this helps because memory is swapped back in
> per-page (currently), so just because a given cluster was initially filled with

That is not the case for Barry's setup, he has some other patch series
to swap in mTHP as a whole. Especially in for the mTHP store in the
zsmalloc as bigger than 4K pages.
https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/

> entries of a given order, doesn't mean that those entries are freed in atomic
> units; only specific pages could have been swapped back in, meaning the holes
> are not of the required order. Additionally, scanning could lead to order-0
> pages being populated in random places.

Yes, that is a problem we need to address. The proposed short term
solution is to have an isolation scheme preventing the high order swap
entry mix with the lower order one inside one cluster. That is easy to
do and has some test results confirming the reservation/isolation
effect.

>
> My naive assumption was that the obvious way to solve this problem in the short
> term would be to extend the scanning logic to be able to scan for an arbitrary
> order. That way you could find an allocation of the required order in any of the
> clusters, even a cluster that was not originally allocated for the required order.
>
> I guess I should read your patches to understand exactly what you are doing
> rather than making assumptions...

Scanning is not enough. We need to have some way to prevent the
fragmentation from happening.
Once the fragmentation has happened, it can't be easily reversed.
Scanning does not help the fragmentation aspect.

Chris

>
> Thanks,
> Ryan
>
> >
> > This greatly improve the sucess rate of the mTHP swap allocation.
> > While I am still waiting for Barry's test result. I paste Kairui's test
> > result here:
> >
> > I'm able to reproduce such an issue with a simple script (enabling all order of mthp):
> >
> > modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024))
> > swapoff -a
> > mkswap /dev/ram0
> > swapon /dev/ram0
> >
> > rmdir /sys/fs/cgroup/benchmark
> > mkdir -p /sys/fs/cgroup/benchmark
> > cd /sys/fs/cgroup/benchmark
> > echo 8G > memory.max
> > echo $$ > cgroup.procs
> >
> > memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary &
> >
> > /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \
> >         -P memcache_binary -n allkeys --key-minimum=1 \
> >         --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \
> >         --ratio 1:0 --pipeline 8 -d 1024
> >
> > Before:
> > Totals      48805.63         0.00         0.00         5.26045         1.19100        38.91100        59.64700     51063.98
> > After:
> > Totals      71098.84         0.00         0.00         3.60585         0.71100        26.36700        39.16700     74388.74
> >
> > And the fallback ratio dropped by a lot:
> > Before:
> > hugepages-32kB/stats/anon_swpout_fallback:15997
> > hugepages-32kB/stats/anon_swpout:18712
> > hugepages-512kB/stats/anon_swpout_fallback:192
> > hugepages-512kB/stats/anon_swpout:0
> > hugepages-2048kB/stats/anon_swpout_fallback:2
> > hugepages-2048kB/stats/anon_swpout:0
> > hugepages-1024kB/stats/anon_swpout_fallback:0
> > hugepages-1024kB/stats/anon_swpout:0
> > hugepages-64kB/stats/anon_swpout_fallback:18246
> > hugepages-64kB/stats/anon_swpout:17644
> > hugepages-16kB/stats/anon_swpout_fallback:13701
> > hugepages-16kB/stats/anon_swpout:18234
> > hugepages-256kB/stats/anon_swpout_fallback:8642
> > hugepages-256kB/stats/anon_swpout:93
> > hugepages-128kB/stats/anon_swpout_fallback:21497
> > hugepages-128kB/stats/anon_swpout:7596
> >
> > (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate)
> >
> > After:
> > hugepages-32kB/stats/swpout:34445
> > hugepages-32kB/stats/swpout_fallback:0
> > hugepages-512kB/stats/swpout:1
> > hugepages-512kB/stats/swpout_fallback:134
> > hugepages-2048kB/stats/swpout:1
> > hugepages-2048kB/stats/swpout_fallback:1
> > hugepages-1024kB/stats/swpout:6
> > hugepages-1024kB/stats/swpout_fallback:0
> > hugepages-64kB/stats/swpout:35495
> > hugepages-64kB/stats/swpout_fallback:0
> > hugepages-16kB/stats/swpout:32441
> > hugepages-16kB/stats/swpout_fallback:0
> > hugepages-256kB/stats/swpout:2223
> > hugepages-256kB/stats/swpout_fallback:6278
> > hugepages-128kB/stats/swpout:29136
> > hugepages-128kB/stats/swpout_fallback:52
> >
> > Reported-by: Barry Song <21cnbao@gmail.com>
> > Tested-by: Kairui Song <kasong@tencent.com>
> > Signed-off-by: Chris Li <chrisl@kernel.org>
> > ---
> > Chris Li (2):
> >       mm: swap: swap cluster switch to double link list
> >       mm: swap: mTHP allocate swap entries from nonfull list
> >
> >  include/linux/swap.h |  18 ++--
> >  mm/swapfile.c        | 252 +++++++++++++++++----------------------------------
> >  2 files changed, 93 insertions(+), 177 deletions(-)
> > ---
> > base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed
> > change-id: 20240523-swap-allocator-1534c480ece4
> >
> > Best regards,
>
Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
Posted by Chris Li 1 year, 8 months ago
I am spinning a new version for this series to address two issues
found in this series:

1) Oppo discovered a bug in the following line:
+               ci = si->cluster_info + tmp;
Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
That is a serious bug but trivial to fix.

2) order 0 allocation currently blindly scans swap_map disregarding
the cluster->order. Given enough order 0 swap allocations(close to the
swap file size) the order 0 allocation head will eventually sweep
across the whole swapfile and destroy other cluster order allocations.

The short term fix is just skipping clusters that are already assigned
to higher orders.

In the long term, I want to unify the non-SSD to use clusters for
locking and allocations as well, just try to follow the last
allocation (less seeking) as much as possible.

Chris



On Fri, May 24, 2024 at 10:17 AM Chris Li <chrisl@kernel.org> wrote:
>
> This is the short term solutiolns "swap cluster order" listed
> in my "Swap Abstraction" discussion slice 8 in the recent
> LSF/MM conference.
>
> When commit 845982eb264bc "mm: swap: allow storage of all mTHP
> orders" is introduced, it only allocates the mTHP swap entries
> from new empty cluster list. That works well for PMD size THP,
> but it has a serius fragmentation issue reported by Barry.
>
> https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
>
> The mTHP allocation failure rate raises to almost 100% after a few
> hours in Barry's test run.
>
> The reason is that all the empty cluster has been exhausted while
> there are planty of free swap entries to in the cluster that is
> not 100% free.
>
> Address this by remember the swap allocation order in the cluster.
> Keep track of the per order non full cluster list for later allocation.
>
> This greatly improve the sucess rate of the mTHP swap allocation.
> While I am still waiting for Barry's test result. I paste Kairui's test
> result here:
>
> I'm able to reproduce such an issue with a simple script (enabling all order of mthp):
>
> modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024))
> swapoff -a
> mkswap /dev/ram0
> swapon /dev/ram0
>
> rmdir /sys/fs/cgroup/benchmark
> mkdir -p /sys/fs/cgroup/benchmark
> cd /sys/fs/cgroup/benchmark
> echo 8G > memory.max
> echo $$ > cgroup.procs
>
> memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary &
>
> /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \
>         -P memcache_binary -n allkeys --key-minimum=1 \
>         --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \
>         --ratio 1:0 --pipeline 8 -d 1024
>
> Before:
> Totals      48805.63         0.00         0.00         5.26045         1.19100        38.91100        59.64700     51063.98
> After:
> Totals      71098.84         0.00         0.00         3.60585         0.71100        26.36700        39.16700     74388.74
>
> And the fallback ratio dropped by a lot:
> Before:
> hugepages-32kB/stats/anon_swpout_fallback:15997
> hugepages-32kB/stats/anon_swpout:18712
> hugepages-512kB/stats/anon_swpout_fallback:192
> hugepages-512kB/stats/anon_swpout:0
> hugepages-2048kB/stats/anon_swpout_fallback:2
> hugepages-2048kB/stats/anon_swpout:0
> hugepages-1024kB/stats/anon_swpout_fallback:0
> hugepages-1024kB/stats/anon_swpout:0
> hugepages-64kB/stats/anon_swpout_fallback:18246
> hugepages-64kB/stats/anon_swpout:17644
> hugepages-16kB/stats/anon_swpout_fallback:13701
> hugepages-16kB/stats/anon_swpout:18234
> hugepages-256kB/stats/anon_swpout_fallback:8642
> hugepages-256kB/stats/anon_swpout:93
> hugepages-128kB/stats/anon_swpout_fallback:21497
> hugepages-128kB/stats/anon_swpout:7596
>
> (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate)
>
> After:
> hugepages-32kB/stats/swpout:34445
> hugepages-32kB/stats/swpout_fallback:0
> hugepages-512kB/stats/swpout:1
> hugepages-512kB/stats/swpout_fallback:134
> hugepages-2048kB/stats/swpout:1
> hugepages-2048kB/stats/swpout_fallback:1
> hugepages-1024kB/stats/swpout:6
> hugepages-1024kB/stats/swpout_fallback:0
> hugepages-64kB/stats/swpout:35495
> hugepages-64kB/stats/swpout_fallback:0
> hugepages-16kB/stats/swpout:32441
> hugepages-16kB/stats/swpout_fallback:0
> hugepages-256kB/stats/swpout:2223
> hugepages-256kB/stats/swpout_fallback:6278
> hugepages-128kB/stats/swpout:29136
> hugepages-128kB/stats/swpout_fallback:52
>
> Reported-by: Barry Song <21cnbao@gmail.com>
> Tested-by: Kairui Song <kasong@tencent.com>
> Signed-off-by: Chris Li <chrisl@kernel.org>
> ---
> Chris Li (2):
>       mm: swap: swap cluster switch to double link list
>       mm: swap: mTHP allocate swap entries from nonfull list
>
>  include/linux/swap.h |  18 ++--
>  mm/swapfile.c        | 252 +++++++++++++++++----------------------------------
>  2 files changed, 93 insertions(+), 177 deletions(-)
> ---
> base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed
> change-id: 20240523-swap-allocator-1534c480ece4
>
> Best regards,
> --
> Chris Li <chrisl@kernel.org>
>
Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
Posted by Barry Song 1 year, 8 months ago
On Wed, May 29, 2024 at 9:04 AM Chris Li <chrisl@kernel.org> wrote:
>
> I am spinning a new version for this series to address two issues
> found in this series:
>
> 1) Oppo discovered a bug in the following line:
> +               ci = si->cluster_info + tmp;
> Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
> That is a serious bug but trivial to fix.
>
> 2) order 0 allocation currently blindly scans swap_map disregarding
> the cluster->order. Given enough order 0 swap allocations(close to the
> swap file size) the order 0 allocation head will eventually sweep
> across the whole swapfile and destroy other cluster order allocations.
>
> The short term fix is just skipping clusters that are already assigned
> to higher orders.
>
> In the long term, I want to unify the non-SSD to use clusters for
> locking and allocations as well, just try to follow the last
> allocation (less seeking) as much as possible.

Hi Chris,

I am sharing some new test results with you. This time, we used two
zRAM devices by modifying get_swap_pages().

zram0 -> dedicated for order-0 swpout
zram1 -> dedicated for order-4 swpout

We allocate a generous amount of space for zRAM1 to ensure it never gets full
and always has ample free space. However, we found that Ryan's approach
does not perform well even in this straightforward scenario. Despite zRAM1
having 80% of its space remaining, we still experience issues obtaining
contiguous swap slots and encounter a high swpout_fallback ratio.

Sorry for the report, Ryan :-)

In contrast, with your patch, we consistently see the thp_swpout_fallback ratio
at 0%, indicating a significant improvement in the situation.

Although your patch still has issues supporting the mixing of order-0 and
order-4 pages in a swap device, it represents a significant improvement.

I would be delighted to witness your approach advancing with Ying
Huang’s assistance. However, due to my current commitments, I
regret that I am unable to allocate time for debugging.

>
> Chris
>
>
>
> On Fri, May 24, 2024 at 10:17 AM Chris Li <chrisl@kernel.org> wrote:
> >
> > This is the short term solutiolns "swap cluster order" listed
> > in my "Swap Abstraction" discussion slice 8 in the recent
> > LSF/MM conference.
> >
> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP
> > orders" is introduced, it only allocates the mTHP swap entries
> > from new empty cluster list. That works well for PMD size THP,
> > but it has a serius fragmentation issue reported by Barry.
> >
> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
> >
> > The mTHP allocation failure rate raises to almost 100% after a few
> > hours in Barry's test run.
> >
> > The reason is that all the empty cluster has been exhausted while
> > there are planty of free swap entries to in the cluster that is
> > not 100% free.
> >
> > Address this by remember the swap allocation order in the cluster.
> > Keep track of the per order non full cluster list for later allocation.
> >
> > This greatly improve the sucess rate of the mTHP swap allocation.
> > While I am still waiting for Barry's test result. I paste Kairui's test
> > result here:
> >
> > I'm able to reproduce such an issue with a simple script (enabling all order of mthp):
> >
> > modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024))
> > swapoff -a
> > mkswap /dev/ram0
> > swapon /dev/ram0
> >
> > rmdir /sys/fs/cgroup/benchmark
> > mkdir -p /sys/fs/cgroup/benchmark
> > cd /sys/fs/cgroup/benchmark
> > echo 8G > memory.max
> > echo $$ > cgroup.procs
> >
> > memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary &
> >
> > /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \
> >         -P memcache_binary -n allkeys --key-minimum=1 \
> >         --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \
> >         --ratio 1:0 --pipeline 8 -d 1024
> >
> > Before:
> > Totals      48805.63         0.00         0.00         5.26045         1.19100        38.91100        59.64700     51063.98
> > After:
> > Totals      71098.84         0.00         0.00         3.60585         0.71100        26.36700        39.16700     74388.74
> >
> > And the fallback ratio dropped by a lot:
> > Before:
> > hugepages-32kB/stats/anon_swpout_fallback:15997
> > hugepages-32kB/stats/anon_swpout:18712
> > hugepages-512kB/stats/anon_swpout_fallback:192
> > hugepages-512kB/stats/anon_swpout:0
> > hugepages-2048kB/stats/anon_swpout_fallback:2
> > hugepages-2048kB/stats/anon_swpout:0
> > hugepages-1024kB/stats/anon_swpout_fallback:0
> > hugepages-1024kB/stats/anon_swpout:0
> > hugepages-64kB/stats/anon_swpout_fallback:18246
> > hugepages-64kB/stats/anon_swpout:17644
> > hugepages-16kB/stats/anon_swpout_fallback:13701
> > hugepages-16kB/stats/anon_swpout:18234
> > hugepages-256kB/stats/anon_swpout_fallback:8642
> > hugepages-256kB/stats/anon_swpout:93
> > hugepages-128kB/stats/anon_swpout_fallback:21497
> > hugepages-128kB/stats/anon_swpout:7596
> >
> > (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate)
> >
> > After:
> > hugepages-32kB/stats/swpout:34445
> > hugepages-32kB/stats/swpout_fallback:0
> > hugepages-512kB/stats/swpout:1
> > hugepages-512kB/stats/swpout_fallback:134
> > hugepages-2048kB/stats/swpout:1
> > hugepages-2048kB/stats/swpout_fallback:1
> > hugepages-1024kB/stats/swpout:6
> > hugepages-1024kB/stats/swpout_fallback:0
> > hugepages-64kB/stats/swpout:35495
> > hugepages-64kB/stats/swpout_fallback:0
> > hugepages-16kB/stats/swpout:32441
> > hugepages-16kB/stats/swpout_fallback:0
> > hugepages-256kB/stats/swpout:2223
> > hugepages-256kB/stats/swpout_fallback:6278
> > hugepages-128kB/stats/swpout:29136
> > hugepages-128kB/stats/swpout_fallback:52
> >
> > Reported-by: Barry Song <21cnbao@gmail.com>
> > Tested-by: Kairui Song <kasong@tencent.com>
> > Signed-off-by: Chris Li <chrisl@kernel.org>
> > ---
> > Chris Li (2):
> >       mm: swap: swap cluster switch to double link list
> >       mm: swap: mTHP allocate swap entries from nonfull list
> >
> >  include/linux/swap.h |  18 ++--
> >  mm/swapfile.c        | 252 +++++++++++++++++----------------------------------
> >  2 files changed, 93 insertions(+), 177 deletions(-)
> > ---
> > base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed
> > change-id: 20240523-swap-allocator-1534c480ece4
> >
> > Best regards,
> > --
> > Chris Li <chrisl@kernel.org>
> >

Thanks
Barry
Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
Posted by Ryan Roberts 1 year, 8 months ago
On 30/05/2024 08:49, Barry Song wrote:
> On Wed, May 29, 2024 at 9:04 AM Chris Li <chrisl@kernel.org> wrote:
>>
>> I am spinning a new version for this series to address two issues
>> found in this series:
>>
>> 1) Oppo discovered a bug in the following line:
>> +               ci = si->cluster_info + tmp;
>> Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
>> That is a serious bug but trivial to fix.
>>
>> 2) order 0 allocation currently blindly scans swap_map disregarding
>> the cluster->order. Given enough order 0 swap allocations(close to the
>> swap file size) the order 0 allocation head will eventually sweep
>> across the whole swapfile and destroy other cluster order allocations.
>>
>> The short term fix is just skipping clusters that are already assigned
>> to higher orders.
>>
>> In the long term, I want to unify the non-SSD to use clusters for
>> locking and allocations as well, just try to follow the last
>> allocation (less seeking) as much as possible.
> 
> Hi Chris,
> 
> I am sharing some new test results with you. This time, we used two
> zRAM devices by modifying get_swap_pages().
> 
> zram0 -> dedicated for order-0 swpout
> zram1 -> dedicated for order-4 swpout
> 
> We allocate a generous amount of space for zRAM1 to ensure it never gets full
> and always has ample free space. However, we found that Ryan's approach
> does not perform well even in this straightforward scenario. Despite zRAM1
> having 80% of its space remaining, we still experience issues obtaining
> contiguous swap slots and encounter a high swpout_fallback ratio.
> 
> Sorry for the report, Ryan :-)

No problem; clearly it needs to be fixed, and I'll help where I can. I'm pretty
sure that this is due to fragmentation preventing clusters from being freed back
to the free list.

> 
> In contrast, with your patch, we consistently see the thp_swpout_fallback ratio
> at 0%, indicating a significant improvement in the situation.

Unless I've misunderstood something critical, Chris's change is just allowing a
cpu to steal a block from another cpu's current cluster for that order. So it
just takes longer (approx by a factor of the number of cpus in the system) to
get to the state where fragmentation is causing fallbacks? As I said in the
other thread, I think the more robust solution is to implement scanning for high
order blocks.

> 
> Although your patch still has issues supporting the mixing of order-0 and
> order-4 pages in a swap device, it represents a significant improvement.
> 
> I would be delighted to witness your approach advancing with Ying
> Huang’s assistance. However, due to my current commitments, I
> regret that I am unable to allocate time for debugging.
> 
>>
>> Chris
>>
>>
>>
>> On Fri, May 24, 2024 at 10:17 AM Chris Li <chrisl@kernel.org> wrote:
>>>
>>> This is the short term solutiolns "swap cluster order" listed
>>> in my "Swap Abstraction" discussion slice 8 in the recent
>>> LSF/MM conference.
>>>
>>> When commit 845982eb264bc "mm: swap: allow storage of all mTHP
>>> orders" is introduced, it only allocates the mTHP swap entries
>>> from new empty cluster list. That works well for PMD size THP,
>>> but it has a serius fragmentation issue reported by Barry.
>>>
>>> https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
>>>
>>> The mTHP allocation failure rate raises to almost 100% after a few
>>> hours in Barry's test run.
>>>
>>> The reason is that all the empty cluster has been exhausted while
>>> there are planty of free swap entries to in the cluster that is
>>> not 100% free.
>>>
>>> Address this by remember the swap allocation order in the cluster.
>>> Keep track of the per order non full cluster list for later allocation.
>>>
>>> This greatly improve the sucess rate of the mTHP swap allocation.
>>> While I am still waiting for Barry's test result. I paste Kairui's test
>>> result here:
>>>
>>> I'm able to reproduce such an issue with a simple script (enabling all order of mthp):
>>>
>>> modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024))
>>> swapoff -a
>>> mkswap /dev/ram0
>>> swapon /dev/ram0
>>>
>>> rmdir /sys/fs/cgroup/benchmark
>>> mkdir -p /sys/fs/cgroup/benchmark
>>> cd /sys/fs/cgroup/benchmark
>>> echo 8G > memory.max
>>> echo $$ > cgroup.procs
>>>
>>> memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary &
>>>
>>> /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \
>>>         -P memcache_binary -n allkeys --key-minimum=1 \
>>>         --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \
>>>         --ratio 1:0 --pipeline 8 -d 1024
>>>
>>> Before:
>>> Totals      48805.63         0.00         0.00         5.26045         1.19100        38.91100        59.64700     51063.98
>>> After:
>>> Totals      71098.84         0.00         0.00         3.60585         0.71100        26.36700        39.16700     74388.74
>>>
>>> And the fallback ratio dropped by a lot:
>>> Before:
>>> hugepages-32kB/stats/anon_swpout_fallback:15997
>>> hugepages-32kB/stats/anon_swpout:18712
>>> hugepages-512kB/stats/anon_swpout_fallback:192
>>> hugepages-512kB/stats/anon_swpout:0
>>> hugepages-2048kB/stats/anon_swpout_fallback:2
>>> hugepages-2048kB/stats/anon_swpout:0
>>> hugepages-1024kB/stats/anon_swpout_fallback:0
>>> hugepages-1024kB/stats/anon_swpout:0
>>> hugepages-64kB/stats/anon_swpout_fallback:18246
>>> hugepages-64kB/stats/anon_swpout:17644
>>> hugepages-16kB/stats/anon_swpout_fallback:13701
>>> hugepages-16kB/stats/anon_swpout:18234
>>> hugepages-256kB/stats/anon_swpout_fallback:8642
>>> hugepages-256kB/stats/anon_swpout:93
>>> hugepages-128kB/stats/anon_swpout_fallback:21497
>>> hugepages-128kB/stats/anon_swpout:7596
>>>
>>> (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate)
>>>
>>> After:
>>> hugepages-32kB/stats/swpout:34445
>>> hugepages-32kB/stats/swpout_fallback:0
>>> hugepages-512kB/stats/swpout:1
>>> hugepages-512kB/stats/swpout_fallback:134
>>> hugepages-2048kB/stats/swpout:1
>>> hugepages-2048kB/stats/swpout_fallback:1
>>> hugepages-1024kB/stats/swpout:6
>>> hugepages-1024kB/stats/swpout_fallback:0
>>> hugepages-64kB/stats/swpout:35495
>>> hugepages-64kB/stats/swpout_fallback:0
>>> hugepages-16kB/stats/swpout:32441
>>> hugepages-16kB/stats/swpout_fallback:0
>>> hugepages-256kB/stats/swpout:2223
>>> hugepages-256kB/stats/swpout_fallback:6278
>>> hugepages-128kB/stats/swpout:29136
>>> hugepages-128kB/stats/swpout_fallback:52
>>>
>>> Reported-by: Barry Song <21cnbao@gmail.com>
>>> Tested-by: Kairui Song <kasong@tencent.com>
>>> Signed-off-by: Chris Li <chrisl@kernel.org>
>>> ---
>>> Chris Li (2):
>>>       mm: swap: swap cluster switch to double link list
>>>       mm: swap: mTHP allocate swap entries from nonfull list
>>>
>>>  include/linux/swap.h |  18 ++--
>>>  mm/swapfile.c        | 252 +++++++++++++++++----------------------------------
>>>  2 files changed, 93 insertions(+), 177 deletions(-)
>>> ---
>>> base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed
>>> change-id: 20240523-swap-allocator-1534c480ece4
>>>
>>> Best regards,
>>> --
>>> Chris Li <chrisl@kernel.org>
>>>
> 
> Thanks
> Barry

Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
Posted by Chris Li 1 year, 8 months ago
On Fri, Jun 7, 2024 at 3:49 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 30/05/2024 08:49, Barry Song wrote:
> > On Wed, May 29, 2024 at 9:04 AM Chris Li <chrisl@kernel.org> wrote:
> >>
> >> I am spinning a new version for this series to address two issues
> >> found in this series:
> >>
> >> 1) Oppo discovered a bug in the following line:
> >> +               ci = si->cluster_info + tmp;
> >> Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
> >> That is a serious bug but trivial to fix.
> >>
> >> 2) order 0 allocation currently blindly scans swap_map disregarding
> >> the cluster->order. Given enough order 0 swap allocations(close to the
> >> swap file size) the order 0 allocation head will eventually sweep
> >> across the whole swapfile and destroy other cluster order allocations.
> >>
> >> The short term fix is just skipping clusters that are already assigned
> >> to higher orders.
> >>
> >> In the long term, I want to unify the non-SSD to use clusters for
> >> locking and allocations as well, just try to follow the last
> >> allocation (less seeking) as much as possible.
> >
> > Hi Chris,
> >
> > I am sharing some new test results with you. This time, we used two
> > zRAM devices by modifying get_swap_pages().
> >
> > zram0 -> dedicated for order-0 swpout
> > zram1 -> dedicated for order-4 swpout
> >
> > We allocate a generous amount of space for zRAM1 to ensure it never gets full
> > and always has ample free space. However, we found that Ryan's approach
> > does not perform well even in this straightforward scenario. Despite zRAM1
> > having 80% of its space remaining, we still experience issues obtaining
> > contiguous swap slots and encounter a high swpout_fallback ratio.
> >
> > Sorry for the report, Ryan :-)
>
> No problem; clearly it needs to be fixed, and I'll help where I can. I'm pretty
> sure that this is due to fragmentation preventing clusters from being freed back
> to the free list.
>
> >
> > In contrast, with your patch, we consistently see the thp_swpout_fallback ratio
> > at 0%, indicating a significant improvement in the situation.
>
> Unless I've misunderstood something critical, Chris's change is just allowing a
> cpu to steal a block from another cpu's current cluster for that order. So it

No, that is not the main change. The main change is to allow the CPU
to allocate from the nonfull and non-empty cluster, which are not in
any CPU's current cluster, not in the empty list either. The current
patch does not prevent the CPU from stealing from the other CPU
current order. It will get addressed in V2.

> just takes longer (approx by a factor of the number of cpus in the system) to
> get to the state where fragmentation is causing fallbacks? As I said in the
> other thread, I think the more robust solution is to implement scanning for high
> order blocks.

That will introduce more fragmentation to the high order cluster, and
will make it harder to allocate high order swap entry later.

Please see my previous email for the usage case and the goal of the change.
https://lore.kernel.org/linux-mm/CANeU7QnVzqGKXp9pKDLWiuhqTvBxXupgFCRXejYhshAjw6uDyQ@mail.gmail.com/T/#mf431a743e458896c2ab4a4077b103341313c9cf4

Let's discuss whether the usage case and the goal makes sense or not.

Chris
Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
Posted by Huang, Ying 1 year, 8 months ago
Chris Li <chrisl@kernel.org> writes:

> I am spinning a new version for this series to address two issues
> found in this series:
>
> 1) Oppo discovered a bug in the following line:
> +               ci = si->cluster_info + tmp;
> Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
> That is a serious bug but trivial to fix.
>
> 2) order 0 allocation currently blindly scans swap_map disregarding
> the cluster->order.

IIUC, now, we only scan swap_map[] only if
!list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]).
That is, if you doesn't run low swap free space, you will not do that.

> Given enough order 0 swap allocations(close to the
> swap file size) the order 0 allocation head will eventually sweep
> across the whole swapfile and destroy other cluster order allocations.
>
> The short term fix is just skipping clusters that are already assigned
> to higher orders.

Better to do any further optimization on top of the simpler one.  Need
to evaluate whether it's necessary to add more complexity.

> In the long term, I want to unify the non-SSD to use clusters for
> locking and allocations as well, just try to follow the last
> allocation (less seeking) as much as possible.

I have thought about that too.  Personally, I think that it's good to
remove swap_map[] scanning.  The implementation can be simplified too.
I don't know whether do we need to consider the performance of HDD swap
now.

--
Best Regards,
Huang, Ying

> On Fri, May 24, 2024 at 10:17 AM Chris Li <chrisl@kernel.org> wrote:
>>
>> This is the short term solutiolns "swap cluster order" listed
>> in my "Swap Abstraction" discussion slice 8 in the recent
>> LSF/MM conference.
>>
>> When commit 845982eb264bc "mm: swap: allow storage of all mTHP
>> orders" is introduced, it only allocates the mTHP swap entries
>> from new empty cluster list. That works well for PMD size THP,
>> but it has a serius fragmentation issue reported by Barry.
>>
>> https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
>>
>> The mTHP allocation failure rate raises to almost 100% after a few
>> hours in Barry's test run.
>>
>> The reason is that all the empty cluster has been exhausted while
>> there are planty of free swap entries to in the cluster that is
>> not 100% free.
>>
>> Address this by remember the swap allocation order in the cluster.
>> Keep track of the per order non full cluster list for later allocation.
>>
>> This greatly improve the sucess rate of the mTHP swap allocation.
>> While I am still waiting for Barry's test result. I paste Kairui's test
>> result here:
>>
>> I'm able to reproduce such an issue with a simple script (enabling all order of mthp):
>>
>> modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024))
>> swapoff -a
>> mkswap /dev/ram0
>> swapon /dev/ram0
>>
>> rmdir /sys/fs/cgroup/benchmark
>> mkdir -p /sys/fs/cgroup/benchmark
>> cd /sys/fs/cgroup/benchmark
>> echo 8G > memory.max
>> echo $$ > cgroup.procs
>>
>> memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary &
>>
>> /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \
>>         -P memcache_binary -n allkeys --key-minimum=1 \
>>         --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \
>>         --ratio 1:0 --pipeline 8 -d 1024
>>
>> Before:
>> Totals      48805.63         0.00         0.00         5.26045         1.19100        38.91100        59.64700     51063.98
>> After:
>> Totals      71098.84         0.00         0.00         3.60585         0.71100        26.36700        39.16700     74388.74
>>
>> And the fallback ratio dropped by a lot:
>> Before:
>> hugepages-32kB/stats/anon_swpout_fallback:15997
>> hugepages-32kB/stats/anon_swpout:18712
>> hugepages-512kB/stats/anon_swpout_fallback:192
>> hugepages-512kB/stats/anon_swpout:0
>> hugepages-2048kB/stats/anon_swpout_fallback:2
>> hugepages-2048kB/stats/anon_swpout:0
>> hugepages-1024kB/stats/anon_swpout_fallback:0
>> hugepages-1024kB/stats/anon_swpout:0
>> hugepages-64kB/stats/anon_swpout_fallback:18246
>> hugepages-64kB/stats/anon_swpout:17644
>> hugepages-16kB/stats/anon_swpout_fallback:13701
>> hugepages-16kB/stats/anon_swpout:18234
>> hugepages-256kB/stats/anon_swpout_fallback:8642
>> hugepages-256kB/stats/anon_swpout:93
>> hugepages-128kB/stats/anon_swpout_fallback:21497
>> hugepages-128kB/stats/anon_swpout:7596
>>
>> (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate)
>>
>> After:
>> hugepages-32kB/stats/swpout:34445
>> hugepages-32kB/stats/swpout_fallback:0
>> hugepages-512kB/stats/swpout:1
>> hugepages-512kB/stats/swpout_fallback:134
>> hugepages-2048kB/stats/swpout:1
>> hugepages-2048kB/stats/swpout_fallback:1
>> hugepages-1024kB/stats/swpout:6
>> hugepages-1024kB/stats/swpout_fallback:0
>> hugepages-64kB/stats/swpout:35495
>> hugepages-64kB/stats/swpout_fallback:0
>> hugepages-16kB/stats/swpout:32441
>> hugepages-16kB/stats/swpout_fallback:0
>> hugepages-256kB/stats/swpout:2223
>> hugepages-256kB/stats/swpout_fallback:6278
>> hugepages-128kB/stats/swpout:29136
>> hugepages-128kB/stats/swpout_fallback:52
>>
>> Reported-by: Barry Song <21cnbao@gmail.com>
>> Tested-by: Kairui Song <kasong@tencent.com>
>> Signed-off-by: Chris Li <chrisl@kernel.org>
>> ---
>> Chris Li (2):
>>       mm: swap: swap cluster switch to double link list
>>       mm: swap: mTHP allocate swap entries from nonfull list
>>
>>  include/linux/swap.h |  18 ++--
>>  mm/swapfile.c        | 252 +++++++++++++++++----------------------------------
>>  2 files changed, 93 insertions(+), 177 deletions(-)
>> ---
>> base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed
>> change-id: 20240523-swap-allocator-1534c480ece4
>>
>> Best regards,
>> --
>> Chris Li <chrisl@kernel.org>
>>
Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
Posted by Chris Li 1 year, 8 months ago
Hi Ying,

On Wed, May 29, 2024 at 1:57 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > I am spinning a new version for this series to address two issues
> > found in this series:
> >
> > 1) Oppo discovered a bug in the following line:
> > +               ci = si->cluster_info + tmp;
> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
> > That is a serious bug but trivial to fix.
> >
> > 2) order 0 allocation currently blindly scans swap_map disregarding
> > the cluster->order.
>
> IIUC, now, we only scan swap_map[] only if
> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]).
> That is, if you doesn't run low swap free space, you will not do that.

You can still swap space in order 0 clusters while order 4 runs out of
free_cluster
or nonfull_clusters[order]. For Android that is a common case.

>
> > Given enough order 0 swap allocations(close to the
> > swap file size) the order 0 allocation head will eventually sweep
> > across the whole swapfile and destroy other cluster order allocations.
> >
> > The short term fix is just skipping clusters that are already assigned
> > to higher orders.
>
> Better to do any further optimization on top of the simpler one.  Need
> to evaluate whether it's necessary to add more complexity.

I agree this needs more careful planning and discussion. In Android's
use case, the swapfile is always almost full. It will run into this
situation after long enough swap time.
Once the order 0 swap entry starts to pollute the higher order
cluster, there is no going back(until the whole cluster is 100% free).

> > In the long term, I want to unify the non-SSD to use clusters for
> > locking and allocations as well, just try to follow the last
> > allocation (less seeking) as much as possible.
>
> I have thought about that too.  Personally, I think that it's good to
> remove swap_map[] scanning.  The implementation can be simplified too.

Agree. I look at the commit that introduces the SSD cluster. The
commit message indicates a lot of CPU time spent in swap_map scanning,
especially when the swapfile is almost full. The main motivation to
introduce the cluster in HDD is to simplify and unify the code.

> I don't know whether do we need to consider the performance of HDD swap
> now.

I am not sure about that either. We can make the best effort to reduce the seek.

Chris

> --
> Best Regards,
> Huang, Ying
>
> > On Fri, May 24, 2024 at 10:17 AM Chris Li <chrisl@kernel.org> wrote:
> >>
> >> This is the short term solutiolns "swap cluster order" listed
> >> in my "Swap Abstraction" discussion slice 8 in the recent
> >> LSF/MM conference.
> >>
> >> When commit 845982eb264bc "mm: swap: allow storage of all mTHP
> >> orders" is introduced, it only allocates the mTHP swap entries
> >> from new empty cluster list. That works well for PMD size THP,
> >> but it has a serius fragmentation issue reported by Barry.
> >>
> >> https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
> >>
> >> The mTHP allocation failure rate raises to almost 100% after a few
> >> hours in Barry's test run.
> >>
> >> The reason is that all the empty cluster has been exhausted while
> >> there are planty of free swap entries to in the cluster that is
> >> not 100% free.
> >>
> >> Address this by remember the swap allocation order in the cluster.
> >> Keep track of the per order non full cluster list for later allocation.
> >>
> >> This greatly improve the sucess rate of the mTHP swap allocation.
> >> While I am still waiting for Barry's test result. I paste Kairui's test
> >> result here:
> >>
> >> I'm able to reproduce such an issue with a simple script (enabling all order of mthp):
> >>
> >> modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024))
> >> swapoff -a
> >> mkswap /dev/ram0
> >> swapon /dev/ram0
> >>
> >> rmdir /sys/fs/cgroup/benchmark
> >> mkdir -p /sys/fs/cgroup/benchmark
> >> cd /sys/fs/cgroup/benchmark
> >> echo 8G > memory.max
> >> echo $$ > cgroup.procs
> >>
> >> memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary &
> >>
> >> /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \
> >>         -P memcache_binary -n allkeys --key-minimum=1 \
> >>         --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \
> >>         --ratio 1:0 --pipeline 8 -d 1024
> >>
> >> Before:
> >> Totals      48805.63         0.00         0.00         5.26045         1.19100        38.91100        59.64700     51063.98
> >> After:
> >> Totals      71098.84         0.00         0.00         3.60585         0.71100        26.36700        39.16700     74388.74
> >>
> >> And the fallback ratio dropped by a lot:
> >> Before:
> >> hugepages-32kB/stats/anon_swpout_fallback:15997
> >> hugepages-32kB/stats/anon_swpout:18712
> >> hugepages-512kB/stats/anon_swpout_fallback:192
> >> hugepages-512kB/stats/anon_swpout:0
> >> hugepages-2048kB/stats/anon_swpout_fallback:2
> >> hugepages-2048kB/stats/anon_swpout:0
> >> hugepages-1024kB/stats/anon_swpout_fallback:0
> >> hugepages-1024kB/stats/anon_swpout:0
> >> hugepages-64kB/stats/anon_swpout_fallback:18246
> >> hugepages-64kB/stats/anon_swpout:17644
> >> hugepages-16kB/stats/anon_swpout_fallback:13701
> >> hugepages-16kB/stats/anon_swpout:18234
> >> hugepages-256kB/stats/anon_swpout_fallback:8642
> >> hugepages-256kB/stats/anon_swpout:93
> >> hugepages-128kB/stats/anon_swpout_fallback:21497
> >> hugepages-128kB/stats/anon_swpout:7596
> >>
> >> (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate)
> >>
> >> After:
> >> hugepages-32kB/stats/swpout:34445
> >> hugepages-32kB/stats/swpout_fallback:0
> >> hugepages-512kB/stats/swpout:1
> >> hugepages-512kB/stats/swpout_fallback:134
> >> hugepages-2048kB/stats/swpout:1
> >> hugepages-2048kB/stats/swpout_fallback:1
> >> hugepages-1024kB/stats/swpout:6
> >> hugepages-1024kB/stats/swpout_fallback:0
> >> hugepages-64kB/stats/swpout:35495
> >> hugepages-64kB/stats/swpout_fallback:0
> >> hugepages-16kB/stats/swpout:32441
> >> hugepages-16kB/stats/swpout_fallback:0
> >> hugepages-256kB/stats/swpout:2223
> >> hugepages-256kB/stats/swpout_fallback:6278
> >> hugepages-128kB/stats/swpout:29136
> >> hugepages-128kB/stats/swpout_fallback:52
> >>
> >> Reported-by: Barry Song <21cnbao@gmail.com>
> >> Tested-by: Kairui Song <kasong@tencent.com>
> >> Signed-off-by: Chris Li <chrisl@kernel.org>
> >> ---
> >> Chris Li (2):
> >>       mm: swap: swap cluster switch to double link list
> >>       mm: swap: mTHP allocate swap entries from nonfull list
> >>
> >>  include/linux/swap.h |  18 ++--
> >>  mm/swapfile.c        | 252 +++++++++++++++++----------------------------------
> >>  2 files changed, 93 insertions(+), 177 deletions(-)
> >> ---
> >> base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed
> >> change-id: 20240523-swap-allocator-1534c480ece4
> >>
> >> Best regards,
> >> --
> >> Chris Li <chrisl@kernel.org>
> >>
>
Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
Posted by Huang, Ying 1 year, 8 months ago
Chris Li <chrisl@kernel.org> writes:

> Hi Ying,
>
> On Wed, May 29, 2024 at 1:57 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>> > I am spinning a new version for this series to address two issues
>> > found in this series:
>> >
>> > 1) Oppo discovered a bug in the following line:
>> > +               ci = si->cluster_info + tmp;
>> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
>> > That is a serious bug but trivial to fix.
>> >
>> > 2) order 0 allocation currently blindly scans swap_map disregarding
>> > the cluster->order.
>>
>> IIUC, now, we only scan swap_map[] only if
>> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]).
>> That is, if you doesn't run low swap free space, you will not do that.
>
> You can still swap space in order 0 clusters while order 4 runs out of
> free_cluster
> or nonfull_clusters[order]. For Android that is a common case.

When we fail to allocate order 4, we will fallback to order 0.  Still
don't need to scan swap_map[].  But after looking at your below reply, I
realized that the swap space is almost full at most times in your cases.
Then, it's possible that we run into scanning swap_map[].
list_empty(&si->free_clusters) &&
list_empty(&si->nonfull_clusters[order]) will become true, if we put too
many clusters in si->percpu_cluster.  So, if we want to avoid to scan
swap_map[], we can stop add clusters in si->percpu_cluster when swap
space runs low.  And maybe take clusters out of si->percpu_cluster
sometimes.

Another issue is nonfull_cluster[order1] cannot be used for
nonfull_cluster[order2].  In definition, we should not fail order 0
allocation, we need to steal nonfull_cluster[order>0] for order 0
allocation.  This can avoid to scan swap_map[] too.  This may be not
perfect, but it is the simplest first step implementation.  You can
optimize based on it further.

And, I checked your code again.  It appears that si->percpu_cluster may
be put in si->nonfull_cluster[], then be used by another CPU.  Please
check it.

--
Best Regards,
Huang, Ying

[snip]
Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
Posted by Chris Li 1 year, 8 months ago
On Wed, May 29, 2024 at 7:54 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > Hi Ying,
> >
> > On Wed, May 29, 2024 at 1:57 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Chris Li <chrisl@kernel.org> writes:
> >>
> >> > I am spinning a new version for this series to address two issues
> >> > found in this series:
> >> >
> >> > 1) Oppo discovered a bug in the following line:
> >> > +               ci = si->cluster_info + tmp;
> >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
> >> > That is a serious bug but trivial to fix.
> >> >
> >> > 2) order 0 allocation currently blindly scans swap_map disregarding
> >> > the cluster->order.
> >>
> >> IIUC, now, we only scan swap_map[] only if
> >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]).
> >> That is, if you doesn't run low swap free space, you will not do that.
> >
> > You can still swap space in order 0 clusters while order 4 runs out of
> > free_cluster
> > or nonfull_clusters[order]. For Android that is a common case.
>
> When we fail to allocate order 4, we will fallback to order 0.  Still
> don't need to scan swap_map[].  But after looking at your below reply, I
> realized that the swap space is almost full at most times in your cases.
> Then, it's possible that we run into scanning swap_map[].
> list_empty(&si->free_clusters) &&
> list_empty(&si->nonfull_clusters[order]) will become true, if we put too
> many clusters in si->percpu_cluster.  So, if we want to avoid to scan
> swap_map[], we can stop add clusters in si->percpu_cluster when swap
> space runs low.  And maybe take clusters out of si->percpu_cluster
> sometimes.

One idea after reading your reply. If we run out of the
nonfull_cluster[order], we should be able to use other cpu's
si->percpu_cluster[] as well. That is a very small win for Android,
because android does not have too many cpu. We are talking about a
handful of clusters, which might not justify the code complexity. It
does not change the behavior that order 0 can pollut higher order.

>
> Another issue is nonfull_cluster[order1] cannot be used for
> nonfull_cluster[order2].  In definition, we should not fail order 0
> allocation, we need to steal nonfull_cluster[order>0] for order 0
> allocation.  This can avoid to scan swap_map[] too.  This may be not
> perfect, but it is the simplest first step implementation.  You can
> optimize based on it further.

Yes, that is listed as the limitation of this cluster order approach.
Initially we need to support one order well first. We might choose
what order that is, 16K or 64K folio. 4K pages are too small, 2M pages
are too big. The sweet spot might be some there in between.  If we can
support one order well, we can demonstrate the value of the mTHP. We
can worry about other mix orders later.

Do you have any suggestions for how to prevent the order 0 polluting
the higher order cluster? If we allow that to happen, then it defeats
the goal of being able to allocate higher order swap entries. The
tricky question is we don't know how much swap space we should reserve
for each order. We can always break higher order clusters to lower
order, but can't do the reserves. The current patch series lets the
actual usage determine the percentage of the cluster for each order.
However that seems not enough for the test case Barry has. When the
app gets OOM kill that is where a large swing of order 0 swap will
show up and not enough higher order usage for the brief moment. The
order 0 swap entry will pollute the high order cluster. We are
currently debating a "knob" to be able to reserve a certain % of swap
space for a certain order. Those reservations will be guaranteed and
order 0 swap entry can't pollute them even when it runs out of swap
space. That can make the mTHP at least usable for the Android case.

Do you see another way to protect the high order cluster polluted by
lower order one?

>
> And, I checked your code again.  It appears that si->percpu_cluster may
> be put in si->nonfull_cluster[], then be used by another CPU.  Please
> check it.

Ah, good point. I think it does. Let me take a closer look.

Chris


Chris
Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
Posted by Huang, Ying 1 year, 8 months ago
Chris Li <chrisl@kernel.org> writes:

> On Wed, May 29, 2024 at 7:54 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>> > Hi Ying,
>> >
>> > On Wed, May 29, 2024 at 1:57 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Chris Li <chrisl@kernel.org> writes:
>> >>
>> >> > I am spinning a new version for this series to address two issues
>> >> > found in this series:
>> >> >
>> >> > 1) Oppo discovered a bug in the following line:
>> >> > +               ci = si->cluster_info + tmp;
>> >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
>> >> > That is a serious bug but trivial to fix.
>> >> >
>> >> > 2) order 0 allocation currently blindly scans swap_map disregarding
>> >> > the cluster->order.
>> >>
>> >> IIUC, now, we only scan swap_map[] only if
>> >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]).
>> >> That is, if you doesn't run low swap free space, you will not do that.
>> >
>> > You can still swap space in order 0 clusters while order 4 runs out of
>> > free_cluster
>> > or nonfull_clusters[order]. For Android that is a common case.
>>
>> When we fail to allocate order 4, we will fallback to order 0.  Still
>> don't need to scan swap_map[].  But after looking at your below reply, I
>> realized that the swap space is almost full at most times in your cases.
>> Then, it's possible that we run into scanning swap_map[].
>> list_empty(&si->free_clusters) &&
>> list_empty(&si->nonfull_clusters[order]) will become true, if we put too
>> many clusters in si->percpu_cluster.  So, if we want to avoid to scan
>> swap_map[], we can stop add clusters in si->percpu_cluster when swap
>> space runs low.  And maybe take clusters out of si->percpu_cluster
>> sometimes.
>
> One idea after reading your reply. If we run out of the
> nonfull_cluster[order], we should be able to use other cpu's
> si->percpu_cluster[] as well. That is a very small win for Android,

This will be useful in general.  The number CPU may be large, and
multiple orders may be used.

> because android does not have too many cpu. We are talking about a
> handful of clusters, which might not justify the code complexity. It
> does not change the behavior that order 0 can pollut higher order.

I have a feeling that you don't really know why swap_map[] is scanned.
I suggest you to do more test and tracing to find out the reason.  I
suspect that there are some non-full cluster collection issues.

>> Another issue is nonfull_cluster[order1] cannot be used for
>> nonfull_cluster[order2].  In definition, we should not fail order 0
>> allocation, we need to steal nonfull_cluster[order>0] for order 0
>> allocation.  This can avoid to scan swap_map[] too.  This may be not
>> perfect, but it is the simplest first step implementation.  You can
>> optimize based on it further.
>
> Yes, that is listed as the limitation of this cluster order approach.
> Initially we need to support one order well first. We might choose
> what order that is, 16K or 64K folio. 4K pages are too small, 2M pages
> are too big. The sweet spot might be some there in between.  If we can
> support one order well, we can demonstrate the value of the mTHP. We
> can worry about other mix orders later.
>
> Do you have any suggestions for how to prevent the order 0 polluting
> the higher order cluster? If we allow that to happen, then it defeats
> the goal of being able to allocate higher order swap entries. The
> tricky question is we don't know how much swap space we should reserve
> for each order. We can always break higher order clusters to lower
> order, but can't do the reserves. The current patch series lets the
> actual usage determine the percentage of the cluster for each order.
> However that seems not enough for the test case Barry has. When the
> app gets OOM kill that is where a large swing of order 0 swap will
> show up and not enough higher order usage for the brief moment. The
> order 0 swap entry will pollute the high order cluster. We are
> currently debating a "knob" to be able to reserve a certain % of swap
> space for a certain order. Those reservations will be guaranteed and
> order 0 swap entry can't pollute them even when it runs out of swap
> space. That can make the mTHP at least usable for the Android case.

IMO, the bottom line is that order-0 allocation is the first class
citizen, we must keep it optimized.  And, OOM with free swap space isn't
acceptable.  Please consider the policy we used for page allocation.

> Do you see another way to protect the high order cluster polluted by
> lower order one?

If we use high-order page allocation as reference, we need something
like compaction to guarantee high-order allocation finally.  But we are
too far from that.

For specific configuration, I believe that we can get reasonable
high-order swap entry allocation success rate for specific use cases.
For example, if we only do limited maximum number order-0 swap entries
allocation, can we keep high-order clusters?

>>
>> And, I checked your code again.  It appears that si->percpu_cluster may
>> be put in si->nonfull_cluster[], then be used by another CPU.  Please
>> check it.
>
> Ah, good point. I think it does. Let me take a closer look.

--
Best Regards,
Huang, Ying
Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
Posted by Chris Li 1 year, 8 months ago
On Thu, May 30, 2024 at 7:37 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > On Wed, May 29, 2024 at 7:54 PM Huang, Ying <ying.huang@intel.com> wrote:
> > because android does not have too many cpu. We are talking about a
> > handful of clusters, which might not justify the code complexity. It
> > does not change the behavior that order 0 can pollut higher order.
>
> I have a feeling that you don't really know why swap_map[] is scanned.
> I suggest you to do more test and tracing to find out the reason.  I
> suspect that there are some non-full cluster collection issues.

Swap_map[] is scanned because of running out of non full clusters.
This can happen because Android tries to make full use of the swapfile.
However, once the swap_map[] scan happens, the non full cluster is polluted.

I currently don't have a local reproduction of the issue Barry reported.
However here is some data point:
Two swap files, one for high order allocation only with this patch. No
fall back.
If there is a non-full cluster collection issue, we should see the
fall back in this case as well.

BTW, same setup without this patch series it will fall back on the
high order allocation as well.

>
> >> Another issue is nonfull_cluster[order1] cannot be used for
> >> nonfull_cluster[order2].  In definition, we should not fail order 0
> >> allocation, we need to steal nonfull_cluster[order>0] for order 0
> >> allocation.  This can avoid to scan swap_map[] too.  This may be not
> >> perfect, but it is the simplest first step implementation.  You can
> >> optimize based on it further.
> >
> > Yes, that is listed as the limitation of this cluster order approach.
> > Initially we need to support one order well first. We might choose
> > what order that is, 16K or 64K folio. 4K pages are too small, 2M pages
> > are too big. The sweet spot might be some there in between.  If we can
> > support one order well, we can demonstrate the value of the mTHP. We
> > can worry about other mix orders later.
> >
> > Do you have any suggestions for how to prevent the order 0 polluting
> > the higher order cluster? If we allow that to happen, then it defeats
> > the goal of being able to allocate higher order swap entries. The
> > tricky question is we don't know how much swap space we should reserve
> > for each order. We can always break higher order clusters to lower
> > order, but can't do the reserves. The current patch series lets the
> > actual usage determine the percentage of the cluster for each order.
> > However that seems not enough for the test case Barry has. When the
> > app gets OOM kill that is where a large swing of order 0 swap will
> > show up and not enough higher order usage for the brief moment. The
> > order 0 swap entry will pollute the high order cluster. We are
> > currently debating a "knob" to be able to reserve a certain % of swap
> > space for a certain order. Those reservations will be guaranteed and
> > order 0 swap entry can't pollute them even when it runs out of swap
> > space. That can make the mTHP at least usable for the Android case.
>
> IMO, the bottom line is that order-0 allocation is the first class
> citizen, we must keep it optimized.  And, OOM with free swap space isn't
> acceptable.  Please consider the policy we used for page allocation.

We need to make order-0 and high order allocation both can work after
the initial pass of allocating from empty clusters.
Only order-0 allocation work is not good enough.

In the page allocation side, we have the hugetlbfs which reserve some
memory for high order pages.
We should have similar things to allow reserve some high order swap
entries without getting polluted by low order one.

>
> > Do you see another way to protect the high order cluster polluted by
> > lower order one?
>
> If we use high-order page allocation as reference, we need something
> like compaction to guarantee high-order allocation finally.  But we are
> too far from that.

We should consider reservation for high-order swap entry allocation
similar to hugetlbfs for memory.
Swap compaction will be very complicated because it needs to scan the
PTE to migrate the swap entry. It might be easier to support folio
write out compound discontiguous swap entries. That is another way to
address the fragmentation issue. We are also too far from that as
right now.

>
> For specific configuration, I believe that we can get reasonable
> high-order swap entry allocation success rate for specific use cases.
> For example, if we only do limited maximum number order-0 swap entries
> allocation, can we keep high-order clusters?

Yes we can. Having a knob to reserve some high order swap space.
Limiting order 0 is the same as having some high order swap entries
reserved.

That is a short term solution.

Chris
Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
Posted by Huang, Ying 1 year, 8 months ago
Chris Li <chrisl@kernel.org> writes:

> On Thu, May 30, 2024 at 7:37 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>> > On Wed, May 29, 2024 at 7:54 PM Huang, Ying <ying.huang@intel.com> wrote:
>> > because android does not have too many cpu. We are talking about a
>> > handful of clusters, which might not justify the code complexity. It
>> > does not change the behavior that order 0 can pollut higher order.
>>
>> I have a feeling that you don't really know why swap_map[] is scanned.
>> I suggest you to do more test and tracing to find out the reason.  I
>> suspect that there are some non-full cluster collection issues.
>
> Swap_map[] is scanned because of running out of non full clusters.
> This can happen because Android tries to make full use of the swapfile.
> However, once the swap_map[] scan happens, the non full cluster is polluted.
>
> I currently don't have a local reproduction of the issue Barry reported.
> However here is some data point:
> Two swap files, one for high order allocation only with this patch. No
> fall back.
> If there is a non-full cluster collection issue, we should see the
> fall back in this case as well.
>
> BTW, same setup without this patch series it will fall back on the
> high order allocation as well.
>
>>
>> >> Another issue is nonfull_cluster[order1] cannot be used for
>> >> nonfull_cluster[order2].  In definition, we should not fail order 0
>> >> allocation, we need to steal nonfull_cluster[order>0] for order 0
>> >> allocation.  This can avoid to scan swap_map[] too.  This may be not
>> >> perfect, but it is the simplest first step implementation.  You can
>> >> optimize based on it further.
>> >
>> > Yes, that is listed as the limitation of this cluster order approach.
>> > Initially we need to support one order well first. We might choose
>> > what order that is, 16K or 64K folio. 4K pages are too small, 2M pages
>> > are too big. The sweet spot might be some there in between.  If we can
>> > support one order well, we can demonstrate the value of the mTHP. We
>> > can worry about other mix orders later.
>> >
>> > Do you have any suggestions for how to prevent the order 0 polluting
>> > the higher order cluster? If we allow that to happen, then it defeats
>> > the goal of being able to allocate higher order swap entries. The
>> > tricky question is we don't know how much swap space we should reserve
>> > for each order. We can always break higher order clusters to lower
>> > order, but can't do the reserves. The current patch series lets the
>> > actual usage determine the percentage of the cluster for each order.
>> > However that seems not enough for the test case Barry has. When the
>> > app gets OOM kill that is where a large swing of order 0 swap will
>> > show up and not enough higher order usage for the brief moment. The
>> > order 0 swap entry will pollute the high order cluster. We are
>> > currently debating a "knob" to be able to reserve a certain % of swap
>> > space for a certain order. Those reservations will be guaranteed and
>> > order 0 swap entry can't pollute them even when it runs out of swap
>> > space. That can make the mTHP at least usable for the Android case.
>>
>> IMO, the bottom line is that order-0 allocation is the first class
>> citizen, we must keep it optimized.  And, OOM with free swap space isn't
>> acceptable.  Please consider the policy we used for page allocation.
>
> We need to make order-0 and high order allocation both can work after
> the initial pass of allocating from empty clusters.
> Only order-0 allocation work is not good enough.
>
> In the page allocation side, we have the hugetlbfs which reserve some
> memory for high order pages.
> We should have similar things to allow reserve some high order swap
> entries without getting polluted by low order one.

TBH, I don't like the idea of high order swap entries reservation.  If
that's really important for you, I think that it's better to design
something like hugetlbfs vs core mm, that is, be separated from the
normal swap subsystem as much as possible.

>>
>> > Do you see another way to protect the high order cluster polluted by
>> > lower order one?
>>
>> If we use high-order page allocation as reference, we need something
>> like compaction to guarantee high-order allocation finally.  But we are
>> too far from that.
>
> We should consider reservation for high-order swap entry allocation
> similar to hugetlbfs for memory.
> Swap compaction will be very complicated because it needs to scan the
> PTE to migrate the swap entry. It might be easier to support folio
> write out compound discontiguous swap entries. That is another way to
> address the fragmentation issue. We are also too far from that as
> right now.

That's not easy to write out compound discontiguous swap entries too.
For example, how to put folios in swap cache?

>>
>> For specific configuration, I believe that we can get reasonable
>> high-order swap entry allocation success rate for specific use cases.
>> For example, if we only do limited maximum number order-0 swap entries
>> allocation, can we keep high-order clusters?
>
> Yes we can. Having a knob to reserve some high order swap space.
> Limiting order 0 is the same as having some high order swap entries
> reserved.
>
> That is a short term solution.

--
Best Regards,
Huang, Ying
Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
Posted by Chris Li 1 year, 8 months ago
On Wed, Jun 5, 2024 at 7:02 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>

> > In the page allocation side, we have the hugetlbfs which reserve some
> > memory for high order pages.
> > We should have similar things to allow reserve some high order swap
> > entries without getting polluted by low order one.
>
> TBH, I don't like the idea of high order swap entries reservation.
May I know more if you don't like the idea? I understand this can be
controversial, because previously we like to take the THP as the best
effort approach. If there is some reason we can't make THP, we use the
order 0 as fall back.

For discussion purpose, I want break it down to smaller steps:

First, can we agree that the following usage case is reasonable:
The usage case is that, as Barry has shown, zsmalloc compresses bigger
size than 4K and can have both better compress ratio and CPU
performance gain.
https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/

So the goal is to make THP/mTHP have some reasonable success rate
running in the mix size swap allocation, after either low order or
high order swap requests can overflow the swap file size. The allocate
can still recover from that, after some swap entries got free.

Please let me know if you think the above usage case and goal are not
reasonable for the kernel.

> that's really important for you, I think that it's better to design
> something like hugetlbfs vs core mm, that is, be separated from the
> normal swap subsystem as much as possible.

I am giving hugetlbfs just to make the point using reservation, or
isolation of the resource to prevent mixing fragmentation existing in
core mm.
I am not suggesting copying the hugetlbfs implementation to the swap
system. Unlike hugetlbfs, the swap allocation is typically done from
the kernel, it is transparent from the application. I don't think
separate from the swap subsystem is a good way to go.

This comes down to why you don't like the reservation. e.g. if we use
two swapfile, one swapfile is purely allocate for high order, would
that be better?
>
> >>
> >> > Do you see another way to protect the high order cluster polluted by
> >> > lower order one?
> >>
> >> If we use high-order page allocation as reference, we need something
> >> like compaction to guarantee high-order allocation finally.  But we are
> >> too far from that.
> >
> > We should consider reservation for high-order swap entry allocation
> > similar to hugetlbfs for memory.
> > Swap compaction will be very complicated because it needs to scan the
> > PTE to migrate the swap entry. It might be easier to support folio
> > write out compound discontiguous swap entries. That is another way to
> > address the fragmentation issue. We are also too far from that as
> > right now.
>
> That's not easy to write out compound discontiguous swap entries too.
> For example, how to put folios in swap cache?

I propose the idea in the recent LSF/MM discussion, the last few
slides are for the discontiguous swap and it has the discontiguous
entries in swap cache.
https://drive.google.com/file/d/10wN4WgEekaiTDiAx2AND97CYLgfDJXAD/view

Agree it is not an easy change. The cache cache would have to change
the assumption all offset are contiguous.
For swap, we kind of have some in memory data associated with per
offset already, so it might provide an opportunity to combine the
offset related data structure for swap together. Another alternative
might be using xarray without the multi entry property. , just treat
each offset like a single entry. I haven't dug deep into this
direction yet.

We can have more discussion, maybe arrange an upstream alignment
meeting if there is interest.

Chris
Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
Posted by Huang, Ying 1 year, 8 months ago
Chris Li <chrisl@kernel.org> writes:

> On Wed, Jun 5, 2024 at 7:02 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>
>> > In the page allocation side, we have the hugetlbfs which reserve some
>> > memory for high order pages.
>> > We should have similar things to allow reserve some high order swap
>> > entries without getting polluted by low order one.
>>
>> TBH, I don't like the idea of high order swap entries reservation.
> May I know more if you don't like the idea? I understand this can be
> controversial, because previously we like to take the THP as the best
> effort approach. If there is some reason we can't make THP, we use the
> order 0 as fall back.
>
> For discussion purpose, I want break it down to smaller steps:
>
> First, can we agree that the following usage case is reasonable:
> The usage case is that, as Barry has shown, zsmalloc compresses bigger
> size than 4K and can have both better compress ratio and CPU
> performance gain.
> https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/
>
> So the goal is to make THP/mTHP have some reasonable success rate
> running in the mix size swap allocation, after either low order or
> high order swap requests can overflow the swap file size. The allocate
> can still recover from that, after some swap entries got free.
>
> Please let me know if you think the above usage case and goal are not
> reasonable for the kernel.

I think that it's reasonable to improve the success rate of high-order
swap entries allocation.  I just think that it's hard to use the
reservation based method.  For example, how much should be reserved?
Why system OOM when there's still swap space available?  And so forth.
So, I prefer the transparent methods.  Just like THP vs. hugetlbfs.

>> that's really important for you, I think that it's better to design
>> something like hugetlbfs vs core mm, that is, be separated from the
>> normal swap subsystem as much as possible.
>
> I am giving hugetlbfs just to make the point using reservation, or
> isolation of the resource to prevent mixing fragmentation existing in
> core mm.
> I am not suggesting copying the hugetlbfs implementation to the swap
> system. Unlike hugetlbfs, the swap allocation is typically done from
> the kernel, it is transparent from the application. I don't think
> separate from the swap subsystem is a good way to go.
>
> This comes down to why you don't like the reservation. e.g. if we use
> two swapfile, one swapfile is purely allocate for high order, would
> that be better?

Sorry, my words weren't accurate.  Personally, I just think that it's
better to make reservation related code not too intrusive.

And, before reservation, we need to consider something else firstly.
Whether is it generally good to swap-in with swap-out order?  Should we
consider memory wastage too?  One static policy doesn't fit all, we may
need either a dynamic policy, or make policy configurable.

In general, I think that we need to do this step by step.

>> >> > Do you see another way to protect the high order cluster polluted by
>> >> > lower order one?
>> >>
>> >> If we use high-order page allocation as reference, we need something
>> >> like compaction to guarantee high-order allocation finally.  But we are
>> >> too far from that.
>> >
>> > We should consider reservation for high-order swap entry allocation
>> > similar to hugetlbfs for memory.
>> > Swap compaction will be very complicated because it needs to scan the
>> > PTE to migrate the swap entry. It might be easier to support folio
>> > write out compound discontiguous swap entries. That is another way to
>> > address the fragmentation issue. We are also too far from that as
>> > right now.
>>
>> That's not easy to write out compound discontiguous swap entries too.
>> For example, how to put folios in swap cache?
>
> I propose the idea in the recent LSF/MM discussion, the last few
> slides are for the discontiguous swap and it has the discontiguous
> entries in swap cache.
> https://drive.google.com/file/d/10wN4WgEekaiTDiAx2AND97CYLgfDJXAD/view
>
> Agree it is not an easy change. The cache cache would have to change
> the assumption all offset are contiguous.
> For swap, we kind of have some in memory data associated with per
> offset already, so it might provide an opportunity to combine the
> offset related data structure for swap together. Another alternative
> might be using xarray without the multi entry property. , just treat
> each offset like a single entry. I haven't dug deep into this
> direction yet.

Thanks!  I will study your idea.

> We can have more discussion, maybe arrange an upstream alignment
> meeting if there is interest.

Sure.

--
Best Regards,
Huang, Ying
Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
Posted by Chris Li 1 year, 8 months ago
On Mon, Jun 10, 2024 at 7:38 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > On Wed, Jun 5, 2024 at 7:02 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Chris Li <chrisl@kernel.org> writes:
> >>
> >
> >> > In the page allocation side, we have the hugetlbfs which reserve some
> >> > memory for high order pages.
> >> > We should have similar things to allow reserve some high order swap
> >> > entries without getting polluted by low order one.
> >>
> >> TBH, I don't like the idea of high order swap entries reservation.
> > May I know more if you don't like the idea? I understand this can be
> > controversial, because previously we like to take the THP as the best
> > effort approach. If there is some reason we can't make THP, we use the
> > order 0 as fall back.
> >
> > For discussion purpose, I want break it down to smaller steps:
> >
> > First, can we agree that the following usage case is reasonable:
> > The usage case is that, as Barry has shown, zsmalloc compresses bigger
> > size than 4K and can have both better compress ratio and CPU
> > performance gain.
> > https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/
> >
> > So the goal is to make THP/mTHP have some reasonable success rate
> > running in the mix size swap allocation, after either low order or
> > high order swap requests can overflow the swap file size. The allocate
> > can still recover from that, after some swap entries got free.
> >
> > Please let me know if you think the above usage case and goal are not
> > reasonable for the kernel.
>
> I think that it's reasonable to improve the success rate of high-order

Glad to hear that.

> swap entries allocation.  I just think that it's hard to use the
> reservation based method.  For example, how much should be reserved?

Understand, it is harder to use than a fully transparent method, but
still better than no solution at all. The alternative right now is we
can't do it.

Regarding how much we should reserve. Similarly, how much should you
choose your swap file size? If you choose N, why not N*120% or N*80%?
That did not stop us from having a swapfile, right?

> Why system OOM when there's still swap space available?  And so forth.

Keep in mind that the reservation is an option. If you prefer the old
behavior, you don't have to use the reservation. That shouldn't be a
reason to stop others who want to use it. We don't have an alternative
solution for the long run mix size allocation yet. If there is, I like
to hear it.

> So, I prefer the transparent methods.  Just like THP vs. hugetlbfs.

Me too. I prefer transparent over reservation if it can achieve the
same goal. Do we have a fully transparent method spec out? How to
achieve fully transparent and also avoid fragmentation caused by mix
order allocation/free?

Keep in mind that we are still in the early stage of the mTHP swap
development, I can have the reservation patch relatively easily. If
you come up with a better transparent method patch which can achieve
the same goal later, we can use it instead.

>
> >> that's really important for you, I think that it's better to design
> >> something like hugetlbfs vs core mm, that is, be separated from the
> >> normal swap subsystem as much as possible.
> >
> > I am giving hugetlbfs just to make the point using reservation, or
> > isolation of the resource to prevent mixing fragmentation existing in
> > core mm.
> > I am not suggesting copying the hugetlbfs implementation to the swap
> > system. Unlike hugetlbfs, the swap allocation is typically done from
> > the kernel, it is transparent from the application. I don't think
> > separate from the swap subsystem is a good way to go.
> >
> > This comes down to why you don't like the reservation. e.g. if we use
> > two swapfile, one swapfile is purely allocate for high order, would
> > that be better?
>
> Sorry, my words weren't accurate.  Personally, I just think that it's
> better to make reservation related code not too intrusive.

Yes. I will try to make it not too intrusive.

> And, before reservation, we need to consider something else firstly.
> Whether is it generally good to swap-in with swap-out order?  Should we

When we have the reservation patch (or other means to sustain mix size
swap allocation/free), we can test it out to get more data to reason
about it.
I consider the swap in size policy an orthogonal issue.

> consider memory wastage too?  One static policy doesn't fit all, we may
> need either a dynamic policy, or make policy configurable.
> In general, I think that we need to do this step by step.

The core swap layer needs to be able to sustain mix size swap
allocation free in the long run. Without that the swap in size policy
is meaningless.

Yes, that is the step by step approach. Allowing long run mix size
swap allocation as the first step.

> >> >> > Do you see another way to protect the high order cluster polluted by
> >> >> > lower order one?
> >> >>
> >> >> If we use high-order page allocation as reference, we need something
> >> >> like compaction to guarantee high-order allocation finally.  But we are
> >> >> too far from that.
> >> >
> >> > We should consider reservation for high-order swap entry allocation
> >> > similar to hugetlbfs for memory.
> >> > Swap compaction will be very complicated because it needs to scan the
> >> > PTE to migrate the swap entry. It might be easier to support folio
> >> > write out compound discontiguous swap entries. That is another way to
> >> > address the fragmentation issue. We are also too far from that as
> >> > right now.
> >>
> >> That's not easy to write out compound discontiguous swap entries too.
> >> For example, how to put folios in swap cache?
> >
> > I propose the idea in the recent LSF/MM discussion, the last few
> > slides are for the discontiguous swap and it has the discontiguous
> > entries in swap cache.
> > https://drive.google.com/file/d/10wN4WgEekaiTDiAx2AND97CYLgfDJXAD/view
> >
> > Agree it is not an easy change. The cache cache would have to change
> > the assumption all offset are contiguous.
> > For swap, we kind of have some in memory data associated with per
> > offset already, so it might provide an opportunity to combine the
> > offset related data structure for swap together. Another alternative
> > might be using xarray without the multi entry property. , just treat
> > each offset like a single entry. I haven't dug deep into this
> > direction yet.
>
> Thanks!  I will study your idea.
>

I am happy to discuss if you have any questions.

> > We can have more discussion, maybe arrange an upstream alignment
> > meeting if there is interest.
>
> Sure.

Ideally, if we can resolve our differences over the mail list then we
don't need to have a separate meeting :-)

Chris
Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
Posted by Huang, Ying 1 year, 8 months ago
Chris Li <chrisl@kernel.org> writes:

> On Mon, Jun 10, 2024 at 7:38 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>> > On Wed, Jun 5, 2024 at 7:02 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Chris Li <chrisl@kernel.org> writes:
>> >>
>> >
>> >> > In the page allocation side, we have the hugetlbfs which reserve some
>> >> > memory for high order pages.
>> >> > We should have similar things to allow reserve some high order swap
>> >> > entries without getting polluted by low order one.
>> >>
>> >> TBH, I don't like the idea of high order swap entries reservation.
>> > May I know more if you don't like the idea? I understand this can be
>> > controversial, because previously we like to take the THP as the best
>> > effort approach. If there is some reason we can't make THP, we use the
>> > order 0 as fall back.
>> >
>> > For discussion purpose, I want break it down to smaller steps:
>> >
>> > First, can we agree that the following usage case is reasonable:
>> > The usage case is that, as Barry has shown, zsmalloc compresses bigger
>> > size than 4K and can have both better compress ratio and CPU
>> > performance gain.
>> > https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/
>> >
>> > So the goal is to make THP/mTHP have some reasonable success rate
>> > running in the mix size swap allocation, after either low order or
>> > high order swap requests can overflow the swap file size. The allocate
>> > can still recover from that, after some swap entries got free.
>> >
>> > Please let me know if you think the above usage case and goal are not
>> > reasonable for the kernel.
>>
>> I think that it's reasonable to improve the success rate of high-order
>
> Glad to hear that.
>
>> swap entries allocation.  I just think that it's hard to use the
>> reservation based method.  For example, how much should be reserved?
>
> Understand, it is harder to use than a fully transparent method, but
> still better than no solution at all. The alternative right now is we
> can't do it.
>
> Regarding how much we should reserve. Similarly, how much should you
> choose your swap file size? If you choose N, why not N*120% or N*80%?
> That did not stop us from having a swapfile, right?
>
>> Why system OOM when there's still swap space available?  And so forth.
>
> Keep in mind that the reservation is an option. If you prefer the old
> behavior, you don't have to use the reservation. That shouldn't be a
> reason to stop others who want to use it. We don't have an alternative
> solution for the long run mix size allocation yet. If there is, I like
> to hear it.

It's not enough to make it optional.  When you run into issue, you need
to debug it.  And you may debug an issue on a system that is configured
by someone else.

>> So, I prefer the transparent methods.  Just like THP vs. hugetlbfs.
>
> Me too. I prefer transparent over reservation if it can achieve the
> same goal. Do we have a fully transparent method spec out? How to
> achieve fully transparent and also avoid fragmentation caused by mix
> order allocation/free?
>
> Keep in mind that we are still in the early stage of the mTHP swap
> development, I can have the reservation patch relatively easily. If
> you come up with a better transparent method patch which can achieve
> the same goal later, we can use it instead.

Because we are still in the early stage, I think that we should try to
improve transparent solution firstly.  Personally, what I don't like is
that we don't work on the transparent solution because we have the
reservation solution.

>>
>> >> that's really important for you, I think that it's better to design
>> >> something like hugetlbfs vs core mm, that is, be separated from the
>> >> normal swap subsystem as much as possible.
>> >
>> > I am giving hugetlbfs just to make the point using reservation, or
>> > isolation of the resource to prevent mixing fragmentation existing in
>> > core mm.
>> > I am not suggesting copying the hugetlbfs implementation to the swap
>> > system. Unlike hugetlbfs, the swap allocation is typically done from
>> > the kernel, it is transparent from the application. I don't think
>> > separate from the swap subsystem is a good way to go.
>> >
>> > This comes down to why you don't like the reservation. e.g. if we use
>> > two swapfile, one swapfile is purely allocate for high order, would
>> > that be better?
>>
>> Sorry, my words weren't accurate.  Personally, I just think that it's
>> better to make reservation related code not too intrusive.
>
> Yes. I will try to make it not too intrusive.
>
>> And, before reservation, we need to consider something else firstly.
>> Whether is it generally good to swap-in with swap-out order?  Should we
>
> When we have the reservation patch (or other means to sustain mix size
> swap allocation/free), we can test it out to get more data to reason
> about it.
> I consider the swap in size policy an orthogonal issue.

No.  I don't think so.  If you swap-out in higher order, but swap-in in
lower order, you make the swap clusters fragmented.

>> consider memory wastage too?  One static policy doesn't fit all, we may
>> need either a dynamic policy, or make policy configurable.
>> In general, I think that we need to do this step by step.
>
> The core swap layer needs to be able to sustain mix size swap
> allocation free in the long run. Without that the swap in size policy
> is meaningless.
>
> Yes, that is the step by step approach. Allowing long run mix size
> swap allocation as the first step.
>
>> >> >> > Do you see another way to protect the high order cluster polluted by
>> >> >> > lower order one?
>> >> >>
>> >> >> If we use high-order page allocation as reference, we need something
>> >> >> like compaction to guarantee high-order allocation finally.  But we are
>> >> >> too far from that.
>> >> >
>> >> > We should consider reservation for high-order swap entry allocation
>> >> > similar to hugetlbfs for memory.
>> >> > Swap compaction will be very complicated because it needs to scan the
>> >> > PTE to migrate the swap entry. It might be easier to support folio
>> >> > write out compound discontiguous swap entries. That is another way to
>> >> > address the fragmentation issue. We are also too far from that as
>> >> > right now.
>> >>
>> >> That's not easy to write out compound discontiguous swap entries too.
>> >> For example, how to put folios in swap cache?
>> >
>> > I propose the idea in the recent LSF/MM discussion, the last few
>> > slides are for the discontiguous swap and it has the discontiguous
>> > entries in swap cache.
>> > https://drive.google.com/file/d/10wN4WgEekaiTDiAx2AND97CYLgfDJXAD/view
>> >
>> > Agree it is not an easy change. The cache cache would have to change
>> > the assumption all offset are contiguous.
>> > For swap, we kind of have some in memory data associated with per
>> > offset already, so it might provide an opportunity to combine the
>> > offset related data structure for swap together. Another alternative
>> > might be using xarray without the multi entry property. , just treat
>> > each offset like a single entry. I haven't dug deep into this
>> > direction yet.
>>
>> Thanks!  I will study your idea.
>>
>
> I am happy to discuss if you have any questions.
>
>> > We can have more discussion, maybe arrange an upstream alignment
>> > meeting if there is interest.
>>
>> Sure.
>
> Ideally, if we can resolve our differences over the mail list then we
> don't need to have a separate meeting :-)
>

--
Best Regards,
Huang, Ying
Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
Posted by Chris Li 1 year, 8 months ago
On Thu, Jun 13, 2024 at 1:40 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > On Mon, Jun 10, 2024 at 7:38 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Chris Li <chrisl@kernel.org> writes:
> >>
> >> > On Wed, Jun 5, 2024 at 7:02 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Chris Li <chrisl@kernel.org> writes:
> >> >>
> >> >
> >> >> > In the page allocation side, we have the hugetlbfs which reserve some
> >> >> > memory for high order pages.
> >> >> > We should have similar things to allow reserve some high order swap
> >> >> > entries without getting polluted by low order one.
> >> >>
> >> >> TBH, I don't like the idea of high order swap entries reservation.
> >> > May I know more if you don't like the idea? I understand this can be
> >> > controversial, because previously we like to take the THP as the best
> >> > effort approach. If there is some reason we can't make THP, we use the
> >> > order 0 as fall back.
> >> >
> >> > For discussion purpose, I want break it down to smaller steps:
> >> >
> >> > First, can we agree that the following usage case is reasonable:
> >> > The usage case is that, as Barry has shown, zsmalloc compresses bigger
> >> > size than 4K and can have both better compress ratio and CPU
> >> > performance gain.
> >> > https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/
> >> >
> >> > So the goal is to make THP/mTHP have some reasonable success rate
> >> > running in the mix size swap allocation, after either low order or
> >> > high order swap requests can overflow the swap file size. The allocate
> >> > can still recover from that, after some swap entries got free.
> >> >
> >> > Please let me know if you think the above usage case and goal are not
> >> > reasonable for the kernel.
> >>
> >> I think that it's reasonable to improve the success rate of high-order
> >
> > Glad to hear that.
> >
> >> swap entries allocation.  I just think that it's hard to use the
> >> reservation based method.  For example, how much should be reserved?
> >
> > Understand, it is harder to use than a fully transparent method, but
> > still better than no solution at all. The alternative right now is we
> > can't do it.
> >
> > Regarding how much we should reserve. Similarly, how much should you
> > choose your swap file size? If you choose N, why not N*120% or N*80%?
> > That did not stop us from having a swapfile, right?
> >
> >> Why system OOM when there's still swap space available?  And so forth.
> >
> > Keep in mind that the reservation is an option. If you prefer the old
> > behavior, you don't have to use the reservation. That shouldn't be a
> > reason to stop others who want to use it. We don't have an alternative
> > solution for the long run mix size allocation yet. If there is, I like
> > to hear it.
>
> It's not enough to make it optional.  When you run into issue, you need
> to debug it.  And you may debug an issue on a system that is configured
> by someone else.

That is in general true with all kernel development regardless of
using options or not. If there is a bug in my patch, I will need to
debug and fix it or the patch might be reverted.

I don't see that as a reason to take the option path or not. The
option just means the user taking this option will need to understand
the trade off and accept the defined behavior of that option.

>
> >> So, I prefer the transparent methods.  Just like THP vs. hugetlbfs.
> >
> > Me too. I prefer transparent over reservation if it can achieve the
> > same goal. Do we have a fully transparent method spec out? How to
> > achieve fully transparent and also avoid fragmentation caused by mix
> > order allocation/free?
> >
> > Keep in mind that we are still in the early stage of the mTHP swap
> > development, I can have the reservation patch relatively easily. If
> > you come up with a better transparent method patch which can achieve
> > the same goal later, we can use it instead.
>
> Because we are still in the early stage, I think that we should try to
> improve transparent solution firstly.  Personally, what I don't like is
> that we don't work on the transparent solution because we have the
> reservation solution.

Do you have a road map or the design for the transparent solution you can share?
I am interested to know what is the short term step(e.g. a month)  in
this transparent solution you have in mind, so we can compare the
different approaches. I can't reason much just by the name
"transparent solution" itself. Need more technical details.

Right now we have a clear usage case we want to support, the swap
in/out mTHP with bigger zsmalloc buffers. We can start with the
limited usage case first then move to more general ones.

> >> >> that's really important for you, I think that it's better to design
> >> >> something like hugetlbfs vs core mm, that is, be separated from the
> >> >> normal swap subsystem as much as possible.
> >> >
> >> > I am giving hugetlbfs just to make the point using reservation, or
> >> > isolation of the resource to prevent mixing fragmentation existing in
> >> > core mm.
> >> > I am not suggesting copying the hugetlbfs implementation to the swap
> >> > system. Unlike hugetlbfs, the swap allocation is typically done from
> >> > the kernel, it is transparent from the application. I don't think
> >> > separate from the swap subsystem is a good way to go.
> >> >
> >> > This comes down to why you don't like the reservation. e.g. if we use
> >> > two swapfile, one swapfile is purely allocate for high order, would
> >> > that be better?
> >>
> >> Sorry, my words weren't accurate.  Personally, I just think that it's
> >> better to make reservation related code not too intrusive.
> >
> > Yes. I will try to make it not too intrusive.
> >
> >> And, before reservation, we need to consider something else firstly.
> >> Whether is it generally good to swap-in with swap-out order?  Should we
> >
> > When we have the reservation patch (or other means to sustain mix size
> > swap allocation/free), we can test it out to get more data to reason
> > about it.
> > I consider the swap in size policy an orthogonal issue.
>
> No.  I don't think so.  If you swap-out in higher order, but swap-in in
> lower order, you make the swap clusters fragmented.

Sounds like that is the reason to apply swap-in the same order of the swap out.
In any case, my original point still stands. We need to have the
ability to allocate high order swap entries with reasonable success
rate *before* we have the option to choose which size to swap in. If
allocating a high order swap always fails, we will be forced to use
the low order one, there is no option to choose from. We can't evalute
"is it generally good to swap-in with swap-out order?" by actual runs.

Chris
Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
Posted by Huang, Ying 1 year, 8 months ago
Chris Li <chrisl@kernel.org> writes:

> On Thu, Jun 13, 2024 at 1:40 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>> > On Mon, Jun 10, 2024 at 7:38 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Chris Li <chrisl@kernel.org> writes:
>> >>
>> >> > On Wed, Jun 5, 2024 at 7:02 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Chris Li <chrisl@kernel.org> writes:
>> >> >>
>> >> >
>> >> >> > In the page allocation side, we have the hugetlbfs which reserve some
>> >> >> > memory for high order pages.
>> >> >> > We should have similar things to allow reserve some high order swap
>> >> >> > entries without getting polluted by low order one.
>> >> >>
>> >> >> TBH, I don't like the idea of high order swap entries reservation.
>> >> > May I know more if you don't like the idea? I understand this can be
>> >> > controversial, because previously we like to take the THP as the best
>> >> > effort approach. If there is some reason we can't make THP, we use the
>> >> > order 0 as fall back.
>> >> >
>> >> > For discussion purpose, I want break it down to smaller steps:
>> >> >
>> >> > First, can we agree that the following usage case is reasonable:
>> >> > The usage case is that, as Barry has shown, zsmalloc compresses bigger
>> >> > size than 4K and can have both better compress ratio and CPU
>> >> > performance gain.
>> >> > https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/
>> >> >
>> >> > So the goal is to make THP/mTHP have some reasonable success rate
>> >> > running in the mix size swap allocation, after either low order or
>> >> > high order swap requests can overflow the swap file size. The allocate
>> >> > can still recover from that, after some swap entries got free.
>> >> >
>> >> > Please let me know if you think the above usage case and goal are not
>> >> > reasonable for the kernel.
>> >>
>> >> I think that it's reasonable to improve the success rate of high-order
>> >
>> > Glad to hear that.
>> >
>> >> swap entries allocation.  I just think that it's hard to use the
>> >> reservation based method.  For example, how much should be reserved?
>> >
>> > Understand, it is harder to use than a fully transparent method, but
>> > still better than no solution at all. The alternative right now is we
>> > can't do it.
>> >
>> > Regarding how much we should reserve. Similarly, how much should you
>> > choose your swap file size? If you choose N, why not N*120% or N*80%?
>> > That did not stop us from having a swapfile, right?
>> >
>> >> Why system OOM when there's still swap space available?  And so forth.
>> >
>> > Keep in mind that the reservation is an option. If you prefer the old
>> > behavior, you don't have to use the reservation. That shouldn't be a
>> > reason to stop others who want to use it. We don't have an alternative
>> > solution for the long run mix size allocation yet. If there is, I like
>> > to hear it.
>>
>> It's not enough to make it optional.  When you run into issue, you need
>> to debug it.  And you may debug an issue on a system that is configured
>> by someone else.
>
> That is in general true with all kernel development regardless of
> using options or not. If there is a bug in my patch, I will need to
> debug and fix it or the patch might be reverted.
>
> I don't see that as a reason to take the option path or not. The
> option just means the user taking this option will need to understand
> the trade off and accept the defined behavior of that option.

User configuration knobs are not forbidden for Linux kernel.  But we are
more careful about them because they will introduce ABI which we need to
maintain forever.  And they are hard to be used for users.  Optimizing
automatically is generally the better solution.  So, I suggest you to
think more about the automatically solution before diving into a new
option.

>>
>> >> So, I prefer the transparent methods.  Just like THP vs. hugetlbfs.
>> >
>> > Me too. I prefer transparent over reservation if it can achieve the
>> > same goal. Do we have a fully transparent method spec out? How to
>> > achieve fully transparent and also avoid fragmentation caused by mix
>> > order allocation/free?
>> >
>> > Keep in mind that we are still in the early stage of the mTHP swap
>> > development, I can have the reservation patch relatively easily. If
>> > you come up with a better transparent method patch which can achieve
>> > the same goal later, we can use it instead.
>>
>> Because we are still in the early stage, I think that we should try to
>> improve transparent solution firstly.  Personally, what I don't like is
>> that we don't work on the transparent solution because we have the
>> reservation solution.
>
> Do you have a road map or the design for the transparent solution you can share?
> I am interested to know what is the short term step(e.g. a month)  in
> this transparent solution you have in mind, so we can compare the
> different approaches. I can't reason much just by the name
> "transparent solution" itself. Need more technical details.
>
> Right now we have a clear usage case we want to support, the swap
> in/out mTHP with bigger zsmalloc buffers. We can start with the
> limited usage case first then move to more general ones.

TBH, This is what I don't like.  It appears that you refuse to think
about the transparent (or automatic) solution.

I haven't thought about them thoroughly, but at least we may think about

- promoting low order non-full cluster when we find a free high order
  swap entries.

- stealing a low order non-full cluster with low usage count for
  high-order allocation.

- freeing more swap entries when swap devices become fragmented.

>> >> >> that's really important for you, I think that it's better to design
>> >> >> something like hugetlbfs vs core mm, that is, be separated from the
>> >> >> normal swap subsystem as much as possible.
>> >> >
>> >> > I am giving hugetlbfs just to make the point using reservation, or
>> >> > isolation of the resource to prevent mixing fragmentation existing in
>> >> > core mm.
>> >> > I am not suggesting copying the hugetlbfs implementation to the swap
>> >> > system. Unlike hugetlbfs, the swap allocation is typically done from
>> >> > the kernel, it is transparent from the application. I don't think
>> >> > separate from the swap subsystem is a good way to go.
>> >> >
>> >> > This comes down to why you don't like the reservation. e.g. if we use
>> >> > two swapfile, one swapfile is purely allocate for high order, would
>> >> > that be better?
>> >>
>> >> Sorry, my words weren't accurate.  Personally, I just think that it's
>> >> better to make reservation related code not too intrusive.
>> >
>> > Yes. I will try to make it not too intrusive.
>> >
>> >> And, before reservation, we need to consider something else firstly.
>> >> Whether is it generally good to swap-in with swap-out order?  Should we
>> >
>> > When we have the reservation patch (or other means to sustain mix size
>> > swap allocation/free), we can test it out to get more data to reason
>> > about it.
>> > I consider the swap in size policy an orthogonal issue.
>>
>> No.  I don't think so.  If you swap-out in higher order, but swap-in in
>> lower order, you make the swap clusters fragmented.
>
> Sounds like that is the reason to apply swap-in the same order of the swap out.
> In any case, my original point still stands. We need to have the
> ability to allocate high order swap entries with reasonable success
> rate *before* we have the option to choose which size to swap in. If
> allocating a high order swap always fails, we will be forced to use
> the low order one, there is no option to choose from. We can't evalute
> "is it generally good to swap-in with swap-out order?" by actual runs.

I think we don't need to fight for that.  Just prove the value of your
patchset with reasonable use cases and normal workloads.  Data will
persuade people.

--
Best Regards,
Huang, Ying
Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
Posted by Chris Li 1 year, 8 months ago
On Mon, Jun 17, 2024 at 11:56 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > That is in general true with all kernel development regardless of
> > using options or not. If there is a bug in my patch, I will need to
> > debug and fix it or the patch might be reverted.
> >
> > I don't see that as a reason to take the option path or not. The
> > option just means the user taking this option will need to understand
> > the trade off and accept the defined behavior of that option.
>
> User configuration knobs are not forbidden for Linux kernel.  But we are
> more careful about them because they will introduce ABI which we need to
> maintain forever.  And they are hard to be used for users.  Optimizing
> automatically is generally the better solution.  So, I suggest you to
> think more about the automatically solution before diving into a new
> option.

I did, see my reply. Right now there are just no other options.

>
> >>
> >> >> So, I prefer the transparent methods.  Just like THP vs. hugetlbfs.
> >> >
> >> > Me too. I prefer transparent over reservation if it can achieve the
> >> > same goal. Do we have a fully transparent method spec out? How to
> >> > achieve fully transparent and also avoid fragmentation caused by mix
> >> > order allocation/free?
> >> >
> >> > Keep in mind that we are still in the early stage of the mTHP swap
> >> > development, I can have the reservation patch relatively easily. If
> >> > you come up with a better transparent method patch which can achieve
> >> > the same goal later, we can use it instead.
> >>
> >> Because we are still in the early stage, I think that we should try to
> >> improve transparent solution firstly.  Personally, what I don't like is
> >> that we don't work on the transparent solution because we have the
> >> reservation solution.
> >
> > Do you have a road map or the design for the transparent solution you can share?
> > I am interested to know what is the short term step(e.g. a month)  in
> > this transparent solution you have in mind, so we can compare the
> > different approaches. I can't reason much just by the name
> > "transparent solution" itself. Need more technical details.
> >
> > Right now we have a clear usage case we want to support, the swap
> > in/out mTHP with bigger zsmalloc buffers. We can start with the
> > limited usage case first then move to more general ones.
>
> TBH, This is what I don't like.  It appears that you refuse to think
> about the transparent (or automatic) solution.

Actually, that is not true, you make the wrong assumption about what I
have considered. I want to find out what you have in mind to compare
the near term solutions.

In my recent LSF slide I already list 3 options to address this
fragmentation problem.
From easy to hard:
1) Assign cluster an order on allocation and remember the cluster
order. (short term).
That is this patch series
2) Buddy allocation on the swap entry (longer term)
3) Folio write out compound discontinuous swap entry. (ultimate)

I also considered 4), which I did not put into the slide, because it
is less effective than 3)
4) migrating the swap entries, which require scan page table entry.
I briefly mentioned it during the session.

3) should might qualify as your transparent solution. It is just much
harder to implement.
Even when we have 3), having some form of 1) can be beneficial as
well. (less IO count, no indirect layer of swap offset).

>
> I haven't thought about them thoroughly, but at least we may think about
>
> - promoting low order non-full cluster when we find a free high order
>   swap entries.
>
> - stealing a low order non-full cluster with low usage count for
>   high-order allocation.

Now we are talking.
These two above fall well within 2) the buddy allocators
But the buddy allocator will not be able to address all fragmentation
issues, due to the allocator not being controlled the life cycle of
the swap entry.
It will not help Barry's zsmalloc usage case much because android
likes to keep the swapfile full. I can already see that.

> - freeing more swap entries when swap devices become fragmented.

That requires a scan page table to free the swap entry, basically 4).

It is all about investment and return. 1) is relatively easy to
implement and with good improvement and return.

Chris

> >> >> >> that's really important for you, I think that it's better to design
> >> >> >> something like hugetlbfs vs core mm, that is, be separated from the
> >> >> >> normal swap subsystem as much as possible.
> >> >> >
> >> >> > I am giving hugetlbfs just to make the point using reservation, or
> >> >> > isolation of the resource to prevent mixing fragmentation existing in
> >> >> > core mm.
> >> >> > I am not suggesting copying the hugetlbfs implementation to the swap
> >> >> > system. Unlike hugetlbfs, the swap allocation is typically done from
> >> >> > the kernel, it is transparent from the application. I don't think
> >> >> > separate from the swap subsystem is a good way to go.
> >> >> >
> >> >> > This comes down to why you don't like the reservation. e.g. if we use
> >> >> > two swapfile, one swapfile is purely allocate for high order, would
> >> >> > that be better?
> >> >>
> >> >> Sorry, my words weren't accurate.  Personally, I just think that it's
> >> >> better to make reservation related code not too intrusive.
> >> >
> >> > Yes. I will try to make it not too intrusive.
> >> >
> >> >> And, before reservation, we need to consider something else firstly.
> >> >> Whether is it generally good to swap-in with swap-out order?  Should we
> >> >
> >> > When we have the reservation patch (or other means to sustain mix size
> >> > swap allocation/free), we can test it out to get more data to reason
> >> > about it.
> >> > I consider the swap in size policy an orthogonal issue.
> >>
> >> No.  I don't think so.  If you swap-out in higher order, but swap-in in
> >> lower order, you make the swap clusters fragmented.
> >
> > Sounds like that is the reason to apply swap-in the same order of the swap out.
> > In any case, my original point still stands. We need to have the
> > ability to allocate high order swap entries with reasonable success
> > rate *before* we have the option to choose which size to swap in. If
> > allocating a high order swap always fails, we will be forced to use
> > the low order one, there is no option to choose from. We can't evalute
> > "is it generally good to swap-in with swap-out order?" by actual runs.
>
> I think we don't need to fight for that.  Just prove the value of your
> patchset with reasonable use cases and normal workloads.  Data will
> persuade people.
Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
Posted by Huang, Ying 1 year, 7 months ago
Chris Li <chrisl@kernel.org> writes:

> On Mon, Jun 17, 2024 at 11:56 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>> > That is in general true with all kernel development regardless of
>> > using options or not. If there is a bug in my patch, I will need to
>> > debug and fix it or the patch might be reverted.
>> >
>> > I don't see that as a reason to take the option path or not. The
>> > option just means the user taking this option will need to understand
>> > the trade off and accept the defined behavior of that option.
>>
>> User configuration knobs are not forbidden for Linux kernel.  But we are
>> more careful about them because they will introduce ABI which we need to
>> maintain forever.  And they are hard to be used for users.  Optimizing
>> automatically is generally the better solution.  So, I suggest you to
>> think more about the automatically solution before diving into a new
>> option.
>
> I did, see my reply. Right now there are just no other options.
>
>>
>> >>
>> >> >> So, I prefer the transparent methods.  Just like THP vs. hugetlbfs.
>> >> >
>> >> > Me too. I prefer transparent over reservation if it can achieve the
>> >> > same goal. Do we have a fully transparent method spec out? How to
>> >> > achieve fully transparent and also avoid fragmentation caused by mix
>> >> > order allocation/free?
>> >> >
>> >> > Keep in mind that we are still in the early stage of the mTHP swap
>> >> > development, I can have the reservation patch relatively easily. If
>> >> > you come up with a better transparent method patch which can achieve
>> >> > the same goal later, we can use it instead.
>> >>
>> >> Because we are still in the early stage, I think that we should try to
>> >> improve transparent solution firstly.  Personally, what I don't like is
>> >> that we don't work on the transparent solution because we have the
>> >> reservation solution.
>> >
>> > Do you have a road map or the design for the transparent solution you can share?
>> > I am interested to know what is the short term step(e.g. a month)  in
>> > this transparent solution you have in mind, so we can compare the
>> > different approaches. I can't reason much just by the name
>> > "transparent solution" itself. Need more technical details.
>> >
>> > Right now we have a clear usage case we want to support, the swap
>> > in/out mTHP with bigger zsmalloc buffers. We can start with the
>> > limited usage case first then move to more general ones.
>>
>> TBH, This is what I don't like.  It appears that you refuse to think
>> about the transparent (or automatic) solution.
>
> Actually, that is not true, you make the wrong assumption about what I
> have considered. I want to find out what you have in mind to compare
> the near term solutions.

Sorry about my wrong assumption.

> In my recent LSF slide I already list 3 options to address this
> fragmentation problem.
> From easy to hard:
> 1) Assign cluster an order on allocation and remember the cluster
> order. (short term).
> That is this patch series
> 2) Buddy allocation on the swap entry (longer term)
> 3) Folio write out compound discontinuous swap entry. (ultimate)
>
> I also considered 4), which I did not put into the slide, because it
> is less effective than 3)
> 4) migrating the swap entries, which require scan page table entry.
> I briefly mentioned it during the session.

Or you need something like a rmap, that isn't easy.

> 3) should might qualify as your transparent solution. It is just much
> harder to implement.
> Even when we have 3), having some form of 1) can be beneficial as
> well. (less IO count, no indirect layer of swap offset).
>
>>
>> I haven't thought about them thoroughly, but at least we may think about
>>
>> - promoting low order non-full cluster when we find a free high order
>>   swap entries.
>>
>> - stealing a low order non-full cluster with low usage count for
>>   high-order allocation.
>
> Now we are talking.
> These two above fall well within 2) the buddy allocators
> But the buddy allocator will not be able to address all fragmentation
> issues, due to the allocator not being controlled the life cycle of
> the swap entry.
> It will not help Barry's zsmalloc usage case much because android
> likes to keep the swapfile full. I can already see that.

I think that buddy-like allocator (not exactly buddy algorithm) will
help fragmentation.  And it will help more users because it works
automatically.

I don't think they are too hard to be implemented.  We can try to find
some simple solution firstly.  So, I think that we don't need to push
them to long term.  At least, they can be done before introducing
high-order cluster reservation ABI.  Then, we can evaluate the benefit
and overhead of reservation ABI.

>> - freeing more swap entries when swap devices become fragmented.
>
> That requires a scan page table to free the swap entry, basically 4).

No.  You can just scan the page table of current process in
do_swap_page() and try to swap-in and free more swap entries.  That
doesn't work well for the shared pages.  However, I think that it can
help quite some workloads.

> It is all about investment and return. 1) is relatively easy to
> implement and with good improvement and return.

[snip]

--
Best Regards,
Huang, Ying
Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
Posted by Kairui Song 1 year, 8 months ago
On Fri, May 31, 2024 at 10:37 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > On Wed, May 29, 2024 at 7:54 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Chris Li <chrisl@kernel.org> writes:
> >>
> >> > Hi Ying,
> >> >
> >> > On Wed, May 29, 2024 at 1:57 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Chris Li <chrisl@kernel.org> writes:
> >> >>
> >> >> > I am spinning a new version for this series to address two issues
> >> >> > found in this series:
> >> >> >
> >> >> > 1) Oppo discovered a bug in the following line:
> >> >> > +               ci = si->cluster_info + tmp;
> >> >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
> >> >> > That is a serious bug but trivial to fix.
> >> >> >
> >> >> > 2) order 0 allocation currently blindly scans swap_map disregarding
> >> >> > the cluster->order.
> >> >>
> >> >> IIUC, now, we only scan swap_map[] only if
> >> >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]).
> >> >> That is, if you doesn't run low swap free space, you will not do that.
> >> >
> >> > You can still swap space in order 0 clusters while order 4 runs out of
> >> > free_cluster
> >> > or nonfull_clusters[order]. For Android that is a common case.
> >>
> >> When we fail to allocate order 4, we will fallback to order 0.  Still
> >> don't need to scan swap_map[].  But after looking at your below reply, I
> >> realized that the swap space is almost full at most times in your cases.
> >> Then, it's possible that we run into scanning swap_map[].
> >> list_empty(&si->free_clusters) &&
> >> list_empty(&si->nonfull_clusters[order]) will become true, if we put too
> >> many clusters in si->percpu_cluster.  So, if we want to avoid to scan
> >> swap_map[], we can stop add clusters in si->percpu_cluster when swap
> >> space runs low.  And maybe take clusters out of si->percpu_cluster
> >> sometimes.
> >
> > One idea after reading your reply. If we run out of the
> > nonfull_cluster[order], we should be able to use other cpu's
> > si->percpu_cluster[] as well. That is a very small win for Android,
>
> This will be useful in general.  The number CPU may be large, and
> multiple orders may be used.
>
> > because android does not have too many cpu. We are talking about a
> > handful of clusters, which might not justify the code complexity. It
> > does not change the behavior that order 0 can pollut higher order.
>
> I have a feeling that you don't really know why swap_map[] is scanned.
> I suggest you to do more test and tracing to find out the reason.  I
> suspect that there are some non-full cluster collection issues.
>
> >> Another issue is nonfull_cluster[order1] cannot be used for
> >> nonfull_cluster[order2].  In definition, we should not fail order 0
> >> allocation, we need to steal nonfull_cluster[order>0] for order 0
> >> allocation.  This can avoid to scan swap_map[] too.  This may be not
> >> perfect, but it is the simplest first step implementation.  You can
> >> optimize based on it further.
> >
> > Yes, that is listed as the limitation of this cluster order approach.
> > Initially we need to support one order well first. We might choose
> > what order that is, 16K or 64K folio. 4K pages are too small, 2M pages
> > are too big. The sweet spot might be some there in between.  If we can
> > support one order well, we can demonstrate the value of the mTHP. We
> > can worry about other mix orders later.
> >
> > Do you have any suggestions for how to prevent the order 0 polluting
> > the higher order cluster? If we allow that to happen, then it defeats
> > the goal of being able to allocate higher order swap entries. The
> > tricky question is we don't know how much swap space we should reserve
> > for each order. We can always break higher order clusters to lower
> > order, but can't do the reserves. The current patch series lets the
> > actual usage determine the percentage of the cluster for each order.
> > However that seems not enough for the test case Barry has. When the
> > app gets OOM kill that is where a large swing of order 0 swap will
> > show up and not enough higher order usage for the brief moment. The
> > order 0 swap entry will pollute the high order cluster. We are
> > currently debating a "knob" to be able to reserve a certain % of swap
> > space for a certain order. Those reservations will be guaranteed and
> > order 0 swap entry can't pollute them even when it runs out of swap
> > space. That can make the mTHP at least usable for the Android case.
>
> IMO, the bottom line is that order-0 allocation is the first class
> citizen, we must keep it optimized.  And, OOM with free swap space isn't
> acceptable.  Please consider the policy we used for page allocation.
>
> > Do you see another way to protect the high order cluster polluted by
> > lower order one?
>
> If we use high-order page allocation as reference, we need something
> like compaction to guarantee high-order allocation finally.  But we are
> too far from that.
>
> For specific configuration, I believe that we can get reasonable
> high-order swap entry allocation success rate for specific use cases.
> For example, if we only do limited maximum number order-0 swap entries
> allocation, can we keep high-order clusters?

Isn't limiting order-0 allocation breaks the bottom line that order-0
allocation is the first class citizen, and should not fail if there is
space?

Just my two cents...

I had a try locally based on Chris's work, allowing order 0 to use
nonfull_clusters as Ying has suggested, and starting with low order
and increase the order until nonfull_cluster[order] is not empty, that
way higher order is just better protected, because unless we ran out
of free_cluster and nonfull_cluster, direct scan won't happen.

More concretely, I applied the following changes, which didn't change
the code much:
- In scan_swap_map_try_ssd_cluster, check nonfull_cluster first, then
free_clusters, then discard_cluster.
- If it's order 0, also check for (int i = 0; i < SWAP_NR_ORDERS; ++i)
nonfull_clusters[i] cluster before scan_swap_map_try_ssd_cluster
returns false.

A quick test still using the memtier test, but decreased the swap
device size from 10G to 8g for higher pressure.

Before:
hugepages-32kB/stats/swpout:34013
hugepages-32kB/stats/swpout_fallback:266
hugepages-512kB/stats/swpout:0
hugepages-512kB/stats/swpout_fallback:77
hugepages-2048kB/stats/swpout:0
hugepages-2048kB/stats/swpout_fallback:1
hugepages-1024kB/stats/swpout:0
hugepages-1024kB/stats/swpout_fallback:0
hugepages-64kB/stats/swpout:35088
hugepages-64kB/stats/swpout_fallback:66
hugepages-16kB/stats/swpout:31848
hugepages-16kB/stats/swpout_fallback:402
hugepages-256kB/stats/swpout:390
hugepages-256kB/stats/swpout_fallback:7244
hugepages-128kB/stats/swpout:28573
hugepages-128kB/stats/swpout_fallback:474

After:
hugepages-32kB/stats/swpout:31448
hugepages-32kB/stats/swpout_fallback:3354
hugepages-512kB/stats/swpout:30
hugepages-512kB/stats/swpout_fallback:33
hugepages-2048kB/stats/swpout:2
hugepages-2048kB/stats/swpout_fallback:0
hugepages-1024kB/stats/swpout:0
hugepages-1024kB/stats/swpout_fallback:0
hugepages-64kB/stats/swpout:31255
hugepages-64kB/stats/swpout_fallback:3112
hugepages-16kB/stats/swpout:29931
hugepages-16kB/stats/swpout_fallback:3397
hugepages-256kB/stats/swpout:5223
hugepages-256kB/stats/swpout_fallback:2351
hugepages-128kB/stats/swpout:25600
hugepages-128kB/stats/swpout_fallback:2194

High order (256k) swapout rate are significantly higher, 512k is now
possible, which indicate high orders are better protected, lower
orders are sacrificed but seems worth it.
Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
Posted by Chris Li 1 year, 8 months ago
On Fri, May 31, 2024 at 5:40 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Fri, May 31, 2024 at 10:37 AM Huang, Ying <ying.huang@intel.com> wrote:

> >
> > For specific configuration, I believe that we can get reasonable
> > high-order swap entry allocation success rate for specific use cases.
> > For example, if we only do limited maximum number order-0 swap entries
> > allocation, can we keep high-order clusters?
>
> Isn't limiting order-0 allocation breaks the bottom line that order-0
> allocation is the first class citizen, and should not fail if there is
> space?

We need to have high order and low order swap allocation working. Able
to recover from the swapfile full case.

>
> Just my two cents...
>
> I had a try locally based on Chris's work, allowing order 0 to use
> nonfull_clusters as Ying has suggested, and starting with low order
> and increase the order until nonfull_cluster[order] is not empty, that
> way higher order is just better protected, because unless we ran out
> of free_cluster and nonfull_cluster, direct scan won't happen.

That does not help the Android test case Barry is running because
Android tries to keep the swapfile full.
It will hit the case both empty and nonfull list all used up.
When it performs the low memory kill. There will be a big change in
the ratio of low vs high order swap.
Allocating high order swap entries should be able to recover from that.

>
> More concretely, I applied the following changes, which didn't change
> the code much:
> - In scan_swap_map_try_ssd_cluster, check nonfull_cluster first, then
> free_clusters, then discard_cluster.

I consider high the nonfull list before the empty list. The current
allocation tries to make the HAS_CACHE only swap entry stay in the
disk for a longer time before recycling it. If the folio is still in
swap cache and not dirty, it can skip the write out and directly reuse
the swap slot during reclaim. I am not sure this code path is
important now, it seems when the swap slot is free, it will remove the
HAS_CACHE as well. BTW, I noticed that the discard cluster doesn't
check if the swap cache has a folio point to it. After discarding it
just set the swap_map to 0. I wonder if swap cache has a folio in that
discarded slot that would hit the skip writeback logic. If that is
triggerable, it would be a corruption bug.

The current SSD allocation also has some command said old SSD can
benefit from not writing to the same block too many times to help the
wear leveling. I don't think that is a big deal now, even cheap SD
cards have wear leveling nowadays.

> - If it's order 0, also check for (int i = 0; i < SWAP_NR_ORDERS; ++i)
> nonfull_clusters[i] cluster before scan_swap_map_try_ssd_cluster
> returns false.

Ideally to have some option to reserve some high order swap space so
order 0 can't pollute high order clusters.


Chris

>
> A quick test still using the memtier test, but decreased the swap
> device size from 10G to 8g for higher pressure.
>
> Before:
> hugepages-32kB/stats/swpout:34013
> hugepages-32kB/stats/swpout_fallback:266
> hugepages-512kB/stats/swpout:0
> hugepages-512kB/stats/swpout_fallback:77
> hugepages-2048kB/stats/swpout:0
> hugepages-2048kB/stats/swpout_fallback:1
> hugepages-1024kB/stats/swpout:0
> hugepages-1024kB/stats/swpout_fallback:0
> hugepages-64kB/stats/swpout:35088
> hugepages-64kB/stats/swpout_fallback:66
> hugepages-16kB/stats/swpout:31848
> hugepages-16kB/stats/swpout_fallback:402
> hugepages-256kB/stats/swpout:390
> hugepages-256kB/stats/swpout_fallback:7244
> hugepages-128kB/stats/swpout:28573
> hugepages-128kB/stats/swpout_fallback:474
>
> After:
> hugepages-32kB/stats/swpout:31448
> hugepages-32kB/stats/swpout_fallback:3354
> hugepages-512kB/stats/swpout:30
> hugepages-512kB/stats/swpout_fallback:33
> hugepages-2048kB/stats/swpout:2
> hugepages-2048kB/stats/swpout_fallback:0
> hugepages-1024kB/stats/swpout:0
> hugepages-1024kB/stats/swpout_fallback:0
> hugepages-64kB/stats/swpout:31255
> hugepages-64kB/stats/swpout_fallback:3112
> hugepages-16kB/stats/swpout:29931
> hugepages-16kB/stats/swpout_fallback:3397
> hugepages-256kB/stats/swpout:5223
> hugepages-256kB/stats/swpout_fallback:2351
> hugepages-128kB/stats/swpout:25600
> hugepages-128kB/stats/swpout_fallback:2194
>
> High order (256k) swapout rate are significantly higher, 512k is now
> possible, which indicate high orders are better protected, lower
> orders are sacrificed but seems worth it.
>
Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
Posted by Huang, Ying 1 year, 8 months ago
Kairui Song <ryncsn@gmail.com> writes:

> On Fri, May 31, 2024 at 10:37 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>> > On Wed, May 29, 2024 at 7:54 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Chris Li <chrisl@kernel.org> writes:
>> >>
>> >> > Hi Ying,
>> >> >
>> >> > On Wed, May 29, 2024 at 1:57 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Chris Li <chrisl@kernel.org> writes:
>> >> >>
>> >> >> > I am spinning a new version for this series to address two issues
>> >> >> > found in this series:
>> >> >> >
>> >> >> > 1) Oppo discovered a bug in the following line:
>> >> >> > +               ci = si->cluster_info + tmp;
>> >> >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
>> >> >> > That is a serious bug but trivial to fix.
>> >> >> >
>> >> >> > 2) order 0 allocation currently blindly scans swap_map disregarding
>> >> >> > the cluster->order.
>> >> >>
>> >> >> IIUC, now, we only scan swap_map[] only if
>> >> >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]).
>> >> >> That is, if you doesn't run low swap free space, you will not do that.
>> >> >
>> >> > You can still swap space in order 0 clusters while order 4 runs out of
>> >> > free_cluster
>> >> > or nonfull_clusters[order]. For Android that is a common case.
>> >>
>> >> When we fail to allocate order 4, we will fallback to order 0.  Still
>> >> don't need to scan swap_map[].  But after looking at your below reply, I
>> >> realized that the swap space is almost full at most times in your cases.
>> >> Then, it's possible that we run into scanning swap_map[].
>> >> list_empty(&si->free_clusters) &&
>> >> list_empty(&si->nonfull_clusters[order]) will become true, if we put too
>> >> many clusters in si->percpu_cluster.  So, if we want to avoid to scan
>> >> swap_map[], we can stop add clusters in si->percpu_cluster when swap
>> >> space runs low.  And maybe take clusters out of si->percpu_cluster
>> >> sometimes.
>> >
>> > One idea after reading your reply. If we run out of the
>> > nonfull_cluster[order], we should be able to use other cpu's
>> > si->percpu_cluster[] as well. That is a very small win for Android,
>>
>> This will be useful in general.  The number CPU may be large, and
>> multiple orders may be used.
>>
>> > because android does not have too many cpu. We are talking about a
>> > handful of clusters, which might not justify the code complexity. It
>> > does not change the behavior that order 0 can pollut higher order.
>>
>> I have a feeling that you don't really know why swap_map[] is scanned.
>> I suggest you to do more test and tracing to find out the reason.  I
>> suspect that there are some non-full cluster collection issues.
>>
>> >> Another issue is nonfull_cluster[order1] cannot be used for
>> >> nonfull_cluster[order2].  In definition, we should not fail order 0
>> >> allocation, we need to steal nonfull_cluster[order>0] for order 0
>> >> allocation.  This can avoid to scan swap_map[] too.  This may be not
>> >> perfect, but it is the simplest first step implementation.  You can
>> >> optimize based on it further.
>> >
>> > Yes, that is listed as the limitation of this cluster order approach.
>> > Initially we need to support one order well first. We might choose
>> > what order that is, 16K or 64K folio. 4K pages are too small, 2M pages
>> > are too big. The sweet spot might be some there in between.  If we can
>> > support one order well, we can demonstrate the value of the mTHP. We
>> > can worry about other mix orders later.
>> >
>> > Do you have any suggestions for how to prevent the order 0 polluting
>> > the higher order cluster? If we allow that to happen, then it defeats
>> > the goal of being able to allocate higher order swap entries. The
>> > tricky question is we don't know how much swap space we should reserve
>> > for each order. We can always break higher order clusters to lower
>> > order, but can't do the reserves. The current patch series lets the
>> > actual usage determine the percentage of the cluster for each order.
>> > However that seems not enough for the test case Barry has. When the
>> > app gets OOM kill that is where a large swing of order 0 swap will
>> > show up and not enough higher order usage for the brief moment. The
>> > order 0 swap entry will pollute the high order cluster. We are
>> > currently debating a "knob" to be able to reserve a certain % of swap
>> > space for a certain order. Those reservations will be guaranteed and
>> > order 0 swap entry can't pollute them even when it runs out of swap
>> > space. That can make the mTHP at least usable for the Android case.
>>
>> IMO, the bottom line is that order-0 allocation is the first class
>> citizen, we must keep it optimized.  And, OOM with free swap space isn't
>> acceptable.  Please consider the policy we used for page allocation.
>>
>> > Do you see another way to protect the high order cluster polluted by
>> > lower order one?
>>
>> If we use high-order page allocation as reference, we need something
>> like compaction to guarantee high-order allocation finally.  But we are
>> too far from that.
>>
>> For specific configuration, I believe that we can get reasonable
>> high-order swap entry allocation success rate for specific use cases.
>> For example, if we only do limited maximum number order-0 swap entries
>> allocation, can we keep high-order clusters?
>
> Isn't limiting order-0 allocation breaks the bottom line that order-0
> allocation is the first class citizen, and should not fail if there is
> space?

Sorry for confusing words.  I mean limiting maximum number order-0 swap
entries allocation in workloads, instead of limiting that in kernel.

> Just my two cents...
>
> I had a try locally based on Chris's work, allowing order 0 to use
> nonfull_clusters as Ying has suggested, and starting with low order
> and increase the order until nonfull_cluster[order] is not empty, that
> way higher order is just better protected, because unless we ran out
> of free_cluster and nonfull_cluster, direct scan won't happen.
>
> More concretely, I applied the following changes, which didn't change
> the code much:
> - In scan_swap_map_try_ssd_cluster, check nonfull_cluster first, then
> free_clusters, then discard_cluster.
> - If it's order 0, also check for (int i = 0; i < SWAP_NR_ORDERS; ++i)
> nonfull_clusters[i] cluster before scan_swap_map_try_ssd_cluster
> returns false.
>
> A quick test still using the memtier test, but decreased the swap
> device size from 10G to 8g for higher pressure.
>
> Before:
> hugepages-32kB/stats/swpout:34013
> hugepages-32kB/stats/swpout_fallback:266
> hugepages-512kB/stats/swpout:0
> hugepages-512kB/stats/swpout_fallback:77
> hugepages-2048kB/stats/swpout:0
> hugepages-2048kB/stats/swpout_fallback:1
> hugepages-1024kB/stats/swpout:0
> hugepages-1024kB/stats/swpout_fallback:0
> hugepages-64kB/stats/swpout:35088
> hugepages-64kB/stats/swpout_fallback:66
> hugepages-16kB/stats/swpout:31848
> hugepages-16kB/stats/swpout_fallback:402
> hugepages-256kB/stats/swpout:390
> hugepages-256kB/stats/swpout_fallback:7244
> hugepages-128kB/stats/swpout:28573
> hugepages-128kB/stats/swpout_fallback:474
>
> After:
> hugepages-32kB/stats/swpout:31448
> hugepages-32kB/stats/swpout_fallback:3354
> hugepages-512kB/stats/swpout:30
> hugepages-512kB/stats/swpout_fallback:33
> hugepages-2048kB/stats/swpout:2
> hugepages-2048kB/stats/swpout_fallback:0
> hugepages-1024kB/stats/swpout:0
> hugepages-1024kB/stats/swpout_fallback:0
> hugepages-64kB/stats/swpout:31255
> hugepages-64kB/stats/swpout_fallback:3112
> hugepages-16kB/stats/swpout:29931
> hugepages-16kB/stats/swpout_fallback:3397
> hugepages-256kB/stats/swpout:5223
> hugepages-256kB/stats/swpout_fallback:2351
> hugepages-128kB/stats/swpout:25600
> hugepages-128kB/stats/swpout_fallback:2194
>
> High order (256k) swapout rate are significantly higher, 512k is now
> possible, which indicate high orders are better protected, lower
> orders are sacrificed but seems worth it.

Yes.  I think that this reflects another aspect of the problem.  In some
situations, it's better to steal one high-order cluster and use it for
order-0 allocation instead of scattering order-0 allocation in random
high-order clusters.

--
Best Regards,
Huang, Ying
Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
Posted by Chris Li 1 year, 8 months ago
On Tue, Jun 4, 2024 at 12:29 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > On Fri, May 31, 2024 at 10:37 AM Huang, Ying <ying.huang@intel.com> wrote:
> > Isn't limiting order-0 allocation breaks the bottom line that order-0
> > allocation is the first class citizen, and should not fail if there is
> > space?
>
> Sorry for confusing words.  I mean limiting maximum number order-0 swap
> entries allocation in workloads, instead of limiting that in kernel.

What interface does it use to limit the order 0 swap entries?

I was thinking the kernel would enforce the high order swap space
reservation just like hugetlbfs does on huge pages.
We will need to introduce some interface to specify the reservation.

>
> > Just my two cents...
> >
> > I had a try locally based on Chris's work, allowing order 0 to use
> > nonfull_clusters as Ying has suggested, and starting with low order
> > and increase the order until nonfull_cluster[order] is not empty, that
> > way higher order is just better protected, because unless we ran out
> > of free_cluster and nonfull_cluster, direct scan won't happen.
> >
> > More concretely, I applied the following changes, which didn't change
> > the code much:
> > - In scan_swap_map_try_ssd_cluster, check nonfull_cluster first, then
> > free_clusters, then discard_cluster.
> > - If it's order 0, also check for (int i = 0; i < SWAP_NR_ORDERS; ++i)
> > nonfull_clusters[i] cluster before scan_swap_map_try_ssd_cluster
> > returns false.
> >
> > A quick test still using the memtier test, but decreased the swap
> > device size from 10G to 8g for higher pressure.
> >
> > Before:
> > hugepages-32kB/stats/swpout:34013
> > hugepages-32kB/stats/swpout_fallback:266
> > hugepages-512kB/stats/swpout:0
> > hugepages-512kB/stats/swpout_fallback:77
> > hugepages-2048kB/stats/swpout:0
> > hugepages-2048kB/stats/swpout_fallback:1
> > hugepages-1024kB/stats/swpout:0
> > hugepages-1024kB/stats/swpout_fallback:0
> > hugepages-64kB/stats/swpout:35088
> > hugepages-64kB/stats/swpout_fallback:66
> > hugepages-16kB/stats/swpout:31848
> > hugepages-16kB/stats/swpout_fallback:402
> > hugepages-256kB/stats/swpout:390
> > hugepages-256kB/stats/swpout_fallback:7244
> > hugepages-128kB/stats/swpout:28573
> > hugepages-128kB/stats/swpout_fallback:474
> >
> > After:
> > hugepages-32kB/stats/swpout:31448
> > hugepages-32kB/stats/swpout_fallback:3354
> > hugepages-512kB/stats/swpout:30
> > hugepages-512kB/stats/swpout_fallback:33
> > hugepages-2048kB/stats/swpout:2
> > hugepages-2048kB/stats/swpout_fallback:0
> > hugepages-1024kB/stats/swpout:0
> > hugepages-1024kB/stats/swpout_fallback:0
> > hugepages-64kB/stats/swpout:31255
> > hugepages-64kB/stats/swpout_fallback:3112
> > hugepages-16kB/stats/swpout:29931
> > hugepages-16kB/stats/swpout_fallback:3397
> > hugepages-256kB/stats/swpout:5223
> > hugepages-256kB/stats/swpout_fallback:2351
> > hugepages-128kB/stats/swpout:25600
> > hugepages-128kB/stats/swpout_fallback:2194
> >
> > High order (256k) swapout rate are significantly higher, 512k is now
> > possible, which indicate high orders are better protected, lower
> > orders are sacrificed but seems worth it.
>
> Yes.  I think that this reflects another aspect of the problem.  In some
> situations, it's better to steal one high-order cluster and use it for
> order-0 allocation instead of scattering order-0 allocation in random
> high-order clusters.

Agree, the  scan loop on swap_map[] has the worst pollution.

Chris
Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
Posted by Kairui Song 1 year, 8 months ago
On Thu, May 30, 2024 at 10:54 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > Hi Ying,
> >
> > On Wed, May 29, 2024 at 1:57 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Chris Li <chrisl@kernel.org> writes:
> >>
> >> > I am spinning a new version for this series to address two issues
> >> > found in this series:
> >> >
> >> > 1) Oppo discovered a bug in the following line:
> >> > +               ci = si->cluster_info + tmp;
> >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
> >> > That is a serious bug but trivial to fix.
> >> >
> >> > 2) order 0 allocation currently blindly scans swap_map disregarding
> >> > the cluster->order.
> >>
> >> IIUC, now, we only scan swap_map[] only if
> >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]).
> >> That is, if you doesn't run low swap free space, you will not do that.
> >
> > You can still swap space in order 0 clusters while order 4 runs out of
> > free_cluster
> > or nonfull_clusters[order]. For Android that is a common case.
>
> When we fail to allocate order 4, we will fallback to order 0.  Still
> don't need to scan swap_map[].  But after looking at your below reply, I
> realized that the swap space is almost full at most times in your cases.
> Then, it's possible that we run into scanning swap_map[].
> list_empty(&si->free_clusters) &&
> list_empty(&si->nonfull_clusters[order]) will become true, if we put too
> many clusters in si->percpu_cluster.  So, if we want to avoid to scan
> swap_map[], we can stop add clusters in si->percpu_cluster when swap
> space runs low.  And maybe take clusters out of si->percpu_cluster
> sometimes.

Stop adding when it runs low seems too late, there could still be a
free cluster stuck on a CPU, and not getting scanned, right?

> Another issue is nonfull_cluster[order1] cannot be used for
> nonfull_cluster[order2].  In definition, we should not fail order 0
> allocation, we need to steal nonfull_cluster[order>0] for order 0
> allocation.  This can avoid to scan swap_map[] too.  This may be not
> perfect, but it is the simplest first step implementation.  You can
> optimize based on it further.

This can be extended to allow any order < MAX_ORDER to steal from
higher order, which might increase fragmentation though.

So this is looking more and more like a buddy allocator, and that
should be the long term solution.
Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
Posted by Chris Li 1 year, 8 months ago
On Thu, May 30, 2024 at 1:08 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Thu, May 30, 2024 at 10:54 AM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Chris Li <chrisl@kernel.org> writes:
> >
> > > Hi Ying,
> > >
> > > On Wed, May 29, 2024 at 1:57 AM Huang, Ying <ying.huang@intel.com> wrote:
> > >>
> > >> Chris Li <chrisl@kernel.org> writes:
> > >>
> > >> > I am spinning a new version for this series to address two issues
> > >> > found in this series:
> > >> >
> > >> > 1) Oppo discovered a bug in the following line:
> > >> > +               ci = si->cluster_info + tmp;
> > >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
> > >> > That is a serious bug but trivial to fix.
> > >> >
> > >> > 2) order 0 allocation currently blindly scans swap_map disregarding
> > >> > the cluster->order.
> > >>
> > >> IIUC, now, we only scan swap_map[] only if
> > >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]).
> > >> That is, if you doesn't run low swap free space, you will not do that.
> > >
> > > You can still swap space in order 0 clusters while order 4 runs out of
> > > free_cluster
> > > or nonfull_clusters[order]. For Android that is a common case.
> >
> > When we fail to allocate order 4, we will fallback to order 0.  Still
> > don't need to scan swap_map[].  But after looking at your below reply, I
> > realized that the swap space is almost full at most times in your cases.
> > Then, it's possible that we run into scanning swap_map[].
> > list_empty(&si->free_clusters) &&
> > list_empty(&si->nonfull_clusters[order]) will become true, if we put too
> > many clusters in si->percpu_cluster.  So, if we want to avoid to scan
> > swap_map[], we can stop add clusters in si->percpu_cluster when swap
> > space runs low.  And maybe take clusters out of si->percpu_cluster
> > sometimes.
>
> Stop adding when it runs low seems too late, there could still be a
> free cluster stuck on a CPU, and not getting scanned, right?

The free clusters stuck on the CPU are a  small number. Only a handful
of clusters. Preventing low order swap polluting the high order
cluster is more urgent.

>
> > Another issue is nonfull_cluster[order1] cannot be used for
> > nonfull_cluster[order2].  In definition, we should not fail order 0
> > allocation, we need to steal nonfull_cluster[order>0] for order 0
> > allocation.  This can avoid to scan swap_map[] too.  This may be not
> > perfect, but it is the simplest first step implementation.  You can
> > optimize based on it further.
>
> This can be extended to allow any order < MAX_ORDER to steal from
> higher order, which might increase fragmentation though.

Steal from higher order is a bad thing. Because the value of the
allocator is able to allocate from higher order.
High to low is always trivil, the low to high is impossible. See the
other email having a "knob" to reserve some swap space for high order
allocations. That is not perfect but more useful.

>
> So this is looking more and more like a buddy allocator, and that
> should be the long term solution.
>
In Barry's test case, there is a huge swing of order 0 and order 4
allocation caused by the low memory killer. Apps get killed and take a
while for the app to launch and swap out high order entries. The buddy
allocator will have limited help there because once cluster is used
for order 0, the fragmentation will prevent higher order allocation.
Buddy allocator might not be able to help much in this situation. We
do need a way to swap out large folios using discontiguous swap
entries. That is the longer term solution to Barry's usage situation.

Chris
Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
Posted by Barry Song 1 year, 8 months ago
On Sat, May 25, 2024 at 5:17 AM Chris Li <chrisl@kernel.org> wrote:
>
> This is the short term solutiolns "swap cluster order" listed
> in my "Swap Abstraction" discussion slice 8 in the recent
> LSF/MM conference.
>
> When commit 845982eb264bc "mm: swap: allow storage of all mTHP
> orders" is introduced, it only allocates the mTHP swap entries
> from new empty cluster list. That works well for PMD size THP,
> but it has a serius fragmentation issue reported by Barry.
>
> https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
>
> The mTHP allocation failure rate raises to almost 100% after a few
> hours in Barry's test run.
>
> The reason is that all the empty cluster has been exhausted while
> there are planty of free swap entries to in the cluster that is
> not 100% free.
>
> Address this by remember the swap allocation order in the cluster.
> Keep track of the per order non full cluster list for later allocation.
>
> This greatly improve the sucess rate of the mTHP swap allocation.
> While I am still waiting for Barry's test result. I paste Kairui's test

Hi Chris,

Attached are the test results from a real phone using 4-order mTHP. The results
seem better overall, but after 7 hours, especially when the swap device becomes
full(soon some apps are killed to free memory and swap), the fallback
ratio still
reaches 100%.

I haven't debugged this, but my guess is that the cluster's order can
shift between
4-order and 0-order. Sometimes, they all shift to 0-order, and hardly can they
get back to 4-order.

> result here:
>
> I'm able to reproduce such an issue with a simple script (enabling all order of mthp):
>
> modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024))
> swapoff -a
> mkswap /dev/ram0
> swapon /dev/ram0
>
> rmdir /sys/fs/cgroup/benchmark
> mkdir -p /sys/fs/cgroup/benchmark
> cd /sys/fs/cgroup/benchmark
> echo 8G > memory.max
> echo $$ > cgroup.procs
>
> memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary &
>
> /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \
>         -P memcache_binary -n allkeys --key-minimum=1 \
>         --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \
>         --ratio 1:0 --pipeline 8 -d 1024
>
> Before:
> Totals      48805.63         0.00         0.00         5.26045         1.19100        38.91100        59.64700     51063.98
> After:
> Totals      71098.84         0.00         0.00         3.60585         0.71100        26.36700        39.16700     74388.74
>
> And the fallback ratio dropped by a lot:
> Before:
> hugepages-32kB/stats/anon_swpout_fallback:15997
> hugepages-32kB/stats/anon_swpout:18712
> hugepages-512kB/stats/anon_swpout_fallback:192
> hugepages-512kB/stats/anon_swpout:0
> hugepages-2048kB/stats/anon_swpout_fallback:2
> hugepages-2048kB/stats/anon_swpout:0
> hugepages-1024kB/stats/anon_swpout_fallback:0
> hugepages-1024kB/stats/anon_swpout:0
> hugepages-64kB/stats/anon_swpout_fallback:18246
> hugepages-64kB/stats/anon_swpout:17644
> hugepages-16kB/stats/anon_swpout_fallback:13701
> hugepages-16kB/stats/anon_swpout:18234
> hugepages-256kB/stats/anon_swpout_fallback:8642
> hugepages-256kB/stats/anon_swpout:93
> hugepages-128kB/stats/anon_swpout_fallback:21497
> hugepages-128kB/stats/anon_swpout:7596
>
> (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate)
>
> After:
> hugepages-32kB/stats/swpout:34445
> hugepages-32kB/stats/swpout_fallback:0
> hugepages-512kB/stats/swpout:1
> hugepages-512kB/stats/swpout_fallback:134
> hugepages-2048kB/stats/swpout:1
> hugepages-2048kB/stats/swpout_fallback:1
> hugepages-1024kB/stats/swpout:6
> hugepages-1024kB/stats/swpout_fallback:0
> hugepages-64kB/stats/swpout:35495
> hugepages-64kB/stats/swpout_fallback:0
> hugepages-16kB/stats/swpout:32441
> hugepages-16kB/stats/swpout_fallback:0
> hugepages-256kB/stats/swpout:2223
> hugepages-256kB/stats/swpout_fallback:6278
> hugepages-128kB/stats/swpout:29136
> hugepages-128kB/stats/swpout_fallback:52
>
> Reported-by: Barry Song <21cnbao@gmail.com>
> Tested-by: Kairui Song <kasong@tencent.com>
> Signed-off-by: Chris Li <chrisl@kernel.org>
> ---
> Chris Li (2):
>       mm: swap: swap cluster switch to double link list
>       mm: swap: mTHP allocate swap entries from nonfull list
>
>  include/linux/swap.h |  18 ++--
>  mm/swapfile.c        | 252 +++++++++++++++++----------------------------------
>  2 files changed, 93 insertions(+), 177 deletions(-)
> ---
> base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed
> change-id: 20240523-swap-allocator-1534c480ece4
>
> Best regards,
> --
> Chris Li <chrisl@kernel.org>
>

Thanks
Barry