include/linux/swap.h | 18 ++-- mm/swapfile.c | 252 +++++++++++++++++---------------------------------- 2 files changed, 93 insertions(+), 177 deletions(-)
This is the short term solutiolns "swap cluster order" listed
in my "Swap Abstraction" discussion slice 8 in the recent
LSF/MM conference.
When commit 845982eb264bc "mm: swap: allow storage of all mTHP
orders" is introduced, it only allocates the mTHP swap entries
from new empty cluster list. That works well for PMD size THP,
but it has a serius fragmentation issue reported by Barry.
https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
The mTHP allocation failure rate raises to almost 100% after a few
hours in Barry's test run.
The reason is that all the empty cluster has been exhausted while
there are planty of free swap entries to in the cluster that is
not 100% free.
Address this by remember the swap allocation order in the cluster.
Keep track of the per order non full cluster list for later allocation.
This greatly improve the sucess rate of the mTHP swap allocation.
While I am still waiting for Barry's test result. I paste Kairui's test
result here:
I'm able to reproduce such an issue with a simple script (enabling all order of mthp):
modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024))
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0
rmdir /sys/fs/cgroup/benchmark
mkdir -p /sys/fs/cgroup/benchmark
cd /sys/fs/cgroup/benchmark
echo 8G > memory.max
echo $$ > cgroup.procs
memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary &
/usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \
--ratio 1:0 --pipeline 8 -d 1024
Before:
Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98
After:
Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74
And the fallback ratio dropped by a lot:
Before:
hugepages-32kB/stats/anon_swpout_fallback:15997
hugepages-32kB/stats/anon_swpout:18712
hugepages-512kB/stats/anon_swpout_fallback:192
hugepages-512kB/stats/anon_swpout:0
hugepages-2048kB/stats/anon_swpout_fallback:2
hugepages-2048kB/stats/anon_swpout:0
hugepages-1024kB/stats/anon_swpout_fallback:0
hugepages-1024kB/stats/anon_swpout:0
hugepages-64kB/stats/anon_swpout_fallback:18246
hugepages-64kB/stats/anon_swpout:17644
hugepages-16kB/stats/anon_swpout_fallback:13701
hugepages-16kB/stats/anon_swpout:18234
hugepages-256kB/stats/anon_swpout_fallback:8642
hugepages-256kB/stats/anon_swpout:93
hugepages-128kB/stats/anon_swpout_fallback:21497
hugepages-128kB/stats/anon_swpout:7596
(Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate)
After:
hugepages-32kB/stats/swpout:34445
hugepages-32kB/stats/swpout_fallback:0
hugepages-512kB/stats/swpout:1
hugepages-512kB/stats/swpout_fallback:134
hugepages-2048kB/stats/swpout:1
hugepages-2048kB/stats/swpout_fallback:1
hugepages-1024kB/stats/swpout:6
hugepages-1024kB/stats/swpout_fallback:0
hugepages-64kB/stats/swpout:35495
hugepages-64kB/stats/swpout_fallback:0
hugepages-16kB/stats/swpout:32441
hugepages-16kB/stats/swpout_fallback:0
hugepages-256kB/stats/swpout:2223
hugepages-256kB/stats/swpout_fallback:6278
hugepages-128kB/stats/swpout:29136
hugepages-128kB/stats/swpout_fallback:52
Reported-by: Barry Song <21cnbao@gmail.com>
Tested-by: Kairui Song <kasong@tencent.com>
Signed-off-by: Chris Li <chrisl@kernel.org>
---
Chris Li (2):
mm: swap: swap cluster switch to double link list
mm: swap: mTHP allocate swap entries from nonfull list
include/linux/swap.h | 18 ++--
mm/swapfile.c | 252 +++++++++++++++++----------------------------------
2 files changed, 93 insertions(+), 177 deletions(-)
---
base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed
change-id: 20240523-swap-allocator-1534c480ece4
Best regards,
--
Chris Li <chrisl@kernel.org>
Sorry I'm late to the discussion - I've been out for the last 3.5 weeks and just getting through my mail now... On 24/05/2024 18:17, Chris Li wrote: > This is the short term solutiolns "swap cluster order" listed > in my "Swap Abstraction" discussion slice 8 in the recent > LSF/MM conference. I've read the article on lwn and look forward to watching the video once available. The longer term plans look interesting. > > When commit 845982eb264bc "mm: swap: allow storage of all mTHP > orders" is introduced, it only allocates the mTHP swap entries > from new empty cluster list. That works well for PMD size THP, > but it has a serius fragmentation issue reported by Barry. Yes, that was a deliberate initial approach to be conservative, just like the original PMD-size THP support. I'm glad to see work to improve the situation! > > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/ > > The mTHP allocation failure rate raises to almost 100% after a few > hours in Barry's test run. > > The reason is that all the empty cluster has been exhausted while > there are planty of free swap entries to in the cluster that is > not 100% free. > > Address this by remember the swap allocation order in the cluster. > Keep track of the per order non full cluster list for later allocation. I don't immediately see how this helps because memory is swapped back in per-page (currently), so just because a given cluster was initially filled with entries of a given order, doesn't mean that those entries are freed in atomic units; only specific pages could have been swapped back in, meaning the holes are not of the required order. Additionally, scanning could lead to order-0 pages being populated in random places. My naive assumption was that the obvious way to solve this problem in the short term would be to extend the scanning logic to be able to scan for an arbitrary order. That way you could find an allocation of the required order in any of the clusters, even a cluster that was not originally allocated for the required order. I guess I should read your patches to understand exactly what you are doing rather than making assumptions... Thanks, Ryan > > This greatly improve the sucess rate of the mTHP swap allocation. > While I am still waiting for Barry's test result. I paste Kairui's test > result here: > > I'm able to reproduce such an issue with a simple script (enabling all order of mthp): > > modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024)) > swapoff -a > mkswap /dev/ram0 > swapon /dev/ram0 > > rmdir /sys/fs/cgroup/benchmark > mkdir -p /sys/fs/cgroup/benchmark > cd /sys/fs/cgroup/benchmark > echo 8G > memory.max > echo $$ > cgroup.procs > > memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary & > > /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \ > -P memcache_binary -n allkeys --key-minimum=1 \ > --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \ > --ratio 1:0 --pipeline 8 -d 1024 > > Before: > Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98 > After: > Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74 > > And the fallback ratio dropped by a lot: > Before: > hugepages-32kB/stats/anon_swpout_fallback:15997 > hugepages-32kB/stats/anon_swpout:18712 > hugepages-512kB/stats/anon_swpout_fallback:192 > hugepages-512kB/stats/anon_swpout:0 > hugepages-2048kB/stats/anon_swpout_fallback:2 > hugepages-2048kB/stats/anon_swpout:0 > hugepages-1024kB/stats/anon_swpout_fallback:0 > hugepages-1024kB/stats/anon_swpout:0 > hugepages-64kB/stats/anon_swpout_fallback:18246 > hugepages-64kB/stats/anon_swpout:17644 > hugepages-16kB/stats/anon_swpout_fallback:13701 > hugepages-16kB/stats/anon_swpout:18234 > hugepages-256kB/stats/anon_swpout_fallback:8642 > hugepages-256kB/stats/anon_swpout:93 > hugepages-128kB/stats/anon_swpout_fallback:21497 > hugepages-128kB/stats/anon_swpout:7596 > > (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate) > > After: > hugepages-32kB/stats/swpout:34445 > hugepages-32kB/stats/swpout_fallback:0 > hugepages-512kB/stats/swpout:1 > hugepages-512kB/stats/swpout_fallback:134 > hugepages-2048kB/stats/swpout:1 > hugepages-2048kB/stats/swpout_fallback:1 > hugepages-1024kB/stats/swpout:6 > hugepages-1024kB/stats/swpout_fallback:0 > hugepages-64kB/stats/swpout:35495 > hugepages-64kB/stats/swpout_fallback:0 > hugepages-16kB/stats/swpout:32441 > hugepages-16kB/stats/swpout_fallback:0 > hugepages-256kB/stats/swpout:2223 > hugepages-256kB/stats/swpout_fallback:6278 > hugepages-128kB/stats/swpout:29136 > hugepages-128kB/stats/swpout_fallback:52 > > Reported-by: Barry Song <21cnbao@gmail.com> > Tested-by: Kairui Song <kasong@tencent.com> > Signed-off-by: Chris Li <chrisl@kernel.org> > --- > Chris Li (2): > mm: swap: swap cluster switch to double link list > mm: swap: mTHP allocate swap entries from nonfull list > > include/linux/swap.h | 18 ++-- > mm/swapfile.c | 252 +++++++++++++++++---------------------------------- > 2 files changed, 93 insertions(+), 177 deletions(-) > --- > base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed > change-id: 20240523-swap-allocator-1534c480ece4 > > Best regards,
On Fri, Jun 7, 2024 at 2:43 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > Sorry I'm late to the discussion - I've been out for the last 3.5 weeks and just > getting through my mail now... No problem at all, please take it easy. > > > On 24/05/2024 18:17, Chris Li wrote: > > This is the short term solutiolns "swap cluster order" listed > > in my "Swap Abstraction" discussion slice 8 in the recent > > LSF/MM conference. > > I've read the article on lwn and look forward to watching the video once > available. The longer term plans look interesting. > > > > > When commit 845982eb264bc "mm: swap: allow storage of all mTHP > > orders" is introduced, it only allocates the mTHP swap entries > > from new empty cluster list. That works well for PMD size THP, > > but it has a serius fragmentation issue reported by Barry. > > Yes, that was a deliberate initial approach to be conservative, just like the > original PMD-size THP support. I'm glad to see work to improve the situation! > > > > > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/ > > > > The mTHP allocation failure rate raises to almost 100% after a few > > hours in Barry's test run. > > > > The reason is that all the empty cluster has been exhausted while > > there are planty of free swap entries to in the cluster that is > > not 100% free. > > > > Address this by remember the swap allocation order in the cluster. > > Keep track of the per order non full cluster list for later allocation. > > I don't immediately see how this helps because memory is swapped back in > per-page (currently), so just because a given cluster was initially filled with That is not the case for Barry's setup, he has some other patch series to swap in mTHP as a whole. Especially in for the mTHP store in the zsmalloc as bigger than 4K pages. https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/ > entries of a given order, doesn't mean that those entries are freed in atomic > units; only specific pages could have been swapped back in, meaning the holes > are not of the required order. Additionally, scanning could lead to order-0 > pages being populated in random places. Yes, that is a problem we need to address. The proposed short term solution is to have an isolation scheme preventing the high order swap entry mix with the lower order one inside one cluster. That is easy to do and has some test results confirming the reservation/isolation effect. > > My naive assumption was that the obvious way to solve this problem in the short > term would be to extend the scanning logic to be able to scan for an arbitrary > order. That way you could find an allocation of the required order in any of the > clusters, even a cluster that was not originally allocated for the required order. > > I guess I should read your patches to understand exactly what you are doing > rather than making assumptions... Scanning is not enough. We need to have some way to prevent the fragmentation from happening. Once the fragmentation has happened, it can't be easily reversed. Scanning does not help the fragmentation aspect. Chris > > Thanks, > Ryan > > > > > This greatly improve the sucess rate of the mTHP swap allocation. > > While I am still waiting for Barry's test result. I paste Kairui's test > > result here: > > > > I'm able to reproduce such an issue with a simple script (enabling all order of mthp): > > > > modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024)) > > swapoff -a > > mkswap /dev/ram0 > > swapon /dev/ram0 > > > > rmdir /sys/fs/cgroup/benchmark > > mkdir -p /sys/fs/cgroup/benchmark > > cd /sys/fs/cgroup/benchmark > > echo 8G > memory.max > > echo $$ > cgroup.procs > > > > memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary & > > > > /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \ > > -P memcache_binary -n allkeys --key-minimum=1 \ > > --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \ > > --ratio 1:0 --pipeline 8 -d 1024 > > > > Before: > > Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98 > > After: > > Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74 > > > > And the fallback ratio dropped by a lot: > > Before: > > hugepages-32kB/stats/anon_swpout_fallback:15997 > > hugepages-32kB/stats/anon_swpout:18712 > > hugepages-512kB/stats/anon_swpout_fallback:192 > > hugepages-512kB/stats/anon_swpout:0 > > hugepages-2048kB/stats/anon_swpout_fallback:2 > > hugepages-2048kB/stats/anon_swpout:0 > > hugepages-1024kB/stats/anon_swpout_fallback:0 > > hugepages-1024kB/stats/anon_swpout:0 > > hugepages-64kB/stats/anon_swpout_fallback:18246 > > hugepages-64kB/stats/anon_swpout:17644 > > hugepages-16kB/stats/anon_swpout_fallback:13701 > > hugepages-16kB/stats/anon_swpout:18234 > > hugepages-256kB/stats/anon_swpout_fallback:8642 > > hugepages-256kB/stats/anon_swpout:93 > > hugepages-128kB/stats/anon_swpout_fallback:21497 > > hugepages-128kB/stats/anon_swpout:7596 > > > > (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate) > > > > After: > > hugepages-32kB/stats/swpout:34445 > > hugepages-32kB/stats/swpout_fallback:0 > > hugepages-512kB/stats/swpout:1 > > hugepages-512kB/stats/swpout_fallback:134 > > hugepages-2048kB/stats/swpout:1 > > hugepages-2048kB/stats/swpout_fallback:1 > > hugepages-1024kB/stats/swpout:6 > > hugepages-1024kB/stats/swpout_fallback:0 > > hugepages-64kB/stats/swpout:35495 > > hugepages-64kB/stats/swpout_fallback:0 > > hugepages-16kB/stats/swpout:32441 > > hugepages-16kB/stats/swpout_fallback:0 > > hugepages-256kB/stats/swpout:2223 > > hugepages-256kB/stats/swpout_fallback:6278 > > hugepages-128kB/stats/swpout:29136 > > hugepages-128kB/stats/swpout_fallback:52 > > > > Reported-by: Barry Song <21cnbao@gmail.com> > > Tested-by: Kairui Song <kasong@tencent.com> > > Signed-off-by: Chris Li <chrisl@kernel.org> > > --- > > Chris Li (2): > > mm: swap: swap cluster switch to double link list > > mm: swap: mTHP allocate swap entries from nonfull list > > > > include/linux/swap.h | 18 ++-- > > mm/swapfile.c | 252 +++++++++++++++++---------------------------------- > > 2 files changed, 93 insertions(+), 177 deletions(-) > > --- > > base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed > > change-id: 20240523-swap-allocator-1534c480ece4 > > > > Best regards, >
I am spinning a new version for this series to address two issues found in this series: 1) Oppo discovered a bug in the following line: + ci = si->cluster_info + tmp; Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp". That is a serious bug but trivial to fix. 2) order 0 allocation currently blindly scans swap_map disregarding the cluster->order. Given enough order 0 swap allocations(close to the swap file size) the order 0 allocation head will eventually sweep across the whole swapfile and destroy other cluster order allocations. The short term fix is just skipping clusters that are already assigned to higher orders. In the long term, I want to unify the non-SSD to use clusters for locking and allocations as well, just try to follow the last allocation (less seeking) as much as possible. Chris On Fri, May 24, 2024 at 10:17 AM Chris Li <chrisl@kernel.org> wrote: > > This is the short term solutiolns "swap cluster order" listed > in my "Swap Abstraction" discussion slice 8 in the recent > LSF/MM conference. > > When commit 845982eb264bc "mm: swap: allow storage of all mTHP > orders" is introduced, it only allocates the mTHP swap entries > from new empty cluster list. That works well for PMD size THP, > but it has a serius fragmentation issue reported by Barry. > > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/ > > The mTHP allocation failure rate raises to almost 100% after a few > hours in Barry's test run. > > The reason is that all the empty cluster has been exhausted while > there are planty of free swap entries to in the cluster that is > not 100% free. > > Address this by remember the swap allocation order in the cluster. > Keep track of the per order non full cluster list for later allocation. > > This greatly improve the sucess rate of the mTHP swap allocation. > While I am still waiting for Barry's test result. I paste Kairui's test > result here: > > I'm able to reproduce such an issue with a simple script (enabling all order of mthp): > > modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024)) > swapoff -a > mkswap /dev/ram0 > swapon /dev/ram0 > > rmdir /sys/fs/cgroup/benchmark > mkdir -p /sys/fs/cgroup/benchmark > cd /sys/fs/cgroup/benchmark > echo 8G > memory.max > echo $$ > cgroup.procs > > memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary & > > /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \ > -P memcache_binary -n allkeys --key-minimum=1 \ > --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \ > --ratio 1:0 --pipeline 8 -d 1024 > > Before: > Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98 > After: > Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74 > > And the fallback ratio dropped by a lot: > Before: > hugepages-32kB/stats/anon_swpout_fallback:15997 > hugepages-32kB/stats/anon_swpout:18712 > hugepages-512kB/stats/anon_swpout_fallback:192 > hugepages-512kB/stats/anon_swpout:0 > hugepages-2048kB/stats/anon_swpout_fallback:2 > hugepages-2048kB/stats/anon_swpout:0 > hugepages-1024kB/stats/anon_swpout_fallback:0 > hugepages-1024kB/stats/anon_swpout:0 > hugepages-64kB/stats/anon_swpout_fallback:18246 > hugepages-64kB/stats/anon_swpout:17644 > hugepages-16kB/stats/anon_swpout_fallback:13701 > hugepages-16kB/stats/anon_swpout:18234 > hugepages-256kB/stats/anon_swpout_fallback:8642 > hugepages-256kB/stats/anon_swpout:93 > hugepages-128kB/stats/anon_swpout_fallback:21497 > hugepages-128kB/stats/anon_swpout:7596 > > (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate) > > After: > hugepages-32kB/stats/swpout:34445 > hugepages-32kB/stats/swpout_fallback:0 > hugepages-512kB/stats/swpout:1 > hugepages-512kB/stats/swpout_fallback:134 > hugepages-2048kB/stats/swpout:1 > hugepages-2048kB/stats/swpout_fallback:1 > hugepages-1024kB/stats/swpout:6 > hugepages-1024kB/stats/swpout_fallback:0 > hugepages-64kB/stats/swpout:35495 > hugepages-64kB/stats/swpout_fallback:0 > hugepages-16kB/stats/swpout:32441 > hugepages-16kB/stats/swpout_fallback:0 > hugepages-256kB/stats/swpout:2223 > hugepages-256kB/stats/swpout_fallback:6278 > hugepages-128kB/stats/swpout:29136 > hugepages-128kB/stats/swpout_fallback:52 > > Reported-by: Barry Song <21cnbao@gmail.com> > Tested-by: Kairui Song <kasong@tencent.com> > Signed-off-by: Chris Li <chrisl@kernel.org> > --- > Chris Li (2): > mm: swap: swap cluster switch to double link list > mm: swap: mTHP allocate swap entries from nonfull list > > include/linux/swap.h | 18 ++-- > mm/swapfile.c | 252 +++++++++++++++++---------------------------------- > 2 files changed, 93 insertions(+), 177 deletions(-) > --- > base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed > change-id: 20240523-swap-allocator-1534c480ece4 > > Best regards, > -- > Chris Li <chrisl@kernel.org> >
On Wed, May 29, 2024 at 9:04 AM Chris Li <chrisl@kernel.org> wrote: > > I am spinning a new version for this series to address two issues > found in this series: > > 1) Oppo discovered a bug in the following line: > + ci = si->cluster_info + tmp; > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp". > That is a serious bug but trivial to fix. > > 2) order 0 allocation currently blindly scans swap_map disregarding > the cluster->order. Given enough order 0 swap allocations(close to the > swap file size) the order 0 allocation head will eventually sweep > across the whole swapfile and destroy other cluster order allocations. > > The short term fix is just skipping clusters that are already assigned > to higher orders. > > In the long term, I want to unify the non-SSD to use clusters for > locking and allocations as well, just try to follow the last > allocation (less seeking) as much as possible. Hi Chris, I am sharing some new test results with you. This time, we used two zRAM devices by modifying get_swap_pages(). zram0 -> dedicated for order-0 swpout zram1 -> dedicated for order-4 swpout We allocate a generous amount of space for zRAM1 to ensure it never gets full and always has ample free space. However, we found that Ryan's approach does not perform well even in this straightforward scenario. Despite zRAM1 having 80% of its space remaining, we still experience issues obtaining contiguous swap slots and encounter a high swpout_fallback ratio. Sorry for the report, Ryan :-) In contrast, with your patch, we consistently see the thp_swpout_fallback ratio at 0%, indicating a significant improvement in the situation. Although your patch still has issues supporting the mixing of order-0 and order-4 pages in a swap device, it represents a significant improvement. I would be delighted to witness your approach advancing with Ying Huang’s assistance. However, due to my current commitments, I regret that I am unable to allocate time for debugging. > > Chris > > > > On Fri, May 24, 2024 at 10:17 AM Chris Li <chrisl@kernel.org> wrote: > > > > This is the short term solutiolns "swap cluster order" listed > > in my "Swap Abstraction" discussion slice 8 in the recent > > LSF/MM conference. > > > > When commit 845982eb264bc "mm: swap: allow storage of all mTHP > > orders" is introduced, it only allocates the mTHP swap entries > > from new empty cluster list. That works well for PMD size THP, > > but it has a serius fragmentation issue reported by Barry. > > > > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/ > > > > The mTHP allocation failure rate raises to almost 100% after a few > > hours in Barry's test run. > > > > The reason is that all the empty cluster has been exhausted while > > there are planty of free swap entries to in the cluster that is > > not 100% free. > > > > Address this by remember the swap allocation order in the cluster. > > Keep track of the per order non full cluster list for later allocation. > > > > This greatly improve the sucess rate of the mTHP swap allocation. > > While I am still waiting for Barry's test result. I paste Kairui's test > > result here: > > > > I'm able to reproduce such an issue with a simple script (enabling all order of mthp): > > > > modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024)) > > swapoff -a > > mkswap /dev/ram0 > > swapon /dev/ram0 > > > > rmdir /sys/fs/cgroup/benchmark > > mkdir -p /sys/fs/cgroup/benchmark > > cd /sys/fs/cgroup/benchmark > > echo 8G > memory.max > > echo $$ > cgroup.procs > > > > memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary & > > > > /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \ > > -P memcache_binary -n allkeys --key-minimum=1 \ > > --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \ > > --ratio 1:0 --pipeline 8 -d 1024 > > > > Before: > > Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98 > > After: > > Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74 > > > > And the fallback ratio dropped by a lot: > > Before: > > hugepages-32kB/stats/anon_swpout_fallback:15997 > > hugepages-32kB/stats/anon_swpout:18712 > > hugepages-512kB/stats/anon_swpout_fallback:192 > > hugepages-512kB/stats/anon_swpout:0 > > hugepages-2048kB/stats/anon_swpout_fallback:2 > > hugepages-2048kB/stats/anon_swpout:0 > > hugepages-1024kB/stats/anon_swpout_fallback:0 > > hugepages-1024kB/stats/anon_swpout:0 > > hugepages-64kB/stats/anon_swpout_fallback:18246 > > hugepages-64kB/stats/anon_swpout:17644 > > hugepages-16kB/stats/anon_swpout_fallback:13701 > > hugepages-16kB/stats/anon_swpout:18234 > > hugepages-256kB/stats/anon_swpout_fallback:8642 > > hugepages-256kB/stats/anon_swpout:93 > > hugepages-128kB/stats/anon_swpout_fallback:21497 > > hugepages-128kB/stats/anon_swpout:7596 > > > > (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate) > > > > After: > > hugepages-32kB/stats/swpout:34445 > > hugepages-32kB/stats/swpout_fallback:0 > > hugepages-512kB/stats/swpout:1 > > hugepages-512kB/stats/swpout_fallback:134 > > hugepages-2048kB/stats/swpout:1 > > hugepages-2048kB/stats/swpout_fallback:1 > > hugepages-1024kB/stats/swpout:6 > > hugepages-1024kB/stats/swpout_fallback:0 > > hugepages-64kB/stats/swpout:35495 > > hugepages-64kB/stats/swpout_fallback:0 > > hugepages-16kB/stats/swpout:32441 > > hugepages-16kB/stats/swpout_fallback:0 > > hugepages-256kB/stats/swpout:2223 > > hugepages-256kB/stats/swpout_fallback:6278 > > hugepages-128kB/stats/swpout:29136 > > hugepages-128kB/stats/swpout_fallback:52 > > > > Reported-by: Barry Song <21cnbao@gmail.com> > > Tested-by: Kairui Song <kasong@tencent.com> > > Signed-off-by: Chris Li <chrisl@kernel.org> > > --- > > Chris Li (2): > > mm: swap: swap cluster switch to double link list > > mm: swap: mTHP allocate swap entries from nonfull list > > > > include/linux/swap.h | 18 ++-- > > mm/swapfile.c | 252 +++++++++++++++++---------------------------------- > > 2 files changed, 93 insertions(+), 177 deletions(-) > > --- > > base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed > > change-id: 20240523-swap-allocator-1534c480ece4 > > > > Best regards, > > -- > > Chris Li <chrisl@kernel.org> > > Thanks Barry
On 30/05/2024 08:49, Barry Song wrote: > On Wed, May 29, 2024 at 9:04 AM Chris Li <chrisl@kernel.org> wrote: >> >> I am spinning a new version for this series to address two issues >> found in this series: >> >> 1) Oppo discovered a bug in the following line: >> + ci = si->cluster_info + tmp; >> Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp". >> That is a serious bug but trivial to fix. >> >> 2) order 0 allocation currently blindly scans swap_map disregarding >> the cluster->order. Given enough order 0 swap allocations(close to the >> swap file size) the order 0 allocation head will eventually sweep >> across the whole swapfile and destroy other cluster order allocations. >> >> The short term fix is just skipping clusters that are already assigned >> to higher orders. >> >> In the long term, I want to unify the non-SSD to use clusters for >> locking and allocations as well, just try to follow the last >> allocation (less seeking) as much as possible. > > Hi Chris, > > I am sharing some new test results with you. This time, we used two > zRAM devices by modifying get_swap_pages(). > > zram0 -> dedicated for order-0 swpout > zram1 -> dedicated for order-4 swpout > > We allocate a generous amount of space for zRAM1 to ensure it never gets full > and always has ample free space. However, we found that Ryan's approach > does not perform well even in this straightforward scenario. Despite zRAM1 > having 80% of its space remaining, we still experience issues obtaining > contiguous swap slots and encounter a high swpout_fallback ratio. > > Sorry for the report, Ryan :-) No problem; clearly it needs to be fixed, and I'll help where I can. I'm pretty sure that this is due to fragmentation preventing clusters from being freed back to the free list. > > In contrast, with your patch, we consistently see the thp_swpout_fallback ratio > at 0%, indicating a significant improvement in the situation. Unless I've misunderstood something critical, Chris's change is just allowing a cpu to steal a block from another cpu's current cluster for that order. So it just takes longer (approx by a factor of the number of cpus in the system) to get to the state where fragmentation is causing fallbacks? As I said in the other thread, I think the more robust solution is to implement scanning for high order blocks. > > Although your patch still has issues supporting the mixing of order-0 and > order-4 pages in a swap device, it represents a significant improvement. > > I would be delighted to witness your approach advancing with Ying > Huang’s assistance. However, due to my current commitments, I > regret that I am unable to allocate time for debugging. > >> >> Chris >> >> >> >> On Fri, May 24, 2024 at 10:17 AM Chris Li <chrisl@kernel.org> wrote: >>> >>> This is the short term solutiolns "swap cluster order" listed >>> in my "Swap Abstraction" discussion slice 8 in the recent >>> LSF/MM conference. >>> >>> When commit 845982eb264bc "mm: swap: allow storage of all mTHP >>> orders" is introduced, it only allocates the mTHP swap entries >>> from new empty cluster list. That works well for PMD size THP, >>> but it has a serius fragmentation issue reported by Barry. >>> >>> https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/ >>> >>> The mTHP allocation failure rate raises to almost 100% after a few >>> hours in Barry's test run. >>> >>> The reason is that all the empty cluster has been exhausted while >>> there are planty of free swap entries to in the cluster that is >>> not 100% free. >>> >>> Address this by remember the swap allocation order in the cluster. >>> Keep track of the per order non full cluster list for later allocation. >>> >>> This greatly improve the sucess rate of the mTHP swap allocation. >>> While I am still waiting for Barry's test result. I paste Kairui's test >>> result here: >>> >>> I'm able to reproduce such an issue with a simple script (enabling all order of mthp): >>> >>> modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024)) >>> swapoff -a >>> mkswap /dev/ram0 >>> swapon /dev/ram0 >>> >>> rmdir /sys/fs/cgroup/benchmark >>> mkdir -p /sys/fs/cgroup/benchmark >>> cd /sys/fs/cgroup/benchmark >>> echo 8G > memory.max >>> echo $$ > cgroup.procs >>> >>> memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary & >>> >>> /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \ >>> -P memcache_binary -n allkeys --key-minimum=1 \ >>> --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \ >>> --ratio 1:0 --pipeline 8 -d 1024 >>> >>> Before: >>> Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98 >>> After: >>> Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74 >>> >>> And the fallback ratio dropped by a lot: >>> Before: >>> hugepages-32kB/stats/anon_swpout_fallback:15997 >>> hugepages-32kB/stats/anon_swpout:18712 >>> hugepages-512kB/stats/anon_swpout_fallback:192 >>> hugepages-512kB/stats/anon_swpout:0 >>> hugepages-2048kB/stats/anon_swpout_fallback:2 >>> hugepages-2048kB/stats/anon_swpout:0 >>> hugepages-1024kB/stats/anon_swpout_fallback:0 >>> hugepages-1024kB/stats/anon_swpout:0 >>> hugepages-64kB/stats/anon_swpout_fallback:18246 >>> hugepages-64kB/stats/anon_swpout:17644 >>> hugepages-16kB/stats/anon_swpout_fallback:13701 >>> hugepages-16kB/stats/anon_swpout:18234 >>> hugepages-256kB/stats/anon_swpout_fallback:8642 >>> hugepages-256kB/stats/anon_swpout:93 >>> hugepages-128kB/stats/anon_swpout_fallback:21497 >>> hugepages-128kB/stats/anon_swpout:7596 >>> >>> (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate) >>> >>> After: >>> hugepages-32kB/stats/swpout:34445 >>> hugepages-32kB/stats/swpout_fallback:0 >>> hugepages-512kB/stats/swpout:1 >>> hugepages-512kB/stats/swpout_fallback:134 >>> hugepages-2048kB/stats/swpout:1 >>> hugepages-2048kB/stats/swpout_fallback:1 >>> hugepages-1024kB/stats/swpout:6 >>> hugepages-1024kB/stats/swpout_fallback:0 >>> hugepages-64kB/stats/swpout:35495 >>> hugepages-64kB/stats/swpout_fallback:0 >>> hugepages-16kB/stats/swpout:32441 >>> hugepages-16kB/stats/swpout_fallback:0 >>> hugepages-256kB/stats/swpout:2223 >>> hugepages-256kB/stats/swpout_fallback:6278 >>> hugepages-128kB/stats/swpout:29136 >>> hugepages-128kB/stats/swpout_fallback:52 >>> >>> Reported-by: Barry Song <21cnbao@gmail.com> >>> Tested-by: Kairui Song <kasong@tencent.com> >>> Signed-off-by: Chris Li <chrisl@kernel.org> >>> --- >>> Chris Li (2): >>> mm: swap: swap cluster switch to double link list >>> mm: swap: mTHP allocate swap entries from nonfull list >>> >>> include/linux/swap.h | 18 ++-- >>> mm/swapfile.c | 252 +++++++++++++++++---------------------------------- >>> 2 files changed, 93 insertions(+), 177 deletions(-) >>> --- >>> base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed >>> change-id: 20240523-swap-allocator-1534c480ece4 >>> >>> Best regards, >>> -- >>> Chris Li <chrisl@kernel.org> >>> > > Thanks > Barry
On Fri, Jun 7, 2024 at 3:49 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 30/05/2024 08:49, Barry Song wrote: > > On Wed, May 29, 2024 at 9:04 AM Chris Li <chrisl@kernel.org> wrote: > >> > >> I am spinning a new version for this series to address two issues > >> found in this series: > >> > >> 1) Oppo discovered a bug in the following line: > >> + ci = si->cluster_info + tmp; > >> Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp". > >> That is a serious bug but trivial to fix. > >> > >> 2) order 0 allocation currently blindly scans swap_map disregarding > >> the cluster->order. Given enough order 0 swap allocations(close to the > >> swap file size) the order 0 allocation head will eventually sweep > >> across the whole swapfile and destroy other cluster order allocations. > >> > >> The short term fix is just skipping clusters that are already assigned > >> to higher orders. > >> > >> In the long term, I want to unify the non-SSD to use clusters for > >> locking and allocations as well, just try to follow the last > >> allocation (less seeking) as much as possible. > > > > Hi Chris, > > > > I am sharing some new test results with you. This time, we used two > > zRAM devices by modifying get_swap_pages(). > > > > zram0 -> dedicated for order-0 swpout > > zram1 -> dedicated for order-4 swpout > > > > We allocate a generous amount of space for zRAM1 to ensure it never gets full > > and always has ample free space. However, we found that Ryan's approach > > does not perform well even in this straightforward scenario. Despite zRAM1 > > having 80% of its space remaining, we still experience issues obtaining > > contiguous swap slots and encounter a high swpout_fallback ratio. > > > > Sorry for the report, Ryan :-) > > No problem; clearly it needs to be fixed, and I'll help where I can. I'm pretty > sure that this is due to fragmentation preventing clusters from being freed back > to the free list. > > > > > In contrast, with your patch, we consistently see the thp_swpout_fallback ratio > > at 0%, indicating a significant improvement in the situation. > > Unless I've misunderstood something critical, Chris's change is just allowing a > cpu to steal a block from another cpu's current cluster for that order. So it No, that is not the main change. The main change is to allow the CPU to allocate from the nonfull and non-empty cluster, which are not in any CPU's current cluster, not in the empty list either. The current patch does not prevent the CPU from stealing from the other CPU current order. It will get addressed in V2. > just takes longer (approx by a factor of the number of cpus in the system) to > get to the state where fragmentation is causing fallbacks? As I said in the > other thread, I think the more robust solution is to implement scanning for high > order blocks. That will introduce more fragmentation to the high order cluster, and will make it harder to allocate high order swap entry later. Please see my previous email for the usage case and the goal of the change. https://lore.kernel.org/linux-mm/CANeU7QnVzqGKXp9pKDLWiuhqTvBxXupgFCRXejYhshAjw6uDyQ@mail.gmail.com/T/#mf431a743e458896c2ab4a4077b103341313c9cf4 Let's discuss whether the usage case and the goal makes sense or not. Chris
Chris Li <chrisl@kernel.org> writes: > I am spinning a new version for this series to address two issues > found in this series: > > 1) Oppo discovered a bug in the following line: > + ci = si->cluster_info + tmp; > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp". > That is a serious bug but trivial to fix. > > 2) order 0 allocation currently blindly scans swap_map disregarding > the cluster->order. IIUC, now, we only scan swap_map[] only if !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]). That is, if you doesn't run low swap free space, you will not do that. > Given enough order 0 swap allocations(close to the > swap file size) the order 0 allocation head will eventually sweep > across the whole swapfile and destroy other cluster order allocations. > > The short term fix is just skipping clusters that are already assigned > to higher orders. Better to do any further optimization on top of the simpler one. Need to evaluate whether it's necessary to add more complexity. > In the long term, I want to unify the non-SSD to use clusters for > locking and allocations as well, just try to follow the last > allocation (less seeking) as much as possible. I have thought about that too. Personally, I think that it's good to remove swap_map[] scanning. The implementation can be simplified too. I don't know whether do we need to consider the performance of HDD swap now. -- Best Regards, Huang, Ying > On Fri, May 24, 2024 at 10:17 AM Chris Li <chrisl@kernel.org> wrote: >> >> This is the short term solutiolns "swap cluster order" listed >> in my "Swap Abstraction" discussion slice 8 in the recent >> LSF/MM conference. >> >> When commit 845982eb264bc "mm: swap: allow storage of all mTHP >> orders" is introduced, it only allocates the mTHP swap entries >> from new empty cluster list. That works well for PMD size THP, >> but it has a serius fragmentation issue reported by Barry. >> >> https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/ >> >> The mTHP allocation failure rate raises to almost 100% after a few >> hours in Barry's test run. >> >> The reason is that all the empty cluster has been exhausted while >> there are planty of free swap entries to in the cluster that is >> not 100% free. >> >> Address this by remember the swap allocation order in the cluster. >> Keep track of the per order non full cluster list for later allocation. >> >> This greatly improve the sucess rate of the mTHP swap allocation. >> While I am still waiting for Barry's test result. I paste Kairui's test >> result here: >> >> I'm able to reproduce such an issue with a simple script (enabling all order of mthp): >> >> modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024)) >> swapoff -a >> mkswap /dev/ram0 >> swapon /dev/ram0 >> >> rmdir /sys/fs/cgroup/benchmark >> mkdir -p /sys/fs/cgroup/benchmark >> cd /sys/fs/cgroup/benchmark >> echo 8G > memory.max >> echo $$ > cgroup.procs >> >> memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary & >> >> /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \ >> -P memcache_binary -n allkeys --key-minimum=1 \ >> --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \ >> --ratio 1:0 --pipeline 8 -d 1024 >> >> Before: >> Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98 >> After: >> Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74 >> >> And the fallback ratio dropped by a lot: >> Before: >> hugepages-32kB/stats/anon_swpout_fallback:15997 >> hugepages-32kB/stats/anon_swpout:18712 >> hugepages-512kB/stats/anon_swpout_fallback:192 >> hugepages-512kB/stats/anon_swpout:0 >> hugepages-2048kB/stats/anon_swpout_fallback:2 >> hugepages-2048kB/stats/anon_swpout:0 >> hugepages-1024kB/stats/anon_swpout_fallback:0 >> hugepages-1024kB/stats/anon_swpout:0 >> hugepages-64kB/stats/anon_swpout_fallback:18246 >> hugepages-64kB/stats/anon_swpout:17644 >> hugepages-16kB/stats/anon_swpout_fallback:13701 >> hugepages-16kB/stats/anon_swpout:18234 >> hugepages-256kB/stats/anon_swpout_fallback:8642 >> hugepages-256kB/stats/anon_swpout:93 >> hugepages-128kB/stats/anon_swpout_fallback:21497 >> hugepages-128kB/stats/anon_swpout:7596 >> >> (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate) >> >> After: >> hugepages-32kB/stats/swpout:34445 >> hugepages-32kB/stats/swpout_fallback:0 >> hugepages-512kB/stats/swpout:1 >> hugepages-512kB/stats/swpout_fallback:134 >> hugepages-2048kB/stats/swpout:1 >> hugepages-2048kB/stats/swpout_fallback:1 >> hugepages-1024kB/stats/swpout:6 >> hugepages-1024kB/stats/swpout_fallback:0 >> hugepages-64kB/stats/swpout:35495 >> hugepages-64kB/stats/swpout_fallback:0 >> hugepages-16kB/stats/swpout:32441 >> hugepages-16kB/stats/swpout_fallback:0 >> hugepages-256kB/stats/swpout:2223 >> hugepages-256kB/stats/swpout_fallback:6278 >> hugepages-128kB/stats/swpout:29136 >> hugepages-128kB/stats/swpout_fallback:52 >> >> Reported-by: Barry Song <21cnbao@gmail.com> >> Tested-by: Kairui Song <kasong@tencent.com> >> Signed-off-by: Chris Li <chrisl@kernel.org> >> --- >> Chris Li (2): >> mm: swap: swap cluster switch to double link list >> mm: swap: mTHP allocate swap entries from nonfull list >> >> include/linux/swap.h | 18 ++-- >> mm/swapfile.c | 252 +++++++++++++++++---------------------------------- >> 2 files changed, 93 insertions(+), 177 deletions(-) >> --- >> base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed >> change-id: 20240523-swap-allocator-1534c480ece4 >> >> Best regards, >> -- >> Chris Li <chrisl@kernel.org> >>
Hi Ying, On Wed, May 29, 2024 at 1:57 AM Huang, Ying <ying.huang@intel.com> wrote: > > Chris Li <chrisl@kernel.org> writes: > > > I am spinning a new version for this series to address two issues > > found in this series: > > > > 1) Oppo discovered a bug in the following line: > > + ci = si->cluster_info + tmp; > > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp". > > That is a serious bug but trivial to fix. > > > > 2) order 0 allocation currently blindly scans swap_map disregarding > > the cluster->order. > > IIUC, now, we only scan swap_map[] only if > !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]). > That is, if you doesn't run low swap free space, you will not do that. You can still swap space in order 0 clusters while order 4 runs out of free_cluster or nonfull_clusters[order]. For Android that is a common case. > > > Given enough order 0 swap allocations(close to the > > swap file size) the order 0 allocation head will eventually sweep > > across the whole swapfile and destroy other cluster order allocations. > > > > The short term fix is just skipping clusters that are already assigned > > to higher orders. > > Better to do any further optimization on top of the simpler one. Need > to evaluate whether it's necessary to add more complexity. I agree this needs more careful planning and discussion. In Android's use case, the swapfile is always almost full. It will run into this situation after long enough swap time. Once the order 0 swap entry starts to pollute the higher order cluster, there is no going back(until the whole cluster is 100% free). > > In the long term, I want to unify the non-SSD to use clusters for > > locking and allocations as well, just try to follow the last > > allocation (less seeking) as much as possible. > > I have thought about that too. Personally, I think that it's good to > remove swap_map[] scanning. The implementation can be simplified too. Agree. I look at the commit that introduces the SSD cluster. The commit message indicates a lot of CPU time spent in swap_map scanning, especially when the swapfile is almost full. The main motivation to introduce the cluster in HDD is to simplify and unify the code. > I don't know whether do we need to consider the performance of HDD swap > now. I am not sure about that either. We can make the best effort to reduce the seek. Chris > -- > Best Regards, > Huang, Ying > > > On Fri, May 24, 2024 at 10:17 AM Chris Li <chrisl@kernel.org> wrote: > >> > >> This is the short term solutiolns "swap cluster order" listed > >> in my "Swap Abstraction" discussion slice 8 in the recent > >> LSF/MM conference. > >> > >> When commit 845982eb264bc "mm: swap: allow storage of all mTHP > >> orders" is introduced, it only allocates the mTHP swap entries > >> from new empty cluster list. That works well for PMD size THP, > >> but it has a serius fragmentation issue reported by Barry. > >> > >> https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/ > >> > >> The mTHP allocation failure rate raises to almost 100% after a few > >> hours in Barry's test run. > >> > >> The reason is that all the empty cluster has been exhausted while > >> there are planty of free swap entries to in the cluster that is > >> not 100% free. > >> > >> Address this by remember the swap allocation order in the cluster. > >> Keep track of the per order non full cluster list for later allocation. > >> > >> This greatly improve the sucess rate of the mTHP swap allocation. > >> While I am still waiting for Barry's test result. I paste Kairui's test > >> result here: > >> > >> I'm able to reproduce such an issue with a simple script (enabling all order of mthp): > >> > >> modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024)) > >> swapoff -a > >> mkswap /dev/ram0 > >> swapon /dev/ram0 > >> > >> rmdir /sys/fs/cgroup/benchmark > >> mkdir -p /sys/fs/cgroup/benchmark > >> cd /sys/fs/cgroup/benchmark > >> echo 8G > memory.max > >> echo $$ > cgroup.procs > >> > >> memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary & > >> > >> /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \ > >> -P memcache_binary -n allkeys --key-minimum=1 \ > >> --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \ > >> --ratio 1:0 --pipeline 8 -d 1024 > >> > >> Before: > >> Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98 > >> After: > >> Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74 > >> > >> And the fallback ratio dropped by a lot: > >> Before: > >> hugepages-32kB/stats/anon_swpout_fallback:15997 > >> hugepages-32kB/stats/anon_swpout:18712 > >> hugepages-512kB/stats/anon_swpout_fallback:192 > >> hugepages-512kB/stats/anon_swpout:0 > >> hugepages-2048kB/stats/anon_swpout_fallback:2 > >> hugepages-2048kB/stats/anon_swpout:0 > >> hugepages-1024kB/stats/anon_swpout_fallback:0 > >> hugepages-1024kB/stats/anon_swpout:0 > >> hugepages-64kB/stats/anon_swpout_fallback:18246 > >> hugepages-64kB/stats/anon_swpout:17644 > >> hugepages-16kB/stats/anon_swpout_fallback:13701 > >> hugepages-16kB/stats/anon_swpout:18234 > >> hugepages-256kB/stats/anon_swpout_fallback:8642 > >> hugepages-256kB/stats/anon_swpout:93 > >> hugepages-128kB/stats/anon_swpout_fallback:21497 > >> hugepages-128kB/stats/anon_swpout:7596 > >> > >> (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate) > >> > >> After: > >> hugepages-32kB/stats/swpout:34445 > >> hugepages-32kB/stats/swpout_fallback:0 > >> hugepages-512kB/stats/swpout:1 > >> hugepages-512kB/stats/swpout_fallback:134 > >> hugepages-2048kB/stats/swpout:1 > >> hugepages-2048kB/stats/swpout_fallback:1 > >> hugepages-1024kB/stats/swpout:6 > >> hugepages-1024kB/stats/swpout_fallback:0 > >> hugepages-64kB/stats/swpout:35495 > >> hugepages-64kB/stats/swpout_fallback:0 > >> hugepages-16kB/stats/swpout:32441 > >> hugepages-16kB/stats/swpout_fallback:0 > >> hugepages-256kB/stats/swpout:2223 > >> hugepages-256kB/stats/swpout_fallback:6278 > >> hugepages-128kB/stats/swpout:29136 > >> hugepages-128kB/stats/swpout_fallback:52 > >> > >> Reported-by: Barry Song <21cnbao@gmail.com> > >> Tested-by: Kairui Song <kasong@tencent.com> > >> Signed-off-by: Chris Li <chrisl@kernel.org> > >> --- > >> Chris Li (2): > >> mm: swap: swap cluster switch to double link list > >> mm: swap: mTHP allocate swap entries from nonfull list > >> > >> include/linux/swap.h | 18 ++-- > >> mm/swapfile.c | 252 +++++++++++++++++---------------------------------- > >> 2 files changed, 93 insertions(+), 177 deletions(-) > >> --- > >> base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed > >> change-id: 20240523-swap-allocator-1534c480ece4 > >> > >> Best regards, > >> -- > >> Chris Li <chrisl@kernel.org> > >> >
Chris Li <chrisl@kernel.org> writes: > Hi Ying, > > On Wed, May 29, 2024 at 1:57 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> Chris Li <chrisl@kernel.org> writes: >> >> > I am spinning a new version for this series to address two issues >> > found in this series: >> > >> > 1) Oppo discovered a bug in the following line: >> > + ci = si->cluster_info + tmp; >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp". >> > That is a serious bug but trivial to fix. >> > >> > 2) order 0 allocation currently blindly scans swap_map disregarding >> > the cluster->order. >> >> IIUC, now, we only scan swap_map[] only if >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]). >> That is, if you doesn't run low swap free space, you will not do that. > > You can still swap space in order 0 clusters while order 4 runs out of > free_cluster > or nonfull_clusters[order]. For Android that is a common case. When we fail to allocate order 4, we will fallback to order 0. Still don't need to scan swap_map[]. But after looking at your below reply, I realized that the swap space is almost full at most times in your cases. Then, it's possible that we run into scanning swap_map[]. list_empty(&si->free_clusters) && list_empty(&si->nonfull_clusters[order]) will become true, if we put too many clusters in si->percpu_cluster. So, if we want to avoid to scan swap_map[], we can stop add clusters in si->percpu_cluster when swap space runs low. And maybe take clusters out of si->percpu_cluster sometimes. Another issue is nonfull_cluster[order1] cannot be used for nonfull_cluster[order2]. In definition, we should not fail order 0 allocation, we need to steal nonfull_cluster[order>0] for order 0 allocation. This can avoid to scan swap_map[] too. This may be not perfect, but it is the simplest first step implementation. You can optimize based on it further. And, I checked your code again. It appears that si->percpu_cluster may be put in si->nonfull_cluster[], then be used by another CPU. Please check it. -- Best Regards, Huang, Ying [snip]
On Wed, May 29, 2024 at 7:54 PM Huang, Ying <ying.huang@intel.com> wrote: > > Chris Li <chrisl@kernel.org> writes: > > > Hi Ying, > > > > On Wed, May 29, 2024 at 1:57 AM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Chris Li <chrisl@kernel.org> writes: > >> > >> > I am spinning a new version for this series to address two issues > >> > found in this series: > >> > > >> > 1) Oppo discovered a bug in the following line: > >> > + ci = si->cluster_info + tmp; > >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp". > >> > That is a serious bug but trivial to fix. > >> > > >> > 2) order 0 allocation currently blindly scans swap_map disregarding > >> > the cluster->order. > >> > >> IIUC, now, we only scan swap_map[] only if > >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]). > >> That is, if you doesn't run low swap free space, you will not do that. > > > > You can still swap space in order 0 clusters while order 4 runs out of > > free_cluster > > or nonfull_clusters[order]. For Android that is a common case. > > When we fail to allocate order 4, we will fallback to order 0. Still > don't need to scan swap_map[]. But after looking at your below reply, I > realized that the swap space is almost full at most times in your cases. > Then, it's possible that we run into scanning swap_map[]. > list_empty(&si->free_clusters) && > list_empty(&si->nonfull_clusters[order]) will become true, if we put too > many clusters in si->percpu_cluster. So, if we want to avoid to scan > swap_map[], we can stop add clusters in si->percpu_cluster when swap > space runs low. And maybe take clusters out of si->percpu_cluster > sometimes. One idea after reading your reply. If we run out of the nonfull_cluster[order], we should be able to use other cpu's si->percpu_cluster[] as well. That is a very small win for Android, because android does not have too many cpu. We are talking about a handful of clusters, which might not justify the code complexity. It does not change the behavior that order 0 can pollut higher order. > > Another issue is nonfull_cluster[order1] cannot be used for > nonfull_cluster[order2]. In definition, we should not fail order 0 > allocation, we need to steal nonfull_cluster[order>0] for order 0 > allocation. This can avoid to scan swap_map[] too. This may be not > perfect, but it is the simplest first step implementation. You can > optimize based on it further. Yes, that is listed as the limitation of this cluster order approach. Initially we need to support one order well first. We might choose what order that is, 16K or 64K folio. 4K pages are too small, 2M pages are too big. The sweet spot might be some there in between. If we can support one order well, we can demonstrate the value of the mTHP. We can worry about other mix orders later. Do you have any suggestions for how to prevent the order 0 polluting the higher order cluster? If we allow that to happen, then it defeats the goal of being able to allocate higher order swap entries. The tricky question is we don't know how much swap space we should reserve for each order. We can always break higher order clusters to lower order, but can't do the reserves. The current patch series lets the actual usage determine the percentage of the cluster for each order. However that seems not enough for the test case Barry has. When the app gets OOM kill that is where a large swing of order 0 swap will show up and not enough higher order usage for the brief moment. The order 0 swap entry will pollute the high order cluster. We are currently debating a "knob" to be able to reserve a certain % of swap space for a certain order. Those reservations will be guaranteed and order 0 swap entry can't pollute them even when it runs out of swap space. That can make the mTHP at least usable for the Android case. Do you see another way to protect the high order cluster polluted by lower order one? > > And, I checked your code again. It appears that si->percpu_cluster may > be put in si->nonfull_cluster[], then be used by another CPU. Please > check it. Ah, good point. I think it does. Let me take a closer look. Chris Chris
Chris Li <chrisl@kernel.org> writes: > On Wed, May 29, 2024 at 7:54 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Chris Li <chrisl@kernel.org> writes: >> >> > Hi Ying, >> > >> > On Wed, May 29, 2024 at 1:57 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Chris Li <chrisl@kernel.org> writes: >> >> >> >> > I am spinning a new version for this series to address two issues >> >> > found in this series: >> >> > >> >> > 1) Oppo discovered a bug in the following line: >> >> > + ci = si->cluster_info + tmp; >> >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp". >> >> > That is a serious bug but trivial to fix. >> >> > >> >> > 2) order 0 allocation currently blindly scans swap_map disregarding >> >> > the cluster->order. >> >> >> >> IIUC, now, we only scan swap_map[] only if >> >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]). >> >> That is, if you doesn't run low swap free space, you will not do that. >> > >> > You can still swap space in order 0 clusters while order 4 runs out of >> > free_cluster >> > or nonfull_clusters[order]. For Android that is a common case. >> >> When we fail to allocate order 4, we will fallback to order 0. Still >> don't need to scan swap_map[]. But after looking at your below reply, I >> realized that the swap space is almost full at most times in your cases. >> Then, it's possible that we run into scanning swap_map[]. >> list_empty(&si->free_clusters) && >> list_empty(&si->nonfull_clusters[order]) will become true, if we put too >> many clusters in si->percpu_cluster. So, if we want to avoid to scan >> swap_map[], we can stop add clusters in si->percpu_cluster when swap >> space runs low. And maybe take clusters out of si->percpu_cluster >> sometimes. > > One idea after reading your reply. If we run out of the > nonfull_cluster[order], we should be able to use other cpu's > si->percpu_cluster[] as well. That is a very small win for Android, This will be useful in general. The number CPU may be large, and multiple orders may be used. > because android does not have too many cpu. We are talking about a > handful of clusters, which might not justify the code complexity. It > does not change the behavior that order 0 can pollut higher order. I have a feeling that you don't really know why swap_map[] is scanned. I suggest you to do more test and tracing to find out the reason. I suspect that there are some non-full cluster collection issues. >> Another issue is nonfull_cluster[order1] cannot be used for >> nonfull_cluster[order2]. In definition, we should not fail order 0 >> allocation, we need to steal nonfull_cluster[order>0] for order 0 >> allocation. This can avoid to scan swap_map[] too. This may be not >> perfect, but it is the simplest first step implementation. You can >> optimize based on it further. > > Yes, that is listed as the limitation of this cluster order approach. > Initially we need to support one order well first. We might choose > what order that is, 16K or 64K folio. 4K pages are too small, 2M pages > are too big. The sweet spot might be some there in between. If we can > support one order well, we can demonstrate the value of the mTHP. We > can worry about other mix orders later. > > Do you have any suggestions for how to prevent the order 0 polluting > the higher order cluster? If we allow that to happen, then it defeats > the goal of being able to allocate higher order swap entries. The > tricky question is we don't know how much swap space we should reserve > for each order. We can always break higher order clusters to lower > order, but can't do the reserves. The current patch series lets the > actual usage determine the percentage of the cluster for each order. > However that seems not enough for the test case Barry has. When the > app gets OOM kill that is where a large swing of order 0 swap will > show up and not enough higher order usage for the brief moment. The > order 0 swap entry will pollute the high order cluster. We are > currently debating a "knob" to be able to reserve a certain % of swap > space for a certain order. Those reservations will be guaranteed and > order 0 swap entry can't pollute them even when it runs out of swap > space. That can make the mTHP at least usable for the Android case. IMO, the bottom line is that order-0 allocation is the first class citizen, we must keep it optimized. And, OOM with free swap space isn't acceptable. Please consider the policy we used for page allocation. > Do you see another way to protect the high order cluster polluted by > lower order one? If we use high-order page allocation as reference, we need something like compaction to guarantee high-order allocation finally. But we are too far from that. For specific configuration, I believe that we can get reasonable high-order swap entry allocation success rate for specific use cases. For example, if we only do limited maximum number order-0 swap entries allocation, can we keep high-order clusters? >> >> And, I checked your code again. It appears that si->percpu_cluster may >> be put in si->nonfull_cluster[], then be used by another CPU. Please >> check it. > > Ah, good point. I think it does. Let me take a closer look. -- Best Regards, Huang, Ying
On Thu, May 30, 2024 at 7:37 PM Huang, Ying <ying.huang@intel.com> wrote: > > Chris Li <chrisl@kernel.org> writes: > > > On Wed, May 29, 2024 at 7:54 PM Huang, Ying <ying.huang@intel.com> wrote: > > because android does not have too many cpu. We are talking about a > > handful of clusters, which might not justify the code complexity. It > > does not change the behavior that order 0 can pollut higher order. > > I have a feeling that you don't really know why swap_map[] is scanned. > I suggest you to do more test and tracing to find out the reason. I > suspect that there are some non-full cluster collection issues. Swap_map[] is scanned because of running out of non full clusters. This can happen because Android tries to make full use of the swapfile. However, once the swap_map[] scan happens, the non full cluster is polluted. I currently don't have a local reproduction of the issue Barry reported. However here is some data point: Two swap files, one for high order allocation only with this patch. No fall back. If there is a non-full cluster collection issue, we should see the fall back in this case as well. BTW, same setup without this patch series it will fall back on the high order allocation as well. > > >> Another issue is nonfull_cluster[order1] cannot be used for > >> nonfull_cluster[order2]. In definition, we should not fail order 0 > >> allocation, we need to steal nonfull_cluster[order>0] for order 0 > >> allocation. This can avoid to scan swap_map[] too. This may be not > >> perfect, but it is the simplest first step implementation. You can > >> optimize based on it further. > > > > Yes, that is listed as the limitation of this cluster order approach. > > Initially we need to support one order well first. We might choose > > what order that is, 16K or 64K folio. 4K pages are too small, 2M pages > > are too big. The sweet spot might be some there in between. If we can > > support one order well, we can demonstrate the value of the mTHP. We > > can worry about other mix orders later. > > > > Do you have any suggestions for how to prevent the order 0 polluting > > the higher order cluster? If we allow that to happen, then it defeats > > the goal of being able to allocate higher order swap entries. The > > tricky question is we don't know how much swap space we should reserve > > for each order. We can always break higher order clusters to lower > > order, but can't do the reserves. The current patch series lets the > > actual usage determine the percentage of the cluster for each order. > > However that seems not enough for the test case Barry has. When the > > app gets OOM kill that is where a large swing of order 0 swap will > > show up and not enough higher order usage for the brief moment. The > > order 0 swap entry will pollute the high order cluster. We are > > currently debating a "knob" to be able to reserve a certain % of swap > > space for a certain order. Those reservations will be guaranteed and > > order 0 swap entry can't pollute them even when it runs out of swap > > space. That can make the mTHP at least usable for the Android case. > > IMO, the bottom line is that order-0 allocation is the first class > citizen, we must keep it optimized. And, OOM with free swap space isn't > acceptable. Please consider the policy we used for page allocation. We need to make order-0 and high order allocation both can work after the initial pass of allocating from empty clusters. Only order-0 allocation work is not good enough. In the page allocation side, we have the hugetlbfs which reserve some memory for high order pages. We should have similar things to allow reserve some high order swap entries without getting polluted by low order one. > > > Do you see another way to protect the high order cluster polluted by > > lower order one? > > If we use high-order page allocation as reference, we need something > like compaction to guarantee high-order allocation finally. But we are > too far from that. We should consider reservation for high-order swap entry allocation similar to hugetlbfs for memory. Swap compaction will be very complicated because it needs to scan the PTE to migrate the swap entry. It might be easier to support folio write out compound discontiguous swap entries. That is another way to address the fragmentation issue. We are also too far from that as right now. > > For specific configuration, I believe that we can get reasonable > high-order swap entry allocation success rate for specific use cases. > For example, if we only do limited maximum number order-0 swap entries > allocation, can we keep high-order clusters? Yes we can. Having a knob to reserve some high order swap space. Limiting order 0 is the same as having some high order swap entries reserved. That is a short term solution. Chris
Chris Li <chrisl@kernel.org> writes: > On Thu, May 30, 2024 at 7:37 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Chris Li <chrisl@kernel.org> writes: >> >> > On Wed, May 29, 2024 at 7:54 PM Huang, Ying <ying.huang@intel.com> wrote: >> > because android does not have too many cpu. We are talking about a >> > handful of clusters, which might not justify the code complexity. It >> > does not change the behavior that order 0 can pollut higher order. >> >> I have a feeling that you don't really know why swap_map[] is scanned. >> I suggest you to do more test and tracing to find out the reason. I >> suspect that there are some non-full cluster collection issues. > > Swap_map[] is scanned because of running out of non full clusters. > This can happen because Android tries to make full use of the swapfile. > However, once the swap_map[] scan happens, the non full cluster is polluted. > > I currently don't have a local reproduction of the issue Barry reported. > However here is some data point: > Two swap files, one for high order allocation only with this patch. No > fall back. > If there is a non-full cluster collection issue, we should see the > fall back in this case as well. > > BTW, same setup without this patch series it will fall back on the > high order allocation as well. > >> >> >> Another issue is nonfull_cluster[order1] cannot be used for >> >> nonfull_cluster[order2]. In definition, we should not fail order 0 >> >> allocation, we need to steal nonfull_cluster[order>0] for order 0 >> >> allocation. This can avoid to scan swap_map[] too. This may be not >> >> perfect, but it is the simplest first step implementation. You can >> >> optimize based on it further. >> > >> > Yes, that is listed as the limitation of this cluster order approach. >> > Initially we need to support one order well first. We might choose >> > what order that is, 16K or 64K folio. 4K pages are too small, 2M pages >> > are too big. The sweet spot might be some there in between. If we can >> > support one order well, we can demonstrate the value of the mTHP. We >> > can worry about other mix orders later. >> > >> > Do you have any suggestions for how to prevent the order 0 polluting >> > the higher order cluster? If we allow that to happen, then it defeats >> > the goal of being able to allocate higher order swap entries. The >> > tricky question is we don't know how much swap space we should reserve >> > for each order. We can always break higher order clusters to lower >> > order, but can't do the reserves. The current patch series lets the >> > actual usage determine the percentage of the cluster for each order. >> > However that seems not enough for the test case Barry has. When the >> > app gets OOM kill that is where a large swing of order 0 swap will >> > show up and not enough higher order usage for the brief moment. The >> > order 0 swap entry will pollute the high order cluster. We are >> > currently debating a "knob" to be able to reserve a certain % of swap >> > space for a certain order. Those reservations will be guaranteed and >> > order 0 swap entry can't pollute them even when it runs out of swap >> > space. That can make the mTHP at least usable for the Android case. >> >> IMO, the bottom line is that order-0 allocation is the first class >> citizen, we must keep it optimized. And, OOM with free swap space isn't >> acceptable. Please consider the policy we used for page allocation. > > We need to make order-0 and high order allocation both can work after > the initial pass of allocating from empty clusters. > Only order-0 allocation work is not good enough. > > In the page allocation side, we have the hugetlbfs which reserve some > memory for high order pages. > We should have similar things to allow reserve some high order swap > entries without getting polluted by low order one. TBH, I don't like the idea of high order swap entries reservation. If that's really important for you, I think that it's better to design something like hugetlbfs vs core mm, that is, be separated from the normal swap subsystem as much as possible. >> >> > Do you see another way to protect the high order cluster polluted by >> > lower order one? >> >> If we use high-order page allocation as reference, we need something >> like compaction to guarantee high-order allocation finally. But we are >> too far from that. > > We should consider reservation for high-order swap entry allocation > similar to hugetlbfs for memory. > Swap compaction will be very complicated because it needs to scan the > PTE to migrate the swap entry. It might be easier to support folio > write out compound discontiguous swap entries. That is another way to > address the fragmentation issue. We are also too far from that as > right now. That's not easy to write out compound discontiguous swap entries too. For example, how to put folios in swap cache? >> >> For specific configuration, I believe that we can get reasonable >> high-order swap entry allocation success rate for specific use cases. >> For example, if we only do limited maximum number order-0 swap entries >> allocation, can we keep high-order clusters? > > Yes we can. Having a knob to reserve some high order swap space. > Limiting order 0 is the same as having some high order swap entries > reserved. > > That is a short term solution. -- Best Regards, Huang, Ying
On Wed, Jun 5, 2024 at 7:02 PM Huang, Ying <ying.huang@intel.com> wrote: > > Chris Li <chrisl@kernel.org> writes: > > > In the page allocation side, we have the hugetlbfs which reserve some > > memory for high order pages. > > We should have similar things to allow reserve some high order swap > > entries without getting polluted by low order one. > > TBH, I don't like the idea of high order swap entries reservation. May I know more if you don't like the idea? I understand this can be controversial, because previously we like to take the THP as the best effort approach. If there is some reason we can't make THP, we use the order 0 as fall back. For discussion purpose, I want break it down to smaller steps: First, can we agree that the following usage case is reasonable: The usage case is that, as Barry has shown, zsmalloc compresses bigger size than 4K and can have both better compress ratio and CPU performance gain. https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/ So the goal is to make THP/mTHP have some reasonable success rate running in the mix size swap allocation, after either low order or high order swap requests can overflow the swap file size. The allocate can still recover from that, after some swap entries got free. Please let me know if you think the above usage case and goal are not reasonable for the kernel. > that's really important for you, I think that it's better to design > something like hugetlbfs vs core mm, that is, be separated from the > normal swap subsystem as much as possible. I am giving hugetlbfs just to make the point using reservation, or isolation of the resource to prevent mixing fragmentation existing in core mm. I am not suggesting copying the hugetlbfs implementation to the swap system. Unlike hugetlbfs, the swap allocation is typically done from the kernel, it is transparent from the application. I don't think separate from the swap subsystem is a good way to go. This comes down to why you don't like the reservation. e.g. if we use two swapfile, one swapfile is purely allocate for high order, would that be better? > > >> > >> > Do you see another way to protect the high order cluster polluted by > >> > lower order one? > >> > >> If we use high-order page allocation as reference, we need something > >> like compaction to guarantee high-order allocation finally. But we are > >> too far from that. > > > > We should consider reservation for high-order swap entry allocation > > similar to hugetlbfs for memory. > > Swap compaction will be very complicated because it needs to scan the > > PTE to migrate the swap entry. It might be easier to support folio > > write out compound discontiguous swap entries. That is another way to > > address the fragmentation issue. We are also too far from that as > > right now. > > That's not easy to write out compound discontiguous swap entries too. > For example, how to put folios in swap cache? I propose the idea in the recent LSF/MM discussion, the last few slides are for the discontiguous swap and it has the discontiguous entries in swap cache. https://drive.google.com/file/d/10wN4WgEekaiTDiAx2AND97CYLgfDJXAD/view Agree it is not an easy change. The cache cache would have to change the assumption all offset are contiguous. For swap, we kind of have some in memory data associated with per offset already, so it might provide an opportunity to combine the offset related data structure for swap together. Another alternative might be using xarray without the multi entry property. , just treat each offset like a single entry. I haven't dug deep into this direction yet. We can have more discussion, maybe arrange an upstream alignment meeting if there is interest. Chris
Chris Li <chrisl@kernel.org> writes: > On Wed, Jun 5, 2024 at 7:02 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Chris Li <chrisl@kernel.org> writes: >> > >> > In the page allocation side, we have the hugetlbfs which reserve some >> > memory for high order pages. >> > We should have similar things to allow reserve some high order swap >> > entries without getting polluted by low order one. >> >> TBH, I don't like the idea of high order swap entries reservation. > May I know more if you don't like the idea? I understand this can be > controversial, because previously we like to take the THP as the best > effort approach. If there is some reason we can't make THP, we use the > order 0 as fall back. > > For discussion purpose, I want break it down to smaller steps: > > First, can we agree that the following usage case is reasonable: > The usage case is that, as Barry has shown, zsmalloc compresses bigger > size than 4K and can have both better compress ratio and CPU > performance gain. > https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/ > > So the goal is to make THP/mTHP have some reasonable success rate > running in the mix size swap allocation, after either low order or > high order swap requests can overflow the swap file size. The allocate > can still recover from that, after some swap entries got free. > > Please let me know if you think the above usage case and goal are not > reasonable for the kernel. I think that it's reasonable to improve the success rate of high-order swap entries allocation. I just think that it's hard to use the reservation based method. For example, how much should be reserved? Why system OOM when there's still swap space available? And so forth. So, I prefer the transparent methods. Just like THP vs. hugetlbfs. >> that's really important for you, I think that it's better to design >> something like hugetlbfs vs core mm, that is, be separated from the >> normal swap subsystem as much as possible. > > I am giving hugetlbfs just to make the point using reservation, or > isolation of the resource to prevent mixing fragmentation existing in > core mm. > I am not suggesting copying the hugetlbfs implementation to the swap > system. Unlike hugetlbfs, the swap allocation is typically done from > the kernel, it is transparent from the application. I don't think > separate from the swap subsystem is a good way to go. > > This comes down to why you don't like the reservation. e.g. if we use > two swapfile, one swapfile is purely allocate for high order, would > that be better? Sorry, my words weren't accurate. Personally, I just think that it's better to make reservation related code not too intrusive. And, before reservation, we need to consider something else firstly. Whether is it generally good to swap-in with swap-out order? Should we consider memory wastage too? One static policy doesn't fit all, we may need either a dynamic policy, or make policy configurable. In general, I think that we need to do this step by step. >> >> > Do you see another way to protect the high order cluster polluted by >> >> > lower order one? >> >> >> >> If we use high-order page allocation as reference, we need something >> >> like compaction to guarantee high-order allocation finally. But we are >> >> too far from that. >> > >> > We should consider reservation for high-order swap entry allocation >> > similar to hugetlbfs for memory. >> > Swap compaction will be very complicated because it needs to scan the >> > PTE to migrate the swap entry. It might be easier to support folio >> > write out compound discontiguous swap entries. That is another way to >> > address the fragmentation issue. We are also too far from that as >> > right now. >> >> That's not easy to write out compound discontiguous swap entries too. >> For example, how to put folios in swap cache? > > I propose the idea in the recent LSF/MM discussion, the last few > slides are for the discontiguous swap and it has the discontiguous > entries in swap cache. > https://drive.google.com/file/d/10wN4WgEekaiTDiAx2AND97CYLgfDJXAD/view > > Agree it is not an easy change. The cache cache would have to change > the assumption all offset are contiguous. > For swap, we kind of have some in memory data associated with per > offset already, so it might provide an opportunity to combine the > offset related data structure for swap together. Another alternative > might be using xarray without the multi entry property. , just treat > each offset like a single entry. I haven't dug deep into this > direction yet. Thanks! I will study your idea. > We can have more discussion, maybe arrange an upstream alignment > meeting if there is interest. Sure. -- Best Regards, Huang, Ying
On Mon, Jun 10, 2024 at 7:38 PM Huang, Ying <ying.huang@intel.com> wrote: > > Chris Li <chrisl@kernel.org> writes: > > > On Wed, Jun 5, 2024 at 7:02 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Chris Li <chrisl@kernel.org> writes: > >> > > > >> > In the page allocation side, we have the hugetlbfs which reserve some > >> > memory for high order pages. > >> > We should have similar things to allow reserve some high order swap > >> > entries without getting polluted by low order one. > >> > >> TBH, I don't like the idea of high order swap entries reservation. > > May I know more if you don't like the idea? I understand this can be > > controversial, because previously we like to take the THP as the best > > effort approach. If there is some reason we can't make THP, we use the > > order 0 as fall back. > > > > For discussion purpose, I want break it down to smaller steps: > > > > First, can we agree that the following usage case is reasonable: > > The usage case is that, as Barry has shown, zsmalloc compresses bigger > > size than 4K and can have both better compress ratio and CPU > > performance gain. > > https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/ > > > > So the goal is to make THP/mTHP have some reasonable success rate > > running in the mix size swap allocation, after either low order or > > high order swap requests can overflow the swap file size. The allocate > > can still recover from that, after some swap entries got free. > > > > Please let me know if you think the above usage case and goal are not > > reasonable for the kernel. > > I think that it's reasonable to improve the success rate of high-order Glad to hear that. > swap entries allocation. I just think that it's hard to use the > reservation based method. For example, how much should be reserved? Understand, it is harder to use than a fully transparent method, but still better than no solution at all. The alternative right now is we can't do it. Regarding how much we should reserve. Similarly, how much should you choose your swap file size? If you choose N, why not N*120% or N*80%? That did not stop us from having a swapfile, right? > Why system OOM when there's still swap space available? And so forth. Keep in mind that the reservation is an option. If you prefer the old behavior, you don't have to use the reservation. That shouldn't be a reason to stop others who want to use it. We don't have an alternative solution for the long run mix size allocation yet. If there is, I like to hear it. > So, I prefer the transparent methods. Just like THP vs. hugetlbfs. Me too. I prefer transparent over reservation if it can achieve the same goal. Do we have a fully transparent method spec out? How to achieve fully transparent and also avoid fragmentation caused by mix order allocation/free? Keep in mind that we are still in the early stage of the mTHP swap development, I can have the reservation patch relatively easily. If you come up with a better transparent method patch which can achieve the same goal later, we can use it instead. > > >> that's really important for you, I think that it's better to design > >> something like hugetlbfs vs core mm, that is, be separated from the > >> normal swap subsystem as much as possible. > > > > I am giving hugetlbfs just to make the point using reservation, or > > isolation of the resource to prevent mixing fragmentation existing in > > core mm. > > I am not suggesting copying the hugetlbfs implementation to the swap > > system. Unlike hugetlbfs, the swap allocation is typically done from > > the kernel, it is transparent from the application. I don't think > > separate from the swap subsystem is a good way to go. > > > > This comes down to why you don't like the reservation. e.g. if we use > > two swapfile, one swapfile is purely allocate for high order, would > > that be better? > > Sorry, my words weren't accurate. Personally, I just think that it's > better to make reservation related code not too intrusive. Yes. I will try to make it not too intrusive. > And, before reservation, we need to consider something else firstly. > Whether is it generally good to swap-in with swap-out order? Should we When we have the reservation patch (or other means to sustain mix size swap allocation/free), we can test it out to get more data to reason about it. I consider the swap in size policy an orthogonal issue. > consider memory wastage too? One static policy doesn't fit all, we may > need either a dynamic policy, or make policy configurable. > In general, I think that we need to do this step by step. The core swap layer needs to be able to sustain mix size swap allocation free in the long run. Without that the swap in size policy is meaningless. Yes, that is the step by step approach. Allowing long run mix size swap allocation as the first step. > >> >> > Do you see another way to protect the high order cluster polluted by > >> >> > lower order one? > >> >> > >> >> If we use high-order page allocation as reference, we need something > >> >> like compaction to guarantee high-order allocation finally. But we are > >> >> too far from that. > >> > > >> > We should consider reservation for high-order swap entry allocation > >> > similar to hugetlbfs for memory. > >> > Swap compaction will be very complicated because it needs to scan the > >> > PTE to migrate the swap entry. It might be easier to support folio > >> > write out compound discontiguous swap entries. That is another way to > >> > address the fragmentation issue. We are also too far from that as > >> > right now. > >> > >> That's not easy to write out compound discontiguous swap entries too. > >> For example, how to put folios in swap cache? > > > > I propose the idea in the recent LSF/MM discussion, the last few > > slides are for the discontiguous swap and it has the discontiguous > > entries in swap cache. > > https://drive.google.com/file/d/10wN4WgEekaiTDiAx2AND97CYLgfDJXAD/view > > > > Agree it is not an easy change. The cache cache would have to change > > the assumption all offset are contiguous. > > For swap, we kind of have some in memory data associated with per > > offset already, so it might provide an opportunity to combine the > > offset related data structure for swap together. Another alternative > > might be using xarray without the multi entry property. , just treat > > each offset like a single entry. I haven't dug deep into this > > direction yet. > > Thanks! I will study your idea. > I am happy to discuss if you have any questions. > > We can have more discussion, maybe arrange an upstream alignment > > meeting if there is interest. > > Sure. Ideally, if we can resolve our differences over the mail list then we don't need to have a separate meeting :-) Chris
Chris Li <chrisl@kernel.org> writes: > On Mon, Jun 10, 2024 at 7:38 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Chris Li <chrisl@kernel.org> writes: >> >> > On Wed, Jun 5, 2024 at 7:02 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Chris Li <chrisl@kernel.org> writes: >> >> >> > >> >> > In the page allocation side, we have the hugetlbfs which reserve some >> >> > memory for high order pages. >> >> > We should have similar things to allow reserve some high order swap >> >> > entries without getting polluted by low order one. >> >> >> >> TBH, I don't like the idea of high order swap entries reservation. >> > May I know more if you don't like the idea? I understand this can be >> > controversial, because previously we like to take the THP as the best >> > effort approach. If there is some reason we can't make THP, we use the >> > order 0 as fall back. >> > >> > For discussion purpose, I want break it down to smaller steps: >> > >> > First, can we agree that the following usage case is reasonable: >> > The usage case is that, as Barry has shown, zsmalloc compresses bigger >> > size than 4K and can have both better compress ratio and CPU >> > performance gain. >> > https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/ >> > >> > So the goal is to make THP/mTHP have some reasonable success rate >> > running in the mix size swap allocation, after either low order or >> > high order swap requests can overflow the swap file size. The allocate >> > can still recover from that, after some swap entries got free. >> > >> > Please let me know if you think the above usage case and goal are not >> > reasonable for the kernel. >> >> I think that it's reasonable to improve the success rate of high-order > > Glad to hear that. > >> swap entries allocation. I just think that it's hard to use the >> reservation based method. For example, how much should be reserved? > > Understand, it is harder to use than a fully transparent method, but > still better than no solution at all. The alternative right now is we > can't do it. > > Regarding how much we should reserve. Similarly, how much should you > choose your swap file size? If you choose N, why not N*120% or N*80%? > That did not stop us from having a swapfile, right? > >> Why system OOM when there's still swap space available? And so forth. > > Keep in mind that the reservation is an option. If you prefer the old > behavior, you don't have to use the reservation. That shouldn't be a > reason to stop others who want to use it. We don't have an alternative > solution for the long run mix size allocation yet. If there is, I like > to hear it. It's not enough to make it optional. When you run into issue, you need to debug it. And you may debug an issue on a system that is configured by someone else. >> So, I prefer the transparent methods. Just like THP vs. hugetlbfs. > > Me too. I prefer transparent over reservation if it can achieve the > same goal. Do we have a fully transparent method spec out? How to > achieve fully transparent and also avoid fragmentation caused by mix > order allocation/free? > > Keep in mind that we are still in the early stage of the mTHP swap > development, I can have the reservation patch relatively easily. If > you come up with a better transparent method patch which can achieve > the same goal later, we can use it instead. Because we are still in the early stage, I think that we should try to improve transparent solution firstly. Personally, what I don't like is that we don't work on the transparent solution because we have the reservation solution. >> >> >> that's really important for you, I think that it's better to design >> >> something like hugetlbfs vs core mm, that is, be separated from the >> >> normal swap subsystem as much as possible. >> > >> > I am giving hugetlbfs just to make the point using reservation, or >> > isolation of the resource to prevent mixing fragmentation existing in >> > core mm. >> > I am not suggesting copying the hugetlbfs implementation to the swap >> > system. Unlike hugetlbfs, the swap allocation is typically done from >> > the kernel, it is transparent from the application. I don't think >> > separate from the swap subsystem is a good way to go. >> > >> > This comes down to why you don't like the reservation. e.g. if we use >> > two swapfile, one swapfile is purely allocate for high order, would >> > that be better? >> >> Sorry, my words weren't accurate. Personally, I just think that it's >> better to make reservation related code not too intrusive. > > Yes. I will try to make it not too intrusive. > >> And, before reservation, we need to consider something else firstly. >> Whether is it generally good to swap-in with swap-out order? Should we > > When we have the reservation patch (or other means to sustain mix size > swap allocation/free), we can test it out to get more data to reason > about it. > I consider the swap in size policy an orthogonal issue. No. I don't think so. If you swap-out in higher order, but swap-in in lower order, you make the swap clusters fragmented. >> consider memory wastage too? One static policy doesn't fit all, we may >> need either a dynamic policy, or make policy configurable. >> In general, I think that we need to do this step by step. > > The core swap layer needs to be able to sustain mix size swap > allocation free in the long run. Without that the swap in size policy > is meaningless. > > Yes, that is the step by step approach. Allowing long run mix size > swap allocation as the first step. > >> >> >> > Do you see another way to protect the high order cluster polluted by >> >> >> > lower order one? >> >> >> >> >> >> If we use high-order page allocation as reference, we need something >> >> >> like compaction to guarantee high-order allocation finally. But we are >> >> >> too far from that. >> >> > >> >> > We should consider reservation for high-order swap entry allocation >> >> > similar to hugetlbfs for memory. >> >> > Swap compaction will be very complicated because it needs to scan the >> >> > PTE to migrate the swap entry. It might be easier to support folio >> >> > write out compound discontiguous swap entries. That is another way to >> >> > address the fragmentation issue. We are also too far from that as >> >> > right now. >> >> >> >> That's not easy to write out compound discontiguous swap entries too. >> >> For example, how to put folios in swap cache? >> > >> > I propose the idea in the recent LSF/MM discussion, the last few >> > slides are for the discontiguous swap and it has the discontiguous >> > entries in swap cache. >> > https://drive.google.com/file/d/10wN4WgEekaiTDiAx2AND97CYLgfDJXAD/view >> > >> > Agree it is not an easy change. The cache cache would have to change >> > the assumption all offset are contiguous. >> > For swap, we kind of have some in memory data associated with per >> > offset already, so it might provide an opportunity to combine the >> > offset related data structure for swap together. Another alternative >> > might be using xarray without the multi entry property. , just treat >> > each offset like a single entry. I haven't dug deep into this >> > direction yet. >> >> Thanks! I will study your idea. >> > > I am happy to discuss if you have any questions. > >> > We can have more discussion, maybe arrange an upstream alignment >> > meeting if there is interest. >> >> Sure. > > Ideally, if we can resolve our differences over the mail list then we > don't need to have a separate meeting :-) > -- Best Regards, Huang, Ying
On Thu, Jun 13, 2024 at 1:40 AM Huang, Ying <ying.huang@intel.com> wrote: > > Chris Li <chrisl@kernel.org> writes: > > > On Mon, Jun 10, 2024 at 7:38 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Chris Li <chrisl@kernel.org> writes: > >> > >> > On Wed, Jun 5, 2024 at 7:02 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> > >> >> Chris Li <chrisl@kernel.org> writes: > >> >> > >> > > >> >> > In the page allocation side, we have the hugetlbfs which reserve some > >> >> > memory for high order pages. > >> >> > We should have similar things to allow reserve some high order swap > >> >> > entries without getting polluted by low order one. > >> >> > >> >> TBH, I don't like the idea of high order swap entries reservation. > >> > May I know more if you don't like the idea? I understand this can be > >> > controversial, because previously we like to take the THP as the best > >> > effort approach. If there is some reason we can't make THP, we use the > >> > order 0 as fall back. > >> > > >> > For discussion purpose, I want break it down to smaller steps: > >> > > >> > First, can we agree that the following usage case is reasonable: > >> > The usage case is that, as Barry has shown, zsmalloc compresses bigger > >> > size than 4K and can have both better compress ratio and CPU > >> > performance gain. > >> > https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/ > >> > > >> > So the goal is to make THP/mTHP have some reasonable success rate > >> > running in the mix size swap allocation, after either low order or > >> > high order swap requests can overflow the swap file size. The allocate > >> > can still recover from that, after some swap entries got free. > >> > > >> > Please let me know if you think the above usage case and goal are not > >> > reasonable for the kernel. > >> > >> I think that it's reasonable to improve the success rate of high-order > > > > Glad to hear that. > > > >> swap entries allocation. I just think that it's hard to use the > >> reservation based method. For example, how much should be reserved? > > > > Understand, it is harder to use than a fully transparent method, but > > still better than no solution at all. The alternative right now is we > > can't do it. > > > > Regarding how much we should reserve. Similarly, how much should you > > choose your swap file size? If you choose N, why not N*120% or N*80%? > > That did not stop us from having a swapfile, right? > > > >> Why system OOM when there's still swap space available? And so forth. > > > > Keep in mind that the reservation is an option. If you prefer the old > > behavior, you don't have to use the reservation. That shouldn't be a > > reason to stop others who want to use it. We don't have an alternative > > solution for the long run mix size allocation yet. If there is, I like > > to hear it. > > It's not enough to make it optional. When you run into issue, you need > to debug it. And you may debug an issue on a system that is configured > by someone else. That is in general true with all kernel development regardless of using options or not. If there is a bug in my patch, I will need to debug and fix it or the patch might be reverted. I don't see that as a reason to take the option path or not. The option just means the user taking this option will need to understand the trade off and accept the defined behavior of that option. > > >> So, I prefer the transparent methods. Just like THP vs. hugetlbfs. > > > > Me too. I prefer transparent over reservation if it can achieve the > > same goal. Do we have a fully transparent method spec out? How to > > achieve fully transparent and also avoid fragmentation caused by mix > > order allocation/free? > > > > Keep in mind that we are still in the early stage of the mTHP swap > > development, I can have the reservation patch relatively easily. If > > you come up with a better transparent method patch which can achieve > > the same goal later, we can use it instead. > > Because we are still in the early stage, I think that we should try to > improve transparent solution firstly. Personally, what I don't like is > that we don't work on the transparent solution because we have the > reservation solution. Do you have a road map or the design for the transparent solution you can share? I am interested to know what is the short term step(e.g. a month) in this transparent solution you have in mind, so we can compare the different approaches. I can't reason much just by the name "transparent solution" itself. Need more technical details. Right now we have a clear usage case we want to support, the swap in/out mTHP with bigger zsmalloc buffers. We can start with the limited usage case first then move to more general ones. > >> >> that's really important for you, I think that it's better to design > >> >> something like hugetlbfs vs core mm, that is, be separated from the > >> >> normal swap subsystem as much as possible. > >> > > >> > I am giving hugetlbfs just to make the point using reservation, or > >> > isolation of the resource to prevent mixing fragmentation existing in > >> > core mm. > >> > I am not suggesting copying the hugetlbfs implementation to the swap > >> > system. Unlike hugetlbfs, the swap allocation is typically done from > >> > the kernel, it is transparent from the application. I don't think > >> > separate from the swap subsystem is a good way to go. > >> > > >> > This comes down to why you don't like the reservation. e.g. if we use > >> > two swapfile, one swapfile is purely allocate for high order, would > >> > that be better? > >> > >> Sorry, my words weren't accurate. Personally, I just think that it's > >> better to make reservation related code not too intrusive. > > > > Yes. I will try to make it not too intrusive. > > > >> And, before reservation, we need to consider something else firstly. > >> Whether is it generally good to swap-in with swap-out order? Should we > > > > When we have the reservation patch (or other means to sustain mix size > > swap allocation/free), we can test it out to get more data to reason > > about it. > > I consider the swap in size policy an orthogonal issue. > > No. I don't think so. If you swap-out in higher order, but swap-in in > lower order, you make the swap clusters fragmented. Sounds like that is the reason to apply swap-in the same order of the swap out. In any case, my original point still stands. We need to have the ability to allocate high order swap entries with reasonable success rate *before* we have the option to choose which size to swap in. If allocating a high order swap always fails, we will be forced to use the low order one, there is no option to choose from. We can't evalute "is it generally good to swap-in with swap-out order?" by actual runs. Chris
Chris Li <chrisl@kernel.org> writes: > On Thu, Jun 13, 2024 at 1:40 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> Chris Li <chrisl@kernel.org> writes: >> >> > On Mon, Jun 10, 2024 at 7:38 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Chris Li <chrisl@kernel.org> writes: >> >> >> >> > On Wed, Jun 5, 2024 at 7:02 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> Chris Li <chrisl@kernel.org> writes: >> >> >> >> >> > >> >> >> > In the page allocation side, we have the hugetlbfs which reserve some >> >> >> > memory for high order pages. >> >> >> > We should have similar things to allow reserve some high order swap >> >> >> > entries without getting polluted by low order one. >> >> >> >> >> >> TBH, I don't like the idea of high order swap entries reservation. >> >> > May I know more if you don't like the idea? I understand this can be >> >> > controversial, because previously we like to take the THP as the best >> >> > effort approach. If there is some reason we can't make THP, we use the >> >> > order 0 as fall back. >> >> > >> >> > For discussion purpose, I want break it down to smaller steps: >> >> > >> >> > First, can we agree that the following usage case is reasonable: >> >> > The usage case is that, as Barry has shown, zsmalloc compresses bigger >> >> > size than 4K and can have both better compress ratio and CPU >> >> > performance gain. >> >> > https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/ >> >> > >> >> > So the goal is to make THP/mTHP have some reasonable success rate >> >> > running in the mix size swap allocation, after either low order or >> >> > high order swap requests can overflow the swap file size. The allocate >> >> > can still recover from that, after some swap entries got free. >> >> > >> >> > Please let me know if you think the above usage case and goal are not >> >> > reasonable for the kernel. >> >> >> >> I think that it's reasonable to improve the success rate of high-order >> > >> > Glad to hear that. >> > >> >> swap entries allocation. I just think that it's hard to use the >> >> reservation based method. For example, how much should be reserved? >> > >> > Understand, it is harder to use than a fully transparent method, but >> > still better than no solution at all. The alternative right now is we >> > can't do it. >> > >> > Regarding how much we should reserve. Similarly, how much should you >> > choose your swap file size? If you choose N, why not N*120% or N*80%? >> > That did not stop us from having a swapfile, right? >> > >> >> Why system OOM when there's still swap space available? And so forth. >> > >> > Keep in mind that the reservation is an option. If you prefer the old >> > behavior, you don't have to use the reservation. That shouldn't be a >> > reason to stop others who want to use it. We don't have an alternative >> > solution for the long run mix size allocation yet. If there is, I like >> > to hear it. >> >> It's not enough to make it optional. When you run into issue, you need >> to debug it. And you may debug an issue on a system that is configured >> by someone else. > > That is in general true with all kernel development regardless of > using options or not. If there is a bug in my patch, I will need to > debug and fix it or the patch might be reverted. > > I don't see that as a reason to take the option path or not. The > option just means the user taking this option will need to understand > the trade off and accept the defined behavior of that option. User configuration knobs are not forbidden for Linux kernel. But we are more careful about them because they will introduce ABI which we need to maintain forever. And they are hard to be used for users. Optimizing automatically is generally the better solution. So, I suggest you to think more about the automatically solution before diving into a new option. >> >> >> So, I prefer the transparent methods. Just like THP vs. hugetlbfs. >> > >> > Me too. I prefer transparent over reservation if it can achieve the >> > same goal. Do we have a fully transparent method spec out? How to >> > achieve fully transparent and also avoid fragmentation caused by mix >> > order allocation/free? >> > >> > Keep in mind that we are still in the early stage of the mTHP swap >> > development, I can have the reservation patch relatively easily. If >> > you come up with a better transparent method patch which can achieve >> > the same goal later, we can use it instead. >> >> Because we are still in the early stage, I think that we should try to >> improve transparent solution firstly. Personally, what I don't like is >> that we don't work on the transparent solution because we have the >> reservation solution. > > Do you have a road map or the design for the transparent solution you can share? > I am interested to know what is the short term step(e.g. a month) in > this transparent solution you have in mind, so we can compare the > different approaches. I can't reason much just by the name > "transparent solution" itself. Need more technical details. > > Right now we have a clear usage case we want to support, the swap > in/out mTHP with bigger zsmalloc buffers. We can start with the > limited usage case first then move to more general ones. TBH, This is what I don't like. It appears that you refuse to think about the transparent (or automatic) solution. I haven't thought about them thoroughly, but at least we may think about - promoting low order non-full cluster when we find a free high order swap entries. - stealing a low order non-full cluster with low usage count for high-order allocation. - freeing more swap entries when swap devices become fragmented. >> >> >> that's really important for you, I think that it's better to design >> >> >> something like hugetlbfs vs core mm, that is, be separated from the >> >> >> normal swap subsystem as much as possible. >> >> > >> >> > I am giving hugetlbfs just to make the point using reservation, or >> >> > isolation of the resource to prevent mixing fragmentation existing in >> >> > core mm. >> >> > I am not suggesting copying the hugetlbfs implementation to the swap >> >> > system. Unlike hugetlbfs, the swap allocation is typically done from >> >> > the kernel, it is transparent from the application. I don't think >> >> > separate from the swap subsystem is a good way to go. >> >> > >> >> > This comes down to why you don't like the reservation. e.g. if we use >> >> > two swapfile, one swapfile is purely allocate for high order, would >> >> > that be better? >> >> >> >> Sorry, my words weren't accurate. Personally, I just think that it's >> >> better to make reservation related code not too intrusive. >> > >> > Yes. I will try to make it not too intrusive. >> > >> >> And, before reservation, we need to consider something else firstly. >> >> Whether is it generally good to swap-in with swap-out order? Should we >> > >> > When we have the reservation patch (or other means to sustain mix size >> > swap allocation/free), we can test it out to get more data to reason >> > about it. >> > I consider the swap in size policy an orthogonal issue. >> >> No. I don't think so. If you swap-out in higher order, but swap-in in >> lower order, you make the swap clusters fragmented. > > Sounds like that is the reason to apply swap-in the same order of the swap out. > In any case, my original point still stands. We need to have the > ability to allocate high order swap entries with reasonable success > rate *before* we have the option to choose which size to swap in. If > allocating a high order swap always fails, we will be forced to use > the low order one, there is no option to choose from. We can't evalute > "is it generally good to swap-in with swap-out order?" by actual runs. I think we don't need to fight for that. Just prove the value of your patchset with reasonable use cases and normal workloads. Data will persuade people. -- Best Regards, Huang, Ying
On Mon, Jun 17, 2024 at 11:56 PM Huang, Ying <ying.huang@intel.com> wrote: > > Chris Li <chrisl@kernel.org> writes: > > > That is in general true with all kernel development regardless of > > using options or not. If there is a bug in my patch, I will need to > > debug and fix it or the patch might be reverted. > > > > I don't see that as a reason to take the option path or not. The > > option just means the user taking this option will need to understand > > the trade off and accept the defined behavior of that option. > > User configuration knobs are not forbidden for Linux kernel. But we are > more careful about them because they will introduce ABI which we need to > maintain forever. And they are hard to be used for users. Optimizing > automatically is generally the better solution. So, I suggest you to > think more about the automatically solution before diving into a new > option. I did, see my reply. Right now there are just no other options. > > >> > >> >> So, I prefer the transparent methods. Just like THP vs. hugetlbfs. > >> > > >> > Me too. I prefer transparent over reservation if it can achieve the > >> > same goal. Do we have a fully transparent method spec out? How to > >> > achieve fully transparent and also avoid fragmentation caused by mix > >> > order allocation/free? > >> > > >> > Keep in mind that we are still in the early stage of the mTHP swap > >> > development, I can have the reservation patch relatively easily. If > >> > you come up with a better transparent method patch which can achieve > >> > the same goal later, we can use it instead. > >> > >> Because we are still in the early stage, I think that we should try to > >> improve transparent solution firstly. Personally, what I don't like is > >> that we don't work on the transparent solution because we have the > >> reservation solution. > > > > Do you have a road map or the design for the transparent solution you can share? > > I am interested to know what is the short term step(e.g. a month) in > > this transparent solution you have in mind, so we can compare the > > different approaches. I can't reason much just by the name > > "transparent solution" itself. Need more technical details. > > > > Right now we have a clear usage case we want to support, the swap > > in/out mTHP with bigger zsmalloc buffers. We can start with the > > limited usage case first then move to more general ones. > > TBH, This is what I don't like. It appears that you refuse to think > about the transparent (or automatic) solution. Actually, that is not true, you make the wrong assumption about what I have considered. I want to find out what you have in mind to compare the near term solutions. In my recent LSF slide I already list 3 options to address this fragmentation problem. From easy to hard: 1) Assign cluster an order on allocation and remember the cluster order. (short term). That is this patch series 2) Buddy allocation on the swap entry (longer term) 3) Folio write out compound discontinuous swap entry. (ultimate) I also considered 4), which I did not put into the slide, because it is less effective than 3) 4) migrating the swap entries, which require scan page table entry. I briefly mentioned it during the session. 3) should might qualify as your transparent solution. It is just much harder to implement. Even when we have 3), having some form of 1) can be beneficial as well. (less IO count, no indirect layer of swap offset). > > I haven't thought about them thoroughly, but at least we may think about > > - promoting low order non-full cluster when we find a free high order > swap entries. > > - stealing a low order non-full cluster with low usage count for > high-order allocation. Now we are talking. These two above fall well within 2) the buddy allocators But the buddy allocator will not be able to address all fragmentation issues, due to the allocator not being controlled the life cycle of the swap entry. It will not help Barry's zsmalloc usage case much because android likes to keep the swapfile full. I can already see that. > - freeing more swap entries when swap devices become fragmented. That requires a scan page table to free the swap entry, basically 4). It is all about investment and return. 1) is relatively easy to implement and with good improvement and return. Chris > >> >> >> that's really important for you, I think that it's better to design > >> >> >> something like hugetlbfs vs core mm, that is, be separated from the > >> >> >> normal swap subsystem as much as possible. > >> >> > > >> >> > I am giving hugetlbfs just to make the point using reservation, or > >> >> > isolation of the resource to prevent mixing fragmentation existing in > >> >> > core mm. > >> >> > I am not suggesting copying the hugetlbfs implementation to the swap > >> >> > system. Unlike hugetlbfs, the swap allocation is typically done from > >> >> > the kernel, it is transparent from the application. I don't think > >> >> > separate from the swap subsystem is a good way to go. > >> >> > > >> >> > This comes down to why you don't like the reservation. e.g. if we use > >> >> > two swapfile, one swapfile is purely allocate for high order, would > >> >> > that be better? > >> >> > >> >> Sorry, my words weren't accurate. Personally, I just think that it's > >> >> better to make reservation related code not too intrusive. > >> > > >> > Yes. I will try to make it not too intrusive. > >> > > >> >> And, before reservation, we need to consider something else firstly. > >> >> Whether is it generally good to swap-in with swap-out order? Should we > >> > > >> > When we have the reservation patch (or other means to sustain mix size > >> > swap allocation/free), we can test it out to get more data to reason > >> > about it. > >> > I consider the swap in size policy an orthogonal issue. > >> > >> No. I don't think so. If you swap-out in higher order, but swap-in in > >> lower order, you make the swap clusters fragmented. > > > > Sounds like that is the reason to apply swap-in the same order of the swap out. > > In any case, my original point still stands. We need to have the > > ability to allocate high order swap entries with reasonable success > > rate *before* we have the option to choose which size to swap in. If > > allocating a high order swap always fails, we will be forced to use > > the low order one, there is no option to choose from. We can't evalute > > "is it generally good to swap-in with swap-out order?" by actual runs. > > I think we don't need to fight for that. Just prove the value of your > patchset with reasonable use cases and normal workloads. Data will > persuade people.
Chris Li <chrisl@kernel.org> writes: > On Mon, Jun 17, 2024 at 11:56 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Chris Li <chrisl@kernel.org> writes: >> >> > That is in general true with all kernel development regardless of >> > using options or not. If there is a bug in my patch, I will need to >> > debug and fix it or the patch might be reverted. >> > >> > I don't see that as a reason to take the option path or not. The >> > option just means the user taking this option will need to understand >> > the trade off and accept the defined behavior of that option. >> >> User configuration knobs are not forbidden for Linux kernel. But we are >> more careful about them because they will introduce ABI which we need to >> maintain forever. And they are hard to be used for users. Optimizing >> automatically is generally the better solution. So, I suggest you to >> think more about the automatically solution before diving into a new >> option. > > I did, see my reply. Right now there are just no other options. > >> >> >> >> >> >> So, I prefer the transparent methods. Just like THP vs. hugetlbfs. >> >> > >> >> > Me too. I prefer transparent over reservation if it can achieve the >> >> > same goal. Do we have a fully transparent method spec out? How to >> >> > achieve fully transparent and also avoid fragmentation caused by mix >> >> > order allocation/free? >> >> > >> >> > Keep in mind that we are still in the early stage of the mTHP swap >> >> > development, I can have the reservation patch relatively easily. If >> >> > you come up with a better transparent method patch which can achieve >> >> > the same goal later, we can use it instead. >> >> >> >> Because we are still in the early stage, I think that we should try to >> >> improve transparent solution firstly. Personally, what I don't like is >> >> that we don't work on the transparent solution because we have the >> >> reservation solution. >> > >> > Do you have a road map or the design for the transparent solution you can share? >> > I am interested to know what is the short term step(e.g. a month) in >> > this transparent solution you have in mind, so we can compare the >> > different approaches. I can't reason much just by the name >> > "transparent solution" itself. Need more technical details. >> > >> > Right now we have a clear usage case we want to support, the swap >> > in/out mTHP with bigger zsmalloc buffers. We can start with the >> > limited usage case first then move to more general ones. >> >> TBH, This is what I don't like. It appears that you refuse to think >> about the transparent (or automatic) solution. > > Actually, that is not true, you make the wrong assumption about what I > have considered. I want to find out what you have in mind to compare > the near term solutions. Sorry about my wrong assumption. > In my recent LSF slide I already list 3 options to address this > fragmentation problem. > From easy to hard: > 1) Assign cluster an order on allocation and remember the cluster > order. (short term). > That is this patch series > 2) Buddy allocation on the swap entry (longer term) > 3) Folio write out compound discontinuous swap entry. (ultimate) > > I also considered 4), which I did not put into the slide, because it > is less effective than 3) > 4) migrating the swap entries, which require scan page table entry. > I briefly mentioned it during the session. Or you need something like a rmap, that isn't easy. > 3) should might qualify as your transparent solution. It is just much > harder to implement. > Even when we have 3), having some form of 1) can be beneficial as > well. (less IO count, no indirect layer of swap offset). > >> >> I haven't thought about them thoroughly, but at least we may think about >> >> - promoting low order non-full cluster when we find a free high order >> swap entries. >> >> - stealing a low order non-full cluster with low usage count for >> high-order allocation. > > Now we are talking. > These two above fall well within 2) the buddy allocators > But the buddy allocator will not be able to address all fragmentation > issues, due to the allocator not being controlled the life cycle of > the swap entry. > It will not help Barry's zsmalloc usage case much because android > likes to keep the swapfile full. I can already see that. I think that buddy-like allocator (not exactly buddy algorithm) will help fragmentation. And it will help more users because it works automatically. I don't think they are too hard to be implemented. We can try to find some simple solution firstly. So, I think that we don't need to push them to long term. At least, they can be done before introducing high-order cluster reservation ABI. Then, we can evaluate the benefit and overhead of reservation ABI. >> - freeing more swap entries when swap devices become fragmented. > > That requires a scan page table to free the swap entry, basically 4). No. You can just scan the page table of current process in do_swap_page() and try to swap-in and free more swap entries. That doesn't work well for the shared pages. However, I think that it can help quite some workloads. > It is all about investment and return. 1) is relatively easy to > implement and with good improvement and return. [snip] -- Best Regards, Huang, Ying
On Fri, May 31, 2024 at 10:37 AM Huang, Ying <ying.huang@intel.com> wrote: > > Chris Li <chrisl@kernel.org> writes: > > > On Wed, May 29, 2024 at 7:54 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Chris Li <chrisl@kernel.org> writes: > >> > >> > Hi Ying, > >> > > >> > On Wed, May 29, 2024 at 1:57 AM Huang, Ying <ying.huang@intel.com> wrote: > >> >> > >> >> Chris Li <chrisl@kernel.org> writes: > >> >> > >> >> > I am spinning a new version for this series to address two issues > >> >> > found in this series: > >> >> > > >> >> > 1) Oppo discovered a bug in the following line: > >> >> > + ci = si->cluster_info + tmp; > >> >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp". > >> >> > That is a serious bug but trivial to fix. > >> >> > > >> >> > 2) order 0 allocation currently blindly scans swap_map disregarding > >> >> > the cluster->order. > >> >> > >> >> IIUC, now, we only scan swap_map[] only if > >> >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]). > >> >> That is, if you doesn't run low swap free space, you will not do that. > >> > > >> > You can still swap space in order 0 clusters while order 4 runs out of > >> > free_cluster > >> > or nonfull_clusters[order]. For Android that is a common case. > >> > >> When we fail to allocate order 4, we will fallback to order 0. Still > >> don't need to scan swap_map[]. But after looking at your below reply, I > >> realized that the swap space is almost full at most times in your cases. > >> Then, it's possible that we run into scanning swap_map[]. > >> list_empty(&si->free_clusters) && > >> list_empty(&si->nonfull_clusters[order]) will become true, if we put too > >> many clusters in si->percpu_cluster. So, if we want to avoid to scan > >> swap_map[], we can stop add clusters in si->percpu_cluster when swap > >> space runs low. And maybe take clusters out of si->percpu_cluster > >> sometimes. > > > > One idea after reading your reply. If we run out of the > > nonfull_cluster[order], we should be able to use other cpu's > > si->percpu_cluster[] as well. That is a very small win for Android, > > This will be useful in general. The number CPU may be large, and > multiple orders may be used. > > > because android does not have too many cpu. We are talking about a > > handful of clusters, which might not justify the code complexity. It > > does not change the behavior that order 0 can pollut higher order. > > I have a feeling that you don't really know why swap_map[] is scanned. > I suggest you to do more test and tracing to find out the reason. I > suspect that there are some non-full cluster collection issues. > > >> Another issue is nonfull_cluster[order1] cannot be used for > >> nonfull_cluster[order2]. In definition, we should not fail order 0 > >> allocation, we need to steal nonfull_cluster[order>0] for order 0 > >> allocation. This can avoid to scan swap_map[] too. This may be not > >> perfect, but it is the simplest first step implementation. You can > >> optimize based on it further. > > > > Yes, that is listed as the limitation of this cluster order approach. > > Initially we need to support one order well first. We might choose > > what order that is, 16K or 64K folio. 4K pages are too small, 2M pages > > are too big. The sweet spot might be some there in between. If we can > > support one order well, we can demonstrate the value of the mTHP. We > > can worry about other mix orders later. > > > > Do you have any suggestions for how to prevent the order 0 polluting > > the higher order cluster? If we allow that to happen, then it defeats > > the goal of being able to allocate higher order swap entries. The > > tricky question is we don't know how much swap space we should reserve > > for each order. We can always break higher order clusters to lower > > order, but can't do the reserves. The current patch series lets the > > actual usage determine the percentage of the cluster for each order. > > However that seems not enough for the test case Barry has. When the > > app gets OOM kill that is where a large swing of order 0 swap will > > show up and not enough higher order usage for the brief moment. The > > order 0 swap entry will pollute the high order cluster. We are > > currently debating a "knob" to be able to reserve a certain % of swap > > space for a certain order. Those reservations will be guaranteed and > > order 0 swap entry can't pollute them even when it runs out of swap > > space. That can make the mTHP at least usable for the Android case. > > IMO, the bottom line is that order-0 allocation is the first class > citizen, we must keep it optimized. And, OOM with free swap space isn't > acceptable. Please consider the policy we used for page allocation. > > > Do you see another way to protect the high order cluster polluted by > > lower order one? > > If we use high-order page allocation as reference, we need something > like compaction to guarantee high-order allocation finally. But we are > too far from that. > > For specific configuration, I believe that we can get reasonable > high-order swap entry allocation success rate for specific use cases. > For example, if we only do limited maximum number order-0 swap entries > allocation, can we keep high-order clusters? Isn't limiting order-0 allocation breaks the bottom line that order-0 allocation is the first class citizen, and should not fail if there is space? Just my two cents... I had a try locally based on Chris's work, allowing order 0 to use nonfull_clusters as Ying has suggested, and starting with low order and increase the order until nonfull_cluster[order] is not empty, that way higher order is just better protected, because unless we ran out of free_cluster and nonfull_cluster, direct scan won't happen. More concretely, I applied the following changes, which didn't change the code much: - In scan_swap_map_try_ssd_cluster, check nonfull_cluster first, then free_clusters, then discard_cluster. - If it's order 0, also check for (int i = 0; i < SWAP_NR_ORDERS; ++i) nonfull_clusters[i] cluster before scan_swap_map_try_ssd_cluster returns false. A quick test still using the memtier test, but decreased the swap device size from 10G to 8g for higher pressure. Before: hugepages-32kB/stats/swpout:34013 hugepages-32kB/stats/swpout_fallback:266 hugepages-512kB/stats/swpout:0 hugepages-512kB/stats/swpout_fallback:77 hugepages-2048kB/stats/swpout:0 hugepages-2048kB/stats/swpout_fallback:1 hugepages-1024kB/stats/swpout:0 hugepages-1024kB/stats/swpout_fallback:0 hugepages-64kB/stats/swpout:35088 hugepages-64kB/stats/swpout_fallback:66 hugepages-16kB/stats/swpout:31848 hugepages-16kB/stats/swpout_fallback:402 hugepages-256kB/stats/swpout:390 hugepages-256kB/stats/swpout_fallback:7244 hugepages-128kB/stats/swpout:28573 hugepages-128kB/stats/swpout_fallback:474 After: hugepages-32kB/stats/swpout:31448 hugepages-32kB/stats/swpout_fallback:3354 hugepages-512kB/stats/swpout:30 hugepages-512kB/stats/swpout_fallback:33 hugepages-2048kB/stats/swpout:2 hugepages-2048kB/stats/swpout_fallback:0 hugepages-1024kB/stats/swpout:0 hugepages-1024kB/stats/swpout_fallback:0 hugepages-64kB/stats/swpout:31255 hugepages-64kB/stats/swpout_fallback:3112 hugepages-16kB/stats/swpout:29931 hugepages-16kB/stats/swpout_fallback:3397 hugepages-256kB/stats/swpout:5223 hugepages-256kB/stats/swpout_fallback:2351 hugepages-128kB/stats/swpout:25600 hugepages-128kB/stats/swpout_fallback:2194 High order (256k) swapout rate are significantly higher, 512k is now possible, which indicate high orders are better protected, lower orders are sacrificed but seems worth it.
On Fri, May 31, 2024 at 5:40 AM Kairui Song <ryncsn@gmail.com> wrote: > > On Fri, May 31, 2024 at 10:37 AM Huang, Ying <ying.huang@intel.com> wrote: > > > > For specific configuration, I believe that we can get reasonable > > high-order swap entry allocation success rate for specific use cases. > > For example, if we only do limited maximum number order-0 swap entries > > allocation, can we keep high-order clusters? > > Isn't limiting order-0 allocation breaks the bottom line that order-0 > allocation is the first class citizen, and should not fail if there is > space? We need to have high order and low order swap allocation working. Able to recover from the swapfile full case. > > Just my two cents... > > I had a try locally based on Chris's work, allowing order 0 to use > nonfull_clusters as Ying has suggested, and starting with low order > and increase the order until nonfull_cluster[order] is not empty, that > way higher order is just better protected, because unless we ran out > of free_cluster and nonfull_cluster, direct scan won't happen. That does not help the Android test case Barry is running because Android tries to keep the swapfile full. It will hit the case both empty and nonfull list all used up. When it performs the low memory kill. There will be a big change in the ratio of low vs high order swap. Allocating high order swap entries should be able to recover from that. > > More concretely, I applied the following changes, which didn't change > the code much: > - In scan_swap_map_try_ssd_cluster, check nonfull_cluster first, then > free_clusters, then discard_cluster. I consider high the nonfull list before the empty list. The current allocation tries to make the HAS_CACHE only swap entry stay in the disk for a longer time before recycling it. If the folio is still in swap cache and not dirty, it can skip the write out and directly reuse the swap slot during reclaim. I am not sure this code path is important now, it seems when the swap slot is free, it will remove the HAS_CACHE as well. BTW, I noticed that the discard cluster doesn't check if the swap cache has a folio point to it. After discarding it just set the swap_map to 0. I wonder if swap cache has a folio in that discarded slot that would hit the skip writeback logic. If that is triggerable, it would be a corruption bug. The current SSD allocation also has some command said old SSD can benefit from not writing to the same block too many times to help the wear leveling. I don't think that is a big deal now, even cheap SD cards have wear leveling nowadays. > - If it's order 0, also check for (int i = 0; i < SWAP_NR_ORDERS; ++i) > nonfull_clusters[i] cluster before scan_swap_map_try_ssd_cluster > returns false. Ideally to have some option to reserve some high order swap space so order 0 can't pollute high order clusters. Chris > > A quick test still using the memtier test, but decreased the swap > device size from 10G to 8g for higher pressure. > > Before: > hugepages-32kB/stats/swpout:34013 > hugepages-32kB/stats/swpout_fallback:266 > hugepages-512kB/stats/swpout:0 > hugepages-512kB/stats/swpout_fallback:77 > hugepages-2048kB/stats/swpout:0 > hugepages-2048kB/stats/swpout_fallback:1 > hugepages-1024kB/stats/swpout:0 > hugepages-1024kB/stats/swpout_fallback:0 > hugepages-64kB/stats/swpout:35088 > hugepages-64kB/stats/swpout_fallback:66 > hugepages-16kB/stats/swpout:31848 > hugepages-16kB/stats/swpout_fallback:402 > hugepages-256kB/stats/swpout:390 > hugepages-256kB/stats/swpout_fallback:7244 > hugepages-128kB/stats/swpout:28573 > hugepages-128kB/stats/swpout_fallback:474 > > After: > hugepages-32kB/stats/swpout:31448 > hugepages-32kB/stats/swpout_fallback:3354 > hugepages-512kB/stats/swpout:30 > hugepages-512kB/stats/swpout_fallback:33 > hugepages-2048kB/stats/swpout:2 > hugepages-2048kB/stats/swpout_fallback:0 > hugepages-1024kB/stats/swpout:0 > hugepages-1024kB/stats/swpout_fallback:0 > hugepages-64kB/stats/swpout:31255 > hugepages-64kB/stats/swpout_fallback:3112 > hugepages-16kB/stats/swpout:29931 > hugepages-16kB/stats/swpout_fallback:3397 > hugepages-256kB/stats/swpout:5223 > hugepages-256kB/stats/swpout_fallback:2351 > hugepages-128kB/stats/swpout:25600 > hugepages-128kB/stats/swpout_fallback:2194 > > High order (256k) swapout rate are significantly higher, 512k is now > possible, which indicate high orders are better protected, lower > orders are sacrificed but seems worth it. >
Kairui Song <ryncsn@gmail.com> writes: > On Fri, May 31, 2024 at 10:37 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> Chris Li <chrisl@kernel.org> writes: >> >> > On Wed, May 29, 2024 at 7:54 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Chris Li <chrisl@kernel.org> writes: >> >> >> >> > Hi Ying, >> >> > >> >> > On Wed, May 29, 2024 at 1:57 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> Chris Li <chrisl@kernel.org> writes: >> >> >> >> >> >> > I am spinning a new version for this series to address two issues >> >> >> > found in this series: >> >> >> > >> >> >> > 1) Oppo discovered a bug in the following line: >> >> >> > + ci = si->cluster_info + tmp; >> >> >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp". >> >> >> > That is a serious bug but trivial to fix. >> >> >> > >> >> >> > 2) order 0 allocation currently blindly scans swap_map disregarding >> >> >> > the cluster->order. >> >> >> >> >> >> IIUC, now, we only scan swap_map[] only if >> >> >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]). >> >> >> That is, if you doesn't run low swap free space, you will not do that. >> >> > >> >> > You can still swap space in order 0 clusters while order 4 runs out of >> >> > free_cluster >> >> > or nonfull_clusters[order]. For Android that is a common case. >> >> >> >> When we fail to allocate order 4, we will fallback to order 0. Still >> >> don't need to scan swap_map[]. But after looking at your below reply, I >> >> realized that the swap space is almost full at most times in your cases. >> >> Then, it's possible that we run into scanning swap_map[]. >> >> list_empty(&si->free_clusters) && >> >> list_empty(&si->nonfull_clusters[order]) will become true, if we put too >> >> many clusters in si->percpu_cluster. So, if we want to avoid to scan >> >> swap_map[], we can stop add clusters in si->percpu_cluster when swap >> >> space runs low. And maybe take clusters out of si->percpu_cluster >> >> sometimes. >> > >> > One idea after reading your reply. If we run out of the >> > nonfull_cluster[order], we should be able to use other cpu's >> > si->percpu_cluster[] as well. That is a very small win for Android, >> >> This will be useful in general. The number CPU may be large, and >> multiple orders may be used. >> >> > because android does not have too many cpu. We are talking about a >> > handful of clusters, which might not justify the code complexity. It >> > does not change the behavior that order 0 can pollut higher order. >> >> I have a feeling that you don't really know why swap_map[] is scanned. >> I suggest you to do more test and tracing to find out the reason. I >> suspect that there are some non-full cluster collection issues. >> >> >> Another issue is nonfull_cluster[order1] cannot be used for >> >> nonfull_cluster[order2]. In definition, we should not fail order 0 >> >> allocation, we need to steal nonfull_cluster[order>0] for order 0 >> >> allocation. This can avoid to scan swap_map[] too. This may be not >> >> perfect, but it is the simplest first step implementation. You can >> >> optimize based on it further. >> > >> > Yes, that is listed as the limitation of this cluster order approach. >> > Initially we need to support one order well first. We might choose >> > what order that is, 16K or 64K folio. 4K pages are too small, 2M pages >> > are too big. The sweet spot might be some there in between. If we can >> > support one order well, we can demonstrate the value of the mTHP. We >> > can worry about other mix orders later. >> > >> > Do you have any suggestions for how to prevent the order 0 polluting >> > the higher order cluster? If we allow that to happen, then it defeats >> > the goal of being able to allocate higher order swap entries. The >> > tricky question is we don't know how much swap space we should reserve >> > for each order. We can always break higher order clusters to lower >> > order, but can't do the reserves. The current patch series lets the >> > actual usage determine the percentage of the cluster for each order. >> > However that seems not enough for the test case Barry has. When the >> > app gets OOM kill that is where a large swing of order 0 swap will >> > show up and not enough higher order usage for the brief moment. The >> > order 0 swap entry will pollute the high order cluster. We are >> > currently debating a "knob" to be able to reserve a certain % of swap >> > space for a certain order. Those reservations will be guaranteed and >> > order 0 swap entry can't pollute them even when it runs out of swap >> > space. That can make the mTHP at least usable for the Android case. >> >> IMO, the bottom line is that order-0 allocation is the first class >> citizen, we must keep it optimized. And, OOM with free swap space isn't >> acceptable. Please consider the policy we used for page allocation. >> >> > Do you see another way to protect the high order cluster polluted by >> > lower order one? >> >> If we use high-order page allocation as reference, we need something >> like compaction to guarantee high-order allocation finally. But we are >> too far from that. >> >> For specific configuration, I believe that we can get reasonable >> high-order swap entry allocation success rate for specific use cases. >> For example, if we only do limited maximum number order-0 swap entries >> allocation, can we keep high-order clusters? > > Isn't limiting order-0 allocation breaks the bottom line that order-0 > allocation is the first class citizen, and should not fail if there is > space? Sorry for confusing words. I mean limiting maximum number order-0 swap entries allocation in workloads, instead of limiting that in kernel. > Just my two cents... > > I had a try locally based on Chris's work, allowing order 0 to use > nonfull_clusters as Ying has suggested, and starting with low order > and increase the order until nonfull_cluster[order] is not empty, that > way higher order is just better protected, because unless we ran out > of free_cluster and nonfull_cluster, direct scan won't happen. > > More concretely, I applied the following changes, which didn't change > the code much: > - In scan_swap_map_try_ssd_cluster, check nonfull_cluster first, then > free_clusters, then discard_cluster. > - If it's order 0, also check for (int i = 0; i < SWAP_NR_ORDERS; ++i) > nonfull_clusters[i] cluster before scan_swap_map_try_ssd_cluster > returns false. > > A quick test still using the memtier test, but decreased the swap > device size from 10G to 8g for higher pressure. > > Before: > hugepages-32kB/stats/swpout:34013 > hugepages-32kB/stats/swpout_fallback:266 > hugepages-512kB/stats/swpout:0 > hugepages-512kB/stats/swpout_fallback:77 > hugepages-2048kB/stats/swpout:0 > hugepages-2048kB/stats/swpout_fallback:1 > hugepages-1024kB/stats/swpout:0 > hugepages-1024kB/stats/swpout_fallback:0 > hugepages-64kB/stats/swpout:35088 > hugepages-64kB/stats/swpout_fallback:66 > hugepages-16kB/stats/swpout:31848 > hugepages-16kB/stats/swpout_fallback:402 > hugepages-256kB/stats/swpout:390 > hugepages-256kB/stats/swpout_fallback:7244 > hugepages-128kB/stats/swpout:28573 > hugepages-128kB/stats/swpout_fallback:474 > > After: > hugepages-32kB/stats/swpout:31448 > hugepages-32kB/stats/swpout_fallback:3354 > hugepages-512kB/stats/swpout:30 > hugepages-512kB/stats/swpout_fallback:33 > hugepages-2048kB/stats/swpout:2 > hugepages-2048kB/stats/swpout_fallback:0 > hugepages-1024kB/stats/swpout:0 > hugepages-1024kB/stats/swpout_fallback:0 > hugepages-64kB/stats/swpout:31255 > hugepages-64kB/stats/swpout_fallback:3112 > hugepages-16kB/stats/swpout:29931 > hugepages-16kB/stats/swpout_fallback:3397 > hugepages-256kB/stats/swpout:5223 > hugepages-256kB/stats/swpout_fallback:2351 > hugepages-128kB/stats/swpout:25600 > hugepages-128kB/stats/swpout_fallback:2194 > > High order (256k) swapout rate are significantly higher, 512k is now > possible, which indicate high orders are better protected, lower > orders are sacrificed but seems worth it. Yes. I think that this reflects another aspect of the problem. In some situations, it's better to steal one high-order cluster and use it for order-0 allocation instead of scattering order-0 allocation in random high-order clusters. -- Best Regards, Huang, Ying
On Tue, Jun 4, 2024 at 12:29 AM Huang, Ying <ying.huang@intel.com> wrote: > > Kairui Song <ryncsn@gmail.com> writes: > > > On Fri, May 31, 2024 at 10:37 AM Huang, Ying <ying.huang@intel.com> wrote: > > Isn't limiting order-0 allocation breaks the bottom line that order-0 > > allocation is the first class citizen, and should not fail if there is > > space? > > Sorry for confusing words. I mean limiting maximum number order-0 swap > entries allocation in workloads, instead of limiting that in kernel. What interface does it use to limit the order 0 swap entries? I was thinking the kernel would enforce the high order swap space reservation just like hugetlbfs does on huge pages. We will need to introduce some interface to specify the reservation. > > > Just my two cents... > > > > I had a try locally based on Chris's work, allowing order 0 to use > > nonfull_clusters as Ying has suggested, and starting with low order > > and increase the order until nonfull_cluster[order] is not empty, that > > way higher order is just better protected, because unless we ran out > > of free_cluster and nonfull_cluster, direct scan won't happen. > > > > More concretely, I applied the following changes, which didn't change > > the code much: > > - In scan_swap_map_try_ssd_cluster, check nonfull_cluster first, then > > free_clusters, then discard_cluster. > > - If it's order 0, also check for (int i = 0; i < SWAP_NR_ORDERS; ++i) > > nonfull_clusters[i] cluster before scan_swap_map_try_ssd_cluster > > returns false. > > > > A quick test still using the memtier test, but decreased the swap > > device size from 10G to 8g for higher pressure. > > > > Before: > > hugepages-32kB/stats/swpout:34013 > > hugepages-32kB/stats/swpout_fallback:266 > > hugepages-512kB/stats/swpout:0 > > hugepages-512kB/stats/swpout_fallback:77 > > hugepages-2048kB/stats/swpout:0 > > hugepages-2048kB/stats/swpout_fallback:1 > > hugepages-1024kB/stats/swpout:0 > > hugepages-1024kB/stats/swpout_fallback:0 > > hugepages-64kB/stats/swpout:35088 > > hugepages-64kB/stats/swpout_fallback:66 > > hugepages-16kB/stats/swpout:31848 > > hugepages-16kB/stats/swpout_fallback:402 > > hugepages-256kB/stats/swpout:390 > > hugepages-256kB/stats/swpout_fallback:7244 > > hugepages-128kB/stats/swpout:28573 > > hugepages-128kB/stats/swpout_fallback:474 > > > > After: > > hugepages-32kB/stats/swpout:31448 > > hugepages-32kB/stats/swpout_fallback:3354 > > hugepages-512kB/stats/swpout:30 > > hugepages-512kB/stats/swpout_fallback:33 > > hugepages-2048kB/stats/swpout:2 > > hugepages-2048kB/stats/swpout_fallback:0 > > hugepages-1024kB/stats/swpout:0 > > hugepages-1024kB/stats/swpout_fallback:0 > > hugepages-64kB/stats/swpout:31255 > > hugepages-64kB/stats/swpout_fallback:3112 > > hugepages-16kB/stats/swpout:29931 > > hugepages-16kB/stats/swpout_fallback:3397 > > hugepages-256kB/stats/swpout:5223 > > hugepages-256kB/stats/swpout_fallback:2351 > > hugepages-128kB/stats/swpout:25600 > > hugepages-128kB/stats/swpout_fallback:2194 > > > > High order (256k) swapout rate are significantly higher, 512k is now > > possible, which indicate high orders are better protected, lower > > orders are sacrificed but seems worth it. > > Yes. I think that this reflects another aspect of the problem. In some > situations, it's better to steal one high-order cluster and use it for > order-0 allocation instead of scattering order-0 allocation in random > high-order clusters. Agree, the scan loop on swap_map[] has the worst pollution. Chris
On Thu, May 30, 2024 at 10:54 AM Huang, Ying <ying.huang@intel.com> wrote: > > Chris Li <chrisl@kernel.org> writes: > > > Hi Ying, > > > > On Wed, May 29, 2024 at 1:57 AM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Chris Li <chrisl@kernel.org> writes: > >> > >> > I am spinning a new version for this series to address two issues > >> > found in this series: > >> > > >> > 1) Oppo discovered a bug in the following line: > >> > + ci = si->cluster_info + tmp; > >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp". > >> > That is a serious bug but trivial to fix. > >> > > >> > 2) order 0 allocation currently blindly scans swap_map disregarding > >> > the cluster->order. > >> > >> IIUC, now, we only scan swap_map[] only if > >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]). > >> That is, if you doesn't run low swap free space, you will not do that. > > > > You can still swap space in order 0 clusters while order 4 runs out of > > free_cluster > > or nonfull_clusters[order]. For Android that is a common case. > > When we fail to allocate order 4, we will fallback to order 0. Still > don't need to scan swap_map[]. But after looking at your below reply, I > realized that the swap space is almost full at most times in your cases. > Then, it's possible that we run into scanning swap_map[]. > list_empty(&si->free_clusters) && > list_empty(&si->nonfull_clusters[order]) will become true, if we put too > many clusters in si->percpu_cluster. So, if we want to avoid to scan > swap_map[], we can stop add clusters in si->percpu_cluster when swap > space runs low. And maybe take clusters out of si->percpu_cluster > sometimes. Stop adding when it runs low seems too late, there could still be a free cluster stuck on a CPU, and not getting scanned, right? > Another issue is nonfull_cluster[order1] cannot be used for > nonfull_cluster[order2]. In definition, we should not fail order 0 > allocation, we need to steal nonfull_cluster[order>0] for order 0 > allocation. This can avoid to scan swap_map[] too. This may be not > perfect, but it is the simplest first step implementation. You can > optimize based on it further. This can be extended to allow any order < MAX_ORDER to steal from higher order, which might increase fragmentation though. So this is looking more and more like a buddy allocator, and that should be the long term solution.
On Thu, May 30, 2024 at 1:08 AM Kairui Song <ryncsn@gmail.com> wrote: > > On Thu, May 30, 2024 at 10:54 AM Huang, Ying <ying.huang@intel.com> wrote: > > > > Chris Li <chrisl@kernel.org> writes: > > > > > Hi Ying, > > > > > > On Wed, May 29, 2024 at 1:57 AM Huang, Ying <ying.huang@intel.com> wrote: > > >> > > >> Chris Li <chrisl@kernel.org> writes: > > >> > > >> > I am spinning a new version for this series to address two issues > > >> > found in this series: > > >> > > > >> > 1) Oppo discovered a bug in the following line: > > >> > + ci = si->cluster_info + tmp; > > >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp". > > >> > That is a serious bug but trivial to fix. > > >> > > > >> > 2) order 0 allocation currently blindly scans swap_map disregarding > > >> > the cluster->order. > > >> > > >> IIUC, now, we only scan swap_map[] only if > > >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]). > > >> That is, if you doesn't run low swap free space, you will not do that. > > > > > > You can still swap space in order 0 clusters while order 4 runs out of > > > free_cluster > > > or nonfull_clusters[order]. For Android that is a common case. > > > > When we fail to allocate order 4, we will fallback to order 0. Still > > don't need to scan swap_map[]. But after looking at your below reply, I > > realized that the swap space is almost full at most times in your cases. > > Then, it's possible that we run into scanning swap_map[]. > > list_empty(&si->free_clusters) && > > list_empty(&si->nonfull_clusters[order]) will become true, if we put too > > many clusters in si->percpu_cluster. So, if we want to avoid to scan > > swap_map[], we can stop add clusters in si->percpu_cluster when swap > > space runs low. And maybe take clusters out of si->percpu_cluster > > sometimes. > > Stop adding when it runs low seems too late, there could still be a > free cluster stuck on a CPU, and not getting scanned, right? The free clusters stuck on the CPU are a small number. Only a handful of clusters. Preventing low order swap polluting the high order cluster is more urgent. > > > Another issue is nonfull_cluster[order1] cannot be used for > > nonfull_cluster[order2]. In definition, we should not fail order 0 > > allocation, we need to steal nonfull_cluster[order>0] for order 0 > > allocation. This can avoid to scan swap_map[] too. This may be not > > perfect, but it is the simplest first step implementation. You can > > optimize based on it further. > > This can be extended to allow any order < MAX_ORDER to steal from > higher order, which might increase fragmentation though. Steal from higher order is a bad thing. Because the value of the allocator is able to allocate from higher order. High to low is always trivil, the low to high is impossible. See the other email having a "knob" to reserve some swap space for high order allocations. That is not perfect but more useful. > > So this is looking more and more like a buddy allocator, and that > should be the long term solution. > In Barry's test case, there is a huge swing of order 0 and order 4 allocation caused by the low memory killer. Apps get killed and take a while for the app to launch and swap out high order entries. The buddy allocator will have limited help there because once cluster is used for order 0, the fragmentation will prevent higher order allocation. Buddy allocator might not be able to help much in this situation. We do need a way to swap out large folios using discontiguous swap entries. That is the longer term solution to Barry's usage situation. Chris
On Sat, May 25, 2024 at 5:17 AM Chris Li <chrisl@kernel.org> wrote: > > This is the short term solutiolns "swap cluster order" listed > in my "Swap Abstraction" discussion slice 8 in the recent > LSF/MM conference. > > When commit 845982eb264bc "mm: swap: allow storage of all mTHP > orders" is introduced, it only allocates the mTHP swap entries > from new empty cluster list. That works well for PMD size THP, > but it has a serius fragmentation issue reported by Barry. > > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/ > > The mTHP allocation failure rate raises to almost 100% after a few > hours in Barry's test run. > > The reason is that all the empty cluster has been exhausted while > there are planty of free swap entries to in the cluster that is > not 100% free. > > Address this by remember the swap allocation order in the cluster. > Keep track of the per order non full cluster list for later allocation. > > This greatly improve the sucess rate of the mTHP swap allocation. > While I am still waiting for Barry's test result. I paste Kairui's test Hi Chris, Attached are the test results from a real phone using 4-order mTHP. The results seem better overall, but after 7 hours, especially when the swap device becomes full(soon some apps are killed to free memory and swap), the fallback ratio still reaches 100%. I haven't debugged this, but my guess is that the cluster's order can shift between 4-order and 0-order. Sometimes, they all shift to 0-order, and hardly can they get back to 4-order. > result here: > > I'm able to reproduce such an issue with a simple script (enabling all order of mthp): > > modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024)) > swapoff -a > mkswap /dev/ram0 > swapon /dev/ram0 > > rmdir /sys/fs/cgroup/benchmark > mkdir -p /sys/fs/cgroup/benchmark > cd /sys/fs/cgroup/benchmark > echo 8G > memory.max > echo $$ > cgroup.procs > > memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary & > > /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \ > -P memcache_binary -n allkeys --key-minimum=1 \ > --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \ > --ratio 1:0 --pipeline 8 -d 1024 > > Before: > Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98 > After: > Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74 > > And the fallback ratio dropped by a lot: > Before: > hugepages-32kB/stats/anon_swpout_fallback:15997 > hugepages-32kB/stats/anon_swpout:18712 > hugepages-512kB/stats/anon_swpout_fallback:192 > hugepages-512kB/stats/anon_swpout:0 > hugepages-2048kB/stats/anon_swpout_fallback:2 > hugepages-2048kB/stats/anon_swpout:0 > hugepages-1024kB/stats/anon_swpout_fallback:0 > hugepages-1024kB/stats/anon_swpout:0 > hugepages-64kB/stats/anon_swpout_fallback:18246 > hugepages-64kB/stats/anon_swpout:17644 > hugepages-16kB/stats/anon_swpout_fallback:13701 > hugepages-16kB/stats/anon_swpout:18234 > hugepages-256kB/stats/anon_swpout_fallback:8642 > hugepages-256kB/stats/anon_swpout:93 > hugepages-128kB/stats/anon_swpout_fallback:21497 > hugepages-128kB/stats/anon_swpout:7596 > > (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate) > > After: > hugepages-32kB/stats/swpout:34445 > hugepages-32kB/stats/swpout_fallback:0 > hugepages-512kB/stats/swpout:1 > hugepages-512kB/stats/swpout_fallback:134 > hugepages-2048kB/stats/swpout:1 > hugepages-2048kB/stats/swpout_fallback:1 > hugepages-1024kB/stats/swpout:6 > hugepages-1024kB/stats/swpout_fallback:0 > hugepages-64kB/stats/swpout:35495 > hugepages-64kB/stats/swpout_fallback:0 > hugepages-16kB/stats/swpout:32441 > hugepages-16kB/stats/swpout_fallback:0 > hugepages-256kB/stats/swpout:2223 > hugepages-256kB/stats/swpout_fallback:6278 > hugepages-128kB/stats/swpout:29136 > hugepages-128kB/stats/swpout_fallback:52 > > Reported-by: Barry Song <21cnbao@gmail.com> > Tested-by: Kairui Song <kasong@tencent.com> > Signed-off-by: Chris Li <chrisl@kernel.org> > --- > Chris Li (2): > mm: swap: swap cluster switch to double link list > mm: swap: mTHP allocate swap entries from nonfull list > > include/linux/swap.h | 18 ++-- > mm/swapfile.c | 252 +++++++++++++++++---------------------------------- > 2 files changed, 93 insertions(+), 177 deletions(-) > --- > base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed > change-id: 20240523-swap-allocator-1534c480ece4 > > Best regards, > -- > Chris Li <chrisl@kernel.org> > Thanks Barry
© 2016 - 2026 Red Hat, Inc.