MAINTAINERS | 1 + include/linux/swap.h | 42 ---- mm/filemap.c | 2 +- mm/huge_memory.c | 16 +- mm/memory-failure.c | 2 +- mm/memory.c | 30 +-- mm/migrate.c | 28 +-- mm/mincore.c | 3 +- mm/page_io.c | 12 +- mm/shmem.c | 56 ++---- mm/swap.h | 268 +++++++++++++++++++++---- mm/swap_state.c | 404 +++++++++++++++++++------------------- mm/swap_table.h | 136 +++++++++++++ mm/swapfile.c | 456 ++++++++++++++++++++++++++++--------------- mm/userfaultfd.c | 5 +- mm/vmscan.c | 20 +- mm/zswap.c | 9 +- 17 files changed, 954 insertions(+), 536 deletions(-) create mode 100644 mm/swap_table.h
From: Kairui Song <kasong@tencent.com> This is the first phase of the bigger series implementing basic infrastructures for the Swap Table idea proposed at the LSF/MM/BPF topic "Integrate swap cache, swap maps with swap allocator" [1]. This phase I contains 9 patches, introduces the swap table infrastructure and uses it as the swap cache backend. By doing so, we have up to ~5-20% performance gain in throughput, RPS or build time for benchmark and workload tests. This is based on Chris Li's idea of using cluster size atomic arrays to implement swap cache. It has less contention on the swap cache access. The cluster size is much finer-grained than the 64M address space split, which is removed in this phase I. It also unifies and cleans up the swap code base. Each swap cluster will dynamically allocate the swap table, which is an atomic array to cover every swap slot in the cluster. It replaces the swap cache back by Xarray. In phase I, the static allocated swap_map still co-exists with the swap table. The memory usage is about the same as the original on average. A few exception test cases show about 1% higher in memory usage. In the following phases of the series, swap_map will merge into the swap table without additional memory allocation. It will result in net memory reduction compared to the original swap cache. Testing has shown that phase I has a significant performance improvement from 8c/1G ARM machine to 48c96t/128G x86_64 servers in many practical workloads. The full picture with a summary can be found at [2]. An older bigger series of 28 patches is posted at [3]. vm-scability test: ================== Test with: usemem --init-time -O -y -x -n 31 1G (4G memcg, PMEM as swap) Before: After: System time: 220.86s 160.42s (-27.36%) Throughput: 4775.18 MB/s 6381.43 MB/s (+33.63%) Free latency: 174492 us 132122 us (+24.28%) usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure, PMEM as swap) Before: After: System time: 355.23s 295.28s (-16.87%) Throughput: 4659.89 MB/s 5765.80 MB/s (+23.73%) Free latency: 500417 us 477098 us (-4.66%) This shows an improvement of more than 20% improvement in most readings. Build kernel test: ================== Building kernel with defconfig on tmpfs with ZSWAP / ZRAM is looking good. The results below show a test matrix using different memory pressure and setups. Tests are done with shmem as filesystem and using the same build config, measuring sys and real time in seconds (user time is almost identical as expected): -j<NR> / Mem | Sys before / after | Real before / after Using 16G ZRAM with memcg limit: 12 / 256M | 6475 / 6232 -3.75% | 814 / 793 -2.58% 24 / 384M | 5904 / 5560 -5.82% | 413 / 397 -3.87% 48 / 768M | 4762 / 4242 -10.9% | 187 / 179 -4.27% With 64k folio: 24 / 512M | 4196 / 4062 -3.19% | 325 / 319 -1.84% 48 / 1G | 3622 / 3544 -2.15% | 148 / 146 -1.37% With ZSWAP with 3G memcg (using higher limit due to kmem account): 48 / 3G | 605 / 571 -5.61% | 81 / 79 -2.47% For extremely high pressure global pressure, using ZSWAP with 32G NVMEs in a 48c VM that has 4G memory globally, no memcg limit, system components take up about 1.5G so the pressure is high, using make -j48: Before: sys time: 2061.72s real time: 135.61s After: sys time: 1990.96s (-3.43%) real time: 134.03s (-1.16%) All cases are faster, and no regression even under heavy global memory pressure. Redis / Valkey bench: ===================== The test machine is a ARM64 VM with 1.5G memory, redis is set to use 2.5G memory: Testing with: redis-benchmark -r 2500000 -n 2500000 -d 1024 -c 12 -P 32 -t get no BGSAVE with BGSAVE Before: 433015.08 RPS 271421.15 RPS After: 431537.61 RPS (-0.34%) 290441.79 RPS (+7.0%) Testing with: redis-benchmark -r 2000000 -n 2000000 -d 1024 -c 12 -P 32 -t get no BGSAVE with BGSAVE Before: 446339.45 RPS 274845.19 RPS After: 442697.29 RPS (-0.81%) 293053.59 RPS (+6.6%) With BGSAVE enabled, most Redis memory will have a swap count > 1 so swap cache is heavily in use. We can see a >5% performance. No BGSAVE is very slightly slower (<1%) due to the higher memory pressure of the co-existence of swap_map and swap table. This will be optimzed into a net gain and up to 20% gain in BGSAVE case in the following phases. Link: https://lore.kernel.org/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com [1] Link: https://github.com/ryncsn/linux/tree/kasong/devel/swap-table [2] Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3] Kairui Song (9): mm, swap: use unified helper for swap cache look up mm, swap: always lock and check the swap cache folio before use mm, swap: rename and move some swap cluster definition and helpers mm, swap: tidy up swap device and cluster info helpers mm/shmem, swap: remove redundant error handling for replacing folio mm, swap: use the swap table for the swap cache and switch API mm, swap: remove contention workaround for swap cache mm, swap: implement dynamic allocation of swap table mm, swap: use a single page for swap table when the size fits MAINTAINERS | 1 + include/linux/swap.h | 42 ---- mm/filemap.c | 2 +- mm/huge_memory.c | 16 +- mm/memory-failure.c | 2 +- mm/memory.c | 30 +-- mm/migrate.c | 28 +-- mm/mincore.c | 3 +- mm/page_io.c | 12 +- mm/shmem.c | 56 ++---- mm/swap.h | 268 +++++++++++++++++++++---- mm/swap_state.c | 404 +++++++++++++++++++------------------- mm/swap_table.h | 136 +++++++++++++ mm/swapfile.c | 456 ++++++++++++++++++++++++++++--------------- mm/userfaultfd.c | 5 +- mm/vmscan.c | 20 +- mm/zswap.c | 9 +- 17 files changed, 954 insertions(+), 536 deletions(-) create mode 100644 mm/swap_table.h --- I was trying some new tools like b4 for branch management, and it seems a draft version was sent out by accident, but seems got rejected. I'm not sure if anyone is seeing duplicated or a malformed email. If so, please accept my apology and use this series for review, discussion or merge. -- 2.51.0
Hi Kairui, I give one pass of review on your series already. I Ack a portion of it. I expect some clarification or update on the rest. I especially want to double check the swap cache atomic set a range of swap entries to folio. I want to make sure this bug does not happen to swap table: https://lore.kernel.org/linux-mm/5bee194c-9cd3-47e7-919b-9f352441f855@kernel.dk/ I just double checked, the swap table should be fine in this regard. The bug is triggered by memory allocation failure in the middle of insert folio. Swap tables already populated the table when the swap entry is allocated and handed out to the caller. We don't do memory allocation when inserting folio into swap cache, which is a good thing. We should not have that bug. I also want some extra pair of eyes on those subtle behavior change patches, I expect you to split them out in the next version. I will need to go through the split out subtle patch one more time as well. This pass I only catch the behavior change, haven't got a chance to reason those behavior changes patches are indeed fine. If you can defer those split out patches, that will save me some time to reason them on the next round. Your call. Oh, I also want to write a design document for the swap table idea. I will send it your way to incorporate into your next version of the series. Thanks for the great work! I am very excited about this. Chris On Fri, Aug 22, 2025 at 12:20 PM Kairui Song <ryncsn@gmail.com> wrote: > > From: Kairui Song <kasong@tencent.com> > > This is the first phase of the bigger series implementing basic > infrastructures for the Swap Table idea proposed at the LSF/MM/BPF > topic "Integrate swap cache, swap maps with swap allocator" [1]. > > This phase I contains 9 patches, introduces the swap table infrastructure > and uses it as the swap cache backend. By doing so, we have up to ~5-20% > performance gain in throughput, RPS or build time for benchmark and > workload tests. This is based on Chris Li's idea of using cluster size > atomic arrays to implement swap cache. It has less contention on the swap > cache access. The cluster size is much finer-grained than the 64M address > space split, which is removed in this phase I. It also unifies and cleans > up the swap code base. > > Each swap cluster will dynamically allocate the swap table, which is an > atomic array to cover every swap slot in the cluster. It replaces the swap > cache back by Xarray. In phase I, the static allocated swap_map still > co-exists with the swap table. The memory usage is about the same as the > original on average. A few exception test cases show about 1% higher in > memory usage. In the following phases of the series, swap_map will merge > into the swap table without additional memory allocation. It will result > in net memory reduction compared to the original swap cache. > > Testing has shown that phase I has a significant performance improvement > from 8c/1G ARM machine to 48c96t/128G x86_64 servers in many practical > workloads. > > The full picture with a summary can be found at [2]. An older bigger > series of 28 patches is posted at [3]. > > vm-scability test: > ================== > Test with: > usemem --init-time -O -y -x -n 31 1G (4G memcg, PMEM as swap) > Before: After: > System time: 220.86s 160.42s (-27.36%) > Throughput: 4775.18 MB/s 6381.43 MB/s (+33.63%) > Free latency: 174492 us 132122 us (+24.28%) > > usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure, > PMEM as swap) > Before: After: > System time: 355.23s 295.28s (-16.87%) > Throughput: 4659.89 MB/s 5765.80 MB/s (+23.73%) > Free latency: 500417 us 477098 us (-4.66%) > > This shows an improvement of more than 20% improvement in most readings. > > Build kernel test: > ================== > Building kernel with defconfig on tmpfs with ZSWAP / ZRAM is looking > good. The results below show a test matrix using different memory > pressure and setups. Tests are done with shmem as filesystem and > using the same build config, measuring sys and real time in seconds > (user time is almost identical as expected): > > -j<NR> / Mem | Sys before / after | Real before / after > Using 16G ZRAM with memcg limit: > 12 / 256M | 6475 / 6232 -3.75% | 814 / 793 -2.58% > 24 / 384M | 5904 / 5560 -5.82% | 413 / 397 -3.87% > 48 / 768M | 4762 / 4242 -10.9% | 187 / 179 -4.27% > With 64k folio: > 24 / 512M | 4196 / 4062 -3.19% | 325 / 319 -1.84% > 48 / 1G | 3622 / 3544 -2.15% | 148 / 146 -1.37% > With ZSWAP with 3G memcg (using higher limit due to kmem account): > 48 / 3G | 605 / 571 -5.61% | 81 / 79 -2.47% > > For extremely high pressure global pressure, using ZSWAP with 32G > NVMEs in a 48c VM that has 4G memory globally, no memcg limit, system > components take up about 1.5G so the pressure is high, using make -j48: > > Before: sys time: 2061.72s real time: 135.61s > After: sys time: 1990.96s (-3.43%) real time: 134.03s (-1.16%) > > All cases are faster, and no regression even under heavy global > memory pressure. > > Redis / Valkey bench: > ===================== > The test machine is a ARM64 VM with 1.5G memory, redis is set to > use 2.5G memory: > > Testing with: > redis-benchmark -r 2500000 -n 2500000 -d 1024 -c 12 -P 32 -t get > > no BGSAVE with BGSAVE > Before: 433015.08 RPS 271421.15 RPS > After: 431537.61 RPS (-0.34%) 290441.79 RPS (+7.0%) > > Testing with: > redis-benchmark -r 2000000 -n 2000000 -d 1024 -c 12 -P 32 -t get > no BGSAVE with BGSAVE > Before: 446339.45 RPS 274845.19 RPS > After: 442697.29 RPS (-0.81%) 293053.59 RPS (+6.6%) > > With BGSAVE enabled, most Redis memory will have a swap count > 1 so > swap cache is heavily in use. We can see a >5% performance. No BGSAVE > is very slightly slower (<1%) due to the higher memory pressure of the > co-existence of swap_map and swap table. This will be optimzed into a > net gain and up to 20% gain in BGSAVE case in the following phases. > > Link: https://lore.kernel.org/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com [1] > Link: https://github.com/ryncsn/linux/tree/kasong/devel/swap-table [2] > Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3] > > Kairui Song (9): > mm, swap: use unified helper for swap cache look up > mm, swap: always lock and check the swap cache folio before use > mm, swap: rename and move some swap cluster definition and helpers > mm, swap: tidy up swap device and cluster info helpers > mm/shmem, swap: remove redundant error handling for replacing folio > mm, swap: use the swap table for the swap cache and switch API > mm, swap: remove contention workaround for swap cache > mm, swap: implement dynamic allocation of swap table > mm, swap: use a single page for swap table when the size fits > > MAINTAINERS | 1 + > include/linux/swap.h | 42 ---- > mm/filemap.c | 2 +- > mm/huge_memory.c | 16 +- > mm/memory-failure.c | 2 +- > mm/memory.c | 30 +-- > mm/migrate.c | 28 +-- > mm/mincore.c | 3 +- > mm/page_io.c | 12 +- > mm/shmem.c | 56 ++---- > mm/swap.h | 268 +++++++++++++++++++++---- > mm/swap_state.c | 404 +++++++++++++++++++------------------- > mm/swap_table.h | 136 +++++++++++++ > mm/swapfile.c | 456 ++++++++++++++++++++++++++++--------------- > mm/userfaultfd.c | 5 +- > mm/vmscan.c | 20 +- > mm/zswap.c | 9 +- > 17 files changed, 954 insertions(+), 536 deletions(-) > create mode 100644 mm/swap_table.h > > --- > > I was trying some new tools like b4 for branch management, and it seems > a draft version was sent out by accident, but seems got rejected. I'm > not sure if anyone is seeing duplicated or a malformed email. If so, > please accept my apology and use this series for review, discussion > or merge. > > -- > 2.51.0 >
On Sat, Aug 30, 2025 at 2:57 PM Chris Li <chrisl@kernel.org> wrote: > > Hi Kairui, > > I give one pass of review on your series already. I Ack a portion of > it. I expect some clarification or update on the rest. > > I especially want to double check the swap cache atomic set a range of > swap entries to folio. > I want to make sure this bug does not happen to swap table: > https://lore.kernel.org/linux-mm/5bee194c-9cd3-47e7-919b-9f352441f855@kernel.dk/ > > I just double checked, the swap table should be fine in this regard. > The bug is triggered by memory allocation failure in the middle of > insert folio. Swap tables already populated the table when the swap > entry is allocated and handed out to the caller. We don't do memory > allocation when inserting folio into swap cache, which is a good > thing. We should not have that bug. > > I also want some extra pair of eyes on those subtle behavior change > patches, I expect you to split them out in the next version. > I will need to go through the split out subtle patch one more time as > well. This pass I only catch the behavior change, haven't got a chance > to reason those behavior changes patches are indeed fine. If you can > defer those split out patches, that will save me some time to reason > them on the next round. Your call. Thanks a lot for the review and raising concern about robustness of phase 1. I just added more atomic runtime checks and ran another few days of stress and performance tests. So far I don't think there is a race or bug in the code, as I have been testing the longer series for months. But with more checks, we are still a lot faster than before, and much less error prone. So it seems very reasonable and acceptable to have them as this is a quite important part, even for a long term. That will surely help catch any potential new or historical issue. V2 would have a few more patches splitted from old ones so it should be cleaner. The code should be basically still the same though. Some parts like the whole new infrastructure are really hard to split though as they are supposed to work as a whole. > > Oh, I also want to write a design document for the swap table idea. I > will send it your way to incorporate into your next version of the > series. > > Thanks for the great work! I am very excited about this. Later phases will also be exciting where we start to trim down the LOC and long existing issues, with net gains :)
On Thu, Sep 4, 2025 at 9:36 AM Kairui Song <ryncsn@gmail.com> wrote: > > On Sat, Aug 30, 2025 at 2:57 PM Chris Li <chrisl@kernel.org> wrote: > > > > Hi Kairui, > > > > I give one pass of review on your series already. I Ack a portion of > > it. I expect some clarification or update on the rest. > > > > I especially want to double check the swap cache atomic set a range of > > swap entries to folio. > > I want to make sure this bug does not happen to swap table: > > https://lore.kernel.org/linux-mm/5bee194c-9cd3-47e7-919b-9f352441f855@kernel.dk/ > > > > I just double checked, the swap table should be fine in this regard. > > The bug is triggered by memory allocation failure in the middle of > > insert folio. Swap tables already populated the table when the swap > > entry is allocated and handed out to the caller. We don't do memory > > allocation when inserting folio into swap cache, which is a good > > thing. We should not have that bug. > > > > I also want some extra pair of eyes on those subtle behavior change > > patches, I expect you to split them out in the next version. > > I will need to go through the split out subtle patch one more time as > > well. This pass I only catch the behavior change, haven't got a chance > > to reason those behavior changes patches are indeed fine. If you can > > defer those split out patches, that will save me some time to reason > > them on the next round. Your call. > > Thanks a lot for the review and raising concern about robustness of phase 1. > > I just added more atomic runtime checks and ran another few days of > stress and performance tests. So far I don't think there is a race or > bug in the code, as I have been testing the longer series for months. > But with more checks, we are still a lot faster than before, and much > less error prone. So it seems very reasonable and acceptable to have > them as this is a quite important part, even for a long term. Yes, that is what I want it to be. Every time an entry messes up in the swap table, that is one page worth of corrupted data. I definitely don't want the corrupted memory to propagate to another place, e.g. write to the harddrive. That is much worse than having a BUG() and stopping the machine there. > That will surely help catch any potential new or historical issue. Yes, let it rip in the mm-untable and hopefully we don't see any trigger of that in the wild. > V2 would have a few more patches splitted from old ones so it should > be cleaner. The code should be basically still the same though. Some > parts like the whole new infrastructure are really hard to split though > as they are supposed to work as a whole. That is great. Looking forward to it. > > Oh, I also want to write a design document for the swap table idea. I > > will send it your way to incorporate into your next version of the > > series. > > > > Thanks for the great work! I am very excited about this. > > Later phases will also be exciting where we start to trim down the LOC > and long existing issues, with net gains :) Same feeling here. Also looking forward to it. Chris
On Fri, Aug 22, 2025 at 12:20 PM Kairui Song <ryncsn@gmail.com> wrote: > > From: Kairui Song <kasong@tencent.com> > > This is the first phase of the bigger series implementing basic > infrastructures for the Swap Table idea proposed at the LSF/MM/BPF > topic "Integrate swap cache, swap maps with swap allocator" [1]. > > This phase I contains 9 patches, introduces the swap table infrastructure > and uses it as the swap cache backend. By doing so, we have up to ~5-20% > performance gain in throughput, RPS or build time for benchmark and > workload tests. This is based on Chris Li's idea of using cluster size > atomic arrays to implement swap cache. It has less contention on the swap > cache access. The cluster size is much finer-grained than the 64M address > space split, which is removed in this phase I. It also unifies and cleans > up the swap code base. Thanks for making this happen. It has gone a long way from my early messy experimental patches on replacing xarray in swap caches. Beating the original swap_map in terms of memory usage is particularly hard. I once received this feedback from Matthew that whoever wants to replace the swap cache is asking for a lot of pain and suffering. He is absolutely right. I am so glad that we are finally seeing the light of the other end of the tunnel. We are close to a state that is able to beat the original swap layer both in terms of memory usage and CPU performance. Just to recap. The current swap layer per slot memory usage is 3 + 8 bytes. 3 up front static, 1 from swap map, 2 from swap cgroup. The 8 byte dynamic allocations are from the xarray of swap cache. At the end of this full series (27+ patches) we can completely get rid of the 3 up front allocation. Only dynamic allocate 8 byte per slot entry. That is a straight win in terms of memory allocation, no compromise was made there. The reason we can beat the previous CPU usage is that each cluster has 512 entries. Much smaller than the 64M xarray tree. The cluster lock is a much smaller lock than the xarray tree lock. We can do lockless atomic lookup on the swap cache that is pretty cool as well. I will do one more review pass on this series again soon. Very exciting. Chris
© 2016 - 2025 Red Hat, Inc.