[PATCH v4 01/15] docs/mm: add document for swap table

Kairui Song posted 15 patches 2 weeks, 1 day ago
[PATCH v4 01/15] docs/mm: add document for swap table
Posted by Kairui Song 2 weeks, 1 day ago
From: Chris Li <chrisl@kernel.org>

Swap table is the new swap cache.

Signed-off-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 Documentation/mm/index.rst      |  1 +
 Documentation/mm/swap-table.rst | 72 +++++++++++++++++++++++++++++++++
 MAINTAINERS                     |  1 +
 3 files changed, 74 insertions(+)
 create mode 100644 Documentation/mm/swap-table.rst

diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
index fb45acba16ac..828ad9b019b3 100644
--- a/Documentation/mm/index.rst
+++ b/Documentation/mm/index.rst
@@ -57,6 +57,7 @@ documentation, or deleted if it has served its purpose.
    page_table_check
    remap_file_pages
    split_page_table_lock
+   swap-table
    transhuge
    unevictable-lru
    vmalloced-kernel-stacks
diff --git a/Documentation/mm/swap-table.rst b/Documentation/mm/swap-table.rst
new file mode 100644
index 000000000000..acae6ceb4f7b
--- /dev/null
+++ b/Documentation/mm/swap-table.rst
@@ -0,0 +1,72 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+:Author: Chris Li <chrisl@kernel.org>, Kairui Song <kasong@tencent.com>
+
+==========
+Swap Table
+==========
+
+Swap table implements swap cache as a per-cluster swap cache value array.
+
+Swap Entry
+----------
+
+A swap entry contains the information required to serve the anonymous page
+fault.
+
+Swap entry is encoded as two parts: swap type and swap offset.
+
+The swap type indicates which swap device to use.
+The swap offset is the offset of the swap file to read the page data from.
+
+Swap Cache
+----------
+
+Swap cache is a map to look up folios using swap entry as the key. The result
+value can have three possible types depending on which stage of this swap entry
+was in.
+
+1. NULL: This swap entry is not used.
+
+2. folio: A folio has been allocated and bound to this swap entry. This is
+   the transient state of swap out or swap in. The folio data can be in
+   the folio or swap file, or both.
+
+3. shadow: The shadow contains the working set information of the swapped
+   out folio. This is the normal state for a swapped out page.
+
+Swap Table Internals
+--------------------
+
+The previous swap cache is implemented by XArray. The XArray is a tree
+structure. Each lookup will go through multiple nodes. Can we do better?
+
+Notice that most of the time when we look up the swap cache, we are either
+in a swap in or swap out path. We should already have the swap cluster,
+which contains the swap entry.
+
+If we have a per-cluster array to store swap cache value in the cluster.
+Swap cache lookup within the cluster can be a very simple array lookup.
+
+We give such a per-cluster swap cache value array a name: the swap table.
+
+Each swap cluster contains 512 entries, so a swap table stores one cluster
+worth of swap cache values, which is exactly one page. This is not
+coincidental because the cluster size is determined by the huge page size.
+The swap table is holding an array of pointers. The pointer has the same
+size as the PTE. The size of the swap table should match to the second
+last level of the page table page, exactly one page.
+
+With swap table, swap cache lookup can achieve great locality, simpler,
+and faster.
+
+Locking
+-------
+
+Swap table modification requires taking the cluster lock. If a folio
+is being added to or removed from the swap table, the folio must be
+locked prior to the cluster lock. After adding or removing is done, the
+folio shall be unlocked.
+
+Swap table lookup is protected by RCU and atomic read. If the lookup
+returns a folio, the user must lock the folio before use.
diff --git a/MAINTAINERS b/MAINTAINERS
index 68d29f0220fc..3d113bfc3c82 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16225,6 +16225,7 @@ R:	Barry Song <baohua@kernel.org>
 R:	Chris Li <chrisl@kernel.org>
 L:	linux-mm@kvack.org
 S:	Maintained
+F:	Documentation/mm/swap-table.rst
 F:	include/linux/swap.h
 F:	include/linux/swapfile.h
 F:	include/linux/swapops.h
-- 
2.51.0
Re: [PATCH v4 01/15] docs/mm: add document for swap table
Posted by SeongJae Park 2 weeks ago
On Wed, 17 Sep 2025 00:00:46 +0800 Kairui Song <ryncsn@gmail.com> wrote:

> From: Chris Li <chrisl@kernel.org>
> 
> Swap table is the new swap cache.

Thank you very much for doing this great work!

> 
> Signed-off-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  Documentation/mm/index.rst      |  1 +
>  Documentation/mm/swap-table.rst | 72 +++++++++++++++++++++++++++++++++
>  MAINTAINERS                     |  1 +
>  3 files changed, 74 insertions(+)
>  create mode 100644 Documentation/mm/swap-table.rst
> 
> diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
> index fb45acba16ac..828ad9b019b3 100644
> --- a/Documentation/mm/index.rst
> +++ b/Documentation/mm/index.rst
> @@ -57,6 +57,7 @@ documentation, or deleted if it has served its purpose.
>     page_table_check
>     remap_file_pages
>     split_page_table_lock
> +   swap-table
>     transhuge
>     unevictable-lru
>     vmalloced-kernel-stacks

The above diff is adding this great document on 'Unsorted Documentation'
section.  Could we add the document on the main section?  I think swap.rst on
the main section could be a good place, and wondering what others think.


Thanks,
SJ

[...]
Re: [PATCH v4 01/15] docs/mm: add document for swap table
Posted by Chris Li 2 weeks ago
On Wed, Sep 17, 2025 at 9:14 AM SeongJae Park <sj@kernel.org> wrote:
>
> On Wed, 17 Sep 2025 00:00:46 +0800 Kairui Song <ryncsn@gmail.com> wrote:
>
> > From: Chris Li <chrisl@kernel.org>
> >
> > Swap table is the new swap cache.
>
> Thank you very much for doing this great work!

I only did the prototype. Kairui did the heavy lifting.

> > Signed-off-by: Chris Li <chrisl@kernel.org>
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  Documentation/mm/index.rst      |  1 +
> >  Documentation/mm/swap-table.rst | 72 +++++++++++++++++++++++++++++++++
> >  MAINTAINERS                     |  1 +
> >  3 files changed, 74 insertions(+)
> >  create mode 100644 Documentation/mm/swap-table.rst
> >
> > diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
> > index fb45acba16ac..828ad9b019b3 100644
> > --- a/Documentation/mm/index.rst
> > +++ b/Documentation/mm/index.rst
> > @@ -57,6 +57,7 @@ documentation, or deleted if it has served its purpose.
> >     page_table_check
> >     remap_file_pages
> >     split_page_table_lock
> > +   swap-table
> >     transhuge
> >     unevictable-lru
> >     vmalloced-kernel-stacks
>
> The above diff is adding this great document on 'Unsorted Documentation'
> section.  Could we add the document on the main section?  I think swap.rst on
> the main section could be a good place, and wondering what others think.

That is a good idea. Will do together with another minor fix reported
by Barry. I will move it after "swap.rst".

Thanks for the suggestion.

Chris
Re: [PATCH v4 01/15] docs/mm: add document for swap table
Posted by Barry Song 2 weeks, 1 day ago
On Wed, Sep 17, 2025 at 12:01 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Chris Li <chrisl@kernel.org>
>
> Swap table is the new swap cache.
>
> Signed-off-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  Documentation/mm/index.rst      |  1 +
>  Documentation/mm/swap-table.rst | 72 +++++++++++++++++++++++++++++++++
>  MAINTAINERS                     |  1 +
>  3 files changed, 74 insertions(+)
>  create mode 100644 Documentation/mm/swap-table.rst
>
> diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
> index fb45acba16ac..828ad9b019b3 100644
> --- a/Documentation/mm/index.rst
> +++ b/Documentation/mm/index.rst
> @@ -57,6 +57,7 @@ documentation, or deleted if it has served its purpose.
>     page_table_check
>     remap_file_pages
>     split_page_table_lock
> +   swap-table
>     transhuge
>     unevictable-lru
>     vmalloced-kernel-stacks
> diff --git a/Documentation/mm/swap-table.rst b/Documentation/mm/swap-table.rst
> new file mode 100644
> index 000000000000..acae6ceb4f7b
> --- /dev/null
> +++ b/Documentation/mm/swap-table.rst
> @@ -0,0 +1,72 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +:Author: Chris Li <chrisl@kernel.org>, Kairui Song <kasong@tencent.com>
> +
> +==========
> +Swap Table
> +==========
> +
> +Swap table implements swap cache as a per-cluster swap cache value array.
> +
> +Swap Entry
> +----------
> +
> +A swap entry contains the information required to serve the anonymous page
> +fault.
> +
> +Swap entry is encoded as two parts: swap type and swap offset.
> +
> +The swap type indicates which swap device to use.
> +The swap offset is the offset of the swap file to read the page data from.
> +
> +Swap Cache
> +----------
> +
> +Swap cache is a map to look up folios using swap entry as the key. The result
> +value can have three possible types depending on which stage of this swap entry
> +was in.
> +
> +1. NULL: This swap entry is not used.
> +
> +2. folio: A folio has been allocated and bound to this swap entry. This is
> +   the transient state of swap out or swap in. The folio data can be in
> +   the folio or swap file, or both.

This doesn’t look quite right.

the folio’s data must reside within the folio itself?
The data might also be in a swap file, or not.

> +
> +3. shadow: The shadow contains the working set information of the swapped
> +   out folio. This is the normal state for a swapped out page.
> +
> +Swap Table Internals
> +--------------------
> +
> +The previous swap cache is implemented by XArray. The XArray is a tree
> +structure. Each lookup will go through multiple nodes. Can we do better?
> +
> +Notice that most of the time when we look up the swap cache, we are either
> +in a swap in or swap out path. We should already have the swap cluster,
> +which contains the swap entry.
> +
> +If we have a per-cluster array to store swap cache value in the cluster.
> +Swap cache lookup within the cluster can be a very simple array lookup.
> +
> +We give such a per-cluster swap cache value array a name: the swap table.
> +
> +Each swap cluster contains 512 entries, so a swap table stores one cluster
> +worth of swap cache values, which is exactly one page. This is not
> +coincidental because the cluster size is determined by the huge page size.
> +The swap table is holding an array of pointers. The pointer has the same
> +size as the PTE. The size of the swap table should match to the second
> +last level of the page table page, exactly one page.

On a 32-bit system, I’m guessing the swap table is 2 KB, which is about
half of a page?

> +
> +With swap table, swap cache lookup can achieve great locality, simpler,
> +and faster.
> +

Thanks
Barry
Re: [PATCH v4 01/15] docs/mm: add document for swap table
Posted by Chris Li 2 weeks, 1 day ago
On Tue, Sep 16, 2025 at 3:00 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Sep 17, 2025 at 12:01 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Chris Li <chrisl@kernel.org>
> >
> > Swap table is the new swap cache.
> >
> > Signed-off-by: Chris Li <chrisl@kernel.org>
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  Documentation/mm/index.rst      |  1 +
> >  Documentation/mm/swap-table.rst | 72 +++++++++++++++++++++++++++++++++
> >  MAINTAINERS                     |  1 +
> >  3 files changed, 74 insertions(+)
> >  create mode 100644 Documentation/mm/swap-table.rst
> >
> > diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
> > index fb45acba16ac..828ad9b019b3 100644
> > --- a/Documentation/mm/index.rst
> > +++ b/Documentation/mm/index.rst
> > @@ -57,6 +57,7 @@ documentation, or deleted if it has served its purpose.
> >     page_table_check
> >     remap_file_pages
> >     split_page_table_lock
> > +   swap-table
> >     transhuge
> >     unevictable-lru
> >     vmalloced-kernel-stacks
> > diff --git a/Documentation/mm/swap-table.rst b/Documentation/mm/swap-table.rst
> > new file mode 100644
> > index 000000000000..acae6ceb4f7b
> > --- /dev/null
> > +++ b/Documentation/mm/swap-table.rst
> > @@ -0,0 +1,72 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +:Author: Chris Li <chrisl@kernel.org>, Kairui Song <kasong@tencent.com>
> > +
> > +==========
> > +Swap Table
> > +==========
> > +
> > +Swap table implements swap cache as a per-cluster swap cache value array.
> > +
> > +Swap Entry
> > +----------
> > +
> > +A swap entry contains the information required to serve the anonymous page
> > +fault.
> > +
> > +Swap entry is encoded as two parts: swap type and swap offset.
> > +
> > +The swap type indicates which swap device to use.
> > +The swap offset is the offset of the swap file to read the page data from.
> > +
> > +Swap Cache
> > +----------
> > +
> > +Swap cache is a map to look up folios using swap entry as the key. The result
> > +value can have three possible types depending on which stage of this swap entry
> > +was in.
> > +
> > +1. NULL: This swap entry is not used.
> > +
> > +2. folio: A folio has been allocated and bound to this swap entry. This is
> > +   the transient state of swap out or swap in. The folio data can be in
> > +   the folio or swap file, or both.
>
> This doesn’t look quite right.
>
> the folio’s data must reside within the folio itself?

For swap out cases that is true. The swap in case you allocate the
folio first then read data from swap file to folio. There is a window
swap file that has the data and folio does not.

> The data might also be in a swap file, or not.

The data only in swap file is covered by "data can be in the folio or
swap file", it is an OR relationship.

I think my previous statement still stands correct considering both
swap out and swap in. Of course there is always room for improvement
to make it more clear. But folio always has the data is not true for
swap in. If you have other ways to improve it, please feel free to
suggest.


> On a 32-bit system, I’m guessing the swap table is 2 KB, which is about
> half of a page?

Yes, true. I consider that but decide to leave it out of the document.
There are a lot of other implementation details the document does not
cover, not just this aspect. This document provides a simple
abstracted view (might not cover all the detail cases). One way to
address that is add a qualification "on a 64 bit system". What do you
say? I don't want to talk about the 32 bit system having half of a
page in this document, I consider that too much detail. The 32 bit
system is pretty rare nowadays.

Chris
Re: [PATCH v4 01/15] docs/mm: add document for swap table
Posted by Barry Song 2 weeks, 1 day ago
On Wed, Sep 17, 2025 at 6:42 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Tue, Sep 16, 2025 at 3:00 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Wed, Sep 17, 2025 at 12:01 AM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > From: Chris Li <chrisl@kernel.org>
> > >
> > > Swap table is the new swap cache.
> > >
> > > Signed-off-by: Chris Li <chrisl@kernel.org>
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > ---
> > >  Documentation/mm/index.rst      |  1 +
> > >  Documentation/mm/swap-table.rst | 72 +++++++++++++++++++++++++++++++++
> > >  MAINTAINERS                     |  1 +
> > >  3 files changed, 74 insertions(+)
> > >  create mode 100644 Documentation/mm/swap-table.rst
> > >
> > > diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
> > > index fb45acba16ac..828ad9b019b3 100644
> > > --- a/Documentation/mm/index.rst
> > > +++ b/Documentation/mm/index.rst
> > > @@ -57,6 +57,7 @@ documentation, or deleted if it has served its purpose.
> > >     page_table_check
> > >     remap_file_pages
> > >     split_page_table_lock
> > > +   swap-table
> > >     transhuge
> > >     unevictable-lru
> > >     vmalloced-kernel-stacks
> > > diff --git a/Documentation/mm/swap-table.rst b/Documentation/mm/swap-table.rst
> > > new file mode 100644
> > > index 000000000000..acae6ceb4f7b
> > > --- /dev/null
> > > +++ b/Documentation/mm/swap-table.rst
> > > @@ -0,0 +1,72 @@
> > > +.. SPDX-License-Identifier: GPL-2.0
> > > +
> > > +:Author: Chris Li <chrisl@kernel.org>, Kairui Song <kasong@tencent.com>
> > > +
> > > +==========
> > > +Swap Table
> > > +==========
> > > +
> > > +Swap table implements swap cache as a per-cluster swap cache value array.
> > > +
> > > +Swap Entry
> > > +----------
> > > +
> > > +A swap entry contains the information required to serve the anonymous page
> > > +fault.
> > > +
> > > +Swap entry is encoded as two parts: swap type and swap offset.
> > > +
> > > +The swap type indicates which swap device to use.
> > > +The swap offset is the offset of the swap file to read the page data from.
> > > +
> > > +Swap Cache
> > > +----------
> > > +
> > > +Swap cache is a map to look up folios using swap entry as the key. The result
> > > +value can have three possible types depending on which stage of this swap entry
> > > +was in.
> > > +
> > > +1. NULL: This swap entry is not used.
> > > +
> > > +2. folio: A folio has been allocated and bound to this swap entry. This is
> > > +   the transient state of swap out or swap in. The folio data can be in
> > > +   the folio or swap file, or both.
> >
> > This doesn’t look quite right.
> >
> > the folio’s data must reside within the folio itself?
>
> For swap out cases that is true. The swap in case you allocate the
> folio first then read data from swap file to folio. There is a window
> swap file that has the data and folio does not.
>
> > The data might also be in a swap file, or not.
>
> The data only in swap file is covered by "data can be in the folio or
> swap file", it is an OR relationship.
>
> I think my previous statement still stands correct considering both
> swap out and swap in. Of course there is always room for improvement
> to make it more clear. But folio always has the data is not true for
> swap in. If you have other ways to improve it, please feel free to
> suggest.

I assume you’re referring to the swapin case where a folio has been
allocated and added to the swap cache, but it’s still being read and
hasn’t been updated yet?

I assume it could be something like:
The data may be in the folio or will be placed there later. It could
also reside in the swap file.

Alternatively, leave it unchanged.

>
>
> > On a 32-bit system, I’m guessing the swap table is 2 KB, which is about
> > half of a page?
>
> Yes, true. I consider that but decide to leave it out of the document.
> There are a lot of other implementation details the document does not
> cover, not just this aspect. This document provides a simple
> abstracted view (might not cover all the detail cases). One way to
> address that is add a qualification "on a 64 bit system". What do you
> say? I don't want to talk about the 32 bit system having half of a
> page in this document, I consider that too much detail. The 32 bit
> system is pretty rare nowadays.

I’d prefer that we remove all descriptions about matching PAGE_SIZE,
since we would need to double-check every case, like 16 KB or 64 KB pages.

For ARM64 with a 16 KB page size, the last-level index uses 24:14.
For ARM64 with a 64 KB page size, it uses 28:16[1]. For them, 512 entries
are not one PAGE.

[1] https://developer.arm.com/documentation/101811/0104/Translation-granule

Thanks
Barry
Re: [PATCH v4 01/15] docs/mm: add document for swap table
Posted by Chris Li 2 weeks, 1 day ago
On Tue, Sep 16, 2025 at 4:09 PM Barry Song <21cnbao@gmail.com> wrote:
> > I think my previous statement still stands correct considering both
> > swap out and swap in. Of course there is always room for improvement
> > to make it more clear. But folio always has the data is not true for
> > swap in. If you have other ways to improve it, please feel free to
> > suggest.
>
> I assume you’re referring to the swapin case where a folio has been
> allocated and added to the swap cache, but it’s still being read and
> hasn’t been updated yet?

Right. That is the case swapfile has the data and folio does not.

>
> I assume it could be something like:
> The data may be in the folio or will be placed there later. It could
This is for swap in only, does not describe the swap out case.

> also reside in the swap file.

Right and it did not have the same coverage about data that can be
both in the folio and swapfile. Sorry about the pedantic. If we want
to improve it, we might want to cover the same level of detail.

> Alternatively, leave it unchanged.
I think considering the swap out and swap in case, the original is
fine. The reader will need to make some effort to map to what it does
in the code, at least the description is correct.

>
> >
> >
> > > On a 32-bit system, I’m guessing the swap table is 2 KB, which is about
> > > half of a page?
> >
> > Yes, true. I consider that but decide to leave it out of the document.
> > There are a lot of other implementation details the document does not
> > cover, not just this aspect. This document provides a simple
> > abstracted view (might not cover all the detail cases). One way to
> > address that is add a qualification "on a 64 bit system". What do you
> > say? I don't want to talk about the 32 bit system having half of a
> > page in this document, I consider that too much detail. The 32 bit
> > system is pretty rare nowadays.
>
> I’d prefer that we remove all descriptions about matching PAGE_SIZE,

I am fine with that as well.

> since we would need to double-check every case, like 16 KB or 64 KB pages.

The cluster size is determined by the last to second level page table
page size. I fail to see why 16KB matters here for the cluster. Are
you saying in the 16KB page size case, the custer size is 512/4 = 128
swap entry per cluster?

> For ARM64 with a 16 KB page size, the last-level index uses 24:14.
> For ARM64 with a 64 KB page size, it uses 28:16[1]. For them, 512 entries
> are not one PAGE.

Now you got me curioused.

In your above two examples, what is the respected swap cluster swap entry size?
In other words, how much entry does one swap cluster hold?

Sorry I am not very familiar with the ARM page tables.

Chris
Re: [PATCH v4 01/15] docs/mm: add document for swap table
Posted by Barry Song 2 weeks, 1 day ago
On Wed, Sep 17, 2025 at 7:29 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Tue, Sep 16, 2025 at 4:09 PM Barry Song <21cnbao@gmail.com> wrote:
> > > I think my previous statement still stands correct considering both
> > > swap out and swap in. Of course there is always room for improvement
> > > to make it more clear. But folio always has the data is not true for
> > > swap in. If you have other ways to improve it, please feel free to
> > > suggest.
> >
> > I assume you’re referring to the swapin case where a folio has been
> > allocated and added to the swap cache, but it’s still being read and
> > hasn’t been updated yet?
>
> Right. That is the case swapfile has the data and folio does not.
>
> >
> > I assume it could be something like:
> > The data may be in the folio or will be placed there later. It could
> This is for swap in only, does not describe the swap out case.
>
> > also reside in the swap file.
>
> Right and it did not have the same coverage about data that can be
> both in the folio and swapfile. Sorry about the pedantic. If we want
> to improve it, we might want to cover the same level of detail.
>
> > Alternatively, leave it unchanged.
> I think considering the swap out and swap in case, the original is
> fine. The reader will need to make some effort to map to what it does
> in the code, at least the description is correct.

Ok.

>
> >
> > >
> > >
> > > > On a 32-bit system, I’m guessing the swap table is 2 KB, which is about
> > > > half of a page?
> > >
> > > Yes, true. I consider that but decide to leave it out of the document.
> > > There are a lot of other implementation details the document does not
> > > cover, not just this aspect. This document provides a simple
> > > abstracted view (might not cover all the detail cases). One way to
> > > address that is add a qualification "on a 64 bit system". What do you
> > > say? I don't want to talk about the 32 bit system having half of a
> > > page in this document, I consider that too much detail. The 32 bit
> > > system is pretty rare nowadays.
> >
> > I’d prefer that we remove all descriptions about matching PAGE_SIZE,
>
> I am fine with that as well.
>
> > since we would need to double-check every case, like 16 KB or 64 KB pages.
>
> The cluster size is determined by the last to second level page table
> page size. I fail to see why 16KB matters here for the cluster. Are
> you saying in the 16KB page size case, the custer size is 512/4 = 128
> swap entry per cluster?
>
> > For ARM64 with a 16 KB page size, the last-level index uses 24:14.
> > For ARM64 with a 64 KB page size, it uses 28:16[1]. For them, 512 entries
> > are not one PAGE.
>
> Now you got me curioused.
>
> In your above two examples, what is the respected swap cluster swap entry size?
> In other words, how much entry does one swap cluster hold?
>
> Sorry I am not very familiar with the ARM page tables.

Oh, my mistake—I recalculated:

For a 16 KB page size, SWAPCLUSTER_SIZE will be 2^11 = 2048, so the swap
table is 2048 * 8 = 16 KB.
For a 64 KB page size, SWAPCLUSTER_SIZE will be 2^13 = 8192, so the swap
table is 8192 * 8 = 64 KB.

This approach still seems to work, so the 32-bit system appears to be
the only exception. However, I’m not entirely sure that your description
of “the second last level” is correct. I believe it refers to the PTE,
which corresponds to the last level, not the second-to-last.
In other words, how do you define the second-to-last level page table?

Thanks
Barry
Re: [PATCH v4 01/15] docs/mm: add document for swap table
Posted by Chris Li 2 weeks ago
On Tue, Sep 16, 2025 at 4:48 PM Barry Song <21cnbao@gmail.com> wrote:
> > In your above two examples, what is the respected swap cluster swap entry size?
> > In other words, how much entry does one swap cluster hold?
> >
> > Sorry I am not very familiar with the ARM page tables.
>
> Oh, my mistake—I recalculated:
>
> For a 16 KB page size, SWAPCLUSTER_SIZE will be 2^11 = 2048, so the swap
> table is 2048 * 8 = 16 KB.

So my original description is correct in the sense that, the 16KB page
size, swap table is 16KB, this is not coincidental.

> For a 64 KB page size, SWAPCLUSTER_SIZE will be 2^13 = 8192, so the swap
> table is 8192 * 8 = 64 KB.

Same here. For 64 KB page size, the swap table is 64KB as you just told me.
I am just trying to give a bit of a glimpse of where I get the
intuition for swap tables.

>
> This approach still seems to work, so the 32-bit system appears to be
> the only exception. However, I’m not entirely sure that your description
> of “the second last level” is correct. I believe it refers to the PTE,
> which corresponds to the last level, not the second-to-last.
> In other words, how do you define the second-to-last level page table?

The second-to-last level page table page holds the PMD. The last level
page table holds PTE.
Cluster size is HPAGE_PMD_NR = 1<<HPAGE_PMD_ORDER
I was thinking of a PMD entry but the actual page table page it points
to is the last level.
That is a good catch. Let me see how to fix it.

What I am trying to say is that, swap table size should match to the
PTE page table page size which determines the cluster size. An
alternative to understanding the swap table is that swap table is a
shadow PTE page table containing the shadow PTE matching to the page
that gets swapped out to the swapfile. It is arranged in the swapfile
swap offset order. The intuition is simple once you find the right
angle to view it. However it might be a mouthful to explain.

I am fine with removing it, on the other hand it removes the only bit
of secret sauce which I try to give the reader a glimpse of my
intuition of the swap table.

Thanks for catching that.

Chris
Re: [PATCH v4 01/15] docs/mm: add document for swap table
Posted by Barry Song 2 weeks ago
> > This approach still seems to work, so the 32-bit system appears to be
> > the only exception. However, I’m not entirely sure that your description
> > of “the second last level” is correct. I believe it refers to the PTE,
> > which corresponds to the last level, not the second-to-last.
> > In other words, how do you define the second-to-last level page table?
>
> The second-to-last level page table page holds the PMD. The last level
> page table holds PTE.
> Cluster size is HPAGE_PMD_NR = 1<<HPAGE_PMD_ORDER
> I was thinking of a PMD entry but the actual page table page it points
> to is the last level.
> That is a good catch. Let me see how to fix it.
>
> What I am trying to say is that, swap table size should match to the
> PTE page table page size which determines the cluster size. An
> alternative to understanding the swap table is that swap table is a
> shadow PTE page table containing the shadow PTE matching to the page
> that gets swapped out to the swapfile. It is arranged in the swapfile
> swap offset order. The intuition is simple once you find the right
> angle to view it. However it might be a mouthful to explain.
>
> I am fine with removing it, on the other hand it removes the only bit
> of secret sauce which I try to give the reader a glimpse of my
> intuition of the swap table.

Perhaps you could describe the swap table as similar to a PTE page table
representing the swap cache mapping.
That is correct for most 32-bit and 64-bit systems,
but not for every machine.

The only exception is a 32-bit system with a 64-bit physical address
(Large Physical Address Extension, LPAE), which uses a 4 KB PTE table
but a 2 KB swap table because the pointer is 32 bit while each page
table entry is 64 bit.

Maybe we can simply say that the number of entries in the swap table
is the same as in a PTE page table?

>
> Thanks for catching that.
>
> Chris

Thanks
Barry
Re: [PATCH v4 01/15] docs/mm: add document for swap table
Posted by Chris Li 2 weeks ago
On Wed, Sep 17, 2025 at 4:38 PM Barry Song <21cnbao@gmail.com> wrote:
>
> > > This approach still seems to work, so the 32-bit system appears to be
> > > the only exception. However, I’m not entirely sure that your description
> > > of “the second last level” is correct. I believe it refers to the PTE,
> > > which corresponds to the last level, not the second-to-last.
> > > In other words, how do you define the second-to-last level page table?
> >
> > The second-to-last level page table page holds the PMD. The last level
> > page table holds PTE.
> > Cluster size is HPAGE_PMD_NR = 1<<HPAGE_PMD_ORDER
> > I was thinking of a PMD entry but the actual page table page it points
> > to is the last level.
> > That is a good catch. Let me see how to fix it.
> >
> > What I am trying to say is that, swap table size should match to the
> > PTE page table page size which determines the cluster size. An
> > alternative to understanding the swap table is that swap table is a
> > shadow PTE page table containing the shadow PTE matching to the page
> > that gets swapped out to the swapfile. It is arranged in the swapfile
> > swap offset order. The intuition is simple once you find the right
> > angle to view it. However it might be a mouthful to explain.
> >
> > I am fine with removing it, on the other hand it removes the only bit
> > of secret sauce which I try to give the reader a glimpse of my
> > intuition of the swap table.
>
> Perhaps you could describe the swap table as similar to a PTE page table
> representing the swap cache mapping.

Hard to qualify what is "similar", in what way it is similar.
Different readers will have different interpretations of what similar
means to them.

> That is correct for most 32-bit and 64-bit systems,
> but not for every machine.

I think I will leave it as for most 64 bit systems, the swap table
size is exactly one page table page size and that is not coincidental.

> The only exception is a 32-bit system with a 64-bit physical address
> (Large Physical Address Extension, LPAE), which uses a 4 KB PTE table
> but a 2 KB swap table because the pointer is 32 bit while each page
> table entry is 64 bit.

I feel that is a very corner case. I will leave it out of the
document. I want to present a simplified abstracted view. There is
always more detail to distract the simple abstracted view. That is why
we have physics.

> Maybe we can simply say that the number of entries in the swap table
> is the same as in a PTE page table?

Yes, that is what I want to say, for most modern 64 bit systems.

Chris
Re: [PATCH v4 01/15] docs/mm: add document for swap table
Posted by Chris Li 2 weeks ago
Hi Barry,

How about this:

A swap table stores one cluster worth of swap cache values, which is
exactly one page table page on most morden 64 bit systems. This is not
coincidental because the cluster size is determined by the huge page size.
The swap table is holding an array of pointers, which have the same
size as the PTE. The size of the swap table should match the page table
page.

If that sounds OK, I will send an incremental patch to Andrew.

Chris

On Wed, Sep 17, 2025 at 10:03 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Wed, Sep 17, 2025 at 4:38 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > > > This approach still seems to work, so the 32-bit system appears to be
> > > > the only exception. However, I’m not entirely sure that your description
> > > > of “the second last level” is correct. I believe it refers to the PTE,
> > > > which corresponds to the last level, not the second-to-last.
> > > > In other words, how do you define the second-to-last level page table?
> > >
> > > The second-to-last level page table page holds the PMD. The last level
> > > page table holds PTE.
> > > Cluster size is HPAGE_PMD_NR = 1<<HPAGE_PMD_ORDER
> > > I was thinking of a PMD entry but the actual page table page it points
> > > to is the last level.
> > > That is a good catch. Let me see how to fix it.
> > >
> > > What I am trying to say is that, swap table size should match to the
> > > PTE page table page size which determines the cluster size. An
> > > alternative to understanding the swap table is that swap table is a
> > > shadow PTE page table containing the shadow PTE matching to the page
> > > that gets swapped out to the swapfile. It is arranged in the swapfile
> > > swap offset order. The intuition is simple once you find the right
> > > angle to view it. However it might be a mouthful to explain.
> > >
> > > I am fine with removing it, on the other hand it removes the only bit
> > > of secret sauce which I try to give the reader a glimpse of my
> > > intuition of the swap table.
> >
> > Perhaps you could describe the swap table as similar to a PTE page table
> > representing the swap cache mapping.
>
> Hard to qualify what is "similar", in what way it is similar.
> Different readers will have different interpretations of what similar
> means to them.
>
> > That is correct for most 32-bit and 64-bit systems,
> > but not for every machine.
>
> I think I will leave it as for most 64 bit systems, the swap table
> size is exactly one page table page size and that is not coincidental.
>
> > The only exception is a 32-bit system with a 64-bit physical address
> > (Large Physical Address Extension, LPAE), which uses a 4 KB PTE table
> > but a 2 KB swap table because the pointer is 32 bit while each page
> > table entry is 64 bit.
>
> I feel that is a very corner case. I will leave it out of the
> document. I want to present a simplified abstracted view. There is
> always more detail to distract the simple abstracted view. That is why
> we have physics.
>
> > Maybe we can simply say that the number of entries in the swap table
> > is the same as in a PTE page table?
>
> Yes, that is what I want to say, for most modern 64 bit systems.
>
> Chris
Re: [PATCH v4 01/15] docs/mm: add document for swap table
Posted by Barry Song 2 weeks ago
On Thu, Sep 18, 2025 at 3:03 PM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Barry,
>
> How about this:
>
> A swap table stores one cluster worth of swap cache values, which is
> exactly one page table page on most morden 64 bit systems. This is not
> coincidental because the cluster size is determined by the huge page size.

I’d phrase it as “PMD huge page,” since we also have “PUD huge page.”

> The swap table is holding an array of pointers, which have the same
> size as the PTE. The size of the swap table should match the page table
> page.
>

I’m not entirely sure what you mean by “page table page.”
My understanding is that you’re saying:
The swap table contains an array of pointers, each the same size as a PTE,
so its total size typically matches a PTE page table—one page on modern
64-bit systems.

> If that sounds OK, I will send an incremental patch to Andrew.
>
> Chris

Thanks
Barry
Re: [PATCH v4 01/15] docs/mm: add document for swap table
Posted by Chris Li 1 week, 6 days ago
On Thu, Sep 18, 2025 at 1:59 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Thu, Sep 18, 2025 at 3:03 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > Hi Barry,
> >
> > How about this:
> >
> > A swap table stores one cluster worth of swap cache values, which is
> > exactly one page table page on most morden 64 bit systems. This is not
> > coincidental because the cluster size is determined by the huge page size.
>
> I’d phrase it as “PMD huge page,” since we also have “PUD huge page.”

Good point. Will do.

>
> > The swap table is holding an array of pointers, which have the same
> > size as the PTE. The size of the swap table should match the page table
> > page.
> >
>
> I’m not entirely sure what you mean by “page table page.”

The page that gets pointed by the page table, or the page that holds the PTE.

> My understanding is that you’re saying:
> The swap table contains an array of pointers, each the same size as a PTE,
> so its total size typically matches a PTE page table—one page on modern
> 64-bit systems.

That sounds good. Thanks for the suggestion.
I take your suggestion with some small modifications, mostly to
clarify the total size is the total size of one cluster of swap
tables. The total size of all swap tables in a swap file is much
bigger.

How about this:

A swap table is an array of pointers. Each pointer is the same size as a PTE.
The size of a swap table for one swap cluster typically matches a PTE
page table,
which is one page on modern 64-bit systems.

Chris
Re: [PATCH v4 01/15] docs/mm: add document for swap table
Posted by Barry Song 1 week, 6 days ago
> > I’m not entirely sure what you mean by “page table page.”
>
> The page that gets pointed by the page table, or the page that holds the PTE.
>
> > My understanding is that you’re saying:
> > The swap table contains an array of pointers, each the same size as a PTE,
> > so its total size typically matches a PTE page table—one page on modern
> > 64-bit systems.
>
> That sounds good. Thanks for the suggestion.
> I take your suggestion with some small modifications, mostly to
> clarify the total size is the total size of one cluster of swap
> tables. The total size of all swap tables in a swap file is much
> bigger.
>
> How about this:
>
> A swap table is an array of pointers. Each pointer is the same size as a PTE.
> The size of a swap table for one swap cluster typically matches a PTE
> page table,
> which is one page on modern 64-bit systems.

Acked.

>
> Chris

Thanks
Barry
Re: [PATCH v4 01/15] docs/mm: add document for swap table
Posted by Chris Li 1 week, 4 days ago
Hi Andrew,

Can you please apply this incremental fix up commit on the document
patch of the swap table series?
Just folding it into the original patch is fine.

Here is the change log:
- Move the swap table document to the mm main section. [SeongJae Park]
- Rewrite the swap table size sentence for easier to understand. [ Barry]

Thanks

Chris

On Thu, Sep 18, 2025 at 2:35 PM Barry Song <21cnbao@gmail.com> wrote:
>
> > > I’m not entirely sure what you mean by “page table page.”
> >
> > The page that gets pointed by the page table, or the page that holds the PTE.
> >
> > > My understanding is that you’re saying:
> > > The swap table contains an array of pointers, each the same size as a PTE,
> > > so its total size typically matches a PTE page table—one page on modern
> > > 64-bit systems.
> >
> > That sounds good. Thanks for the suggestion.
> > I take your suggestion with some small modifications, mostly to
> > clarify the total size is the total size of one cluster of swap
> > tables. The total size of all swap tables in a swap file is much
> > bigger.
> >
> > How about this:
> >
> > A swap table is an array of pointers. Each pointer is the same size as a PTE.
> > The size of a swap table for one swap cluster typically matches a PTE
> > page table,
> > which is one page on modern 64-bit systems.
>
> Acked.
>
> >
> > Chris
>
> Thanks
> Barry
Re: [PATCH v4 01/15] docs/mm: add document for swap table
Posted by Barry Song 2 weeks ago
> Perhaps you could describe the swap table as similar to a PTE page table
> representing the swap cache mapping.
> That is correct for most 32-bit and 64-bit systems,
> but not for every machine.
>
> The only exception is a 32-bit system with a 64-bit physical address
> (Large Physical Address Extension, LPAE), which uses a 4 KB PTE table
> but a 2 KB swap table because the pointer is 32 bit while each page
> table entry is 64 bit.
>
> Maybe we can simply say that the number of entries in the swap table
> is the same as in a PTE page table?

BTW, as Kairui mentioned, you plan to store the PFN instead of a
pointer in phase 2.

I wonder whether we need to switch to atomic64_t on systems where the
physical address is 64 bit but the virtual address is 32 bit :-)

Thanks
Barry
Re: [PATCH v4 01/15] docs/mm: add document for swap table
Posted by Chris Li 2 weeks ago
On Wed, Sep 17, 2025 at 4:50 PM Barry Song <21cnbao@gmail.com> wrote:
>
> > Perhaps you could describe the swap table as similar to a PTE page table
> > representing the swap cache mapping.
> > That is correct for most 32-bit and 64-bit systems,
> > but not for every machine.
> >
> > The only exception is a 32-bit system with a 64-bit physical address
> > (Large Physical Address Extension, LPAE), which uses a 4 KB PTE table
> > but a 2 KB swap table because the pointer is 32 bit while each page
> > table entry is 64 bit.
> >
> > Maybe we can simply say that the number of entries in the swap table
> > is the same as in a PTE page table?
>
> BTW, as Kairui mentioned, you plan to store the PFN instead of a
> pointer in phase 2.

Yes, let's update the document then and only then. Otherwise the
document will be mismatching the code and confuse the reader.

>
> I wonder whether we need to switch to atomic64_t on systems where the
> physical address is 64 bit but the virtual address is 32 bit :-)

It is possible we need 64 bit for the swap cache anyway for other
reasons when we get into the later phases. Again, let's deal with it
later.

Chris