mm, swap: never bypass swap cache and cleanup flags (swap table phase II)

[PATCH 04/19] mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices

Posted by Kairui Song 3 months, 1 week ago

From: Kairui Song <kasong@tencent.com>

Now SWP_SYNCHRONOUS_IO devices are also using swap cache. One side
effect is that a folio may stay in swap cache for a longer time due to
lazy freeing (vm_swap_full()). This can help save some CPU / IO if folios
are being swapped out very frequently right after swapin, hence improving
the performance. But the long pinning of swap slots also increases the
fragmentation rate of the swap device significantly, and currently,
all in-tree SWP_SYNCHRONOUS_IO devices are RAM disks, so it also
causes the backing memory to be pinned, increasing the memory pressure.

So drop the swap cache immediately for SWP_SYNCHRONOUS_IO devices
after swapin finishes. Swap cache has served its role as a
synchronization layer to prevent any parallel swapin from wasting
CPU or memory allocation, and the redundant IO is not a major concern
for SWP_SYNCHRONOUS_IO devices.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 9a43d4811781..78457347ae60 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4359,12 +4359,21 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
 	return 0;
 }
 
-static inline bool should_try_to_free_swap(struct folio *folio,
+static inline bool should_try_to_free_swap(struct swap_info_struct *si,
+					   struct folio *folio,
 					   struct vm_area_struct *vma,
 					   unsigned int fault_flags)
 {
 	if (!folio_test_swapcache(folio))
 		return false;
+	/*
+	 * Try to free swap cache for SWP_SYNCHRONOUS_IO devices.
+	 * Redundant IO is unlikely to be an issue for them, but a
+	 * slot being pinned by swap cache may cause more fragmentation
+	 * and delayed freeing of swap metadata.
+	 */
+	if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
+		return true;
 	if (mem_cgroup_swap_full(folio) || (vma->vm_flags & VM_LOCKED) ||
 	    folio_test_mlocked(folio))
 		return true;
@@ -4935,7 +4944,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 * yet.
 	 */
 	swap_free_nr(entry, nr_pages);
-	if (should_try_to_free_swap(folio, vma, vmf->flags))
+	if (should_try_to_free_swap(si, folio, vma, vmf->flags))
 		folio_free_swap(folio);
 
 	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);

-- 
2.51.1

Re: [PATCH 04/19] mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices

Posted by Barry Song 3 months, 1 week ago

On Wed, Oct 29, 2025 at 11:59 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Now SWP_SYNCHRONOUS_IO devices are also using swap cache. One side
> effect is that a folio may stay in swap cache for a longer time due to
> lazy freeing (vm_swap_full()). This can help save some CPU / IO if folios
> are being swapped out very frequently right after swapin, hence improving
> the performance. But the long pinning of swap slots also increases the
> fragmentation rate of the swap device significantly, and currently,
> all in-tree SWP_SYNCHRONOUS_IO devices are RAM disks, so it also
> causes the backing memory to be pinned, increasing the memory pressure.
>
> So drop the swap cache immediately for SWP_SYNCHRONOUS_IO devices
> after swapin finishes. Swap cache has served its role as a
> synchronization layer to prevent any parallel swapin from wasting
> CPU or memory allocation, and the redundant IO is not a major concern
> for SWP_SYNCHRONOUS_IO devices.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/memory.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 9a43d4811781..78457347ae60 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4359,12 +4359,21 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
>         return 0;
>  }
>
> -static inline bool should_try_to_free_swap(struct folio *folio,
> +static inline bool should_try_to_free_swap(struct swap_info_struct *si,
> +                                          struct folio *folio,
>                                            struct vm_area_struct *vma,
>                                            unsigned int fault_flags)
>  {
>         if (!folio_test_swapcache(folio))
>                 return false;
> +       /*
> +        * Try to free swap cache for SWP_SYNCHRONOUS_IO devices.
> +        * Redundant IO is unlikely to be an issue for them, but a
> +        * slot being pinned by swap cache may cause more fragmentation
> +        * and delayed freeing of swap metadata.
> +        */

I don’t like the claim about “redundant I/O” — it sounds misleading. Those
I/Os are not redundant; they are simply saved by swapcache, which prevents
some swap-out I/O when a recently swap-in folio is swapped out again.

So, could we make it a bit more specific in both the comment and the commit
message?

Thanks
Barry

Re: [PATCH 04/19] mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices

Posted by Barry Song 3 months, 1 week ago

On Tue, Nov 4, 2025 at 12:19 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Oct 29, 2025 at 11:59 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Now SWP_SYNCHRONOUS_IO devices are also using swap cache. One side
> > effect is that a folio may stay in swap cache for a longer time due to
> > lazy freeing (vm_swap_full()). This can help save some CPU / IO if folios
> > are being swapped out very frequently right after swapin, hence improving
> > the performance. But the long pinning of swap slots also increases the
> > fragmentation rate of the swap device significantly, and currently,
> > all in-tree SWP_SYNCHRONOUS_IO devices are RAM disks, so it also
> > causes the backing memory to be pinned, increasing the memory pressure.
> >
> > So drop the swap cache immediately for SWP_SYNCHRONOUS_IO devices
> > after swapin finishes. Swap cache has served its role as a
> > synchronization layer to prevent any parallel swapin from wasting
> > CPU or memory allocation, and the redundant IO is not a major concern
> > for SWP_SYNCHRONOUS_IO devices.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/memory.c | 13 +++++++++++--
> >  1 file changed, 11 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 9a43d4811781..78457347ae60 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4359,12 +4359,21 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> >         return 0;
> >  }
> >
> > -static inline bool should_try_to_free_swap(struct folio *folio,
> > +static inline bool should_try_to_free_swap(struct swap_info_struct *si,
> > +                                          struct folio *folio,
> >                                            struct vm_area_struct *vma,
> >                                            unsigned int fault_flags)
> >  {
> >         if (!folio_test_swapcache(folio))
> >                 return false;
> > +       /*
> > +        * Try to free swap cache for SWP_SYNCHRONOUS_IO devices.
> > +        * Redundant IO is unlikely to be an issue for them, but a
> > +        * slot being pinned by swap cache may cause more fragmentation
> > +        * and delayed freeing of swap metadata.
> > +        */
>
> I don’t like the claim about “redundant I/O” — it sounds misleading. Those
> I/Os are not redundant; they are simply saved by swapcache, which prevents
> some swap-out I/O when a recently swap-in folio is swapped out again.
>
> So, could we make it a bit more specific in both the comment and the commit
> message?

Sorry, on second thought—consider a case where process A mmaps 100 MB and writes
to it to populate memory, then forks process B. If that 100 MB gets swapped out,
and A and B later swap it in separately for reading, with this change it seems
they would each get their own 100 MB copy (total 2 × 100 MB), whereas previously
they could share the same 100 MB?

Thanks
Barry

Re: [PATCH 04/19] mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices

Posted by Kairui Song 3 months, 1 week ago

On Tue, Nov 4, 2025 at 4:27 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Nov 4, 2025 at 12:19 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Wed, Oct 29, 2025 at 11:59 PM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > Now SWP_SYNCHRONOUS_IO devices are also using swap cache. One side
> > > effect is that a folio may stay in swap cache for a longer time due to
> > > lazy freeing (vm_swap_full()). This can help save some CPU / IO if folios
> > > are being swapped out very frequently right after swapin, hence improving
> > > the performance. But the long pinning of swap slots also increases the
> > > fragmentation rate of the swap device significantly, and currently,
> > > all in-tree SWP_SYNCHRONOUS_IO devices are RAM disks, so it also
> > > causes the backing memory to be pinned, increasing the memory pressure.
> > >
> > > So drop the swap cache immediately for SWP_SYNCHRONOUS_IO devices
> > > after swapin finishes. Swap cache has served its role as a
> > > synchronization layer to prevent any parallel swapin from wasting
> > > CPU or memory allocation, and the redundant IO is not a major concern
> > > for SWP_SYNCHRONOUS_IO devices.
> > >
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > ---
> > >  mm/memory.c | 13 +++++++++++--
> > >  1 file changed, 11 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 9a43d4811781..78457347ae60 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -4359,12 +4359,21 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> > >         return 0;
> > >  }
> > >
> > > -static inline bool should_try_to_free_swap(struct folio *folio,
> > > +static inline bool should_try_to_free_swap(struct swap_info_struct *si,
> > > +                                          struct folio *folio,
> > >                                            struct vm_area_struct *vma,
> > >                                            unsigned int fault_flags)
> > >  {
> > >         if (!folio_test_swapcache(folio))
> > >                 return false;
> > > +       /*
> > > +        * Try to free swap cache for SWP_SYNCHRONOUS_IO devices.
> > > +        * Redundant IO is unlikely to be an issue for them, but a
> > > +        * slot being pinned by swap cache may cause more fragmentation
> > > +        * and delayed freeing of swap metadata.
> > > +        */
> >
> > I don’t like the claim about “redundant I/O” — it sounds misleading. Those
> > I/Os are not redundant; they are simply saved by swapcache, which prevents
> > some swap-out I/O when a recently swap-in folio is swapped out again.
> >
> > So, could we make it a bit more specific in both the comment and the commit
> > message?
>
> Sorry, on second thought—consider a case where process A mmaps 100 MB and writes
> to it to populate memory, then forks process B. If that 100 MB gets swapped out,
> and A and B later swap it in separately for reading, with this change it seems
> they would each get their own 100 MB copy (total 2 × 100 MB), whereas previously
> they could share the same 100 MB?

It's a bit tricky here, folio_free_swap only frees the swap cache if a
folio's swap count is 0, so if A swapin these folios first, the swap
cache won't be freed until B also mapped these folios and reduced the
swap count.

And this function is called should_try_to_free_swap: it's only trying
to free the swap cache if swap count == 0. I think I can add some
comments on that.