mm, swap: free the cluster extend table on teardown

[PATCH] mm, swap: free the cluster extend table on teardown

Posted by David Carlier 5 days, 5 hours ago

swap_cluster_free_table() frees every per-cluster side table but
ci->extend_table. That table is only released by
swap_extend_table_try_free(), which the teardown path never calls, so a
cluster can be freed with an extend table still attached.

It can also linger while the cluster is live. swap_dup_entries_cluster()
drops the lock to allocate an extend table when a slot reaches
SWP_TB_COUNT_MAX - 1, then retries. If the count dropped in the meantime,
the retry takes the normal path and leaves the table behind, all entries
zero; only the failure path frees it.

Since a swap_cluster_info is reused in place and swap_extend_table_alloc()
skips allocation when ci->extend_table is set, the next user of the
cluster inherits the stale table and its leftover counts, corrupting the
swap count of any slot that overflows. CONFIG_DEBUG_VM catches the
dangling table in swap_cluster_assert_empty(); otherwise it is silent.

Free it in swap_cluster_free_table(), and also on the
swap_dup_entries_cluster() success path to match the failure path.

Reported-by: syzbot+deedf22929084640666f@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=deedf22929084640666f
Fixes: 0d6af9bcf383 ("mm, swap: use the swap table to track the swap count")
Cc: <stable@vger.kernel.org>
Signed-off-by: David Carlier <devnexen@gmail.com>
---
 mm/swapfile.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 615d90867111..a69a26aec4c0 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -432,6 +432,9 @@ static void swap_cluster_free_table(struct swap_cluster_info *ci)
 	ci->zero_bitmap = NULL;
 #endif
 
+	kfree(ci->extend_table);
+	ci->extend_table = NULL;
+
 	table = (struct swap_table *)rcu_access_pointer(ci->table);
 	if (!table)
 		return;
@@ -1711,6 +1714,7 @@ static int swap_dup_entries_cluster(struct swap_info_struct *si,
 			goto failed;
 		}
 	} while (++ci_off < ci_end);
+	swap_extend_table_try_free(ci);
 	swap_cluster_unlock(ci);
 	return 0;
 failed:
-- 
2.53.0

Re: [PATCH] mm, swap: free the cluster extend table on teardown

Posted by Andrew Morton 4 days, 7 hours ago

On Tue,  2 Jun 2026 23:23:57 +0100 David Carlier <devnexen@gmail.com> wrote:

> swap_cluster_free_table() frees every per-cluster side table but
> ci->extend_table. That table is only released by
> swap_extend_table_try_free(), which the teardown path never calls, so a
> cluster can be freed with an extend table still attached.
> 
> It can also linger while the cluster is live. swap_dup_entries_cluster()
> drops the lock to allocate an extend table when a slot reaches
> SWP_TB_COUNT_MAX - 1, then retries. If the count dropped in the meantime,
> the retry takes the normal path and leaves the table behind, all entries
> zero; only the failure path frees it.
> 
> Since a swap_cluster_info is reused in place and swap_extend_table_alloc()
> skips allocation when ci->extend_table is set, the next user of the
> cluster inherits the stale table and its leftover counts, corrupting the
> swap count of any slot that overflows. CONFIG_DEBUG_VM catches the
> dangling table in swap_cluster_assert_empty(); otherwise it is silent.
> 
> Free it in swap_cluster_free_table(), and also on the
> swap_dup_entries_cluster() success path to match the failure path.

This all sounds rather horrid.  We have no description of how this all
manifests for the user, but I assume "badly"?

> Reported-by: syzbot+deedf22929084640666f@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=deedf22929084640666f
> Fixes: 0d6af9bcf383 ("mm, swap: use the swap table to track the swap count")
> Cc: <stable@vger.kernel.org>

First merged in 7.1-rc1 so no cc:stable should be needed, if we upstream a fix
promptly.

> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -432,6 +432,9 @@ static void swap_cluster_free_table(struct swap_cluster_info *ci)
>  	ci->zero_bitmap = NULL;
>  #endif
>  
> +	kfree(ci->extend_table);
> +	ci->extend_table = NULL;
> +
>  	table = (struct swap_table *)rcu_access_pointer(ci->table);
>  	if (!table)
>  		return;
> @@ -1711,6 +1714,7 @@ static int swap_dup_entries_cluster(struct swap_info_struct *si,
>  			goto failed;
>  		}
>  	} while (++ci_off < ci_end);
> +	swap_extend_table_try_free(ci);
>  	swap_cluster_unlock(ci);
>  	return 0;
>  failed:

AI reviw flagged a possible issue:
	https://sashiko.dev/#/patchset/20260602222358.49061-1-devnexen@gmail.com

Re: [PATCH] mm, swap: free the cluster extend table on teardown

Posted by Kairui Song 5 days, 1 hour ago

On Wed, Jun 3, 2026 at 6:27 AM David Carlier <devnexen@gmail.com> wrote:
>
> swap_cluster_free_table() frees every per-cluster side table but
> ci->extend_table. That table is only released by
> swap_extend_table_try_free(), which the teardown path never calls, so a
> cluster can be freed with an extend table still attached.
>
> It can also linger while the cluster is live. swap_dup_entries_cluster()
> drops the lock to allocate an extend table when a slot reaches
> SWP_TB_COUNT_MAX - 1, then retries. If the count dropped in the meantime,
> the retry takes the normal path and leaves the table behind, all entries
> zero; only the failure path frees it.
>
> Since a swap_cluster_info is reused in place and swap_extend_table_alloc()
> skips allocation when ci->extend_table is set, the next user of the
> cluster inherits the stale table and its leftover counts, corrupting the
> swap count of any slot that overflows. CONFIG_DEBUG_VM catches the

There won't be a corruption, extend_table is all zero at this point,
the leak on swapoff is real though.

> dangling table in swap_cluster_assert_empty(); otherwise it is silent.
>
> Free it in swap_cluster_free_table(), and also on the
> swap_dup_entries_cluster() success path to match the failure path.
>
> Reported-by: syzbot+deedf22929084640666f@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=deedf22929084640666f
> Fixes: 0d6af9bcf383 ("mm, swap: use the swap table to track the swap count")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: David Carlier <devnexen@gmail.com>
> ---
>  mm/swapfile.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 615d90867111..a69a26aec4c0 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -432,6 +432,9 @@ static void swap_cluster_free_table(struct swap_cluster_info *ci)
>         ci->zero_bitmap = NULL;
>  #endif
>
> +       kfree(ci->extend_table);
> +       ci->extend_table = NULL;
> +

Still a bit too late to avoid the WARN? The WARN is already triggered
at this point, swap_cluster_free_table is called after
swap_cluster_assert_empty.

>         table = (struct swap_table *)rcu_access_pointer(ci->table);
>         if (!table)
>                 return;
> @@ -1711,6 +1714,7 @@ static int swap_dup_entries_cluster(struct swap_info_struct *si,
>                         goto failed;
>                 }
>         } while (++ci_off < ci_end);
> +       swap_extend_table_try_free(ci);
>         swap_cluster_unlock(ci);
>         return 0;
>  failed:
> --
> 2.53.0

I think we have already fixed this?
https://lore.kernel.org/all/6a1eac8e.fbc46276.3c3783.0008.GAE@google.com/T/

Re: [PATCH] mm, swap: free the cluster extend table on teardown

Posted by David CARLIER 4 days, 7 hours ago

On Wed, 3 Jun 2026 at 03:42, Kairui Song <ryncsn@gmail.com> wrote:
>
> On Wed, Jun 3, 2026 at 6:27 AM David Carlier <devnexen@gmail.com> wrote:
> >
> > swap_cluster_free_table() frees every per-cluster side table but
> > ci->extend_table. That table is only released by
> > swap_extend_table_try_free(), which the teardown path never calls, so a
> > cluster can be freed with an extend table still attached.
> >
> > It can also linger while the cluster is live. swap_dup_entries_cluster()
> > drops the lock to allocate an extend table when a slot reaches
> > SWP_TB_COUNT_MAX - 1, then retries. If the count dropped in the meantime,
> > the retry takes the normal path and leaves the table behind, all entries
> > zero; only the failure path frees it.
> >
> > Since a swap_cluster_info is reused in place and swap_extend_table_alloc()
> > skips allocation when ci->extend_table is set, the next user of the
> > cluster inherits the stale table and its leftover counts, corrupting the
> > swap count of any slot that overflows. CONFIG_DEBUG_VM catches the
>
> There won't be a corruption, extend_table is all zero at this point,
> the leak on swapoff is real though.
>
> > dangling table in swap_cluster_assert_empty(); otherwise it is silent.
> >
> > Free it in swap_cluster_free_table(), and also on the
> > swap_dup_entries_cluster() success path to match the failure path.
> >
> > Reported-by: syzbot+deedf22929084640666f@syzkaller.appspotmail.com
> > Closes: https://syzkaller.appspot.com/bug?extid=deedf22929084640666f
> > Fixes: 0d6af9bcf383 ("mm, swap: use the swap table to track the swap count")
> > Cc: <stable@vger.kernel.org>
> > Signed-off-by: David Carlier <devnexen@gmail.com>
> > ---
> >  mm/swapfile.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 615d90867111..a69a26aec4c0 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -432,6 +432,9 @@ static void swap_cluster_free_table(struct swap_cluster_info *ci)
> >         ci->zero_bitmap = NULL;
> >  #endif
> >
> > +       kfree(ci->extend_table);
> > +       ci->extend_table = NULL;
> > +
>
> Still a bit too late to avoid the WARN? The WARN is already triggered
> at this point, swap_cluster_free_table is called after
> swap_cluster_assert_empty.
>
> >         table = (struct swap_table *)rcu_access_pointer(ci->table);
> >         if (!table)
> >                 return;
> > @@ -1711,6 +1714,7 @@ static int swap_dup_entries_cluster(struct swap_info_struct *si,
> >                         goto failed;
> >                 }
> >         } while (++ci_off < ci_end);
> > +       swap_extend_table_try_free(ci);
> >         swap_cluster_unlock(ci);
> >         return 0;
> >  failed:
> > --
> > 2.53.0
>
> I think we have already fixed this?
> https://lore.kernel.org/all/6a1eac8e.fbc46276.3c3783.0008.GAE@google.com/T/


 Thanks for the review.

  Agreed on all counts. 0475fde0f68d already addresses both the warning
  and the swapoff leak at the allocation site, so this patch is
  redundant. Please drop it.

  Andrew, you're right that no cc:stable was warranted here.

Cheers.