[v2] mm, swap: introduce swap table as swap cache (phase I)

[PATCH v2 04/15] mm, swap: check page poison flag after locking it

Posted by Kairui Song 4 days, 10 hours ago

From: Kairui Song <kasong@tencent.com>

Instead of checking the poison flag only in the fast swap cache lookup
path, always check the poison flags after locking a swap cache folio.

There are two reasons to do so.

The folio is unstable and could be removed from the swap cache anytime,
so it's totally possible that the folio is no longer the backing folio
of a swap entry, and could be an irrelevant poisoned folio. We might
mistakenly kill a faulting process.

And it's totally possible or even common for the slow swap in path
(swapin_readahead) to bring in a cached folio. The cache folio could be
poisoned, too. Only checking the poison flag in the fast path will miss
such folios.

The race window is tiny, so it's very unlikely to happen, though.
While at it, also add a unlikely prefix.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 10ef528a5f44..94a5928e8ace 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4661,10 +4661,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		goto out;
 
 	folio = swap_cache_get_folio(entry);
-	if (folio) {
+	if (folio)
 		swap_update_readahead(folio, vma, vmf->address);
-		page = folio_file_page(folio, swp_offset(entry));
-	}
 	swapcache = folio;
 
 	if (!folio) {
@@ -4735,20 +4733,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
 		count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
-		page = folio_file_page(folio, swp_offset(entry));
-	} else if (PageHWPoison(page)) {
-		/*
-		 * hwpoisoned dirty swapcache pages are kept for killing
-		 * owner processes (which may be unknown at hwpoison time)
-		 */
-		ret = VM_FAULT_HWPOISON;
-		goto out_release;
 	}
 
 	ret |= folio_lock_or_retry(folio, vmf);
 	if (ret & VM_FAULT_RETRY)
 		goto out_release;
 
+	page = folio_file_page(folio, swp_offset(entry));
 	if (swapcache) {
 		/*
 		 * Make sure folio_free_swap() or swapoff did not release the
@@ -4761,6 +4752,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			     page_swap_entry(page).val != entry.val))
 			goto out_page;
 
+		if (unlikely(PageHWPoison(page))) {
+			/*
+			 * hwpoisoned dirty swapcache pages are kept for killing
+			 * owner processes (which may be unknown at hwpoison time)
+			 */
+			ret = VM_FAULT_HWPOISON;
+			goto out_page;
+		}
+
 		/*
 		 * KSM sometimes has to copy on read faults, for example, if
 		 * folio->index of non-ksm folios would be nonlinear inside the
-- 
2.51.0

Re: [PATCH v2 04/15] mm, swap: check page poison flag after locking it

Posted by David Hildenbrand 1 day, 17 hours ago

On 05.09.25 21:13, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Instead of checking the poison flag only in the fast swap cache lookup
> path, always check the poison flags after locking a swap cache folio.
> 
> There are two reasons to do so.
> 
> The folio is unstable and could be removed from the swap cache anytime,
> so it's totally possible that the folio is no longer the backing folio
> of a swap entry, and could be an irrelevant poisoned folio. We might
> mistakenly kill a faulting process.
> 
> And it's totally possible or even common for the slow swap in path
> (swapin_readahead) to bring in a cached folio. The cache folio could be
> poisoned, too. Only checking the poison flag in the fast path will miss
> such folios.
> 
> The race window is tiny, so it's very unlikely to happen, though.
> While at it, also add a unlikely prefix.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>   mm/memory.c | 22 +++++++++++-----------
>   1 file changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 10ef528a5f44..94a5928e8ace 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4661,10 +4661,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   		goto out;
>   
>   	folio = swap_cache_get_folio(entry);
> -	if (folio) {
> +	if (folio)
>   		swap_update_readahead(folio, vma, vmf->address);
> -		page = folio_file_page(folio, swp_offset(entry));
> -	}
>   	swapcache = folio;
>   
>   	if (!folio) {
> @@ -4735,20 +4733,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   		ret = VM_FAULT_MAJOR;
>   		count_vm_event(PGMAJFAULT);
>   		count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
> -		page = folio_file_page(folio, swp_offset(entry));
> -	} else if (PageHWPoison(page)) {
> -		/*
> -		 * hwpoisoned dirty swapcache pages are kept for killing
> -		 * owner processes (which may be unknown at hwpoison time)
> -		 */
> -		ret = VM_FAULT_HWPOISON;
> -		goto out_release;
>   	}
>   
>   	ret |= folio_lock_or_retry(folio, vmf);
>   	if (ret & VM_FAULT_RETRY)
>   		goto out_release;
>   
> +	page = folio_file_page(folio, swp_offset(entry));
>   	if (swapcache) {
>   		/*
>   		 * Make sure folio_free_swap() or swapoff did not release the
> @@ -4761,6 +4752,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   			     page_swap_entry(page).val != entry.val))
>   			goto out_page;
>   
> +		if (unlikely(PageHWPoison(page))) {
> +			/*
> +			 * hwpoisoned dirty swapcache pages are kept for killing
> +			 * owner processes (which may be unknown at hwpoison time)
> +			 */
> +			ret = VM_FAULT_HWPOISON;
> +			goto out_page;
> +		}
> +
>   		/*
>   		 * KSM sometimes has to copy on read faults, for example, if
>   		 * folio->index of non-ksm folios would be nonlinear inside the

LGTM, but I was wondering whether we just want to check that even when 
we just allocated a fresh folio for simplicity. The check is cheap ...


-- 
Cheers

David / dhildenb

Re: [PATCH v2 04/15] mm, swap: check page poison flag after locking it

Posted by Kairui Song 14 hours ago

On Mon, Sep 8, 2025 at 8:40 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 05.09.25 21:13, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > Instead of checking the poison flag only in the fast swap cache lookup
> > path, always check the poison flags after locking a swap cache folio.
> >
> > There are two reasons to do so.
> >
> > The folio is unstable and could be removed from the swap cache anytime,
> > so it's totally possible that the folio is no longer the backing folio
> > of a swap entry, and could be an irrelevant poisoned folio. We might
> > mistakenly kill a faulting process.
> >
> > And it's totally possible or even common for the slow swap in path
> > (swapin_readahead) to bring in a cached folio. The cache folio could be
> > poisoned, too. Only checking the poison flag in the fast path will miss
> > such folios.
> >
> > The race window is tiny, so it's very unlikely to happen, though.
> > While at it, also add a unlikely prefix.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >   mm/memory.c | 22 +++++++++++-----------
> >   1 file changed, 11 insertions(+), 11 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 10ef528a5f44..94a5928e8ace 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4661,10 +4661,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >               goto out;
> >
> >       folio = swap_cache_get_folio(entry);
> > -     if (folio) {
> > +     if (folio)
> >               swap_update_readahead(folio, vma, vmf->address);
> > -             page = folio_file_page(folio, swp_offset(entry));
> > -     }
> >       swapcache = folio;
> >
> >       if (!folio) {
> > @@ -4735,20 +4733,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >               ret = VM_FAULT_MAJOR;
> >               count_vm_event(PGMAJFAULT);
> >               count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
> > -             page = folio_file_page(folio, swp_offset(entry));
> > -     } else if (PageHWPoison(page)) {
> > -             /*
> > -              * hwpoisoned dirty swapcache pages are kept for killing
> > -              * owner processes (which may be unknown at hwpoison time)
> > -              */
> > -             ret = VM_FAULT_HWPOISON;
> > -             goto out_release;
> >       }
> >
> >       ret |= folio_lock_or_retry(folio, vmf);
> >       if (ret & VM_FAULT_RETRY)
> >               goto out_release;
> >
> > +     page = folio_file_page(folio, swp_offset(entry));
> >       if (swapcache) {
> >               /*
> >                * Make sure folio_free_swap() or swapoff did not release the
> > @@ -4761,6 +4752,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                            page_swap_entry(page).val != entry.val))
> >                       goto out_page;
> >
> > +             if (unlikely(PageHWPoison(page))) {
> > +                     /*
> > +                      * hwpoisoned dirty swapcache pages are kept for killing
> > +                      * owner processes (which may be unknown at hwpoison time)
> > +                      */
> > +                     ret = VM_FAULT_HWPOISON;
> > +                     goto out_page;
> > +             }
> > +
> >               /*
> >                * KSM sometimes has to copy on read faults, for example, if
> >                * folio->index of non-ksm folios would be nonlinear inside the
>
> LGTM, but I was wondering whether we just want to check that even when

Thanks for checking the patch.

> we just allocated a fresh folio for simplicity. The check is cheap ...
>

Maybe not for now? This patch expects folio_test_swapcache to filter
out potentially irrelevant folios, so moving the check before that is
in theory not correct.. And folio_test_swapcache check won't work for
the fresh allocated folio here...

I'm planning to remove the whole `if (swapcache)` check in phase 2, as
all swapin will go through swap cache. By that time all checks will
always be applied. The simplification will be done in a cleaner way.

> --
> Cheers
>
> David / dhildenb
>
>

Re: [PATCH v2 04/15] mm, swap: check page poison flag after locking it

Posted by David Hildenbrand 14 hours ago

On 09.09.25 16:54, Kairui Song wrote:
> On Mon, Sep 8, 2025 at 8:40 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 05.09.25 21:13, Kairui Song wrote:
>>> From: Kairui Song <kasong@tencent.com>
>>>
>>> Instead of checking the poison flag only in the fast swap cache lookup
>>> path, always check the poison flags after locking a swap cache folio.
>>>
>>> There are two reasons to do so.
>>>
>>> The folio is unstable and could be removed from the swap cache anytime,
>>> so it's totally possible that the folio is no longer the backing folio
>>> of a swap entry, and could be an irrelevant poisoned folio. We might
>>> mistakenly kill a faulting process.
>>>
>>> And it's totally possible or even common for the slow swap in path
>>> (swapin_readahead) to bring in a cached folio. The cache folio could be
>>> poisoned, too. Only checking the poison flag in the fast path will miss
>>> such folios.
>>>
>>> The race window is tiny, so it's very unlikely to happen, though.
>>> While at it, also add a unlikely prefix.
>>>
>>> Signed-off-by: Kairui Song <kasong@tencent.com>
>>> ---
>>>    mm/memory.c | 22 +++++++++++-----------
>>>    1 file changed, 11 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 10ef528a5f44..94a5928e8ace 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -4661,10 +4661,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>                goto out;
>>>
>>>        folio = swap_cache_get_folio(entry);
>>> -     if (folio) {
>>> +     if (folio)
>>>                swap_update_readahead(folio, vma, vmf->address);
>>> -             page = folio_file_page(folio, swp_offset(entry));
>>> -     }
>>>        swapcache = folio;
>>>
>>>        if (!folio) {
>>> @@ -4735,20 +4733,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>                ret = VM_FAULT_MAJOR;
>>>                count_vm_event(PGMAJFAULT);
>>>                count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
>>> -             page = folio_file_page(folio, swp_offset(entry));
>>> -     } else if (PageHWPoison(page)) {
>>> -             /*
>>> -              * hwpoisoned dirty swapcache pages are kept for killing
>>> -              * owner processes (which may be unknown at hwpoison time)
>>> -              */
>>> -             ret = VM_FAULT_HWPOISON;
>>> -             goto out_release;
>>>        }
>>>
>>>        ret |= folio_lock_or_retry(folio, vmf);
>>>        if (ret & VM_FAULT_RETRY)
>>>                goto out_release;
>>>
>>> +     page = folio_file_page(folio, swp_offset(entry));
>>>        if (swapcache) {
>>>                /*
>>>                 * Make sure folio_free_swap() or swapoff did not release the
>>> @@ -4761,6 +4752,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>                             page_swap_entry(page).val != entry.val))
>>>                        goto out_page;
>>>
>>> +             if (unlikely(PageHWPoison(page))) {
>>> +                     /*
>>> +                      * hwpoisoned dirty swapcache pages are kept for killing
>>> +                      * owner processes (which may be unknown at hwpoison time)
>>> +                      */
>>> +                     ret = VM_FAULT_HWPOISON;
>>> +                     goto out_page;
>>> +             }
>>> +
>>>                /*
>>>                 * KSM sometimes has to copy on read faults, for example, if
>>>                 * folio->index of non-ksm folios would be nonlinear inside the
>>
>> LGTM, but I was wondering whether we just want to check that even when
> 
> Thanks for checking the patch.
> 
>> we just allocated a fresh folio for simplicity. The check is cheap ...
>>
> 
> Maybe not for now? This patch expects folio_test_swapcache to filter
> out potentially irrelevant folios, so moving the check before that is
> in theory not correct.. And folio_test_swapcache check won't work for
> the fresh allocated folio here...
> 
> I'm planning to remove the whole `if (swapcache)` check in phase 2, as
> all swapin will go through swap cache. By that time all checks will
> always be applied. The simplification will be done in a cleaner way.
> 

Fair enough :)

-- 
Cheers

David / dhildenb

Re: [PATCH v2 04/15] mm, swap: check page poison flag after locking it

Posted by Chris Li 4 days, 3 hours ago

Separating this patch out makes it easier to read for me. Thank you.
The last V1 mixed diff is very messy. My last attempt missing the
out_page will unlock the HWPoison page as well. Now it is obviously
correct to me.

Acked-by: Chris Li <chrisl@kernel.org>

Chris

On Fri, Sep 5, 2025 at 12:14 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Instead of checking the poison flag only in the fast swap cache lookup
> path, always check the poison flags after locking a swap cache folio.
>
> There are two reasons to do so.
>
> The folio is unstable and could be removed from the swap cache anytime,
> so it's totally possible that the folio is no longer the backing folio
> of a swap entry, and could be an irrelevant poisoned folio. We might
> mistakenly kill a faulting process.
>
> And it's totally possible or even common for the slow swap in path
> (swapin_readahead) to bring in a cached folio. The cache folio could be
> poisoned, too. Only checking the poison flag in the fast path will miss
> such folios.
>
> The race window is tiny, so it's very unlikely to happen, though.
> While at it, also add a unlikely prefix.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/memory.c | 22 +++++++++++-----------
>  1 file changed, 11 insertions(+), 11 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 10ef528a5f44..94a5928e8ace 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4661,10 +4661,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 goto out;
>
>         folio = swap_cache_get_folio(entry);
> -       if (folio) {
> +       if (folio)
>                 swap_update_readahead(folio, vma, vmf->address);
> -               page = folio_file_page(folio, swp_offset(entry));
> -       }
>         swapcache = folio;
>
>         if (!folio) {
> @@ -4735,20 +4733,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 ret = VM_FAULT_MAJOR;
>                 count_vm_event(PGMAJFAULT);
>                 count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
> -               page = folio_file_page(folio, swp_offset(entry));
> -       } else if (PageHWPoison(page)) {
> -               /*
> -                * hwpoisoned dirty swapcache pages are kept for killing
> -                * owner processes (which may be unknown at hwpoison time)
> -                */
> -               ret = VM_FAULT_HWPOISON;
> -               goto out_release;
>         }
>
>         ret |= folio_lock_or_retry(folio, vmf);
>         if (ret & VM_FAULT_RETRY)
>                 goto out_release;
>
> +       page = folio_file_page(folio, swp_offset(entry));
>         if (swapcache) {
>                 /*
>                  * Make sure folio_free_swap() or swapoff did not release the
> @@ -4761,6 +4752,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                              page_swap_entry(page).val != entry.val))
>                         goto out_page;
>
> +               if (unlikely(PageHWPoison(page))) {
> +                       /*
> +                        * hwpoisoned dirty swapcache pages are kept for killing
> +                        * owner processes (which may be unknown at hwpoison time)
> +                        */
> +                       ret = VM_FAULT_HWPOISON;
> +                       goto out_page;
> +               }
> +
>                 /*
>                  * KSM sometimes has to copy on read faults, for example, if
>                  * folio->index of non-ksm folios would be nonlinear inside the
> --
> 2.51.0
>
>