mm/shmem, swap: bugfix and improvement of mTHP swap-in

[PATCH v4 5/9] mm/shmem, swap: avoid false positive swap cache lookup

Posted by Kairui Song 3 months ago

From: Kairui Song <kasong@tencent.com>

If a shmem read request's index points to the middle of a large swap
entry, shmem swap in will try the swap cache lookup using the large
swap entry's starting value (which is the first sub swap entry of this
large entry).  This will lead to false positive lookup results, if only
the first few swap entries are cached but the actual requested swap
entry pointed by index is uncached. This is not a rare event as swap
readahead always try to cache order 0 folios when possible.

Currently, shmem will do a large entry split when it occurs, aborts
due to a mismatching folio swap value, then retry the swapin from
the beginning, which is a waste of CPU and adds wrong info to
the readahead statistics.

This can be optimized easily by doing the lookup using the right
swap entry value.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/shmem.c | 31 +++++++++++++++----------------
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 217264315842..2ab214e2771c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2274,14 +2274,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	pgoff_t offset;
 
 	VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
-	swap = index_entry = radix_to_swp_entry(*foliop);
+	index_entry = radix_to_swp_entry(*foliop);
+	swap = index_entry;
 	*foliop = NULL;
 
-	if (is_poisoned_swp_entry(swap))
+	if (is_poisoned_swp_entry(index_entry))
 		return -EIO;
 
-	si = get_swap_device(swap);
-	order = shmem_confirm_swap(mapping, index, swap);
+	si = get_swap_device(index_entry);
+	order = shmem_confirm_swap(mapping, index, index_entry);
 	if (unlikely(!si)) {
 		if (order < 0)
 			return -EEXIST;
@@ -2293,6 +2294,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		return -EEXIST;
 	}
 
+	/* index may point to the middle of a large entry, get the sub entry */
+	if (order) {
+		offset = index - round_down(index, 1 << order);
+		swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
+	}
+
 	/* Look it up and read it in.. */
 	folio = swap_cache_get_folio(swap, NULL, 0);
 	if (!folio) {
@@ -2305,8 +2312,10 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 
 		/* Skip swapcache for synchronous device. */
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
-			folio = shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp);
+			folio = shmem_swap_alloc_folio(inode, vma, index,
+						       index_entry, order, gfp);
 			if (!IS_ERR(folio)) {
+				swap = index_entry;
 				skip_swapcache = true;
 				goto alloced;
 			}
@@ -2320,17 +2329,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			if (error == -EEXIST)
 				goto failed;
 		}
-
-		/*
-		 * Now swap device can only swap in order 0 folio, it is
-		 * necessary to recalculate the new swap entry based on
-		 * the offset, as the swapin index might be unalgined.
-		 */
-		if (order) {
-			offset = index - round_down(index, 1 << order);
-			swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
-		}
-
+		/* Cached swapin with readahead, only supports order 0 */
 		folio = shmem_swapin_cluster(swap, gfp, info, index);
 		if (!folio) {
 			error = -ENOMEM;
-- 
2.50.0

Re: [PATCH v4 5/9] mm/shmem, swap: avoid false positive swap cache lookup

Posted by Baolin Wang 3 months ago

Hi Kairui,

On 2025/7/5 02:17, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> If a shmem read request's index points to the middle of a large swap
> entry, shmem swap in will try the swap cache lookup using the large
> swap entry's starting value (which is the first sub swap entry of this
> large entry).  This will lead to false positive lookup results, if only
> the first few swap entries are cached but the actual requested swap
> entry pointed by index is uncached. This is not a rare event as swap
> readahead always try to cache order 0 folios when possible.
> 
> Currently, shmem will do a large entry split when it occurs, aborts
> due to a mismatching folio swap value, then retry the swapin from
> the beginning, which is a waste of CPU and adds wrong info to
> the readahead statistics.
> 
> This can be optimized easily by doing the lookup using the right
> swap entry value.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>   mm/shmem.c | 31 +++++++++++++++----------------
>   1 file changed, 15 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 217264315842..2ab214e2771c 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2274,14 +2274,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>   	pgoff_t offset;
>   
>   	VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
> -	swap = index_entry = radix_to_swp_entry(*foliop);
> +	index_entry = radix_to_swp_entry(*foliop);
> +	swap = index_entry;
>   	*foliop = NULL;
>   
> -	if (is_poisoned_swp_entry(swap))
> +	if (is_poisoned_swp_entry(index_entry))
>   		return -EIO;
>   
> -	si = get_swap_device(swap);
> -	order = shmem_confirm_swap(mapping, index, swap);
> +	si = get_swap_device(index_entry);
> +	order = shmem_confirm_swap(mapping, index, index_entry);
>   	if (unlikely(!si)) {
>   		if (order < 0)
>   			return -EEXIST;
> @@ -2293,6 +2294,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>   		return -EEXIST;
>   	}
>   
> +	/* index may point to the middle of a large entry, get the sub entry */
> +	if (order) {
> +		offset = index - round_down(index, 1 << order);
> +		swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> +	}
> +
>   	/* Look it up and read it in.. */
>   	folio = swap_cache_get_folio(swap, NULL, 0);

Please drop this patch, which will cause a swapin fault dead loop.

Assume an order-4 shmem folio has been swapped out, and the swap cache 
holds this order-4 folio (assuming index == 0, swap.val == 0x4000).

During swapin, if the index is 1, and the recalculation of the swap 
value here will result in 'swap.val == 0x4001'. This will cause the 
subsequent 'folio->swap.val != swap.val' check to fail, continuously 
triggering a dead-loop swapin fault, ultimately causing the CPU to hang.

Re: [PATCH v4 5/9] mm/shmem, swap: avoid false positive swap cache lookup

Posted by Kairui Song 3 months ago

On Mon, Jul 7, 2025 at 3:53 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
> Hi Kairui,
>
> On 2025/7/5 02:17, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > If a shmem read request's index points to the middle of a large swap
> > entry, shmem swap in will try the swap cache lookup using the large
> > swap entry's starting value (which is the first sub swap entry of this
> > large entry).  This will lead to false positive lookup results, if only
> > the first few swap entries are cached but the actual requested swap
> > entry pointed by index is uncached. This is not a rare event as swap
> > readahead always try to cache order 0 folios when possible.
> >
> > Currently, shmem will do a large entry split when it occurs, aborts
> > due to a mismatching folio swap value, then retry the swapin from
> > the beginning, which is a waste of CPU and adds wrong info to
> > the readahead statistics.
> >
> > This can be optimized easily by doing the lookup using the right
> > swap entry value.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >   mm/shmem.c | 31 +++++++++++++++----------------
> >   1 file changed, 15 insertions(+), 16 deletions(-)
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 217264315842..2ab214e2771c 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -2274,14 +2274,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >       pgoff_t offset;
> >
> >       VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
> > -     swap = index_entry = radix_to_swp_entry(*foliop);
> > +     index_entry = radix_to_swp_entry(*foliop);
> > +     swap = index_entry;
> >       *foliop = NULL;
> >
> > -     if (is_poisoned_swp_entry(swap))
> > +     if (is_poisoned_swp_entry(index_entry))
> >               return -EIO;
> >
> > -     si = get_swap_device(swap);
> > -     order = shmem_confirm_swap(mapping, index, swap);
> > +     si = get_swap_device(index_entry);
> > +     order = shmem_confirm_swap(mapping, index, index_entry);
> >       if (unlikely(!si)) {
> >               if (order < 0)
> >                       return -EEXIST;
> > @@ -2293,6 +2294,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >               return -EEXIST;
> >       }
> >
> > +     /* index may point to the middle of a large entry, get the sub entry */
> > +     if (order) {
> > +             offset = index - round_down(index, 1 << order);
> > +             swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> > +     }
> > +
> >       /* Look it up and read it in.. */
> >       folio = swap_cache_get_folio(swap, NULL, 0);
>
> Please drop this patch, which will cause a swapin fault dead loop.
>
> Assume an order-4 shmem folio has been swapped out, and the swap cache
> holds this order-4 folio (assuming index == 0, swap.val == 0x4000).
>
> During swapin, if the index is 1, and the recalculation of the swap
> value here will result in 'swap.val == 0x4001'. This will cause the
> subsequent 'folio->swap.val != swap.val' check to fail, continuously
> triggering a dead-loop swapin fault, ultimately causing the CPU to hang.
>

Oh, thanks for catching that.

Clearly I wasn't thinking carefully enough on this. The problem will
be gone if we calculate the `swap.val` based on folio_order and not
split_order, which is currently done in patch 8.

Previously there were only 4 patches so I never expected this
problem... I can try to organize the patch order again. I was hoping
they could be merged as one patch, some designs are supposed to work
together so splitting the patch may cause intermediate problems like
this.

Perhaps you can help have a look at later patches, if we can just
merge them into one? eg. merge or move patch 8 into this. Or maybe I
need to move this patch later.

The performance / object size / stack usage improvements are
shown in the commit message.

Re: [PATCH v4 5/9] mm/shmem, swap: avoid false positive swap cache lookup

Posted by Baolin Wang 3 months ago


On 2025/7/7 16:04, Kairui Song wrote:
> On Mon, Jul 7, 2025 at 3:53 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>> Hi Kairui,
>>
>> On 2025/7/5 02:17, Kairui Song wrote:
>>> From: Kairui Song <kasong@tencent.com>
>>>
>>> If a shmem read request's index points to the middle of a large swap
>>> entry, shmem swap in will try the swap cache lookup using the large
>>> swap entry's starting value (which is the first sub swap entry of this
>>> large entry).  This will lead to false positive lookup results, if only
>>> the first few swap entries are cached but the actual requested swap
>>> entry pointed by index is uncached. This is not a rare event as swap
>>> readahead always try to cache order 0 folios when possible.
>>>
>>> Currently, shmem will do a large entry split when it occurs, aborts
>>> due to a mismatching folio swap value, then retry the swapin from
>>> the beginning, which is a waste of CPU and adds wrong info to
>>> the readahead statistics.
>>>
>>> This can be optimized easily by doing the lookup using the right
>>> swap entry value.
>>>
>>> Signed-off-by: Kairui Song <kasong@tencent.com>
>>> ---
>>>    mm/shmem.c | 31 +++++++++++++++----------------
>>>    1 file changed, 15 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index 217264315842..2ab214e2771c 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -2274,14 +2274,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>>        pgoff_t offset;
>>>
>>>        VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
>>> -     swap = index_entry = radix_to_swp_entry(*foliop);
>>> +     index_entry = radix_to_swp_entry(*foliop);
>>> +     swap = index_entry;
>>>        *foliop = NULL;
>>>
>>> -     if (is_poisoned_swp_entry(swap))
>>> +     if (is_poisoned_swp_entry(index_entry))
>>>                return -EIO;
>>>
>>> -     si = get_swap_device(swap);
>>> -     order = shmem_confirm_swap(mapping, index, swap);
>>> +     si = get_swap_device(index_entry);
>>> +     order = shmem_confirm_swap(mapping, index, index_entry);
>>>        if (unlikely(!si)) {
>>>                if (order < 0)
>>>                        return -EEXIST;
>>> @@ -2293,6 +2294,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>>                return -EEXIST;
>>>        }
>>>
>>> +     /* index may point to the middle of a large entry, get the sub entry */
>>> +     if (order) {
>>> +             offset = index - round_down(index, 1 << order);
>>> +             swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>> +     }
>>> +
>>>        /* Look it up and read it in.. */
>>>        folio = swap_cache_get_folio(swap, NULL, 0);
>>
>> Please drop this patch, which will cause a swapin fault dead loop.
>>
>> Assume an order-4 shmem folio has been swapped out, and the swap cache
>> holds this order-4 folio (assuming index == 0, swap.val == 0x4000).
>>
>> During swapin, if the index is 1, and the recalculation of the swap
>> value here will result in 'swap.val == 0x4001'. This will cause the
>> subsequent 'folio->swap.val != swap.val' check to fail, continuously
>> triggering a dead-loop swapin fault, ultimately causing the CPU to hang.
>>
> 
> Oh, thanks for catching that.
> 
> Clearly I wasn't thinking carefully enough on this. The problem will
> be gone if we calculate the `swap.val` based on folio_order and not
> split_order, which is currently done in patch 8.

OK. I saw patch 8. After patch 8, the logic seems correct.

> Previously there were only 4 patches so I never expected this
> problem... I can try to organize the patch order again. I was hoping
> they could be merged as one patch, some designs are supposed to work
> together so splitting the patch may cause intermediate problems like
> this.

Again, please do not combine different changes into one huge patch, 
which is _really_ hard to review and discuss. Please split your patches 
properly and ensure each patch has been tested.

> Perhaps you can help have a look at later patches, if we can just
> merge them into one? eg. merge or move patch 8 into this. Or maybe I
> need to move this patch later.

It seems that patch 5 depends on the cleanup in patch 8. If there's no 
better way to split them, I suggest merging patch 5 into patch 8.

> The performance / object size / stack usage improvements are
> shown in the commit message.