mm/shmem: fix THP allocation size check and fallback

[PATCH] mm/shmem: fix THP allocation size check and fallback

Posted by Kairui Song 3 months, 2 weeks ago

From: Kairui Song <kasong@tencent.com>

There are some problems with the code implementations of THP fallback.
suitable_orders could be zero, and calling highest_order on a zero value
returns an overflowed size. And the order check loop is updating the
index value on every loop which may cause the index to be aligned by a
larger value while the loop shrinks the order. And it forgot to try order
0 after the final loop.

This is usually fine because shmem_add_to_page_cache ensures the shmem
mapping is still sane, but it might cause many potential issues like
allocating random folios into the random position in the map or return
-ENOMEM by accident. This triggered some strange userspace errors [1],
and shouldn't have happened in the first place.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/linux-mm/CAMgjq7DqgAmj25nDUwwu1U2cSGSn8n4-Hqpgottedy0S6YYeUw@mail.gmail.com/ [1]
Fixes: e7a2ab7b3bb5d ("mm: shmem: add mTHP support for anonymous shmem")
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/shmem.c | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index b50ce7dbc84a..25303711f123 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1824,6 +1824,9 @@ static unsigned long shmem_suitable_orders(struct inode *inode, struct vm_fault
 	unsigned long pages;
 	int order;
 
+	if (!orders)
+		return 0;
+
 	if (vma) {
 		orders = thp_vma_suitable_orders(vma, vmf->address, orders);
 		if (!orders)
@@ -1888,27 +1891,28 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
 	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
 		orders = 0;
 
-	if (orders > 0) {
-		suitable_orders = shmem_suitable_orders(inode, vmf,
-							mapping, index, orders);
+	suitable_orders = shmem_suitable_orders(inode, vmf,
+						mapping, index, orders);
 
+	if (suitable_orders) {
 		order = highest_order(suitable_orders);
-		while (suitable_orders) {
+		do {
 			pages = 1UL << order;
-			index = round_down(index, pages);
-			folio = shmem_alloc_folio(gfp, order, info, index);
-			if (folio)
+			folio = shmem_alloc_folio(gfp, order, info, round_down(index, pages));
+			if (folio) {
+				index = round_down(index, pages);
 				goto allocated;
+			}
 
 			if (pages == HPAGE_PMD_NR)
 				count_vm_event(THP_FILE_FALLBACK);
 			count_mthp_stat(order, MTHP_STAT_SHMEM_FALLBACK);
 			order = next_order(&suitable_orders, order);
-		}
-	} else {
-		pages = 1;
-		folio = shmem_alloc_folio(gfp, 0, info, index);
+		} while (suitable_orders);
 	}
+
+	pages = 1;
+	folio = shmem_alloc_folio(gfp, 0, info, index);
 	if (!folio)
 		return ERR_PTR(-ENOMEM);
 
-- 
2.51.0

Re: [PATCH] mm/shmem: fix THP allocation size check and fallback

Posted by Baolin Wang 3 months, 2 weeks ago


On 2025/10/22 03:04, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> There are some problems with the code implementations of THP fallback.
> suitable_orders could be zero, and calling highest_order on a zero value
> returns an overflowed size. And the order check loop is updating the
> index value on every loop which may cause the index to be aligned by a
> larger value while the loop shrinks the order. 

No, this is not true. Although ‘suitable_orders’ might be 0, it will not 
enter the ‘while (suitable_orders)’ loop, and ‘order’ will not be used 
(this is also how the highest_order() function is used in other places).

> And it forgot to try order
> 0 after the final loop.

This is also not true. We will fallback to order 0 allocation in 
shmem_get_folio_gfp() if large order allocation fails.

> This is usually fine because shmem_add_to_page_cache ensures the shmem
> mapping is still sane, but it might cause many potential issues like
> allocating random folios into the random position in the map or return
> -ENOMEM by accident. This triggered some strange userspace errors [1],
> and shouldn't have happened in the first place.

I tested tmpfs with swap on ZRAM in the mm-new branch, and no issues 
were encountered. However, it is worth mentioning that, after the commit 
69e0a3b49003 ("mm: shmem: fix the strategy for the tmpfs 'huge=' 
options"), tmpfs may consume more memory (because we no longer allocate 
large folios based on the write size), which might cause your test case 
to run out of memory (OOM) and trigger some errors. You may need to 
adjust your swap size or memcg limit.

> Cc: stable@vger.kernel.org
> Link: https://lore.kernel.org/linux-mm/CAMgjq7DqgAmj25nDUwwu1U2cSGSn8n4-Hqpgottedy0S6YYeUw@mail.gmail.com/ [1]
> Fixes: e7a2ab7b3bb5d ("mm: shmem: add mTHP support for anonymous shmem")
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>   mm/shmem.c | 26 +++++++++++++++-----------
>   1 file changed, 15 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index b50ce7dbc84a..25303711f123 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1824,6 +1824,9 @@ static unsigned long shmem_suitable_orders(struct inode *inode, struct vm_fault
>   	unsigned long pages;
>   	int order;
>   
> +	if (!orders)
> +		return 0;
> +
>   	if (vma) {
>   		orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>   		if (!orders)
> @@ -1888,27 +1891,28 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
>   	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
>   		orders = 0;
>   
> -	if (orders > 0) {
> -		suitable_orders = shmem_suitable_orders(inode, vmf,
> -							mapping, index, orders);
> +	suitable_orders = shmem_suitable_orders(inode, vmf,
> +						mapping, index, orders);
>   
> +	if (suitable_orders) {
>   		order = highest_order(suitable_orders);
> -		while (suitable_orders) {
> +		do {
>   			pages = 1UL << order;
> -			index = round_down(index, pages);
> -			folio = shmem_alloc_folio(gfp, order, info, index);
> -			if (folio)
> +			folio = shmem_alloc_folio(gfp, order, info, round_down(index, pages));
> +			if (folio) {
> +				index = round_down(index, pages);
>   				goto allocated;
> +			}
>   
>   			if (pages == HPAGE_PMD_NR)
>   				count_vm_event(THP_FILE_FALLBACK);
>   			count_mthp_stat(order, MTHP_STAT_SHMEM_FALLBACK);
>   			order = next_order(&suitable_orders, order);
> -		}
> -	} else {
> -		pages = 1;
> -		folio = shmem_alloc_folio(gfp, 0, info, index);
> +		} while (suitable_orders);
>   	}
> +
> +	pages = 1;
> +	folio = shmem_alloc_folio(gfp, 0, info, index);
>   	if (!folio)
>   		return ERR_PTR(-ENOMEM);
>

Re: [PATCH] mm/shmem: fix THP allocation size check and fallback

Posted by Kairui Song 3 months, 2 weeks ago

On Wed, Oct 22, 2025 at 9:25 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2025/10/22 03:04, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > There are some problems with the code implementations of THP fallback.
> > suitable_orders could be zero, and calling highest_order on a zero value
> > returns an overflowed size. And the order check loop is updating the
> > index value on every loop which may cause the index to be aligned by a
> > larger value while the loop shrinks the order.
>
> No, this is not true. Although ‘suitable_orders’ might be 0, it will not
> enter the ‘while (suitable_orders)’ loop, and ‘order’ will not be used
> (this is also how the highest_order() function is used in other places).

Maybe I shouldn't mix the trivial issue with the real issue here,
sorry, my bad, I was in a hurry :P.
I mean if suitable_orders is zero we should just skip calling the
highest_order since that returns a negative value. It's not causing an
issue though, but redundant.

>
> > And it forgot to try order
> > 0 after the final loop.
>
> This is also not true. We will fallback to order 0 allocation in
> shmem_get_folio_gfp() if large order allocation fails.

I thought after the fix, we can simplify the code, and maybe reduce
the call to shmem_alloc_and_add_folio to only one so it will be
inlined by the compiler.

On second thought some more changes are needed to respect the
huge_gfp. Maybe I should send a series to split the hot fix with clean
ups.

But here the index being modified during the loop do need a fix I
think, so, for the fix part, we just need:

diff --git a/mm/shmem.c b/mm/shmem.c
index 29e1eb690125..e89ae4dd6859 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1895,10 +1895,11 @@ static struct folio
*shmem_alloc_and_add_folio(struct vm_fault *vmf,
                order = highest_order(suitable_orders);
                while (suitable_orders) {
                        pages = 1UL << order;
-                       index = round_down(index, pages);
-                       folio = shmem_alloc_folio(gfp, order, info, index);
-                       if (folio)
+                       folio = shmem_alloc_folio(gfp, order, info,
round_down(index, pages));
+                       if (folio) {
+                               index = round_down(index, pages);
                                goto allocated;
+                       }

                        if (pages == HPAGE_PMD_NR)
                                count_vm_event(THP_FILE_FALLBACK);

>
> > This is usually fine because shmem_add_to_page_cache ensures the shmem
> > mapping is still sane, but it might cause many potential issues like
> > allocating random folios into the random position in the map or return
> > -ENOMEM by accident. This triggered some strange userspace errors [1],
> > and shouldn't have happened in the first place.
>
> I tested tmpfs with swap on ZRAM in the mm-new branch, and no issues
> were encountered. However, it is worth mentioning that, after the commit
> 69e0a3b49003 ("mm: shmem: fix the strategy for the tmpfs 'huge='
> options"), tmpfs may consume more memory (because we no longer allocate
> large folios based on the write size), which might cause your test case
> to run out of memory (OOM) and trigger some errors. You may need to
> adjust your swap size or memcg limit.
>

I'm not seeing any OOM issue, and I tested with different memory
sizes, the error occurs with all different setup.

If I compared the built object with the ones before 69e0a3b49003, I
can see that the build object is corrupted:

Compare using hexdump & diff on any .o generated by gcc:
--- GOOD        2025-10-21 12:17:44.121000287 +0000
+++ BAD 2025-10-21 12:18:01.094000870 +0000
@@ -3371,425 +3371,7 @@
... <summary: whole chunk of data & symbols missing> ...

GCC compile didn't fail because the problem occurs when writing the
objects, so instead LD failed and not due to OOM.

The reproduction rate is very high with different setup, not sure why
it didn't occur on your setup, I suspect it's related to how different
gcc handles write (not fault)?

Revert 69e0a3b49003, the build is OK. 69e0a3b49003 isn't the root
cause but triggered this.

The minimized fix posted above also fixes the problem just fine.

Re: [PATCH] mm/shmem: fix THP allocation size check and fallback

Posted by Baolin Wang 3 months, 2 weeks ago


On 2025/10/22 13:48, Kairui Song wrote:
> On Wed, Oct 22, 2025 at 9:25 AM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>>
>>
>> On 2025/10/22 03:04, Kairui Song wrote:
>>> From: Kairui Song <kasong@tencent.com>
>>>
>>> There are some problems with the code implementations of THP fallback.
>>> suitable_orders could be zero, and calling highest_order on a zero value
>>> returns an overflowed size. And the order check loop is updating the
>>> index value on every loop which may cause the index to be aligned by a
>>> larger value while the loop shrinks the order.
>>
>> No, this is not true. Although ‘suitable_orders’ might be 0, it will not
>> enter the ‘while (suitable_orders)’ loop, and ‘order’ will not be used
>> (this is also how the highest_order() function is used in other places).
> 
> Maybe I shouldn't mix the trivial issue with the real issue here,
> sorry, my bad, I was in a hurry :P.
> I mean if suitable_orders is zero we should just skip calling the
> highest_order since that returns a negative value. It's not causing an
> issue though, but redundant.

I think compiler can optimize this(?).

>>> And it forgot to try order
>>> 0 after the final loop.
>>
>> This is also not true. We will fallback to order 0 allocation in
>> shmem_get_folio_gfp() if large order allocation fails.
> 
> I thought after the fix, we can simplify the code, and maybe reduce
> the call to shmem_alloc_and_add_folio to only one so it will be
> inlined by the compiler.
> 
> On second thought some more changes are needed to respect the
> huge_gfp. Maybe I should send a series to split the hot fix with clean
> ups.

Yes, I've considered simplifying this, but it would require more 
changes, making it less readable compared to how it is now. Of course, 
if you have a better approach, feel free to try cleaning it up in 
another thread.

> But here the index being modified during the loop do need a fix I
> think, so, for the fix part, we just need:
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 29e1eb690125..e89ae4dd6859 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1895,10 +1895,11 @@ static struct folio
> *shmem_alloc_and_add_folio(struct vm_fault *vmf,
>                  order = highest_order(suitable_orders);
>                  while (suitable_orders) {
>                          pages = 1UL << order;
> -                       index = round_down(index, pages);
> -                       folio = shmem_alloc_folio(gfp, order, info, index);
> -                       if (folio)
> +                       folio = shmem_alloc_folio(gfp, order, info,
> round_down(index, pages));
> +                       if (folio) {
> +                               index = round_down(index, pages);
>                                  goto allocated;
> +                       }
> 
>                          if (pages == HPAGE_PMD_NR)
>                                  count_vm_event(THP_FILE_FALLBACK);

Good catch. I've fixed one similar issue with commit 4cbf320b1500 ("mm: 
shmem: fix incorrect aligned index when checking conflicts"), but I 
missed this part. Please resend the bugfix patch with a correct commit 
message and remove the cleanup changes.