mm: Drain PCP during direct reclaim

[RFC PATCH] mm: Drain PCP during direct reclaim

Posted by Wupeng Ma 8 months, 1 week ago

Memory retained in Per-CPU Pages (PCP) caches can prevent hugepage
allocations from succeeding despite sufficient free system memory. This
occurs because:
1. Hugepage allocations don't actively trigger PCP draining
2. Direct reclaim path fails to trigger drain_all_pages() when:
   a) All zone pages are free/hugetlb (!did_some_progress)
   b) Compaction skips due to costly order watermarks (COMPACT_SKIPPED)

Reproduction:
  - Alloc page and free the page via put_page to release to pcp
  - Observe hugepage reservation failure

Solution:
  Actively drain PCP during direct reclaim for memory allocations.
  This increases page allocation success rate by making stranded pages
  available to any order allocations.

Verification:
  This issue can be reproduce easily in zone movable with the following
  step:

w/o this patch
  # numactl -m 2 dd if=/dev/urandom of=/dev/shm/testfile bs=4k count=64
  # rm -f /dev/shm/testfile
  # sync
  # echo 3 > /proc/sys/vm/drop_caches
  # echo 2048 > /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages
  # cat /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages
    2029

w/ this patch
  # numactl -m 2 dd if=/dev/urandom of=/dev/shm/testfile bs=4k count=64
  # rm -f /dev/shm/testfile
  # sync
  # echo 3 > /proc/sys/vm/drop_caches
  # echo 2048 > /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages
  # cat /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages
    2047

Signed-off-by: Wupeng Ma <mawupeng1@huawei.com>
---
 mm/page_alloc.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2ef3c07266b3..464f2e48651e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4137,28 +4137,22 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 {
 	struct page *page = NULL;
 	unsigned long pflags;
-	bool drained = false;
 
 	psi_memstall_enter(&pflags);
 	*did_some_progress = __perform_reclaim(gfp_mask, order, ac);
-	if (unlikely(!(*did_some_progress)))
-		goto out;
-
-retry:
-	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
+	if (likely(*did_some_progress))
+		page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
 
 	/*
 	 * If an allocation failed after direct reclaim, it could be because
 	 * pages are pinned on the per-cpu lists or in high alloc reserves.
 	 * Shrink them and try again
 	 */
-	if (!page && !drained) {
+	if (!page) {
 		unreserve_highatomic_pageblock(ac, false);
 		drain_all_pages(NULL);
-		drained = true;
-		goto retry;
+		page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
 	}
-out:
 	psi_memstall_leave(&pflags);
 
 	return page;
-- 
2.43.0

Re: [RFC PATCH] mm: Drain PCP during direct reclaim

Posted by Raghavendra K T 8 months ago

++
On 6/6/2025 12:29 PM, Wupeng Ma wrote:
> Memory retained in Per-CPU Pages (PCP) caches can prevent hugepage
> allocations from succeeding despite sufficient free system memory. This
> occurs because:
> 1. Hugepage allocations don't actively trigger PCP draining
> 2. Direct reclaim path fails to trigger drain_all_pages() when:
>     a) All zone pages are free/hugetlb (!did_some_progress)
>     b) Compaction skips due to costly order watermarks (COMPACT_SKIPPED)
> 
> Reproduction:
>    - Alloc page and free the page via put_page to release to pcp
>    - Observe hugepage reservation failure
> 
> Solution:
>    Actively drain PCP during direct reclaim for memory allocations.
>    This increases page allocation success rate by making stranded pages
>    available to any order allocations.
> 
> Verification:
>    This issue can be reproduce easily in zone movable with the following
>    step:
> 
> w/o this patch
>    # numactl -m 2 dd if=/dev/urandom of=/dev/shm/testfile bs=4k count=64
>    # rm -f /dev/shm/testfile
>    # sync
>    # echo 3 > /proc/sys/vm/drop_caches
>    # echo 2048 > /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages
>    # cat /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages
>      2029
> 
> w/ this patch
>    # numactl -m 2 dd if=/dev/urandom of=/dev/shm/testfile bs=4k count=64
>    # rm -f /dev/shm/testfile
>    # sync
>    # echo 3 > /proc/sys/vm/drop_caches
>    # echo 2048 > /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages
>    # cat /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages
>      2047
> 

Hello Wupeng Ma,

Can you also post results of iperf/netperf for this patch in future?

Thanks and Regards
- Raghu

Re: [RFC PATCH] mm: Drain PCP during direct reclaim

Posted by Johannes Weiner 8 months, 1 week ago

On Fri, Jun 06, 2025 at 02:59:30PM +0800, Wupeng Ma wrote:
> Memory retained in Per-CPU Pages (PCP) caches can prevent hugepage
> allocations from succeeding despite sufficient free system memory. This
> occurs because:
> 1. Hugepage allocations don't actively trigger PCP draining
> 2. Direct reclaim path fails to trigger drain_all_pages() when:
>    a) All zone pages are free/hugetlb (!did_some_progress)
>    b) Compaction skips due to costly order watermarks (COMPACT_SKIPPED)

This doesn't sound quite right. Direct reclaim skips when compaction
is suitable. Compaction says COMPACT_SKIPPED when it *isn't* suitable.

So if direct reclaim didn't drain, presumably compaction ran but
returned COMPLETE or PARTIAL_SKIPPED because the freelist checks in
__compact_finished() never succeed due to the pcp?

> @@ -4137,28 +4137,22 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
>  {
>  	struct page *page = NULL;
>  	unsigned long pflags;
> -	bool drained = false;
>  
>  	psi_memstall_enter(&pflags);
>  	*did_some_progress = __perform_reclaim(gfp_mask, order, ac);
> -	if (unlikely(!(*did_some_progress)))
> -		goto out;
> -
> -retry:
> -	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
> +	if (likely(*did_some_progress))
> +		page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
>  
>  	/*
>  	 * If an allocation failed after direct reclaim, it could be because
>  	 * pages are pinned on the per-cpu lists or in high alloc reserves.
>  	 * Shrink them and try again
>  	 */
> -	if (!page && !drained) {
> +	if (!page) {
>  		unreserve_highatomic_pageblock(ac, false);
>  		drain_all_pages(NULL);
> -		drained = true;
> -		goto retry;
> +		page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);

This seems like the wrong place to fix the issue.

Kcompactd has a drain_all_pages() call. Move that to compact_zone(),
so that it also applies to the try_to_compact_pages() path?

Re: [RFC PATCH] mm: Drain PCP during direct reclaim

Posted by mawupeng 8 months ago


On 2025/6/6 19:19, Johannes Weiner wrote:
> On Fri, Jun 06, 2025 at 02:59:30PM +0800, Wupeng Ma wrote:
>> Memory retained in Per-CPU Pages (PCP) caches can prevent hugepage
>> allocations from succeeding despite sufficient free system memory. This
>> occurs because:
>> 1. Hugepage allocations don't actively trigger PCP draining
>> 2. Direct reclaim path fails to trigger drain_all_pages() when:
>>    a) All zone pages are free/hugetlb (!did_some_progress)
>>    b) Compaction skips due to costly order watermarks (COMPACT_SKIPPED)
> 
> This doesn't sound quite right. Direct reclaim skips when compaction
> is suitable. Compaction says COMPACT_SKIPPED when it *isn't* suitable.
> 
> So if direct reclaim didn't drain, presumably compaction ran but
> returned COMPLETE or PARTIAL_SKIPPED because the freelist checks in
> __compact_finished() never succeed due to the pcp?

Yes, compaction do run, however since all pages in this movable node
are free or in pcp. there is no way for compaction to reclaim a page.

> 
>> @@ -4137,28 +4137,22 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
>>  {
>>  	struct page *page = NULL;
>>  	unsigned long pflags;
>> -	bool drained = false;
>>  
>>  	psi_memstall_enter(&pflags);
>>  	*did_some_progress = __perform_reclaim(gfp_mask, order, ac);
>> -	if (unlikely(!(*did_some_progress)))
>> -		goto out;
>> -
>> -retry:
>> -	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
>> +	if (likely(*did_some_progress))
>> +		page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
>>  
>>  	/*
>>  	 * If an allocation failed after direct reclaim, it could be because
>>  	 * pages are pinned on the per-cpu lists or in high alloc reserves.
>>  	 * Shrink them and try again
>>  	 */
>> -	if (!page && !drained) {
>> +	if (!page) {
>>  		unreserve_highatomic_pageblock(ac, false);
>>  		drain_all_pages(NULL);
>> -		drained = true;
>> -		goto retry;
>> +		page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
> 
> This seems like the wrong place to fix the issue.
> 
> Kcompactd has a drain_all_pages() call. Move that to compact_zone(),
> so that it also applies to the try_to_compact_pages() path?

Since there is no pages isolated during isolate_migratepages(), it is
strange to drain_pcp here?