[PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number

Vernon Yang posted 5 patches 1 week ago
There is a newer version of this series
[PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
Posted by Vernon Yang 1 week ago
From: Vernon Yang <yanglincheng@kylinos.cn>

Currently, each scan always increases "progress" by HPAGE_PMD_NR,
even if only scanning a single PTE/PMD entry.

- When only scanning a sigle PTE entry, let me provide a detailed
  example:

static int hpage_collapse_scan_pmd()
{
	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
	     _pte++, addr += PAGE_SIZE) {
		pte_t pteval = ptep_get(_pte);
		...
		if (pte_uffd_wp(pteval)) { <-- first scan hit
			result = SCAN_PTE_UFFD_WP;
			goto out_unmap;
		}
	}
}

During the first scan, if pte_uffd_wp(pteval) is true, the loop exits
directly. In practice, only one PTE is scanned before termination.
Here, "progress += 1" reflects the actual number of PTEs scanned, but
previously "progress += HPAGE_PMD_NR" always.

- When the memory has been collapsed to PMD, let me provide a detailed
  example:

The following data is traced by bpftrace on a desktop system. After
the system has been left idle for 10 minutes upon booting, a lot of
SCAN_PMD_MAPPED or SCAN_NO_PTE_TABLE are observed during a full scan
by khugepaged.

From trace_mm_khugepaged_scan_pmd and trace_mm_khugepaged_scan_file, the
following statuses were observed, with frequency mentioned next to them:

SCAN_SUCCEED          : 1
SCAN_EXCEED_SHARED_PTE: 2
SCAN_PMD_MAPPED       : 142
SCAN_NO_PTE_TABLE     : 178
total progress size   : 674 MB
Total time            : 419 seconds, include khugepaged_scan_sleep_millisecs

The khugepaged_scan list save all task that support collapse into hugepage,
as long as the task is not destroyed, khugepaged will not remove it from
the khugepaged_scan list. This exist a phenomenon where task has already
collapsed all memory regions into hugepage, but khugepaged continues to
scan it, which wastes CPU time and invalid, and due to
khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
scanning a large number of invalid task, so scanning really valid task
is later.

After applying this patch, when the memory is either SCAN_PMD_MAPPED or
SCAN_NO_PTE_TABLE, just skip it, as follow:

SCAN_EXCEED_SHARED_PTE: 2
SCAN_PMD_MAPPED       : 147
SCAN_NO_PTE_TABLE     : 173
total progress size   : 45 MB
Total time            : 20 seconds

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
 mm/khugepaged.c | 41 +++++++++++++++++++++++++++++++----------
 1 file changed, 31 insertions(+), 10 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d94b34e10bdf..df22b2274d92 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -68,7 +68,10 @@ enum scan_result {
 static struct task_struct *khugepaged_thread __read_mostly;
 static DEFINE_MUTEX(khugepaged_mutex);
 
-/* default scan 8*HPAGE_PMD_NR ptes (or vmas) every 10 second */
+/*
+ * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
+ * every 10 second.
+ */
 static unsigned int khugepaged_pages_to_scan __read_mostly;
 static unsigned int khugepaged_pages_collapsed;
 static unsigned int khugepaged_full_scans;
@@ -1240,7 +1243,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 }
 
 static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
-		struct vm_area_struct *vma, unsigned long start_addr, bool *mmap_locked,
+		struct vm_area_struct *vma, unsigned long start_addr,
+		bool *mmap_locked, unsigned int *cur_progress,
 		struct collapse_control *cc)
 {
 	pmd_t *pmd;
@@ -1256,13 +1260,18 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
 	VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
 
 	result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
-	if (result != SCAN_SUCCEED)
+	if (result != SCAN_SUCCEED) {
+		if (cur_progress)
+			*cur_progress = 1;
 		goto out;
+	}
 
 	memset(cc->node_load, 0, sizeof(cc->node_load));
 	nodes_clear(cc->alloc_nmask);
 	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
 	if (!pte) {
+		if (cur_progress)
+			*cur_progress = 1;
 		result = SCAN_NO_PTE_TABLE;
 		goto out;
 	}
@@ -1396,6 +1405,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
 		result = SCAN_SUCCEED;
 	}
 out_unmap:
+	if (cur_progress) {
+		if (_pte >= pte + HPAGE_PMD_NR)
+			*cur_progress = HPAGE_PMD_NR;
+		else
+			*cur_progress = _pte - pte + 1;
+	}
 	pte_unmap_unlock(pte, ptl);
 	if (result == SCAN_SUCCEED) {
 		result = collapse_huge_page(mm, start_addr, referenced,
@@ -2286,8 +2301,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
 	return result;
 }
 
-static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
-		struct file *file, pgoff_t start, struct collapse_control *cc)
+static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
+		unsigned long addr, struct file *file, pgoff_t start,
+		unsigned int *cur_progress, struct collapse_control *cc)
 {
 	struct folio *folio = NULL;
 	struct address_space *mapping = file->f_mapping;
@@ -2376,6 +2392,8 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
 			cond_resched_rcu();
 		}
 	}
+	if (cur_progress)
+		*cur_progress = max(xas.xa_index - start, 1UL);
 	rcu_read_unlock();
 
 	if (result == SCAN_SUCCEED) {
@@ -2455,6 +2473,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
 
 		while (khugepaged_scan.address < hend) {
 			bool mmap_locked = true;
+			unsigned int cur_progress = 0;
 
 			cond_resched();
 			if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
@@ -2471,7 +2490,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
 				mmap_read_unlock(mm);
 				mmap_locked = false;
 				*result = hpage_collapse_scan_file(mm,
-					khugepaged_scan.address, file, pgoff, cc);
+					khugepaged_scan.address, file, pgoff,
+					&cur_progress, cc);
 				fput(file);
 				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
 					mmap_read_lock(mm);
@@ -2485,7 +2505,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
 				}
 			} else {
 				*result = hpage_collapse_scan_pmd(mm, vma,
-					khugepaged_scan.address, &mmap_locked, cc);
+					khugepaged_scan.address, &mmap_locked,
+					&cur_progress, cc);
 			}
 
 			if (*result == SCAN_SUCCEED)
@@ -2493,7 +2514,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
 
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
-			progress += HPAGE_PMD_NR;
+			progress += cur_progress;
 			if (!mmap_locked)
 				/*
 				 * We released mmap_lock so break loop.  Note
@@ -2816,7 +2837,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 			mmap_locked = false;
 			*lock_dropped = true;
 			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
-							  cc);
+							  NULL, cc);
 
 			if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
 			    mapping_can_writeback(file->f_mapping)) {
@@ -2831,7 +2852,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 			fput(file);
 		} else {
 			result = hpage_collapse_scan_pmd(mm, vma, addr,
-							 &mmap_locked, cc);
+							 &mmap_locked, NULL, cc);
 		}
 		if (!mmap_locked)
 			*lock_dropped = true;
-- 
2.51.0
Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
Posted by David Hildenbrand (arm) 3 days, 22 hours ago
[...]

> +	if (cur_progress) {
> +		if (_pte >= pte + HPAGE_PMD_NR)
> +			*cur_progress = HPAGE_PMD_NR;
> +		else
> +			*cur_progress = _pte - pte + 1;

*cur_progress = max(_pte - pte + 1, HPAGE_PMD_NR);

?

It's still a bit nasty, though.

Can't we just add one at the beginning of the loop and let the compiler
optimize that? ;)

> +	}
>   	pte_unmap_unlock(pte, ptl);
>   	if (result == SCAN_SUCCEED) {
>   		result = collapse_huge_page(mm, start_addr, referenced,
> @@ -2286,8 +2301,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>   	return result;
>   }
>   
> -static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> -		struct file *file, pgoff_t start, struct collapse_control *cc)
> +static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
> +		unsigned long addr, struct file *file, pgoff_t start,
> +		unsigned int *cur_progress, struct collapse_control *cc)
>   {
>   	struct folio *folio = NULL;
>   	struct address_space *mapping = file->f_mapping;
> @@ -2376,6 +2392,8 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
>   			cond_resched_rcu();
>   		}
>   	}
> +	if (cur_progress)
> +		*cur_progress = max(xas.xa_index - start, 1UL);
I would really just keep it simple here and do a

*cur_progress = HPAGE_PMD_NR;

This stuff is hard to reason about, so I would just leave the file case 
essentially unchanged.

IIRC, it would not affect the numbers you report in the patch description?

-- 
Cheers,

David
Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
Posted by Vernon Yang 3 days, 13 hours ago
On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm) <david@kernel.org> wrote:
>
> [...]
>
> > +     if (cur_progress) {
> > +             if (_pte >= pte + HPAGE_PMD_NR)
> > +                     *cur_progress = HPAGE_PMD_NR;
> > +             else
> > +                     *cur_progress = _pte - pte + 1;
>
> *cur_progress = max(_pte - pte + 1, HPAGE_PMD_NR);

I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().

> ?
>
> It's still a bit nasty, though.
>
> Can't we just add one at the beginning of the loop and let the compiler
> optimize that? ;)

I'm also worried that the compiler can't optimize this since the body of
the loop is complex, as with Dev's opinion [1].

[1] https://lore.kernel.org/linux-mm/7c4b5933-7bbd-4ad7-baef-830304a09485@arm.com

If you have a strong recommendation for this, please let me know, Thanks!

> > +     }
> >       pte_unmap_unlock(pte, ptl);
> >       if (result == SCAN_SUCCEED) {
> >               result = collapse_huge_page(mm, start_addr, referenced,
> > @@ -2286,8 +2301,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
> >       return result;
> >   }
> >
> > -static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> > -             struct file *file, pgoff_t start, struct collapse_control *cc)
> > +static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
> > +             unsigned long addr, struct file *file, pgoff_t start,
> > +             unsigned int *cur_progress, struct collapse_control *cc)
> >   {
> >       struct folio *folio = NULL;
> >       struct address_space *mapping = file->f_mapping;
> > @@ -2376,6 +2392,8 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
> >                       cond_resched_rcu();
> >               }
> >       }
> > +     if (cur_progress)
> > +             *cur_progress = max(xas.xa_index - start, 1UL);
> I would really just keep it simple here and do a
>
> *cur_progress = HPAGE_PMD_NR;
>
> This stuff is hard to reason about, so I would just leave the file case
> essentially unchanged.
>
> IIRC, it would not affect the numbers you report in the patch description?

Yes, Let's keep it simple, always equal to HPAGE_PMD_NR in file case.

--
Thanks,
Vernon
Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
Posted by David Hildenbrand (arm) 3 days, 7 hours ago
On 2/5/26 07:08, Vernon Yang wrote:
> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm) <david@kernel.org> wrote:
>>
>> [...]
>>
>>> +     if (cur_progress) {
>>> +             if (_pte >= pte + HPAGE_PMD_NR)
>>> +                     *cur_progress = HPAGE_PMD_NR;
>>> +             else
>>> +                     *cur_progress = _pte - pte + 1;
>>
>> *cur_progress = max(_pte - pte + 1, HPAGE_PMD_NR);
> 
> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().

Yes!

> 
>> ?
>>
>> It's still a bit nasty, though.
>>
>> Can't we just add one at the beginning of the loop and let the compiler
>> optimize that? ;)
> 
> I'm also worried that the compiler can't optimize this since the body of
> the loop is complex, as with Dev's opinion [1].

Why do we even have to optimize this? :)

Premature ... ? :)

-- 
Cheers,

David
Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
Posted by Dev Jain 3 days, 5 hours ago
On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
> On 2/5/26 07:08, Vernon Yang wrote:
>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm)
>> <david@kernel.org> wrote:
>>>
>>> [...]
>>>
>>>> +     if (cur_progress) {
>>>> +             if (_pte >= pte + HPAGE_PMD_NR)
>>>> +                     *cur_progress = HPAGE_PMD_NR;
>>>> +             else
>>>> +                     *cur_progress = _pte - pte + 1;
>>>
>>> *cur_progress = max(_pte - pte + 1, HPAGE_PMD_NR);
>>
>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
>
> Yes!
>
>>
>>> ?
>>>
>>> It's still a bit nasty, though.
>>>
>>> Can't we just add one at the beginning of the loop and let the compiler
>>> optimize that? ;)
>>
>> I'm also worried that the compiler can't optimize this since the body of
>> the loop is complex, as with Dev's opinion [1].
>
> Why do we even have to optimize this? :)
>
> Premature ... ? :)


I mean .... we don't, but the alternate is a one liner using max().

The objective is to compute the number of iterations of the for-loop.

It just seems weird to me to track that in the loop, when we have the

loop iterator, which *literally* does that only.



Anyhow, I won't shout in any case : ) If you deem incrementing in the

loop prettier, that's fine.

Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
Posted by David Hildenbrand (Arm) 2 days, 10 hours ago
On 2/5/26 15:25, Dev Jain wrote:
> 
> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
>> On 2/5/26 07:08, Vernon Yang wrote:
>>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm)
>>> <david@kernel.org> wrote:
>>>
>>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
>>
>> Yes!
>>
>>>
>>>
>>> I'm also worried that the compiler can't optimize this since the body of
>>> the loop is complex, as with Dev's opinion [1].
>>
>> Why do we even have to optimize this? :)
>>
>> Premature ... ? :)
> 
> 
> I mean .... we don't, but the alternate is a one liner using max().

I'm fine with the max(), but it still seems like adding complexity to 
optimize something that is nowhere prove to really be a problem.

-- 
Cheers,

David
Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
Posted by Vernon Yang 2 days, 8 hours ago
On Fri, Feb 06, 2026 at 10:02:48AM +0100, David Hildenbrand (Arm) wrote:
> On 2/5/26 15:25, Dev Jain wrote:
> >
> > On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
> > > On 2/5/26 07:08, Vernon Yang wrote:
> > > > On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm)
> > > > <david@kernel.org> wrote:
> > > >
> > > > I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
> > >
> > > Yes!
> > >
> > > >
> > > >
> > > > I'm also worried that the compiler can't optimize this since the body of
> > > > the loop is complex, as with Dev's opinion [1].
> > >
> > > Why do we even have to optimize this? :)
> > >
> > > Premature ... ? :)
> >
> >
> > I mean .... we don't, but the alternate is a one liner using max().
>
> I'm fine with the max(), but it still seems like adding complexity to
> optimize something that is nowhere prove to really be a problem.

Hi David, Dev,

I use "*cur_progress += 1" at the beginning of the loop, the compiler
optimize that. Assembly as follows:

60c1:	4d 29 ca        sub    %r9,%r10		// r10 is _pte, r9 is pte, r10 = _pte - pte
60c4:	b8 00 02 00 00  mov    $0x200,%eax	// eax = HPAGE_PMD_NR
60c9:	44 89 5c 24 10  mov    %r11d,0x10(%rsp)	//
60ce:	49 c1 fa 03     sar    $0x3,%r10	//
60d2:	49 83 c2 01     add    $0x1,%r10	// r10 += 1
60d6:	49 39 c2        cmp    %rax,%r10	// r10 = min(r10, eax)
60d9:	4c 0f 4f d0     cmovg  %rax,%r10	//
60dd:	44 89 55 00     mov    %r10d,0x0(%rbp)	// *cur_progress = r10

To make the code simpler, Let us use "*cur_progress += 1".

--
Thanks,
Vernon
Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
Posted by Dev Jain 10 hours ago
On 06/02/26 4:42 pm, Vernon Yang wrote:
> On Fri, Feb 06, 2026 at 10:02:48AM +0100, David Hildenbrand (Arm) wrote:
>> On 2/5/26 15:25, Dev Jain wrote:
>>> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
>>>> On 2/5/26 07:08, Vernon Yang wrote:
>>>>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm)
>>>>> <david@kernel.org> wrote:
>>>>>
>>>>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
>>>> Yes!
>>>>
>>>>>
>>>>> I'm also worried that the compiler can't optimize this since the body of
>>>>> the loop is complex, as with Dev's opinion [1].
>>>> Why do we even have to optimize this? :)
>>>>
>>>> Premature ... ? :)
>>>
>>> I mean .... we don't, but the alternate is a one liner using max().
>> I'm fine with the max(), but it still seems like adding complexity to
>> optimize something that is nowhere prove to really be a problem.
> Hi David, Dev,
>
> I use "*cur_progress += 1" at the beginning of the loop, the compiler
> optimize that. Assembly as follows:
>
> 60c1:	4d 29 ca        sub    %r9,%r10		// r10 is _pte, r9 is pte, r10 = _pte - pte
> 60c4:	b8 00 02 00 00  mov    $0x200,%eax	// eax = HPAGE_PMD_NR
> 60c9:	44 89 5c 24 10  mov    %r11d,0x10(%rsp)	//
> 60ce:	49 c1 fa 03     sar    $0x3,%r10	//
> 60d2:	49 83 c2 01     add    $0x1,%r10	// r10 += 1
> 60d6:	49 39 c2        cmp    %rax,%r10	// r10 = min(r10, eax)
> 60d9:	4c 0f 4f d0     cmovg  %rax,%r10	//
> 60dd:	44 89 55 00     mov    %r10d,0x0(%rbp)	// *cur_progress = r10
>
> To make the code simpler, Let us use "*cur_progress += 1".

Wow! Wasn't expecting that. What's your gcc version? I checked with
gcc 11.4.0 (looks pretty old) with both x86 and arm64, and it couldn't
optimize.

> --
> Thanks,
> Vernon
Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
Posted by Vernon Yang 6 hours ago
On Sun, Feb 8, 2026 at 5:05 PM Dev Jain <dev.jain@arm.com> wrote:
>
> On 06/02/26 4:42 pm, Vernon Yang wrote:
> > On Fri, Feb 06, 2026 at 10:02:48AM +0100, David Hildenbrand (Arm) wrote:
> >> On 2/5/26 15:25, Dev Jain wrote:
> >>> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
> >>>> On 2/5/26 07:08, Vernon Yang wrote:
> >>>>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm)
> >>>>> <david@kernel.org> wrote:
> >>>>>
> >>>>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
> >>>> Yes!
> >>>>
> >>>>>
> >>>>> I'm also worried that the compiler can't optimize this since the body of
> >>>>> the loop is complex, as with Dev's opinion [1].
> >>>> Why do we even have to optimize this? :)
> >>>>
> >>>> Premature ... ? :)
> >>>
> >>> I mean .... we don't, but the alternate is a one liner using max().
> >> I'm fine with the max(), but it still seems like adding complexity to
> >> optimize something that is nowhere prove to really be a problem.
> > Hi David, Dev,
> >
> > I use "*cur_progress += 1" at the beginning of the loop, the compiler
> > optimize that. Assembly as follows:
> >
> > 60c1: 4d 29 ca        sub    %r9,%r10         // r10 is _pte, r9 is pte, r10 = _pte - pte
> > 60c4: b8 00 02 00 00  mov    $0x200,%eax      // eax = HPAGE_PMD_NR
> > 60c9: 44 89 5c 24 10  mov    %r11d,0x10(%rsp) //
> > 60ce: 49 c1 fa 03     sar    $0x3,%r10        //
> > 60d2: 49 83 c2 01     add    $0x1,%r10        // r10 += 1
> > 60d6: 49 39 c2        cmp    %rax,%r10        // r10 = min(r10, eax)
> > 60d9: 4c 0f 4f d0     cmovg  %rax,%r10        //
> > 60dd: 44 89 55 00     mov    %r10d,0x0(%rbp)  // *cur_progress = r10
> >
> > To make the code simpler, Let us use "*cur_progress += 1".
>
> Wow! Wasn't expecting that. What's your gcc version? I checked with
> gcc 11.4.0 (looks pretty old) with both x86 and arm64, and it couldn't
> optimize.

$ gcc --version
gcc (GCC) 15.2.1 20250808 (Red Hat 15.2.1-1)

Above is my gcc version. However, I performed the assembly again without
any optimization :(

I suspect that I might have messed up the environment earlier, failing to
compile the newly modified code successfully, which resulted is assembly
old_khugepaged.o.
Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
Posted by Lance Yang 10 hours ago

On 2026/2/8 17:05, Dev Jain wrote:
> 
> On 06/02/26 4:42 pm, Vernon Yang wrote:
>> On Fri, Feb 06, 2026 at 10:02:48AM +0100, David Hildenbrand (Arm) wrote:
>>> On 2/5/26 15:25, Dev Jain wrote:
>>>> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
>>>>> On 2/5/26 07:08, Vernon Yang wrote:
>>>>>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm)
>>>>>> <david@kernel.org> wrote:
>>>>>>
>>>>>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
>>>>> Yes!
>>>>>
>>>>>>
>>>>>> I'm also worried that the compiler can't optimize this since the body of
>>>>>> the loop is complex, as with Dev's opinion [1].
>>>>> Why do we even have to optimize this? :)
>>>>>
>>>>> Premature ... ? :)
>>>>
>>>> I mean .... we don't, but the alternate is a one liner using max().
>>> I'm fine with the max(), but it still seems like adding complexity to
>>> optimize something that is nowhere prove to really be a problem.
>> Hi David, Dev,
>>
>> I use "*cur_progress += 1" at the beginning of the loop, the compiler
>> optimize that. Assembly as follows:
>>
>> 60c1:	4d 29 ca        sub    %r9,%r10		// r10 is _pte, r9 is pte, r10 = _pte - pte
>> 60c4:	b8 00 02 00 00  mov    $0x200,%eax	// eax = HPAGE_PMD_NR
>> 60c9:	44 89 5c 24 10  mov    %r11d,0x10(%rsp)	//
>> 60ce:	49 c1 fa 03     sar    $0x3,%r10	//
>> 60d2:	49 83 c2 01     add    $0x1,%r10	// r10 += 1
>> 60d6:	49 39 c2        cmp    %rax,%r10	// r10 = min(r10, eax)
>> 60d9:	4c 0f 4f d0     cmovg  %rax,%r10	//
>> 60dd:	44 89 55 00     mov    %r10d,0x0(%rbp)	// *cur_progress = r10
>>
>> To make the code simpler, Let us use "*cur_progress += 1".
> 
> Wow! Wasn't expecting that. What's your gcc version? I checked with
> gcc 11.4.0 (looks pretty old) with both x86 and arm64, and it couldn't
> optimize.

FWIW, 11.4.0 is newer that the minimum GCC version (8.1) required by
kernel. See Documentation/process/changes.rst

The optimization might just be version-dependent :)
Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
Posted by Lance Yang 2 days, 5 hours ago

On 2026/2/6 19:12, Vernon Yang wrote:
> On Fri, Feb 06, 2026 at 10:02:48AM +0100, David Hildenbrand (Arm) wrote:
>> On 2/5/26 15:25, Dev Jain wrote:
>>>
>>> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
>>>> On 2/5/26 07:08, Vernon Yang wrote:
>>>>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm)
>>>>> <david@kernel.org> wrote:
>>>>>
>>>>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
>>>>
>>>> Yes!
>>>>
>>>>>
>>>>>
>>>>> I'm also worried that the compiler can't optimize this since the body of
>>>>> the loop is complex, as with Dev's opinion [1].
>>>>
>>>> Why do we even have to optimize this? :)
>>>>
>>>> Premature ... ? :)
>>>
>>>
>>> I mean .... we don't, but the alternate is a one liner using max().
>>
>> I'm fine with the max(), but it still seems like adding complexity to
>> optimize something that is nowhere prove to really be a problem.
> 
> Hi David, Dev,
> 
> I use "*cur_progress += 1" at the beginning of the loop, the compiler
> optimize that. Assembly as follows:
> 
> 60c1:	4d 29 ca        sub    %r9,%r10		// r10 is _pte, r9 is pte, r10 = _pte - pte
> 60c4:	b8 00 02 00 00  mov    $0x200,%eax	// eax = HPAGE_PMD_NR
> 60c9:	44 89 5c 24 10  mov    %r11d,0x10(%rsp)	//
> 60ce:	49 c1 fa 03     sar    $0x3,%r10	//
> 60d2:	49 83 c2 01     add    $0x1,%r10	// r10 += 1
> 60d6:	49 39 c2        cmp    %rax,%r10	// r10 = min(r10, eax)
> 60d9:	4c 0f 4f d0     cmovg  %rax,%r10	//
> 60dd:	44 89 55 00     mov    %r10d,0x0(%rbp)	// *cur_progress = r10
> 
> To make the code simpler, Let us use "*cur_progress += 1".

Cool! Compiler did the right thing and the heavy lifting after all - we get
to keep it simple :p

Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
Posted by Dev Jain 2 days, 9 hours ago
On 06/02/26 2:32 pm, David Hildenbrand (Arm) wrote:
> On 2/5/26 15:25, Dev Jain wrote:
>>
>> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
>>> On 2/5/26 07:08, Vernon Yang wrote:
>>>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm)
>>>> <david@kernel.org> wrote:
>>>>
>>>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
>>>
>>> Yes!
>>>
>>>>
>>>>
>>>> I'm also worried that the compiler can't optimize this since the body of
>>>> the loop is complex, as with Dev's opinion [1].
>>>
>>> Why do we even have to optimize this? :)
>>>
>>> Premature ... ? :)
>>
>>
>> I mean .... we don't, but the alternate is a one liner using max().
>
> I'm fine with the max(), but it still seems like adding complexity to
> optimize something that is nowhere prove to really be a problem. 

Agreed. Vernon, let us do the increment in the loop then.
Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
Posted by David Hildenbrand (Arm) 2 days, 8 hours ago
On 2/6/26 11:00, Dev Jain wrote:
> 
> On 06/02/26 2:32 pm, David Hildenbrand (Arm) wrote:
>> On 2/5/26 15:25, Dev Jain wrote:
>>>
>>>
>>>
>>> I mean .... we don't, but the alternate is a one liner using max().
>>
>> I'm fine with the max(), but it still seems like adding complexity to
>> optimize something that is nowhere prove to really be a problem.
> 
> Agreed. Vernon, let us do the increment in the loop then.

I'm fine with the min(), so if you both think it's better, let's do that!

It makes it slightly harder to understand what's happening, but 
fortunately, if we mess up slightly nobody will really notice :)

-- 
Cheers,

David
Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
Posted by Dev Jain 3 days, 5 hours ago
On 05/02/26 7:55 pm, Dev Jain wrote:
> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
>> On 2/5/26 07:08, Vernon Yang wrote:
>>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm)
>>> <david@kernel.org> wrote:
>>>> [...]
>>>>
>>>>> +     if (cur_progress) {
>>>>> +             if (_pte >= pte + HPAGE_PMD_NR)
>>>>> +                     *cur_progress = HPAGE_PMD_NR;
>>>>> +             else
>>>>> +                     *cur_progress = _pte - pte + 1;
>>>> *cur_progress = max(_pte - pte + 1, HPAGE_PMD_NR);
>>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
>> Yes!
>>
>>>> ?
>>>>
>>>> It's still a bit nasty, though.
>>>>
>>>> Can't we just add one at the beginning of the loop and let the compiler
>>>> optimize that? ;)
>>> I'm also worried that the compiler can't optimize this since the body of
>>> the loop is complex, as with Dev's opinion [1].
>> Why do we even have to optimize this? :)
>>
>> Premature ... ? :)
>
> I mean .... we don't, but the alternate is a one liner using max().
>
> The objective is to compute the number of iterations of the for-loop.
>
> It just seems weird to me to track that in the loop, when we have the
>
> loop iterator, which *literally* does that only.

I realize I shouldn't have bolded out the "literally" - below I wrote that
I won't shout, but the bold seems like shouting :)

>
>
>
> Anyhow, I won't shout in any case : ) If you deem incrementing in the
>
> loop prettier, that's fine.
>
>
Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
Posted by David Hildenbrand (Arm) 2 days, 10 hours ago
On 2/5/26 15:30, Dev Jain wrote:
> 
> On 05/02/26 7:55 pm, Dev Jain wrote:
>> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
>>> Yes!
>>>
>>> Why do we even have to optimize this? :)
>>>
>>> Premature ... ? :)
>>
>> I mean .... we don't, but the alternate is a one liner using max().
>>
>> The objective is to compute the number of iterations of the for-loop.
>>
>> It just seems weird to me to track that in the loop, when we have the
>>
>> loop iterator, which *literally* does that only.
> 
> I realize I shouldn't have bolded out the "literally" - below I wrote that
> I won't shout, but the bold seems like shouting :)

Heh.

The thing is that the loop iterator does not quite what we want, 
otherwise we wouldn't have to mess with max() etc.

-- 
Cheers,

David
Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
Posted by Dev Jain 3 days, 7 hours ago
On 05/02/26 11:38 am, Vernon Yang wrote:
> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm) <david@kernel.org> wrote:
>> [...]
>>
>>> +     if (cur_progress) {
>>> +             if (_pte >= pte + HPAGE_PMD_NR)
>>> +                     *cur_progress = HPAGE_PMD_NR;
>>> +             else
>>> +                     *cur_progress = _pte - pte + 1;
>> *cur_progress = max(_pte - pte + 1, HPAGE_PMD_NR);
> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
>
>> ?
>>
>> It's still a bit nasty, though.
>>
>> Can't we just add one at the beginning of the loop and let the compiler
>> optimize that? ;)
> I'm also worried that the compiler can't optimize this since the body of
> the loop is complex, as with Dev's opinion [1].
>
> [1] https://lore.kernel.org/linux-mm/7c4b5933-7bbd-4ad7-baef-830304a09485@arm.com
>
> If you have a strong recommendation for this, please let me know, Thanks!

I haven't explicitly checked with assembly, but I am fairly sure this won't get optimized.
There are two cases where it could have been optimized:

1) Had the compiler inlined hpage_collapse_scan_pmd
2) Had the compiler done something like
   if (p) -> foo(), where foo() contains the complete for loop, with the increment
   else -> bar(), where bar() contains the complete for loop, without the increment

Both of which are highly unlikely because of the complexity of the function.

>
>>> +     }
>>>       pte_unmap_unlock(pte, ptl);
>>>       if (result == SCAN_SUCCEED) {
>>>               result = collapse_huge_page(mm, start_addr, referenced,
>>> @@ -2286,8 +2301,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>>>       return result;
>>>   }
>>>
>>> -static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
>>> -             struct file *file, pgoff_t start, struct collapse_control *cc)
>>> +static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
>>> +             unsigned long addr, struct file *file, pgoff_t start,
>>> +             unsigned int *cur_progress, struct collapse_control *cc)
>>>   {
>>>       struct folio *folio = NULL;
>>>       struct address_space *mapping = file->f_mapping;
>>> @@ -2376,6 +2392,8 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
>>>                       cond_resched_rcu();
>>>               }
>>>       }
>>> +     if (cur_progress)
>>> +             *cur_progress = max(xas.xa_index - start, 1UL);
>> I would really just keep it simple here and do a
>>
>> *cur_progress = HPAGE_PMD_NR;
>>
>> This stuff is hard to reason about, so I would just leave the file case
>> essentially unchanged.
>>
>> IIRC, it would not affect the numbers you report in the patch description?
> Yes, Let's keep it simple, always equal to HPAGE_PMD_NR in file case.
>
> --
> Thanks,
> Vernon
Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
Posted by David Hildenbrand (Arm) 3 days, 7 hours ago
On 2/5/26 13:07, Dev Jain wrote:
> 
> On 05/02/26 11:38 am, Vernon Yang wrote:
>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm) <david@kernel.org> wrote:
>>> [...]
>>>
>>> *cur_progress = max(_pte - pte + 1, HPAGE_PMD_NR);
>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
>>
>>> ?
>>>
>>> It's still a bit nasty, though.
>>>
>>> Can't we just add one at the beginning of the loop and let the compiler
>>> optimize that? ;)
>> I'm also worried that the compiler can't optimize this since the body of
>> the loop is complex, as with Dev's opinion [1].
>>
>> [1] https://lore.kernel.org/linux-mm/7c4b5933-7bbd-4ad7-baef-830304a09485@arm.com
>>
>> If you have a strong recommendation for this, please let me know, Thanks!
> 
> I haven't explicitly checked with assembly, but I am fairly sure this won't get optimized.
> There are two cases where it could have been optimized:
> 
> 1) Had the compiler inlined hpage_collapse_scan_pmd

Yeah, there are two callers so that likely does not happen.

> 2) Had the compiler done something like
>     if (p) -> foo(), where foo() contains the complete for loop, with the increment
>     else -> bar(), where bar() contains the complete for loop, without the increment
> 
> Both of which are highly unlikely because of the complexity of the function.

Not sure if the compiler would be to optimize this out also in 
non-inlined cases. In any case, I wonder if this must be optimized at 
all ...

-- 
Cheers,

David