From: Vernon Yang <yanglincheng@kylinos.cn>
Currently, each scan always increases "progress" by HPAGE_PMD_NR,
even if only scanning a single PTE/PMD entry.
- When only scanning a sigle PTE entry, let me provide a detailed
example:
static int hpage_collapse_scan_pmd()
{
for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
_pte++, addr += PAGE_SIZE) {
pte_t pteval = ptep_get(_pte);
...
if (pte_uffd_wp(pteval)) { <-- first scan hit
result = SCAN_PTE_UFFD_WP;
goto out_unmap;
}
}
}
During the first scan, if pte_uffd_wp(pteval) is true, the loop exits
directly. In practice, only one PTE is scanned before termination.
Here, "progress += 1" reflects the actual number of PTEs scanned, but
previously "progress += HPAGE_PMD_NR" always.
- When the memory has been collapsed to PMD, let me provide a detailed
example:
The following data is traced by bpftrace on a desktop system. After
the system has been left idle for 10 minutes upon booting, a lot of
SCAN_PMD_MAPPED or SCAN_NO_PTE_TABLE are observed during a full scan
by khugepaged.
From trace_mm_khugepaged_scan_pmd and trace_mm_khugepaged_scan_file, the
following statuses were observed, with frequency mentioned next to them:
SCAN_SUCCEED : 1
SCAN_EXCEED_SHARED_PTE: 2
SCAN_PMD_MAPPED : 142
SCAN_NO_PTE_TABLE : 178
total progress size : 674 MB
Total time : 419 seconds, include khugepaged_scan_sleep_millisecs
The khugepaged_scan list save all task that support collapse into hugepage,
as long as the task is not destroyed, khugepaged will not remove it from
the khugepaged_scan list. This exist a phenomenon where task has already
collapsed all memory regions into hugepage, but khugepaged continues to
scan it, which wastes CPU time and invalid, and due to
khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
scanning a large number of invalid task, so scanning really valid task
is later.
After applying this patch, when the memory is either SCAN_PMD_MAPPED or
SCAN_NO_PTE_TABLE, just skip it, as follow:
SCAN_EXCEED_SHARED_PTE: 2
SCAN_PMD_MAPPED : 147
SCAN_NO_PTE_TABLE : 173
total progress size : 45 MB
Total time : 20 seconds
Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
mm/khugepaged.c | 41 +++++++++++++++++++++++++++++++----------
1 file changed, 31 insertions(+), 10 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d94b34e10bdf..df22b2274d92 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -68,7 +68,10 @@ enum scan_result {
static struct task_struct *khugepaged_thread __read_mostly;
static DEFINE_MUTEX(khugepaged_mutex);
-/* default scan 8*HPAGE_PMD_NR ptes (or vmas) every 10 second */
+/*
+ * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
+ * every 10 second.
+ */
static unsigned int khugepaged_pages_to_scan __read_mostly;
static unsigned int khugepaged_pages_collapsed;
static unsigned int khugepaged_full_scans;
@@ -1240,7 +1243,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
}
static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long start_addr, bool *mmap_locked,
+ struct vm_area_struct *vma, unsigned long start_addr,
+ bool *mmap_locked, unsigned int *cur_progress,
struct collapse_control *cc)
{
pmd_t *pmd;
@@ -1256,13 +1260,18 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
- if (result != SCAN_SUCCEED)
+ if (result != SCAN_SUCCEED) {
+ if (cur_progress)
+ *cur_progress = 1;
goto out;
+ }
memset(cc->node_load, 0, sizeof(cc->node_load));
nodes_clear(cc->alloc_nmask);
pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
if (!pte) {
+ if (cur_progress)
+ *cur_progress = 1;
result = SCAN_NO_PTE_TABLE;
goto out;
}
@@ -1396,6 +1405,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
result = SCAN_SUCCEED;
}
out_unmap:
+ if (cur_progress) {
+ if (_pte >= pte + HPAGE_PMD_NR)
+ *cur_progress = HPAGE_PMD_NR;
+ else
+ *cur_progress = _pte - pte + 1;
+ }
pte_unmap_unlock(pte, ptl);
if (result == SCAN_SUCCEED) {
result = collapse_huge_page(mm, start_addr, referenced,
@@ -2286,8 +2301,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
return result;
}
-static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
- struct file *file, pgoff_t start, struct collapse_control *cc)
+static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
+ unsigned long addr, struct file *file, pgoff_t start,
+ unsigned int *cur_progress, struct collapse_control *cc)
{
struct folio *folio = NULL;
struct address_space *mapping = file->f_mapping;
@@ -2376,6 +2392,8 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
cond_resched_rcu();
}
}
+ if (cur_progress)
+ *cur_progress = max(xas.xa_index - start, 1UL);
rcu_read_unlock();
if (result == SCAN_SUCCEED) {
@@ -2455,6 +2473,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
while (khugepaged_scan.address < hend) {
bool mmap_locked = true;
+ unsigned int cur_progress = 0;
cond_resched();
if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
@@ -2471,7 +2490,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
mmap_read_unlock(mm);
mmap_locked = false;
*result = hpage_collapse_scan_file(mm,
- khugepaged_scan.address, file, pgoff, cc);
+ khugepaged_scan.address, file, pgoff,
+ &cur_progress, cc);
fput(file);
if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
mmap_read_lock(mm);
@@ -2485,7 +2505,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
}
} else {
*result = hpage_collapse_scan_pmd(mm, vma,
- khugepaged_scan.address, &mmap_locked, cc);
+ khugepaged_scan.address, &mmap_locked,
+ &cur_progress, cc);
}
if (*result == SCAN_SUCCEED)
@@ -2493,7 +2514,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
/* move to next address */
khugepaged_scan.address += HPAGE_PMD_SIZE;
- progress += HPAGE_PMD_NR;
+ progress += cur_progress;
if (!mmap_locked)
/*
* We released mmap_lock so break loop. Note
@@ -2816,7 +2837,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
mmap_locked = false;
*lock_dropped = true;
result = hpage_collapse_scan_file(mm, addr, file, pgoff,
- cc);
+ NULL, cc);
if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
mapping_can_writeback(file->f_mapping)) {
@@ -2831,7 +2852,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
fput(file);
} else {
result = hpage_collapse_scan_pmd(mm, vma, addr,
- &mmap_locked, cc);
+ &mmap_locked, NULL, cc);
}
if (!mmap_locked)
*lock_dropped = true;
--
2.51.0
[...]
> + if (cur_progress) {
> + if (_pte >= pte + HPAGE_PMD_NR)
> + *cur_progress = HPAGE_PMD_NR;
> + else
> + *cur_progress = _pte - pte + 1;
*cur_progress = max(_pte - pte + 1, HPAGE_PMD_NR);
?
It's still a bit nasty, though.
Can't we just add one at the beginning of the loop and let the compiler
optimize that? ;)
> + }
> pte_unmap_unlock(pte, ptl);
> if (result == SCAN_SUCCEED) {
> result = collapse_huge_page(mm, start_addr, referenced,
> @@ -2286,8 +2301,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
> return result;
> }
>
> -static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> - struct file *file, pgoff_t start, struct collapse_control *cc)
> +static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
> + unsigned long addr, struct file *file, pgoff_t start,
> + unsigned int *cur_progress, struct collapse_control *cc)
> {
> struct folio *folio = NULL;
> struct address_space *mapping = file->f_mapping;
> @@ -2376,6 +2392,8 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
> cond_resched_rcu();
> }
> }
> + if (cur_progress)
> + *cur_progress = max(xas.xa_index - start, 1UL);
I would really just keep it simple here and do a
*cur_progress = HPAGE_PMD_NR;
This stuff is hard to reason about, so I would just leave the file case
essentially unchanged.
IIRC, it would not affect the numbers you report in the patch description?
--
Cheers,
David
On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm) <david@kernel.org> wrote:
>
> [...]
>
> > + if (cur_progress) {
> > + if (_pte >= pte + HPAGE_PMD_NR)
> > + *cur_progress = HPAGE_PMD_NR;
> > + else
> > + *cur_progress = _pte - pte + 1;
>
> *cur_progress = max(_pte - pte + 1, HPAGE_PMD_NR);
I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
> ?
>
> It's still a bit nasty, though.
>
> Can't we just add one at the beginning of the loop and let the compiler
> optimize that? ;)
I'm also worried that the compiler can't optimize this since the body of
the loop is complex, as with Dev's opinion [1].
[1] https://lore.kernel.org/linux-mm/7c4b5933-7bbd-4ad7-baef-830304a09485@arm.com
If you have a strong recommendation for this, please let me know, Thanks!
> > + }
> > pte_unmap_unlock(pte, ptl);
> > if (result == SCAN_SUCCEED) {
> > result = collapse_huge_page(mm, start_addr, referenced,
> > @@ -2286,8 +2301,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
> > return result;
> > }
> >
> > -static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> > - struct file *file, pgoff_t start, struct collapse_control *cc)
> > +static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
> > + unsigned long addr, struct file *file, pgoff_t start,
> > + unsigned int *cur_progress, struct collapse_control *cc)
> > {
> > struct folio *folio = NULL;
> > struct address_space *mapping = file->f_mapping;
> > @@ -2376,6 +2392,8 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
> > cond_resched_rcu();
> > }
> > }
> > + if (cur_progress)
> > + *cur_progress = max(xas.xa_index - start, 1UL);
> I would really just keep it simple here and do a
>
> *cur_progress = HPAGE_PMD_NR;
>
> This stuff is hard to reason about, so I would just leave the file case
> essentially unchanged.
>
> IIRC, it would not affect the numbers you report in the patch description?
Yes, Let's keep it simple, always equal to HPAGE_PMD_NR in file case.
--
Thanks,
Vernon
On 2/5/26 07:08, Vernon Yang wrote:
> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm) <david@kernel.org> wrote:
>>
>> [...]
>>
>>> + if (cur_progress) {
>>> + if (_pte >= pte + HPAGE_PMD_NR)
>>> + *cur_progress = HPAGE_PMD_NR;
>>> + else
>>> + *cur_progress = _pte - pte + 1;
>>
>> *cur_progress = max(_pte - pte + 1, HPAGE_PMD_NR);
>
> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
Yes!
>
>> ?
>>
>> It's still a bit nasty, though.
>>
>> Can't we just add one at the beginning of the loop and let the compiler
>> optimize that? ;)
>
> I'm also worried that the compiler can't optimize this since the body of
> the loop is complex, as with Dev's opinion [1].
Why do we even have to optimize this? :)
Premature ... ? :)
--
Cheers,
David
On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
> On 2/5/26 07:08, Vernon Yang wrote:
>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm)
>> <david@kernel.org> wrote:
>>>
>>> [...]
>>>
>>>> + if (cur_progress) {
>>>> + if (_pte >= pte + HPAGE_PMD_NR)
>>>> + *cur_progress = HPAGE_PMD_NR;
>>>> + else
>>>> + *cur_progress = _pte - pte + 1;
>>>
>>> *cur_progress = max(_pte - pte + 1, HPAGE_PMD_NR);
>>
>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
>
> Yes!
>
>>
>>> ?
>>>
>>> It's still a bit nasty, though.
>>>
>>> Can't we just add one at the beginning of the loop and let the compiler
>>> optimize that? ;)
>>
>> I'm also worried that the compiler can't optimize this since the body of
>> the loop is complex, as with Dev's opinion [1].
>
> Why do we even have to optimize this? :)
>
> Premature ... ? :)
I mean .... we don't, but the alternate is a one liner using max().
The objective is to compute the number of iterations of the for-loop.
It just seems weird to me to track that in the loop, when we have the
loop iterator, which *literally* does that only.
Anyhow, I won't shout in any case : ) If you deem incrementing in the
loop prettier, that's fine.
On 2/5/26 15:25, Dev Jain wrote: > > On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote: >> On 2/5/26 07:08, Vernon Yang wrote: >>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm) >>> <david@kernel.org> wrote: >>> >>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max(). >> >> Yes! >> >>> >>> >>> I'm also worried that the compiler can't optimize this since the body of >>> the loop is complex, as with Dev's opinion [1]. >> >> Why do we even have to optimize this? :) >> >> Premature ... ? :) > > > I mean .... we don't, but the alternate is a one liner using max(). I'm fine with the max(), but it still seems like adding complexity to optimize something that is nowhere prove to really be a problem. -- Cheers, David
On Fri, Feb 06, 2026 at 10:02:48AM +0100, David Hildenbrand (Arm) wrote: > On 2/5/26 15:25, Dev Jain wrote: > > > > On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote: > > > On 2/5/26 07:08, Vernon Yang wrote: > > > > On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm) > > > > <david@kernel.org> wrote: > > > > > > > > I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max(). > > > > > > Yes! > > > > > > > > > > > > > > > I'm also worried that the compiler can't optimize this since the body of > > > > the loop is complex, as with Dev's opinion [1]. > > > > > > Why do we even have to optimize this? :) > > > > > > Premature ... ? :) > > > > > > I mean .... we don't, but the alternate is a one liner using max(). > > I'm fine with the max(), but it still seems like adding complexity to > optimize something that is nowhere prove to really be a problem. Hi David, Dev, I use "*cur_progress += 1" at the beginning of the loop, the compiler optimize that. Assembly as follows: 60c1: 4d 29 ca sub %r9,%r10 // r10 is _pte, r9 is pte, r10 = _pte - pte 60c4: b8 00 02 00 00 mov $0x200,%eax // eax = HPAGE_PMD_NR 60c9: 44 89 5c 24 10 mov %r11d,0x10(%rsp) // 60ce: 49 c1 fa 03 sar $0x3,%r10 // 60d2: 49 83 c2 01 add $0x1,%r10 // r10 += 1 60d6: 49 39 c2 cmp %rax,%r10 // r10 = min(r10, eax) 60d9: 4c 0f 4f d0 cmovg %rax,%r10 // 60dd: 44 89 55 00 mov %r10d,0x0(%rbp) // *cur_progress = r10 To make the code simpler, Let us use "*cur_progress += 1". -- Thanks, Vernon
On 06/02/26 4:42 pm, Vernon Yang wrote: > On Fri, Feb 06, 2026 at 10:02:48AM +0100, David Hildenbrand (Arm) wrote: >> On 2/5/26 15:25, Dev Jain wrote: >>> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote: >>>> On 2/5/26 07:08, Vernon Yang wrote: >>>>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm) >>>>> <david@kernel.org> wrote: >>>>> >>>>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max(). >>>> Yes! >>>> >>>>> >>>>> I'm also worried that the compiler can't optimize this since the body of >>>>> the loop is complex, as with Dev's opinion [1]. >>>> Why do we even have to optimize this? :) >>>> >>>> Premature ... ? :) >>> >>> I mean .... we don't, but the alternate is a one liner using max(). >> I'm fine with the max(), but it still seems like adding complexity to >> optimize something that is nowhere prove to really be a problem. > Hi David, Dev, > > I use "*cur_progress += 1" at the beginning of the loop, the compiler > optimize that. Assembly as follows: > > 60c1: 4d 29 ca sub %r9,%r10 // r10 is _pte, r9 is pte, r10 = _pte - pte > 60c4: b8 00 02 00 00 mov $0x200,%eax // eax = HPAGE_PMD_NR > 60c9: 44 89 5c 24 10 mov %r11d,0x10(%rsp) // > 60ce: 49 c1 fa 03 sar $0x3,%r10 // > 60d2: 49 83 c2 01 add $0x1,%r10 // r10 += 1 > 60d6: 49 39 c2 cmp %rax,%r10 // r10 = min(r10, eax) > 60d9: 4c 0f 4f d0 cmovg %rax,%r10 // > 60dd: 44 89 55 00 mov %r10d,0x0(%rbp) // *cur_progress = r10 > > To make the code simpler, Let us use "*cur_progress += 1". Wow! Wasn't expecting that. What's your gcc version? I checked with gcc 11.4.0 (looks pretty old) with both x86 and arm64, and it couldn't optimize. > -- > Thanks, > Vernon
On Sun, Feb 8, 2026 at 5:05 PM Dev Jain <dev.jain@arm.com> wrote: > > On 06/02/26 4:42 pm, Vernon Yang wrote: > > On Fri, Feb 06, 2026 at 10:02:48AM +0100, David Hildenbrand (Arm) wrote: > >> On 2/5/26 15:25, Dev Jain wrote: > >>> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote: > >>>> On 2/5/26 07:08, Vernon Yang wrote: > >>>>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm) > >>>>> <david@kernel.org> wrote: > >>>>> > >>>>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max(). > >>>> Yes! > >>>> > >>>>> > >>>>> I'm also worried that the compiler can't optimize this since the body of > >>>>> the loop is complex, as with Dev's opinion [1]. > >>>> Why do we even have to optimize this? :) > >>>> > >>>> Premature ... ? :) > >>> > >>> I mean .... we don't, but the alternate is a one liner using max(). > >> I'm fine with the max(), but it still seems like adding complexity to > >> optimize something that is nowhere prove to really be a problem. > > Hi David, Dev, > > > > I use "*cur_progress += 1" at the beginning of the loop, the compiler > > optimize that. Assembly as follows: > > > > 60c1: 4d 29 ca sub %r9,%r10 // r10 is _pte, r9 is pte, r10 = _pte - pte > > 60c4: b8 00 02 00 00 mov $0x200,%eax // eax = HPAGE_PMD_NR > > 60c9: 44 89 5c 24 10 mov %r11d,0x10(%rsp) // > > 60ce: 49 c1 fa 03 sar $0x3,%r10 // > > 60d2: 49 83 c2 01 add $0x1,%r10 // r10 += 1 > > 60d6: 49 39 c2 cmp %rax,%r10 // r10 = min(r10, eax) > > 60d9: 4c 0f 4f d0 cmovg %rax,%r10 // > > 60dd: 44 89 55 00 mov %r10d,0x0(%rbp) // *cur_progress = r10 > > > > To make the code simpler, Let us use "*cur_progress += 1". > > Wow! Wasn't expecting that. What's your gcc version? I checked with > gcc 11.4.0 (looks pretty old) with both x86 and arm64, and it couldn't > optimize. $ gcc --version gcc (GCC) 15.2.1 20250808 (Red Hat 15.2.1-1) Above is my gcc version. However, I performed the assembly again without any optimization :( I suspect that I might have messed up the environment earlier, failing to compile the newly modified code successfully, which resulted is assembly old_khugepaged.o.
On 2026/2/8 17:05, Dev Jain wrote: > > On 06/02/26 4:42 pm, Vernon Yang wrote: >> On Fri, Feb 06, 2026 at 10:02:48AM +0100, David Hildenbrand (Arm) wrote: >>> On 2/5/26 15:25, Dev Jain wrote: >>>> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote: >>>>> On 2/5/26 07:08, Vernon Yang wrote: >>>>>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm) >>>>>> <david@kernel.org> wrote: >>>>>> >>>>>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max(). >>>>> Yes! >>>>> >>>>>> >>>>>> I'm also worried that the compiler can't optimize this since the body of >>>>>> the loop is complex, as with Dev's opinion [1]. >>>>> Why do we even have to optimize this? :) >>>>> >>>>> Premature ... ? :) >>>> >>>> I mean .... we don't, but the alternate is a one liner using max(). >>> I'm fine with the max(), but it still seems like adding complexity to >>> optimize something that is nowhere prove to really be a problem. >> Hi David, Dev, >> >> I use "*cur_progress += 1" at the beginning of the loop, the compiler >> optimize that. Assembly as follows: >> >> 60c1: 4d 29 ca sub %r9,%r10 // r10 is _pte, r9 is pte, r10 = _pte - pte >> 60c4: b8 00 02 00 00 mov $0x200,%eax // eax = HPAGE_PMD_NR >> 60c9: 44 89 5c 24 10 mov %r11d,0x10(%rsp) // >> 60ce: 49 c1 fa 03 sar $0x3,%r10 // >> 60d2: 49 83 c2 01 add $0x1,%r10 // r10 += 1 >> 60d6: 49 39 c2 cmp %rax,%r10 // r10 = min(r10, eax) >> 60d9: 4c 0f 4f d0 cmovg %rax,%r10 // >> 60dd: 44 89 55 00 mov %r10d,0x0(%rbp) // *cur_progress = r10 >> >> To make the code simpler, Let us use "*cur_progress += 1". > > Wow! Wasn't expecting that. What's your gcc version? I checked with > gcc 11.4.0 (looks pretty old) with both x86 and arm64, and it couldn't > optimize. FWIW, 11.4.0 is newer that the minimum GCC version (8.1) required by kernel. See Documentation/process/changes.rst The optimization might just be version-dependent :)
On 2026/2/6 19:12, Vernon Yang wrote: > On Fri, Feb 06, 2026 at 10:02:48AM +0100, David Hildenbrand (Arm) wrote: >> On 2/5/26 15:25, Dev Jain wrote: >>> >>> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote: >>>> On 2/5/26 07:08, Vernon Yang wrote: >>>>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm) >>>>> <david@kernel.org> wrote: >>>>> >>>>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max(). >>>> >>>> Yes! >>>> >>>>> >>>>> >>>>> I'm also worried that the compiler can't optimize this since the body of >>>>> the loop is complex, as with Dev's opinion [1]. >>>> >>>> Why do we even have to optimize this? :) >>>> >>>> Premature ... ? :) >>> >>> >>> I mean .... we don't, but the alternate is a one liner using max(). >> >> I'm fine with the max(), but it still seems like adding complexity to >> optimize something that is nowhere prove to really be a problem. > > Hi David, Dev, > > I use "*cur_progress += 1" at the beginning of the loop, the compiler > optimize that. Assembly as follows: > > 60c1: 4d 29 ca sub %r9,%r10 // r10 is _pte, r9 is pte, r10 = _pte - pte > 60c4: b8 00 02 00 00 mov $0x200,%eax // eax = HPAGE_PMD_NR > 60c9: 44 89 5c 24 10 mov %r11d,0x10(%rsp) // > 60ce: 49 c1 fa 03 sar $0x3,%r10 // > 60d2: 49 83 c2 01 add $0x1,%r10 // r10 += 1 > 60d6: 49 39 c2 cmp %rax,%r10 // r10 = min(r10, eax) > 60d9: 4c 0f 4f d0 cmovg %rax,%r10 // > 60dd: 44 89 55 00 mov %r10d,0x0(%rbp) // *cur_progress = r10 > > To make the code simpler, Let us use "*cur_progress += 1". Cool! Compiler did the right thing and the heavy lifting after all - we get to keep it simple :p
On 06/02/26 2:32 pm, David Hildenbrand (Arm) wrote: > On 2/5/26 15:25, Dev Jain wrote: >> >> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote: >>> On 2/5/26 07:08, Vernon Yang wrote: >>>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm) >>>> <david@kernel.org> wrote: >>>> >>>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max(). >>> >>> Yes! >>> >>>> >>>> >>>> I'm also worried that the compiler can't optimize this since the body of >>>> the loop is complex, as with Dev's opinion [1]. >>> >>> Why do we even have to optimize this? :) >>> >>> Premature ... ? :) >> >> >> I mean .... we don't, but the alternate is a one liner using max(). > > I'm fine with the max(), but it still seems like adding complexity to > optimize something that is nowhere prove to really be a problem. Agreed. Vernon, let us do the increment in the loop then.
On 2/6/26 11:00, Dev Jain wrote: > > On 06/02/26 2:32 pm, David Hildenbrand (Arm) wrote: >> On 2/5/26 15:25, Dev Jain wrote: >>> >>> >>> >>> I mean .... we don't, but the alternate is a one liner using max(). >> >> I'm fine with the max(), but it still seems like adding complexity to >> optimize something that is nowhere prove to really be a problem. > > Agreed. Vernon, let us do the increment in the loop then. I'm fine with the min(), so if you both think it's better, let's do that! It makes it slightly harder to understand what's happening, but fortunately, if we mess up slightly nobody will really notice :) -- Cheers, David
On 05/02/26 7:55 pm, Dev Jain wrote:
> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
>> On 2/5/26 07:08, Vernon Yang wrote:
>>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm)
>>> <david@kernel.org> wrote:
>>>> [...]
>>>>
>>>>> + if (cur_progress) {
>>>>> + if (_pte >= pte + HPAGE_PMD_NR)
>>>>> + *cur_progress = HPAGE_PMD_NR;
>>>>> + else
>>>>> + *cur_progress = _pte - pte + 1;
>>>> *cur_progress = max(_pte - pte + 1, HPAGE_PMD_NR);
>>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
>> Yes!
>>
>>>> ?
>>>>
>>>> It's still a bit nasty, though.
>>>>
>>>> Can't we just add one at the beginning of the loop and let the compiler
>>>> optimize that? ;)
>>> I'm also worried that the compiler can't optimize this since the body of
>>> the loop is complex, as with Dev's opinion [1].
>> Why do we even have to optimize this? :)
>>
>> Premature ... ? :)
>
> I mean .... we don't, but the alternate is a one liner using max().
>
> The objective is to compute the number of iterations of the for-loop.
>
> It just seems weird to me to track that in the loop, when we have the
>
> loop iterator, which *literally* does that only.
I realize I shouldn't have bolded out the "literally" - below I wrote that
I won't shout, but the bold seems like shouting :)
>
>
>
> Anyhow, I won't shout in any case : ) If you deem incrementing in the
>
> loop prettier, that's fine.
>
>
On 2/5/26 15:30, Dev Jain wrote: > > On 05/02/26 7:55 pm, Dev Jain wrote: >> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote: >>> Yes! >>> >>> Why do we even have to optimize this? :) >>> >>> Premature ... ? :) >> >> I mean .... we don't, but the alternate is a one liner using max(). >> >> The objective is to compute the number of iterations of the for-loop. >> >> It just seems weird to me to track that in the loop, when we have the >> >> loop iterator, which *literally* does that only. > > I realize I shouldn't have bolded out the "literally" - below I wrote that > I won't shout, but the bold seems like shouting :) Heh. The thing is that the loop iterator does not quite what we want, otherwise we wouldn't have to mess with max() etc. -- Cheers, David
On 05/02/26 11:38 am, Vernon Yang wrote:
> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm) <david@kernel.org> wrote:
>> [...]
>>
>>> + if (cur_progress) {
>>> + if (_pte >= pte + HPAGE_PMD_NR)
>>> + *cur_progress = HPAGE_PMD_NR;
>>> + else
>>> + *cur_progress = _pte - pte + 1;
>> *cur_progress = max(_pte - pte + 1, HPAGE_PMD_NR);
> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
>
>> ?
>>
>> It's still a bit nasty, though.
>>
>> Can't we just add one at the beginning of the loop and let the compiler
>> optimize that? ;)
> I'm also worried that the compiler can't optimize this since the body of
> the loop is complex, as with Dev's opinion [1].
>
> [1] https://lore.kernel.org/linux-mm/7c4b5933-7bbd-4ad7-baef-830304a09485@arm.com
>
> If you have a strong recommendation for this, please let me know, Thanks!
I haven't explicitly checked with assembly, but I am fairly sure this won't get optimized.
There are two cases where it could have been optimized:
1) Had the compiler inlined hpage_collapse_scan_pmd
2) Had the compiler done something like
if (p) -> foo(), where foo() contains the complete for loop, with the increment
else -> bar(), where bar() contains the complete for loop, without the increment
Both of which are highly unlikely because of the complexity of the function.
>
>>> + }
>>> pte_unmap_unlock(pte, ptl);
>>> if (result == SCAN_SUCCEED) {
>>> result = collapse_huge_page(mm, start_addr, referenced,
>>> @@ -2286,8 +2301,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>>> return result;
>>> }
>>>
>>> -static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
>>> - struct file *file, pgoff_t start, struct collapse_control *cc)
>>> +static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
>>> + unsigned long addr, struct file *file, pgoff_t start,
>>> + unsigned int *cur_progress, struct collapse_control *cc)
>>> {
>>> struct folio *folio = NULL;
>>> struct address_space *mapping = file->f_mapping;
>>> @@ -2376,6 +2392,8 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
>>> cond_resched_rcu();
>>> }
>>> }
>>> + if (cur_progress)
>>> + *cur_progress = max(xas.xa_index - start, 1UL);
>> I would really just keep it simple here and do a
>>
>> *cur_progress = HPAGE_PMD_NR;
>>
>> This stuff is hard to reason about, so I would just leave the file case
>> essentially unchanged.
>>
>> IIRC, it would not affect the numbers you report in the patch description?
> Yes, Let's keep it simple, always equal to HPAGE_PMD_NR in file case.
>
> --
> Thanks,
> Vernon
On 2/5/26 13:07, Dev Jain wrote: > > On 05/02/26 11:38 am, Vernon Yang wrote: >> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm) <david@kernel.org> wrote: >>> [...] >>> >>> *cur_progress = max(_pte - pte + 1, HPAGE_PMD_NR); >> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max(). >> >>> ? >>> >>> It's still a bit nasty, though. >>> >>> Can't we just add one at the beginning of the loop and let the compiler >>> optimize that? ;) >> I'm also worried that the compiler can't optimize this since the body of >> the loop is complex, as with Dev's opinion [1]. >> >> [1] https://lore.kernel.org/linux-mm/7c4b5933-7bbd-4ad7-baef-830304a09485@arm.com >> >> If you have a strong recommendation for this, please let me know, Thanks! > > I haven't explicitly checked with assembly, but I am fairly sure this won't get optimized. > There are two cases where it could have been optimized: > > 1) Had the compiler inlined hpage_collapse_scan_pmd Yeah, there are two callers so that likely does not happen. > 2) Had the compiler done something like > if (p) -> foo(), where foo() contains the complete for loop, with the increment > else -> bar(), where bar() contains the complete for loop, without the increment > > Both of which are highly unlikely because of the complexity of the function. Not sure if the compiler would be to optimize this out also in non-inlined cases. In any case, I wonder if this must be optimized at all ... -- Cheers, David
© 2016 - 2026 Red Hat, Inc.