mm/hugetlb.c | 13 ++++--------- 1 file changed, 4 insertions(+), 9 deletions(-)
commit 59d9094df3d79 introduced ->pt_share_count dedicated to
hugetlb PMD share count tracking, but omitted fixing
copy_hugetlb_page_range(), leaving the function relying on
page_count() for tracking that no longer works.
When lazy page table copy for hugetlb is disabled (commit bcd51a3c679d),
fork()'ing with hugetlb PMD sharing quickly lockup -
[ 239.446559] watchdog: BUG: soft lockup - CPU#75 stuck for 27s!
[ 239.446611] RIP: 0010:native_queued_spin_lock_slowpath+0x7e/0x2e0
[ 239.446631] Call Trace:
[ 239.446633] <TASK>
[ 239.446636] _raw_spin_lock+0x3f/0x60
[ 239.446639] copy_hugetlb_page_range+0x258/0xb50
[ 239.446645] copy_page_range+0x22b/0x2c0
[ 239.446651] dup_mmap+0x3e2/0x770
[ 239.446654] dup_mm.constprop.0+0x5e/0x230
[ 239.446657] copy_process+0xd17/0x1760
[ 239.446660] kernel_clone+0xc0/0x3e0
[ 239.446661] __do_sys_clone+0x65/0xa0
[ 239.446664] do_syscall_64+0x82/0x930
[ 239.446668] ? count_memcg_events+0xd2/0x190
[ 239.446671] ? syscall_trace_enter+0x14e/0x1f0
[ 239.446676] ? syscall_exit_work+0x118/0x150
[ 239.446677] ? arch_exit_to_user_mode_prepare.constprop.0+0x9/0xb0
[ 239.446681] ? clear_bhb_loop+0x30/0x80
[ 239.446684] ? clear_bhb_loop+0x30/0x80
[ 239.446686] entry_SYSCALL_64_after_hwframe+0x76/0x7e
There are two options to resolve the potential latent issue:
1. remove the PMD sharing awareness from copy_hugetlb_page_range(),
2. fix it.
This patch opts for the second option.
Fixes: 59d9094df3d79 ("mm: hugetlb: independent PMD page table shared
count")
Signed-off-by: Jane Chu <jane.chu@oracle.com>
---
mm/hugetlb.c | 13 ++++---------
1 file changed, 4 insertions(+), 9 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 753f99b4c718..8ca5b4f7805f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5594,18 +5594,13 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
break;
}
- /*
- * If the pagetables are shared don't copy or take references.
- *
- * dst_pte == src_pte is the common case of src/dest sharing.
- * However, src could have 'unshared' and dst shares with
- * another vma. So page_count of ptep page is checked instead
- * to reliably determine whether pte is shared.
- */
- if (page_count(virt_to_page(dst_pte)) > 1) {
+#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING
+ /* If the pagetables are shared don't copy or take references. */
+ if (ptdesc_pmd_pts_count(virt_to_ptdesc(dst_pte)) > 0) {
addr |= last_addr_mask;
continue;
}
+#endif
dst_ptl = huge_pte_lock(h, dst, dst_pte);
src_ptl = huge_pte_lockptr(h, src, src_pte);
--
2.43.5
On 09.09.25 20:43, Jane Chu wrote: > commit 59d9094df3d79 introduced ->pt_share_count dedicated to > hugetlb PMD share count tracking, but omitted fixing > copy_hugetlb_page_range(), leaving the function relying on > page_count() for tracking that no longer works. > > When lazy page table copy for hugetlb is disabled (commit bcd51a3c679d), > fork()'ing with hugetlb PMD sharing quickly lockup - > > [ 239.446559] watchdog: BUG: soft lockup - CPU#75 stuck for 27s! > [ 239.446611] RIP: 0010:native_queued_spin_lock_slowpath+0x7e/0x2e0 > [ 239.446631] Call Trace: > [ 239.446633] <TASK> > [ 239.446636] _raw_spin_lock+0x3f/0x60 > [ 239.446639] copy_hugetlb_page_range+0x258/0xb50 > [ 239.446645] copy_page_range+0x22b/0x2c0 > [ 239.446651] dup_mmap+0x3e2/0x770 > [ 239.446654] dup_mm.constprop.0+0x5e/0x230 > [ 239.446657] copy_process+0xd17/0x1760 > [ 239.446660] kernel_clone+0xc0/0x3e0 > [ 239.446661] __do_sys_clone+0x65/0xa0 > [ 239.446664] do_syscall_64+0x82/0x930 > [ 239.446668] ? count_memcg_events+0xd2/0x190 > [ 239.446671] ? syscall_trace_enter+0x14e/0x1f0 > [ 239.446676] ? syscall_exit_work+0x118/0x150 > [ 239.446677] ? arch_exit_to_user_mode_prepare.constprop.0+0x9/0xb0 > [ 239.446681] ? clear_bhb_loop+0x30/0x80 > [ 239.446684] ? clear_bhb_loop+0x30/0x80 > [ 239.446686] entry_SYSCALL_64_after_hwframe+0x76/0x7e > > There are two options to resolve the potential latent issue: > 1. remove the PMD sharing awareness from copy_hugetlb_page_range(), > 2. fix it. > This patch opts for the second option. > > Fixes: 59d9094df3d79 ("mm: hugetlb: independent PMD page table shared > count") > > Signed-off-by: Jane Chu <jane.chu@oracle.com> > --- > mm/hugetlb.c | 13 ++++--------- > 1 file changed, 4 insertions(+), 9 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 753f99b4c718..8ca5b4f7805f 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -5594,18 +5594,13 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, > break; > } > > - /* > - * If the pagetables are shared don't copy or take references. > - * > - * dst_pte == src_pte is the common case of src/dest sharing. > - * However, src could have 'unshared' and dst shares with > - * another vma. So page_count of ptep page is checked instead > - * to reliably determine whether pte is shared. > - */ > - if (page_count(virt_to_page(dst_pte)) > 1) { > +#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING > + /* If the pagetables are shared don't copy or take references. */ Why remove so much of the original comment? > + if (ptdesc_pmd_pts_count(virt_to_ptdesc(dst_pte)) > 0) { > addr |= last_addr_mask; > continue; > } LGTM, thanks! -- Cheers David / dhildenb
On 9/9/2025 11:45 PM, David Hildenbrand wrote: [..] >> - /* >> - * If the pagetables are shared don't copy or take references. >> - * >> - * dst_pte == src_pte is the common case of src/dest sharing. >> - * However, src could have 'unshared' and dst shares with >> - * another vma. So page_count of ptep page is checked instead >> - * to reliably determine whether pte is shared. >> - */ >> - if (page_count(virt_to_page(dst_pte)) > 1) { >> +#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING >> + /* If the pagetables are shared don't copy or take >> references. */ > > Why remove so much of the original comment? Because, this part of checking has already advanced from the "dst_pte == src_pte" to "page_count() > 1" to ->pt_share_count > 0, it seems cleaner to just keep an one liner comment. That said, if you feel the comments should be kept, I'd be happy to restore them with a bit revision. > >> + if (ptdesc_pmd_pts_count(virt_to_ptdesc(dst_pte)) > 0) { >> addr |= last_addr_mask; >> continue; >> } > > LGTM, thanks! Thanks! -jane >
On 11.09.25 21:54, jane.chu@oracle.com wrote: > > On 9/9/2025 11:45 PM, David Hildenbrand wrote: > [..] >>> - /* >>> - * If the pagetables are shared don't copy or take references. >>> - * >>> - * dst_pte == src_pte is the common case of src/dest sharing. >>> - * However, src could have 'unshared' and dst shares with >>> - * another vma. So page_count of ptep page is checked instead >>> - * to reliably determine whether pte is shared. >>> - */ >>> - if (page_count(virt_to_page(dst_pte)) > 1) { >>> +#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING >>> + /* If the pagetables are shared don't copy or take >>> references. */ >> >> Why remove so much of the original comment? > > Because, this part of checking has already advanced from the "dst_pte == > src_pte" to "page_count() > 1" to ->pt_share_count > 0, it seems cleaner > to just keep an one liner comment. > That said, if you feel the comments should be kept, I'd be happy to > restore them with a bit revision. Well, the comment explains why checking the pte pointers is insufficient and why there is a corner case where the pointers differ but we still want to unshare. :) But yeah, I agree that reading the code it's clear: if dst is already shared, just don't do anything. I would probably rephrase the comment to something simpler like "/* If the pagetables are shared, there is nothing to do. */ If you resend, please add a comment to the patch description like "While at it, simplify the comment, the details are not actually relevant anymore". -- Cheers David / dhildenb
On 9/12/2025 12:31 AM, David Hildenbrand wrote: > On 11.09.25 21:54, jane.chu@oracle.com wrote: >> >> On 9/9/2025 11:45 PM, David Hildenbrand wrote: >> [..] >>>> - /* >>>> - * If the pagetables are shared don't copy or take references. >>>> - * >>>> - * dst_pte == src_pte is the common case of src/dest sharing. >>>> - * However, src could have 'unshared' and dst shares with >>>> - * another vma. So page_count of ptep page is checked instead >>>> - * to reliably determine whether pte is shared. >>>> - */ >>>> - if (page_count(virt_to_page(dst_pte)) > 1) { >>>> +#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING >>>> + /* If the pagetables are shared don't copy or take >>>> references. */ >>> >>> Why remove so much of the original comment? >> >> Because, this part of checking has already advanced from the "dst_pte == >> src_pte" to "page_count() > 1" to ->pt_share_count > 0, it seems cleaner >> to just keep an one liner comment. >> That said, if you feel the comments should be kept, I'd be happy to >> restore them with a bit revision. > > Well, the comment explains why checking the pte pointers is insufficient > and why there is a corner case where the pointers differ but we still > want to unshare. :) > > But yeah, I agree that reading the code it's clear: if dst is already > shared, just don't do anything. > > I would probably rephrase the comment to something simpler like > > "/* If the pagetables are shared, there is nothing to do. */ > > If you resend, please add a comment to the patch description like "While > at it, simplify the comment, the details are not actually relevant > anymore". > Will do, thanks! -jane
On Tue, Sep 09, 2025 at 12:43:57PM -0600, Jane Chu wrote: > commit 59d9094df3d79 introduced ->pt_share_count dedicated to nit: the format should be: commit <sha1> ("summary") ? > hugetlb PMD share count tracking, but omitted fixing > copy_hugetlb_page_range(), leaving the function relying on > page_count() for tracking that no longer works. > > When lazy page table copy for hugetlb is disabled (commit bcd51a3c679d), same here. > fork()'ing with hugetlb PMD sharing quickly lockup - > > [ 239.446559] watchdog: BUG: soft lockup - CPU#75 stuck for 27s! > [ 239.446611] RIP: 0010:native_queued_spin_lock_slowpath+0x7e/0x2e0 > [ 239.446631] Call Trace: > [ 239.446633] <TASK> > [ 239.446636] _raw_spin_lock+0x3f/0x60 > [ 239.446639] copy_hugetlb_page_range+0x258/0xb50 > [ 239.446645] copy_page_range+0x22b/0x2c0 > [ 239.446651] dup_mmap+0x3e2/0x770 > [ 239.446654] dup_mm.constprop.0+0x5e/0x230 > [ 239.446657] copy_process+0xd17/0x1760 > [ 239.446660] kernel_clone+0xc0/0x3e0 > [ 239.446661] __do_sys_clone+0x65/0xa0 > [ 239.446664] do_syscall_64+0x82/0x930 > [ 239.446668] ? count_memcg_events+0xd2/0x190 > [ 239.446671] ? syscall_trace_enter+0x14e/0x1f0 > [ 239.446676] ? syscall_exit_work+0x118/0x150 > [ 239.446677] ? arch_exit_to_user_mode_prepare.constprop.0+0x9/0xb0 > [ 239.446681] ? clear_bhb_loop+0x30/0x80 > [ 239.446684] ? clear_bhb_loop+0x30/0x80 > [ 239.446686] entry_SYSCALL_64_after_hwframe+0x76/0x7e > > There are two options to resolve the potential latent issue: > 1. remove the PMD sharing awareness from copy_hugetlb_page_range(), > 2. fix it. > This patch opts for the second option. > > Fixes: 59d9094df3d79 ("mm: hugetlb: independent PMD page table shared > count") nit: we don't add newline even when Fixes: tag is line is over 75 characters? > Signed-off-by: Jane Chu <jane.chu@oracle.com> > --- The change in general looks good to me, Reviewed-by: Harry Yoo <harry.yoo@oracle.com> +Cc Jann Horn who backported the commit 59d9094df3d79. Elaborating a little bit why it doesn't need to be backported: TL;DR: One needs to backport commit 3aa4ed8040e15 to pre-v6.0 kernels, or revert commit bcd51a3c679d to trigger this. commit 59d9094df3d79 ("mm: hugetlb: independent PMD page table shared count") is introduced in 6.13 and backported to 5.10.y, 5.15.y, 6.1.y, 6.6.y, 6.12.y. As lazy page table copy is enabled in v6.0 by commit bcd51a3c679d ("hugetlb: lazy page table copies in fork()"), the bug is not triggered because shared hugetlb VMAs go through lazy page table copy logic (vma_needs_copy() returns false) or they can't share page tables (uffd_disable_huge_pmd_share() returns false). They shouldn't have anon_vma, VM_PFNMAP, VM_MIXEDMAP and UFFD-WP VMA cannot share page tables - so either vma_needs_copy() return false, or page tables cannot be shared. And before commit 3aa4ed8040e15 ("mm/hugetlb: make detecting shared pte more reliable") introduced in v6.1, copy_hugetlb_page_range() doesn't check refcount to determine whether the page table is shared. > mm/hugetlb.c | 13 ++++--------- > 1 file changed, 4 insertions(+), 9 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 753f99b4c718..8ca5b4f7805f 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -5594,18 +5594,13 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, > break; > } > > - /* > - * If the pagetables are shared don't copy or take references. > - * > - * dst_pte == src_pte is the common case of src/dest sharing. > - * However, src could have 'unshared' and dst shares with > - * another vma. So page_count of ptep page is checked instead > - * to reliably determine whether pte is shared. > - */ > - if (page_count(virt_to_page(dst_pte)) > 1) { > +#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING > + /* If the pagetables are shared don't copy or take references. */ > + if (ptdesc_pmd_pts_count(virt_to_ptdesc(dst_pte)) > 0) { > addr |= last_addr_mask; > continue; > } > +#endif > > dst_ptl = huge_pte_lock(h, dst, dst_pte); > src_ptl = huge_pte_lockptr(h, src, src_pte); > -- > 2.43.5 > -- Cheers, Harry / Hyeonggon
On 9/9/2025 6:14 PM, Harry Yoo wrote: > On Tue, Sep 09, 2025 at 12:43:57PM -0600, Jane Chu wrote: >> commit 59d9094df3d79 introduced ->pt_share_count dedicated to > > nit: the format should be: > commit <sha1> ("summary") > ? > >> hugetlb PMD share count tracking, but omitted fixing >> copy_hugetlb_page_range(), leaving the function relying on >> page_count() for tracking that no longer works. >> >> When lazy page table copy for hugetlb is disabled (commit bcd51a3c679d), > > same here. > >> fork()'ing with hugetlb PMD sharing quickly lockup - >> >> [ 239.446559] watchdog: BUG: soft lockup - CPU#75 stuck for 27s! >> [ 239.446611] RIP: 0010:native_queued_spin_lock_slowpath+0x7e/0x2e0 >> [ 239.446631] Call Trace: >> [ 239.446633] <TASK> >> [ 239.446636] _raw_spin_lock+0x3f/0x60 >> [ 239.446639] copy_hugetlb_page_range+0x258/0xb50 >> [ 239.446645] copy_page_range+0x22b/0x2c0 >> [ 239.446651] dup_mmap+0x3e2/0x770 >> [ 239.446654] dup_mm.constprop.0+0x5e/0x230 >> [ 239.446657] copy_process+0xd17/0x1760 >> [ 239.446660] kernel_clone+0xc0/0x3e0 >> [ 239.446661] __do_sys_clone+0x65/0xa0 >> [ 239.446664] do_syscall_64+0x82/0x930 >> [ 239.446668] ? count_memcg_events+0xd2/0x190 >> [ 239.446671] ? syscall_trace_enter+0x14e/0x1f0 >> [ 239.446676] ? syscall_exit_work+0x118/0x150 >> [ 239.446677] ? arch_exit_to_user_mode_prepare.constprop.0+0x9/0xb0 >> [ 239.446681] ? clear_bhb_loop+0x30/0x80 >> [ 239.446684] ? clear_bhb_loop+0x30/0x80 >> [ 239.446686] entry_SYSCALL_64_after_hwframe+0x76/0x7e >> >> There are two options to resolve the potential latent issue: >> 1. remove the PMD sharing awareness from copy_hugetlb_page_range(), >> 2. fix it. >> This patch opts for the second option. >> >> Fixes: 59d9094df3d79 ("mm: hugetlb: independent PMD page table shared >> count") > > nit: we don't add newline even when Fixes: tag is line is over 75 characters? > >> Signed-off-by: Jane Chu <jane.chu@oracle.com> >> --- > > The change in general looks good to me, > Reviewed-by: Harry Yoo <harry.yoo@oracle.com> > > +Cc Jann Horn who backported the commit 59d9094df3d79. Thanks Harry! Will fix the commit message and send v2.> > Elaborating a little bit why it doesn't need to be backported: > TL;DR: One needs to backport commit 3aa4ed8040e15 to pre-v6.0 kernels, > or revert commit bcd51a3c679d to trigger this. > > commit 59d9094df3d79 ("mm: hugetlb: independent PMD page table shared count") > is introduced in 6.13 and backported to 5.10.y, 5.15.y, 6.1.y, 6.6.y, 6.12.y. > > As lazy page table copy is enabled in v6.0 by commit bcd51a3c679d > ("hugetlb: lazy page table copies in fork()"), the bug is not triggered > because shared hugetlb VMAs go through lazy page table copy logic > (vma_needs_copy() returns false) or they can't share page tables > (uffd_disable_huge_pmd_share() returns false). > > They shouldn't have anon_vma, VM_PFNMAP, VM_MIXEDMAP and UFFD-WP VMA > cannot share page tables - so either vma_needs_copy() return false, or > page tables cannot be shared. > > And before commit 3aa4ed8040e15 ("mm/hugetlb: make detecting shared pte > more reliable") introduced in v6.1, copy_hugetlb_page_range() doesn't check > refcount to determine whether the page table is shared. Yes, it's the combination of v6.1: 3aa4ed8040e15 mm/hugetlb: make detecting shared pte more reliable v6.13: 59d9094df3d79 mm: hugetlb: independent PMD page table shared count without v6.0: bcd51a3c679d hugetlb: lazy page table copies in fork() that is problematic. Since the problematic combination doesn't exist in any upstream release, no need to backport. The patch is for a potential latent bug. thanks, -jane > >> mm/hugetlb.c | 13 ++++--------- >> 1 file changed, 4 insertions(+), 9 deletions(-) >> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >> index 753f99b4c718..8ca5b4f7805f 100644 >> --- a/mm/hugetlb.c >> +++ b/mm/hugetlb.c >> @@ -5594,18 +5594,13 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, >> break; >> } >> >> - /* >> - * If the pagetables are shared don't copy or take references. >> - * >> - * dst_pte == src_pte is the common case of src/dest sharing. >> - * However, src could have 'unshared' and dst shares with >> - * another vma. So page_count of ptep page is checked instead >> - * to reliably determine whether pte is shared. >> - */ >> - if (page_count(virt_to_page(dst_pte)) > 1) { >> +#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING >> + /* If the pagetables are shared don't copy or take references. */ >> + if (ptdesc_pmd_pts_count(virt_to_ptdesc(dst_pte)) > 0) { >> addr |= last_addr_mask; >> continue; >> } >> +#endif >> >> dst_ptl = huge_pte_lock(h, dst, dst_pte); >> src_ptl = huge_pte_lockptr(h, src, src_pte); >> -- >> 2.43.5 >> >
© 2016 - 2025 Red Hat, Inc.