mm/khugepaged.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
From: Barry Song <v-songbaohua@oppo.com>
The check_pmd_still_valid() call during collapse is currently only
protected by the mmap_lock in write mode, which was sufficient when
pt_reclaim always ran under mmap_lock in read mode. However, since
madvise_dontneed can now execute under a per-VMA lock, this assumption
is no longer valid. As a result, a race condition can occur between
collapse and PT_RECLAIM, potentially leading to a kernel panic.
[ 38.151897] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] SMP KASI
[ 38.153519] KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
[ 38.154605] CPU: 0 UID: 0 PID: 721 Comm: repro Not tainted 6.16.0-next-20250801-next-2025080 #1 PREEMPT(voluntary)
[ 38.155929] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org4
[ 38.157418] RIP: 0010:kasan_byte_accessible+0x15/0x30
[ 38.158125] Code: 03 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 48 b8 00 00 00 00 00 fc0
[ 38.160461] RSP: 0018:ffff88800feef678 EFLAGS: 00010286
[ 38.161220] RAX: dffffc0000000000 RBX: 0000000000000001 RCX: 1ffffffff0dde60c
[ 38.162232] RDX: 0000000000000000 RSI: ffffffff85da1e18 RDI: dffffc0000000003
[ 38.163176] RBP: ffff88800feef698 R08: 0000000000000001 R09: 0000000000000000
[ 38.164195] R10: 0000000000000000 R11: ffff888016a8ba58 R12: 0000000000000018
[ 38.165189] R13: 0000000000000018 R14: ffffffff85da1e18 R15: 0000000000000000
[ 38.166100] FS: 0000000000000000(0000) GS:ffff8880e3b40000(0000) knlGS:0000000000000000
[ 38.167137] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 38.167891] CR2: 00007f97fadfe504 CR3: 0000000007088005 CR4: 0000000000770ef0
[ 38.168812] PKRU: 55555554
[ 38.169275] Call Trace:
[ 38.169647] <TASK>
[ 38.169975] ? __kasan_check_byte+0x19/0x50
[ 38.170581] lock_acquire+0xea/0x310
[ 38.171083] ? rcu_is_watching+0x19/0xc0
[ 38.171615] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[ 38.172343] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
[ 38.173130] _raw_spin_lock+0x38/0x50
[ 38.173707] ? __pte_offset_map_lock+0x1a2/0x3c0
[ 38.174390] __pte_offset_map_lock+0x1a2/0x3c0
[ 38.174987] ? __pfx___pte_offset_map_lock+0x10/0x10
[ 38.175724] ? __pfx_pud_val+0x10/0x10
[ 38.176308] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
[ 38.177183] unmap_page_range+0xb60/0x43e0
[ 38.177824] ? __pfx_unmap_page_range+0x10/0x10
[ 38.178485] ? mas_next_slot+0x133a/0x1a50
[ 38.179079] unmap_single_vma.constprop.0+0x15b/0x250
[ 38.179830] unmap_vmas+0x1fa/0x460
[ 38.180373] ? __pfx_unmap_vmas+0x10/0x10
[ 38.180994] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[ 38.181877] exit_mmap+0x1a2/0xb40
[ 38.182396] ? lock_release+0x14f/0x2c0
[ 38.182929] ? __pfx_exit_mmap+0x10/0x10
[ 38.183474] ? __pfx___mutex_unlock_slowpath+0x10/0x10
[ 38.184188] ? mutex_unlock+0x16/0x20
[ 38.184704] mmput+0x132/0x370
[ 38.185208] do_exit+0x7e7/0x28c0
[ 38.185682] ? __this_cpu_preempt_check+0x21/0x30
[ 38.186328] ? do_group_exit+0x1d8/0x2c0
[ 38.186873] ? __pfx_do_exit+0x10/0x10
[ 38.187401] ? __this_cpu_preempt_check+0x21/0x30
[ 38.188036] ? _raw_spin_unlock_irq+0x2c/0x60
[ 38.188634] ? lockdep_hardirqs_on+0x89/0x110
[ 38.189313] do_group_exit+0xe4/0x2c0
[ 38.189831] __x64_sys_exit_group+0x4d/0x60
[ 38.190413] x64_sys_call+0x2174/0x2180
[ 38.190935] do_syscall_64+0x6d/0x2e0
[ 38.191449] entry_SYSCALL_64_after_hwframe+0x76/0x7e
This patch moves the vma_start_write() call to precede
check_pmd_still_valid(), ensuring that the check is also properly
protected by the per-VMA lock.
Fixes: a6fde7add78d ("mm: use per_vma lock for MADV_DONTNEED")
Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com>
Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
Closes: https://lore.kernel.org/all/aJAFrYfyzGpbm+0m@ly-workstation/
Cc: David Hildenbrand <david@redhat.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
mm/khugepaged.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 374a6a5193a7..6b40bdfd224c 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1172,11 +1172,11 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
if (result != SCAN_SUCCEED)
goto out_up_write;
/* check if the pmd is still valid */
+ vma_start_write(vma);
result = check_pmd_still_valid(mm, address, pmd);
if (result != SCAN_SUCCEED)
goto out_up_write;
- vma_start_write(vma);
anon_vma_lock_write(vma->anon_vma);
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
--
2.39.3 (Apple Git-146)
Andrew - to be clear, we need this as a hotfix for 6.17, as this is a known bug in rc1 right now. On Tue, Aug 05, 2025 at 11:54:47AM +0800, Barry Song wrote: > From: Barry Song <v-songbaohua@oppo.com> > > The check_pmd_still_valid() call during collapse is currently only > protected by the mmap_lock in write mode, which was sufficient when > pt_reclaim always ran under mmap_lock in read mode. However, since > madvise_dontneed can now execute under a per-VMA lock, this assumption > is no longer valid. As a result, a race condition can occur between > collapse and PT_RECLAIM, potentially leading to a kernel panic. > > [ 38.151897] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] SMP KASI > [ 38.153519] KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f] > [ 38.154605] CPU: 0 UID: 0 PID: 721 Comm: repro Not tainted 6.16.0-next-20250801-next-2025080 #1 PREEMPT(voluntary) > [ 38.155929] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org4 > [ 38.157418] RIP: 0010:kasan_byte_accessible+0x15/0x30 > [ 38.158125] Code: 03 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 48 b8 00 00 00 00 00 fc0 > [ 38.160461] RSP: 0018:ffff88800feef678 EFLAGS: 00010286 > [ 38.161220] RAX: dffffc0000000000 RBX: 0000000000000001 RCX: 1ffffffff0dde60c > [ 38.162232] RDX: 0000000000000000 RSI: ffffffff85da1e18 RDI: dffffc0000000003 > [ 38.163176] RBP: ffff88800feef698 R08: 0000000000000001 R09: 0000000000000000 > [ 38.164195] R10: 0000000000000000 R11: ffff888016a8ba58 R12: 0000000000000018 > [ 38.165189] R13: 0000000000000018 R14: ffffffff85da1e18 R15: 0000000000000000 > [ 38.166100] FS: 0000000000000000(0000) GS:ffff8880e3b40000(0000) knlGS:0000000000000000 > [ 38.167137] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 38.167891] CR2: 00007f97fadfe504 CR3: 0000000007088005 CR4: 0000000000770ef0 > [ 38.168812] PKRU: 55555554 > [ 38.169275] Call Trace: > [ 38.169647] <TASK> > [ 38.169975] ? __kasan_check_byte+0x19/0x50 > [ 38.170581] lock_acquire+0xea/0x310 > [ 38.171083] ? rcu_is_watching+0x19/0xc0 > [ 38.171615] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 > [ 38.172343] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30 > [ 38.173130] _raw_spin_lock+0x38/0x50 > [ 38.173707] ? __pte_offset_map_lock+0x1a2/0x3c0 > [ 38.174390] __pte_offset_map_lock+0x1a2/0x3c0 > [ 38.174987] ? __pfx___pte_offset_map_lock+0x10/0x10 > [ 38.175724] ? __pfx_pud_val+0x10/0x10 > [ 38.176308] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30 > [ 38.177183] unmap_page_range+0xb60/0x43e0 > [ 38.177824] ? __pfx_unmap_page_range+0x10/0x10 > [ 38.178485] ? mas_next_slot+0x133a/0x1a50 > [ 38.179079] unmap_single_vma.constprop.0+0x15b/0x250 > [ 38.179830] unmap_vmas+0x1fa/0x460 > [ 38.180373] ? __pfx_unmap_vmas+0x10/0x10 > [ 38.180994] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 > [ 38.181877] exit_mmap+0x1a2/0xb40 > [ 38.182396] ? lock_release+0x14f/0x2c0 > [ 38.182929] ? __pfx_exit_mmap+0x10/0x10 > [ 38.183474] ? __pfx___mutex_unlock_slowpath+0x10/0x10 > [ 38.184188] ? mutex_unlock+0x16/0x20 > [ 38.184704] mmput+0x132/0x370 > [ 38.185208] do_exit+0x7e7/0x28c0 > [ 38.185682] ? __this_cpu_preempt_check+0x21/0x30 > [ 38.186328] ? do_group_exit+0x1d8/0x2c0 > [ 38.186873] ? __pfx_do_exit+0x10/0x10 > [ 38.187401] ? __this_cpu_preempt_check+0x21/0x30 > [ 38.188036] ? _raw_spin_unlock_irq+0x2c/0x60 > [ 38.188634] ? lockdep_hardirqs_on+0x89/0x110 > [ 38.189313] do_group_exit+0xe4/0x2c0 > [ 38.189831] __x64_sys_exit_group+0x4d/0x60 > [ 38.190413] x64_sys_call+0x2174/0x2180 > [ 38.190935] do_syscall_64+0x6d/0x2e0 > [ 38.191449] entry_SYSCALL_64_after_hwframe+0x76/0x7e > > This patch moves the vma_start_write() call to precede > check_pmd_still_valid(), ensuring that the check is also properly > protected by the per-VMA lock. > > Fixes: a6fde7add78d ("mm: use per_vma lock for MADV_DONTNEED") > Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com> > Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com> > Closes: https://lore.kernel.org/all/aJAFrYfyzGpbm+0m@ly-workstation/ > Cc: David Hildenbrand <david@redhat.com> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > Cc: Qi Zheng <zhengqi.arch@bytedance.com> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Jann Horn <jannh@google.com> > Cc: Suren Baghdasaryan <surenb@google.com> > Cc: Lokesh Gidra <lokeshgidra@google.com> > Cc: Tangquan Zheng <zhengtangquan@oppo.com> > Cc: Lance Yang <ioworker0@gmail.com> > Cc: Zi Yan <ziy@nvidia.com> > Cc: Baolin Wang <baolin.wang@linux.alibaba.com> > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > Cc: Nico Pache <npache@redhat.com> > Cc: Ryan Roberts <ryan.roberts@arm.com> > Cc: Dev Jain <dev.jain@arm.com> > Signed-off-by: Barry Song <v-songbaohua@oppo.com> Looks good to me so: Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > --- > mm/khugepaged.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 374a6a5193a7..6b40bdfd224c 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1172,11 +1172,11 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > if (result != SCAN_SUCCEED) > goto out_up_write; > /* check if the pmd is still valid */ > + vma_start_write(vma); > result = check_pmd_still_valid(mm, address, pmd); > if (result != SCAN_SUCCEED) > goto out_up_write; > > - vma_start_write(vma); > anon_vma_lock_write(vma->anon_vma); > > mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address, > -- > 2.39.3 (Apple Git-146) >
On 2025/8/5 11:54, Barry Song wrote: > From: Barry Song <v-songbaohua@oppo.com> > > The check_pmd_still_valid() call during collapse is currently only > protected by the mmap_lock in write mode, which was sufficient when > pt_reclaim always ran under mmap_lock in read mode. However, since > madvise_dontneed can now execute under a per-VMA lock, this assumption > is no longer valid. As a result, a race condition can occur between > collapse and PT_RECLAIM, potentially leading to a kernel panic. > > [ 38.151897] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] SMP KASI > [ 38.153519] KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f] > [ 38.154605] CPU: 0 UID: 0 PID: 721 Comm: repro Not tainted 6.16.0-next-20250801-next-2025080 #1 PREEMPT(voluntary) > [ 38.155929] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org4 > [ 38.157418] RIP: 0010:kasan_byte_accessible+0x15/0x30 > [ 38.158125] Code: 03 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 48 b8 00 00 00 00 00 fc0 > [ 38.160461] RSP: 0018:ffff88800feef678 EFLAGS: 00010286 > [ 38.161220] RAX: dffffc0000000000 RBX: 0000000000000001 RCX: 1ffffffff0dde60c > [ 38.162232] RDX: 0000000000000000 RSI: ffffffff85da1e18 RDI: dffffc0000000003 > [ 38.163176] RBP: ffff88800feef698 R08: 0000000000000001 R09: 0000000000000000 > [ 38.164195] R10: 0000000000000000 R11: ffff888016a8ba58 R12: 0000000000000018 > [ 38.165189] R13: 0000000000000018 R14: ffffffff85da1e18 R15: 0000000000000000 > [ 38.166100] FS: 0000000000000000(0000) GS:ffff8880e3b40000(0000) knlGS:0000000000000000 > [ 38.167137] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 38.167891] CR2: 00007f97fadfe504 CR3: 0000000007088005 CR4: 0000000000770ef0 > [ 38.168812] PKRU: 55555554 > [ 38.169275] Call Trace: > [ 38.169647] <TASK> > [ 38.169975] ? __kasan_check_byte+0x19/0x50 > [ 38.170581] lock_acquire+0xea/0x310 > [ 38.171083] ? rcu_is_watching+0x19/0xc0 > [ 38.171615] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 > [ 38.172343] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30 > [ 38.173130] _raw_spin_lock+0x38/0x50 > [ 38.173707] ? __pte_offset_map_lock+0x1a2/0x3c0 > [ 38.174390] __pte_offset_map_lock+0x1a2/0x3c0 > [ 38.174987] ? __pfx___pte_offset_map_lock+0x10/0x10 > [ 38.175724] ? __pfx_pud_val+0x10/0x10 > [ 38.176308] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30 > [ 38.177183] unmap_page_range+0xb60/0x43e0 > [ 38.177824] ? __pfx_unmap_page_range+0x10/0x10 > [ 38.178485] ? mas_next_slot+0x133a/0x1a50 > [ 38.179079] unmap_single_vma.constprop.0+0x15b/0x250 > [ 38.179830] unmap_vmas+0x1fa/0x460 > [ 38.180373] ? __pfx_unmap_vmas+0x10/0x10 > [ 38.180994] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 > [ 38.181877] exit_mmap+0x1a2/0xb40 > [ 38.182396] ? lock_release+0x14f/0x2c0 > [ 38.182929] ? __pfx_exit_mmap+0x10/0x10 > [ 38.183474] ? __pfx___mutex_unlock_slowpath+0x10/0x10 > [ 38.184188] ? mutex_unlock+0x16/0x20 > [ 38.184704] mmput+0x132/0x370 > [ 38.185208] do_exit+0x7e7/0x28c0 > [ 38.185682] ? __this_cpu_preempt_check+0x21/0x30 > [ 38.186328] ? do_group_exit+0x1d8/0x2c0 > [ 38.186873] ? __pfx_do_exit+0x10/0x10 > [ 38.187401] ? __this_cpu_preempt_check+0x21/0x30 > [ 38.188036] ? _raw_spin_unlock_irq+0x2c/0x60 > [ 38.188634] ? lockdep_hardirqs_on+0x89/0x110 > [ 38.189313] do_group_exit+0xe4/0x2c0 > [ 38.189831] __x64_sys_exit_group+0x4d/0x60 > [ 38.190413] x64_sys_call+0x2174/0x2180 > [ 38.190935] do_syscall_64+0x6d/0x2e0 > [ 38.191449] entry_SYSCALL_64_after_hwframe+0x76/0x7e > > This patch moves the vma_start_write() call to precede > check_pmd_still_valid(), ensuring that the check is also properly > protected by the per-VMA lock. > > Fixes: a6fde7add78d ("mm: use per_vma lock for MADV_DONTNEED") > Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com> > Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com> > Closes: https://lore.kernel.org/all/aJAFrYfyzGpbm+0m@ly-workstation/ > Cc: David Hildenbrand <david@redhat.com> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > Cc: Qi Zheng <zhengqi.arch@bytedance.com> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Jann Horn <jannh@google.com> > Cc: Suren Baghdasaryan <surenb@google.com> > Cc: Lokesh Gidra <lokeshgidra@google.com> > Cc: Tangquan Zheng <zhengtangquan@oppo.com> > Cc: Lance Yang <ioworker0@gmail.com> > Cc: Zi Yan <ziy@nvidia.com> > Cc: Baolin Wang <baolin.wang@linux.alibaba.com> > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > Cc: Nico Pache <npache@redhat.com> > Cc: Ryan Roberts <ryan.roberts@arm.com> > Cc: Dev Jain <dev.jain@arm.com> > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > --- LGTM. Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Hi Barry, On 8/5/25 11:54 AM, Barry Song wrote: > From: Barry Song <v-songbaohua@oppo.com> > > The check_pmd_still_valid() call during collapse is currently only > protected by the mmap_lock in write mode, which was sufficient when > pt_reclaim always ran under mmap_lock in read mode. However, since > madvise_dontneed can now execute under a per-VMA lock, this assumption > is no longer valid. As a result, a race condition can occur between > collapse and PT_RECLAIM, potentially leading to a kernel panic. There is indeed a race condition here. And after applying this patch, I can no longer reproduce the problem locally (I was able to reproduce it stably locally last night). But I still can't figure out how this race condtion causes the following panic: exit_mmap --> mmap_read_lock() unmap_vmas() --> pte_offset_map_lock --> rcu_read_lock() check if the pmd entry is a PTE page ptl = pte_lockptr(mm, &pmdval) <-- ptl is NULL spin_lock(ptl) <-- PANIC!! If this PTE page is freed by pt_reclaim (via RCU), then the ptl can not be NULL. The collapse holds mmap write lock, so it is impossible to be concurrent with exit_mmap(). Confusing. :( > > [ 38.151897] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] SMP KASI > [ 38.153519] KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f] > [ 38.154605] CPU: 0 UID: 0 PID: 721 Comm: repro Not tainted 6.16.0-next-20250801-next-2025080 #1 PREEMPT(voluntary) > [ 38.155929] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org4 > [ 38.157418] RIP: 0010:kasan_byte_accessible+0x15/0x30 > [ 38.158125] Code: 03 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 48 b8 00 00 00 00 00 fc0 > [ 38.160461] RSP: 0018:ffff88800feef678 EFLAGS: 00010286 > [ 38.161220] RAX: dffffc0000000000 RBX: 0000000000000001 RCX: 1ffffffff0dde60c > [ 38.162232] RDX: 0000000000000000 RSI: ffffffff85da1e18 RDI: dffffc0000000003 > [ 38.163176] RBP: ffff88800feef698 R08: 0000000000000001 R09: 0000000000000000 > [ 38.164195] R10: 0000000000000000 R11: ffff888016a8ba58 R12: 0000000000000018 > [ 38.165189] R13: 0000000000000018 R14: ffffffff85da1e18 R15: 0000000000000000 > [ 38.166100] FS: 0000000000000000(0000) GS:ffff8880e3b40000(0000) knlGS:0000000000000000 > [ 38.167137] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 38.167891] CR2: 00007f97fadfe504 CR3: 0000000007088005 CR4: 0000000000770ef0 > [ 38.168812] PKRU: 55555554 > [ 38.169275] Call Trace: > [ 38.169647] <TASK> > [ 38.169975] ? __kasan_check_byte+0x19/0x50 > [ 38.170581] lock_acquire+0xea/0x310 > [ 38.171083] ? rcu_is_watching+0x19/0xc0 > [ 38.171615] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 > [ 38.172343] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30 > [ 38.173130] _raw_spin_lock+0x38/0x50 > [ 38.173707] ? __pte_offset_map_lock+0x1a2/0x3c0 > [ 38.174390] __pte_offset_map_lock+0x1a2/0x3c0 > [ 38.174987] ? __pfx___pte_offset_map_lock+0x10/0x10 > [ 38.175724] ? __pfx_pud_val+0x10/0x10 > [ 38.176308] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30 > [ 38.177183] unmap_page_range+0xb60/0x43e0 > [ 38.177824] ? __pfx_unmap_page_range+0x10/0x10 > [ 38.178485] ? mas_next_slot+0x133a/0x1a50 > [ 38.179079] unmap_single_vma.constprop.0+0x15b/0x250 > [ 38.179830] unmap_vmas+0x1fa/0x460 > [ 38.180373] ? __pfx_unmap_vmas+0x10/0x10 > [ 38.180994] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 > [ 38.181877] exit_mmap+0x1a2/0xb40 > [ 38.182396] ? lock_release+0x14f/0x2c0 > [ 38.182929] ? __pfx_exit_mmap+0x10/0x10 > [ 38.183474] ? __pfx___mutex_unlock_slowpath+0x10/0x10 > [ 38.184188] ? mutex_unlock+0x16/0x20 > [ 38.184704] mmput+0x132/0x370 > [ 38.185208] do_exit+0x7e7/0x28c0 > [ 38.185682] ? __this_cpu_preempt_check+0x21/0x30 > [ 38.186328] ? do_group_exit+0x1d8/0x2c0 > [ 38.186873] ? __pfx_do_exit+0x10/0x10 > [ 38.187401] ? __this_cpu_preempt_check+0x21/0x30 > [ 38.188036] ? _raw_spin_unlock_irq+0x2c/0x60 > [ 38.188634] ? lockdep_hardirqs_on+0x89/0x110 > [ 38.189313] do_group_exit+0xe4/0x2c0 > [ 38.189831] __x64_sys_exit_group+0x4d/0x60 > [ 38.190413] x64_sys_call+0x2174/0x2180 > [ 38.190935] do_syscall_64+0x6d/0x2e0 > [ 38.191449] entry_SYSCALL_64_after_hwframe+0x76/0x7e > > This patch moves the vma_start_write() call to precede > check_pmd_still_valid(), ensuring that the check is also properly > protected by the per-VMA lock. > > Fixes: a6fde7add78d ("mm: use per_vma lock for MADV_DONTNEED") > Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com> > Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com> > Closes: https://lore.kernel.org/all/aJAFrYfyzGpbm+0m@ly-workstation/ > Cc: David Hildenbrand <david@redhat.com> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > Cc: Qi Zheng <zhengqi.arch@bytedance.com> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Jann Horn <jannh@google.com> > Cc: Suren Baghdasaryan <surenb@google.com> > Cc: Lokesh Gidra <lokeshgidra@google.com> > Cc: Tangquan Zheng <zhengtangquan@oppo.com> > Cc: Lance Yang <ioworker0@gmail.com> > Cc: Zi Yan <ziy@nvidia.com> > Cc: Baolin Wang <baolin.wang@linux.alibaba.com> > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > Cc: Nico Pache <npache@redhat.com> > Cc: Ryan Roberts <ryan.roberts@arm.com> > Cc: Dev Jain <dev.jain@arm.com> > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > --- > mm/khugepaged.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 374a6a5193a7..6b40bdfd224c 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1172,11 +1172,11 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > if (result != SCAN_SUCCEED) > goto out_up_write; > /* check if the pmd is still valid */ > + vma_start_write(vma); > result = check_pmd_still_valid(mm, address, pmd); > if (result != SCAN_SUCCEED) > goto out_up_write; > > - vma_start_write(vma); > anon_vma_lock_write(vma->anon_vma); > > mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
On 2025/8/5 14:42, Qi Zheng wrote: > Hi Barry, > > On 8/5/25 11:54 AM, Barry Song wrote: >> From: Barry Song <v-songbaohua@oppo.com> >> >> The check_pmd_still_valid() call during collapse is currently only >> protected by the mmap_lock in write mode, which was sufficient when >> pt_reclaim always ran under mmap_lock in read mode. However, since >> madvise_dontneed can now execute under a per-VMA lock, this assumption >> is no longer valid. As a result, a race condition can occur between >> collapse and PT_RECLAIM, potentially leading to a kernel panic. > > There is indeed a race condition here. And after applying this patch, I > can no longer reproduce the problem locally (I was able to reproduce it > stably locally last night). > > But I still can't figure out how this race condtion causes the > following panic: > > exit_mmap > --> mmap_read_lock() > unmap_vmas() > --> pte_offset_map_lock > --> rcu_read_lock() > check if the pmd entry is a PTE page > ptl = pte_lockptr(mm, &pmdval) <-- ptl is NULL > spin_lock(ptl) <-- PANIC!! > > If this PTE page is freed by pt_reclaim (via RCU), then the ptl can not > be NULL. > > The collapse holds mmap write lock, so it is impossible to be concurrent > with exit_mmap(). > > Confusing. :( IIUC, the issue is not caused by the concurrency between exit_mmap and collapse, but rather by the concurrency between pt_reclaim and collapse. Before this patch, khugepaged might incorrectly restore a PTE pagetable that had already been freed. pt_reclaim has cleared the pmd entry and freed the PTE page table. However, due to the race condition, check_pmd_still_valid() still passes and continues to attempt the collapse: _pmd = pmdp_collapse_flush(vma, address, pmd); ---> returns a none pmd entry (the original pmd entry has been cleared) pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); ---> returns pte == NULL Then khugepaged will restore the old PTE pagetable with an invalid pmd entry: pmd_populate(mm, pmd, pmd_pgtable(_pmd)); So when the process exits and trys to free the mapping of the process, traversing the invalid pmd table will lead to a crash. Barry, please correct me if I have misunderstood something. >> [ 38.151897] Oops: general protection fault, probably for non- >> canonical address 0xdffffc0000000003: 0000 [#1] SMP KASI >> [ 38.153519] KASAN: null-ptr-deref in range >> [0x0000000000000018-0x000000000000001f] >> [ 38.154605] CPU: 0 UID: 0 PID: 721 Comm: repro Not tainted >> 6.16.0-next-20250801-next-2025080 #1 PREEMPT(voluntary) >> [ 38.155929] Hardware name: QEMU Standard PC (i440FX + PIIX, >> 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org4 >> [ 38.157418] RIP: 0010:kasan_byte_accessible+0x15/0x30 >> [ 38.158125] Code: 03 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 >> 90 90 90 90 90 66 0f 1f 00 48 b8 00 00 00 00 00 fc0 >> [ 38.160461] RSP: 0018:ffff88800feef678 EFLAGS: 00010286 >> [ 38.161220] RAX: dffffc0000000000 RBX: 0000000000000001 RCX: >> 1ffffffff0dde60c >> [ 38.162232] RDX: 0000000000000000 RSI: ffffffff85da1e18 RDI: >> dffffc0000000003 >> [ 38.163176] RBP: ffff88800feef698 R08: 0000000000000001 R09: >> 0000000000000000 >> [ 38.164195] R10: 0000000000000000 R11: ffff888016a8ba58 R12: >> 0000000000000018 >> [ 38.165189] R13: 0000000000000018 R14: ffffffff85da1e18 R15: >> 0000000000000000 >> [ 38.166100] FS: 0000000000000000(0000) GS:ffff8880e3b40000(0000) >> knlGS:0000000000000000 >> [ 38.167137] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 38.167891] CR2: 00007f97fadfe504 CR3: 0000000007088005 CR4: >> 0000000000770ef0 >> [ 38.168812] PKRU: 55555554 >> [ 38.169275] Call Trace: >> [ 38.169647] <TASK> >> [ 38.169975] ? __kasan_check_byte+0x19/0x50 >> [ 38.170581] lock_acquire+0xea/0x310 >> [ 38.171083] ? rcu_is_watching+0x19/0xc0 >> [ 38.171615] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 >> [ 38.172343] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30 >> [ 38.173130] _raw_spin_lock+0x38/0x50 >> [ 38.173707] ? __pte_offset_map_lock+0x1a2/0x3c0 >> [ 38.174390] __pte_offset_map_lock+0x1a2/0x3c0 >> [ 38.174987] ? __pfx___pte_offset_map_lock+0x10/0x10 >> [ 38.175724] ? __pfx_pud_val+0x10/0x10 >> [ 38.176308] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30 >> [ 38.177183] unmap_page_range+0xb60/0x43e0 >> [ 38.177824] ? __pfx_unmap_page_range+0x10/0x10 >> [ 38.178485] ? mas_next_slot+0x133a/0x1a50 >> [ 38.179079] unmap_single_vma.constprop.0+0x15b/0x250 >> [ 38.179830] unmap_vmas+0x1fa/0x460 >> [ 38.180373] ? __pfx_unmap_vmas+0x10/0x10 >> [ 38.180994] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 >> [ 38.181877] exit_mmap+0x1a2/0xb40 >> [ 38.182396] ? lock_release+0x14f/0x2c0 >> [ 38.182929] ? __pfx_exit_mmap+0x10/0x10 >> [ 38.183474] ? __pfx___mutex_unlock_slowpath+0x10/0x10 >> [ 38.184188] ? mutex_unlock+0x16/0x20 >> [ 38.184704] mmput+0x132/0x370 >> [ 38.185208] do_exit+0x7e7/0x28c0 >> [ 38.185682] ? __this_cpu_preempt_check+0x21/0x30 >> [ 38.186328] ? do_group_exit+0x1d8/0x2c0 >> [ 38.186873] ? __pfx_do_exit+0x10/0x10 >> [ 38.187401] ? __this_cpu_preempt_check+0x21/0x30 >> [ 38.188036] ? _raw_spin_unlock_irq+0x2c/0x60 >> [ 38.188634] ? lockdep_hardirqs_on+0x89/0x110 >> [ 38.189313] do_group_exit+0xe4/0x2c0 >> [ 38.189831] __x64_sys_exit_group+0x4d/0x60 >> [ 38.190413] x64_sys_call+0x2174/0x2180 >> [ 38.190935] do_syscall_64+0x6d/0x2e0 >> [ 38.191449] entry_SYSCALL_64_after_hwframe+0x76/0x7e >> >> This patch moves the vma_start_write() call to precede >> check_pmd_still_valid(), ensuring that the check is also properly >> protected by the per-VMA lock. >> >> Fixes: a6fde7add78d ("mm: use per_vma lock for MADV_DONTNEED") >> Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com> >> Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com> >> Closes: https://lore.kernel.org/all/aJAFrYfyzGpbm+0m@ly-workstation/ >> Cc: David Hildenbrand <david@redhat.com> >> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> >> Cc: Qi Zheng <zhengqi.arch@bytedance.com> >> Cc: Vlastimil Babka <vbabka@suse.cz> >> Cc: Jann Horn <jannh@google.com> >> Cc: Suren Baghdasaryan <surenb@google.com> >> Cc: Lokesh Gidra <lokeshgidra@google.com> >> Cc: Tangquan Zheng <zhengtangquan@oppo.com> >> Cc: Lance Yang <ioworker0@gmail.com> >> Cc: Zi Yan <ziy@nvidia.com> >> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> >> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> >> Cc: Nico Pache <npache@redhat.com> >> Cc: Ryan Roberts <ryan.roberts@arm.com> >> Cc: Dev Jain <dev.jain@arm.com> >> Signed-off-by: Barry Song <v-songbaohua@oppo.com> >> --- >> mm/khugepaged.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >> index 374a6a5193a7..6b40bdfd224c 100644 >> --- a/mm/khugepaged.c >> +++ b/mm/khugepaged.c >> @@ -1172,11 +1172,11 @@ static int collapse_huge_page(struct mm_struct >> *mm, unsigned long address, >> if (result != SCAN_SUCCEED) >> goto out_up_write; >> /* check if the pmd is still valid */ >> + vma_start_write(vma); >> result = check_pmd_still_valid(mm, address, pmd); >> if (result != SCAN_SUCCEED) >> goto out_up_write; >> - vma_start_write(vma); >> anon_vma_lock_write(vma->anon_vma); >> mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
Hi Baolin, On 8/5/25 3:53 PM, Baolin Wang wrote: > > > On 2025/8/5 14:42, Qi Zheng wrote: >> Hi Barry, >> >> On 8/5/25 11:54 AM, Barry Song wrote: >>> From: Barry Song <v-songbaohua@oppo.com> >>> >>> The check_pmd_still_valid() call during collapse is currently only >>> protected by the mmap_lock in write mode, which was sufficient when >>> pt_reclaim always ran under mmap_lock in read mode. However, since >>> madvise_dontneed can now execute under a per-VMA lock, this assumption >>> is no longer valid. As a result, a race condition can occur between >>> collapse and PT_RECLAIM, potentially leading to a kernel panic. >> >> There is indeed a race condition here. And after applying this patch, I >> can no longer reproduce the problem locally (I was able to reproduce it >> stably locally last night). >> >> But I still can't figure out how this race condtion causes the >> following panic: >> >> exit_mmap >> --> mmap_read_lock() >> unmap_vmas() >> --> pte_offset_map_lock >> --> rcu_read_lock() >> check if the pmd entry is a PTE page >> ptl = pte_lockptr(mm, &pmdval) <-- ptl is NULL >> spin_lock(ptl) <-- PANIC!! >> >> If this PTE page is freed by pt_reclaim (via RCU), then the ptl can >> not be NULL. >> >> The collapse holds mmap write lock, so it is impossible to be concurrent >> with exit_mmap(). >> >> Confusing. :( > > IIUC, the issue is not caused by the concurrency between exit_mmap and > collapse, but rather by the concurrency between pt_reclaim and collapse. > > Before this patch, khugepaged might incorrectly restore a PTE pagetable > that had already been freed. > > pt_reclaim has cleared the pmd entry and freed the PTE page table. > However, due to the race condition, check_pmd_still_valid() still passes > and continues to attempt the collapse: > > _pmd = pmdp_collapse_flush(vma, address, pmd); ---> returns a none pmd > entry (the original pmd entry has been cleared) > > pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); ---> returns > pte == NULL > > Then khugepaged will restore the old PTE pagetable with an invalid pmd > entry: > > pmd_populate(mm, pmd, pmd_pgtable(_pmd)); > > So when the process exits and trys to free the mapping of the process, > traversing the invalid pmd table will lead to a crash. CPU0 CPU1 ==== ==== collapse --> pmd_populate(mm, pmd, pmd_pgtable(_pmd)); mmap_write_unlock exit_mmap --> hold mmap lock __pte_offset_map_lock --> pte = __pte_offset_map(pmd, addr, &pmdval); if (unlikely(!pte)) return pte; <-- will return IIUC, in this case, if we get an invalid pmd entry, we will retrun directly instead of causing a crash? > > Barry, please correct me if I have misunderstood something. >
On 2025/8/5 16:17, Qi Zheng wrote: > Hi Baolin, > > On 8/5/25 3:53 PM, Baolin Wang wrote: >> >> >> On 2025/8/5 14:42, Qi Zheng wrote: >>> Hi Barry, >>> >>> On 8/5/25 11:54 AM, Barry Song wrote: >>>> From: Barry Song <v-songbaohua@oppo.com> >>>> >>>> The check_pmd_still_valid() call during collapse is currently only >>>> protected by the mmap_lock in write mode, which was sufficient when >>>> pt_reclaim always ran under mmap_lock in read mode. However, since >>>> madvise_dontneed can now execute under a per-VMA lock, this assumption >>>> is no longer valid. As a result, a race condition can occur between >>>> collapse and PT_RECLAIM, potentially leading to a kernel panic. >>> >>> There is indeed a race condition here. And after applying this patch, I >>> can no longer reproduce the problem locally (I was able to reproduce it >>> stably locally last night). >>> >>> But I still can't figure out how this race condtion causes the >>> following panic: >>> >>> exit_mmap >>> --> mmap_read_lock() >>> unmap_vmas() >>> --> pte_offset_map_lock >>> --> rcu_read_lock() >>> check if the pmd entry is a PTE page >>> ptl = pte_lockptr(mm, &pmdval) <-- ptl is NULL >>> spin_lock(ptl) <-- PANIC!! >>> >>> If this PTE page is freed by pt_reclaim (via RCU), then the ptl can >>> not be NULL. >>> >>> The collapse holds mmap write lock, so it is impossible to be concurrent >>> with exit_mmap(). >>> >>> Confusing. :( >> >> IIUC, the issue is not caused by the concurrency between exit_mmap and >> collapse, but rather by the concurrency between pt_reclaim and collapse. >> >> Before this patch, khugepaged might incorrectly restore a PTE >> pagetable that had already been freed. >> >> pt_reclaim has cleared the pmd entry and freed the PTE page table. >> However, due to the race condition, check_pmd_still_valid() still >> passes and continues to attempt the collapse: >> >> _pmd = pmdp_collapse_flush(vma, address, pmd); ---> returns a none pmd >> entry (the original pmd entry has been cleared) >> >> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); ---> returns >> pte == NULL >> >> Then khugepaged will restore the old PTE pagetable with an invalid pmd >> entry: >> >> pmd_populate(mm, pmd, pmd_pgtable(_pmd)); >> >> So when the process exits and trys to free the mapping of the process, >> traversing the invalid pmd table will lead to a crash. > > CPU0 CPU1 > ==== ==== > > collapse > --> pmd_populate(mm, pmd, pmd_pgtable(_pmd)); > mmap_write_unlock > exit_mmap > --> hold mmap lock > __pte_offset_map_lock > --> pte = __pte_offset_map(pmd, addr, > &pmdval); > if (unlikely(!pte)) > return pte; <-- will return __pte_offset_map() might not return NULL? Because the 'pmd_populate(mm, pmd, pmd_pgtable(_pmd))' could populate a valid page (although the '_pmd' entry is NONE), but it is not the original pagetable page. > IIUC, in this case, if we get an invalid pmd entry, we will retrun > directly instead of causing a crash? > >> >> Barry, please correct me if I have misunderstood something. >> >
On 8/5/25 4:56 PM, Baolin Wang wrote: > > > On 2025/8/5 16:17, Qi Zheng wrote: >> Hi Baolin, >> >> On 8/5/25 3:53 PM, Baolin Wang wrote: >>> >>> >>> On 2025/8/5 14:42, Qi Zheng wrote: >>>> Hi Barry, >>>> >>>> On 8/5/25 11:54 AM, Barry Song wrote: >>>>> From: Barry Song <v-songbaohua@oppo.com> >>>>> >>>>> The check_pmd_still_valid() call during collapse is currently only >>>>> protected by the mmap_lock in write mode, which was sufficient when >>>>> pt_reclaim always ran under mmap_lock in read mode. However, since >>>>> madvise_dontneed can now execute under a per-VMA lock, this assumption >>>>> is no longer valid. As a result, a race condition can occur between >>>>> collapse and PT_RECLAIM, potentially leading to a kernel panic. >>>> >>>> There is indeed a race condition here. And after applying this patch, I >>>> can no longer reproduce the problem locally (I was able to reproduce it >>>> stably locally last night). >>>> >>>> But I still can't figure out how this race condtion causes the >>>> following panic: >>>> >>>> exit_mmap >>>> --> mmap_read_lock() >>>> unmap_vmas() >>>> --> pte_offset_map_lock >>>> --> rcu_read_lock() >>>> check if the pmd entry is a PTE page >>>> ptl = pte_lockptr(mm, &pmdval) <-- ptl is NULL >>>> spin_lock(ptl) <-- PANIC!! >>>> >>>> If this PTE page is freed by pt_reclaim (via RCU), then the ptl can >>>> not be NULL. >>>> >>>> The collapse holds mmap write lock, so it is impossible to be >>>> concurrent >>>> with exit_mmap(). >>>> >>>> Confusing. :( >>> >>> IIUC, the issue is not caused by the concurrency between exit_mmap >>> and collapse, but rather by the concurrency between pt_reclaim and >>> collapse. >>> >>> Before this patch, khugepaged might incorrectly restore a PTE >>> pagetable that had already been freed. >>> >>> pt_reclaim has cleared the pmd entry and freed the PTE page table. >>> However, due to the race condition, check_pmd_still_valid() still >>> passes and continues to attempt the collapse: >>> >>> _pmd = pmdp_collapse_flush(vma, address, pmd); ---> returns a none >>> pmd entry (the original pmd entry has been cleared) >>> >>> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); ---> returns >>> pte == NULL >>> >>> Then khugepaged will restore the old PTE pagetable with an invalid >>> pmd entry: >>> >>> pmd_populate(mm, pmd, pmd_pgtable(_pmd)); >>> >>> So when the process exits and trys to free the mapping of the >>> process, traversing the invalid pmd table will lead to a crash. >> >> CPU0 CPU1 >> ==== ==== >> >> collapse >> --> pmd_populate(mm, pmd, pmd_pgtable(_pmd)); >> mmap_write_unlock >> exit_mmap >> --> hold mmap lock >> __pte_offset_map_lock >> --> pte = __pte_offset_map(pmd, >> addr, &pmdval); >> if (unlikely(!pte)) >> return pte; <-- will return > > __pte_offset_map() might not return NULL? Because the 'pmd_populate(mm, > pmd, pmd_pgtable(_pmd))' could populate a valid page (although the > '_pmd' entry is NONE), but it is not the original pagetable page. CPU0 CPU1 ==== ==== collapse --> check_pmd_still_valid vma read lock pt_reclaim clear the pmd entry and will free the PTE page (via RCU) vma read unlock vma write lock _pmd = pmdp_collapse_flush(vma, address, pmd) <-- pmd_none(_pmd) pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); <-- pte is NULL pmd_populate(mm, pmd, pmd_pgtable(_pmd)); <-- populate a valid page? vma write unlock The above is the concurrent scenario you mentioned, right? What types of this 'valid page' could be? If __pte_offset_map() returns non-NULL, then it is a PTE page. Even if it is not the original one, it should not cause panic. Did I miss some key information? :( > >> IIUC, in this case, if we get an invalid pmd entry, we will retrun >> directly instead of causing a crash? >> >>> >>> Barry, please correct me if I have misunderstood something. >>> >> >
On 05.08.25 11:30, Qi Zheng wrote: > > > On 8/5/25 4:56 PM, Baolin Wang wrote: >> >> >> On 2025/8/5 16:17, Qi Zheng wrote: >>> Hi Baolin, >>> >>> On 8/5/25 3:53 PM, Baolin Wang wrote: >>>> >>>> >>>> On 2025/8/5 14:42, Qi Zheng wrote: >>>>> Hi Barry, >>>>> >>>>> On 8/5/25 11:54 AM, Barry Song wrote: >>>>>> From: Barry Song <v-songbaohua@oppo.com> >>>>>> >>>>>> The check_pmd_still_valid() call during collapse is currently only >>>>>> protected by the mmap_lock in write mode, which was sufficient when >>>>>> pt_reclaim always ran under mmap_lock in read mode. However, since >>>>>> madvise_dontneed can now execute under a per-VMA lock, this assumption >>>>>> is no longer valid. As a result, a race condition can occur between >>>>>> collapse and PT_RECLAIM, potentially leading to a kernel panic. >>>>> >>>>> There is indeed a race condition here. And after applying this patch, I >>>>> can no longer reproduce the problem locally (I was able to reproduce it >>>>> stably locally last night). >>>>> >>>>> But I still can't figure out how this race condtion causes the >>>>> following panic: >>>>> >>>>> exit_mmap >>>>> --> mmap_read_lock() >>>>> unmap_vmas() >>>>> --> pte_offset_map_lock >>>>> --> rcu_read_lock() >>>>> check if the pmd entry is a PTE page >>>>> ptl = pte_lockptr(mm, &pmdval) <-- ptl is NULL >>>>> spin_lock(ptl) <-- PANIC!! >>>>> >>>>> If this PTE page is freed by pt_reclaim (via RCU), then the ptl can >>>>> not be NULL. >>>>> >>>>> The collapse holds mmap write lock, so it is impossible to be >>>>> concurrent >>>>> with exit_mmap(). >>>>> >>>>> Confusing. :( >>>> >>>> IIUC, the issue is not caused by the concurrency between exit_mmap >>>> and collapse, but rather by the concurrency between pt_reclaim and >>>> collapse. >>>> >>>> Before this patch, khugepaged might incorrectly restore a PTE >>>> pagetable that had already been freed. >>>> >>>> pt_reclaim has cleared the pmd entry and freed the PTE page table. >>>> However, due to the race condition, check_pmd_still_valid() still >>>> passes and continues to attempt the collapse: >>>> >>>> _pmd = pmdp_collapse_flush(vma, address, pmd); ---> returns a none >>>> pmd entry (the original pmd entry has been cleared) >>>> >>>> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); ---> returns >>>> pte == NULL >>>> >>>> Then khugepaged will restore the old PTE pagetable with an invalid >>>> pmd entry: >>>> >>>> pmd_populate(mm, pmd, pmd_pgtable(_pmd)); >>>> >>>> So when the process exits and trys to free the mapping of the >>>> process, traversing the invalid pmd table will lead to a crash. >>> >>> CPU0 CPU1 >>> ==== ==== >>> >>> collapse >>> --> pmd_populate(mm, pmd, pmd_pgtable(_pmd)); >>> mmap_write_unlock >>> exit_mmap >>> --> hold mmap lock >>> __pte_offset_map_lock >>> --> pte = __pte_offset_map(pmd, >>> addr, &pmdval); >>> if (unlikely(!pte)) >>> return pte; <-- will return >> >> __pte_offset_map() might not return NULL? Because the 'pmd_populate(mm, >> pmd, pmd_pgtable(_pmd))' could populate a valid page (although the >> '_pmd' entry is NONE), but it is not the original pagetable page. > > CPU0 CPU1 > ==== ==== > > collapse > --> check_pmd_still_valid > vma read lock > pt_reclaim clear the pmd entry and will > free the PTE page (via RCU) > vma read unlock > > vma write lock > _pmd = pmdp_collapse_flush(vma, address, pmd) <-- pmd_none(_pmd) > pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); <-- pte is > NULL > pmd_populate(mm, pmd, pmd_pgtable(_pmd)); <-- populate a valid page? > vma write unlock > > The above is the concurrent scenario you mentioned, right? > > What types of this 'valid page' could be? If __pte_offset_map() returns > non-NULL, then it is a PTE page. Even if it is not the original one, it > should not cause panic. Did I miss some key information? :( Wasn't the original issue all about a NULL-pointer de-reference while *locking*? Note that in that kernel config [1] we have CONFIG_DEBUG_SPINLOCK=y, so likely we will have ALLOC_SPLIT_PTLOCKS set. [1] https://github.com/laifryiee/syzkaller_logs/blob/main/250803_193026___pte_offset_map_lock/.config -- Cheers, David / dhildenb
On 2025/8/5 17:50, David Hildenbrand wrote: > On 05.08.25 11:30, Qi Zheng wrote: >> >> >> On 8/5/25 4:56 PM, Baolin Wang wrote: >>> >>> >>> On 2025/8/5 16:17, Qi Zheng wrote: >>>> Hi Baolin, >>>> >>>> On 8/5/25 3:53 PM, Baolin Wang wrote: >>>>> >>>>> >>>>> On 2025/8/5 14:42, Qi Zheng wrote: >>>>>> Hi Barry, >>>>>> >>>>>> On 8/5/25 11:54 AM, Barry Song wrote: >>>>>>> From: Barry Song <v-songbaohua@oppo.com> >>>>>>> >>>>>>> The check_pmd_still_valid() call during collapse is currently only >>>>>>> protected by the mmap_lock in write mode, which was sufficient when >>>>>>> pt_reclaim always ran under mmap_lock in read mode. However, since >>>>>>> madvise_dontneed can now execute under a per-VMA lock, this >>>>>>> assumption >>>>>>> is no longer valid. As a result, a race condition can occur between >>>>>>> collapse and PT_RECLAIM, potentially leading to a kernel panic. >>>>>> >>>>>> There is indeed a race condition here. And after applying this >>>>>> patch, I >>>>>> can no longer reproduce the problem locally (I was able to >>>>>> reproduce it >>>>>> stably locally last night). >>>>>> >>>>>> But I still can't figure out how this race condtion causes the >>>>>> following panic: >>>>>> >>>>>> exit_mmap >>>>>> --> mmap_read_lock() >>>>>> unmap_vmas() >>>>>> --> pte_offset_map_lock >>>>>> --> rcu_read_lock() >>>>>> check if the pmd entry is a PTE page >>>>>> ptl = pte_lockptr(mm, &pmdval) <-- ptl is NULL >>>>>> spin_lock(ptl) <-- PANIC!! >>>>>> >>>>>> If this PTE page is freed by pt_reclaim (via RCU), then the ptl can >>>>>> not be NULL. >>>>>> >>>>>> The collapse holds mmap write lock, so it is impossible to be >>>>>> concurrent >>>>>> with exit_mmap(). >>>>>> >>>>>> Confusing. :( >>>>> >>>>> IIUC, the issue is not caused by the concurrency between exit_mmap >>>>> and collapse, but rather by the concurrency between pt_reclaim and >>>>> collapse. >>>>> >>>>> Before this patch, khugepaged might incorrectly restore a PTE >>>>> pagetable that had already been freed. >>>>> >>>>> pt_reclaim has cleared the pmd entry and freed the PTE page table. >>>>> However, due to the race condition, check_pmd_still_valid() still >>>>> passes and continues to attempt the collapse: >>>>> >>>>> _pmd = pmdp_collapse_flush(vma, address, pmd); ---> returns a none >>>>> pmd entry (the original pmd entry has been cleared) >>>>> >>>>> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); ---> returns >>>>> pte == NULL >>>>> >>>>> Then khugepaged will restore the old PTE pagetable with an invalid >>>>> pmd entry: >>>>> >>>>> pmd_populate(mm, pmd, pmd_pgtable(_pmd)); >>>>> >>>>> So when the process exits and trys to free the mapping of the >>>>> process, traversing the invalid pmd table will lead to a crash. >>>> >>>> CPU0 CPU1 >>>> ==== ==== >>>> >>>> collapse >>>> --> pmd_populate(mm, pmd, pmd_pgtable(_pmd)); >>>> mmap_write_unlock >>>> exit_mmap >>>> --> hold mmap lock >>>> __pte_offset_map_lock >>>> --> pte = __pte_offset_map(pmd, >>>> addr, &pmdval); >>>> if (unlikely(!pte)) >>>> return pte; <-- will >>>> return >>> >>> __pte_offset_map() might not return NULL? Because the 'pmd_populate(mm, >>> pmd, pmd_pgtable(_pmd))' could populate a valid page (although the >>> '_pmd' entry is NONE), but it is not the original pagetable page. >> >> CPU0 CPU1 >> ==== ==== >> >> collapse >> --> check_pmd_still_valid >> vma read lock >> pt_reclaim clear the pmd entry and will >> free the PTE page (via RCU) >> vma read unlock >> >> vma write lock >> _pmd = pmdp_collapse_flush(vma, address, pmd) <-- pmd_none(_pmd) >> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); <-- pte is >> NULL >> pmd_populate(mm, pmd, pmd_pgtable(_pmd)); <-- populate a valid >> page? >> vma write unlock >> >> The above is the concurrent scenario you mentioned, right? Yes. >> >> What types of this 'valid page' could be? If __pte_offset_map() returns >> non-NULL, then it is a PTE page. Even if it is not the original one, it >> should not cause panic. Did I miss some key information? :( Sorry for not being clear. Let me try again. In the race condition described above, the '_pmd' value is NONE, meaning that when restoring the pmd entry with ‘pmd_populate(mm, pmd, pmd_pgtable(_pmd))’, the 'pmd_pgtable(_pmd)' can return a struct page corresponding to pfn == 0 (cause the '_pmd' is NONE) to populate the pmd entry. Clearly, this pfn == 0 page is not a pagetable page, meaning the corresponding ptl lock of this page is not initialized. Additionally, from the boot dmesg, I can see that the BIOS reports an address range with pfn == 0, indicating that there is a struct page initialized for pfn == 0 (possibly a reserved page): [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable [ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000007ffdffff] usable [ 0.000000] BIOS-e820: [mem 0x000000007ffe0000-0x000000007fffffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved Of course, this is my theoretical analysis from the code perspective. If there are other race conditions, I would be very surprised:) > Wasn't the original issue all about a NULL-pointer de-reference while > *locking*? Yes. > Note that in that kernel config [1] we have CONFIG_DEBUG_SPINLOCK=y, so > likely we will have ALLOC_SPLIT_PTLOCKS set. > > [1] https://github.com/laifryiee/syzkaller_logs/blob/ > main/250803_193026___pte_offset_map_lock/.config >
On 8/5/25 6:07 PM, Baolin Wang wrote: > > [...] > >>> >>> What types of this 'valid page' could be? If __pte_offset_map() returns >>> non-NULL, then it is a PTE page. Even if it is not the original one, it >>> should not cause panic. Did I miss some key information? :( > > Sorry for not being clear. Let me try again. > > In the race condition described above, the '_pmd' value is NONE, meaning > that when restoring the pmd entry with ‘pmd_populate(mm, pmd, > pmd_pgtable(_pmd))’, the 'pmd_pgtable(_pmd)' can return a struct page > corresponding to pfn == 0 (cause the '_pmd' is NONE) to populate the pmd > entry. Clearly, this pfn == 0 page is not a pagetable page, meaning the > corresponding ptl lock of this page is not initialized. > > Additionally, from the boot dmesg, I can see that the BIOS reports an > address range with pfn == 0, indicating that there is a struct page > initialized for pfn == 0 (possibly a reserved page): > > [ 0.000000] BIOS-provided physical RAM map: > [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] > usable > [ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] > reserved > [ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] > reserved > [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000007ffdffff] > usable > [ 0.000000] BIOS-e820: [mem 0x000000007ffe0000-0x000000007fffffff] > reserved > [ 0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] > reserved > [ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] > reserved > Now I understand, thank you very much for your patient explanation! And for this patch: Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Thanks! > Of course, this is my theoretical analysis from the code perspective. If > there are other race conditions, I would be very surprised:) > >> Wasn't the original issue all about a NULL-pointer de-reference while >> *locking*? > > Yes. > >> Note that in that kernel config [1] we have CONFIG_DEBUG_SPINLOCK=y, >> so likely we will have ALLOC_SPLIT_PTLOCKS set. >> >> [1] https://github.com/laifryiee/syzkaller_logs/blob/ >> main/250803_193026___pte_offset_map_lock/.config >>
On 05.08.25 05:54, Barry Song wrote: > From: Barry Song <v-songbaohua@oppo.com> > > The check_pmd_still_valid() call during collapse is currently only > protected by the mmap_lock in write mode, which was sufficient when > pt_reclaim always ran under mmap_lock in read mode. However, since > madvise_dontneed can now execute under a per-VMA lock, this assumption > is no longer valid. As a result, a race condition can occur between > collapse and PT_RECLAIM, potentially leading to a kernel panic. > > [ 38.151897] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] SMP KASI > [ 38.153519] KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f] > [ 38.154605] CPU: 0 UID: 0 PID: 721 Comm: repro Not tainted 6.16.0-next-20250801-next-2025080 #1 PREEMPT(voluntary) > [ 38.155929] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org4 > [ 38.157418] RIP: 0010:kasan_byte_accessible+0x15/0x30 > [ 38.158125] Code: 03 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 48 b8 00 00 00 00 00 fc0 > [ 38.160461] RSP: 0018:ffff88800feef678 EFLAGS: 00010286 > [ 38.161220] RAX: dffffc0000000000 RBX: 0000000000000001 RCX: 1ffffffff0dde60c > [ 38.162232] RDX: 0000000000000000 RSI: ffffffff85da1e18 RDI: dffffc0000000003 > [ 38.163176] RBP: ffff88800feef698 R08: 0000000000000001 R09: 0000000000000000 > [ 38.164195] R10: 0000000000000000 R11: ffff888016a8ba58 R12: 0000000000000018 > [ 38.165189] R13: 0000000000000018 R14: ffffffff85da1e18 R15: 0000000000000000 > [ 38.166100] FS: 0000000000000000(0000) GS:ffff8880e3b40000(0000) knlGS:0000000000000000 > [ 38.167137] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 38.167891] CR2: 00007f97fadfe504 CR3: 0000000007088005 CR4: 0000000000770ef0 > [ 38.168812] PKRU: 55555554 > [ 38.169275] Call Trace: > [ 38.169647] <TASK> > [ 38.169975] ? __kasan_check_byte+0x19/0x50 > [ 38.170581] lock_acquire+0xea/0x310 > [ 38.171083] ? rcu_is_watching+0x19/0xc0 > [ 38.171615] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 > [ 38.172343] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30 > [ 38.173130] _raw_spin_lock+0x38/0x50 > [ 38.173707] ? __pte_offset_map_lock+0x1a2/0x3c0 > [ 38.174390] __pte_offset_map_lock+0x1a2/0x3c0 > [ 38.174987] ? __pfx___pte_offset_map_lock+0x10/0x10 > [ 38.175724] ? __pfx_pud_val+0x10/0x10 > [ 38.176308] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30 > [ 38.177183] unmap_page_range+0xb60/0x43e0 > [ 38.177824] ? __pfx_unmap_page_range+0x10/0x10 > [ 38.178485] ? mas_next_slot+0x133a/0x1a50 > [ 38.179079] unmap_single_vma.constprop.0+0x15b/0x250 > [ 38.179830] unmap_vmas+0x1fa/0x460 > [ 38.180373] ? __pfx_unmap_vmas+0x10/0x10 > [ 38.180994] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 > [ 38.181877] exit_mmap+0x1a2/0xb40 > [ 38.182396] ? lock_release+0x14f/0x2c0 > [ 38.182929] ? __pfx_exit_mmap+0x10/0x10 > [ 38.183474] ? __pfx___mutex_unlock_slowpath+0x10/0x10 > [ 38.184188] ? mutex_unlock+0x16/0x20 > [ 38.184704] mmput+0x132/0x370 > [ 38.185208] do_exit+0x7e7/0x28c0 > [ 38.185682] ? __this_cpu_preempt_check+0x21/0x30 > [ 38.186328] ? do_group_exit+0x1d8/0x2c0 > [ 38.186873] ? __pfx_do_exit+0x10/0x10 > [ 38.187401] ? __this_cpu_preempt_check+0x21/0x30 > [ 38.188036] ? _raw_spin_unlock_irq+0x2c/0x60 > [ 38.188634] ? lockdep_hardirqs_on+0x89/0x110 > [ 38.189313] do_group_exit+0xe4/0x2c0 > [ 38.189831] __x64_sys_exit_group+0x4d/0x60 > [ 38.190413] x64_sys_call+0x2174/0x2180 > [ 38.190935] do_syscall_64+0x6d/0x2e0 > [ 38.191449] entry_SYSCALL_64_after_hwframe+0x76/0x7e > > This patch moves the vma_start_write() call to precede > check_pmd_still_valid(), ensuring that the check is also properly > protected by the per-VMA lock. > > Fixes: a6fde7add78d ("mm: use per_vma lock for MADV_DONTNEED") > Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com> > Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com> > Closes: https://lore.kernel.org/all/aJAFrYfyzGpbm+0m@ly-workstation/ > Cc: David Hildenbrand <david@redhat.com> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > Cc: Qi Zheng <zhengqi.arch@bytedance.com> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Jann Horn <jannh@google.com> > Cc: Suren Baghdasaryan <surenb@google.com> > Cc: Lokesh Gidra <lokeshgidra@google.com> > Cc: Tangquan Zheng <zhengtangquan@oppo.com> > Cc: Lance Yang <ioworker0@gmail.com> > Cc: Zi Yan <ziy@nvidia.com> > Cc: Baolin Wang <baolin.wang@linux.alibaba.com> > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > Cc: Nico Pache <npache@redhat.com> > Cc: Ryan Roberts <ryan.roberts@arm.com> > Cc: Dev Jain <dev.jain@arm.com> > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > --- > mm/khugepaged.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 374a6a5193a7..6b40bdfd224c 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1172,11 +1172,11 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > if (result != SCAN_SUCCEED) > goto out_up_write; > /* check if the pmd is still valid */ > + vma_start_write(vma); > result = check_pmd_still_valid(mm, address, pmd); > if (result != SCAN_SUCCEED) > goto out_up_write; > > - vma_start_write(vma); > anon_vma_lock_write(vma->anon_vma); > > mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address, LGTM, I was wondering whether we should just place it next to the mmap_write_lock() with the assumption that hugepage_vma_revalidate() will commonly not fail. So personally, I would move it further up. Acked-by: David Hildenbrand <david@redhat.com> -- Cheers, David / dhildenb
© 2016 - 2025 Red Hat, Inc.