mm/khugepaged.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
From: Barry Song <v-songbaohua@oppo.com>
The check_pmd_still_valid() call during collapse is currently only
protected by the mmap_lock in write mode, which was sufficient when
pt_reclaim always ran under mmap_lock in read mode. However, since
madvise_dontneed can now execute under a per-VMA lock, this assumption
is no longer valid. As a result, a race condition can occur between
collapse and PT_RECLAIM, potentially leading to a kernel panic.
[ 38.151897] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] SMP KASI
[ 38.153519] KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
[ 38.154605] CPU: 0 UID: 0 PID: 721 Comm: repro Not tainted 6.16.0-next-20250801-next-2025080 #1 PREEMPT(voluntary)
[ 38.155929] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org4
[ 38.157418] RIP: 0010:kasan_byte_accessible+0x15/0x30
[ 38.158125] Code: 03 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 48 b8 00 00 00 00 00 fc0
[ 38.160461] RSP: 0018:ffff88800feef678 EFLAGS: 00010286
[ 38.161220] RAX: dffffc0000000000 RBX: 0000000000000001 RCX: 1ffffffff0dde60c
[ 38.162232] RDX: 0000000000000000 RSI: ffffffff85da1e18 RDI: dffffc0000000003
[ 38.163176] RBP: ffff88800feef698 R08: 0000000000000001 R09: 0000000000000000
[ 38.164195] R10: 0000000000000000 R11: ffff888016a8ba58 R12: 0000000000000018
[ 38.165189] R13: 0000000000000018 R14: ffffffff85da1e18 R15: 0000000000000000
[ 38.166100] FS: 0000000000000000(0000) GS:ffff8880e3b40000(0000) knlGS:0000000000000000
[ 38.167137] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 38.167891] CR2: 00007f97fadfe504 CR3: 0000000007088005 CR4: 0000000000770ef0
[ 38.168812] PKRU: 55555554
[ 38.169275] Call Trace:
[ 38.169647] <TASK>
[ 38.169975] ? __kasan_check_byte+0x19/0x50
[ 38.170581] lock_acquire+0xea/0x310
[ 38.171083] ? rcu_is_watching+0x19/0xc0
[ 38.171615] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[ 38.172343] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
[ 38.173130] _raw_spin_lock+0x38/0x50
[ 38.173707] ? __pte_offset_map_lock+0x1a2/0x3c0
[ 38.174390] __pte_offset_map_lock+0x1a2/0x3c0
[ 38.174987] ? __pfx___pte_offset_map_lock+0x10/0x10
[ 38.175724] ? __pfx_pud_val+0x10/0x10
[ 38.176308] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
[ 38.177183] unmap_page_range+0xb60/0x43e0
[ 38.177824] ? __pfx_unmap_page_range+0x10/0x10
[ 38.178485] ? mas_next_slot+0x133a/0x1a50
[ 38.179079] unmap_single_vma.constprop.0+0x15b/0x250
[ 38.179830] unmap_vmas+0x1fa/0x460
[ 38.180373] ? __pfx_unmap_vmas+0x10/0x10
[ 38.180994] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[ 38.181877] exit_mmap+0x1a2/0xb40
[ 38.182396] ? lock_release+0x14f/0x2c0
[ 38.182929] ? __pfx_exit_mmap+0x10/0x10
[ 38.183474] ? __pfx___mutex_unlock_slowpath+0x10/0x10
[ 38.184188] ? mutex_unlock+0x16/0x20
[ 38.184704] mmput+0x132/0x370
[ 38.185208] do_exit+0x7e7/0x28c0
[ 38.185682] ? __this_cpu_preempt_check+0x21/0x30
[ 38.186328] ? do_group_exit+0x1d8/0x2c0
[ 38.186873] ? __pfx_do_exit+0x10/0x10
[ 38.187401] ? __this_cpu_preempt_check+0x21/0x30
[ 38.188036] ? _raw_spin_unlock_irq+0x2c/0x60
[ 38.188634] ? lockdep_hardirqs_on+0x89/0x110
[ 38.189313] do_group_exit+0xe4/0x2c0
[ 38.189831] __x64_sys_exit_group+0x4d/0x60
[ 38.190413] x64_sys_call+0x2174/0x2180
[ 38.190935] do_syscall_64+0x6d/0x2e0
[ 38.191449] entry_SYSCALL_64_after_hwframe+0x76/0x7e
This patch moves the vma_start_write() call to precede
check_pmd_still_valid(), ensuring that the check is also properly
protected by the per-VMA lock.
Fixes: a6fde7add78d ("mm: use per_vma lock for MADV_DONTNEED")
Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com>
Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
Closes: https://lore.kernel.org/all/aJAFrYfyzGpbm+0m@ly-workstation/
Cc: David Hildenbrand <david@redhat.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
mm/khugepaged.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 374a6a5193a7..6b40bdfd224c 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1172,11 +1172,11 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
if (result != SCAN_SUCCEED)
goto out_up_write;
/* check if the pmd is still valid */
+ vma_start_write(vma);
result = check_pmd_still_valid(mm, address, pmd);
if (result != SCAN_SUCCEED)
goto out_up_write;
- vma_start_write(vma);
anon_vma_lock_write(vma->anon_vma);
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
--
2.39.3 (Apple Git-146)
Andrew - to be clear, we need this as a hotfix for 6.17, as this is a known bug
in rc1 right now.
On Tue, Aug 05, 2025 at 11:54:47AM +0800, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
>
> The check_pmd_still_valid() call during collapse is currently only
> protected by the mmap_lock in write mode, which was sufficient when
> pt_reclaim always ran under mmap_lock in read mode. However, since
> madvise_dontneed can now execute under a per-VMA lock, this assumption
> is no longer valid. As a result, a race condition can occur between
> collapse and PT_RECLAIM, potentially leading to a kernel panic.
>
> [ 38.151897] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] SMP KASI
> [ 38.153519] KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
> [ 38.154605] CPU: 0 UID: 0 PID: 721 Comm: repro Not tainted 6.16.0-next-20250801-next-2025080 #1 PREEMPT(voluntary)
> [ 38.155929] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org4
> [ 38.157418] RIP: 0010:kasan_byte_accessible+0x15/0x30
> [ 38.158125] Code: 03 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 48 b8 00 00 00 00 00 fc0
> [ 38.160461] RSP: 0018:ffff88800feef678 EFLAGS: 00010286
> [ 38.161220] RAX: dffffc0000000000 RBX: 0000000000000001 RCX: 1ffffffff0dde60c
> [ 38.162232] RDX: 0000000000000000 RSI: ffffffff85da1e18 RDI: dffffc0000000003
> [ 38.163176] RBP: ffff88800feef698 R08: 0000000000000001 R09: 0000000000000000
> [ 38.164195] R10: 0000000000000000 R11: ffff888016a8ba58 R12: 0000000000000018
> [ 38.165189] R13: 0000000000000018 R14: ffffffff85da1e18 R15: 0000000000000000
> [ 38.166100] FS: 0000000000000000(0000) GS:ffff8880e3b40000(0000) knlGS:0000000000000000
> [ 38.167137] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 38.167891] CR2: 00007f97fadfe504 CR3: 0000000007088005 CR4: 0000000000770ef0
> [ 38.168812] PKRU: 55555554
> [ 38.169275] Call Trace:
> [ 38.169647] <TASK>
> [ 38.169975] ? __kasan_check_byte+0x19/0x50
> [ 38.170581] lock_acquire+0xea/0x310
> [ 38.171083] ? rcu_is_watching+0x19/0xc0
> [ 38.171615] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> [ 38.172343] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
> [ 38.173130] _raw_spin_lock+0x38/0x50
> [ 38.173707] ? __pte_offset_map_lock+0x1a2/0x3c0
> [ 38.174390] __pte_offset_map_lock+0x1a2/0x3c0
> [ 38.174987] ? __pfx___pte_offset_map_lock+0x10/0x10
> [ 38.175724] ? __pfx_pud_val+0x10/0x10
> [ 38.176308] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
> [ 38.177183] unmap_page_range+0xb60/0x43e0
> [ 38.177824] ? __pfx_unmap_page_range+0x10/0x10
> [ 38.178485] ? mas_next_slot+0x133a/0x1a50
> [ 38.179079] unmap_single_vma.constprop.0+0x15b/0x250
> [ 38.179830] unmap_vmas+0x1fa/0x460
> [ 38.180373] ? __pfx_unmap_vmas+0x10/0x10
> [ 38.180994] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> [ 38.181877] exit_mmap+0x1a2/0xb40
> [ 38.182396] ? lock_release+0x14f/0x2c0
> [ 38.182929] ? __pfx_exit_mmap+0x10/0x10
> [ 38.183474] ? __pfx___mutex_unlock_slowpath+0x10/0x10
> [ 38.184188] ? mutex_unlock+0x16/0x20
> [ 38.184704] mmput+0x132/0x370
> [ 38.185208] do_exit+0x7e7/0x28c0
> [ 38.185682] ? __this_cpu_preempt_check+0x21/0x30
> [ 38.186328] ? do_group_exit+0x1d8/0x2c0
> [ 38.186873] ? __pfx_do_exit+0x10/0x10
> [ 38.187401] ? __this_cpu_preempt_check+0x21/0x30
> [ 38.188036] ? _raw_spin_unlock_irq+0x2c/0x60
> [ 38.188634] ? lockdep_hardirqs_on+0x89/0x110
> [ 38.189313] do_group_exit+0xe4/0x2c0
> [ 38.189831] __x64_sys_exit_group+0x4d/0x60
> [ 38.190413] x64_sys_call+0x2174/0x2180
> [ 38.190935] do_syscall_64+0x6d/0x2e0
> [ 38.191449] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> This patch moves the vma_start_write() call to precede
> check_pmd_still_valid(), ensuring that the check is also properly
> protected by the per-VMA lock.
>
> Fixes: a6fde7add78d ("mm: use per_vma lock for MADV_DONTNEED")
> Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com>
> Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
> Closes: https://lore.kernel.org/all/aJAFrYfyzGpbm+0m@ly-workstation/
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Qi Zheng <zhengqi.arch@bytedance.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Jann Horn <jannh@google.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Lokesh Gidra <lokeshgidra@google.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Cc: Lance Yang <ioworker0@gmail.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Looks good to me so:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/khugepaged.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 374a6a5193a7..6b40bdfd224c 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1172,11 +1172,11 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> if (result != SCAN_SUCCEED)
> goto out_up_write;
> /* check if the pmd is still valid */
> + vma_start_write(vma);
> result = check_pmd_still_valid(mm, address, pmd);
> if (result != SCAN_SUCCEED)
> goto out_up_write;
>
> - vma_start_write(vma);
> anon_vma_lock_write(vma->anon_vma);
>
> mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> --
> 2.39.3 (Apple Git-146)
>
On 2025/8/5 11:54, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
>
> The check_pmd_still_valid() call during collapse is currently only
> protected by the mmap_lock in write mode, which was sufficient when
> pt_reclaim always ran under mmap_lock in read mode. However, since
> madvise_dontneed can now execute under a per-VMA lock, this assumption
> is no longer valid. As a result, a race condition can occur between
> collapse and PT_RECLAIM, potentially leading to a kernel panic.
>
> [ 38.151897] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] SMP KASI
> [ 38.153519] KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
> [ 38.154605] CPU: 0 UID: 0 PID: 721 Comm: repro Not tainted 6.16.0-next-20250801-next-2025080 #1 PREEMPT(voluntary)
> [ 38.155929] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org4
> [ 38.157418] RIP: 0010:kasan_byte_accessible+0x15/0x30
> [ 38.158125] Code: 03 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 48 b8 00 00 00 00 00 fc0
> [ 38.160461] RSP: 0018:ffff88800feef678 EFLAGS: 00010286
> [ 38.161220] RAX: dffffc0000000000 RBX: 0000000000000001 RCX: 1ffffffff0dde60c
> [ 38.162232] RDX: 0000000000000000 RSI: ffffffff85da1e18 RDI: dffffc0000000003
> [ 38.163176] RBP: ffff88800feef698 R08: 0000000000000001 R09: 0000000000000000
> [ 38.164195] R10: 0000000000000000 R11: ffff888016a8ba58 R12: 0000000000000018
> [ 38.165189] R13: 0000000000000018 R14: ffffffff85da1e18 R15: 0000000000000000
> [ 38.166100] FS: 0000000000000000(0000) GS:ffff8880e3b40000(0000) knlGS:0000000000000000
> [ 38.167137] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 38.167891] CR2: 00007f97fadfe504 CR3: 0000000007088005 CR4: 0000000000770ef0
> [ 38.168812] PKRU: 55555554
> [ 38.169275] Call Trace:
> [ 38.169647] <TASK>
> [ 38.169975] ? __kasan_check_byte+0x19/0x50
> [ 38.170581] lock_acquire+0xea/0x310
> [ 38.171083] ? rcu_is_watching+0x19/0xc0
> [ 38.171615] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> [ 38.172343] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
> [ 38.173130] _raw_spin_lock+0x38/0x50
> [ 38.173707] ? __pte_offset_map_lock+0x1a2/0x3c0
> [ 38.174390] __pte_offset_map_lock+0x1a2/0x3c0
> [ 38.174987] ? __pfx___pte_offset_map_lock+0x10/0x10
> [ 38.175724] ? __pfx_pud_val+0x10/0x10
> [ 38.176308] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
> [ 38.177183] unmap_page_range+0xb60/0x43e0
> [ 38.177824] ? __pfx_unmap_page_range+0x10/0x10
> [ 38.178485] ? mas_next_slot+0x133a/0x1a50
> [ 38.179079] unmap_single_vma.constprop.0+0x15b/0x250
> [ 38.179830] unmap_vmas+0x1fa/0x460
> [ 38.180373] ? __pfx_unmap_vmas+0x10/0x10
> [ 38.180994] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> [ 38.181877] exit_mmap+0x1a2/0xb40
> [ 38.182396] ? lock_release+0x14f/0x2c0
> [ 38.182929] ? __pfx_exit_mmap+0x10/0x10
> [ 38.183474] ? __pfx___mutex_unlock_slowpath+0x10/0x10
> [ 38.184188] ? mutex_unlock+0x16/0x20
> [ 38.184704] mmput+0x132/0x370
> [ 38.185208] do_exit+0x7e7/0x28c0
> [ 38.185682] ? __this_cpu_preempt_check+0x21/0x30
> [ 38.186328] ? do_group_exit+0x1d8/0x2c0
> [ 38.186873] ? __pfx_do_exit+0x10/0x10
> [ 38.187401] ? __this_cpu_preempt_check+0x21/0x30
> [ 38.188036] ? _raw_spin_unlock_irq+0x2c/0x60
> [ 38.188634] ? lockdep_hardirqs_on+0x89/0x110
> [ 38.189313] do_group_exit+0xe4/0x2c0
> [ 38.189831] __x64_sys_exit_group+0x4d/0x60
> [ 38.190413] x64_sys_call+0x2174/0x2180
> [ 38.190935] do_syscall_64+0x6d/0x2e0
> [ 38.191449] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> This patch moves the vma_start_write() call to precede
> check_pmd_still_valid(), ensuring that the check is also properly
> protected by the per-VMA lock.
>
> Fixes: a6fde7add78d ("mm: use per_vma lock for MADV_DONTNEED")
> Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com>
> Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
> Closes: https://lore.kernel.org/all/aJAFrYfyzGpbm+0m@ly-workstation/
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Qi Zheng <zhengqi.arch@bytedance.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Jann Horn <jannh@google.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Lokesh Gidra <lokeshgidra@google.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Cc: Lance Yang <ioworker0@gmail.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Hi Barry,
On 8/5/25 11:54 AM, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
>
> The check_pmd_still_valid() call during collapse is currently only
> protected by the mmap_lock in write mode, which was sufficient when
> pt_reclaim always ran under mmap_lock in read mode. However, since
> madvise_dontneed can now execute under a per-VMA lock, this assumption
> is no longer valid. As a result, a race condition can occur between
> collapse and PT_RECLAIM, potentially leading to a kernel panic.
There is indeed a race condition here. And after applying this patch, I
can no longer reproduce the problem locally (I was able to reproduce it
stably locally last night).
But I still can't figure out how this race condtion causes the
following panic:
exit_mmap
--> mmap_read_lock()
unmap_vmas()
--> pte_offset_map_lock
--> rcu_read_lock()
check if the pmd entry is a PTE page
ptl = pte_lockptr(mm, &pmdval) <-- ptl is NULL
spin_lock(ptl) <-- PANIC!!
If this PTE page is freed by pt_reclaim (via RCU), then the ptl can not
be NULL.
The collapse holds mmap write lock, so it is impossible to be concurrent
with exit_mmap().
Confusing. :(
>
> [ 38.151897] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] SMP KASI
> [ 38.153519] KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
> [ 38.154605] CPU: 0 UID: 0 PID: 721 Comm: repro Not tainted 6.16.0-next-20250801-next-2025080 #1 PREEMPT(voluntary)
> [ 38.155929] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org4
> [ 38.157418] RIP: 0010:kasan_byte_accessible+0x15/0x30
> [ 38.158125] Code: 03 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 48 b8 00 00 00 00 00 fc0
> [ 38.160461] RSP: 0018:ffff88800feef678 EFLAGS: 00010286
> [ 38.161220] RAX: dffffc0000000000 RBX: 0000000000000001 RCX: 1ffffffff0dde60c
> [ 38.162232] RDX: 0000000000000000 RSI: ffffffff85da1e18 RDI: dffffc0000000003
> [ 38.163176] RBP: ffff88800feef698 R08: 0000000000000001 R09: 0000000000000000
> [ 38.164195] R10: 0000000000000000 R11: ffff888016a8ba58 R12: 0000000000000018
> [ 38.165189] R13: 0000000000000018 R14: ffffffff85da1e18 R15: 0000000000000000
> [ 38.166100] FS: 0000000000000000(0000) GS:ffff8880e3b40000(0000) knlGS:0000000000000000
> [ 38.167137] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 38.167891] CR2: 00007f97fadfe504 CR3: 0000000007088005 CR4: 0000000000770ef0
> [ 38.168812] PKRU: 55555554
> [ 38.169275] Call Trace:
> [ 38.169647] <TASK>
> [ 38.169975] ? __kasan_check_byte+0x19/0x50
> [ 38.170581] lock_acquire+0xea/0x310
> [ 38.171083] ? rcu_is_watching+0x19/0xc0
> [ 38.171615] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> [ 38.172343] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
> [ 38.173130] _raw_spin_lock+0x38/0x50
> [ 38.173707] ? __pte_offset_map_lock+0x1a2/0x3c0
> [ 38.174390] __pte_offset_map_lock+0x1a2/0x3c0
> [ 38.174987] ? __pfx___pte_offset_map_lock+0x10/0x10
> [ 38.175724] ? __pfx_pud_val+0x10/0x10
> [ 38.176308] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
> [ 38.177183] unmap_page_range+0xb60/0x43e0
> [ 38.177824] ? __pfx_unmap_page_range+0x10/0x10
> [ 38.178485] ? mas_next_slot+0x133a/0x1a50
> [ 38.179079] unmap_single_vma.constprop.0+0x15b/0x250
> [ 38.179830] unmap_vmas+0x1fa/0x460
> [ 38.180373] ? __pfx_unmap_vmas+0x10/0x10
> [ 38.180994] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> [ 38.181877] exit_mmap+0x1a2/0xb40
> [ 38.182396] ? lock_release+0x14f/0x2c0
> [ 38.182929] ? __pfx_exit_mmap+0x10/0x10
> [ 38.183474] ? __pfx___mutex_unlock_slowpath+0x10/0x10
> [ 38.184188] ? mutex_unlock+0x16/0x20
> [ 38.184704] mmput+0x132/0x370
> [ 38.185208] do_exit+0x7e7/0x28c0
> [ 38.185682] ? __this_cpu_preempt_check+0x21/0x30
> [ 38.186328] ? do_group_exit+0x1d8/0x2c0
> [ 38.186873] ? __pfx_do_exit+0x10/0x10
> [ 38.187401] ? __this_cpu_preempt_check+0x21/0x30
> [ 38.188036] ? _raw_spin_unlock_irq+0x2c/0x60
> [ 38.188634] ? lockdep_hardirqs_on+0x89/0x110
> [ 38.189313] do_group_exit+0xe4/0x2c0
> [ 38.189831] __x64_sys_exit_group+0x4d/0x60
> [ 38.190413] x64_sys_call+0x2174/0x2180
> [ 38.190935] do_syscall_64+0x6d/0x2e0
> [ 38.191449] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> This patch moves the vma_start_write() call to precede
> check_pmd_still_valid(), ensuring that the check is also properly
> protected by the per-VMA lock.
>
> Fixes: a6fde7add78d ("mm: use per_vma lock for MADV_DONTNEED")
> Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com>
> Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
> Closes: https://lore.kernel.org/all/aJAFrYfyzGpbm+0m@ly-workstation/
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Qi Zheng <zhengqi.arch@bytedance.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Jann Horn <jannh@google.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Lokesh Gidra <lokeshgidra@google.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Cc: Lance Yang <ioworker0@gmail.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
> mm/khugepaged.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 374a6a5193a7..6b40bdfd224c 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1172,11 +1172,11 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> if (result != SCAN_SUCCEED)
> goto out_up_write;
> /* check if the pmd is still valid */
> + vma_start_write(vma);
> result = check_pmd_still_valid(mm, address, pmd);
> if (result != SCAN_SUCCEED)
> goto out_up_write;
>
> - vma_start_write(vma);
> anon_vma_lock_write(vma->anon_vma);
>
> mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
On 2025/8/5 14:42, Qi Zheng wrote:
> Hi Barry,
>
> On 8/5/25 11:54 AM, Barry Song wrote:
>> From: Barry Song <v-songbaohua@oppo.com>
>>
>> The check_pmd_still_valid() call during collapse is currently only
>> protected by the mmap_lock in write mode, which was sufficient when
>> pt_reclaim always ran under mmap_lock in read mode. However, since
>> madvise_dontneed can now execute under a per-VMA lock, this assumption
>> is no longer valid. As a result, a race condition can occur between
>> collapse and PT_RECLAIM, potentially leading to a kernel panic.
>
> There is indeed a race condition here. And after applying this patch, I
> can no longer reproduce the problem locally (I was able to reproduce it
> stably locally last night).
>
> But I still can't figure out how this race condtion causes the
> following panic:
>
> exit_mmap
> --> mmap_read_lock()
> unmap_vmas()
> --> pte_offset_map_lock
> --> rcu_read_lock()
> check if the pmd entry is a PTE page
> ptl = pte_lockptr(mm, &pmdval) <-- ptl is NULL
> spin_lock(ptl) <-- PANIC!!
>
> If this PTE page is freed by pt_reclaim (via RCU), then the ptl can not
> be NULL.
>
> The collapse holds mmap write lock, so it is impossible to be concurrent
> with exit_mmap().
>
> Confusing. :(
IIUC, the issue is not caused by the concurrency between exit_mmap and
collapse, but rather by the concurrency between pt_reclaim and collapse.
Before this patch, khugepaged might incorrectly restore a PTE pagetable
that had already been freed.
pt_reclaim has cleared the pmd entry and freed the PTE page table.
However, due to the race condition, check_pmd_still_valid() still passes
and continues to attempt the collapse:
_pmd = pmdp_collapse_flush(vma, address, pmd); ---> returns a none pmd
entry (the original pmd entry has been cleared)
pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); ---> returns
pte == NULL
Then khugepaged will restore the old PTE pagetable with an invalid pmd
entry:
pmd_populate(mm, pmd, pmd_pgtable(_pmd));
So when the process exits and trys to free the mapping of the process,
traversing the invalid pmd table will lead to a crash.
Barry, please correct me if I have misunderstood something.
>> [ 38.151897] Oops: general protection fault, probably for non-
>> canonical address 0xdffffc0000000003: 0000 [#1] SMP KASI
>> [ 38.153519] KASAN: null-ptr-deref in range
>> [0x0000000000000018-0x000000000000001f]
>> [ 38.154605] CPU: 0 UID: 0 PID: 721 Comm: repro Not tainted
>> 6.16.0-next-20250801-next-2025080 #1 PREEMPT(voluntary)
>> [ 38.155929] Hardware name: QEMU Standard PC (i440FX + PIIX,
>> 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org4
>> [ 38.157418] RIP: 0010:kasan_byte_accessible+0x15/0x30
>> [ 38.158125] Code: 03 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90
>> 90 90 90 90 90 66 0f 1f 00 48 b8 00 00 00 00 00 fc0
>> [ 38.160461] RSP: 0018:ffff88800feef678 EFLAGS: 00010286
>> [ 38.161220] RAX: dffffc0000000000 RBX: 0000000000000001 RCX:
>> 1ffffffff0dde60c
>> [ 38.162232] RDX: 0000000000000000 RSI: ffffffff85da1e18 RDI:
>> dffffc0000000003
>> [ 38.163176] RBP: ffff88800feef698 R08: 0000000000000001 R09:
>> 0000000000000000
>> [ 38.164195] R10: 0000000000000000 R11: ffff888016a8ba58 R12:
>> 0000000000000018
>> [ 38.165189] R13: 0000000000000018 R14: ffffffff85da1e18 R15:
>> 0000000000000000
>> [ 38.166100] FS: 0000000000000000(0000) GS:ffff8880e3b40000(0000)
>> knlGS:0000000000000000
>> [ 38.167137] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 38.167891] CR2: 00007f97fadfe504 CR3: 0000000007088005 CR4:
>> 0000000000770ef0
>> [ 38.168812] PKRU: 55555554
>> [ 38.169275] Call Trace:
>> [ 38.169647] <TASK>
>> [ 38.169975] ? __kasan_check_byte+0x19/0x50
>> [ 38.170581] lock_acquire+0xea/0x310
>> [ 38.171083] ? rcu_is_watching+0x19/0xc0
>> [ 38.171615] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
>> [ 38.172343] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
>> [ 38.173130] _raw_spin_lock+0x38/0x50
>> [ 38.173707] ? __pte_offset_map_lock+0x1a2/0x3c0
>> [ 38.174390] __pte_offset_map_lock+0x1a2/0x3c0
>> [ 38.174987] ? __pfx___pte_offset_map_lock+0x10/0x10
>> [ 38.175724] ? __pfx_pud_val+0x10/0x10
>> [ 38.176308] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
>> [ 38.177183] unmap_page_range+0xb60/0x43e0
>> [ 38.177824] ? __pfx_unmap_page_range+0x10/0x10
>> [ 38.178485] ? mas_next_slot+0x133a/0x1a50
>> [ 38.179079] unmap_single_vma.constprop.0+0x15b/0x250
>> [ 38.179830] unmap_vmas+0x1fa/0x460
>> [ 38.180373] ? __pfx_unmap_vmas+0x10/0x10
>> [ 38.180994] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
>> [ 38.181877] exit_mmap+0x1a2/0xb40
>> [ 38.182396] ? lock_release+0x14f/0x2c0
>> [ 38.182929] ? __pfx_exit_mmap+0x10/0x10
>> [ 38.183474] ? __pfx___mutex_unlock_slowpath+0x10/0x10
>> [ 38.184188] ? mutex_unlock+0x16/0x20
>> [ 38.184704] mmput+0x132/0x370
>> [ 38.185208] do_exit+0x7e7/0x28c0
>> [ 38.185682] ? __this_cpu_preempt_check+0x21/0x30
>> [ 38.186328] ? do_group_exit+0x1d8/0x2c0
>> [ 38.186873] ? __pfx_do_exit+0x10/0x10
>> [ 38.187401] ? __this_cpu_preempt_check+0x21/0x30
>> [ 38.188036] ? _raw_spin_unlock_irq+0x2c/0x60
>> [ 38.188634] ? lockdep_hardirqs_on+0x89/0x110
>> [ 38.189313] do_group_exit+0xe4/0x2c0
>> [ 38.189831] __x64_sys_exit_group+0x4d/0x60
>> [ 38.190413] x64_sys_call+0x2174/0x2180
>> [ 38.190935] do_syscall_64+0x6d/0x2e0
>> [ 38.191449] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>
>> This patch moves the vma_start_write() call to precede
>> check_pmd_still_valid(), ensuring that the check is also properly
>> protected by the per-VMA lock.
>>
>> Fixes: a6fde7add78d ("mm: use per_vma lock for MADV_DONTNEED")
>> Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com>
>> Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
>> Closes: https://lore.kernel.org/all/aJAFrYfyzGpbm+0m@ly-workstation/
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Cc: Qi Zheng <zhengqi.arch@bytedance.com>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Jann Horn <jannh@google.com>
>> Cc: Suren Baghdasaryan <surenb@google.com>
>> Cc: Lokesh Gidra <lokeshgidra@google.com>
>> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
>> Cc: Lance Yang <ioworker0@gmail.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
>> Cc: Nico Pache <npache@redhat.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Dev Jain <dev.jain@arm.com>
>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>> ---
>> mm/khugepaged.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 374a6a5193a7..6b40bdfd224c 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1172,11 +1172,11 @@ static int collapse_huge_page(struct mm_struct
>> *mm, unsigned long address,
>> if (result != SCAN_SUCCEED)
>> goto out_up_write;
>> /* check if the pmd is still valid */
>> + vma_start_write(vma);
>> result = check_pmd_still_valid(mm, address, pmd);
>> if (result != SCAN_SUCCEED)
>> goto out_up_write;
>> - vma_start_write(vma);
>> anon_vma_lock_write(vma->anon_vma);
>> mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
Hi Baolin,
On 8/5/25 3:53 PM, Baolin Wang wrote:
>
>
> On 2025/8/5 14:42, Qi Zheng wrote:
>> Hi Barry,
>>
>> On 8/5/25 11:54 AM, Barry Song wrote:
>>> From: Barry Song <v-songbaohua@oppo.com>
>>>
>>> The check_pmd_still_valid() call during collapse is currently only
>>> protected by the mmap_lock in write mode, which was sufficient when
>>> pt_reclaim always ran under mmap_lock in read mode. However, since
>>> madvise_dontneed can now execute under a per-VMA lock, this assumption
>>> is no longer valid. As a result, a race condition can occur between
>>> collapse and PT_RECLAIM, potentially leading to a kernel panic.
>>
>> There is indeed a race condition here. And after applying this patch, I
>> can no longer reproduce the problem locally (I was able to reproduce it
>> stably locally last night).
>>
>> But I still can't figure out how this race condtion causes the
>> following panic:
>>
>> exit_mmap
>> --> mmap_read_lock()
>> unmap_vmas()
>> --> pte_offset_map_lock
>> --> rcu_read_lock()
>> check if the pmd entry is a PTE page
>> ptl = pte_lockptr(mm, &pmdval) <-- ptl is NULL
>> spin_lock(ptl) <-- PANIC!!
>>
>> If this PTE page is freed by pt_reclaim (via RCU), then the ptl can
>> not be NULL.
>>
>> The collapse holds mmap write lock, so it is impossible to be concurrent
>> with exit_mmap().
>>
>> Confusing. :(
>
> IIUC, the issue is not caused by the concurrency between exit_mmap and
> collapse, but rather by the concurrency between pt_reclaim and collapse.
>
> Before this patch, khugepaged might incorrectly restore a PTE pagetable
> that had already been freed.
>
> pt_reclaim has cleared the pmd entry and freed the PTE page table.
> However, due to the race condition, check_pmd_still_valid() still passes
> and continues to attempt the collapse:
>
> _pmd = pmdp_collapse_flush(vma, address, pmd); ---> returns a none pmd
> entry (the original pmd entry has been cleared)
>
> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); ---> returns
> pte == NULL
>
> Then khugepaged will restore the old PTE pagetable with an invalid pmd
> entry:
>
> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>
> So when the process exits and trys to free the mapping of the process,
> traversing the invalid pmd table will lead to a crash.
CPU0 CPU1
==== ====
collapse
--> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
mmap_write_unlock
exit_mmap
--> hold mmap lock
__pte_offset_map_lock
--> pte = __pte_offset_map(pmd, addr,
&pmdval);
if (unlikely(!pte))
return pte; <-- will return
IIUC, in this case, if we get an invalid pmd entry, we will retrun
directly instead of causing a crash?
>
> Barry, please correct me if I have misunderstood something.
>
On 2025/8/5 16:17, Qi Zheng wrote: > Hi Baolin, > > On 8/5/25 3:53 PM, Baolin Wang wrote: >> >> >> On 2025/8/5 14:42, Qi Zheng wrote: >>> Hi Barry, >>> >>> On 8/5/25 11:54 AM, Barry Song wrote: >>>> From: Barry Song <v-songbaohua@oppo.com> >>>> >>>> The check_pmd_still_valid() call during collapse is currently only >>>> protected by the mmap_lock in write mode, which was sufficient when >>>> pt_reclaim always ran under mmap_lock in read mode. However, since >>>> madvise_dontneed can now execute under a per-VMA lock, this assumption >>>> is no longer valid. As a result, a race condition can occur between >>>> collapse and PT_RECLAIM, potentially leading to a kernel panic. >>> >>> There is indeed a race condition here. And after applying this patch, I >>> can no longer reproduce the problem locally (I was able to reproduce it >>> stably locally last night). >>> >>> But I still can't figure out how this race condtion causes the >>> following panic: >>> >>> exit_mmap >>> --> mmap_read_lock() >>> unmap_vmas() >>> --> pte_offset_map_lock >>> --> rcu_read_lock() >>> check if the pmd entry is a PTE page >>> ptl = pte_lockptr(mm, &pmdval) <-- ptl is NULL >>> spin_lock(ptl) <-- PANIC!! >>> >>> If this PTE page is freed by pt_reclaim (via RCU), then the ptl can >>> not be NULL. >>> >>> The collapse holds mmap write lock, so it is impossible to be concurrent >>> with exit_mmap(). >>> >>> Confusing. :( >> >> IIUC, the issue is not caused by the concurrency between exit_mmap and >> collapse, but rather by the concurrency between pt_reclaim and collapse. >> >> Before this patch, khugepaged might incorrectly restore a PTE >> pagetable that had already been freed. >> >> pt_reclaim has cleared the pmd entry and freed the PTE page table. >> However, due to the race condition, check_pmd_still_valid() still >> passes and continues to attempt the collapse: >> >> _pmd = pmdp_collapse_flush(vma, address, pmd); ---> returns a none pmd >> entry (the original pmd entry has been cleared) >> >> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); ---> returns >> pte == NULL >> >> Then khugepaged will restore the old PTE pagetable with an invalid pmd >> entry: >> >> pmd_populate(mm, pmd, pmd_pgtable(_pmd)); >> >> So when the process exits and trys to free the mapping of the process, >> traversing the invalid pmd table will lead to a crash. > > CPU0 CPU1 > ==== ==== > > collapse > --> pmd_populate(mm, pmd, pmd_pgtable(_pmd)); > mmap_write_unlock > exit_mmap > --> hold mmap lock > __pte_offset_map_lock > --> pte = __pte_offset_map(pmd, addr, > &pmdval); > if (unlikely(!pte)) > return pte; <-- will return __pte_offset_map() might not return NULL? Because the 'pmd_populate(mm, pmd, pmd_pgtable(_pmd))' could populate a valid page (although the '_pmd' entry is NONE), but it is not the original pagetable page. > IIUC, in this case, if we get an invalid pmd entry, we will retrun > directly instead of causing a crash? > >> >> Barry, please correct me if I have misunderstood something. >> >
On 8/5/25 4:56 PM, Baolin Wang wrote:
>
>
> On 2025/8/5 16:17, Qi Zheng wrote:
>> Hi Baolin,
>>
>> On 8/5/25 3:53 PM, Baolin Wang wrote:
>>>
>>>
>>> On 2025/8/5 14:42, Qi Zheng wrote:
>>>> Hi Barry,
>>>>
>>>> On 8/5/25 11:54 AM, Barry Song wrote:
>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>
>>>>> The check_pmd_still_valid() call during collapse is currently only
>>>>> protected by the mmap_lock in write mode, which was sufficient when
>>>>> pt_reclaim always ran under mmap_lock in read mode. However, since
>>>>> madvise_dontneed can now execute under a per-VMA lock, this assumption
>>>>> is no longer valid. As a result, a race condition can occur between
>>>>> collapse and PT_RECLAIM, potentially leading to a kernel panic.
>>>>
>>>> There is indeed a race condition here. And after applying this patch, I
>>>> can no longer reproduce the problem locally (I was able to reproduce it
>>>> stably locally last night).
>>>>
>>>> But I still can't figure out how this race condtion causes the
>>>> following panic:
>>>>
>>>> exit_mmap
>>>> --> mmap_read_lock()
>>>> unmap_vmas()
>>>> --> pte_offset_map_lock
>>>> --> rcu_read_lock()
>>>> check if the pmd entry is a PTE page
>>>> ptl = pte_lockptr(mm, &pmdval) <-- ptl is NULL
>>>> spin_lock(ptl) <-- PANIC!!
>>>>
>>>> If this PTE page is freed by pt_reclaim (via RCU), then the ptl can
>>>> not be NULL.
>>>>
>>>> The collapse holds mmap write lock, so it is impossible to be
>>>> concurrent
>>>> with exit_mmap().
>>>>
>>>> Confusing. :(
>>>
>>> IIUC, the issue is not caused by the concurrency between exit_mmap
>>> and collapse, but rather by the concurrency between pt_reclaim and
>>> collapse.
>>>
>>> Before this patch, khugepaged might incorrectly restore a PTE
>>> pagetable that had already been freed.
>>>
>>> pt_reclaim has cleared the pmd entry and freed the PTE page table.
>>> However, due to the race condition, check_pmd_still_valid() still
>>> passes and continues to attempt the collapse:
>>>
>>> _pmd = pmdp_collapse_flush(vma, address, pmd); ---> returns a none
>>> pmd entry (the original pmd entry has been cleared)
>>>
>>> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); ---> returns
>>> pte == NULL
>>>
>>> Then khugepaged will restore the old PTE pagetable with an invalid
>>> pmd entry:
>>>
>>> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>>>
>>> So when the process exits and trys to free the mapping of the
>>> process, traversing the invalid pmd table will lead to a crash.
>>
>> CPU0 CPU1
>> ==== ====
>>
>> collapse
>> --> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>> mmap_write_unlock
>> exit_mmap
>> --> hold mmap lock
>> __pte_offset_map_lock
>> --> pte = __pte_offset_map(pmd,
>> addr, &pmdval);
>> if (unlikely(!pte))
>> return pte; <-- will return
>
> __pte_offset_map() might not return NULL? Because the 'pmd_populate(mm,
> pmd, pmd_pgtable(_pmd))' could populate a valid page (although the
> '_pmd' entry is NONE), but it is not the original pagetable page.
CPU0 CPU1
==== ====
collapse
--> check_pmd_still_valid
vma read lock
pt_reclaim clear the pmd entry and will
free the PTE page (via RCU)
vma read unlock
vma write lock
_pmd = pmdp_collapse_flush(vma, address, pmd) <-- pmd_none(_pmd)
pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); <-- pte is
NULL
pmd_populate(mm, pmd, pmd_pgtable(_pmd)); <-- populate a valid page?
vma write unlock
The above is the concurrent scenario you mentioned, right?
What types of this 'valid page' could be? If __pte_offset_map() returns
non-NULL, then it is a PTE page. Even if it is not the original one, it
should not cause panic. Did I miss some key information? :(
>
>> IIUC, in this case, if we get an invalid pmd entry, we will retrun
>> directly instead of causing a crash?
>>
>>>
>>> Barry, please correct me if I have misunderstood something.
>>>
>>
>
On 05.08.25 11:30, Qi Zheng wrote: > > > On 8/5/25 4:56 PM, Baolin Wang wrote: >> >> >> On 2025/8/5 16:17, Qi Zheng wrote: >>> Hi Baolin, >>> >>> On 8/5/25 3:53 PM, Baolin Wang wrote: >>>> >>>> >>>> On 2025/8/5 14:42, Qi Zheng wrote: >>>>> Hi Barry, >>>>> >>>>> On 8/5/25 11:54 AM, Barry Song wrote: >>>>>> From: Barry Song <v-songbaohua@oppo.com> >>>>>> >>>>>> The check_pmd_still_valid() call during collapse is currently only >>>>>> protected by the mmap_lock in write mode, which was sufficient when >>>>>> pt_reclaim always ran under mmap_lock in read mode. However, since >>>>>> madvise_dontneed can now execute under a per-VMA lock, this assumption >>>>>> is no longer valid. As a result, a race condition can occur between >>>>>> collapse and PT_RECLAIM, potentially leading to a kernel panic. >>>>> >>>>> There is indeed a race condition here. And after applying this patch, I >>>>> can no longer reproduce the problem locally (I was able to reproduce it >>>>> stably locally last night). >>>>> >>>>> But I still can't figure out how this race condtion causes the >>>>> following panic: >>>>> >>>>> exit_mmap >>>>> --> mmap_read_lock() >>>>> unmap_vmas() >>>>> --> pte_offset_map_lock >>>>> --> rcu_read_lock() >>>>> check if the pmd entry is a PTE page >>>>> ptl = pte_lockptr(mm, &pmdval) <-- ptl is NULL >>>>> spin_lock(ptl) <-- PANIC!! >>>>> >>>>> If this PTE page is freed by pt_reclaim (via RCU), then the ptl can >>>>> not be NULL. >>>>> >>>>> The collapse holds mmap write lock, so it is impossible to be >>>>> concurrent >>>>> with exit_mmap(). >>>>> >>>>> Confusing. :( >>>> >>>> IIUC, the issue is not caused by the concurrency between exit_mmap >>>> and collapse, but rather by the concurrency between pt_reclaim and >>>> collapse. >>>> >>>> Before this patch, khugepaged might incorrectly restore a PTE >>>> pagetable that had already been freed. >>>> >>>> pt_reclaim has cleared the pmd entry and freed the PTE page table. >>>> However, due to the race condition, check_pmd_still_valid() still >>>> passes and continues to attempt the collapse: >>>> >>>> _pmd = pmdp_collapse_flush(vma, address, pmd); ---> returns a none >>>> pmd entry (the original pmd entry has been cleared) >>>> >>>> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); ---> returns >>>> pte == NULL >>>> >>>> Then khugepaged will restore the old PTE pagetable with an invalid >>>> pmd entry: >>>> >>>> pmd_populate(mm, pmd, pmd_pgtable(_pmd)); >>>> >>>> So when the process exits and trys to free the mapping of the >>>> process, traversing the invalid pmd table will lead to a crash. >>> >>> CPU0 CPU1 >>> ==== ==== >>> >>> collapse >>> --> pmd_populate(mm, pmd, pmd_pgtable(_pmd)); >>> mmap_write_unlock >>> exit_mmap >>> --> hold mmap lock >>> __pte_offset_map_lock >>> --> pte = __pte_offset_map(pmd, >>> addr, &pmdval); >>> if (unlikely(!pte)) >>> return pte; <-- will return >> >> __pte_offset_map() might not return NULL? Because the 'pmd_populate(mm, >> pmd, pmd_pgtable(_pmd))' could populate a valid page (although the >> '_pmd' entry is NONE), but it is not the original pagetable page. > > CPU0 CPU1 > ==== ==== > > collapse > --> check_pmd_still_valid > vma read lock > pt_reclaim clear the pmd entry and will > free the PTE page (via RCU) > vma read unlock > > vma write lock > _pmd = pmdp_collapse_flush(vma, address, pmd) <-- pmd_none(_pmd) > pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); <-- pte is > NULL > pmd_populate(mm, pmd, pmd_pgtable(_pmd)); <-- populate a valid page? > vma write unlock > > The above is the concurrent scenario you mentioned, right? > > What types of this 'valid page' could be? If __pte_offset_map() returns > non-NULL, then it is a PTE page. Even if it is not the original one, it > should not cause panic. Did I miss some key information? :( Wasn't the original issue all about a NULL-pointer de-reference while *locking*? Note that in that kernel config [1] we have CONFIG_DEBUG_SPINLOCK=y, so likely we will have ALLOC_SPLIT_PTLOCKS set. [1] https://github.com/laifryiee/syzkaller_logs/blob/main/250803_193026___pte_offset_map_lock/.config -- Cheers, David / dhildenb
On 2025/8/5 17:50, David Hildenbrand wrote: > On 05.08.25 11:30, Qi Zheng wrote: >> >> >> On 8/5/25 4:56 PM, Baolin Wang wrote: >>> >>> >>> On 2025/8/5 16:17, Qi Zheng wrote: >>>> Hi Baolin, >>>> >>>> On 8/5/25 3:53 PM, Baolin Wang wrote: >>>>> >>>>> >>>>> On 2025/8/5 14:42, Qi Zheng wrote: >>>>>> Hi Barry, >>>>>> >>>>>> On 8/5/25 11:54 AM, Barry Song wrote: >>>>>>> From: Barry Song <v-songbaohua@oppo.com> >>>>>>> >>>>>>> The check_pmd_still_valid() call during collapse is currently only >>>>>>> protected by the mmap_lock in write mode, which was sufficient when >>>>>>> pt_reclaim always ran under mmap_lock in read mode. However, since >>>>>>> madvise_dontneed can now execute under a per-VMA lock, this >>>>>>> assumption >>>>>>> is no longer valid. As a result, a race condition can occur between >>>>>>> collapse and PT_RECLAIM, potentially leading to a kernel panic. >>>>>> >>>>>> There is indeed a race condition here. And after applying this >>>>>> patch, I >>>>>> can no longer reproduce the problem locally (I was able to >>>>>> reproduce it >>>>>> stably locally last night). >>>>>> >>>>>> But I still can't figure out how this race condtion causes the >>>>>> following panic: >>>>>> >>>>>> exit_mmap >>>>>> --> mmap_read_lock() >>>>>> unmap_vmas() >>>>>> --> pte_offset_map_lock >>>>>> --> rcu_read_lock() >>>>>> check if the pmd entry is a PTE page >>>>>> ptl = pte_lockptr(mm, &pmdval) <-- ptl is NULL >>>>>> spin_lock(ptl) <-- PANIC!! >>>>>> >>>>>> If this PTE page is freed by pt_reclaim (via RCU), then the ptl can >>>>>> not be NULL. >>>>>> >>>>>> The collapse holds mmap write lock, so it is impossible to be >>>>>> concurrent >>>>>> with exit_mmap(). >>>>>> >>>>>> Confusing. :( >>>>> >>>>> IIUC, the issue is not caused by the concurrency between exit_mmap >>>>> and collapse, but rather by the concurrency between pt_reclaim and >>>>> collapse. >>>>> >>>>> Before this patch, khugepaged might incorrectly restore a PTE >>>>> pagetable that had already been freed. >>>>> >>>>> pt_reclaim has cleared the pmd entry and freed the PTE page table. >>>>> However, due to the race condition, check_pmd_still_valid() still >>>>> passes and continues to attempt the collapse: >>>>> >>>>> _pmd = pmdp_collapse_flush(vma, address, pmd); ---> returns a none >>>>> pmd entry (the original pmd entry has been cleared) >>>>> >>>>> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); ---> returns >>>>> pte == NULL >>>>> >>>>> Then khugepaged will restore the old PTE pagetable with an invalid >>>>> pmd entry: >>>>> >>>>> pmd_populate(mm, pmd, pmd_pgtable(_pmd)); >>>>> >>>>> So when the process exits and trys to free the mapping of the >>>>> process, traversing the invalid pmd table will lead to a crash. >>>> >>>> CPU0 CPU1 >>>> ==== ==== >>>> >>>> collapse >>>> --> pmd_populate(mm, pmd, pmd_pgtable(_pmd)); >>>> mmap_write_unlock >>>> exit_mmap >>>> --> hold mmap lock >>>> __pte_offset_map_lock >>>> --> pte = __pte_offset_map(pmd, >>>> addr, &pmdval); >>>> if (unlikely(!pte)) >>>> return pte; <-- will >>>> return >>> >>> __pte_offset_map() might not return NULL? Because the 'pmd_populate(mm, >>> pmd, pmd_pgtable(_pmd))' could populate a valid page (although the >>> '_pmd' entry is NONE), but it is not the original pagetable page. >> >> CPU0 CPU1 >> ==== ==== >> >> collapse >> --> check_pmd_still_valid >> vma read lock >> pt_reclaim clear the pmd entry and will >> free the PTE page (via RCU) >> vma read unlock >> >> vma write lock >> _pmd = pmdp_collapse_flush(vma, address, pmd) <-- pmd_none(_pmd) >> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); <-- pte is >> NULL >> pmd_populate(mm, pmd, pmd_pgtable(_pmd)); <-- populate a valid >> page? >> vma write unlock >> >> The above is the concurrent scenario you mentioned, right? Yes. >> >> What types of this 'valid page' could be? If __pte_offset_map() returns >> non-NULL, then it is a PTE page. Even if it is not the original one, it >> should not cause panic. Did I miss some key information? :( Sorry for not being clear. Let me try again. In the race condition described above, the '_pmd' value is NONE, meaning that when restoring the pmd entry with ‘pmd_populate(mm, pmd, pmd_pgtable(_pmd))’, the 'pmd_pgtable(_pmd)' can return a struct page corresponding to pfn == 0 (cause the '_pmd' is NONE) to populate the pmd entry. Clearly, this pfn == 0 page is not a pagetable page, meaning the corresponding ptl lock of this page is not initialized. Additionally, from the boot dmesg, I can see that the BIOS reports an address range with pfn == 0, indicating that there is a struct page initialized for pfn == 0 (possibly a reserved page): [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable [ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000007ffdffff] usable [ 0.000000] BIOS-e820: [mem 0x000000007ffe0000-0x000000007fffffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved Of course, this is my theoretical analysis from the code perspective. If there are other race conditions, I would be very surprised:) > Wasn't the original issue all about a NULL-pointer de-reference while > *locking*? Yes. > Note that in that kernel config [1] we have CONFIG_DEBUG_SPINLOCK=y, so > likely we will have ALLOC_SPLIT_PTLOCKS set. > > [1] https://github.com/laifryiee/syzkaller_logs/blob/ > main/250803_193026___pte_offset_map_lock/.config >
On 8/5/25 6:07 PM, Baolin Wang wrote: > > [...] > >>> >>> What types of this 'valid page' could be? If __pte_offset_map() returns >>> non-NULL, then it is a PTE page. Even if it is not the original one, it >>> should not cause panic. Did I miss some key information? :( > > Sorry for not being clear. Let me try again. > > In the race condition described above, the '_pmd' value is NONE, meaning > that when restoring the pmd entry with ‘pmd_populate(mm, pmd, > pmd_pgtable(_pmd))’, the 'pmd_pgtable(_pmd)' can return a struct page > corresponding to pfn == 0 (cause the '_pmd' is NONE) to populate the pmd > entry. Clearly, this pfn == 0 page is not a pagetable page, meaning the > corresponding ptl lock of this page is not initialized. > > Additionally, from the boot dmesg, I can see that the BIOS reports an > address range with pfn == 0, indicating that there is a struct page > initialized for pfn == 0 (possibly a reserved page): > > [ 0.000000] BIOS-provided physical RAM map: > [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] > usable > [ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] > reserved > [ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] > reserved > [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000007ffdffff] > usable > [ 0.000000] BIOS-e820: [mem 0x000000007ffe0000-0x000000007fffffff] > reserved > [ 0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] > reserved > [ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] > reserved > Now I understand, thank you very much for your patient explanation! And for this patch: Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Thanks! > Of course, this is my theoretical analysis from the code perspective. If > there are other race conditions, I would be very surprised:) > >> Wasn't the original issue all about a NULL-pointer de-reference while >> *locking*? > > Yes. > >> Note that in that kernel config [1] we have CONFIG_DEBUG_SPINLOCK=y, >> so likely we will have ALLOC_SPLIT_PTLOCKS set. >> >> [1] https://github.com/laifryiee/syzkaller_logs/blob/ >> main/250803_193026___pte_offset_map_lock/.config >>
On 05.08.25 05:54, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
>
> The check_pmd_still_valid() call during collapse is currently only
> protected by the mmap_lock in write mode, which was sufficient when
> pt_reclaim always ran under mmap_lock in read mode. However, since
> madvise_dontneed can now execute under a per-VMA lock, this assumption
> is no longer valid. As a result, a race condition can occur between
> collapse and PT_RECLAIM, potentially leading to a kernel panic.
>
> [ 38.151897] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] SMP KASI
> [ 38.153519] KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
> [ 38.154605] CPU: 0 UID: 0 PID: 721 Comm: repro Not tainted 6.16.0-next-20250801-next-2025080 #1 PREEMPT(voluntary)
> [ 38.155929] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org4
> [ 38.157418] RIP: 0010:kasan_byte_accessible+0x15/0x30
> [ 38.158125] Code: 03 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 48 b8 00 00 00 00 00 fc0
> [ 38.160461] RSP: 0018:ffff88800feef678 EFLAGS: 00010286
> [ 38.161220] RAX: dffffc0000000000 RBX: 0000000000000001 RCX: 1ffffffff0dde60c
> [ 38.162232] RDX: 0000000000000000 RSI: ffffffff85da1e18 RDI: dffffc0000000003
> [ 38.163176] RBP: ffff88800feef698 R08: 0000000000000001 R09: 0000000000000000
> [ 38.164195] R10: 0000000000000000 R11: ffff888016a8ba58 R12: 0000000000000018
> [ 38.165189] R13: 0000000000000018 R14: ffffffff85da1e18 R15: 0000000000000000
> [ 38.166100] FS: 0000000000000000(0000) GS:ffff8880e3b40000(0000) knlGS:0000000000000000
> [ 38.167137] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 38.167891] CR2: 00007f97fadfe504 CR3: 0000000007088005 CR4: 0000000000770ef0
> [ 38.168812] PKRU: 55555554
> [ 38.169275] Call Trace:
> [ 38.169647] <TASK>
> [ 38.169975] ? __kasan_check_byte+0x19/0x50
> [ 38.170581] lock_acquire+0xea/0x310
> [ 38.171083] ? rcu_is_watching+0x19/0xc0
> [ 38.171615] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> [ 38.172343] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
> [ 38.173130] _raw_spin_lock+0x38/0x50
> [ 38.173707] ? __pte_offset_map_lock+0x1a2/0x3c0
> [ 38.174390] __pte_offset_map_lock+0x1a2/0x3c0
> [ 38.174987] ? __pfx___pte_offset_map_lock+0x10/0x10
> [ 38.175724] ? __pfx_pud_val+0x10/0x10
> [ 38.176308] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
> [ 38.177183] unmap_page_range+0xb60/0x43e0
> [ 38.177824] ? __pfx_unmap_page_range+0x10/0x10
> [ 38.178485] ? mas_next_slot+0x133a/0x1a50
> [ 38.179079] unmap_single_vma.constprop.0+0x15b/0x250
> [ 38.179830] unmap_vmas+0x1fa/0x460
> [ 38.180373] ? __pfx_unmap_vmas+0x10/0x10
> [ 38.180994] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> [ 38.181877] exit_mmap+0x1a2/0xb40
> [ 38.182396] ? lock_release+0x14f/0x2c0
> [ 38.182929] ? __pfx_exit_mmap+0x10/0x10
> [ 38.183474] ? __pfx___mutex_unlock_slowpath+0x10/0x10
> [ 38.184188] ? mutex_unlock+0x16/0x20
> [ 38.184704] mmput+0x132/0x370
> [ 38.185208] do_exit+0x7e7/0x28c0
> [ 38.185682] ? __this_cpu_preempt_check+0x21/0x30
> [ 38.186328] ? do_group_exit+0x1d8/0x2c0
> [ 38.186873] ? __pfx_do_exit+0x10/0x10
> [ 38.187401] ? __this_cpu_preempt_check+0x21/0x30
> [ 38.188036] ? _raw_spin_unlock_irq+0x2c/0x60
> [ 38.188634] ? lockdep_hardirqs_on+0x89/0x110
> [ 38.189313] do_group_exit+0xe4/0x2c0
> [ 38.189831] __x64_sys_exit_group+0x4d/0x60
> [ 38.190413] x64_sys_call+0x2174/0x2180
> [ 38.190935] do_syscall_64+0x6d/0x2e0
> [ 38.191449] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> This patch moves the vma_start_write() call to precede
> check_pmd_still_valid(), ensuring that the check is also properly
> protected by the per-VMA lock.
>
> Fixes: a6fde7add78d ("mm: use per_vma lock for MADV_DONTNEED")
> Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com>
> Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
> Closes: https://lore.kernel.org/all/aJAFrYfyzGpbm+0m@ly-workstation/
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Qi Zheng <zhengqi.arch@bytedance.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Jann Horn <jannh@google.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Lokesh Gidra <lokeshgidra@google.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Cc: Lance Yang <ioworker0@gmail.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
> mm/khugepaged.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 374a6a5193a7..6b40bdfd224c 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1172,11 +1172,11 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> if (result != SCAN_SUCCEED)
> goto out_up_write;
> /* check if the pmd is still valid */
> + vma_start_write(vma);
> result = check_pmd_still_valid(mm, address, pmd);
> if (result != SCAN_SUCCEED)
> goto out_up_write;
>
> - vma_start_write(vma);
> anon_vma_lock_write(vma->anon_vma);
>
> mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
LGTM, I was wondering whether we should just place it next to the
mmap_write_lock() with the assumption that hugepage_vma_revalidate()
will commonly not fail.
So personally, I would move it further up.
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers,
David / dhildenb
© 2016 - 2026 Red Hat, Inc.