mm: Fix the race between collapse and PT_RECLAIM under per-vma lock

[PATCH] mm: Fix the race between collapse and PT_RECLAIM under per-vma lock

Posted by Barry Song 2 months ago

From: Barry Song <v-songbaohua@oppo.com>

The check_pmd_still_valid() call during collapse is currently only
protected by the mmap_lock in write mode, which was sufficient when
pt_reclaim always ran under mmap_lock in read mode. However, since
madvise_dontneed can now execute under a per-VMA lock, this assumption
is no longer valid. As a result, a race condition can occur between
collapse and PT_RECLAIM, potentially leading to a kernel panic.

 [   38.151897] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] SMP KASI
 [   38.153519] KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
 [   38.154605] CPU: 0 UID: 0 PID: 721 Comm: repro Not tainted 6.16.0-next-20250801-next-2025080 #1 PREEMPT(voluntary)
 [   38.155929] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org4
 [   38.157418] RIP: 0010:kasan_byte_accessible+0x15/0x30
 [   38.158125] Code: 03 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 48 b8 00 00 00 00 00 fc0
 [   38.160461] RSP: 0018:ffff88800feef678 EFLAGS: 00010286
 [   38.161220] RAX: dffffc0000000000 RBX: 0000000000000001 RCX: 1ffffffff0dde60c
 [   38.162232] RDX: 0000000000000000 RSI: ffffffff85da1e18 RDI: dffffc0000000003
 [   38.163176] RBP: ffff88800feef698 R08: 0000000000000001 R09: 0000000000000000
 [   38.164195] R10: 0000000000000000 R11: ffff888016a8ba58 R12: 0000000000000018
 [   38.165189] R13: 0000000000000018 R14: ffffffff85da1e18 R15: 0000000000000000
 [   38.166100] FS:  0000000000000000(0000) GS:ffff8880e3b40000(0000) knlGS:0000000000000000
 [   38.167137] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [   38.167891] CR2: 00007f97fadfe504 CR3: 0000000007088005 CR4: 0000000000770ef0
 [   38.168812] PKRU: 55555554
 [   38.169275] Call Trace:
 [   38.169647]  <TASK>
 [   38.169975]  ? __kasan_check_byte+0x19/0x50
 [   38.170581]  lock_acquire+0xea/0x310
 [   38.171083]  ? rcu_is_watching+0x19/0xc0
 [   38.171615]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
 [   38.172343]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
 [   38.173130]  _raw_spin_lock+0x38/0x50
 [   38.173707]  ? __pte_offset_map_lock+0x1a2/0x3c0
 [   38.174390]  __pte_offset_map_lock+0x1a2/0x3c0
 [   38.174987]  ? __pfx___pte_offset_map_lock+0x10/0x10
 [   38.175724]  ? __pfx_pud_val+0x10/0x10
 [   38.176308]  ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
 [   38.177183]  unmap_page_range+0xb60/0x43e0
 [   38.177824]  ? __pfx_unmap_page_range+0x10/0x10
 [   38.178485]  ? mas_next_slot+0x133a/0x1a50
 [   38.179079]  unmap_single_vma.constprop.0+0x15b/0x250
 [   38.179830]  unmap_vmas+0x1fa/0x460
 [   38.180373]  ? __pfx_unmap_vmas+0x10/0x10
 [   38.180994]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
 [   38.181877]  exit_mmap+0x1a2/0xb40
 [   38.182396]  ? lock_release+0x14f/0x2c0
 [   38.182929]  ? __pfx_exit_mmap+0x10/0x10
 [   38.183474]  ? __pfx___mutex_unlock_slowpath+0x10/0x10
 [   38.184188]  ? mutex_unlock+0x16/0x20
 [   38.184704]  mmput+0x132/0x370
 [   38.185208]  do_exit+0x7e7/0x28c0
 [   38.185682]  ? __this_cpu_preempt_check+0x21/0x30
 [   38.186328]  ? do_group_exit+0x1d8/0x2c0
 [   38.186873]  ? __pfx_do_exit+0x10/0x10
 [   38.187401]  ? __this_cpu_preempt_check+0x21/0x30
 [   38.188036]  ? _raw_spin_unlock_irq+0x2c/0x60
 [   38.188634]  ? lockdep_hardirqs_on+0x89/0x110
 [   38.189313]  do_group_exit+0xe4/0x2c0
 [   38.189831]  __x64_sys_exit_group+0x4d/0x60
 [   38.190413]  x64_sys_call+0x2174/0x2180
 [   38.190935]  do_syscall_64+0x6d/0x2e0
 [   38.191449]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

This patch moves the vma_start_write() call to precede
check_pmd_still_valid(), ensuring that the check is also properly
protected by the per-VMA lock.

Fixes: a6fde7add78d ("mm: use per_vma lock for MADV_DONTNEED")
Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com>
Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
Closes: https://lore.kernel.org/all/aJAFrYfyzGpbm+0m@ly-workstation/
Cc: David Hildenbrand <david@redhat.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/khugepaged.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 374a6a5193a7..6b40bdfd224c 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1172,11 +1172,11 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 	/* check if the pmd is still valid */
+	vma_start_write(vma);
 	result = check_pmd_still_valid(mm, address, pmd);
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 
-	vma_start_write(vma);
 	anon_vma_lock_write(vma->anon_vma);
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
-- 
2.39.3 (Apple Git-146)

Re: [PATCH] mm: Fix the race between collapse and PT_RECLAIM under per-vma lock

Posted by Lorenzo Stoakes 2 months ago

Andrew - to be clear, we need this as a hotfix for 6.17, as this is a known bug
in rc1 right now.

On Tue, Aug 05, 2025 at 11:54:47AM +0800, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
>
> The check_pmd_still_valid() call during collapse is currently only
> protected by the mmap_lock in write mode, which was sufficient when
> pt_reclaim always ran under mmap_lock in read mode. However, since
> madvise_dontneed can now execute under a per-VMA lock, this assumption
> is no longer valid. As a result, a race condition can occur between
> collapse and PT_RECLAIM, potentially leading to a kernel panic.
>
>  [   38.151897] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] SMP KASI
>  [   38.153519] KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
>  [   38.154605] CPU: 0 UID: 0 PID: 721 Comm: repro Not tainted 6.16.0-next-20250801-next-2025080 #1 PREEMPT(voluntary)
>  [   38.155929] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org4
>  [   38.157418] RIP: 0010:kasan_byte_accessible+0x15/0x30
>  [   38.158125] Code: 03 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 48 b8 00 00 00 00 00 fc0
>  [   38.160461] RSP: 0018:ffff88800feef678 EFLAGS: 00010286
>  [   38.161220] RAX: dffffc0000000000 RBX: 0000000000000001 RCX: 1ffffffff0dde60c
>  [   38.162232] RDX: 0000000000000000 RSI: ffffffff85da1e18 RDI: dffffc0000000003
>  [   38.163176] RBP: ffff88800feef698 R08: 0000000000000001 R09: 0000000000000000
>  [   38.164195] R10: 0000000000000000 R11: ffff888016a8ba58 R12: 0000000000000018
>  [   38.165189] R13: 0000000000000018 R14: ffffffff85da1e18 R15: 0000000000000000
>  [   38.166100] FS:  0000000000000000(0000) GS:ffff8880e3b40000(0000) knlGS:0000000000000000
>  [   38.167137] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  [   38.167891] CR2: 00007f97fadfe504 CR3: 0000000007088005 CR4: 0000000000770ef0
>  [   38.168812] PKRU: 55555554
>  [   38.169275] Call Trace:
>  [   38.169647]  <TASK>
>  [   38.169975]  ? __kasan_check_byte+0x19/0x50
>  [   38.170581]  lock_acquire+0xea/0x310
>  [   38.171083]  ? rcu_is_watching+0x19/0xc0
>  [   38.171615]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
>  [   38.172343]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
>  [   38.173130]  _raw_spin_lock+0x38/0x50
>  [   38.173707]  ? __pte_offset_map_lock+0x1a2/0x3c0
>  [   38.174390]  __pte_offset_map_lock+0x1a2/0x3c0
>  [   38.174987]  ? __pfx___pte_offset_map_lock+0x10/0x10
>  [   38.175724]  ? __pfx_pud_val+0x10/0x10
>  [   38.176308]  ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
>  [   38.177183]  unmap_page_range+0xb60/0x43e0
>  [   38.177824]  ? __pfx_unmap_page_range+0x10/0x10
>  [   38.178485]  ? mas_next_slot+0x133a/0x1a50
>  [   38.179079]  unmap_single_vma.constprop.0+0x15b/0x250
>  [   38.179830]  unmap_vmas+0x1fa/0x460
>  [   38.180373]  ? __pfx_unmap_vmas+0x10/0x10
>  [   38.180994]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
>  [   38.181877]  exit_mmap+0x1a2/0xb40
>  [   38.182396]  ? lock_release+0x14f/0x2c0
>  [   38.182929]  ? __pfx_exit_mmap+0x10/0x10
>  [   38.183474]  ? __pfx___mutex_unlock_slowpath+0x10/0x10
>  [   38.184188]  ? mutex_unlock+0x16/0x20
>  [   38.184704]  mmput+0x132/0x370
>  [   38.185208]  do_exit+0x7e7/0x28c0
>  [   38.185682]  ? __this_cpu_preempt_check+0x21/0x30
>  [   38.186328]  ? do_group_exit+0x1d8/0x2c0
>  [   38.186873]  ? __pfx_do_exit+0x10/0x10
>  [   38.187401]  ? __this_cpu_preempt_check+0x21/0x30
>  [   38.188036]  ? _raw_spin_unlock_irq+0x2c/0x60
>  [   38.188634]  ? lockdep_hardirqs_on+0x89/0x110
>  [   38.189313]  do_group_exit+0xe4/0x2c0
>  [   38.189831]  __x64_sys_exit_group+0x4d/0x60
>  [   38.190413]  x64_sys_call+0x2174/0x2180
>  [   38.190935]  do_syscall_64+0x6d/0x2e0
>  [   38.191449]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> This patch moves the vma_start_write() call to precede
> check_pmd_still_valid(), ensuring that the check is also properly
> protected by the per-VMA lock.
>
> Fixes: a6fde7add78d ("mm: use per_vma lock for MADV_DONTNEED")
> Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com>
> Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
> Closes: https://lore.kernel.org/all/aJAFrYfyzGpbm+0m@ly-workstation/
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Qi Zheng <zhengqi.arch@bytedance.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Jann Horn <jannh@google.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Lokesh Gidra <lokeshgidra@google.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Cc: Lance Yang <ioworker0@gmail.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>

Looks good to me so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/khugepaged.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 374a6a5193a7..6b40bdfd224c 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1172,11 +1172,11 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	if (result != SCAN_SUCCEED)
>  		goto out_up_write;
>  	/* check if the pmd is still valid */
> +	vma_start_write(vma);
>  	result = check_pmd_still_valid(mm, address, pmd);
>  	if (result != SCAN_SUCCEED)
>  		goto out_up_write;
>
> -	vma_start_write(vma);
>  	anon_vma_lock_write(vma->anon_vma);
>
>  	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> --
> 2.39.3 (Apple Git-146)
>

Re: [PATCH] mm: Fix the race between collapse and PT_RECLAIM under per-vma lock

Posted by Baolin Wang 2 months ago


On 2025/8/5 11:54, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> The check_pmd_still_valid() call during collapse is currently only
> protected by the mmap_lock in write mode, which was sufficient when
> pt_reclaim always ran under mmap_lock in read mode. However, since
> madvise_dontneed can now execute under a per-VMA lock, this assumption
> is no longer valid. As a result, a race condition can occur between
> collapse and PT_RECLAIM, potentially leading to a kernel panic.
> 
>   [   38.151897] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] SMP KASI
>   [   38.153519] KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
>   [   38.154605] CPU: 0 UID: 0 PID: 721 Comm: repro Not tainted 6.16.0-next-20250801-next-2025080 #1 PREEMPT(voluntary)
>   [   38.155929] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org4
>   [   38.157418] RIP: 0010:kasan_byte_accessible+0x15/0x30
>   [   38.158125] Code: 03 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 48 b8 00 00 00 00 00 fc0
>   [   38.160461] RSP: 0018:ffff88800feef678 EFLAGS: 00010286
>   [   38.161220] RAX: dffffc0000000000 RBX: 0000000000000001 RCX: 1ffffffff0dde60c
>   [   38.162232] RDX: 0000000000000000 RSI: ffffffff85da1e18 RDI: dffffc0000000003
>   [   38.163176] RBP: ffff88800feef698 R08: 0000000000000001 R09: 0000000000000000
>   [   38.164195] R10: 0000000000000000 R11: ffff888016a8ba58 R12: 0000000000000018
>   [   38.165189] R13: 0000000000000018 R14: ffffffff85da1e18 R15: 0000000000000000
>   [   38.166100] FS:  0000000000000000(0000) GS:ffff8880e3b40000(0000) knlGS:0000000000000000
>   [   38.167137] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   [   38.167891] CR2: 00007f97fadfe504 CR3: 0000000007088005 CR4: 0000000000770ef0
>   [   38.168812] PKRU: 55555554
>   [   38.169275] Call Trace:
>   [   38.169647]  <TASK>
>   [   38.169975]  ? __kasan_check_byte+0x19/0x50
>   [   38.170581]  lock_acquire+0xea/0x310
>   [   38.171083]  ? rcu_is_watching+0x19/0xc0
>   [   38.171615]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
>   [   38.172343]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
>   [   38.173130]  _raw_spin_lock+0x38/0x50
>   [   38.173707]  ? __pte_offset_map_lock+0x1a2/0x3c0
>   [   38.174390]  __pte_offset_map_lock+0x1a2/0x3c0
>   [   38.174987]  ? __pfx___pte_offset_map_lock+0x10/0x10
>   [   38.175724]  ? __pfx_pud_val+0x10/0x10
>   [   38.176308]  ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
>   [   38.177183]  unmap_page_range+0xb60/0x43e0
>   [   38.177824]  ? __pfx_unmap_page_range+0x10/0x10
>   [   38.178485]  ? mas_next_slot+0x133a/0x1a50
>   [   38.179079]  unmap_single_vma.constprop.0+0x15b/0x250
>   [   38.179830]  unmap_vmas+0x1fa/0x460
>   [   38.180373]  ? __pfx_unmap_vmas+0x10/0x10
>   [   38.180994]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
>   [   38.181877]  exit_mmap+0x1a2/0xb40
>   [   38.182396]  ? lock_release+0x14f/0x2c0
>   [   38.182929]  ? __pfx_exit_mmap+0x10/0x10
>   [   38.183474]  ? __pfx___mutex_unlock_slowpath+0x10/0x10
>   [   38.184188]  ? mutex_unlock+0x16/0x20
>   [   38.184704]  mmput+0x132/0x370
>   [   38.185208]  do_exit+0x7e7/0x28c0
>   [   38.185682]  ? __this_cpu_preempt_check+0x21/0x30
>   [   38.186328]  ? do_group_exit+0x1d8/0x2c0
>   [   38.186873]  ? __pfx_do_exit+0x10/0x10
>   [   38.187401]  ? __this_cpu_preempt_check+0x21/0x30
>   [   38.188036]  ? _raw_spin_unlock_irq+0x2c/0x60
>   [   38.188634]  ? lockdep_hardirqs_on+0x89/0x110
>   [   38.189313]  do_group_exit+0xe4/0x2c0
>   [   38.189831]  __x64_sys_exit_group+0x4d/0x60
>   [   38.190413]  x64_sys_call+0x2174/0x2180
>   [   38.190935]  do_syscall_64+0x6d/0x2e0
>   [   38.191449]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> This patch moves the vma_start_write() call to precede
> check_pmd_still_valid(), ensuring that the check is also properly
> protected by the per-VMA lock.
> 
> Fixes: a6fde7add78d ("mm: use per_vma lock for MADV_DONTNEED")
> Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com>
> Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
> Closes: https://lore.kernel.org/all/aJAFrYfyzGpbm+0m@ly-workstation/
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Qi Zheng <zhengqi.arch@bytedance.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Jann Horn <jannh@google.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Lokesh Gidra <lokeshgidra@google.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Cc: Lance Yang <ioworker0@gmail.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---

LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

Re: [PATCH] mm: Fix the race between collapse and PT_RECLAIM under per-vma lock

Posted by Qi Zheng 2 months ago

Hi Barry,

On 8/5/25 11:54 AM, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> The check_pmd_still_valid() call during collapse is currently only
> protected by the mmap_lock in write mode, which was sufficient when
> pt_reclaim always ran under mmap_lock in read mode. However, since
> madvise_dontneed can now execute under a per-VMA lock, this assumption
> is no longer valid. As a result, a race condition can occur between
> collapse and PT_RECLAIM, potentially leading to a kernel panic.

There is indeed a race condition here. And after applying this patch, I
can no longer reproduce the problem locally (I was able to reproduce it
stably locally last night).

But I still can't figure out how this race condtion causes the
following panic:

exit_mmap
--> mmap_read_lock()
     unmap_vmas()
     --> pte_offset_map_lock
         --> rcu_read_lock()
             check if the pmd entry is a PTE page
             ptl = pte_lockptr(mm, &pmdval)  <-- ptl is NULL
             spin_lock(ptl)                  <-- PANIC!!

If this PTE page is freed by pt_reclaim (via RCU), then the ptl can not 
be NULL.

The collapse holds mmap write lock, so it is impossible to be concurrent
with exit_mmap().

Confusing. :(


> 
>   [   38.151897] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] SMP KASI
>   [   38.153519] KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
>   [   38.154605] CPU: 0 UID: 0 PID: 721 Comm: repro Not tainted 6.16.0-next-20250801-next-2025080 #1 PREEMPT(voluntary)
>   [   38.155929] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org4
>   [   38.157418] RIP: 0010:kasan_byte_accessible+0x15/0x30
>   [   38.158125] Code: 03 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 48 b8 00 00 00 00 00 fc0
>   [   38.160461] RSP: 0018:ffff88800feef678 EFLAGS: 00010286
>   [   38.161220] RAX: dffffc0000000000 RBX: 0000000000000001 RCX: 1ffffffff0dde60c
>   [   38.162232] RDX: 0000000000000000 RSI: ffffffff85da1e18 RDI: dffffc0000000003
>   [   38.163176] RBP: ffff88800feef698 R08: 0000000000000001 R09: 0000000000000000
>   [   38.164195] R10: 0000000000000000 R11: ffff888016a8ba58 R12: 0000000000000018
>   [   38.165189] R13: 0000000000000018 R14: ffffffff85da1e18 R15: 0000000000000000
>   [   38.166100] FS:  0000000000000000(0000) GS:ffff8880e3b40000(0000) knlGS:0000000000000000
>   [   38.167137] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   [   38.167891] CR2: 00007f97fadfe504 CR3: 0000000007088005 CR4: 0000000000770ef0
>   [   38.168812] PKRU: 55555554
>   [   38.169275] Call Trace:
>   [   38.169647]  <TASK>
>   [   38.169975]  ? __kasan_check_byte+0x19/0x50
>   [   38.170581]  lock_acquire+0xea/0x310
>   [   38.171083]  ? rcu_is_watching+0x19/0xc0
>   [   38.171615]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
>   [   38.172343]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
>   [   38.173130]  _raw_spin_lock+0x38/0x50
>   [   38.173707]  ? __pte_offset_map_lock+0x1a2/0x3c0
>   [   38.174390]  __pte_offset_map_lock+0x1a2/0x3c0
>   [   38.174987]  ? __pfx___pte_offset_map_lock+0x10/0x10
>   [   38.175724]  ? __pfx_pud_val+0x10/0x10
>   [   38.176308]  ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
>   [   38.177183]  unmap_page_range+0xb60/0x43e0
>   [   38.177824]  ? __pfx_unmap_page_range+0x10/0x10
>   [   38.178485]  ? mas_next_slot+0x133a/0x1a50
>   [   38.179079]  unmap_single_vma.constprop.0+0x15b/0x250
>   [   38.179830]  unmap_vmas+0x1fa/0x460
>   [   38.180373]  ? __pfx_unmap_vmas+0x10/0x10
>   [   38.180994]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
>   [   38.181877]  exit_mmap+0x1a2/0xb40
>   [   38.182396]  ? lock_release+0x14f/0x2c0
>   [   38.182929]  ? __pfx_exit_mmap+0x10/0x10
>   [   38.183474]  ? __pfx___mutex_unlock_slowpath+0x10/0x10
>   [   38.184188]  ? mutex_unlock+0x16/0x20
>   [   38.184704]  mmput+0x132/0x370
>   [   38.185208]  do_exit+0x7e7/0x28c0
>   [   38.185682]  ? __this_cpu_preempt_check+0x21/0x30
>   [   38.186328]  ? do_group_exit+0x1d8/0x2c0
>   [   38.186873]  ? __pfx_do_exit+0x10/0x10
>   [   38.187401]  ? __this_cpu_preempt_check+0x21/0x30
>   [   38.188036]  ? _raw_spin_unlock_irq+0x2c/0x60
>   [   38.188634]  ? lockdep_hardirqs_on+0x89/0x110
>   [   38.189313]  do_group_exit+0xe4/0x2c0
>   [   38.189831]  __x64_sys_exit_group+0x4d/0x60
>   [   38.190413]  x64_sys_call+0x2174/0x2180
>   [   38.190935]  do_syscall_64+0x6d/0x2e0
>   [   38.191449]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> This patch moves the vma_start_write() call to precede
> check_pmd_still_valid(), ensuring that the check is also properly
> protected by the per-VMA lock.
> 
> Fixes: a6fde7add78d ("mm: use per_vma lock for MADV_DONTNEED")
> Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com>
> Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
> Closes: https://lore.kernel.org/all/aJAFrYfyzGpbm+0m@ly-workstation/
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Qi Zheng <zhengqi.arch@bytedance.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Jann Horn <jannh@google.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Lokesh Gidra <lokeshgidra@google.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Cc: Lance Yang <ioworker0@gmail.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>   mm/khugepaged.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 374a6a5193a7..6b40bdfd224c 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1172,11 +1172,11 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   	if (result != SCAN_SUCCEED)
>   		goto out_up_write;
>   	/* check if the pmd is still valid */
> +	vma_start_write(vma);
>   	result = check_pmd_still_valid(mm, address, pmd);
>   	if (result != SCAN_SUCCEED)
>   		goto out_up_write;
>   
> -	vma_start_write(vma);
>   	anon_vma_lock_write(vma->anon_vma);
>   
>   	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,

Re: [PATCH] mm: Fix the race between collapse and PT_RECLAIM under per-vma lock

Posted by Baolin Wang 2 months ago


On 2025/8/5 14:42, Qi Zheng wrote:
> Hi Barry,
> 
> On 8/5/25 11:54 AM, Barry Song wrote:
>> From: Barry Song <v-songbaohua@oppo.com>
>>
>> The check_pmd_still_valid() call during collapse is currently only
>> protected by the mmap_lock in write mode, which was sufficient when
>> pt_reclaim always ran under mmap_lock in read mode. However, since
>> madvise_dontneed can now execute under a per-VMA lock, this assumption
>> is no longer valid. As a result, a race condition can occur between
>> collapse and PT_RECLAIM, potentially leading to a kernel panic.
> 
> There is indeed a race condition here. And after applying this patch, I
> can no longer reproduce the problem locally (I was able to reproduce it
> stably locally last night).
> 
> But I still can't figure out how this race condtion causes the
> following panic:
> 
> exit_mmap
> --> mmap_read_lock()
>      unmap_vmas()
>      --> pte_offset_map_lock
>          --> rcu_read_lock()
>              check if the pmd entry is a PTE page
>              ptl = pte_lockptr(mm, &pmdval)  <-- ptl is NULL
>              spin_lock(ptl)                  <-- PANIC!!
> 
> If this PTE page is freed by pt_reclaim (via RCU), then the ptl can not 
> be NULL.
> 
> The collapse holds mmap write lock, so it is impossible to be concurrent
> with exit_mmap().
> 
> Confusing. :(

IIUC, the issue is not caused by the concurrency between exit_mmap and 
collapse, but rather by the concurrency between pt_reclaim and collapse.

Before this patch, khugepaged might incorrectly restore a PTE pagetable 
that had already been freed.

pt_reclaim has cleared the pmd entry and freed the PTE page table. 
However, due to the race condition, check_pmd_still_valid() still passes 
and continues to attempt the collapse:

_pmd = pmdp_collapse_flush(vma, address, pmd); ---> returns a none pmd 
entry (the original pmd entry has been cleared)

pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); ---> returns 
pte == NULL

Then khugepaged will restore the old PTE pagetable with an invalid pmd 
entry:

pmd_populate(mm, pmd, pmd_pgtable(_pmd));

So when the process exits and trys to free the mapping of the process, 
traversing the invalid pmd table will lead to a crash.

Barry, please correct me if I have misunderstood something.

>>   [   38.151897] Oops: general protection fault, probably for non- 
>> canonical address 0xdffffc0000000003: 0000 [#1] SMP KASI
>>   [   38.153519] KASAN: null-ptr-deref in range 
>> [0x0000000000000018-0x000000000000001f]
>>   [   38.154605] CPU: 0 UID: 0 PID: 721 Comm: repro Not tainted 
>> 6.16.0-next-20250801-next-2025080 #1 PREEMPT(voluntary)
>>   [   38.155929] Hardware name: QEMU Standard PC (i440FX + PIIX, 
>> 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org4
>>   [   38.157418] RIP: 0010:kasan_byte_accessible+0x15/0x30
>>   [   38.158125] Code: 03 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 
>> 90 90 90 90 90 66 0f 1f 00 48 b8 00 00 00 00 00 fc0
>>   [   38.160461] RSP: 0018:ffff88800feef678 EFLAGS: 00010286
>>   [   38.161220] RAX: dffffc0000000000 RBX: 0000000000000001 RCX: 
>> 1ffffffff0dde60c
>>   [   38.162232] RDX: 0000000000000000 RSI: ffffffff85da1e18 RDI: 
>> dffffc0000000003
>>   [   38.163176] RBP: ffff88800feef698 R08: 0000000000000001 R09: 
>> 0000000000000000
>>   [   38.164195] R10: 0000000000000000 R11: ffff888016a8ba58 R12: 
>> 0000000000000018
>>   [   38.165189] R13: 0000000000000018 R14: ffffffff85da1e18 R15: 
>> 0000000000000000
>>   [   38.166100] FS:  0000000000000000(0000) GS:ffff8880e3b40000(0000) 
>> knlGS:0000000000000000
>>   [   38.167137] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>   [   38.167891] CR2: 00007f97fadfe504 CR3: 0000000007088005 CR4: 
>> 0000000000770ef0
>>   [   38.168812] PKRU: 55555554
>>   [   38.169275] Call Trace:
>>   [   38.169647]  <TASK>
>>   [   38.169975]  ? __kasan_check_byte+0x19/0x50
>>   [   38.170581]  lock_acquire+0xea/0x310
>>   [   38.171083]  ? rcu_is_watching+0x19/0xc0
>>   [   38.171615]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
>>   [   38.172343]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
>>   [   38.173130]  _raw_spin_lock+0x38/0x50
>>   [   38.173707]  ? __pte_offset_map_lock+0x1a2/0x3c0
>>   [   38.174390]  __pte_offset_map_lock+0x1a2/0x3c0
>>   [   38.174987]  ? __pfx___pte_offset_map_lock+0x10/0x10
>>   [   38.175724]  ? __pfx_pud_val+0x10/0x10
>>   [   38.176308]  ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
>>   [   38.177183]  unmap_page_range+0xb60/0x43e0
>>   [   38.177824]  ? __pfx_unmap_page_range+0x10/0x10
>>   [   38.178485]  ? mas_next_slot+0x133a/0x1a50
>>   [   38.179079]  unmap_single_vma.constprop.0+0x15b/0x250
>>   [   38.179830]  unmap_vmas+0x1fa/0x460
>>   [   38.180373]  ? __pfx_unmap_vmas+0x10/0x10
>>   [   38.180994]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
>>   [   38.181877]  exit_mmap+0x1a2/0xb40
>>   [   38.182396]  ? lock_release+0x14f/0x2c0
>>   [   38.182929]  ? __pfx_exit_mmap+0x10/0x10
>>   [   38.183474]  ? __pfx___mutex_unlock_slowpath+0x10/0x10
>>   [   38.184188]  ? mutex_unlock+0x16/0x20
>>   [   38.184704]  mmput+0x132/0x370
>>   [   38.185208]  do_exit+0x7e7/0x28c0
>>   [   38.185682]  ? __this_cpu_preempt_check+0x21/0x30
>>   [   38.186328]  ? do_group_exit+0x1d8/0x2c0
>>   [   38.186873]  ? __pfx_do_exit+0x10/0x10
>>   [   38.187401]  ? __this_cpu_preempt_check+0x21/0x30
>>   [   38.188036]  ? _raw_spin_unlock_irq+0x2c/0x60
>>   [   38.188634]  ? lockdep_hardirqs_on+0x89/0x110
>>   [   38.189313]  do_group_exit+0xe4/0x2c0
>>   [   38.189831]  __x64_sys_exit_group+0x4d/0x60
>>   [   38.190413]  x64_sys_call+0x2174/0x2180
>>   [   38.190935]  do_syscall_64+0x6d/0x2e0
>>   [   38.191449]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>
>> This patch moves the vma_start_write() call to precede
>> check_pmd_still_valid(), ensuring that the check is also properly
>> protected by the per-VMA lock.
>>
>> Fixes: a6fde7add78d ("mm: use per_vma lock for MADV_DONTNEED")
>> Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com>
>> Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
>> Closes: https://lore.kernel.org/all/aJAFrYfyzGpbm+0m@ly-workstation/
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Cc: Qi Zheng <zhengqi.arch@bytedance.com>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Jann Horn <jannh@google.com>
>> Cc: Suren Baghdasaryan <surenb@google.com>
>> Cc: Lokesh Gidra <lokeshgidra@google.com>
>> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
>> Cc: Lance Yang <ioworker0@gmail.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
>> Cc: Nico Pache <npache@redhat.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Dev Jain <dev.jain@arm.com>
>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>> ---
>>   mm/khugepaged.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 374a6a5193a7..6b40bdfd224c 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1172,11 +1172,11 @@ static int collapse_huge_page(struct mm_struct 
>> *mm, unsigned long address,
>>       if (result != SCAN_SUCCEED)
>>           goto out_up_write;
>>       /* check if the pmd is still valid */
>> +    vma_start_write(vma);
>>       result = check_pmd_still_valid(mm, address, pmd);
>>       if (result != SCAN_SUCCEED)
>>           goto out_up_write;
>> -    vma_start_write(vma);
>>       anon_vma_lock_write(vma->anon_vma);
>>       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,

Re: [PATCH] mm: Fix the race between collapse and PT_RECLAIM under per-vma lock

Posted by Qi Zheng 2 months ago

Hi Baolin,

On 8/5/25 3:53 PM, Baolin Wang wrote:
> 
> 
> On 2025/8/5 14:42, Qi Zheng wrote:
>> Hi Barry,
>>
>> On 8/5/25 11:54 AM, Barry Song wrote:
>>> From: Barry Song <v-songbaohua@oppo.com>
>>>
>>> The check_pmd_still_valid() call during collapse is currently only
>>> protected by the mmap_lock in write mode, which was sufficient when
>>> pt_reclaim always ran under mmap_lock in read mode. However, since
>>> madvise_dontneed can now execute under a per-VMA lock, this assumption
>>> is no longer valid. As a result, a race condition can occur between
>>> collapse and PT_RECLAIM, potentially leading to a kernel panic.
>>
>> There is indeed a race condition here. And after applying this patch, I
>> can no longer reproduce the problem locally (I was able to reproduce it
>> stably locally last night).
>>
>> But I still can't figure out how this race condtion causes the
>> following panic:
>>
>> exit_mmap
>> --> mmap_read_lock()
>>      unmap_vmas()
>>      --> pte_offset_map_lock
>>          --> rcu_read_lock()
>>              check if the pmd entry is a PTE page
>>              ptl = pte_lockptr(mm, &pmdval)  <-- ptl is NULL
>>              spin_lock(ptl)                  <-- PANIC!!
>>
>> If this PTE page is freed by pt_reclaim (via RCU), then the ptl can 
>> not be NULL.
>>
>> The collapse holds mmap write lock, so it is impossible to be concurrent
>> with exit_mmap().
>>
>> Confusing. :(
> 
> IIUC, the issue is not caused by the concurrency between exit_mmap and 
> collapse, but rather by the concurrency between pt_reclaim and collapse.
> 
> Before this patch, khugepaged might incorrectly restore a PTE pagetable 
> that had already been freed.
> 
> pt_reclaim has cleared the pmd entry and freed the PTE page table. 
> However, due to the race condition, check_pmd_still_valid() still passes 
> and continues to attempt the collapse:
> 
> _pmd = pmdp_collapse_flush(vma, address, pmd); ---> returns a none pmd 
> entry (the original pmd entry has been cleared)
> 
> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); ---> returns 
> pte == NULL
> 
> Then khugepaged will restore the old PTE pagetable with an invalid pmd 
> entry:
> 
> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> 
> So when the process exits and trys to free the mapping of the process, 
> traversing the invalid pmd table will lead to a crash.

CPU0                         CPU1
====                         ====

collapse
--> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
     mmap_write_unlock
                              exit_mmap
                              --> hold mmap lock
                                  __pte_offset_map_lock
                                  --> pte = __pte_offset_map(pmd, addr, 
&pmdval);
                                      if (unlikely(!pte))
                                          return pte;   <-- will return

IIUC, in this case, if we get an invalid pmd entry, we will retrun
directly instead of causing a crash?

> 
> Barry, please correct me if I have misunderstood something.
>

Re: [PATCH] mm: Fix the race between collapse and PT_RECLAIM under per-vma lock

Posted by Baolin Wang 2 months ago


On 2025/8/5 16:17, Qi Zheng wrote:
> Hi Baolin,
> 
> On 8/5/25 3:53 PM, Baolin Wang wrote:
>>
>>
>> On 2025/8/5 14:42, Qi Zheng wrote:
>>> Hi Barry,
>>>
>>> On 8/5/25 11:54 AM, Barry Song wrote:
>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>
>>>> The check_pmd_still_valid() call during collapse is currently only
>>>> protected by the mmap_lock in write mode, which was sufficient when
>>>> pt_reclaim always ran under mmap_lock in read mode. However, since
>>>> madvise_dontneed can now execute under a per-VMA lock, this assumption
>>>> is no longer valid. As a result, a race condition can occur between
>>>> collapse and PT_RECLAIM, potentially leading to a kernel panic.
>>>
>>> There is indeed a race condition here. And after applying this patch, I
>>> can no longer reproduce the problem locally (I was able to reproduce it
>>> stably locally last night).
>>>
>>> But I still can't figure out how this race condtion causes the
>>> following panic:
>>>
>>> exit_mmap
>>> --> mmap_read_lock()
>>>      unmap_vmas()
>>>      --> pte_offset_map_lock
>>>          --> rcu_read_lock()
>>>              check if the pmd entry is a PTE page
>>>              ptl = pte_lockptr(mm, &pmdval)  <-- ptl is NULL
>>>              spin_lock(ptl)                  <-- PANIC!!
>>>
>>> If this PTE page is freed by pt_reclaim (via RCU), then the ptl can 
>>> not be NULL.
>>>
>>> The collapse holds mmap write lock, so it is impossible to be concurrent
>>> with exit_mmap().
>>>
>>> Confusing. :(
>>
>> IIUC, the issue is not caused by the concurrency between exit_mmap and 
>> collapse, but rather by the concurrency between pt_reclaim and collapse.
>>
>> Before this patch, khugepaged might incorrectly restore a PTE 
>> pagetable that had already been freed.
>>
>> pt_reclaim has cleared the pmd entry and freed the PTE page table. 
>> However, due to the race condition, check_pmd_still_valid() still 
>> passes and continues to attempt the collapse:
>>
>> _pmd = pmdp_collapse_flush(vma, address, pmd); ---> returns a none pmd 
>> entry (the original pmd entry has been cleared)
>>
>> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); ---> returns 
>> pte == NULL
>>
>> Then khugepaged will restore the old PTE pagetable with an invalid pmd 
>> entry:
>>
>> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>>
>> So when the process exits and trys to free the mapping of the process, 
>> traversing the invalid pmd table will lead to a crash.
> 
> CPU0                         CPU1
> ====                         ====
> 
> collapse
> --> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>      mmap_write_unlock
>                               exit_mmap
>                               --> hold mmap lock
>                                   __pte_offset_map_lock
>                                   --> pte = __pte_offset_map(pmd, addr, 
> &pmdval);
>                                       if (unlikely(!pte))
>                                           return pte;   <-- will return

__pte_offset_map() might not return NULL? Because the 'pmd_populate(mm, 
pmd, pmd_pgtable(_pmd))' could populate a valid page (although the 
'_pmd' entry is NONE), but it is not the original pagetable page.

> IIUC, in this case, if we get an invalid pmd entry, we will retrun
> directly instead of causing a crash?
> 
>>
>> Barry, please correct me if I have misunderstood something.
>>
>

Re: [PATCH] mm: Fix the race between collapse and PT_RECLAIM under per-vma lock

Posted by Qi Zheng 2 months ago


On 8/5/25 4:56 PM, Baolin Wang wrote:
> 
> 
> On 2025/8/5 16:17, Qi Zheng wrote:
>> Hi Baolin,
>>
>> On 8/5/25 3:53 PM, Baolin Wang wrote:
>>>
>>>
>>> On 2025/8/5 14:42, Qi Zheng wrote:
>>>> Hi Barry,
>>>>
>>>> On 8/5/25 11:54 AM, Barry Song wrote:
>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>
>>>>> The check_pmd_still_valid() call during collapse is currently only
>>>>> protected by the mmap_lock in write mode, which was sufficient when
>>>>> pt_reclaim always ran under mmap_lock in read mode. However, since
>>>>> madvise_dontneed can now execute under a per-VMA lock, this assumption
>>>>> is no longer valid. As a result, a race condition can occur between
>>>>> collapse and PT_RECLAIM, potentially leading to a kernel panic.
>>>>
>>>> There is indeed a race condition here. And after applying this patch, I
>>>> can no longer reproduce the problem locally (I was able to reproduce it
>>>> stably locally last night).
>>>>
>>>> But I still can't figure out how this race condtion causes the
>>>> following panic:
>>>>
>>>> exit_mmap
>>>> --> mmap_read_lock()
>>>>      unmap_vmas()
>>>>      --> pte_offset_map_lock
>>>>          --> rcu_read_lock()
>>>>              check if the pmd entry is a PTE page
>>>>              ptl = pte_lockptr(mm, &pmdval)  <-- ptl is NULL
>>>>              spin_lock(ptl)                  <-- PANIC!!
>>>>
>>>> If this PTE page is freed by pt_reclaim (via RCU), then the ptl can 
>>>> not be NULL.
>>>>
>>>> The collapse holds mmap write lock, so it is impossible to be 
>>>> concurrent
>>>> with exit_mmap().
>>>>
>>>> Confusing. :(
>>>
>>> IIUC, the issue is not caused by the concurrency between exit_mmap 
>>> and collapse, but rather by the concurrency between pt_reclaim and 
>>> collapse.
>>>
>>> Before this patch, khugepaged might incorrectly restore a PTE 
>>> pagetable that had already been freed.
>>>
>>> pt_reclaim has cleared the pmd entry and freed the PTE page table. 
>>> However, due to the race condition, check_pmd_still_valid() still 
>>> passes and continues to attempt the collapse:
>>>
>>> _pmd = pmdp_collapse_flush(vma, address, pmd); ---> returns a none 
>>> pmd entry (the original pmd entry has been cleared)
>>>
>>> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); ---> returns 
>>> pte == NULL
>>>
>>> Then khugepaged will restore the old PTE pagetable with an invalid 
>>> pmd entry:
>>>
>>> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>>>
>>> So when the process exits and trys to free the mapping of the 
>>> process, traversing the invalid pmd table will lead to a crash.
>>
>> CPU0                         CPU1
>> ====                         ====
>>
>> collapse
>> --> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>>      mmap_write_unlock
>>                               exit_mmap
>>                               --> hold mmap lock
>>                                   __pte_offset_map_lock
>>                                   --> pte = __pte_offset_map(pmd, 
>> addr, &pmdval);
>>                                       if (unlikely(!pte))
>>                                           return pte;   <-- will return
> 
> __pte_offset_map() might not return NULL? Because the 'pmd_populate(mm, 
> pmd, pmd_pgtable(_pmd))' could populate a valid page (although the 
> '_pmd' entry is NONE), but it is not the original pagetable page.

CPU0                          CPU1
====                          ====

collapse
--> check_pmd_still_valid
                               vma read lock
                               pt_reclaim clear the pmd entry and will 
free the PTE page (via RCU)
                               vma read unlock

     vma write lock
     _pmd = pmdp_collapse_flush(vma, address, pmd) <-- pmd_none(_pmd)
     pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); <-- pte is 
NULL
     pmd_populate(mm, pmd, pmd_pgtable(_pmd)); <-- populate a valid page?
     vma write unlock

The above is the concurrent scenario you mentioned, right?

What types of this 'valid page' could be? If __pte_offset_map() returns
non-NULL, then it is a PTE page. Even if it is not the original one, it
should not cause panic. Did I miss some key information? :(

> 
>> IIUC, in this case, if we get an invalid pmd entry, we will retrun
>> directly instead of causing a crash?
>>
>>>
>>> Barry, please correct me if I have misunderstood something.
>>>
>>
>

Re: [PATCH] mm: Fix the race between collapse and PT_RECLAIM under per-vma lock

Posted by David Hildenbrand 2 months ago

On 05.08.25 11:30, Qi Zheng wrote:
> 
> 
> On 8/5/25 4:56 PM, Baolin Wang wrote:
>>
>>
>> On 2025/8/5 16:17, Qi Zheng wrote:
>>> Hi Baolin,
>>>
>>> On 8/5/25 3:53 PM, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2025/8/5 14:42, Qi Zheng wrote:
>>>>> Hi Barry,
>>>>>
>>>>> On 8/5/25 11:54 AM, Barry Song wrote:
>>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>>
>>>>>> The check_pmd_still_valid() call during collapse is currently only
>>>>>> protected by the mmap_lock in write mode, which was sufficient when
>>>>>> pt_reclaim always ran under mmap_lock in read mode. However, since
>>>>>> madvise_dontneed can now execute under a per-VMA lock, this assumption
>>>>>> is no longer valid. As a result, a race condition can occur between
>>>>>> collapse and PT_RECLAIM, potentially leading to a kernel panic.
>>>>>
>>>>> There is indeed a race condition here. And after applying this patch, I
>>>>> can no longer reproduce the problem locally (I was able to reproduce it
>>>>> stably locally last night).
>>>>>
>>>>> But I still can't figure out how this race condtion causes the
>>>>> following panic:
>>>>>
>>>>> exit_mmap
>>>>> --> mmap_read_lock()
>>>>>       unmap_vmas()
>>>>>       --> pte_offset_map_lock
>>>>>           --> rcu_read_lock()
>>>>>               check if the pmd entry is a PTE page
>>>>>               ptl = pte_lockptr(mm, &pmdval)  <-- ptl is NULL
>>>>>               spin_lock(ptl)                  <-- PANIC!!
>>>>>
>>>>> If this PTE page is freed by pt_reclaim (via RCU), then the ptl can
>>>>> not be NULL.
>>>>>
>>>>> The collapse holds mmap write lock, so it is impossible to be
>>>>> concurrent
>>>>> with exit_mmap().
>>>>>
>>>>> Confusing. :(
>>>>
>>>> IIUC, the issue is not caused by the concurrency between exit_mmap
>>>> and collapse, but rather by the concurrency between pt_reclaim and
>>>> collapse.
>>>>
>>>> Before this patch, khugepaged might incorrectly restore a PTE
>>>> pagetable that had already been freed.
>>>>
>>>> pt_reclaim has cleared the pmd entry and freed the PTE page table.
>>>> However, due to the race condition, check_pmd_still_valid() still
>>>> passes and continues to attempt the collapse:
>>>>
>>>> _pmd = pmdp_collapse_flush(vma, address, pmd); ---> returns a none
>>>> pmd entry (the original pmd entry has been cleared)
>>>>
>>>> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); ---> returns
>>>> pte == NULL
>>>>
>>>> Then khugepaged will restore the old PTE pagetable with an invalid
>>>> pmd entry:
>>>>
>>>> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>>>>
>>>> So when the process exits and trys to free the mapping of the
>>>> process, traversing the invalid pmd table will lead to a crash.
>>>
>>> CPU0                         CPU1
>>> ====                         ====
>>>
>>> collapse
>>> --> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>>>       mmap_write_unlock
>>>                                exit_mmap
>>>                                --> hold mmap lock
>>>                                    __pte_offset_map_lock
>>>                                    --> pte = __pte_offset_map(pmd,
>>> addr, &pmdval);
>>>                                        if (unlikely(!pte))
>>>                                            return pte;   <-- will return
>>
>> __pte_offset_map() might not return NULL? Because the 'pmd_populate(mm,
>> pmd, pmd_pgtable(_pmd))' could populate a valid page (although the
>> '_pmd' entry is NONE), but it is not the original pagetable page.
> 
> CPU0                          CPU1
> ====                          ====
> 
> collapse
> --> check_pmd_still_valid
>                                 vma read lock
>                                 pt_reclaim clear the pmd entry and will
> free the PTE page (via RCU)
>                                 vma read unlock
> 
>       vma write lock
>       _pmd = pmdp_collapse_flush(vma, address, pmd) <-- pmd_none(_pmd)
>       pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); <-- pte is
> NULL
>       pmd_populate(mm, pmd, pmd_pgtable(_pmd)); <-- populate a valid page?
>       vma write unlock
> 
> The above is the concurrent scenario you mentioned, right?
> 
> What types of this 'valid page' could be? If __pte_offset_map() returns
> non-NULL, then it is a PTE page. Even if it is not the original one, it
> should not cause panic. Did I miss some key information? :(

Wasn't the original issue all about a NULL-pointer de-reference while 
*locking*?

Note that in that kernel config [1] we have CONFIG_DEBUG_SPINLOCK=y, so 
likely we will have ALLOC_SPLIT_PTLOCKS set.

[1] 
https://github.com/laifryiee/syzkaller_logs/blob/main/250803_193026___pte_offset_map_lock/.config

-- 
Cheers,

David / dhildenb

Re: [PATCH] mm: Fix the race between collapse and PT_RECLAIM under per-vma lock

Posted by Baolin Wang 2 months ago


On 2025/8/5 17:50, David Hildenbrand wrote:
> On 05.08.25 11:30, Qi Zheng wrote:
>>
>>
>> On 8/5/25 4:56 PM, Baolin Wang wrote:
>>>
>>>
>>> On 2025/8/5 16:17, Qi Zheng wrote:
>>>> Hi Baolin,
>>>>
>>>> On 8/5/25 3:53 PM, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2025/8/5 14:42, Qi Zheng wrote:
>>>>>> Hi Barry,
>>>>>>
>>>>>> On 8/5/25 11:54 AM, Barry Song wrote:
>>>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>>>
>>>>>>> The check_pmd_still_valid() call during collapse is currently only
>>>>>>> protected by the mmap_lock in write mode, which was sufficient when
>>>>>>> pt_reclaim always ran under mmap_lock in read mode. However, since
>>>>>>> madvise_dontneed can now execute under a per-VMA lock, this 
>>>>>>> assumption
>>>>>>> is no longer valid. As a result, a race condition can occur between
>>>>>>> collapse and PT_RECLAIM, potentially leading to a kernel panic.
>>>>>>
>>>>>> There is indeed a race condition here. And after applying this 
>>>>>> patch, I
>>>>>> can no longer reproduce the problem locally (I was able to 
>>>>>> reproduce it
>>>>>> stably locally last night).
>>>>>>
>>>>>> But I still can't figure out how this race condtion causes the
>>>>>> following panic:
>>>>>>
>>>>>> exit_mmap
>>>>>> --> mmap_read_lock()
>>>>>>       unmap_vmas()
>>>>>>       --> pte_offset_map_lock
>>>>>>           --> rcu_read_lock()
>>>>>>               check if the pmd entry is a PTE page
>>>>>>               ptl = pte_lockptr(mm, &pmdval)  <-- ptl is NULL
>>>>>>               spin_lock(ptl)                  <-- PANIC!!
>>>>>>
>>>>>> If this PTE page is freed by pt_reclaim (via RCU), then the ptl can
>>>>>> not be NULL.
>>>>>>
>>>>>> The collapse holds mmap write lock, so it is impossible to be
>>>>>> concurrent
>>>>>> with exit_mmap().
>>>>>>
>>>>>> Confusing. :(
>>>>>
>>>>> IIUC, the issue is not caused by the concurrency between exit_mmap
>>>>> and collapse, but rather by the concurrency between pt_reclaim and
>>>>> collapse.
>>>>>
>>>>> Before this patch, khugepaged might incorrectly restore a PTE
>>>>> pagetable that had already been freed.
>>>>>
>>>>> pt_reclaim has cleared the pmd entry and freed the PTE page table.
>>>>> However, due to the race condition, check_pmd_still_valid() still
>>>>> passes and continues to attempt the collapse:
>>>>>
>>>>> _pmd = pmdp_collapse_flush(vma, address, pmd); ---> returns a none
>>>>> pmd entry (the original pmd entry has been cleared)
>>>>>
>>>>> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); ---> returns
>>>>> pte == NULL
>>>>>
>>>>> Then khugepaged will restore the old PTE pagetable with an invalid
>>>>> pmd entry:
>>>>>
>>>>> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>>>>>
>>>>> So when the process exits and trys to free the mapping of the
>>>>> process, traversing the invalid pmd table will lead to a crash.
>>>>
>>>> CPU0                         CPU1
>>>> ====                         ====
>>>>
>>>> collapse
>>>> --> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>>>>       mmap_write_unlock
>>>>                                exit_mmap
>>>>                                --> hold mmap lock
>>>>                                    __pte_offset_map_lock
>>>>                                    --> pte = __pte_offset_map(pmd,
>>>> addr, &pmdval);
>>>>                                        if (unlikely(!pte))
>>>>                                            return pte;   <-- will 
>>>> return
>>>
>>> __pte_offset_map() might not return NULL? Because the 'pmd_populate(mm,
>>> pmd, pmd_pgtable(_pmd))' could populate a valid page (although the
>>> '_pmd' entry is NONE), but it is not the original pagetable page.
>>
>> CPU0                          CPU1
>> ====                          ====
>>
>> collapse
>> --> check_pmd_still_valid
>>                                 vma read lock
>>                                 pt_reclaim clear the pmd entry and will
>> free the PTE page (via RCU)
>>                                 vma read unlock
>>
>>       vma write lock
>>       _pmd = pmdp_collapse_flush(vma, address, pmd) <-- pmd_none(_pmd)
>>       pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); <-- pte is
>> NULL
>>       pmd_populate(mm, pmd, pmd_pgtable(_pmd)); <-- populate a valid 
>> page?
>>       vma write unlock
>>
>> The above is the concurrent scenario you mentioned, right?

Yes.

>>
>> What types of this 'valid page' could be? If __pte_offset_map() returns
>> non-NULL, then it is a PTE page. Even if it is not the original one, it
>> should not cause panic. Did I miss some key information? :(

Sorry for not being clear. Let me try again.

In the race condition described above, the '_pmd' value is NONE, meaning 
that when restoring the pmd entry with ‘pmd_populate(mm, pmd, 
pmd_pgtable(_pmd))’, the 'pmd_pgtable(_pmd)' can return a struct page 
corresponding to pfn == 0 (cause the '_pmd' is NONE) to populate the pmd 
entry. Clearly, this pfn == 0 page is not a pagetable page, meaning the 
corresponding ptl lock of this page is not initialized.

Additionally, from the boot dmesg, I can see that the BIOS reports an 
address range with pfn == 0, indicating that there is a struct page 
initialized for pfn == 0 (possibly a reserved page):

[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] 
reserved
[    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] 
reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000007ffdffff] usable
[    0.000000] BIOS-e820: [mem 0x000000007ffe0000-0x000000007fffffff] 
reserved
[    0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] 
reserved
[    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] 
reserved

Of course, this is my theoretical analysis from the code perspective. If 
there are other race conditions, I would be very surprised:)

> Wasn't the original issue all about a NULL-pointer de-reference while 
> *locking*?

Yes.

> Note that in that kernel config [1] we have CONFIG_DEBUG_SPINLOCK=y, so 
> likely we will have ALLOC_SPLIT_PTLOCKS set.
> 
> [1] https://github.com/laifryiee/syzkaller_logs/blob/ 
> main/250803_193026___pte_offset_map_lock/.config
>

Re: [PATCH] mm: Fix the race between collapse and PT_RECLAIM under per-vma lock

Posted by Qi Zheng 2 months ago


On 8/5/25 6:07 PM, Baolin Wang wrote:
> 
> 

[...]

> 
>>>
>>> What types of this 'valid page' could be? If __pte_offset_map() returns
>>> non-NULL, then it is a PTE page. Even if it is not the original one, it
>>> should not cause panic. Did I miss some key information? :(
> 
> Sorry for not being clear. Let me try again.
> 
> In the race condition described above, the '_pmd' value is NONE, meaning 
> that when restoring the pmd entry with ‘pmd_populate(mm, pmd, 
> pmd_pgtable(_pmd))’, the 'pmd_pgtable(_pmd)' can return a struct page 
> corresponding to pfn == 0 (cause the '_pmd' is NONE) to populate the pmd 
> entry. Clearly, this pfn == 0 page is not a pagetable page, meaning the 
> corresponding ptl lock of this page is not initialized.
> 
> Additionally, from the boot dmesg, I can see that the BIOS reports an 
> address range with pfn == 0, indicating that there is a struct page 
> initialized for pfn == 0 (possibly a reserved page):
> 
> [    0.000000] BIOS-provided physical RAM map:
> [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] 
> usable
> [    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] 
> reserved
> [    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] 
> reserved
> [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000007ffdffff] 
> usable
> [    0.000000] BIOS-e820: [mem 0x000000007ffe0000-0x000000007fffffff] 
> reserved
> [    0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] 
> reserved
> [    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] 
> reserved
> 

Now I understand, thank you very much for your patient explanation!

And for this patch:

Acked-by: Qi Zheng <zhengqi.arch@bytedance.com>

Thanks!

> Of course, this is my theoretical analysis from the code perspective. If 
> there are other race conditions, I would be very surprised:)
> 
>> Wasn't the original issue all about a NULL-pointer de-reference while 
>> *locking*?
> 
> Yes.
> 
>> Note that in that kernel config [1] we have CONFIG_DEBUG_SPINLOCK=y, 
>> so likely we will have ALLOC_SPLIT_PTLOCKS set.
>>
>> [1] https://github.com/laifryiee/syzkaller_logs/blob/ 
>> main/250803_193026___pte_offset_map_lock/.config
>>

Re: [PATCH] mm: Fix the race between collapse and PT_RECLAIM under per-vma lock

Posted by David Hildenbrand 2 months ago

On 05.08.25 05:54, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> The check_pmd_still_valid() call during collapse is currently only
> protected by the mmap_lock in write mode, which was sufficient when
> pt_reclaim always ran under mmap_lock in read mode. However, since
> madvise_dontneed can now execute under a per-VMA lock, this assumption
> is no longer valid. As a result, a race condition can occur between
> collapse and PT_RECLAIM, potentially leading to a kernel panic.
> 
>   [   38.151897] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] SMP KASI
>   [   38.153519] KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
>   [   38.154605] CPU: 0 UID: 0 PID: 721 Comm: repro Not tainted 6.16.0-next-20250801-next-2025080 #1 PREEMPT(voluntary)
>   [   38.155929] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org4
>   [   38.157418] RIP: 0010:kasan_byte_accessible+0x15/0x30
>   [   38.158125] Code: 03 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 48 b8 00 00 00 00 00 fc0
>   [   38.160461] RSP: 0018:ffff88800feef678 EFLAGS: 00010286
>   [   38.161220] RAX: dffffc0000000000 RBX: 0000000000000001 RCX: 1ffffffff0dde60c
>   [   38.162232] RDX: 0000000000000000 RSI: ffffffff85da1e18 RDI: dffffc0000000003
>   [   38.163176] RBP: ffff88800feef698 R08: 0000000000000001 R09: 0000000000000000
>   [   38.164195] R10: 0000000000000000 R11: ffff888016a8ba58 R12: 0000000000000018
>   [   38.165189] R13: 0000000000000018 R14: ffffffff85da1e18 R15: 0000000000000000
>   [   38.166100] FS:  0000000000000000(0000) GS:ffff8880e3b40000(0000) knlGS:0000000000000000
>   [   38.167137] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   [   38.167891] CR2: 00007f97fadfe504 CR3: 0000000007088005 CR4: 0000000000770ef0
>   [   38.168812] PKRU: 55555554
>   [   38.169275] Call Trace:
>   [   38.169647]  <TASK>
>   [   38.169975]  ? __kasan_check_byte+0x19/0x50
>   [   38.170581]  lock_acquire+0xea/0x310
>   [   38.171083]  ? rcu_is_watching+0x19/0xc0
>   [   38.171615]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
>   [   38.172343]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
>   [   38.173130]  _raw_spin_lock+0x38/0x50
>   [   38.173707]  ? __pte_offset_map_lock+0x1a2/0x3c0
>   [   38.174390]  __pte_offset_map_lock+0x1a2/0x3c0
>   [   38.174987]  ? __pfx___pte_offset_map_lock+0x10/0x10
>   [   38.175724]  ? __pfx_pud_val+0x10/0x10
>   [   38.176308]  ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
>   [   38.177183]  unmap_page_range+0xb60/0x43e0
>   [   38.177824]  ? __pfx_unmap_page_range+0x10/0x10
>   [   38.178485]  ? mas_next_slot+0x133a/0x1a50
>   [   38.179079]  unmap_single_vma.constprop.0+0x15b/0x250
>   [   38.179830]  unmap_vmas+0x1fa/0x460
>   [   38.180373]  ? __pfx_unmap_vmas+0x10/0x10
>   [   38.180994]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
>   [   38.181877]  exit_mmap+0x1a2/0xb40
>   [   38.182396]  ? lock_release+0x14f/0x2c0
>   [   38.182929]  ? __pfx_exit_mmap+0x10/0x10
>   [   38.183474]  ? __pfx___mutex_unlock_slowpath+0x10/0x10
>   [   38.184188]  ? mutex_unlock+0x16/0x20
>   [   38.184704]  mmput+0x132/0x370
>   [   38.185208]  do_exit+0x7e7/0x28c0
>   [   38.185682]  ? __this_cpu_preempt_check+0x21/0x30
>   [   38.186328]  ? do_group_exit+0x1d8/0x2c0
>   [   38.186873]  ? __pfx_do_exit+0x10/0x10
>   [   38.187401]  ? __this_cpu_preempt_check+0x21/0x30
>   [   38.188036]  ? _raw_spin_unlock_irq+0x2c/0x60
>   [   38.188634]  ? lockdep_hardirqs_on+0x89/0x110
>   [   38.189313]  do_group_exit+0xe4/0x2c0
>   [   38.189831]  __x64_sys_exit_group+0x4d/0x60
>   [   38.190413]  x64_sys_call+0x2174/0x2180
>   [   38.190935]  do_syscall_64+0x6d/0x2e0
>   [   38.191449]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> This patch moves the vma_start_write() call to precede
> check_pmd_still_valid(), ensuring that the check is also properly
> protected by the per-VMA lock.
> 
> Fixes: a6fde7add78d ("mm: use per_vma lock for MADV_DONTNEED")
> Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com>
> Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
> Closes: https://lore.kernel.org/all/aJAFrYfyzGpbm+0m@ly-workstation/
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Qi Zheng <zhengqi.arch@bytedance.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Jann Horn <jannh@google.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Lokesh Gidra <lokeshgidra@google.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Cc: Lance Yang <ioworker0@gmail.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>   mm/khugepaged.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 374a6a5193a7..6b40bdfd224c 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1172,11 +1172,11 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   	if (result != SCAN_SUCCEED)
>   		goto out_up_write;
>   	/* check if the pmd is still valid */
> +	vma_start_write(vma);
>   	result = check_pmd_still_valid(mm, address, pmd);
>   	if (result != SCAN_SUCCEED)
>   		goto out_up_write;
>   
> -	vma_start_write(vma);
>   	anon_vma_lock_write(vma->anon_vma);
>   
>   	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,

LGTM, I was wondering whether we should just place it next to the 
mmap_write_lock() with the assumption that hugepage_vma_revalidate() 
will commonly not fail.

So personally, I would move it further up.

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb