[v3 05/24] mm: thp: handle split failure in zap_pmd_range()

Usama Arif posted 24 patches 6 days, 16 hours ago
[v3 05/24] mm: thp: handle split failure in zap_pmd_range()
Posted by Usama Arif 6 days, 16 hours ago
zap_pmd_range() splits a huge PMD when the zap range doesn't cover the
full PMD (partial unmap).  If the split fails, the PMD stays huge.
Falling through to zap_pte_range() would dereference the huge PMD entry
as a PTE page table pointer.

Skip the range covered by the PMD on split failure instead.

The skip is safe across all call paths into zap_pmd_range():

- exit_mmap() and OOM reaper: the zap range covers entire VMAs, so
  every PMD is fully covered (next - addr == HPAGE_PMD_SIZE).  The
  zap_huge_pmd() branch handles these without splitting.  The split
  failure path is unreachable.

- munmap / mmap overlay: vma_adjust_trans_huge() (called from
  __split_vma) splits any PMD straddling the VMA boundary before the
  VMA is split.  If that PMD split fails, __split_vma() returns
  -ENOMEM and the munmap is aborted before reaching zap_pmd_range().
  The split failure path is unreachable.

- MADV_DONTNEED: advisory hint, the kernel is allowed to ignore it.
  The pages remain valid and accessible.  A subsequent access returns
  existing data without faulting.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/memory.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index e44469f9cf659..caf97c48cb166 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1985,9 +1985,18 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 	do {
 		next = pmd_addr_end(addr, end);
 		if (pmd_is_huge(*pmd)) {
-			if (next - addr != HPAGE_PMD_SIZE)
-				__split_huge_pmd(vma, pmd, addr, false);
-			else if (zap_huge_pmd(tlb, vma, pmd, addr)) {
+			if (next - addr != HPAGE_PMD_SIZE) {
+				/*
+				 * If split fails, the PMD stays huge.
+				 * Skip the range to avoid falling through
+				 * to zap_pte_range, which would treat the
+				 * huge PMD entry as a page table pointer.
+				 */
+				if (__split_huge_pmd(vma, pmd, addr, false)) {
+					addr = next;
+					continue;
+				}
+			} else if (zap_huge_pmd(tlb, vma, pmd, addr)) {
 				addr = next;
 				continue;
 			}
-- 
2.52.0
Re: [v3 05/24] mm: thp: handle split failure in zap_pmd_range()
Posted by Kiryl Shutsemau 3 days, 4 hours ago
On Thu, Mar 26, 2026 at 07:08:47PM -0700, Usama Arif wrote:
> zap_pmd_range() splits a huge PMD when the zap range doesn't cover the
> full PMD (partial unmap).  If the split fails, the PMD stays huge.
> Falling through to zap_pte_range() would dereference the huge PMD entry
> as a PTE page table pointer.
> 
> Skip the range covered by the PMD on split failure instead.

Ughh... This is hacky as hell.

> The skip is safe across all call paths into zap_pmd_range():
> 
> - exit_mmap() and OOM reaper: the zap range covers entire VMAs, so
>   every PMD is fully covered (next - addr == HPAGE_PMD_SIZE).  The
>   zap_huge_pmd() branch handles these without splitting.  The split
>   failure path is unreachable.
> 
> - munmap / mmap overlay: vma_adjust_trans_huge() (called from
>   __split_vma) splits any PMD straddling the VMA boundary before the
>   VMA is split.  If that PMD split fails, __split_vma() returns
>   -ENOMEM and the munmap is aborted before reaching zap_pmd_range().
>   The split failure path is unreachable.
> 
> - MADV_DONTNEED: advisory hint, the kernel is allowed to ignore it.
>   The pages remain valid and accessible.  A subsequent access returns
>   existing data without faulting.

Em, no. MADV_DONTNEED users expect memory to be zeroed after the
"advise" is complete. At very least you need to zero the skipped range.

And are you sure that the list of users is complete?

I am also worried about a possible new user that is not aware about this
skip-on-split-failure semantics.

I think it hast o be opt-in. Maybe a ZAP_FLAG_WHATEVER?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov
Re: [v3 05/24] mm: thp: handle split failure in zap_pmd_range()
Posted by David Hildenbrand (Arm) 3 days, 3 hours ago
On 3/30/26 16:13, Kiryl Shutsemau wrote:
> On Thu, Mar 26, 2026 at 07:08:47PM -0700, Usama Arif wrote:
>> zap_pmd_range() splits a huge PMD when the zap range doesn't cover the
>> full PMD (partial unmap).  If the split fails, the PMD stays huge.
>> Falling through to zap_pte_range() would dereference the huge PMD entry
>> as a PTE page table pointer.
>>
>> Skip the range covered by the PMD on split failure instead.
> 
> Ughh... This is hacky as hell.
> 
>> The skip is safe across all call paths into zap_pmd_range():
>>
>> - exit_mmap() and OOM reaper: the zap range covers entire VMAs, so
>>   every PMD is fully covered (next - addr == HPAGE_PMD_SIZE).  The
>>   zap_huge_pmd() branch handles these without splitting.  The split
>>   failure path is unreachable.
>>
>> - munmap / mmap overlay: vma_adjust_trans_huge() (called from
>>   __split_vma) splits any PMD straddling the VMA boundary before the
>>   VMA is split.  If that PMD split fails, __split_vma() returns
>>   -ENOMEM and the munmap is aborted before reaching zap_pmd_range().
>>   The split failure path is unreachable.
>>
>> - MADV_DONTNEED: advisory hint, the kernel is allowed to ignore it.
>>   The pages remain valid and accessible.  A subsequent access returns
>>   existing data without faulting.
> 
> Em, no. MADV_DONTNEED users expect memory to be zeroed after the
> "advise" is complete. At very least you need to zero the skipped range.

Fully agreed. This definitely needs more thought :)

-- 
Cheers,

David