arch/arm64/include/asm/pgtable.h | 41 + arch/loongarch/include/asm/pgtable.h | 1 + arch/powerpc/include/asm/book3s/64/pgtable.h | 7 + arch/s390/include/asm/pgtable.h | 38 + arch/x86/include/asm/pgtable.h | 53 + arch/x86/include/asm/pgtable_64.h | 2 + arch/x86/mm/pgtable.c | 18 +- fs/proc/task_mmu.c | 2295 ++++++++---------- include/asm-generic/pgtable_uffd.h | 15 + include/linux/leafops.h | 46 + include/linux/mm.h | 2 + include/linux/mm_inline.h | 32 + include/linux/pagewalk.h | 106 + include/linux/pgtable.h | 95 + mm/internal.h | 75 +- mm/memory.c | 22 + mm/pagewalk.c | 400 +++ mm/pgtable-generic.c | 21 + 18 files changed, 2039 insertions(+), 1230 deletions(-)
Changelog:
rfcv2 -> rfcv3:
- Fix an out-of-bounds write
- Convert clear_refs to the new API
- Fix issue when reading cont-PMDs
rfc -> rfcv2:
- Add pte_hole functionality
- Fix pagemap issues
- Fix shmem in smap
- Testing with pagemap "testsuite"
[WARNING]
This is not yet fully complete, but before investing more time into it I would like
to know whether 1) this is heading into the right direction and 2) this is something
we are still interested in.
There are still things that need work:
- convert make_uffd_wp_huge_pte: Since hugetlb is being dealt like a
pte, we inherited PTE_MARKERs for it when those came into play, and
AFAIK, those are being used mostly for UFFD.
From here on we have two options: 1) find another way to deal with
UFFD without markers or 2) introduce markers for PMD and PUD level.
I am leaning towards option 1), because 2) seems a bit unfair.
I still need to put some thought into it and see how we can achieve
that.
- Teach the new API how to use other kind of locks. E.g: pagemap scan
needs to take i_mmap_lock during the scanning, so we need to able to
take that lock. I have some ideas to do that, but something for the
new version.
- Find corner-cases and fix them.
Kudos go to David, who was the person suggesting the interface and
he gave me some ideas where to begin, besides providing feedback
on early stages (in case there is something stupid don't blame him, blame me)
Also, I would like to thank Vlastimil, who helped me running this
patchset quite a few times through Claude, to catch some fixes.
[/WARNING]
[TESTING]
Part of the testing has been to duplicate
/proc/$$/(pagemap,smaps,numa_maps,clear_refs) and have the same with
_lab extension linked to the old API.
In that way I could check whether the outcome from e.g: /proc/$$/smaps
and /proc/$$/smaps_lab was the same for any given program.
The same I did for pagemap and numa_maps.
Also, regarding pagemap:
So far, tools/mm/page-types.c reports the right outcome (compared to the old API),
and tools/testing/selftests/mm/pagemap_ioctl.c only reports 4 failing tests.
Although to be honest, I do not how much should I trust that one because if I
add a few delays in the userspace code, some tests that were failing before are not
now, so yeah.
localhost:~/workspace # ./page-types -p 1168
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000000000800 1 0 ___________M_______________________________ mmap
0x0000000000000828 2 0 ___U_l_____M_______________________________ uptodate,lru,mmap
0x000000000000082c 1 0 __RU_l_____M_______________________________ referenced,uptodate,lru,mmap
0x0000000000004838 1 0 ___UDl_____M__b____________________________ uptodate,dirty,lru,mmap,swapbacked
0x000000000000086c 423 1 __RU_lA____M_______________________________ referenced,uptodate,lru,active,mmap
0x0000000000205828 29 0 ___U_l_____Ma_b______x_____________________ uptodate,lru,mmap,anonymous,swapbacked,ksm
0x000000000020586c 1 0 __RU_lA____Ma_b______x_____________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked,ksm
total 458 1
localhost:~/workspace # ./page-types_lab -p 1168
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000000000804 1 0 __R________M_______________________________ referenced,mmap
0x0000000000000828 2 0 ___U_l_____M_______________________________ uptodate,lru,mmap
0x000000000000082c 1 0 __RU_l_____M_______________________________ referenced,uptodate,lru,mmap
0x0000000000004838 1 0 ___UDl_____M__b____________________________ uptodate,dirty,lru,mmap,swapbacked
0x000000000000086c 423 1 __RU_lA____M_______________________________ referenced,uptodate,lru,active,mmap
0x0000000000205828 29 0 ___U_l_____Ma_b______x_____________________ uptodate,lru,mmap,anonymous,swapbacked,ksm
0x000000000020586c 1 0 __RU_lA____Ma_b______x_____________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked,ksm
total 458 1
page-types being using the new API and page-types_lab the old one.
# ./pagemap_ioctl
TAP version 13
1..117
ok 1 sanity_tests_sd Zero range size is valid
ok 2 sanity_tests_sd output bu
ok 35 Walk_end: 1 max page
ok 36 Page testing: all new pages must not be written (dirty)
ok 37 Page testing: all pages must be written (dirty)
ok 38 Page testing: all pages dirty other than first and the last one
ok 39 Page testing: PM_SCAN_WP_MATCHING | PM_SCAN_CHECK_WPASYNC
ok 40 Page testing: only middle page dirty
ok 41 Page testing: only two middle pages dirty
ok 42 Large Page testing: all new pages must not be written (dirty)
ok 43 Large Page testing: all pages must be written (dirty)
ok 44 Large Page testing: all pages dirty other than first and the last one
ok 45 Large Page testing: PM_SCAN_WP_MATCHING | PM_SCAN_CHECK_WPASYNC
ok 46 Large Page testing: only middle page dirty
ok 47 Large Page testing: only two middle pages dirty
ok 48 Huge page testing: all new pages must not be written (dirty)
ok 49 Huge page testing: all pages must be written (dirty)
ok 50 Huge page testing: all pages dirty other than first and the last one
ok 51 Huge page testing: PM_SCAN_WP_MATCHING | PM_SCAN_CHECK_WPASYNC
ok 52 Huge page testing: only middle page dirty
ok 53 Huge page testing: only two middle pages dirty
ok 54 Hugetlb shmem testing: all new pages must not be written (dirty)
ok 55 Hugetlb shmem testing: all pages must be written (dirty)
ok 56 Hugetlb shmem testing: all pages dirty other than first and the last one
ok 57 Hugetlb shmem testing: PM_SCAN_WP_MATCHING | PM_SCAN_CHECK_WPASYNC
ok 58 Hugetlb shmem testing: only middle page dirty
not ok 59 Hugetlb shmem testing: only two middle pages dirty
ok 60 Hugetlb mem testing: all new pages must not be written (dirty)
ok 61 Hugetlb mem testing: all pages must be written (dirty)
ok 62 Hugetlb mem testing: all pages dirty other than first and the last one
ok 63 Hugetlb mem testing: PM_SCAN_WP_MATCHING | PM_SCAN_CHECK_WPASYNC
ok 64 Hugetlb mem testing: only middle page dirty
not ok 65 Hugetlb mem testing: only two middle pages dirty
ok 66 Hugetlb shmem testing: all new pages must not be written (dirty)
ok 67 Hugetlb shmem testing: all pages must be written (dirty)
ok 68 Hugetlb shmem testing: all pages dirty other than first and the last one
ok 69 Hugetlb shmem testing: PM_SCAN_WP_MATCHING | PM_SCAN_CHECK_WPASYNC
ok 70 Hugetlb shmem testing: only middle page dirty
not ok 71 Hugetlb shmem testing: only two middle pages dirty
ok 72 File memory testing: all new pages must not be written (dirty)
ok 73 File memory testing: all p
# Totals: pass:113 fail:4 xfail:0 xpass:0 skip:0 error:0
[/TESTING]
In the LSFMM/BFP 2025, there was a general agreement that we 1) would like to have
a generic pagewalk API 2) that replaces the existing one with callbacks if possible
and 3) that HugeTLB can use without the need to special case it (e.g: not having to
depend on .hugetlb_entry callbacks)., which means having a lot of duplicated
code and also having a lot of special casing just because hugetlb lore.
pt_range_walk API tries to do that and replaces the old behaviour of "in
HugeTLB world everything reads as a PTE" and starts reading HugeTLB entries
the way they really are, that means interpreting them as PMD/PUD entries and
contiguous-PMD/PTE entries.
In order to achieve that, we need some infrastructure we did not really need until
know, in order to be able to read HugeTLB pages as PUD/PMD entries.
E.g: softleaf_from_pud had to be added and some other pud_* functions.
In a few words, this API goes through an address range and returns
whatever it is in there (swap/hwpoison/migration/marker entries, folios,
pfn and device entries, or nothing).
These are the internal return types the API uses:
PT_TYPE_NONE
PT_TYPE_FOLIO
PT_TYPE_MARKER
PT_TYPE_PFN
PT_TYPE_SWAP
PT_TYPE_MIGRATION
PT_TYPE_DEVICE
PT_TYPE_HWPOISON
The API also handles locking and batching itself, so the caller
does not really need to bother with that.
In order to handle contiguous-PMD mapped hugetlb pages, folio_pmd_batch,
which is an analogous of folio_pte_batch, has been implemented.
More information about the API can be found in patch #4.
This was tested on x86_64 and arm64, but as I said, it is still
incomplete, therefore the RFC, to gather some initial feedback before
investing more time into this.
For now, all users of the old API from fs/proc/task_mmu.c have been
converted: /proc/pid/(smaps|numa_maps|pagemap|clear_refs).
Thanks in advance
Oscar Salvador (8):
mm: Add softleaf_from_pud
mm: Add {pmd,pud}_huge_lock helper
mm: Implement folio_pmd_batch
mm: Implement pt_range_walk
mm: Make /proc/pid/smaps use the new generic pagewalk API
mm: Make /proc/pid/numa_maps use the new generic pagewalk API
mm: Make /proc/pid/pagemap use the new generic pagewalk API
mm: Make /proc/pid/clear_refs use the new generic pagewalk API
arch/arm64/include/asm/pgtable.h | 41 +
arch/loongarch/include/asm/pgtable.h | 1 +
arch/powerpc/include/asm/book3s/64/pgtable.h | 7 +
arch/s390/include/asm/pgtable.h | 38 +
arch/x86/include/asm/pgtable.h | 53 +
arch/x86/include/asm/pgtable_64.h | 2 +
arch/x86/mm/pgtable.c | 18 +-
fs/proc/task_mmu.c | 2295 ++++++++----------
include/asm-generic/pgtable_uffd.h | 15 +
include/linux/leafops.h | 46 +
include/linux/mm.h | 2 +
include/linux/mm_inline.h | 32 +
include/linux/pagewalk.h | 106 +
include/linux/pgtable.h | 95 +
mm/internal.h | 75 +-
mm/memory.c | 22 +
mm/pagewalk.c | 400 +++
mm/pgtable-generic.c | 21 +
18 files changed, 2039 insertions(+), 1230 deletions(-)
--
2.53.0
syzbot ci has tested the following series
[v3] Implement a new generic pagewalk API
https://lore.kernel.org/all/20260525165528.184397-1-osalvador@suse.de
* [RFC PATCH v3 1/8] mm: Add softleaf_from_pud
* [RFC PATCH v3 2/8] mm: Add {pmd,pud}_huge_lock helper
* [RFC PATCH v3 3/8] mm: Implement folio_pmd_batch
* [RFC PATCH v3 4/8] mm: Implement pt_range_walk
* [RFC PATCH v3 5/8] mm: Make /proc/pid/smaps use the new generic pagewalk API
* [RFC PATCH v3 6/8] mm: Make /proc/pid/numa_maps use the new generic pagewalk API
* [RFC PATCH v3 7/8] mm: Make /proc/pid/pagemap use the new generic pagewalk API
* [RFC PATCH v3 8/8] mm: Make /proc/pid/clear_refs use the new generic pagewalk API
and found the following issue:
general protection fault in clear_refs_write
Full report is available here:
https://ci.syzbot.org/series/0beab2d0-9d88-4288-ac81-b294e51c057d
***
general protection fault in clear_refs_write
tree: mm-new
URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git
base: 5200f5f493f79f14bbdc349e402a40dfb32f23c8
arch: amd64
compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config: https://ci.syzbot.org/builds/8f20e88b-808c-4c60-9b8e-0af4c9fce5ff/config
syz repro: https://ci.syzbot.org/findings/634b1f0a-cf64-4607-928a-ef3ad40f765c/syz_repro
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
CPU: 1 UID: 0 PID: 5863 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:clear_soft_dirty_pmd fs/proc/task_mmu.c:1545 [inline]
RIP: 0010:clear_refs_pmd_range fs/proc/task_mmu.c:1581 [inline]
RIP: 0010:clear_refs_write+0xaa7/0x1590 fs/proc/task_mmu.c:1721
Code: 44 24 10 4c 8b bc 24 20 02 00 00 83 7c 24 28 04 0f 85 aa 05 00 00 48 8b 9c 24 08 02 00 00 48 89 d8 48 c1 e8 03 48 89 44 24 58 <42> 80 3c 20 00 74 08 48 89 df e8 fa 2c c6 ff 4c 8b 23 4c 89 e6 48
RSP: 0018:ffffc900037bf960 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000005
RDX: ffffffff826b21ce RSI: ffffffff8eb17b50 RDI: 0000000000000001
RBP: ffffc900037bfc10 R08: ffff8881047c0000 R09: 0000000000000002
R10: 0000000000000002 R11: 0000000000000000 R12: dffffc0000000000
R13: fffff520006f7f40 R14: ffff8881141c8308 R15: ffff8881141c8300
FS: 00007ff8ffc1d6c0(0000) GS:ffff8882a928a000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b33e63fff CR3: 000000010476e000 CR4: 00000000000006f0
Call Trace:
<TASK>
do_loop_readv_writev fs/read_write.c:852 [inline]
vfs_writev+0x4bd/0x990 fs/read_write.c:1061
do_writev+0x154/0x2e0 fs/read_write.c:1105
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7ff8fed9ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ff8ffc1d028 EFLAGS: 00000246 ORIG_RAX: 0000000000000014
RAX: ffffffffffffffda RBX: 00007ff8ff015fa0 RCX: 00007ff8fed9ce59
RDX: 0000000000000001 RSI: 00002000000000c0 RDI: 0000000000000003
RBP: 00007ff8fee32d6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ff8ff016038 R14: 00007ff8ff015fa0 R15: 00007fffd02796e8
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:clear_soft_dirty_pmd fs/proc/task_mmu.c:1545 [inline]
RIP: 0010:clear_refs_pmd_range fs/proc/task_mmu.c:1581 [inline]
RIP: 0010:clear_refs_write+0xaa7/0x1590 fs/proc/task_mmu.c:1721
Code: 44 24 10 4c 8b bc 24 20 02 00 00 83 7c 24 28 04 0f 85 aa 05 00 00 48 8b 9c 24 08 02 00 00 48 89 d8 48 c1 e8 03 48 89 44 24 58 <42> 80 3c 20 00 74 08 48 89 df e8 fa 2c c6 ff 4c 8b 23 4c 89 e6 48
RSP: 0018:ffffc900037bf960 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000005
RDX: ffffffff826b21ce RSI: ffffffff8eb17b50 RDI: 0000000000000001
RBP: ffffc900037bfc10 R08: ffff8881047c0000 R09: 0000000000000002
R10: 0000000000000002 R11: 0000000000000000 R12: dffffc0000000000
R13: fffff520006f7f40 R14: ffff8881141c8308 R15: ffff8881141c8300
FS: 00007ff8ffc1d6c0(0000) GS:ffff8882a928a000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b34063fff CR3: 000000010476e000 CR4: 00000000000006f0
----------------
Code disassembly (best guess):
0: 44 24 10 rex.R and $0x10,%al
3: 4c 8b bc 24 20 02 00 mov 0x220(%rsp),%r15
a: 00
b: 83 7c 24 28 04 cmpl $0x4,0x28(%rsp)
10: 0f 85 aa 05 00 00 jne 0x5c0
16: 48 8b 9c 24 08 02 00 mov 0x208(%rsp),%rbx
1d: 00
1e: 48 89 d8 mov %rbx,%rax
21: 48 c1 e8 03 shr $0x3,%rax
25: 48 89 44 24 58 mov %rax,0x58(%rsp)
* 2a: 42 80 3c 20 00 cmpb $0x0,(%rax,%r12,1) <-- trapping instruction
2f: 74 08 je 0x39
31: 48 89 df mov %rbx,%rdi
34: e8 fa 2c c6 ff call 0xffc62d33
39: 4c 8b 23 mov (%rbx),%r12
3c: 4c 89 e6 mov %r12,%rsi
3f: 48 rex.W
***
If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
Tested-by: syzbot@syzkaller.appspotmail.com
---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.
To test a patch for this bug, please reply with `#syz test`
(should be on a separate line).
The patch should be attached to the email.
Note: arguments like custom git repos and branches are not supported.
© 2016 - 2026 Red Hat, Inc.