lib/test_xarray.c | 6 +- lib/xarray.c | 210 ++++++++++++++++++++++++++++++++++------------ 2 files changed, 162 insertions(+), 54 deletions(-)
guest_memfd is planning to store huge pages in the filemap, and
guest_memfd's use of huge pages involves splitting of huge pages into
individual pages. Splitting of huge pages also involves splitting of
the filemap entries for the pages being split.
To summarize the context of how these patches will be used,
+ guest_memfd stores huge pages (up to 1G pages) in the filemap.
+ During folio splitting, guest_memfd needs split the folios, and
approaches that by first splitting the filemap (XArray) entries that
the folio occupies, and then splitting the struct folios themselves.
+ Splitting from a 1G to 4K folio requires splitting an entry in a
shift-18 XArray node to a shift-0 node in the xarray, which goes
beyond 2 levels of XArray nodes, and is currently not supported.
This work-in-progress series at [1] shows the context of how these
patches for XArray entry splitting will be used.
[1] https://github.com/googleprodkernel/linux-cc/tree/wip-gmem-conversions-hugetlb-restructuring
This patch series extends xas_split_alloc() to allocate enough nodes
for splitting an XArray node beyond 2 levels, and extends xas_split()
to use the allocated nodes in a split beyond 2 levels.
Merging of XArray entries can be performed with xa_store_order() at
the original order, and hence no change to the XArray library is
required.
xas_destroy() cleans up any allocated and unused nodes in struct
xa_state, and so no further changes are necessary there.
Please let me know
+ If this extension is welcome
+ Your thoughts on the approach: is it too many nodes to allocate at
once? Would a recursive implementation be preferred?
+ If there are any bugs, particularly around how xas_split() interacts
with LRU
Thank you!
Ackerley Tng (4):
XArray: Initialize nodes while splitting instead of while allocating
XArray: Update xas_split_alloc() to allocate enough nodes to split
large entries
XArray: Support splitting for arbitrarily large entries
XArray: test: Increase split order test range in check_split()
lib/test_xarray.c | 6 +-
lib/xarray.c | 210 ++++++++++++++++++++++++++++++++++------------
2 files changed, 162 insertions(+), 54 deletions(-)
--
2.52.0.rc1.455.g30608eb744-goog
On Mon, Nov 17, 2025 at 02:46:57PM -0800, Ackerley Tng wrote: > guest_memfd is planning to store huge pages in the filemap, and > guest_memfd's use of huge pages involves splitting of huge pages into > individual pages. Splitting of huge pages also involves splitting of > the filemap entries for the pages being split. Hm, I'm not most concerned about the number of nodes you're allocating. I'm most concerned that, once we have memdescs, splitting a 1GB page into 512 * 512 4kB pages is going to involve allocating about 20MB of memory (80 bytes * 512 * 512). Is this necessary to do all at once?
Matthew Wilcox <willy@infradead.org> writes: > On Mon, Nov 17, 2025 at 02:46:57PM -0800, Ackerley Tng wrote: >> guest_memfd is planning to store huge pages in the filemap, and >> guest_memfd's use of huge pages involves splitting of huge pages into >> individual pages. Splitting of huge pages also involves splitting of >> the filemap entries for the pages being split. > > Hm, I'm not most concerned about the number of nodes you're allocating. Thanks for reminding me, I left this out of the original message. Splitting the xarray entry for a 1G folio (in a shift-18 node for order=18 on x86), assuming XA_CHUNK_SHIFT is 6, would involve + shift-18 node (the original node will be reused - no new allocations) + shift-12 node: 1 node allocated + shift-6 node : 64 nodes allocated + shift-0 node : 64 * 64 = 4096 nodes allocated This brings the total number of allocated nodes to 4161 nodes. struct xa_node is 576 bytes, so that's 2396736 bytes or 2.28 MB, so splitting a 1G folio to 4K pages costs ~2.5 MB just in filemap (XArray) entry splitting. The other large memory cost would be from undoing HVO for the HugeTLB folio. > I'm most concerned that, once we have memdescs, splitting a 1GB page > into 512 * 512 4kB pages is going to involve allocating about 20MB > of memory (80 bytes * 512 * 512). I definitely need to catch up on memdescs. What's the best place for me to learn/get an overview of how memdescs will describe memory/replace struct folios? I think there might be a better way to solve the original problem of usage tracking with memdesc support, but this was intended to make progress before memdescs. > Is this necessary to do all at once? The plan for guest_memfd was to first split from 1G to 4K, then optimize on that by splitting in stages, from 1G to 2M as much as possible, then to 4K only for the page ranges that the guest shared with the host.
On 18.11.25 00:43, Ackerley Tng wrote: > Matthew Wilcox <willy@infradead.org> writes: > >> On Mon, Nov 17, 2025 at 02:46:57PM -0800, Ackerley Tng wrote: >>> guest_memfd is planning to store huge pages in the filemap, and >>> guest_memfd's use of huge pages involves splitting of huge pages into >>> individual pages. Splitting of huge pages also involves splitting of >>> the filemap entries for the pages being split. > >> >> Hm, I'm not most concerned about the number of nodes you're allocating. > > Thanks for reminding me, I left this out of the original message. > > Splitting the xarray entry for a 1G folio (in a shift-18 node for > order=18 on x86), assuming XA_CHUNK_SHIFT is 6, would involve > > + shift-18 node (the original node will be reused - no new allocations) > + shift-12 node: 1 node allocated > + shift-6 node : 64 nodes allocated > + shift-0 node : 64 * 64 = 4096 nodes allocated > > This brings the total number of allocated nodes to 4161 nodes. struct > xa_node is 576 bytes, so that's 2396736 bytes or 2.28 MB, so splitting a > 1G folio to 4K pages costs ~2.5 MB just in filemap (XArray) entry > splitting. The other large memory cost would be from undoing HVO for the > HugeTLB folio. > >> I'm most concerned that, once we have memdescs, splitting a 1GB page >> into 512 * 512 4kB pages is going to involve allocating about 20MB >> of memory (80 bytes * 512 * 512). > > I definitely need to catch up on memdescs. What's the best place for me > to learn/get an overview of how memdescs will describe memory/replace > struct folios? > > I think there might be a better way to solve the original problem of > usage tracking with memdesc support, but this was intended to make > progress before memdescs. > >> Is this necessary to do all at once? > > The plan for guest_memfd was to first split from 1G to 4K, then optimize > on that by splitting in stages, from 1G to 2M as much as possible, then > to 4K only for the page ranges that the guest shared with the host. Right, we also discussed the non-uniform split as an optimization in the future. -- Cheers David
syzbot ci has tested the following series [v1] Extend xas_split* to support splitting arbitrarily large entries https://lore.kernel.org/all/20251117224701.1279139-1-ackerleytng@google.com * [RFC PATCH 1/4] XArray: Initialize nodes while splitting instead of while allocating * [RFC PATCH 2/4] XArray: Update xas_split_alloc() to allocate enough nodes to split large entries * [RFC PATCH 3/4] XArray: Support splitting for arbitrarily large entries * [RFC PATCH 4/4] XArray: test: Increase split order test range in check_split() and found the following issue: WARNING: kmalloc bug in bpf_prog_alloc_no_stats Full report is available here: https://ci.syzbot.org/series/aa74d39d-0773-4398-bb90-0a6d21365c3d *** WARNING: kmalloc bug in bpf_prog_alloc_no_stats tree: mm-new URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git base: 41218ede767f6b218185af65ce919d0cade75f6b arch: amd64 compiler: Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8 config: https://ci.syzbot.org/builds/c26972f6-b81e-4d6f-bead-3d77003cf075/config ------------[ cut here ]------------ Unexpected gfp: 0x400000 (__GFP_ACCOUNT). Fixing up to gfp: 0xdc0 (GFP_KERNEL|__GFP_ZERO). Fix your code! WARNING: CPU: 0 PID: 6465 at mm/vmalloc.c:3938 vmalloc_fix_flags+0x9c/0xe0 Modules linked in: CPU: 0 UID: 0 PID: 6465 Comm: syz-executor Not tainted syzkaller #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 RIP: 0010:vmalloc_fix_flags+0x9c/0xe0 Code: 81 e6 1f 52 ee ff 89 74 24 30 81 e3 e0 ad 11 00 89 5c 24 20 90 48 c7 c7 c0 b9 76 8b 4c 89 fa 89 d9 4d 89 f0 e8 75 2b 6e ff 90 <0f> 0b 90 90 8b 44 24 20 48 c7 04 24 0e 36 e0 45 4b c7 04 2c 00 00 RSP: 0018:ffffc90005d7fb00 EFLAGS: 00010246 RAX: 6e85c22fb4362300 RBX: 0000000000000dc0 RCX: ffff888176898000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002 RBP: ffffc90005d7fb98 R08: ffff888121224293 R09: 1ffff11024244852 R10: dffffc0000000000 R11: ffffed1024244853 R12: 1ffff92000baff60 R13: dffffc0000000000 R14: ffffc90005d7fb20 R15: ffffc90005d7fb30 FS: 000055555be14500(0000) GS:ffff88818eb36000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f653e85c470 CR3: 00000001139ec000 CR4: 00000000000006f0 Call Trace: <TASK> __vmalloc_noprof+0xf2/0x120 bpf_prog_alloc_no_stats+0x4a/0x4d0 bpf_prog_alloc+0x3c/0x1a0 bpf_prog_create_from_user+0xa7/0x440 do_seccomp+0x7b1/0xd90 __se_sys_prctl+0xc3c/0x1830 do_syscall_64+0xfa/0xfa0 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f653e990b0d Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 18 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 9d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 1b 48 8b 54 24 18 64 48 2b 14 25 28 00 00 00 RSP: 002b:00007fffbd3687c0 EFLAGS: 00000246 ORIG_RAX: 000000000000009d RAX: ffffffffffffffda RBX: 00007f653ea2cf80 RCX: 00007f653e990b0d RDX: 00007fffbd368820 RSI: 0000000000000002 RDI: 0000000000000016 RBP: 00007fffbd368830 R08: 0000000000000006 R09: 0000000000000071 R10: 0000000000000071 R11: 0000000000000246 R12: 000000000000006d R13: 00007fffbd368c58 R14: 00007fffbd368ed8 R15: 0000000000000000 </TASK> *** If these findings have caused you to resend the series or submit a separate fix, please add the following tag to your commit message: Tested-by: syzbot@syzkaller.appspotmail.com --- This report is generated by a bot. It may contain errors. syzbot ci engineers can be reached at syzkaller@googlegroups.com.
© 2016 - 2025 Red Hat, Inc.