[RFC PATCH 0/4] Extend xas_split* to support splitting arbitrarily large entries

Ackerley Tng posted 4 patches 2 weeks ago
lib/test_xarray.c |   6 +-
lib/xarray.c      | 210 ++++++++++++++++++++++++++++++++++------------
2 files changed, 162 insertions(+), 54 deletions(-)
[RFC PATCH 0/4] Extend xas_split* to support splitting arbitrarily large entries
Posted by Ackerley Tng 2 weeks ago
guest_memfd is planning to store huge pages in the filemap, and
guest_memfd's use of huge pages involves splitting of huge pages into
individual pages. Splitting of huge pages also involves splitting of
the filemap entries for the pages being split.

To summarize the context of how these patches will be used,

+ guest_memfd stores huge pages (up to 1G pages) in the filemap.
+ During folio splitting, guest_memfd needs split the folios, and
  approaches that by first splitting the filemap (XArray) entries that
  the folio occupies, and then splitting the struct folios themselves.
+ Splitting from a 1G to 4K folio requires splitting an entry in a
  shift-18 XArray node to a shift-0 node in the xarray, which goes
  beyond 2 levels of XArray nodes, and is currently not supported.

This work-in-progress series at [1] shows the context of how these
patches for XArray entry splitting will be used.

[1] https://github.com/googleprodkernel/linux-cc/tree/wip-gmem-conversions-hugetlb-restructuring

This patch series extends xas_split_alloc() to allocate enough nodes
for splitting an XArray node beyond 2 levels, and extends xas_split()
to use the allocated nodes in a split beyond 2 levels.

Merging of XArray entries can be performed with xa_store_order() at
the original order, and hence no change to the XArray library is
required.

xas_destroy() cleans up any allocated and unused nodes in struct
xa_state, and so no further changes are necessary there.

Please let me know

+ If this extension is welcome
+ Your thoughts on the approach: is it too many nodes to allocate at
  once? Would a recursive implementation be preferred?
+ If there are any bugs, particularly around how xas_split() interacts
  with LRU


Thank you!


Ackerley Tng (4):
  XArray: Initialize nodes while splitting instead of while allocating
  XArray: Update xas_split_alloc() to allocate enough nodes to split
    large entries
  XArray: Support splitting for arbitrarily large entries
  XArray: test: Increase split order test range in check_split()

 lib/test_xarray.c |   6 +-
 lib/xarray.c      | 210 ++++++++++++++++++++++++++++++++++------------
 2 files changed, 162 insertions(+), 54 deletions(-)

--
2.52.0.rc1.455.g30608eb744-goog
Re: [RFC PATCH 0/4] Extend xas_split* to support splitting arbitrarily large entries
Posted by Matthew Wilcox 2 weeks ago
On Mon, Nov 17, 2025 at 02:46:57PM -0800, Ackerley Tng wrote:
> guest_memfd is planning to store huge pages in the filemap, and
> guest_memfd's use of huge pages involves splitting of huge pages into
> individual pages. Splitting of huge pages also involves splitting of
> the filemap entries for the pages being split.

Hm, I'm not most concerned about the number of nodes you're allocating.
I'm most concerned that, once we have memdescs, splitting a 1GB page
into 512 * 512 4kB pages is going to involve allocating about 20MB
of memory (80 bytes * 512 * 512).  Is this necessary to do all at once?
Re: [RFC PATCH 0/4] Extend xas_split* to support splitting arbitrarily large entries
Posted by Ackerley Tng 2 weeks ago
Matthew Wilcox <willy@infradead.org> writes:

> On Mon, Nov 17, 2025 at 02:46:57PM -0800, Ackerley Tng wrote:
>> guest_memfd is planning to store huge pages in the filemap, and
>> guest_memfd's use of huge pages involves splitting of huge pages into
>> individual pages. Splitting of huge pages also involves splitting of
>> the filemap entries for the pages being split.

>
> Hm, I'm not most concerned about the number of nodes you're allocating.

Thanks for reminding me, I left this out of the original message.

Splitting the xarray entry for a 1G folio (in a shift-18 node for
order=18 on x86), assuming XA_CHUNK_SHIFT is 6, would involve

+ shift-18 node (the original node will be reused - no new allocations)
+ shift-12 node: 1 node allocated
+ shift-6 node : 64 nodes allocated
+ shift-0 node : 64 * 64 = 4096 nodes allocated

This brings the total number of allocated nodes to 4161 nodes. struct
xa_node is 576 bytes, so that's 2396736 bytes or 2.28 MB, so splitting a
1G folio to 4K pages costs ~2.5 MB just in filemap (XArray) entry
splitting. The other large memory cost would be from undoing HVO for the
HugeTLB folio.

> I'm most concerned that, once we have memdescs, splitting a 1GB page
> into 512 * 512 4kB pages is going to involve allocating about 20MB
> of memory (80 bytes * 512 * 512).

I definitely need to catch up on memdescs. What's the best place for me
to learn/get an overview of how memdescs will describe memory/replace
struct folios?

I think there might be a better way to solve the original problem of
usage tracking with memdesc support, but this was intended to make
progress before memdescs.

> Is this necessary to do all at once?

The plan for guest_memfd was to first split from 1G to 4K, then optimize
on that by splitting in stages, from 1G to 2M as much as possible, then
to 4K only for the page ranges that the guest shared with the host.
Re: [RFC PATCH 0/4] Extend xas_split* to support splitting arbitrarily large entries
Posted by David Hildenbrand (Red Hat) 1 week, 6 days ago
On 18.11.25 00:43, Ackerley Tng wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> 
>> On Mon, Nov 17, 2025 at 02:46:57PM -0800, Ackerley Tng wrote:
>>> guest_memfd is planning to store huge pages in the filemap, and
>>> guest_memfd's use of huge pages involves splitting of huge pages into
>>> individual pages. Splitting of huge pages also involves splitting of
>>> the filemap entries for the pages being split.
> 
>>
>> Hm, I'm not most concerned about the number of nodes you're allocating.
> 
> Thanks for reminding me, I left this out of the original message.
> 
> Splitting the xarray entry for a 1G folio (in a shift-18 node for
> order=18 on x86), assuming XA_CHUNK_SHIFT is 6, would involve
> 
> + shift-18 node (the original node will be reused - no new allocations)
> + shift-12 node: 1 node allocated
> + shift-6 node : 64 nodes allocated
> + shift-0 node : 64 * 64 = 4096 nodes allocated
> 
> This brings the total number of allocated nodes to 4161 nodes. struct
> xa_node is 576 bytes, so that's 2396736 bytes or 2.28 MB, so splitting a
> 1G folio to 4K pages costs ~2.5 MB just in filemap (XArray) entry
> splitting. The other large memory cost would be from undoing HVO for the
> HugeTLB folio.
> 
>> I'm most concerned that, once we have memdescs, splitting a 1GB page
>> into 512 * 512 4kB pages is going to involve allocating about 20MB
>> of memory (80 bytes * 512 * 512).
> 
> I definitely need to catch up on memdescs. What's the best place for me
> to learn/get an overview of how memdescs will describe memory/replace
> struct folios?
> 
> I think there might be a better way to solve the original problem of
> usage tracking with memdesc support, but this was intended to make
> progress before memdescs.
> 
>> Is this necessary to do all at once?
> 
> The plan for guest_memfd was to first split from 1G to 4K, then optimize
> on that by splitting in stages, from 1G to 2M as much as possible, then
> to 4K only for the page ranges that the guest shared with the host.

Right, we also discussed the non-uniform split as an optimization in the 
future.

-- 
Cheers

David
[syzbot ci] Re: Extend xas_split* to support splitting arbitrarily large entries
Posted by syzbot ci 1 week, 6 days ago
syzbot ci has tested the following series

[v1] Extend xas_split* to support splitting arbitrarily large entries
https://lore.kernel.org/all/20251117224701.1279139-1-ackerleytng@google.com
* [RFC PATCH 1/4] XArray: Initialize nodes while splitting instead of while allocating
* [RFC PATCH 2/4] XArray: Update xas_split_alloc() to allocate enough nodes to split large entries
* [RFC PATCH 3/4] XArray: Support splitting for arbitrarily large entries
* [RFC PATCH 4/4] XArray: test: Increase split order test range in check_split()

and found the following issue:
WARNING: kmalloc bug in bpf_prog_alloc_no_stats

Full report is available here:
https://ci.syzbot.org/series/aa74d39d-0773-4398-bb90-0a6d21365c3d

***

WARNING: kmalloc bug in bpf_prog_alloc_no_stats

tree:      mm-new
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git
base:      41218ede767f6b218185af65ce919d0cade75f6b
arch:      amd64
compiler:  Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
config:    https://ci.syzbot.org/builds/c26972f6-b81e-4d6f-bead-3d77003cf075/config

------------[ cut here ]------------
Unexpected gfp: 0x400000 (__GFP_ACCOUNT). Fixing up to gfp: 0xdc0 (GFP_KERNEL|__GFP_ZERO). Fix your code!
WARNING: CPU: 0 PID: 6465 at mm/vmalloc.c:3938 vmalloc_fix_flags+0x9c/0xe0
Modules linked in:
CPU: 0 UID: 0 PID: 6465 Comm: syz-executor Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:vmalloc_fix_flags+0x9c/0xe0
Code: 81 e6 1f 52 ee ff 89 74 24 30 81 e3 e0 ad 11 00 89 5c 24 20 90 48 c7 c7 c0 b9 76 8b 4c 89 fa 89 d9 4d 89 f0 e8 75 2b 6e ff 90 <0f> 0b 90 90 8b 44 24 20 48 c7 04 24 0e 36 e0 45 4b c7 04 2c 00 00
RSP: 0018:ffffc90005d7fb00 EFLAGS: 00010246
RAX: 6e85c22fb4362300 RBX: 0000000000000dc0 RCX: ffff888176898000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
RBP: ffffc90005d7fb98 R08: ffff888121224293 R09: 1ffff11024244852
R10: dffffc0000000000 R11: ffffed1024244853 R12: 1ffff92000baff60
R13: dffffc0000000000 R14: ffffc90005d7fb20 R15: ffffc90005d7fb30
FS:  000055555be14500(0000) GS:ffff88818eb36000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f653e85c470 CR3: 00000001139ec000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 __vmalloc_noprof+0xf2/0x120
 bpf_prog_alloc_no_stats+0x4a/0x4d0
 bpf_prog_alloc+0x3c/0x1a0
 bpf_prog_create_from_user+0xa7/0x440
 do_seccomp+0x7b1/0xd90
 __se_sys_prctl+0xc3c/0x1830
 do_syscall_64+0xfa/0xfa0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f653e990b0d
Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 18 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 9d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 1b 48 8b 54 24 18 64 48 2b 14 25 28 00 00 00
RSP: 002b:00007fffbd3687c0 EFLAGS: 00000246 ORIG_RAX: 000000000000009d
RAX: ffffffffffffffda RBX: 00007f653ea2cf80 RCX: 00007f653e990b0d
RDX: 00007fffbd368820 RSI: 0000000000000002 RDI: 0000000000000016
RBP: 00007fffbd368830 R08: 0000000000000006 R09: 0000000000000071
R10: 0000000000000071 R11: 0000000000000246 R12: 000000000000006d
R13: 00007fffbd368c58 R14: 00007fffbd368ed8 R15: 0000000000000000
 </TASK>


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.