Extend xas_split* to support splitting arbitrarily large entries

[RFC PATCH 0/4] Extend xas_split* to support splitting arbitrarily large entries

Posted by Ackerley Tng 2 months, 3 weeks ago

guest_memfd is planning to store huge pages in the filemap, and
guest_memfd's use of huge pages involves splitting of huge pages into
individual pages. Splitting of huge pages also involves splitting of
the filemap entries for the pages being split.

To summarize the context of how these patches will be used,

+ guest_memfd stores huge pages (up to 1G pages) in the filemap.
+ During folio splitting, guest_memfd needs split the folios, and
  approaches that by first splitting the filemap (XArray) entries that
  the folio occupies, and then splitting the struct folios themselves.
+ Splitting from a 1G to 4K folio requires splitting an entry in a
  shift-18 XArray node to a shift-0 node in the xarray, which goes
  beyond 2 levels of XArray nodes, and is currently not supported.

This work-in-progress series at [1] shows the context of how these
patches for XArray entry splitting will be used.

[1] https://github.com/googleprodkernel/linux-cc/tree/wip-gmem-conversions-hugetlb-restructuring

This patch series extends xas_split_alloc() to allocate enough nodes
for splitting an XArray node beyond 2 levels, and extends xas_split()
to use the allocated nodes in a split beyond 2 levels.

Merging of XArray entries can be performed with xa_store_order() at
the original order, and hence no change to the XArray library is
required.

xas_destroy() cleans up any allocated and unused nodes in struct
xa_state, and so no further changes are necessary there.

Please let me know

+ If this extension is welcome
+ Your thoughts on the approach: is it too many nodes to allocate at
  once? Would a recursive implementation be preferred?
+ If there are any bugs, particularly around how xas_split() interacts
  with LRU


Thank you!


Ackerley Tng (4):
  XArray: Initialize nodes while splitting instead of while allocating
  XArray: Update xas_split_alloc() to allocate enough nodes to split
    large entries
  XArray: Support splitting for arbitrarily large entries
  XArray: test: Increase split order test range in check_split()

 lib/test_xarray.c |   6 +-
 lib/xarray.c      | 210 ++++++++++++++++++++++++++++++++++------------
 2 files changed, 162 insertions(+), 54 deletions(-)

--
2.52.0.rc1.455.g30608eb744-goog

Re: [RFC PATCH 0/4] Extend xas_split* to support splitting arbitrarily large entries

Posted by Matthew Wilcox 2 months, 3 weeks ago

On Mon, Nov 17, 2025 at 02:46:57PM -0800, Ackerley Tng wrote:
> guest_memfd is planning to store huge pages in the filemap, and
> guest_memfd's use of huge pages involves splitting of huge pages into
> individual pages. Splitting of huge pages also involves splitting of
> the filemap entries for the pages being split.

Hm, I'm not most concerned about the number of nodes you're allocating.
I'm most concerned that, once we have memdescs, splitting a 1GB page
into 512 * 512 4kB pages is going to involve allocating about 20MB
of memory (80 bytes * 512 * 512).  Is this necessary to do all at once?

Re: [RFC PATCH 0/4] Extend xas_split* to support splitting arbitrarily large entries

Posted by Ackerley Tng 2 months, 3 weeks ago

Matthew Wilcox <willy@infradead.org> writes:

> On Mon, Nov 17, 2025 at 02:46:57PM -0800, Ackerley Tng wrote:
>> guest_memfd is planning to store huge pages in the filemap, and
>> guest_memfd's use of huge pages involves splitting of huge pages into
>> individual pages. Splitting of huge pages also involves splitting of
>> the filemap entries for the pages being split.

>
> Hm, I'm not most concerned about the number of nodes you're allocating.

Thanks for reminding me, I left this out of the original message.

Splitting the xarray entry for a 1G folio (in a shift-18 node for
order=18 on x86), assuming XA_CHUNK_SHIFT is 6, would involve

+ shift-18 node (the original node will be reused - no new allocations)
+ shift-12 node: 1 node allocated
+ shift-6 node : 64 nodes allocated
+ shift-0 node : 64 * 64 = 4096 nodes allocated

This brings the total number of allocated nodes to 4161 nodes. struct
xa_node is 576 bytes, so that's 2396736 bytes or 2.28 MB, so splitting a
1G folio to 4K pages costs ~2.5 MB just in filemap (XArray) entry
splitting. The other large memory cost would be from undoing HVO for the
HugeTLB folio.

> I'm most concerned that, once we have memdescs, splitting a 1GB page
> into 512 * 512 4kB pages is going to involve allocating about 20MB
> of memory (80 bytes * 512 * 512).

I definitely need to catch up on memdescs. What's the best place for me
to learn/get an overview of how memdescs will describe memory/replace
struct folios?

I think there might be a better way to solve the original problem of
usage tracking with memdesc support, but this was intended to make
progress before memdescs.

> Is this necessary to do all at once?

The plan for guest_memfd was to first split from 1G to 4K, then optimize
on that by splitting in stages, from 1G to 2M as much as possible, then
to 4K only for the page ranges that the guest shared with the host.

Re: [RFC PATCH 0/4] Extend xas_split* to support splitting arbitrarily large entries

Posted by Ackerley Tng 2 months ago

Ackerley Tng <ackerleytng@google.com> writes:

> Matthew Wilcox <willy@infradead.org> writes:
>
>> On Mon, Nov 17, 2025 at 02:46:57PM -0800, Ackerley Tng wrote:
>>> guest_memfd is planning to store huge pages in the filemap, and
>>> guest_memfd's use of huge pages involves splitting of huge pages into
>>> individual pages. Splitting of huge pages also involves splitting of
>>> the filemap entries for the pages being split.
>
>>
>> Hm, I'm not most concerned about the number of nodes you're allocating.
>
> Thanks for reminding me, I left this out of the original message.
>
> Splitting the xarray entry for a 1G folio (in a shift-18 node for
> order=18 on x86), assuming XA_CHUNK_SHIFT is 6, would involve
>
> + shift-18 node (the original node will be reused - no new allocations)
> + shift-12 node: 1 node allocated
> + shift-6 node : 64 nodes allocated
> + shift-0 node : 64 * 64 = 4096 nodes allocated
>
> This brings the total number of allocated nodes to 4161 nodes. struct
> xa_node is 576 bytes, so that's 2396736 bytes or 2.28 MB, so splitting a
> 1G folio to 4K pages costs ~2.5 MB just in filemap (XArray) entry
> splitting. The other large memory cost would be from undoing HVO for the
> HugeTLB folio.
>

At the guest_memfd biweekly call this morning, we touched on this topic
again. David pointed out that the ~2MB overhead to store a 1G folio in
the filemap seems a little high.

IIUC the above is correct, so even if we put aside splitting, without
multi-index XArrays, storing a 1G folio in the filemap would incur this
number of nodes in overheads. (Hence multi-index XArrays are great :))

>> I'm most concerned that, once we have memdescs, splitting a 1GB page
>> into 512 * 512 4kB pages is going to involve allocating about 20MB
>> of memory (80 bytes * 512 * 512).
>
> I definitely need to catch up on memdescs. What's the best place for me
> to learn/get an overview of how memdescs will describe memory/replace
> struct folios?
>
> I think there might be a better way to solve the original problem of
> usage tracking with memdesc support, but this was intended to make
> progress before memdescs.
>
>> Is this necessary to do all at once?
>
> The plan for guest_memfd was to first split from 1G to 4K, then optimize
> on that by splitting in stages, from 1G to 2M as much as possible, then
> to 4K only for the page ranges that the guest shared with the host.

David asked if splitting from 1G to 2M would remove the need for this
extension patch series. On the call, I wrongly agreed - looking at the
code again, even though the existing code kind of takes input for the
target order of the split though xas, it actually still does not split
to the requested order.

I think some workarounds could be possible, but for the introduction of
guest_memfd HugeTLB with folio restructuring, taking a dependency on
non-uniform splits (splitting 1G to 511 2M folios and 512 4K folios) is
significant complexity for a single series. It is significant because in
addition to having to deal with non-uniform splits of the folios, we'd
also have to deal with non-uniform HugeTLB vmemmap optimization.

Hence I'm hoping that I could get help reviewing these changes, so that
guest_memfd HugeTLB with non-uniform splits could be handled in a later
stage as an optimization. Besides, David says generalizing this could
help unblock other things (I forgot the detail, maybe David can chime in
here) :)

Thanks!

Re: [RFC PATCH 0/4] Extend xas_split* to support splitting arbitrarily large entries

Posted by David Hildenbrand (Red Hat) 2 months, 3 weeks ago

On 18.11.25 00:43, Ackerley Tng wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> 
>> On Mon, Nov 17, 2025 at 02:46:57PM -0800, Ackerley Tng wrote:
>>> guest_memfd is planning to store huge pages in the filemap, and
>>> guest_memfd's use of huge pages involves splitting of huge pages into
>>> individual pages. Splitting of huge pages also involves splitting of
>>> the filemap entries for the pages being split.
> 
>>
>> Hm, I'm not most concerned about the number of nodes you're allocating.
> 
> Thanks for reminding me, I left this out of the original message.
> 
> Splitting the xarray entry for a 1G folio (in a shift-18 node for
> order=18 on x86), assuming XA_CHUNK_SHIFT is 6, would involve
> 
> + shift-18 node (the original node will be reused - no new allocations)
> + shift-12 node: 1 node allocated
> + shift-6 node : 64 nodes allocated
> + shift-0 node : 64 * 64 = 4096 nodes allocated
> 
> This brings the total number of allocated nodes to 4161 nodes. struct
> xa_node is 576 bytes, so that's 2396736 bytes or 2.28 MB, so splitting a
> 1G folio to 4K pages costs ~2.5 MB just in filemap (XArray) entry
> splitting. The other large memory cost would be from undoing HVO for the
> HugeTLB folio.
> 
>> I'm most concerned that, once we have memdescs, splitting a 1GB page
>> into 512 * 512 4kB pages is going to involve allocating about 20MB
>> of memory (80 bytes * 512 * 512).
> 
> I definitely need to catch up on memdescs. What's the best place for me
> to learn/get an overview of how memdescs will describe memory/replace
> struct folios?
> 
> I think there might be a better way to solve the original problem of
> usage tracking with memdesc support, but this was intended to make
> progress before memdescs.
> 
>> Is this necessary to do all at once?
> 
> The plan for guest_memfd was to first split from 1G to 4K, then optimize
> on that by splitting in stages, from 1G to 2M as much as possible, then
> to 4K only for the page ranges that the guest shared with the host.

Right, we also discussed the non-uniform split as an optimization in the 
future.

-- 
Cheers

David

[syzbot ci] Re: Extend xas_split* to support splitting arbitrarily large entries

Posted by syzbot ci 2 months, 3 weeks ago

syzbot ci has tested the following series

[v1] Extend xas_split* to support splitting arbitrarily large entries
https://lore.kernel.org/all/20251117224701.1279139-1-ackerleytng@google.com
* [RFC PATCH 1/4] XArray: Initialize nodes while splitting instead of while allocating
* [RFC PATCH 2/4] XArray: Update xas_split_alloc() to allocate enough nodes to split large entries
* [RFC PATCH 3/4] XArray: Support splitting for arbitrarily large entries
* [RFC PATCH 4/4] XArray: test: Increase split order test range in check_split()

and found the following issue:
WARNING: kmalloc bug in bpf_prog_alloc_no_stats

Full report is available here:
https://ci.syzbot.org/series/aa74d39d-0773-4398-bb90-0a6d21365c3d

***

WARNING: kmalloc bug in bpf_prog_alloc_no_stats

tree:      mm-new
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git
base:      41218ede767f6b218185af65ce919d0cade75f6b
arch:      amd64
compiler:  Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
config:    https://ci.syzbot.org/builds/c26972f6-b81e-4d6f-bead-3d77003cf075/config

------------[ cut here ]------------
Unexpected gfp: 0x400000 (__GFP_ACCOUNT). Fixing up to gfp: 0xdc0 (GFP_KERNEL|__GFP_ZERO). Fix your code!
WARNING: CPU: 0 PID: 6465 at mm/vmalloc.c:3938 vmalloc_fix_flags+0x9c/0xe0
Modules linked in:
CPU: 0 UID: 0 PID: 6465 Comm: syz-executor Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:vmalloc_fix_flags+0x9c/0xe0
Code: 81 e6 1f 52 ee ff 89 74 24 30 81 e3 e0 ad 11 00 89 5c 24 20 90 48 c7 c7 c0 b9 76 8b 4c 89 fa 89 d9 4d 89 f0 e8 75 2b 6e ff 90 <0f> 0b 90 90 8b 44 24 20 48 c7 04 24 0e 36 e0 45 4b c7 04 2c 00 00
RSP: 0018:ffffc90005d7fb00 EFLAGS: 00010246
RAX: 6e85c22fb4362300 RBX: 0000000000000dc0 RCX: ffff888176898000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
RBP: ffffc90005d7fb98 R08: ffff888121224293 R09: 1ffff11024244852
R10: dffffc0000000000 R11: ffffed1024244853 R12: 1ffff92000baff60
R13: dffffc0000000000 R14: ffffc90005d7fb20 R15: ffffc90005d7fb30
FS:  000055555be14500(0000) GS:ffff88818eb36000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f653e85c470 CR3: 00000001139ec000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 __vmalloc_noprof+0xf2/0x120
 bpf_prog_alloc_no_stats+0x4a/0x4d0
 bpf_prog_alloc+0x3c/0x1a0
 bpf_prog_create_from_user+0xa7/0x440
 do_seccomp+0x7b1/0xd90
 __se_sys_prctl+0xc3c/0x1830
 do_syscall_64+0xfa/0xfa0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f653e990b0d
Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 18 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 9d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 1b 48 8b 54 24 18 64 48 2b 14 25 28 00 00 00
RSP: 002b:00007fffbd3687c0 EFLAGS: 00000246 ORIG_RAX: 000000000000009d
RAX: ffffffffffffffda RBX: 00007f653ea2cf80 RCX: 00007f653e990b0d
RDX: 00007fffbd368820 RSI: 0000000000000002 RDI: 0000000000000016
RBP: 00007fffbd368830 R08: 0000000000000006 R09: 0000000000000071
R10: 0000000000000071 R11: 0000000000000246 R12: 000000000000006d
R13: 00007fffbd368c58 R14: 00007fffbd368ed8 R15: 0000000000000000
 </TASK>


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.