Documentation/mm/process_addrs.rst | 63 +++++++--- arch/arm/mm/fault-armv.c | 3 +- arch/arm/mm/flush.c | 3 +- arch/nios2/mm/cacheflush.c | 3 +- arch/parisc/kernel/cache.c | 4 +- fs/Kconfig | 8 ++ fs/dax.c | 3 +- fs/hugetlbfs/inode.c | 30 +++-- fs/inode.c | 75 +++++++++++- include/linux/fs.h | 179 ++++++++++++++++++++++++++++- include/linux/mm.h | 81 +++++++++++++ include/linux/mm_types.h | 3 + kernel/events/uprobes.c | 3 +- mm/hugetlb.c | 7 +- mm/internal.h | 3 +- mm/khugepaged.c | 6 +- mm/memory-failure.c | 8 +- mm/memory.c | 8 +- mm/mmap.c | 11 +- mm/nommu.c | 28 +++-- mm/pagewalk.c | 4 +- mm/rmap.c | 2 +- mm/vma.c | 74 +++++++++--- mm/vma_init.c | 3 + 24 files changed, 534 insertions(+), 78 deletions(-)
In NUMA, there are maybe many NUMA nodes and many CPUs.
For example, a Hygon's server has 12 NUMA nodes, and 384 CPUs.
In the UnixBench tests, there is a test "execl" which tests
the execve system call.
When we test our server with "./Run -c 384 execl",
the test result is not good enough. The i_mmap locks contended heavily on
"libc.so" and "ld.so". For example, the i_mmap tree for "libc.so" can have
over 6000 VMAs, all the VMAs can be in different NUMA mode.
The insert/remove operations do not run quickly enough.
patch 1 & patch 2 are try to hide the direct access of i_mmap.
patch 3 splits the i_mmap into sibling trees, each tree has separate lock,
and we can get better performance with this patch set in our NUMA server:
we can get over 400% performance improvement.
I did not test the non-NUMA case, since I do not have such server.
v1 --> v2:
Not only split the immap tree, but also split the lock.
v1 : https://lkml.org/lkml/2026/4/13/199
Huang Shijie (4):
mm: use mapping_mapped to simplify the code
mm: use get_i_mmap_root to access the file's i_mmap
mm/fs: split the file's i_mmap tree
docs/mm: update document for split i_mmap tree
Documentation/mm/process_addrs.rst | 63 +++++++---
arch/arm/mm/fault-armv.c | 3 +-
arch/arm/mm/flush.c | 3 +-
arch/nios2/mm/cacheflush.c | 3 +-
arch/parisc/kernel/cache.c | 4 +-
fs/Kconfig | 8 ++
fs/dax.c | 3 +-
fs/hugetlbfs/inode.c | 30 +++--
fs/inode.c | 75 +++++++++++-
include/linux/fs.h | 179 ++++++++++++++++++++++++++++-
include/linux/mm.h | 81 +++++++++++++
include/linux/mm_types.h | 3 +
kernel/events/uprobes.c | 3 +-
mm/hugetlb.c | 7 +-
mm/internal.h | 3 +-
mm/khugepaged.c | 6 +-
mm/memory-failure.c | 8 +-
mm/memory.c | 8 +-
mm/mmap.c | 11 +-
mm/nommu.c | 28 +++--
mm/pagewalk.c | 4 +-
mm/rmap.c | 2 +-
mm/vma.c | 74 +++++++++---
mm/vma_init.c | 3 +
24 files changed, 534 insertions(+), 78 deletions(-)
--
2.53.0
Hi Huang, You seem to be replacing the file rmap altogether here, so you really ought to have sent this as an RFC so we could discuss it as a community first. Especially so as Pedro had publicly mentioned his plans to implement something similar here, so coordination would have been appreciated. Anyway, as Pedro has pointed out, the code is overly complicated, it's far too configurable (not always a good thing), and the locking implementation is questionable. You seem to be adding a whole bunch of open-coded complexity too, which is not something we want. Abstraction is key for the rmap. You're also not adding any kdoc comments or really many comments at all, and you've not added any tests (though perhaps it's difficult given how core this is). So I would suggest that perhaps any respin should be sent as an RFC so we can engage in that conversation and ensure we're all on the same page? Especially since Pedro plans to send an alternative, simpler, solution I believe. It's also not helpful that you haven't examined the non-NUMA case :) perhaps your particular server behaves a certain way that this approach aids, but regresses other NUMA configurations? We'd really need to be sure of this before accepting invasive changes like this. Thanks, Lorenzo On Thu, Jun 11, 2026 at 02:18:56PM +0800, Huang Shijie wrote: > In NUMA, there are maybe many NUMA nodes and many CPUs. > For example, a Hygon's server has 12 NUMA nodes, and 384 CPUs. > In the UnixBench tests, there is a test "execl" which tests > the execve system call. > > When we test our server with "./Run -c 384 execl", > the test result is not good enough. The i_mmap locks contended heavily on > "libc.so" and "ld.so". For example, the i_mmap tree for "libc.so" can have > over 6000 VMAs, all the VMAs can be in different NUMA mode. > The insert/remove operations do not run quickly enough. You really need to send detailed, statistically valid numbers across different NUMA configurations for changes like this to be considered. > > patch 1 & patch 2 are try to hide the direct access of i_mmap. > patch 3 splits the i_mmap into sibling trees, each tree has separate lock, > and we can get better performance with this patch set in our NUMA server: > we can get over 400% performance improvement. > > I did not test the non-NUMA case, since I do not have such server. Yeah this isn't a great thing to hear :) you need to demonstrate this doesn't regress non-NUMA machines or NUMA machines of a different configuration. > > v1 --> v2: > Not only split the immap tree, but also split the lock. > v1 : https://lkml.org/lkml/2026/4/13/199 > > Huang Shijie (4): > mm: use mapping_mapped to simplify the code > mm: use get_i_mmap_root to access the file's i_mmap > mm/fs: split the file's i_mmap tree > docs/mm: update document for split i_mmap tree > > Documentation/mm/process_addrs.rst | 63 +++++++--- > arch/arm/mm/fault-armv.c | 3 +- > arch/arm/mm/flush.c | 3 +- > arch/nios2/mm/cacheflush.c | 3 +- > arch/parisc/kernel/cache.c | 4 +- > fs/Kconfig | 8 ++ > fs/dax.c | 3 +- > fs/hugetlbfs/inode.c | 30 +++-- > fs/inode.c | 75 +++++++++++- > include/linux/fs.h | 179 ++++++++++++++++++++++++++++- > include/linux/mm.h | 81 +++++++++++++ > include/linux/mm_types.h | 3 + > kernel/events/uprobes.c | 3 +- > mm/hugetlb.c | 7 +- > mm/internal.h | 3 +- > mm/khugepaged.c | 6 +- > mm/memory-failure.c | 8 +- > mm/memory.c | 8 +- > mm/mmap.c | 11 +- > mm/nommu.c | 28 +++-- > mm/pagewalk.c | 4 +- > mm/rmap.c | 2 +- > mm/vma.c | 74 +++++++++--- > mm/vma_init.c | 3 + > 24 files changed, 534 insertions(+), 78 deletions(-) This is a _lot_ of changes you're making here. It therefore feels like the abstraction is broken somewhat? > > -- > 2.53.0 > > Thanks, Lorenzo
On Thu, Jun 11, 2026 at 05:00:49PM +0100, Lorenzo Stoakes wrote: > Hi Huang, > > You seem to be replacing the file rmap altogether here, so you really ought > to have sent this as an RFC so we could discuss it as a community first. No problem. > > Especially so as Pedro had publicly mentioned his plans to implement > something similar here, so coordination would have been appreciated. Yes. I am very happy to work with Pedro. > > Anyway, as Pedro has pointed out, the code is overly complicated, it's far > too configurable (not always a good thing), and the locking implementation > is questionable. I can make the code more simple. :) > > You seem to be adding a whole bunch of open-coded complexity too, which is > not something we want. Abstraction is key for the rmap. > > You're also not adding any kdoc comments or really many comments at all, > and you've not added any tests (though perhaps it's difficult given how > core this is). Got it. > > So I would suggest that perhaps any respin should be sent as an RFC so we > can engage in that conversation and ensure we're all on the same page? > > Especially since Pedro plans to send an alternative, simpler, solution I > believe. > > It's also not helpful that you haven't examined the non-NUMA case :) > perhaps your particular server behaves a certain way that this approach > aids, but regresses other NUMA configurations? emm. I ever hoped someone can help me to test this patch set on the non-NUMA server. It seems I should find some non-NUMA server before I send out the patch set. :) > > We'd really need to be sure of this before accepting invasive changes like > this. Okay. Thanks Huang Shijie
syzbot ci has tested the following series [v2] mm: split the file's i_mmap tree for NUMA https://lore.kernel.org/all/20260611061915.2354307-1-huangsj@hygon.cn * [PATCH v2 1/4] mm: use mapping_mapped to simplify the code * [PATCH v2 2/4] mm: use get_i_mmap_root to access the file's i_mmap * [PATCH v2 3/4] mm/fs: split the file's i_mmap tree * [PATCH v2 4/4] docs/mm: update document for split i_mmap tree and found the following issue: INFO: trying to register non-static key in do_one_initcall Full report is available here: https://ci.syzbot.org/series/a9bada61-06e7-40d5-b423-5f2d69a60209 *** INFO: trying to register non-static key in do_one_initcall tree: linux-next URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next base: 14546c7bef6c1036fc82e36c1a200b0caccd339a arch: amd64 compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8 config: https://ci.syzbot.org/builds/2f92f704-660a-4108-9172-7e620e10ce46/config acpi PNP0A08:00: _OSC: platform does not support [PCIeHotplug LTR] acpi PNP0A08:00: _OSC: OS now controls [PME AER PCIeCapability] PCI host bridge to bus 0000:00 pci_bus 0000:00: Unknown NUMA node; performance will be reduced pci_bus 0000:00: root bus resource [io 0x0000-0x0cf7 window] pci_bus 0000:00: root bus resource [io 0x0d00-0xffff window] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window] pci_bus 0000:00: root bus resource [mem 0x80000000-0xafffffff window] pci_bus 0000:00: root bus resource [mem 0xc0000000-0xfebfffff window] pci_bus 0000:00: root bus resource [mem 0x240000000-0xa3fffffff window] pci_bus 0000:00: root bus resource [bus 00-ff] pci 0000:00:00.0: [8086:29c0] type 00 class 0x060000 conventional PCI endpoint pci 0000:00:01.0: [1234:1111] type 00 class 0x030000 conventional PCI endpoint pci 0000:00:01.0: BAR 0 [mem 0xfd000000-0xfdffffff pref] pci 0000:00:01.0: BAR 2 [mem 0xfebf0000-0xfebf0fff] pci 0000:00:01.0: ROM [mem 0xfebe0000-0xfebeffff pref] pci 0000:00:01.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff] pci 0000:00:02.0: [1af4:1005] type 00 class 0x00ff00 conventional PCI endpoint pci 0000:00:02.0: BAR 0 [io 0xc080-0xc09f] pci 0000:00:02.0: BAR 1 [mem 0xfebf1000-0xfebf1fff] pci 0000:00:02.0: BAR 4 [mem 0xfe000000-0xfe003fff 64bit pref] pci 0000:00:03.0: [8086:100e] type 00 class 0x020000 conventional PCI endpoint pci 0000:00:03.0: BAR 0 [mem 0xfebc0000-0xfebdffff] pci 0000:00:03.0: BAR 1 [io 0xc000-0xc03f] pci 0000:00:03.0: ROM [mem 0xfeb80000-0xfebbffff pref] pci 0000:00:1f.0: [8086:2918] type 00 class 0x060100 conventional PCI endpoint pci 0000:00:1f.0: quirk: [io 0x0600-0x067f] claimed by ICH6 ACPI/GPIO/TCO pci 0000:00:1f.2: [8086:2922] type 00 class 0x010601 conventional PCI endpoint pci 0000:00:1f.2: BAR 4 [io 0xc0a0-0xc0bf] pci 0000:00:1f.2: BAR 5 [mem 0xfebf2000-0xfebf2fff] pci 0000:00:1f.3: [8086:2930] type 00 class 0x0c0500 conventional PCI endpoint pci 0000:00:1f.3: BAR 4 [io 0x0700-0x073f] ACPI: PCI: Interrupt link LNKA configured for IRQ 10 ACPI: PCI: Interrupt link LNKB configured for IRQ 10 ACPI: PCI: Interrupt link LNKC configured for IRQ 11 ACPI: PCI: Interrupt link LNKD configured for IRQ 11 ACPI: PCI: Interrupt link LNKE configured for IRQ 10 ACPI: PCI: Interrupt link LNKF configured for IRQ 10 ACPI: PCI: Interrupt link LNKG configured for IRQ 11 ACPI: PCI: Interrupt link LNKH configured for IRQ 11 ACPI: PCI: Interrupt link GSIA configured for IRQ 16 ACPI: PCI: Interrupt link GSIB configured for IRQ 17 ACPI: PCI: Interrupt link GSIC configured for IRQ 18 ACPI: PCI: Interrupt link GSID configured for IRQ 19 ACPI: PCI: Interrupt link GSIE configured for IRQ 20 ACPI: PCI: Interrupt link GSIF configured for IRQ 21 ACPI: PCI: Interrupt link GSIG configured for IRQ 22 ACPI: PCI: Interrupt link GSIH configured for IRQ 23 iommu: Default domain type: Translated iommu: DMA domain TLB invalidation policy: lazy mode INFO: trying to register non-static key. The code is fine but needs lockdep annotation, or maybe you didn't initialize this object before use? turning off the locking correctness validator. CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted syzkaller #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0xe8/0x150 assign_lock_key+0x133/0x150 register_lock_class+0xcc/0x2e0 __lock_acquire+0xad/0x2cf0 lock_acquire+0x106/0x350 down_write+0x96/0x200 dma_resv_lockdep+0x39c/0x660 do_one_initcall+0x250/0x870 do_initcall_level+0x104/0x190 do_initcalls+0x59/0xa0 kernel_init_freeable+0x2a6/0x3e0 kernel_init+0x1d/0x1d0 ret_from_fork+0x514/0xb70 ret_from_fork_asm+0x1a/0x30 </TASK> ------------[ cut here ]------------ DEBUG_RWSEMS_WARN_ON(sem->magic != sem): count = 0x1, magic = 0x0, owner = 0xffff888102a95940, curr 0xffff888102a95940, list not empty WARNING: kernel/locking/rwsem.c:1405 at up_write+0x1e2/0x410, CPU#0: swapper/0/1 Modules linked in: CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted syzkaller #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 RIP: 0010:up_write+0x2b1/0x410 Code: c0 c0 e6 cc 8b 49 c7 c2 a0 e6 cc 8b 4c 0f 44 d0 48 8b 7c 24 10 48 c7 c6 40 e8 cc 8b 48 8b 54 24 08 48 8b 0c 24 4d 89 f9 41 52 <67> 48 0f b9 3a 48 83 c4 08 e8 21 1f 0d 03 e9 b2 fd ff ff 90 0f 0b RSP: 0000:ffffc90000067480 EFLAGS: 00010246 RAX: ffffffff8bcce6c0 RBX: ffffc900000677d0 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffff8bcce840 RDI: ffffffff90338290 RBP: ffffc90000067830 R08: ffff888102a95940 R09: ffff888102a95940 R10: ffffffff8bcce6c0 R11: fffff5200000cefc R12: ffffc90000067828 R13: dffffc0000000000 R14: 1ffff9200000cf06 R15: ffff888102a95940 FS: 0000000000000000(0000) GS:ffff88818dc9e000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff88823ffff000 CR3: 000000000e74a000 CR4: 00000000000006f0 Call Trace: <TASK> dma_resv_lockdep+0x3a4/0x660 do_one_initcall+0x250/0x870 do_initcall_level+0x104/0x190 do_initcalls+0x59/0xa0 kernel_init_freeable+0x2a6/0x3e0 kernel_init+0x1d/0x1d0 ret_from_fork+0x514/0xb70 ret_from_fork_asm+0x1a/0x30 </TASK> *** If these findings have caused you to resend the series or submit a separate fix, please add the following tag to your commit message: Tested-by: syzbot@syzkaller.appspotmail.com --- This report is generated by a bot. It may contain errors. syzbot ci engineers can be reached at syzkaller@googlegroups.com. To test a patch for this bug, please reply with `#syz test` (should be on a separate line). The patch should be attached to the email. Note: arguments like custom git repos and branches are not supported.
© 2016 - 2026 Red Hat, Inc.