Documentation/core-api/index.rst | 1 + Documentation/core-api/maple_tree.rst | 218 + MAINTAINERS | 12 + arch/arm64/kernel/vdso.c | 3 +- arch/parisc/kernel/cache.c | 9 +- arch/powerpc/kernel/vdso.c | 6 +- arch/powerpc/mm/book3s32/tlb.c | 11 +- arch/powerpc/mm/book3s64/subpage_prot.c | 13 +- arch/riscv/kernel/vdso.c | 3 +- arch/s390/kernel/vdso.c | 3 +- arch/s390/mm/gmap.c | 6 +- arch/um/kernel/tlb.c | 14 +- arch/x86/entry/vdso/vma.c | 9 +- arch/x86/kernel/tboot.c | 2 +- arch/xtensa/kernel/syscall.c | 18 +- drivers/firmware/efi/efi.c | 2 +- drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 14 +- drivers/misc/cxl/fault.c | 45 +- drivers/tee/optee/call.c | 18 +- drivers/xen/privcmd.c | 2 +- fs/binfmt_elf.c | 6 +- fs/coredump.c | 33 +- fs/exec.c | 12 +- fs/proc/base.c | 5 +- fs/proc/internal.h | 2 +- fs/proc/task_mmu.c | 74 +- fs/proc/task_nommu.c | 45 +- fs/userfaultfd.c | 49 +- include/linux/maple_tree.h | 683 + include/linux/mm.h | 77 +- include/linux/mm_types.h | 43 +- include/linux/mm_types_task.h | 12 - include/linux/sched.h | 1 - include/linux/userfaultfd_k.h | 7 +- include/linux/vm_event_item.h | 4 - include/linux/vmacache.h | 28 - include/linux/vmstat.h | 6 - include/linux/xarray.h | 1 + include/trace/events/maple_tree.h | 123 + include/trace/events/mmap.h | 71 + init/main.c | 2 + ipc/shm.c | 21 +- kernel/acct.c | 11 +- kernel/bpf/task_iter.c | 10 +- kernel/debug/debug_core.c | 12 - kernel/events/core.c | 3 +- kernel/events/uprobes.c | 9 +- kernel/fork.c | 58 +- kernel/sched/fair.c | 10 +- lib/Kconfig.debug | 18 +- lib/Makefile | 3 +- lib/maple_tree.c | 6967 +++ lib/test_maple_tree.c | 37398 ++++++++++++++++ mm/Makefile | 2 +- mm/damon/vaddr-test.h | 37 +- mm/damon/vaddr.c | 53 +- mm/debug.c | 14 +- mm/gup.c | 9 +- mm/huge_memory.c | 4 +- mm/init-mm.c | 4 +- mm/internal.h | 78 +- mm/khugepaged.c | 13 +- mm/ksm.c | 18 +- mm/madvise.c | 2 +- mm/memcontrol.c | 6 +- mm/memory.c | 33 +- mm/mempolicy.c | 58 +- mm/mlock.c | 34 +- mm/mmap.c | 2086 +- mm/mprotect.c | 7 +- mm/mremap.c | 22 +- mm/msync.c | 2 +- mm/nommu.c | 127 +- mm/oom_kill.c | 3 +- mm/pagewalk.c | 2 +- mm/swapfile.c | 4 +- mm/util.c | 32 - mm/vmacache.c | 117 - mm/vmstat.c | 4 - tools/testing/radix-tree/.gitignore | 2 + tools/testing/radix-tree/Makefile | 9 +- tools/testing/radix-tree/generated/autoconf.h | 1 + tools/testing/radix-tree/linux.c | 160 +- tools/testing/radix-tree/linux/kernel.h | 1 + tools/testing/radix-tree/linux/lockdep.h | 2 + tools/testing/radix-tree/linux/maple_tree.h | 7 + tools/testing/radix-tree/linux/slab.h | 4 + tools/testing/radix-tree/maple.c | 59 + .../radix-tree/trace/events/maple_tree.h | 3 + 89 files changed, 47390 insertions(+), 1842 deletions(-) create mode 100644 Documentation/core-api/maple_tree.rst create mode 100644 include/linux/maple_tree.h delete mode 100644 include/linux/vmacache.h create mode 100644 include/trace/events/maple_tree.h create mode 100644 lib/maple_tree.c create mode 100644 lib/test_maple_tree.c delete mode 100644 mm/vmacache.c create mode 100644 tools/testing/radix-tree/linux/maple_tree.h create mode 100644 tools/testing/radix-tree/maple.c create mode 100644 tools/testing/radix-tree/trace/events/maple_tree.h
The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently. There are a number of places in the kernel
that a non-overlapping range-based tree would be beneficial, especially
one with a simple interface. The first user that is covered in this
patch set is the vm_area_struct, where three data structures are
replaced by the maple tree: the augmented rbtree, the vma cache, and the
linked list of VMAs in the mm_struct. The long term goal is to reduce
or remove the mmap_sem contention.
The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
nodes. With the increased branching factor, it is significantly shorter than
the rbtree so it has fewer cache misses. The removal of the linked list
between subsequent entries also reduces the cache misses and the need to pull
in the previous and next VMA during many tree alterations.
This patch is based on v5.17-rc4
git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220214
v6 changes:
- Added patch for xarray testcode which should be dropped for upstream
fix. - Thanks Matthew Wilcox
- Fixed issue with maple state index/last setting when the tree is just
a pointer - Thanks David Howells
- Changed internal RCU handling to not check flags more than once
- Fixed mas_prev() underflow issue - Thanks David Howells
- Fixed returns on mas_prev()/mas_next() when there is only a value at
0 index - Thanks David Howells
- Fixed mas_find_rev() running past minimum value - Thanks David Howells
- Fixed testing code function rename - Thanks Mark Hemment
- Reworked brk() and vm_brk_flags() as suggested - Thanks Vlastimil
Babka
- Documentation fixes - Thanks Mike Rapoport
- Separated test code from tree code - Thanks Vlastimil Babka
- Moved maple_tree_init() call into the maple tree patch for other users
- Thanks David Howells
- Fixed copyright date - Thanks Mike Rapoport
- Whitespace fixes in comments, reduced changes in other locations.
- Added missing kdocs - Thanks Mike Rapoport
- Fixed exit_mmap comment - Thanks Mark Hemment
- Removed RCU tracking as there is an issue with atomic increments of
mmget() - Thanks Mark Hemment for initial issue report that allowed me
to discover this. RCU was not being used, so it is disabled for VMA
tracking for now.
- Removed unnecessary assignment in mtree_range_walk() - Thanks JaeJoon
Jung
- Dropped ma_xa_benchmark from testing Makefile - Thanks JaeJoon Jung
- Fixed leaf gap calculation in rare case of underflow
- Fixed mmap_region() bug on merging of prev
- Fixed mmap_region() bug on khugepaged_enter_vma_merge()
- Added test cases for mas_find_rev(), mas_prev(), mas_next, and
mas_root_expand() to test suite.
- Updated config options and ifdefs to allow other maple tree users to
debug maple tree without debugging the VM maple tree.
v5: https://lore.kernel.org/linux-mm/20220203172051.i2jnhnkudzssdsxg@revolver/T/
v4: https://lore.kernel.org/linux-mm/20211201142918.921493-30-Liam.Howlett@oracle.com/t/
v3: https://lore.kernel.org/linux-mm/20211005012959.1110504-1-Liam.Howlett@oracle.com/
v2: https://lore.kernel.org/linux-mm/20210817154651.1570984-1-Liam.Howlett@oracle.com/
v1: https://lore.kernel.org/linux-mm/20210428153542.2814175-1-Liam.Howlett@Oracle.com/
Performance on a 144 core x86:
It is important to note that the code is still using the mmap_sem, the
performance seems fairly similar on real-world workloads, while there
are variations in micro-benchmarks.
kernbench, increased system time, less user time:
Amean user-2 885.34 ( 0.00%) 886.07 * -0.08%*
Amean syst-2 161.95 ( 0.00%) 168.19 * -3.85%*
Amean elsp-2 530.06 ( 0.00%) 532.96 * -0.55%*
Amean user-4 908.58 ( 0.00%) 908.88 * -0.03%*
Amean syst-4 167.56 ( 0.00%) 173.72 * -3.68%*
Amean elsp-4 277.21 ( 0.00%) 277.38 * -0.06%*
Amean user-8 961.84 ( 0.00%) 962.45 * -0.06%*
Amean syst-8 176.40 ( 0.00%) 183.43 * -3.99%*
Amean elsp-8 150.59 ( 0.00%) 151.21 * -0.41%*
Amean user-16 1040.15 ( 0.00%) 1039.89 * 0.02%*
Amean syst-16 188.19 ( 0.00%) 193.81 * -2.99%*
Amean elsp-16 86.85 ( 0.00%) 86.32 * 0.61%*
Amean user-32 1240.46 ( 0.00%) 1233.93 * 0.53%*
Amean syst-32 217.15 ( 0.00%) 222.99 * -2.69%*
Amean elsp-32 55.16 ( 0.00%) 55.09 * 0.12%*
Amean user-64 1241.17 ( 0.00%) 1234.26 * 0.56%*
Amean syst-64 215.11 ( 0.00%) 220.76 * -2.63%*
Amean elsp-64 32.88 ( 0.00%) 33.72 * -2.57%*
Amean user-128 1613.09 ( 0.00%) 1609.63 * 0.21%*
Amean syst-128 267.10 ( 0.00%) 276.72 * -3.60%*
Amean elsp-128 25.80 ( 0.00%) 26.09 * -1.10%*
Mixed Hmean results:
- freqmine-medium -3.50% to +12.82%
- malloc1-processes: -14.50% to +6.53%
- signal1-processes -1.87% to +14.02%
- page_fault3-threads -6.46% to +26.15%
- pthread_mutex1-threads -16.81% to +28.63%
Decrease in performance in the following micro-benchmarks in Hmean:
- brk1-processes -37.42% to -44.16%
- malloc1-threads -18.27% to -23.08%
Modifications are more expensive so the micro-benchmarks that write but
do not use the data will be negatively affected.
Patch organization:
Patch 1 is to add a missing lock to avoid an assert issue when using a vma iterator.
Patch 2 is an xarray fix due to bitmap header changes which will be
dropped for a pending upstream fix.
Patches 3 to 7 are radix tree test suite additions for maple tree
support.
Patch 8 adds the maple tree.
Patch 9 adds the maple tree test code.
Patches 10-19 are the removal of the rbtree from the mm_struct. This
now includes the introduction of the vma iterator.
Patch 20 optimizes __vma_adjust() for the maple tree.
Patches 21-27 are the removal of the vmacache from the kernel.
Patches 28-31 are internal mm changes for efficiencies.
Patches 32-69 are the removal of the linked list
Patches 70 and 71 are a small cleanup from the removal of the vma linked list.
Liam R. Howlett (61):
xarray: Fix bitmap breakage
radix tree test suite: Add pr_err define
radix tree test suite: Add kmem_cache_set_non_kernel()
radix tree test suite: Add allocation counts and size to kmem_cache
radix tree test suite: Add support for slab bulk APIs
radix tree test suite: Add lockdep_is_held to header
Maple Tree: Add new data structure
lib/test_maple_tree: Add testing for maple tree
mm: Start tracking VMAs with maple tree
mm/mmap: Use the maple tree in find_vma() instead of the rbtree.
mm/mmap: Use the maple tree for find_vma_prev() instead of the rbtree
mm/mmap: Use maple tree for unmapped_area{_topdown}
kernel/fork: Use maple tree for dup_mmap() during forking
mm: Remove rb tree.
mmap: Change zeroing of maple tree in __vma_adjust()
xen: Use vma_lookup() in privcmd_ioctl_mmap()
mm: Optimize find_exact_vma() to use vma_lookup()
mm/khugepaged: Optimize collapse_pte_mapped_thp() by using
vma_lookup()
mm/mmap: Change do_brk_flags() to expand existing VMA and add
do_brk_munmap()
mm: Use maple tree operations for find_vma_intersection()
mm/mmap: Use advanced maple tree API for mmap_region()
mm: Remove vmacache
mm/mmap: Move mmap_region() below do_munmap()
mm/mmap: Reorganize munmap to use maple states
mm/mmap: Change do_brk_munmap() to use do_mas_align_munmap()
arm64: Remove mmap linked list from vdso
parisc: Remove mmap linked list from cache handling
powerpc: Remove mmap linked list walks
s390: Remove vma linked list walks
x86: Remove vma linked list walks
xtensa: Remove vma linked list walks
cxl: Remove vma linked list walk
optee: Remove vma linked list walk
um: Remove vma linked list walk
binfmt_elf: Remove vma linked list walk
exec: Use VMA iterator instead of linked list
fs/proc/base: Use maple tree iterators in place of linked list
userfaultfd: Use maple tree iterator to iterate VMAs
ipc/shm: Use VMA iterator instead of linked list
acct: Use VMA iterator instead of linked list
perf: Use VMA iterator
sched: Use maple tree iterator to walk VMAs
fork: Use VMA iterator
bpf: Remove VMA linked list
mm/gup: Use maple tree navigation instead of linked list
mm/khugepaged: Stop using vma linked list
mm/ksm: Use vma iterators instead of vma linked list
mm/madvise: Use vma_find() instead of vma linked list
mm/memcontrol: Stop using mm->highest_vm_end
mm/mempolicy: Use vma iterator & maple state instead of vma linked
list
mm/mlock: Use vma iterator and instead of vma linked list
mm/mprotect: Use maple tree navigation instead of vma linked list
mm/mremap: Use vma_find_intersection() instead of vma linked list
mm/msync: Use vma_find() instead of vma linked list
mm/oom_kill: Use maple tree iterators instead of vma linked list
mm/pagewalk: Use vma_find() instead of vma linked list
mm/swapfile: Use vma iterator instead of vma linked list
riscv: Use vma iterator for vdso
mm: Remove the vma linked list
mm/mmap: Drop range_has_overlap() function
mm/mmap.c: Pass in mapping to __vma_link_file()
Matthew Wilcox (Oracle) (10):
binfmt_elf: Take the mmap lock when walking the VMA list
mm: Add VMA iterator
mmap: Use the VMA iterator in count_vma_pages_range()
damon: Convert __damon_va_three_regions to use the VMA iterator
proc: Remove VMA rbtree use from nommu
mm: Convert vma_lookup() to use mtree_load()
coredump: Remove vma linked list walk
fs/proc/task_mmu: Stop using linked list and highest_vm_end
i915: Use the VMA iterator
nommu: Remove uses of VMA linked list
Documentation/core-api/index.rst | 1 +
Documentation/core-api/maple_tree.rst | 218 +
MAINTAINERS | 12 +
arch/arm64/kernel/vdso.c | 3 +-
arch/parisc/kernel/cache.c | 9 +-
arch/powerpc/kernel/vdso.c | 6 +-
arch/powerpc/mm/book3s32/tlb.c | 11 +-
arch/powerpc/mm/book3s64/subpage_prot.c | 13 +-
arch/riscv/kernel/vdso.c | 3 +-
arch/s390/kernel/vdso.c | 3 +-
arch/s390/mm/gmap.c | 6 +-
arch/um/kernel/tlb.c | 14 +-
arch/x86/entry/vdso/vma.c | 9 +-
arch/x86/kernel/tboot.c | 2 +-
arch/xtensa/kernel/syscall.c | 18 +-
drivers/firmware/efi/efi.c | 2 +-
drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 14 +-
drivers/misc/cxl/fault.c | 45 +-
drivers/tee/optee/call.c | 18 +-
drivers/xen/privcmd.c | 2 +-
fs/binfmt_elf.c | 6 +-
fs/coredump.c | 33 +-
fs/exec.c | 12 +-
fs/proc/base.c | 5 +-
fs/proc/internal.h | 2 +-
fs/proc/task_mmu.c | 74 +-
fs/proc/task_nommu.c | 45 +-
fs/userfaultfd.c | 49 +-
include/linux/maple_tree.h | 683 +
include/linux/mm.h | 77 +-
include/linux/mm_types.h | 43 +-
include/linux/mm_types_task.h | 12 -
include/linux/sched.h | 1 -
include/linux/userfaultfd_k.h | 7 +-
include/linux/vm_event_item.h | 4 -
include/linux/vmacache.h | 28 -
include/linux/vmstat.h | 6 -
include/linux/xarray.h | 1 +
include/trace/events/maple_tree.h | 123 +
include/trace/events/mmap.h | 71 +
init/main.c | 2 +
ipc/shm.c | 21 +-
kernel/acct.c | 11 +-
kernel/bpf/task_iter.c | 10 +-
kernel/debug/debug_core.c | 12 -
kernel/events/core.c | 3 +-
kernel/events/uprobes.c | 9 +-
kernel/fork.c | 58 +-
kernel/sched/fair.c | 10 +-
lib/Kconfig.debug | 18 +-
lib/Makefile | 3 +-
lib/maple_tree.c | 6967 +++
lib/test_maple_tree.c | 37398 ++++++++++++++++
mm/Makefile | 2 +-
mm/damon/vaddr-test.h | 37 +-
mm/damon/vaddr.c | 53 +-
mm/debug.c | 14 +-
mm/gup.c | 9 +-
mm/huge_memory.c | 4 +-
mm/init-mm.c | 4 +-
mm/internal.h | 78 +-
mm/khugepaged.c | 13 +-
mm/ksm.c | 18 +-
mm/madvise.c | 2 +-
mm/memcontrol.c | 6 +-
mm/memory.c | 33 +-
mm/mempolicy.c | 58 +-
mm/mlock.c | 34 +-
mm/mmap.c | 2086 +-
mm/mprotect.c | 7 +-
mm/mremap.c | 22 +-
mm/msync.c | 2 +-
mm/nommu.c | 127 +-
mm/oom_kill.c | 3 +-
mm/pagewalk.c | 2 +-
mm/swapfile.c | 4 +-
mm/util.c | 32 -
mm/vmacache.c | 117 -
mm/vmstat.c | 4 -
tools/testing/radix-tree/.gitignore | 2 +
tools/testing/radix-tree/Makefile | 9 +-
tools/testing/radix-tree/generated/autoconf.h | 1 +
tools/testing/radix-tree/linux.c | 160 +-
tools/testing/radix-tree/linux/kernel.h | 1 +
tools/testing/radix-tree/linux/lockdep.h | 2 +
tools/testing/radix-tree/linux/maple_tree.h | 7 +
tools/testing/radix-tree/linux/slab.h | 4 +
tools/testing/radix-tree/maple.c | 59 +
.../radix-tree/trace/events/maple_tree.h | 3 +
89 files changed, 47390 insertions(+), 1842 deletions(-)
create mode 100644 Documentation/core-api/maple_tree.rst
create mode 100644 include/linux/maple_tree.h
delete mode 100644 include/linux/vmacache.h
create mode 100644 include/trace/events/maple_tree.h
create mode 100644 lib/maple_tree.c
create mode 100644 lib/test_maple_tree.c
delete mode 100644 mm/vmacache.c
create mode 100644 tools/testing/radix-tree/linux/maple_tree.h
create mode 100644 tools/testing/radix-tree/maple.c
create mode 100644 tools/testing/radix-tree/trace/events/maple_tree.h
--
2.34.1
On Tue, 15 Feb 2022 14:37:44 +0000 Liam Howlett <liam.howlett@oracle.com> wrote: > The maple tree is an RCU-safe range based B-tree designed to use modern > processor cache efficiently. There are a number of places in the kernel > that a non-overlapping range-based tree would be beneficial, especially > one with a simple interface. The first user that is covered in this > patch set is the vm_area_struct, where three data structures are > replaced by the maple tree: the augmented rbtree, the vma cache, and the > linked list of VMAs in the mm_struct. The long term goal is to reduce > or remove the mmap_sem contention. Has a path been chosen which gets us from here to significant reduction in mmap_lock overhead? If so, what's the plan and what must be done? Thanks.
On Wed, Feb 16, 2022 at 11:47:00AM -0800, Andrew Morton wrote: > On Tue, 15 Feb 2022 14:37:44 +0000 Liam Howlett <liam.howlett@oracle.com> wrote: > > > The maple tree is an RCU-safe range based B-tree designed to use modern > > processor cache efficiently. There are a number of places in the kernel > > that a non-overlapping range-based tree would be beneficial, especially > > one with a simple interface. The first user that is covered in this > > patch set is the vm_area_struct, where three data structures are > > replaced by the maple tree: the augmented rbtree, the vma cache, and the > > linked list of VMAs in the mm_struct. The long term goal is to reduce > > or remove the mmap_sem contention. > > Has a path been chosen which gets us from here to significant reduction > in mmap_lock overhead? > > If so, what's the plan and what must be done? I would say there are still competing ideas for how that is to be done. First, the Maple Tree is independent (to a certain extent) of all the approaches to reducing mmap_lock overhead. It's a better data structure for VMA lookup than the rbtree. Set against that, it has higher overhead for modifications. That means that benchmarks which primarily measure modification overhead see worse performance. We believe this is not representative of real workloads, and indeed we see ~parity on workloads like git and make -jN. The advantage to this patch set is that we get rid of the vmacache entirely, we get rid of the doubly-linked list chaining VMAs together and we get rid of the rbtree usage. And we add a new generic data structure that hopefully I can point people towards instead of using the XArray inappropriately. Both of the approaches to avoiding taking the mmap_lock in the fault path will benefit from using the maple tree for VMAs. Fewer cache misses and RCU safety are things we all want. The maple tree will also benefit if/when either approach is merged; because modifications to the maple tree are more heavyweight than the rbtree, they block readers for longer. The SPF approach is further along. It has been tested in the past with the maple tree, but not in the most recent iteration of either. It doesn't rely on the maple tree, but it does benefit. The RCU-freeing-of-VMAs approach that I favour has not received much attention. I've been busy with other things ;-) It offers much more opportunity for performance improvement, but will take longer to arrive.
On Wed, Feb 16, 2022 at 08:24:34PM +0000, Matthew Wilcox wrote:
> On Wed, Feb 16, 2022 at 11:47:00AM -0800, Andrew Morton wrote:
> > On Tue, 15 Feb 2022 14:37:44 +0000 Liam Howlett <liam.howlett@oracle.com> wrote:
> >
> > > The maple tree is an RCU-safe range based B-tree designed to use modern
> > > processor cache efficiently. There are a number of places in the kernel
> > > that a non-overlapping range-based tree would be beneficial, especially
> > > one with a simple interface. The first user that is covered in this
> > > patch set is the vm_area_struct, where three data structures are
> > > replaced by the maple tree: the augmented rbtree, the vma cache, and the
> > > linked list of VMAs in the mm_struct. The long term goal is to reduce
> > > or remove the mmap_sem contention.
> >
> > Has a path been chosen which gets us from here to significant reduction
> > in mmap_lock overhead?
> >
> > If so, what's the plan and what must be done?
>
> I would say there are still competing ideas for how that is to be done.
>
> First, the Maple Tree is independent (to a certain extent) of all the
> approaches to reducing mmap_lock overhead. It's a better data structure
> for VMA lookup than the rbtree. Set against that, it has higher overhead
> for modifications. That means that benchmarks which primarily measure
> modification overhead see worse performance. We believe this is not
> representative of real workloads, and indeed we see ~parity on workloads
> like git and make -jN.
>
I'm way behind and only got around to looking at SPF properly today. Maple
is working its way up my list and I need to gather new data but I did
have old data queued from a time when I thought I would get around to
maple tree soon. The big result that stood was was brk performance from
will-it-scale but for processes specifically
wis-malloc
5.17.0-rc3 5.17.0-rc3
vanilla mm-maplevma-v5r1
Hmean brk1-processes-2 5265361.84 ( 0.00%) 3489097.63 * -33.73%*
Hmean brk1-processes-5 12270382.36 ( 0.00%) 8136542.13 * -33.69%*
Hmean brk1-processes-8 19560585.95 ( 0.00%) 12941094.22 * -33.84%*
Hmean brk1-processes-12 27328267.18 ( 0.00%) 18148145.64 * -33.59%*
Hmean brk1-processes-21 39599378.04 ( 0.00%) 26188063.96 * -33.87%*
Hmean brk1-processes-30 59880571.20 ( 0.00%) 39653658.00 * -33.78%*
Hmean brk1-processes-48 72900335.59 ( 0.00%) 49044880.84 * -32.72%*
Hmean brk1-processes-79 72388042.42 ( 0.00%) 48416196.93 * -33.12%*
Hmean brk1-processes-110 72333206.18 ( 0.00%) 48437222.12 * -33.04%*
Hmean brk1-processes-141 72443841.68 ( 0.00%) 48490158.48 * -33.07%*
Hmean brk1-processes-172 72187451.54 ( 0.00%) 48480696.77 * -32.84%*
Hmean brk1-processes-203 72575945.95 ( 0.00%) 48503896.03 * -33.17%*
Hmean brk1-processes-234 67509812.97 ( 0.00%) 48466347.95 * -28.21%*
Hmean brk1-processes-265 72379447.48 ( 0.00%) 48449330.06 * -33.06%*
Hmean brk1-processes-296 72241861.59 ( 0.00%) 48303854.85 * -33.14%*
Hmean brk1-processes-320 72183320.07 ( 0.00%) 48253708.21 * -33.15%*
Hmean brk1-threads-2 1352297.67 ( 0.00%) 2963406.42 * 119.14%*
Hmean brk1-threads-5 814358.55 ( 0.00%) 1233604.84 * 51.48%*
Hmean brk1-threads-8 849900.52 ( 0.00%) 1241170.76 * 46.04%*
Hmean brk1-threads-12 804143.24 ( 0.00%) 1232451.60 * 53.26%*
Hmean brk1-threads-21 672680.28 ( 0.00%) 1001257.38 * 48.85%*
Hmean brk1-threads-30 443221.73 ( 0.00%) 713592.66 * 61.00%*
Hmean brk1-threads-48 397283.95 ( 0.00%) 591349.83 * 48.85%*
Hmean brk1-threads-79 167479.04 ( 0.00%) 548681.03 * 227.61%*
Hmean brk1-threads-110 154108.98 ( 0.00%) 207349.67 * 34.55%*
Hmean brk1-threads-141 166712.91 ( 0.00%) 217314.50 * 30.35%*
Hmean brk1-threads-172 155023.61 ( 0.00%) 210318.59 * 35.67%*
Hmean brk1-threads-203 166943.60 ( 0.00%) 227892.26 * 36.51%*
The mmtests configuration this was based on no longer exists but the
equivalent should be
configs/config-workload-will-it-scale-pf-processes
configs/config-workload-will-it-scale-pf-threads
It shouldn't be too hard to pull out the brk1 test on its own and see
what it's doing in profiles.
brk_test from aim9 also suffers so may be brk-specific.
Other workloads looked ok, kernel building took a small -1.5%
performance hit on some machines but not enough that I would scream.
Others looked mostly neutral but it was only a provisional sniff test
before I even considered looking at 70 patches :O
--
Mel Gorman
SUSE Labs
On Wed, Feb 23, 2022 at 04:35:09PM +0000, Mel Gorman wrote:
> On Wed, Feb 16, 2022 at 08:24:34PM +0000, Matthew Wilcox wrote:
> > On Wed, Feb 16, 2022 at 11:47:00AM -0800, Andrew Morton wrote:
> > > On Tue, 15 Feb 2022 14:37:44 +0000 Liam Howlett <liam.howlett@oracle.com> wrote:
> > >
> > > > The maple tree is an RCU-safe range based B-tree designed to use modern
> > > > processor cache efficiently. There are a number of places in the kernel
> > > > that a non-overlapping range-based tree would be beneficial, especially
> > > > one with a simple interface. The first user that is covered in this
> > > > patch set is the vm_area_struct, where three data structures are
> > > > replaced by the maple tree: the augmented rbtree, the vma cache, and the
> > > > linked list of VMAs in the mm_struct. The long term goal is to reduce
> > > > or remove the mmap_sem contention.
> > >
> > > Has a path been chosen which gets us from here to significant reduction
> > > in mmap_lock overhead?
> > >
> > > If so, what's the plan and what must be done?
> >
> > I would say there are still competing ideas for how that is to be done.
> >
> > First, the Maple Tree is independent (to a certain extent) of all the
> > approaches to reducing mmap_lock overhead. It's a better data structure
> > for VMA lookup than the rbtree. Set against that, it has higher overhead
> > for modifications. That means that benchmarks which primarily measure
> > modification overhead see worse performance. We believe this is not
> > representative of real workloads, and indeed we see ~parity on workloads
> > like git and make -jN.
>
> I'm way behind and only got around to looking at SPF properly today. Maple
> is working its way up my list and I need to gather new data but I did
> have old data queued from a time when I thought I would get around to
> maple tree soon. The big result that stood was was brk performance from
> will-it-scale but for processes specifically
Yup, we know about the brk1 regression. It's a really unrealistic
benchmark, which is why we added brk2. To quote the commit message:
Linux has this horrendously complicated anon_vma structure that you don't
care about, but the upshot is that after calling fork(), each process
that calls brk() gets a _new_ VMA created. That is, after calling brk()
the first time, the process address space looks like this:
557777fab000-557777ff0000 rw-p 00000000 00:00 0 [heap]
557777ff0000-557777ff1000 rw-p 00000000 00:00 0 [heap]
so what brk1 is actually testing is how long it takes to create & destroy
a new VMA. This does not match what most programs do -- most will call
exec() which resets the anon_vma structures and starts each program off
with its own heap. And if you do have a multi-process program which
uses brk(), chances are it doesn't just oscillate betwee zero and one
extra pages of heap compared to its parent.
A better test starts out by allocating one page on the heap and then
throbs between one and two pages instead of throbbing between zero and
one page. That means we're actually testing expanding and contracting
the heap instead of creating and destroying a new heap.
For realism, I wanted to add actually accessing the memory in the new
heap, but that doesn't work for the threaded case -- another thread
might remove the memory you just allocated while you're allocating it.
Threaded programs give each thread its own heap anyway, so this is
kind of a pointless syscall to ask about its threaded scalability.
Anyway, here's brk2.c. It is not very different from brk1.c, but the
performance results are quite different (actually worse by about 10-15%).
On Tue, Feb 15, 2022 at 02:37:44PM +0000, Liam Howlett wrote:
> The maple tree is an RCU-safe range based B-tree designed to use modern
> processor cache efficiently. There are a number of places in the kernel
> that a non-overlapping range-based tree would be beneficial, especially
> one with a simple interface. The first user that is covered in this
> patch set is the vm_area_struct, where three data structures are
> replaced by the maple tree: the augmented rbtree, the vma cache, and the
> linked list of VMAs in the mm_struct. The long term goal is to reduce
> or remove the mmap_sem contention.
>
> The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
> nodes. With the increased branching factor, it is significantly shorter than
> the rbtree so it has fewer cache misses. The removal of the linked list
> between subsequent entries also reduces the cache misses and the need to pull
> in the previous and next VMA during many tree alterations.
>
> This patch is based on v5.17-rc4
>
> git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220214
Just a heads-up. I noticed an use-after-free in today's linux-next below. I
am running out of time to fully triage this, but I noticed this is the only
series (not sure which revision) heavily touched mm/mmap.c recently there.
BUG: KASAN: use-after-free in move_vma.isra.0
Read of size 8 at addr ffff0009ce752ac8 by task systemd-logind/1280
CPU: 21 PID: 1280 Comm: systemd-logind Not tainted 5.17.0-rc5-next-20220223-dirty #242
Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
Call trace:
dump_backtrace
show_stack
dump_stack_lvl
print_address_description.constprop.0
kasan_report
__asan_report_load8_noabort
move_vma.isra.0
move_vma at /usr/src/linux-next/mm/mremap.c:714
__do_sys_mremap
__arm64_sys_mremap
invoke_syscall
el0_svc_common.constprop.0
do_el0_svc
el0_svc
el0t_64_sync_handler
el0t_64_sync
Allocated by task 1280:
kasan_save_stack
__kasan_slab_alloc
slab_post_alloc_hook
kmem_cache_alloc
vm_area_alloc
vm_area_alloc at /usr/src/linux-next/kernel/fork.c:455
mmap_region
mmap_region at /usr/src/linux-next/mm/mmap.c:2585
do_mmap
vm_mmap_pgoff
ksys_mmap_pgoff
__arm64_sys_mmap
invoke_syscall
el0_svc_common.constprop.0
do_el0_svc
el0_svc
el0t_64_sync_handler
el0t_64_sync
Freed by task 1280:
kasan_save_stack
kasan_set_track
kasan_set_free_info
__kasan_slab_free
kmem_cache_free
vm_area_free
remove_vma
remove_vma at /usr/src/linux-next/mm/mmap.c:187
do_mas_align_munmap.isra.0
remove_mt at /usr/src/linux-next/mm/mmap.c:2176
(inlined by) do_mas_align_munmap at /usr/src/linux-next/mm/mmap.c:2437
do_mas_munmap
do_mas_munmap at /usr/src/linux-next/mm/mmap.c:2483
do_munmap
move_vma.isra.0
__do_sys_mremap
__arm64_sys_mremap
invoke_syscall
el0_svc_common.constprop.0
do_el0_svc
el0_svc
el0t_64_sync_handler
el0t_64_sync
The buggy address belongs to the object at ffff0009ce752aa8
which belongs to the cache vm_area_struct of size 144
The buggy address is located 32 bytes inside of
144-byte region [ffff0009ce752aa8, ffff0009ce752b38)
The buggy address belongs to the page:
page:fffffc002739d400 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xa4e750
head:fffffc002739d400 order:2 compound_mapcount:0 compound_pincount:0
memcg:ffff0009ce456e01
flags: 0xbfffc0000010200(slab|head|node=0|zone=2|lastcpupid=0xffff)
raw: 0bfffc0000010200 fffffc002739f108 fffffc00273a3b08 ffff000800f53380
raw: 0000000000000000 0000000000210021 00000001ffffffff ffff0009ce456e01
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff0009ce752980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff0009ce752a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff0009ce752a80: fc fc fc fc fc fa fb fb fb fb fb fb fb fb fb fb
^
ffff0009ce752b00: fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc
ffff0009ce752b80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
* Qian Cai <quic_qiancai@quicinc.com> [220224 22:49]: > On Tue, Feb 15, 2022 at 02:37:44PM +0000, Liam Howlett wrote: > > The maple tree is an RCU-safe range based B-tree designed to use modern > > processor cache efficiently. There are a number of places in the kernel > > that a non-overlapping range-based tree would be beneficial, especially > > one with a simple interface. The first user that is covered in this > > patch set is the vm_area_struct, where three data structures are > > replaced by the maple tree: the augmented rbtree, the vma cache, and the > > linked list of VMAs in the mm_struct. The long term goal is to reduce > > or remove the mmap_sem contention. > > > > The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf > > nodes. With the increased branching factor, it is significantly shorter than > > the rbtree so it has fewer cache misses. The removal of the linked list > > between subsequent entries also reduces the cache misses and the need to pull > > in the previous and next VMA during many tree alterations. > > > > This patch is based on v5.17-rc4 > > > > git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220214 > > Just a heads-up. I noticed an use-after-free in today's linux-next below. I > am running out of time to fully triage this, but I noticed this is the only > series (not sure which revision) heavily touched mm/mmap.c recently there. > > BUG: KASAN: use-after-free in move_vma.isra.0 > Read of size 8 at addr ffff0009ce752ac8 by task systemd-logind/1280 > > CPU: 21 PID: 1280 Comm: systemd-logind Not tainted 5.17.0-rc5-next-20220223-dirty #242 > Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020 > Call trace: > dump_backtrace > show_stack > dump_stack_lvl > print_address_description.constprop.0 > kasan_report > __asan_report_load8_noabort > move_vma.isra.0 > move_vma at /usr/src/linux-next/mm/mremap.c:714 > __do_sys_mremap > __arm64_sys_mremap > invoke_syscall > el0_svc_common.constprop.0 > do_el0_svc > el0_svc > el0t_64_sync_handler > el0t_64_sync > > Allocated by task 1280: > kasan_save_stack > __kasan_slab_alloc > slab_post_alloc_hook > kmem_cache_alloc > vm_area_alloc > vm_area_alloc at /usr/src/linux-next/kernel/fork.c:455 > mmap_region > mmap_region at /usr/src/linux-next/mm/mmap.c:2585 > do_mmap > vm_mmap_pgoff > ksys_mmap_pgoff > __arm64_sys_mmap > invoke_syscall > el0_svc_common.constprop.0 > do_el0_svc > el0_svc > el0t_64_sync_handler > el0t_64_sync > > Freed by task 1280: > kasan_save_stack > kasan_set_track > kasan_set_free_info > __kasan_slab_free > kmem_cache_free > vm_area_free > remove_vma > remove_vma at /usr/src/linux-next/mm/mmap.c:187 > do_mas_align_munmap.isra.0 > remove_mt at /usr/src/linux-next/mm/mmap.c:2176 > (inlined by) do_mas_align_munmap at /usr/src/linux-next/mm/mmap.c:2437 > do_mas_munmap > do_mas_munmap at /usr/src/linux-next/mm/mmap.c:2483 > do_munmap > move_vma.isra.0 > __do_sys_mremap > __arm64_sys_mremap > invoke_syscall > el0_svc_common.constprop.0 > do_el0_svc > el0_svc > el0t_64_sync_handler > el0t_64_sync > > The buggy address belongs to the object at ffff0009ce752aa8 > which belongs to the cache vm_area_struct of size 144 > The buggy address is located 32 bytes inside of > 144-byte region [ffff0009ce752aa8, ffff0009ce752b38) > The buggy address belongs to the page: > page:fffffc002739d400 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xa4e750 > head:fffffc002739d400 order:2 compound_mapcount:0 compound_pincount:0 > memcg:ffff0009ce456e01 > flags: 0xbfffc0000010200(slab|head|node=0|zone=2|lastcpupid=0xffff) > raw: 0bfffc0000010200 fffffc002739f108 fffffc00273a3b08 ffff000800f53380 > raw: 0000000000000000 0000000000210021 00000001ffffffff ffff0009ce456e01 > page dumped because: kasan: bad access detected > > Memory state around the buggy address: > ffff0009ce752980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc > ffff0009ce752a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc > >ffff0009ce752a80: fc fc fc fc fc fa fb fb fb fb fb fb fb fb fb fb > ^ > ffff0009ce752b00: fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc > ffff0009ce752b80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc Thank you for the report. I will try to reproduce it with a kvm here. Cheers, Liam
* Liam Howlett <liam.howlett@oracle.com> [220225 14:09]: > * Qian Cai <quic_qiancai@quicinc.com> [220224 22:49]: > > On Tue, Feb 15, 2022 at 02:37:44PM +0000, Liam Howlett wrote: > > > The maple tree is an RCU-safe range based B-tree designed to use modern > > > processor cache efficiently. There are a number of places in the kernel > > > that a non-overlapping range-based tree would be beneficial, especially > > > one with a simple interface. The first user that is covered in this > > > patch set is the vm_area_struct, where three data structures are > > > replaced by the maple tree: the augmented rbtree, the vma cache, and the > > > linked list of VMAs in the mm_struct. The long term goal is to reduce > > > or remove the mmap_sem contention. > > > > > > The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf > > > nodes. With the increased branching factor, it is significantly shorter than > > > the rbtree so it has fewer cache misses. The removal of the linked list > > > between subsequent entries also reduces the cache misses and the need to pull > > > in the previous and next VMA during many tree alterations. > > > > > > This patch is based on v5.17-rc4 > > > > > > git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220214 > > > > Just a heads-up. I noticed an use-after-free in today's linux-next below. I > > am running out of time to fully triage this, but I noticed this is the only > > series (not sure which revision) heavily touched mm/mmap.c recently there. > > > > BUG: KASAN: use-after-free in move_vma.isra.0 > > Read of size 8 at addr ffff0009ce752ac8 by task systemd-logind/1280 > > > > CPU: 21 PID: 1280 Comm: systemd-logind Not tainted 5.17.0-rc5-next-20220223-dirty #242 > > Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020 > > Call trace: > > dump_backtrace > > show_stack > > dump_stack_lvl > > print_address_description.constprop.0 > > kasan_report > > __asan_report_load8_noabort > > move_vma.isra.0 > > move_vma at /usr/src/linux-next/mm/mremap.c:714 > > __do_sys_mremap > > __arm64_sys_mremap > > invoke_syscall > > el0_svc_common.constprop.0 > > do_el0_svc > > el0_svc > > el0t_64_sync_handler > > el0t_64_sync > > > > Allocated by task 1280: > > kasan_save_stack > > __kasan_slab_alloc > > slab_post_alloc_hook > > kmem_cache_alloc > > vm_area_alloc > > vm_area_alloc at /usr/src/linux-next/kernel/fork.c:455 > > mmap_region > > mmap_region at /usr/src/linux-next/mm/mmap.c:2585 > > do_mmap > > vm_mmap_pgoff > > ksys_mmap_pgoff > > __arm64_sys_mmap > > invoke_syscall > > el0_svc_common.constprop.0 > > do_el0_svc > > el0_svc > > el0t_64_sync_handler > > el0t_64_sync > > > > Freed by task 1280: > > kasan_save_stack > > kasan_set_track > > kasan_set_free_info > > __kasan_slab_free > > kmem_cache_free > > vm_area_free > > remove_vma > > remove_vma at /usr/src/linux-next/mm/mmap.c:187 > > do_mas_align_munmap.isra.0 > > remove_mt at /usr/src/linux-next/mm/mmap.c:2176 > > (inlined by) do_mas_align_munmap at /usr/src/linux-next/mm/mmap.c:2437 > > do_mas_munmap > > do_mas_munmap at /usr/src/linux-next/mm/mmap.c:2483 > > do_munmap > > move_vma.isra.0 > > __do_sys_mremap > > __arm64_sys_mremap > > invoke_syscall > > el0_svc_common.constprop.0 > > do_el0_svc > > el0_svc > > el0t_64_sync_handler > > el0t_64_sync > > > > The buggy address belongs to the object at ffff0009ce752aa8 > > which belongs to the cache vm_area_struct of size 144 > > The buggy address is located 32 bytes inside of > > 144-byte region [ffff0009ce752aa8, ffff0009ce752b38) > > The buggy address belongs to the page: > > page:fffffc002739d400 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xa4e750 > > head:fffffc002739d400 order:2 compound_mapcount:0 compound_pincount:0 > > memcg:ffff0009ce456e01 > > flags: 0xbfffc0000010200(slab|head|node=0|zone=2|lastcpupid=0xffff) > > raw: 0bfffc0000010200 fffffc002739f108 fffffc00273a3b08 ffff000800f53380 > > raw: 0000000000000000 0000000000210021 00000001ffffffff ffff0009ce456e01 > > page dumped because: kasan: bad access detected > > > > Memory state around the buggy address: > > ffff0009ce752980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc > > ffff0009ce752a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc > > >ffff0009ce752a80: fc fc fc fc fc fa fb fb fb fb fb fb fb fb fb fb > > ^ > > ffff0009ce752b00: fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc > > ffff0009ce752b80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc > > Thank you for the report. I will try to reproduce it with a kvm here. > I just booted an arm64 VM with my build and kasan enabled with no issue. Could you please send me your config file for the build? Thanks, Liam
On Fri, Feb 25, 2022 at 08:23:41PM +0000, Liam Howlett wrote: > I just booted an arm64 VM with my build and kasan enabled with no issue. > Could you please send me your config file for the build? On linux-next, I just do: $ make arch=arm64 defconfig debug.config [1] Then, I just generate some memory pressume into swapping/OOM Killer to trigger it. [1] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/kernel/configs/debug.config
On Tue, Feb 15, 2022 at 02:37:44PM +0000, Liam Howlett wrote:
> The maple tree is an RCU-safe range based B-tree designed to use modern
> processor cache efficiently. There are a number of places in the kernel
> that a non-overlapping range-based tree would be beneficial, especially
> one with a simple interface. The first user that is covered in this
> patch set is the vm_area_struct, where three data structures are
> replaced by the maple tree: the augmented rbtree, the vma cache, and the
> linked list of VMAs in the mm_struct. The long term goal is to reduce
> or remove the mmap_sem contention.
>
> The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
> nodes. With the increased branching factor, it is significantly shorter than
> the rbtree so it has fewer cache misses. The removal of the linked list
> between subsequent entries also reduces the cache misses and the need to pull
> in the previous and next VMA during many tree alterations.
>
> This patch is based on v5.17-rc4
>
> git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220214
Hi Liam,
this patch series completely breaks s390. Besides endianess issue
in maple trees reported here:
https://lore.kernel.org/all/your-ad-here.call-01645924312-ext-0398@work.hours/
and with this endianess issue fixed (fixup from here ^^^) we still get numerous
KASAN reports starting with
[PATCH v6 10/71] mm: Start tracking VMAs with maple tree
mostly looking like this:
BUG: KASAN: use-after-free in mas_descend_adopt+0x113a/0x1408
Read of size 8 at addr 00000000b1d4e900 by task cat/610
CPU: 34 PID: 610 Comm: cat Tainted: G W 5.17.0-rc4-91313-g592c3b299aad #1
Hardware name: IBM 3906 M04 704 (LPAR)
Call Trace:
[<000000000234647a>] dump_stack_lvl+0xfa/0x150
[<0000000002328154>] print_address_description.constprop.0+0x64/0x340
[<00000000008f5ba6>] kasan_report+0x13e/0x1a8
[<00000000017f0c12>] mas_descend_adopt+0x113a/0x1408
[<00000000018076ec>] mas_spanning_rebalance.isra.0+0x5164/0x6a20
[<000000000180a71e>] mas_wr_spanning_store.isra.0+0x476/0xbc8
[<00000000018189bc>] mas_store_gfp+0xd4/0x188
[<0000000000827bae>] vma_mt_szero+0x146/0x368
[<000000000082e990>] __do_munmap+0x340/0xe20
[<000000000082f580>] __vm_munmap+0x110/0x1e8
[<000000000082f76e>] __s390x_sys_munmap+0x6e/0x90
[<000000000010dc6c>] do_syscall+0x22c/0x328
[<000000000234d3d2>] __do_syscall+0x9a/0xf8
[<0000000002374ed2>] system_call+0x82/0xb0
INFO: lockdep is turned off.
Allocated by task 610:
kasan_save_stack+0x34/0x58
__kasan_slab_alloc+0x84/0xa8
kmem_cache_alloc+0x20c/0x520
mas_alloc_nodes+0x26a/0x4c8
mas_split.isra.0+0x2aa/0x1418
mas_wr_modify+0x3fa/0xd28
mas_store_gfp+0xd4/0x188
vma_store+0x17a/0x3d8
vma_link+0xac/0x798
mmap_region+0xa5a/0x10b8
do_mmap+0x7c2/0xa90
vm_mmap_pgoff+0x186/0x250
ksys_mmap_pgoff+0x334/0x400
__s390x_sys_old_mmap+0xf4/0x130
do_syscall+0x22c/0x328
__do_syscall+0x9a/0xf8
system_call+0x82/0xb0
Freed by task 610:
kasan_save_stack+0x34/0x58
kasan_set_track+0x36/0x48
kasan_set_free_info+0x34/0x58
____kasan_slab_free+0x11c/0x188
__kasan_slab_free+0x24/0x30
kmem_cache_free_bulk.part.0+0xec/0x538
mas_destroy+0x2e4/0x710
mas_store_gfp+0xf4/0x188
vma_mt_szero+0x146/0x368
__vma_adjust+0x155a/0x2188
__split_vma+0x228/0x450
mprotect_fixup+0x4f2/0x618
do_mprotect_pkey.constprop.0+0x328/0x600
__s390x_sys_mprotect+0x8e/0xb8
do_syscall+0x22c/0x328
__do_syscall+0x9a/0xf8
system_call+0x82/0xb0
The buggy address belongs to the object at 00000000b1d4e900
which belongs to the cache maple_node of size 256
The buggy address is located 0 bytes inside of
256-byte region [00000000b1d4e900, 00000000b1d4ea00)
The buggy address belongs to the page:
page:0000400002c75200 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xb1d48
head:0000400002c75200 order:3 compound_mapcount:0 compound_pincount:0
flags: 0x3ffff00000010200(slab|head|node=0|zone=1|lastcpupid=0x1ffff)
raw: 3ffff00000010200 0000400002c91e08 000040000263b608 000000008009ed00
raw: 0000000000000000 0020004000000000 ffffffff00000001 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
00000000b1d4e800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
00000000b1d4e880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>00000000b1d4e900: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
00000000b1d4e980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
00000000b1d4ea00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
with eventual crash.
How come none of s390 architecture maintainers nor s390 mailing list were
on CC for the changes which affect s390 to that magnitude? None of this
has apparently been tested on s390 or any other big endian system. And
a shoulder tap to give it try would be helpful.
Now we are just starting looking at the problems. And until issues
are resolved this patch series has to be dropped from linux-next.
* Vasily Gorbik <gor@linux.ibm.com> [220226 21:23]: > On Tue, Feb 15, 2022 at 02:37:44PM +0000, Liam Howlett wrote: > > The maple tree is an RCU-safe range based B-tree designed to use modern > > processor cache efficiently. There are a number of places in the kernel > > that a non-overlapping range-based tree would be beneficial, especially > > one with a simple interface. The first user that is covered in this > > patch set is the vm_area_struct, where three data structures are > > replaced by the maple tree: the augmented rbtree, the vma cache, and the > > linked list of VMAs in the mm_struct. The long term goal is to reduce > > or remove the mmap_sem contention. > > > > The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf > > nodes. With the increased branching factor, it is significantly shorter than > > the rbtree so it has fewer cache misses. The removal of the linked list > > between subsequent entries also reduces the cache misses and the need to pull > > in the previous and next VMA during many tree alterations. > > > > This patch is based on v5.17-rc4 > > > > git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220214 > > Hi Liam, > > this patch series completely breaks s390. Besides endianess issue > in maple trees reported here: > > https://lore.kernel.org/all/your-ad-here.call-01645924312-ext-0398@work.hours/ > > and with this endianess issue fixed (fixup from here ^^^) we still get numerous > KASAN reports starting with > > [PATCH v6 10/71] mm: Start tracking VMAs with maple tree Thanks for looking at this. That commit had two issues which I was debugging on parisc for most of last week. The first is that I was not actually writing the stack expansion to the maple tree in that commit - it was added in a later commit. The second issue which was most likely causing your crashes was the calculation for the gap in unmapped_area_topdown(). With the maple tree (3dea5db7fbb5), I am able to boot parisc and s390 virtual machines. Although, the issue which Hugh discusses in detail still exists [1]. The latest maple tree can be found on github [2]. I am still investigating why the heap changed locations on parisc to ensure it's not of issue. 1: https://lore.kernel.org/all/5f8f4f-ad63-eb-fd73-d48748af8a76@google.com/ 2: https://github.com/oracle/linux-uek/tree/maple/mainline Regards, Liam
© 2016 - 2026 Red Hat, Inc.