[PATCH v6 00/71] Introducing the Maple Tree

Liam Howlett posted 71 patches 4 years, 4 months ago
Only 1 patches received!
There is a newer version of this series
Documentation/core-api/index.rst              |     1 +
Documentation/core-api/maple_tree.rst         |   218 +
MAINTAINERS                                   |    12 +
arch/arm64/kernel/vdso.c                      |     3 +-
arch/parisc/kernel/cache.c                    |     9 +-
arch/powerpc/kernel/vdso.c                    |     6 +-
arch/powerpc/mm/book3s32/tlb.c                |    11 +-
arch/powerpc/mm/book3s64/subpage_prot.c       |    13 +-
arch/riscv/kernel/vdso.c                      |     3 +-
arch/s390/kernel/vdso.c                       |     3 +-
arch/s390/mm/gmap.c                           |     6 +-
arch/um/kernel/tlb.c                          |    14 +-
arch/x86/entry/vdso/vma.c                     |     9 +-
arch/x86/kernel/tboot.c                       |     2 +-
arch/xtensa/kernel/syscall.c                  |    18 +-
drivers/firmware/efi/efi.c                    |     2 +-
drivers/gpu/drm/i915/gem/i915_gem_userptr.c   |    14 +-
drivers/misc/cxl/fault.c                      |    45 +-
drivers/tee/optee/call.c                      |    18 +-
drivers/xen/privcmd.c                         |     2 +-
fs/binfmt_elf.c                               |     6 +-
fs/coredump.c                                 |    33 +-
fs/exec.c                                     |    12 +-
fs/proc/base.c                                |     5 +-
fs/proc/internal.h                            |     2 +-
fs/proc/task_mmu.c                            |    74 +-
fs/proc/task_nommu.c                          |    45 +-
fs/userfaultfd.c                              |    49 +-
include/linux/maple_tree.h                    |   683 +
include/linux/mm.h                            |    77 +-
include/linux/mm_types.h                      |    43 +-
include/linux/mm_types_task.h                 |    12 -
include/linux/sched.h                         |     1 -
include/linux/userfaultfd_k.h                 |     7 +-
include/linux/vm_event_item.h                 |     4 -
include/linux/vmacache.h                      |    28 -
include/linux/vmstat.h                        |     6 -
include/linux/xarray.h                        |     1 +
include/trace/events/maple_tree.h             |   123 +
include/trace/events/mmap.h                   |    71 +
init/main.c                                   |     2 +
ipc/shm.c                                     |    21 +-
kernel/acct.c                                 |    11 +-
kernel/bpf/task_iter.c                        |    10 +-
kernel/debug/debug_core.c                     |    12 -
kernel/events/core.c                          |     3 +-
kernel/events/uprobes.c                       |     9 +-
kernel/fork.c                                 |    58 +-
kernel/sched/fair.c                           |    10 +-
lib/Kconfig.debug                             |    18 +-
lib/Makefile                                  |     3 +-
lib/maple_tree.c                              |  6967 +++
lib/test_maple_tree.c                         | 37398 ++++++++++++++++
mm/Makefile                                   |     2 +-
mm/damon/vaddr-test.h                         |    37 +-
mm/damon/vaddr.c                              |    53 +-
mm/debug.c                                    |    14 +-
mm/gup.c                                      |     9 +-
mm/huge_memory.c                              |     4 +-
mm/init-mm.c                                  |     4 +-
mm/internal.h                                 |    78 +-
mm/khugepaged.c                               |    13 +-
mm/ksm.c                                      |    18 +-
mm/madvise.c                                  |     2 +-
mm/memcontrol.c                               |     6 +-
mm/memory.c                                   |    33 +-
mm/mempolicy.c                                |    58 +-
mm/mlock.c                                    |    34 +-
mm/mmap.c                                     |  2086 +-
mm/mprotect.c                                 |     7 +-
mm/mremap.c                                   |    22 +-
mm/msync.c                                    |     2 +-
mm/nommu.c                                    |   127 +-
mm/oom_kill.c                                 |     3 +-
mm/pagewalk.c                                 |     2 +-
mm/swapfile.c                                 |     4 +-
mm/util.c                                     |    32 -
mm/vmacache.c                                 |   117 -
mm/vmstat.c                                   |     4 -
tools/testing/radix-tree/.gitignore           |     2 +
tools/testing/radix-tree/Makefile             |     9 +-
tools/testing/radix-tree/generated/autoconf.h |     1 +
tools/testing/radix-tree/linux.c              |   160 +-
tools/testing/radix-tree/linux/kernel.h       |     1 +
tools/testing/radix-tree/linux/lockdep.h      |     2 +
tools/testing/radix-tree/linux/maple_tree.h   |     7 +
tools/testing/radix-tree/linux/slab.h         |     4 +
tools/testing/radix-tree/maple.c              |    59 +
.../radix-tree/trace/events/maple_tree.h      |     3 +
89 files changed, 47390 insertions(+), 1842 deletions(-)
create mode 100644 Documentation/core-api/maple_tree.rst
create mode 100644 include/linux/maple_tree.h
delete mode 100644 include/linux/vmacache.h
create mode 100644 include/trace/events/maple_tree.h
create mode 100644 lib/maple_tree.c
create mode 100644 lib/test_maple_tree.c
delete mode 100644 mm/vmacache.c
create mode 100644 tools/testing/radix-tree/linux/maple_tree.h
create mode 100644 tools/testing/radix-tree/maple.c
create mode 100644 tools/testing/radix-tree/trace/events/maple_tree.h
[PATCH v6 00/71] Introducing the Maple Tree
Posted by Liam Howlett 4 years, 4 months ago
The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently.  There are a number of places in the kernel
that a non-overlapping range-based tree would be beneficial, especially
one with a simple interface.  The first user that is covered in this
patch set is the vm_area_struct, where three data structures are
replaced by the maple tree: the augmented rbtree, the vma cache, and the
linked list of VMAs in the mm_struct.  The long term goal is to reduce
or remove the mmap_sem contention.

The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
nodes.  With the increased branching factor, it is significantly shorter than
the rbtree so it has fewer cache misses.  The removal of the linked list
between subsequent entries also reduces the cache misses and the need to pull
in the previous and next VMA during many tree alterations.

This patch is based on v5.17-rc4

git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220214

v6 changes:
- Added patch for xarray testcode which should be dropped for upstream
  fix. - Thanks Matthew Wilcox
- Fixed issue with maple state index/last setting when the tree is just
  a pointer - Thanks David Howells
- Changed internal RCU handling to not check flags more than once
- Fixed mas_prev() underflow issue - Thanks David Howells
- Fixed returns on mas_prev()/mas_next() when there is only a value at
  0 index - Thanks David Howells
- Fixed mas_find_rev() running past minimum value - Thanks David Howells
- Fixed testing code function rename - Thanks Mark Hemment
- Reworked brk() and vm_brk_flags() as suggested - Thanks Vlastimil
  Babka
- Documentation fixes - Thanks Mike Rapoport
- Separated test code from tree code - Thanks Vlastimil Babka
- Moved maple_tree_init() call into the maple tree patch for other users
  - Thanks David Howells
- Fixed copyright date - Thanks Mike Rapoport
- Whitespace fixes in comments, reduced changes in other locations.
- Added missing kdocs - Thanks Mike Rapoport
- Fixed exit_mmap comment - Thanks Mark Hemment
- Removed RCU tracking as there is an issue with atomic increments of
  mmget() - Thanks Mark Hemment for initial issue report that allowed me
  to discover this.  RCU was not being used, so it is disabled for VMA
  tracking for now.
- Removed unnecessary assignment in mtree_range_walk() - Thanks JaeJoon
  Jung
- Dropped ma_xa_benchmark from testing Makefile - Thanks JaeJoon Jung
- Fixed leaf gap calculation in rare case of underflow
- Fixed mmap_region() bug on merging of prev
- Fixed mmap_region() bug on khugepaged_enter_vma_merge()
- Added test cases for mas_find_rev(), mas_prev(), mas_next, and
  mas_root_expand() to test suite.
- Updated config options and ifdefs to allow other maple tree users to
  debug maple tree without debugging the VM maple tree.

v5: https://lore.kernel.org/linux-mm/20220203172051.i2jnhnkudzssdsxg@revolver/T/
v4: https://lore.kernel.org/linux-mm/20211201142918.921493-30-Liam.Howlett@oracle.com/t/
v3: https://lore.kernel.org/linux-mm/20211005012959.1110504-1-Liam.Howlett@oracle.com/
v2: https://lore.kernel.org/linux-mm/20210817154651.1570984-1-Liam.Howlett@oracle.com/
v1: https://lore.kernel.org/linux-mm/20210428153542.2814175-1-Liam.Howlett@Oracle.com/


Performance on a 144 core x86:

It is important to note that the code is still using the mmap_sem, the
performance seems fairly similar on real-world workloads, while there
are variations in micro-benchmarks.

kernbench, increased system time, less user time:
Amean     user-2        885.34 (   0.00%)      886.07 *  -0.08%*
Amean     syst-2        161.95 (   0.00%)      168.19 *  -3.85%*
Amean     elsp-2        530.06 (   0.00%)      532.96 *  -0.55%*
Amean     user-4        908.58 (   0.00%)      908.88 *  -0.03%*
Amean     syst-4        167.56 (   0.00%)      173.72 *  -3.68%*
Amean     elsp-4        277.21 (   0.00%)      277.38 *  -0.06%*
Amean     user-8        961.84 (   0.00%)      962.45 *  -0.06%*
Amean     syst-8        176.40 (   0.00%)      183.43 *  -3.99%*
Amean     elsp-8        150.59 (   0.00%)      151.21 *  -0.41%*
Amean     user-16      1040.15 (   0.00%)     1039.89 *   0.02%*
Amean     syst-16       188.19 (   0.00%)      193.81 *  -2.99%*
Amean     elsp-16        86.85 (   0.00%)       86.32 *   0.61%*
Amean     user-32      1240.46 (   0.00%)     1233.93 *   0.53%*
Amean     syst-32       217.15 (   0.00%)      222.99 *  -2.69%*
Amean     elsp-32        55.16 (   0.00%)       55.09 *   0.12%*
Amean     user-64      1241.17 (   0.00%)     1234.26 *   0.56%*
Amean     syst-64       215.11 (   0.00%)      220.76 *  -2.63%*
Amean     elsp-64        32.88 (   0.00%)       33.72 *  -2.57%*
Amean     user-128     1613.09 (   0.00%)     1609.63 *   0.21%*
Amean     syst-128      267.10 (   0.00%)      276.72 *  -3.60%*
Amean     elsp-128       25.80 (   0.00%)       26.09 *  -1.10%*

Mixed Hmean results:
- freqmine-medium -3.50% to +12.82%
- malloc1-processes: -14.50% to +6.53%
- signal1-processes -1.87% to +14.02%
- page_fault3-threads -6.46% to +26.15%
- pthread_mutex1-threads -16.81% to +28.63%

Decrease in performance in the following micro-benchmarks in Hmean:
- brk1-processes -37.42% to -44.16%
- malloc1-threads -18.27% to -23.08%

Modifications are more expensive so the micro-benchmarks that write but
do not use the data will be negatively affected.

Patch organization:
Patch 1 is to add a missing lock to avoid an assert issue when using a vma iterator.

Patch 2 is an xarray fix due to bitmap header changes which will be
dropped for a pending upstream fix.

Patches 3 to 7 are radix tree test suite additions for maple tree
support.

Patch 8 adds the maple tree.
Patch 9 adds the maple tree test code.

Patches 10-19 are the removal of the rbtree from the mm_struct.  This
now includes the introduction of the vma iterator.

Patch 20 optimizes __vma_adjust() for the maple tree.

Patches 21-27 are the removal of the vmacache from the kernel.

Patches 28-31 are internal mm changes for efficiencies.

Patches 32-69 are the removal of the linked list

Patches 70 and 71 are a small cleanup from the removal of the vma linked list.

Liam R. Howlett (61):
  xarray: Fix bitmap breakage
  radix tree test suite: Add pr_err define
  radix tree test suite: Add kmem_cache_set_non_kernel()
  radix tree test suite: Add allocation counts and size to kmem_cache
  radix tree test suite: Add support for slab bulk APIs
  radix tree test suite: Add lockdep_is_held to header
  Maple Tree: Add new data structure
  lib/test_maple_tree: Add testing for maple tree
  mm: Start tracking VMAs with maple tree
  mm/mmap: Use the maple tree in find_vma() instead of the rbtree.
  mm/mmap: Use the maple tree for find_vma_prev() instead of the rbtree
  mm/mmap: Use maple tree for unmapped_area{_topdown}
  kernel/fork: Use maple tree for dup_mmap() during forking
  mm: Remove rb tree.
  mmap: Change zeroing of maple tree in __vma_adjust()
  xen: Use vma_lookup() in privcmd_ioctl_mmap()
  mm: Optimize find_exact_vma() to use vma_lookup()
  mm/khugepaged: Optimize collapse_pte_mapped_thp() by using
    vma_lookup()
  mm/mmap: Change do_brk_flags() to expand existing VMA and add
    do_brk_munmap()
  mm: Use maple tree operations for find_vma_intersection()
  mm/mmap: Use advanced maple tree API for mmap_region()
  mm: Remove vmacache
  mm/mmap: Move mmap_region() below do_munmap()
  mm/mmap: Reorganize munmap to use maple states
  mm/mmap: Change do_brk_munmap() to use do_mas_align_munmap()
  arm64: Remove mmap linked list from vdso
  parisc: Remove mmap linked list from cache handling
  powerpc: Remove mmap linked list walks
  s390: Remove vma linked list walks
  x86: Remove vma linked list walks
  xtensa: Remove vma linked list walks
  cxl: Remove vma linked list walk
  optee: Remove vma linked list walk
  um: Remove vma linked list walk
  binfmt_elf: Remove vma linked list walk
  exec: Use VMA iterator instead of linked list
  fs/proc/base: Use maple tree iterators in place of linked list
  userfaultfd: Use maple tree iterator to iterate VMAs
  ipc/shm: Use VMA iterator instead of linked list
  acct: Use VMA iterator instead of linked list
  perf: Use VMA iterator
  sched: Use maple tree iterator to walk VMAs
  fork: Use VMA iterator
  bpf: Remove VMA linked list
  mm/gup: Use maple tree navigation instead of linked list
  mm/khugepaged: Stop using vma linked list
  mm/ksm: Use vma iterators instead of vma linked list
  mm/madvise: Use vma_find() instead of vma linked list
  mm/memcontrol: Stop using mm->highest_vm_end
  mm/mempolicy: Use vma iterator & maple state instead of vma linked
    list
  mm/mlock: Use vma iterator and instead of vma linked list
  mm/mprotect: Use maple tree navigation instead of vma linked list
  mm/mremap: Use vma_find_intersection() instead of vma linked list
  mm/msync: Use vma_find() instead of vma linked list
  mm/oom_kill: Use maple tree iterators instead of vma linked list
  mm/pagewalk: Use vma_find() instead of vma linked list
  mm/swapfile: Use vma iterator instead of vma linked list
  riscv: Use vma iterator for vdso
  mm: Remove the vma linked list
  mm/mmap: Drop range_has_overlap() function
  mm/mmap.c: Pass in mapping to __vma_link_file()

Matthew Wilcox (Oracle) (10):
  binfmt_elf: Take the mmap lock when walking the VMA list
  mm: Add VMA iterator
  mmap: Use the VMA iterator in count_vma_pages_range()
  damon: Convert __damon_va_three_regions to use the VMA iterator
  proc: Remove VMA rbtree use from nommu
  mm: Convert vma_lookup() to use mtree_load()
  coredump: Remove vma linked list walk
  fs/proc/task_mmu: Stop using linked list and highest_vm_end
  i915: Use the VMA iterator
  nommu: Remove uses of VMA linked list

 Documentation/core-api/index.rst              |     1 +
 Documentation/core-api/maple_tree.rst         |   218 +
 MAINTAINERS                                   |    12 +
 arch/arm64/kernel/vdso.c                      |     3 +-
 arch/parisc/kernel/cache.c                    |     9 +-
 arch/powerpc/kernel/vdso.c                    |     6 +-
 arch/powerpc/mm/book3s32/tlb.c                |    11 +-
 arch/powerpc/mm/book3s64/subpage_prot.c       |    13 +-
 arch/riscv/kernel/vdso.c                      |     3 +-
 arch/s390/kernel/vdso.c                       |     3 +-
 arch/s390/mm/gmap.c                           |     6 +-
 arch/um/kernel/tlb.c                          |    14 +-
 arch/x86/entry/vdso/vma.c                     |     9 +-
 arch/x86/kernel/tboot.c                       |     2 +-
 arch/xtensa/kernel/syscall.c                  |    18 +-
 drivers/firmware/efi/efi.c                    |     2 +-
 drivers/gpu/drm/i915/gem/i915_gem_userptr.c   |    14 +-
 drivers/misc/cxl/fault.c                      |    45 +-
 drivers/tee/optee/call.c                      |    18 +-
 drivers/xen/privcmd.c                         |     2 +-
 fs/binfmt_elf.c                               |     6 +-
 fs/coredump.c                                 |    33 +-
 fs/exec.c                                     |    12 +-
 fs/proc/base.c                                |     5 +-
 fs/proc/internal.h                            |     2 +-
 fs/proc/task_mmu.c                            |    74 +-
 fs/proc/task_nommu.c                          |    45 +-
 fs/userfaultfd.c                              |    49 +-
 include/linux/maple_tree.h                    |   683 +
 include/linux/mm.h                            |    77 +-
 include/linux/mm_types.h                      |    43 +-
 include/linux/mm_types_task.h                 |    12 -
 include/linux/sched.h                         |     1 -
 include/linux/userfaultfd_k.h                 |     7 +-
 include/linux/vm_event_item.h                 |     4 -
 include/linux/vmacache.h                      |    28 -
 include/linux/vmstat.h                        |     6 -
 include/linux/xarray.h                        |     1 +
 include/trace/events/maple_tree.h             |   123 +
 include/trace/events/mmap.h                   |    71 +
 init/main.c                                   |     2 +
 ipc/shm.c                                     |    21 +-
 kernel/acct.c                                 |    11 +-
 kernel/bpf/task_iter.c                        |    10 +-
 kernel/debug/debug_core.c                     |    12 -
 kernel/events/core.c                          |     3 +-
 kernel/events/uprobes.c                       |     9 +-
 kernel/fork.c                                 |    58 +-
 kernel/sched/fair.c                           |    10 +-
 lib/Kconfig.debug                             |    18 +-
 lib/Makefile                                  |     3 +-
 lib/maple_tree.c                              |  6967 +++
 lib/test_maple_tree.c                         | 37398 ++++++++++++++++
 mm/Makefile                                   |     2 +-
 mm/damon/vaddr-test.h                         |    37 +-
 mm/damon/vaddr.c                              |    53 +-
 mm/debug.c                                    |    14 +-
 mm/gup.c                                      |     9 +-
 mm/huge_memory.c                              |     4 +-
 mm/init-mm.c                                  |     4 +-
 mm/internal.h                                 |    78 +-
 mm/khugepaged.c                               |    13 +-
 mm/ksm.c                                      |    18 +-
 mm/madvise.c                                  |     2 +-
 mm/memcontrol.c                               |     6 +-
 mm/memory.c                                   |    33 +-
 mm/mempolicy.c                                |    58 +-
 mm/mlock.c                                    |    34 +-
 mm/mmap.c                                     |  2086 +-
 mm/mprotect.c                                 |     7 +-
 mm/mremap.c                                   |    22 +-
 mm/msync.c                                    |     2 +-
 mm/nommu.c                                    |   127 +-
 mm/oom_kill.c                                 |     3 +-
 mm/pagewalk.c                                 |     2 +-
 mm/swapfile.c                                 |     4 +-
 mm/util.c                                     |    32 -
 mm/vmacache.c                                 |   117 -
 mm/vmstat.c                                   |     4 -
 tools/testing/radix-tree/.gitignore           |     2 +
 tools/testing/radix-tree/Makefile             |     9 +-
 tools/testing/radix-tree/generated/autoconf.h |     1 +
 tools/testing/radix-tree/linux.c              |   160 +-
 tools/testing/radix-tree/linux/kernel.h       |     1 +
 tools/testing/radix-tree/linux/lockdep.h      |     2 +
 tools/testing/radix-tree/linux/maple_tree.h   |     7 +
 tools/testing/radix-tree/linux/slab.h         |     4 +
 tools/testing/radix-tree/maple.c              |    59 +
 .../radix-tree/trace/events/maple_tree.h      |     3 +
 89 files changed, 47390 insertions(+), 1842 deletions(-)
 create mode 100644 Documentation/core-api/maple_tree.rst
 create mode 100644 include/linux/maple_tree.h
 delete mode 100644 include/linux/vmacache.h
 create mode 100644 include/trace/events/maple_tree.h
 create mode 100644 lib/maple_tree.c
 create mode 100644 lib/test_maple_tree.c
 delete mode 100644 mm/vmacache.c
 create mode 100644 tools/testing/radix-tree/linux/maple_tree.h
 create mode 100644 tools/testing/radix-tree/maple.c
 create mode 100644 tools/testing/radix-tree/trace/events/maple_tree.h

-- 
2.34.1
Re: [PATCH v6 00/71] Introducing the Maple Tree
Posted by Andrew Morton 4 years, 4 months ago
On Tue, 15 Feb 2022 14:37:44 +0000 Liam Howlett <liam.howlett@oracle.com> wrote:

> The maple tree is an RCU-safe range based B-tree designed to use modern
> processor cache efficiently.  There are a number of places in the kernel
> that a non-overlapping range-based tree would be beneficial, especially
> one with a simple interface.  The first user that is covered in this
> patch set is the vm_area_struct, where three data structures are
> replaced by the maple tree: the augmented rbtree, the vma cache, and the
> linked list of VMAs in the mm_struct.  The long term goal is to reduce
> or remove the mmap_sem contention.

Has a path been chosen which gets us from here to significant reduction
in mmap_lock overhead?

If so, what's the plan and what must be done?

Thanks.
Re: [PATCH v6 00/71] Introducing the Maple Tree
Posted by Matthew Wilcox 4 years, 4 months ago
On Wed, Feb 16, 2022 at 11:47:00AM -0800, Andrew Morton wrote:
> On Tue, 15 Feb 2022 14:37:44 +0000 Liam Howlett <liam.howlett@oracle.com> wrote:
> 
> > The maple tree is an RCU-safe range based B-tree designed to use modern
> > processor cache efficiently.  There are a number of places in the kernel
> > that a non-overlapping range-based tree would be beneficial, especially
> > one with a simple interface.  The first user that is covered in this
> > patch set is the vm_area_struct, where three data structures are
> > replaced by the maple tree: the augmented rbtree, the vma cache, and the
> > linked list of VMAs in the mm_struct.  The long term goal is to reduce
> > or remove the mmap_sem contention.
> 
> Has a path been chosen which gets us from here to significant reduction
> in mmap_lock overhead?
> 
> If so, what's the plan and what must be done?

I would say there are still competing ideas for how that is to be done.

First, the Maple Tree is independent (to a certain extent) of all the
approaches to reducing mmap_lock overhead.  It's a better data structure
for VMA lookup than the rbtree.  Set against that, it has higher overhead
for modifications.  That means that benchmarks which primarily measure
modification overhead see worse performance.  We believe this is not
representative of real workloads, and indeed we see ~parity on workloads
like git and make -jN.

The advantage to this patch set is that we get rid of the vmacache
entirely, we get rid of the doubly-linked list chaining VMAs together
and we get rid of the rbtree usage.  And we add a new generic data
structure that hopefully I can point people towards instead of using
the XArray inappropriately.

Both of the approaches to avoiding taking the mmap_lock in the fault path
will benefit from using the maple tree for VMAs.  Fewer cache misses and
RCU safety are things we all want.  The maple tree will also benefit
if/when either approach is merged; because modifications to the maple
tree are more heavyweight than the rbtree, they block readers for longer.


The SPF approach is further along.  It has been tested in the past with
the maple tree, but not in the most recent iteration of either.  It
doesn't rely on the maple tree, but it does benefit.

The RCU-freeing-of-VMAs approach that I favour has not received much
attention.  I've been busy with other things ;-)  It offers much more
opportunity for performance improvement, but will take longer to arrive.

Re: [PATCH v6 00/71] Introducing the Maple Tree
Posted by Mel Gorman 4 years, 4 months ago
On Wed, Feb 16, 2022 at 08:24:34PM +0000, Matthew Wilcox wrote:
> On Wed, Feb 16, 2022 at 11:47:00AM -0800, Andrew Morton wrote:
> > On Tue, 15 Feb 2022 14:37:44 +0000 Liam Howlett <liam.howlett@oracle.com> wrote:
> > 
> > > The maple tree is an RCU-safe range based B-tree designed to use modern
> > > processor cache efficiently.  There are a number of places in the kernel
> > > that a non-overlapping range-based tree would be beneficial, especially
> > > one with a simple interface.  The first user that is covered in this
> > > patch set is the vm_area_struct, where three data structures are
> > > replaced by the maple tree: the augmented rbtree, the vma cache, and the
> > > linked list of VMAs in the mm_struct.  The long term goal is to reduce
> > > or remove the mmap_sem contention.
> > 
> > Has a path been chosen which gets us from here to significant reduction
> > in mmap_lock overhead?
> > 
> > If so, what's the plan and what must be done?
> 
> I would say there are still competing ideas for how that is to be done.
> 
> First, the Maple Tree is independent (to a certain extent) of all the
> approaches to reducing mmap_lock overhead.  It's a better data structure
> for VMA lookup than the rbtree.  Set against that, it has higher overhead
> for modifications.  That means that benchmarks which primarily measure
> modification overhead see worse performance.  We believe this is not
> representative of real workloads, and indeed we see ~parity on workloads
> like git and make -jN.
> 

I'm way behind and only got around to looking at SPF properly today. Maple
is working its way up my list and I need to gather new data but I did
have old data queued from a time when I thought I would get around to
maple tree soon.  The big result that stood was was brk performance from
will-it-scale but for processes specifically

wis-malloc
                                                5.17.0-rc3                 5.17.0-rc3
                                                   vanilla           mm-maplevma-v5r1
Hmean     brk1-processes-2           5265361.84 (   0.00%)      3489097.63 * -33.73%*
Hmean     brk1-processes-5          12270382.36 (   0.00%)      8136542.13 * -33.69%*
Hmean     brk1-processes-8          19560585.95 (   0.00%)     12941094.22 * -33.84%*
Hmean     brk1-processes-12         27328267.18 (   0.00%)     18148145.64 * -33.59%*
Hmean     brk1-processes-21         39599378.04 (   0.00%)     26188063.96 * -33.87%*
Hmean     brk1-processes-30         59880571.20 (   0.00%)     39653658.00 * -33.78%*
Hmean     brk1-processes-48         72900335.59 (   0.00%)     49044880.84 * -32.72%*
Hmean     brk1-processes-79         72388042.42 (   0.00%)     48416196.93 * -33.12%*
Hmean     brk1-processes-110        72333206.18 (   0.00%)     48437222.12 * -33.04%*
Hmean     brk1-processes-141        72443841.68 (   0.00%)     48490158.48 * -33.07%*
Hmean     brk1-processes-172        72187451.54 (   0.00%)     48480696.77 * -32.84%*
Hmean     brk1-processes-203        72575945.95 (   0.00%)     48503896.03 * -33.17%*
Hmean     brk1-processes-234        67509812.97 (   0.00%)     48466347.95 * -28.21%*
Hmean     brk1-processes-265        72379447.48 (   0.00%)     48449330.06 * -33.06%*
Hmean     brk1-processes-296        72241861.59 (   0.00%)     48303854.85 * -33.14%*
Hmean     brk1-processes-320        72183320.07 (   0.00%)     48253708.21 * -33.15%*
Hmean     brk1-threads-2             1352297.67 (   0.00%)      2963406.42 * 119.14%*
Hmean     brk1-threads-5              814358.55 (   0.00%)      1233604.84 *  51.48%*
Hmean     brk1-threads-8              849900.52 (   0.00%)      1241170.76 *  46.04%*
Hmean     brk1-threads-12             804143.24 (   0.00%)      1232451.60 *  53.26%*
Hmean     brk1-threads-21             672680.28 (   0.00%)      1001257.38 *  48.85%*
Hmean     brk1-threads-30             443221.73 (   0.00%)       713592.66 *  61.00%*
Hmean     brk1-threads-48             397283.95 (   0.00%)       591349.83 *  48.85%*
Hmean     brk1-threads-79             167479.04 (   0.00%)       548681.03 * 227.61%*
Hmean     brk1-threads-110            154108.98 (   0.00%)       207349.67 *  34.55%*
Hmean     brk1-threads-141            166712.91 (   0.00%)       217314.50 *  30.35%*
Hmean     brk1-threads-172            155023.61 (   0.00%)       210318.59 *  35.67%*
Hmean     brk1-threads-203            166943.60 (   0.00%)       227892.26 *  36.51%*

The mmtests configuration this was based on no longer exists but the
equivalent should be

configs/config-workload-will-it-scale-pf-processes
configs/config-workload-will-it-scale-pf-threads

It shouldn't be too hard to pull out the brk1 test on its own and see
what it's doing in profiles.

brk_test from aim9 also suffers so may be brk-specific.

Other workloads looked ok, kernel building took a small -1.5%
performance hit on some machines but not enough that I would scream.
Others looked mostly neutral but it was only a provisional sniff test
before I even considered looking at 70 patches :O

-- 
Mel Gorman
SUSE Labs
Re: [PATCH v6 00/71] Introducing the Maple Tree
Posted by Matthew Wilcox 4 years, 4 months ago
On Wed, Feb 23, 2022 at 04:35:09PM +0000, Mel Gorman wrote:
> On Wed, Feb 16, 2022 at 08:24:34PM +0000, Matthew Wilcox wrote:
> > On Wed, Feb 16, 2022 at 11:47:00AM -0800, Andrew Morton wrote:
> > > On Tue, 15 Feb 2022 14:37:44 +0000 Liam Howlett <liam.howlett@oracle.com> wrote:
> > > 
> > > > The maple tree is an RCU-safe range based B-tree designed to use modern
> > > > processor cache efficiently.  There are a number of places in the kernel
> > > > that a non-overlapping range-based tree would be beneficial, especially
> > > > one with a simple interface.  The first user that is covered in this
> > > > patch set is the vm_area_struct, where three data structures are
> > > > replaced by the maple tree: the augmented rbtree, the vma cache, and the
> > > > linked list of VMAs in the mm_struct.  The long term goal is to reduce
> > > > or remove the mmap_sem contention.
> > > 
> > > Has a path been chosen which gets us from here to significant reduction
> > > in mmap_lock overhead?
> > > 
> > > If so, what's the plan and what must be done?
> > 
> > I would say there are still competing ideas for how that is to be done.
> > 
> > First, the Maple Tree is independent (to a certain extent) of all the
> > approaches to reducing mmap_lock overhead.  It's a better data structure
> > for VMA lookup than the rbtree.  Set against that, it has higher overhead
> > for modifications.  That means that benchmarks which primarily measure
> > modification overhead see worse performance.  We believe this is not
> > representative of real workloads, and indeed we see ~parity on workloads
> > like git and make -jN.
> 
> I'm way behind and only got around to looking at SPF properly today. Maple
> is working its way up my list and I need to gather new data but I did
> have old data queued from a time when I thought I would get around to
> maple tree soon.  The big result that stood was was brk performance from
> will-it-scale but for processes specifically

Yup, we know about the brk1 regression.  It's a really unrealistic
benchmark, which is why we added brk2.  To quote the commit message:

    Linux has this horrendously complicated anon_vma structure that you don't
    care about, but the upshot is that after calling fork(), each process
    that calls brk() gets a _new_ VMA created.  That is, after calling brk()
    the first time, the process address space looks like this:

    557777fab000-557777ff0000 rw-p 00000000 00:00 0                          [heap]
    557777ff0000-557777ff1000 rw-p 00000000 00:00 0                          [heap]

    so what brk1 is actually testing is how long it takes to create & destroy
    a new VMA.  This does not match what most programs do -- most will call
    exec() which resets the anon_vma structures and starts each program off
    with its own heap.  And if you do have a multi-process program which
    uses brk(), chances are it doesn't just oscillate betwee zero and one
    extra pages of heap compared to its parent.

    A better test starts out by allocating one page on the heap and then
    throbs between one and two pages instead of throbbing between zero and
    one page.  That means we're actually testing expanding and contracting
    the heap instead of creating and destroying a new heap.

    For realism, I wanted to add actually accessing the memory in the new
    heap, but that doesn't work for the threaded case -- another thread
    might remove the memory you just allocated while you're allocating it.
    Threaded programs give each thread its own heap anyway, so this is
    kind of a pointless syscall to ask about its threaded scalability.
    
    Anyway, here's brk2.c.  It is not very different from brk1.c, but the
    performance results are quite different (actually worse by about 10-15%).

Re: [PATCH v6 00/71] Introducing the Maple Tree
Posted by Qian Cai 4 years, 4 months ago
On Tue, Feb 15, 2022 at 02:37:44PM +0000, Liam Howlett wrote:
> The maple tree is an RCU-safe range based B-tree designed to use modern
> processor cache efficiently.  There are a number of places in the kernel
> that a non-overlapping range-based tree would be beneficial, especially
> one with a simple interface.  The first user that is covered in this
> patch set is the vm_area_struct, where three data structures are
> replaced by the maple tree: the augmented rbtree, the vma cache, and the
> linked list of VMAs in the mm_struct.  The long term goal is to reduce
> or remove the mmap_sem contention.
> 
> The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
> nodes.  With the increased branching factor, it is significantly shorter than
> the rbtree so it has fewer cache misses.  The removal of the linked list
> between subsequent entries also reduces the cache misses and the need to pull
> in the previous and next VMA during many tree alterations.
> 
> This patch is based on v5.17-rc4
> 
> git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220214

Just a heads-up. I noticed an use-after-free in today's linux-next below. I
am running out of time to fully triage this, but I noticed this is the only
series (not sure which revision) heavily touched mm/mmap.c recently there.

 BUG: KASAN: use-after-free in move_vma.isra.0
 Read of size 8 at addr ffff0009ce752ac8 by task systemd-logind/1280

 CPU: 21 PID: 1280 Comm: systemd-logind Not tainted 5.17.0-rc5-next-20220223-dirty #242
 Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
 Call trace:
  dump_backtrace
  show_stack
  dump_stack_lvl
  print_address_description.constprop.0
  kasan_report
  __asan_report_load8_noabort
  move_vma.isra.0
  move_vma at /usr/src/linux-next/mm/mremap.c:714
  __do_sys_mremap
  __arm64_sys_mremap
  invoke_syscall
  el0_svc_common.constprop.0
  do_el0_svc
  el0_svc
  el0t_64_sync_handler
  el0t_64_sync

 Allocated by task 1280:
  kasan_save_stack
  __kasan_slab_alloc
  slab_post_alloc_hook
  kmem_cache_alloc
  vm_area_alloc
  vm_area_alloc at /usr/src/linux-next/kernel/fork.c:455
  mmap_region
  mmap_region at /usr/src/linux-next/mm/mmap.c:2585
  do_mmap
  vm_mmap_pgoff
  ksys_mmap_pgoff
  __arm64_sys_mmap
  invoke_syscall
  el0_svc_common.constprop.0
  do_el0_svc
  el0_svc
  el0t_64_sync_handler
  el0t_64_sync

 Freed by task 1280:
  kasan_save_stack
  kasan_set_track
  kasan_set_free_info
  __kasan_slab_free
  kmem_cache_free
  vm_area_free
  remove_vma
  remove_vma at /usr/src/linux-next/mm/mmap.c:187
  do_mas_align_munmap.isra.0
  remove_mt at /usr/src/linux-next/mm/mmap.c:2176
  (inlined by) do_mas_align_munmap at /usr/src/linux-next/mm/mmap.c:2437
  do_mas_munmap
  do_mas_munmap at /usr/src/linux-next/mm/mmap.c:2483
  do_munmap
  move_vma.isra.0
  __do_sys_mremap
  __arm64_sys_mremap
  invoke_syscall
  el0_svc_common.constprop.0
  do_el0_svc
  el0_svc
  el0t_64_sync_handler
  el0t_64_sync

 The buggy address belongs to the object at ffff0009ce752aa8
  which belongs to the cache vm_area_struct of size 144
 The buggy address is located 32 bytes inside of
  144-byte region [ffff0009ce752aa8, ffff0009ce752b38)
 The buggy address belongs to the page:
 page:fffffc002739d400 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xa4e750
 head:fffffc002739d400 order:2 compound_mapcount:0 compound_pincount:0
 memcg:ffff0009ce456e01
 flags: 0xbfffc0000010200(slab|head|node=0|zone=2|lastcpupid=0xffff)
 raw: 0bfffc0000010200 fffffc002739f108 fffffc00273a3b08 ffff000800f53380
 raw: 0000000000000000 0000000000210021 00000001ffffffff ffff0009ce456e01
 page dumped because: kasan: bad access detected

 Memory state around the buggy address:
  ffff0009ce752980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  ffff0009ce752a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 >ffff0009ce752a80: fc fc fc fc fc fa fb fb fb fb fb fb fb fb fb fb
                                               ^
  ffff0009ce752b00: fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc
  ffff0009ce752b80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
Re: [PATCH v6 00/71] Introducing the Maple Tree
Posted by Liam Howlett 4 years, 4 months ago
* Qian Cai <quic_qiancai@quicinc.com> [220224 22:49]:
> On Tue, Feb 15, 2022 at 02:37:44PM +0000, Liam Howlett wrote:
> > The maple tree is an RCU-safe range based B-tree designed to use modern
> > processor cache efficiently.  There are a number of places in the kernel
> > that a non-overlapping range-based tree would be beneficial, especially
> > one with a simple interface.  The first user that is covered in this
> > patch set is the vm_area_struct, where three data structures are
> > replaced by the maple tree: the augmented rbtree, the vma cache, and the
> > linked list of VMAs in the mm_struct.  The long term goal is to reduce
> > or remove the mmap_sem contention.
> > 
> > The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
> > nodes.  With the increased branching factor, it is significantly shorter than
> > the rbtree so it has fewer cache misses.  The removal of the linked list
> > between subsequent entries also reduces the cache misses and the need to pull
> > in the previous and next VMA during many tree alterations.
> > 
> > This patch is based on v5.17-rc4
> > 
> > git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220214
> 
> Just a heads-up. I noticed an use-after-free in today's linux-next below. I
> am running out of time to fully triage this, but I noticed this is the only
> series (not sure which revision) heavily touched mm/mmap.c recently there.
> 
>  BUG: KASAN: use-after-free in move_vma.isra.0
>  Read of size 8 at addr ffff0009ce752ac8 by task systemd-logind/1280
> 
>  CPU: 21 PID: 1280 Comm: systemd-logind Not tainted 5.17.0-rc5-next-20220223-dirty #242
>  Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
>  Call trace:
>   dump_backtrace
>   show_stack
>   dump_stack_lvl
>   print_address_description.constprop.0
>   kasan_report
>   __asan_report_load8_noabort
>   move_vma.isra.0
>   move_vma at /usr/src/linux-next/mm/mremap.c:714
>   __do_sys_mremap
>   __arm64_sys_mremap
>   invoke_syscall
>   el0_svc_common.constprop.0
>   do_el0_svc
>   el0_svc
>   el0t_64_sync_handler
>   el0t_64_sync
> 
>  Allocated by task 1280:
>   kasan_save_stack
>   __kasan_slab_alloc
>   slab_post_alloc_hook
>   kmem_cache_alloc
>   vm_area_alloc
>   vm_area_alloc at /usr/src/linux-next/kernel/fork.c:455
>   mmap_region
>   mmap_region at /usr/src/linux-next/mm/mmap.c:2585
>   do_mmap
>   vm_mmap_pgoff
>   ksys_mmap_pgoff
>   __arm64_sys_mmap
>   invoke_syscall
>   el0_svc_common.constprop.0
>   do_el0_svc
>   el0_svc
>   el0t_64_sync_handler
>   el0t_64_sync
> 
>  Freed by task 1280:
>   kasan_save_stack
>   kasan_set_track
>   kasan_set_free_info
>   __kasan_slab_free
>   kmem_cache_free
>   vm_area_free
>   remove_vma
>   remove_vma at /usr/src/linux-next/mm/mmap.c:187
>   do_mas_align_munmap.isra.0
>   remove_mt at /usr/src/linux-next/mm/mmap.c:2176
>   (inlined by) do_mas_align_munmap at /usr/src/linux-next/mm/mmap.c:2437
>   do_mas_munmap
>   do_mas_munmap at /usr/src/linux-next/mm/mmap.c:2483
>   do_munmap
>   move_vma.isra.0
>   __do_sys_mremap
>   __arm64_sys_mremap
>   invoke_syscall
>   el0_svc_common.constprop.0
>   do_el0_svc
>   el0_svc
>   el0t_64_sync_handler
>   el0t_64_sync
> 
>  The buggy address belongs to the object at ffff0009ce752aa8
>   which belongs to the cache vm_area_struct of size 144
>  The buggy address is located 32 bytes inside of
>   144-byte region [ffff0009ce752aa8, ffff0009ce752b38)
>  The buggy address belongs to the page:
>  page:fffffc002739d400 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xa4e750
>  head:fffffc002739d400 order:2 compound_mapcount:0 compound_pincount:0
>  memcg:ffff0009ce456e01
>  flags: 0xbfffc0000010200(slab|head|node=0|zone=2|lastcpupid=0xffff)
>  raw: 0bfffc0000010200 fffffc002739f108 fffffc00273a3b08 ffff000800f53380
>  raw: 0000000000000000 0000000000210021 00000001ffffffff ffff0009ce456e01
>  page dumped because: kasan: bad access detected
> 
>  Memory state around the buggy address:
>   ffff0009ce752980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>   ffff0009ce752a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>  >ffff0009ce752a80: fc fc fc fc fc fa fb fb fb fb fb fb fb fb fb fb
>                                                ^
>   ffff0009ce752b00: fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc
>   ffff0009ce752b80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc

Thank you for the report.  I will try to reproduce it with a kvm here.

Cheers,
Liam
Re: [PATCH v6 00/71] Introducing the Maple Tree
Posted by Liam Howlett 4 years, 4 months ago
* Liam Howlett <liam.howlett@oracle.com> [220225 14:09]:
> * Qian Cai <quic_qiancai@quicinc.com> [220224 22:49]:
> > On Tue, Feb 15, 2022 at 02:37:44PM +0000, Liam Howlett wrote:
> > > The maple tree is an RCU-safe range based B-tree designed to use modern
> > > processor cache efficiently.  There are a number of places in the kernel
> > > that a non-overlapping range-based tree would be beneficial, especially
> > > one with a simple interface.  The first user that is covered in this
> > > patch set is the vm_area_struct, where three data structures are
> > > replaced by the maple tree: the augmented rbtree, the vma cache, and the
> > > linked list of VMAs in the mm_struct.  The long term goal is to reduce
> > > or remove the mmap_sem contention.
> > > 
> > > The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
> > > nodes.  With the increased branching factor, it is significantly shorter than
> > > the rbtree so it has fewer cache misses.  The removal of the linked list
> > > between subsequent entries also reduces the cache misses and the need to pull
> > > in the previous and next VMA during many tree alterations.
> > > 
> > > This patch is based on v5.17-rc4
> > > 
> > > git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220214
> > 
> > Just a heads-up. I noticed an use-after-free in today's linux-next below. I
> > am running out of time to fully triage this, but I noticed this is the only
> > series (not sure which revision) heavily touched mm/mmap.c recently there.
> > 
> >  BUG: KASAN: use-after-free in move_vma.isra.0
> >  Read of size 8 at addr ffff0009ce752ac8 by task systemd-logind/1280
> > 
> >  CPU: 21 PID: 1280 Comm: systemd-logind Not tainted 5.17.0-rc5-next-20220223-dirty #242
> >  Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
> >  Call trace:
> >   dump_backtrace
> >   show_stack
> >   dump_stack_lvl
> >   print_address_description.constprop.0
> >   kasan_report
> >   __asan_report_load8_noabort
> >   move_vma.isra.0
> >   move_vma at /usr/src/linux-next/mm/mremap.c:714
> >   __do_sys_mremap
> >   __arm64_sys_mremap
> >   invoke_syscall
> >   el0_svc_common.constprop.0
> >   do_el0_svc
> >   el0_svc
> >   el0t_64_sync_handler
> >   el0t_64_sync
> > 
> >  Allocated by task 1280:
> >   kasan_save_stack
> >   __kasan_slab_alloc
> >   slab_post_alloc_hook
> >   kmem_cache_alloc
> >   vm_area_alloc
> >   vm_area_alloc at /usr/src/linux-next/kernel/fork.c:455
> >   mmap_region
> >   mmap_region at /usr/src/linux-next/mm/mmap.c:2585
> >   do_mmap
> >   vm_mmap_pgoff
> >   ksys_mmap_pgoff
> >   __arm64_sys_mmap
> >   invoke_syscall
> >   el0_svc_common.constprop.0
> >   do_el0_svc
> >   el0_svc
> >   el0t_64_sync_handler
> >   el0t_64_sync
> > 
> >  Freed by task 1280:
> >   kasan_save_stack
> >   kasan_set_track
> >   kasan_set_free_info
> >   __kasan_slab_free
> >   kmem_cache_free
> >   vm_area_free
> >   remove_vma
> >   remove_vma at /usr/src/linux-next/mm/mmap.c:187
> >   do_mas_align_munmap.isra.0
> >   remove_mt at /usr/src/linux-next/mm/mmap.c:2176
> >   (inlined by) do_mas_align_munmap at /usr/src/linux-next/mm/mmap.c:2437
> >   do_mas_munmap
> >   do_mas_munmap at /usr/src/linux-next/mm/mmap.c:2483
> >   do_munmap
> >   move_vma.isra.0
> >   __do_sys_mremap
> >   __arm64_sys_mremap
> >   invoke_syscall
> >   el0_svc_common.constprop.0
> >   do_el0_svc
> >   el0_svc
> >   el0t_64_sync_handler
> >   el0t_64_sync
> > 
> >  The buggy address belongs to the object at ffff0009ce752aa8
> >   which belongs to the cache vm_area_struct of size 144
> >  The buggy address is located 32 bytes inside of
> >   144-byte region [ffff0009ce752aa8, ffff0009ce752b38)
> >  The buggy address belongs to the page:
> >  page:fffffc002739d400 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xa4e750
> >  head:fffffc002739d400 order:2 compound_mapcount:0 compound_pincount:0
> >  memcg:ffff0009ce456e01
> >  flags: 0xbfffc0000010200(slab|head|node=0|zone=2|lastcpupid=0xffff)
> >  raw: 0bfffc0000010200 fffffc002739f108 fffffc00273a3b08 ffff000800f53380
> >  raw: 0000000000000000 0000000000210021 00000001ffffffff ffff0009ce456e01
> >  page dumped because: kasan: bad access detected
> > 
> >  Memory state around the buggy address:
> >   ffff0009ce752980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> >   ffff0009ce752a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> >  >ffff0009ce752a80: fc fc fc fc fc fa fb fb fb fb fb fb fb fb fb fb
> >                                                ^
> >   ffff0009ce752b00: fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc
> >   ffff0009ce752b80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> 
> Thank you for the report.  I will try to reproduce it with a kvm here.
> 

I just booted an arm64 VM with my build and kasan enabled with no issue.
Could you please send me your config file for the build?

Thanks,
Liam


Re: [PATCH v6 00/71] Introducing the Maple Tree
Posted by Qian Cai 4 years, 4 months ago
On Fri, Feb 25, 2022 at 08:23:41PM +0000, Liam Howlett wrote:
> I just booted an arm64 VM with my build and kasan enabled with no issue.
> Could you please send me your config file for the build?

On linux-next, I just do:

$ make arch=arm64 defconfig debug.config [1]

Then, I just generate some memory pressume into swapping/OOM Killer to
trigger it.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/kernel/configs/debug.config
Re: [PATCH v6 00/71] Introducing the Maple Tree
Posted by Vasily Gorbik 4 years, 4 months ago
On Tue, Feb 15, 2022 at 02:37:44PM +0000, Liam Howlett wrote:
> The maple tree is an RCU-safe range based B-tree designed to use modern
> processor cache efficiently.  There are a number of places in the kernel
> that a non-overlapping range-based tree would be beneficial, especially
> one with a simple interface.  The first user that is covered in this
> patch set is the vm_area_struct, where three data structures are
> replaced by the maple tree: the augmented rbtree, the vma cache, and the
> linked list of VMAs in the mm_struct.  The long term goal is to reduce
> or remove the mmap_sem contention.
> 
> The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
> nodes.  With the increased branching factor, it is significantly shorter than
> the rbtree so it has fewer cache misses.  The removal of the linked list
> between subsequent entries also reduces the cache misses and the need to pull
> in the previous and next VMA during many tree alterations.
> 
> This patch is based on v5.17-rc4
> 
> git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220214

Hi Liam,

this patch series completely breaks s390. Besides endianess issue
in maple trees reported here:

https://lore.kernel.org/all/your-ad-here.call-01645924312-ext-0398@work.hours/

and with this endianess issue fixed (fixup from here ^^^) we still get numerous
KASAN reports starting with

[PATCH v6 10/71] mm: Start tracking VMAs with maple tree

mostly looking like this:

 BUG: KASAN: use-after-free in mas_descend_adopt+0x113a/0x1408
 Read of size 8 at addr 00000000b1d4e900 by task cat/610

 CPU: 34 PID: 610 Comm: cat Tainted: G        W         5.17.0-rc4-91313-g592c3b299aad #1
 Hardware name: IBM 3906 M04 704 (LPAR)
 Call Trace:
  [<000000000234647a>] dump_stack_lvl+0xfa/0x150
  [<0000000002328154>] print_address_description.constprop.0+0x64/0x340
  [<00000000008f5ba6>] kasan_report+0x13e/0x1a8
  [<00000000017f0c12>] mas_descend_adopt+0x113a/0x1408
  [<00000000018076ec>] mas_spanning_rebalance.isra.0+0x5164/0x6a20
  [<000000000180a71e>] mas_wr_spanning_store.isra.0+0x476/0xbc8
  [<00000000018189bc>] mas_store_gfp+0xd4/0x188
  [<0000000000827bae>] vma_mt_szero+0x146/0x368
  [<000000000082e990>] __do_munmap+0x340/0xe20
  [<000000000082f580>] __vm_munmap+0x110/0x1e8
  [<000000000082f76e>] __s390x_sys_munmap+0x6e/0x90
  [<000000000010dc6c>] do_syscall+0x22c/0x328
  [<000000000234d3d2>] __do_syscall+0x9a/0xf8
  [<0000000002374ed2>] system_call+0x82/0xb0
 INFO: lockdep is turned off.

 Allocated by task 610:
  kasan_save_stack+0x34/0x58
  __kasan_slab_alloc+0x84/0xa8
  kmem_cache_alloc+0x20c/0x520
  mas_alloc_nodes+0x26a/0x4c8
  mas_split.isra.0+0x2aa/0x1418
  mas_wr_modify+0x3fa/0xd28
  mas_store_gfp+0xd4/0x188
  vma_store+0x17a/0x3d8
  vma_link+0xac/0x798
  mmap_region+0xa5a/0x10b8
  do_mmap+0x7c2/0xa90
  vm_mmap_pgoff+0x186/0x250
  ksys_mmap_pgoff+0x334/0x400
  __s390x_sys_old_mmap+0xf4/0x130
  do_syscall+0x22c/0x328
  __do_syscall+0x9a/0xf8
  system_call+0x82/0xb0
 Freed by task 610:
  kasan_save_stack+0x34/0x58
  kasan_set_track+0x36/0x48
  kasan_set_free_info+0x34/0x58
  ____kasan_slab_free+0x11c/0x188
  __kasan_slab_free+0x24/0x30
  kmem_cache_free_bulk.part.0+0xec/0x538
  mas_destroy+0x2e4/0x710
  mas_store_gfp+0xf4/0x188
  vma_mt_szero+0x146/0x368
  __vma_adjust+0x155a/0x2188
  __split_vma+0x228/0x450
  mprotect_fixup+0x4f2/0x618
  do_mprotect_pkey.constprop.0+0x328/0x600
  __s390x_sys_mprotect+0x8e/0xb8
  do_syscall+0x22c/0x328
  __do_syscall+0x9a/0xf8
  system_call+0x82/0xb0

 The buggy address belongs to the object at 00000000b1d4e900
  which belongs to the cache maple_node of size 256
 The buggy address is located 0 bytes inside of
  256-byte region [00000000b1d4e900, 00000000b1d4ea00)
 The buggy address belongs to the page:
 page:0000400002c75200 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xb1d48
 head:0000400002c75200 order:3 compound_mapcount:0 compound_pincount:0
 flags: 0x3ffff00000010200(slab|head|node=0|zone=1|lastcpupid=0x1ffff)
 raw: 3ffff00000010200 0000400002c91e08 000040000263b608 000000008009ed00
 raw: 0000000000000000 0020004000000000 ffffffff00000001 0000000000000000
 page dumped because: kasan: bad access detected

 Memory state around the buggy address:
  00000000b1d4e800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  00000000b1d4e880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 >00000000b1d4e900: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                    ^
  00000000b1d4e980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  00000000b1d4ea00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc

with eventual crash.

How come none of s390 architecture maintainers nor s390 mailing list were
on CC for the changes which affect s390 to that magnitude? None of this
has apparently been tested on s390 or any other big endian system. And
a shoulder tap to give it try would be helpful.

Now we are just starting looking at the problems. And until issues
are resolved this patch series has to be dropped from linux-next.
Re: [PATCH v6 00/71] Introducing the Maple Tree
Posted by Liam Howlett 4 years, 3 months ago
* Vasily Gorbik <gor@linux.ibm.com> [220226 21:23]:
> On Tue, Feb 15, 2022 at 02:37:44PM +0000, Liam Howlett wrote:
> > The maple tree is an RCU-safe range based B-tree designed to use modern
> > processor cache efficiently.  There are a number of places in the kernel
> > that a non-overlapping range-based tree would be beneficial, especially
> > one with a simple interface.  The first user that is covered in this
> > patch set is the vm_area_struct, where three data structures are
> > replaced by the maple tree: the augmented rbtree, the vma cache, and the
> > linked list of VMAs in the mm_struct.  The long term goal is to reduce
> > or remove the mmap_sem contention.
> > 
> > The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
> > nodes.  With the increased branching factor, it is significantly shorter than
> > the rbtree so it has fewer cache misses.  The removal of the linked list
> > between subsequent entries also reduces the cache misses and the need to pull
> > in the previous and next VMA during many tree alterations.
> > 
> > This patch is based on v5.17-rc4
> > 
> > git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220214
> 
> Hi Liam,
> 
> this patch series completely breaks s390. Besides endianess issue
> in maple trees reported here:
> 
> https://lore.kernel.org/all/your-ad-here.call-01645924312-ext-0398@work.hours/
> 
> and with this endianess issue fixed (fixup from here ^^^) we still get numerous
> KASAN reports starting with
> 
> [PATCH v6 10/71] mm: Start tracking VMAs with maple tree

Thanks for looking at this.  That commit had two issues which I was
debugging on parisc for most of last week.  The first is that I was not
actually writing the stack expansion to the maple tree in that commit -
it was added in a later commit.  The second issue which was most likely
causing your crashes was the calculation for the gap in
unmapped_area_topdown().

With the maple tree (3dea5db7fbb5), I am able to boot parisc and s390
virtual machines.  Although, the issue which Hugh discusses in detail
still exists [1].  The latest maple tree can be found on github [2].  I
am still investigating why the heap changed locations on parisc to
ensure it's not of issue.

1: https://lore.kernel.org/all/5f8f4f-ad63-eb-fd73-d48748af8a76@google.com/
2: https://github.com/oracle/linux-uek/tree/maple/mainline

Regards,
Liam