mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree

[PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree
Posted by Pranjal Arya 4 hours ago
vmalloc's free/busy/lazy area tracking is one of the last remaining
augmented-rb_tree consumers in the core mm allocators. The rest of
mm/ has been gradually consolidating range-keyed indexing around
maple_tree (notably the per-process VMA tree in mm/mmap.c), and
the underlying reason is a structural mismatch between rb_tree and
range tracking:

- rb_tree is a binary tree with a single entry per node. A lookup
  walks log2(N) nodes via a pointer chase that touches log2(N)
  cache lines, with one comparison per node. Range queries are not
  native to rb_tree; vmalloc has historically maintained them by
  augmenting every node with a subtree_max_size field whose value
  has to be recomputed on every insert, erase, and rotation via a
  custom callback. That callback machinery is vmalloc-specific
  code that exists only to coax range semantics out of a
  binary-keyed tree.
- maple_tree stores up to 16 ranges per node (fanout f=16), so a
  lookup walks ~log16(N) nodes in tightly-packed pivot/slot arrays
  that are far more cache-friendly. Range queries are first-class
  (mas_empty_area, mas_find, mas_erase), with no augmentation
  callback to maintain. RCU traversal and sentinel range storage
  (XA_ZERO_ENTRY) are part of the data structure's contract, not
  bolted on by the consumer.

For the vmalloc allocator's hot paths this means shallower walks
under the same N, fewer cache lines touched per lookup, and no
custom augmented-callback machinery to maintain. Aligning vmalloc
with the same range-tree direction the rest of mm/ has taken
collapses the augmented gap walker to a single mas_empty_area()
descent, retires the augmented rb_node from struct vmap_area
(-24 bytes per object on 64-bit), and exposes the range,
sentinel, and RCU primitives needed for a per-CPU range cache
that the augmented rb_tree could not host cleanly.

This series completes the vmalloc internal migration from rb-tree
indexed tracking to maple-tree indexed tracking for free, busy, and
lazy vmap_area range management.

The series removes vmalloc's internal rb-tree dependencies and moves
address indexing and range lookups/scans to maple-tree-backed paths,
while keeping ordered list neighbour traversal where coalescing
semantics require stable predecessor/successor behaviour.

In addition to the direct rb -> maple migration, the series includes
robustness and scalability refinements in the same code path:

- deferred/lazy maple bring-up to avoid early-boot allocator hazards
- maple-assisted ordered-list insertion for busy/lazy tracking
- mas_preallocate / mas_store_prealloc fast path for common-case
  publish work, with a non-indexed retry queue that absorbs the
  rare publish-under-pressure case without leaking or panicking
- single mas_store(NULL, ...) sub-range trim in va_clip() in place
  of an erase-and-restore pair when narrowing a free-area entry
- single mas_erase() for the busy-tree find-then-unlink pair on the
  free path
- consolidation of in-use ranges as a single authoritative index on
  the steady-state allocation hot path
- list_head representation of the lazy-purge queue, since that queue
  is bulk-drained and has no address-keyed query
- per-CPU bump-allocator overlay layered on top of the migrated
  indices for short-lived, page-aligned, common-case allocations
  (design and chunk-size derivation in the 0010-0012 commit
  messages)
- explicit lock/serialisation behaviour preserved (no lock removed)

Primary advantages
==================

- struct vmap_area shrinks by 24 bytes on 64-bit layouts (72 -> 48),
  removing the embedded augmented rb_node and the subtree_max_size
  field that the rb-tree gap walk depended on. Verified with pahole
  on arm64.
- maple_tree's per-node fanout (multiple pivots/slots per node)
  replaces a binary rb-tree descent for indexed lookups, giving a
  shallower walk for the same allocation count.
- alloc-side gap finding moves from a recursive augmented-rb walk to
  mas_empty_area() over the in-use range index, returning the lowest
  matching gap in a single descent.
- vfree of a chunk-resident allocation through the per-CPU overlay
  resolves addr -> vmap_area in O(1) via the chunk's back-pointer
  array, with a bounded fast-reject for addresses outside any
  reserved chunk; the maple-tree busy lookup remains the fallback.
- correctness behaviour preserved: ordered list neighbour traversal
  for coalescing remains; the locking/serialisation model is
  unchanged; lockdep is silent across the test_vmalloc subtest sweep.
- robustness in bring-up and high churn: deferred/lazy maple
  initialisation avoids early-boot allocator hazards; the retry
  queue keeps publish failures under GFP_NOWAIT pressure correct
  without leaking or panicking.

Real-silicon validation
=======================

The series was tested on Qualcomm Snapdragon X1E80100.
The patched kernel was booted on the device against an RB baseline
image of the same kernel revision, and exercised through:

- GFXBench, run for several hours of sustained graphics workload;
  the patched kernel ran clean throughout, with throughput matching
  the RB baseline within run-to-run noise.
- boot-time module loading via the finit_module / kernel_read_file
  path that exercises the bump-allocator's indexed write loop;
  this path drove the patch 0012 hardening, and the patched kernel
  is UBSAN-clean here.
- repeated insmod / rmmod cycles to soak the chunk install / drain
  paths under live workload.

No kernel WARN, BUG, or UBSAN report was observed across the
multi-hour soak.

Quantified validated performance improvements
=============================================

Numbers below compare the current upstream rb-tree-based vmalloc
allocator (baseline) against the final state after applying all 12
patches in this series (patched). Each row reports the median of N
reps under sequential QEMU execution (PARALLEL=1, taskset-pinned
CPUs, nice -n 5) on a production kernel configuration
(CONFIG_DEBUG_VM=n, CONFIG_KASAN=n, CONFIG_LOCKDEP=n,
CONFIG_UBSAN=n). The baseline label is RB; the patched label is MT.

Aggregate observation:

- x86_64 KVM contention sweep (smp=8, 30 reps): improvement on every
  measured nt x mask configuration, range +5.1 % to +43.6 %.
- x86_64 KVM saturation sweep (smp=16, 15 reps): improvement on every
  measured nt x mask configuration, range +28.2 % to +64.9 %.
- arm64 cortex-a72 TCG contention sweep (smp=8, 15 reps):
  improvement on every measured nt x mask configuration, range
  +19.5 % to +32.6 %.
- functional sweep (1 rep, all 12 test_vmalloc masks): all 11
  passing subtests pass on RB and MT; the 12th (mask=64,
  align_shift_alloc_test) is xfail by design and remains xfail on MT.

x86_64 observed faster cases (sorted by gain):

- full_fit_alloc_test, smp=16 nt=16:                  +64.9 %
- long_busy_list_alloc_test, smp=16 nt=16:            +62.9 %
- full_fit_alloc_test, smp=16 nt=8:                   +51.7 %
- long_busy_list_alloc_test, smp=16 nt=8:             +49.3 %
- long_busy_list_alloc_test, smp=8 nt=8:              +43.6 %
- full_fit_alloc_test, smp=8 nt=4:                    +40.4 %
- full_fit_alloc_test, smp=8 nt=8:                    +40.0 %
- full_fit_alloc_test, smp=16 nt=4:                   +38.2 %
- full_fit_alloc_test, smp=8 nt=2:                    +31.2 %
- fix_align_alloc_test, functional sweep:             +30.2 %
- long_busy_list_alloc_test, smp=8 nt=4:              +29.3 %
- long_busy_list_alloc_test, smp=16 nt=4:             +28.2 %
- full_fit_alloc_test, smp=4 nt=2:                    +26.5 %
- full_fit_alloc_test, smp=8 nt=1:                    +20.4 %
- long_busy_list_alloc_test, smp=8 nt=2:              +17.7 %
- long_busy_list_alloc_test, smp=4 nt=2:              +15.3 %

arm64 observed faster cases (sorted by gain):

- long_busy_list_alloc_test, smp=8 nt=8:              +32.6 %
- long_busy_list_alloc_test, smp=8 nt=4:              +26.7 %
- fix_size_alloc_test, functional sweep:              +26.0 %
- full_fit_alloc_test, smp=8 nt=4:                    +25.5 %
- full_fit_alloc_test, smp=8 nt=2:                    +23.4 %
- full_fit_alloc_test, smp=8 nt=1:                    +23.2 %
- full_fit_alloc_test, smp=8 nt=8:                    +22.3 %
- full_fit_alloc_test, functional sweep:              +21.9 %
- long_busy_list_alloc_test, smp=8 nt=2:              +21.2 %
- random_size_align_alloc_test, functional sweep:     +19.9 %
- long_busy_list_alloc_test, smp=8 nt=1:              +19.5 %
- no_block_alloc_test, functional sweep:              +18.0 %
- full_fit_alloc_test, smp=4 nt=2:                    +17.3 %
- kvfree_rcu_1_arg_vmalloc_test, functional sweep:    +16.9 %
- long_busy_list_alloc_test, smp=4 nt=2:              +16.2 %
- random_size_align_alloc_test, smp=4 nt=2:           +14.3 %
- long_busy_list_alloc_test, functional sweep:        +12.6 %
- fix_align_alloc_test, functional sweep:             +12.4 %
- kvfree_rcu_2_arg_vmalloc_test, functional sweep:    +11.0 %
- pcpu_alloc_test, smp=4 nt=2:                        +6.9 %

Memory impact quantification
============================

struct vmap_area shrinks by 24 bytes per object on 64-bit layouts
(72 -> 48; verified with pahole on the patched build vs the RB
baseline). The shrink comes from removing the embedded augmented
rb_node (24 bytes) and the subtree_max_size union member (8 bytes,
formerly union with vm). list_head (16 bytes) and vm (8 bytes) are
retained.

Aggregate savings scale linearly with the number of concurrent live
vmap_area objects on the system.

mm/vmalloc.o text grows by ~47 KB on x86_64 / ~95 KB on arm64
(absorbing the maple-tree call sites and the per-CPU overlay), but
the linker-page-aligned bzImage / Image size is unchanged within
rounding.

QEMU validation and synthetic benchmarks
========================================

In addition to the real-silicon soak, the series was exercised
under QEMU (KVM on x86_64, TCG on arm64):

11 of 12 test_vmalloc subtests PASS on x86_64 and arm64 at series
HEAD. mask=64 (align_shift_alloc_test) reports xfailed=1 by design;
the same xfail is observed on the unpatched RB baseline.

KASAN, lockdep, CONFIG_DEBUG_VM=y, PROVE_LOCKING and CONFIG_UBSAN
boots are clean on both architectures with
test_vmalloc.run_test_mask=0xFFF (full subtest sweep).

Each commit individually compiles mm/vmalloc.o on x86_64 and arm64.
checkpatch.pl --strict reports no errors across the series.

Methodology limitations and reproducibility caveats:

- arm64 numbers in the validated wins above come from cortex-a72
  TCG emulation, so production arm64 silicon
  would be expected to track somewhere between these TCG numbers
  and the x86_64 KVM numbers.
- Independent replay can change the ranking or sign of small
  sub-millisecond deltas in test workloads whose absolute runtime
  is below ~1 ms at the configured test_loop_count; the headline
  workloads (full_fit_alloc_test, long_busy_list_alloc_test) run
  in the tens-to-hundreds-of-ms range and are not affected.

Reproduce
=========

Apply the series with git am, then build with TEST_VMALLOC=y in a
production-config kernel and boot under QEMU:

  $ git am 00*.patch
  $ make defconfig
  $ scripts/config -e TEST_VMALLOC -d DEBUG_VM -d KASAN
  $ make olddefconfig
  $ make -j$(nproc) bzImage
  $ # boot in QEMU with test_vmalloc.run_test_mask=...

Signed-off-by: Pranjal Arya <pranjal.arya@oss.qualcomm.com>
---
Pranjal Arya (12):
      mm/vmalloc: introduce maple_tree-based indexing for vmap_area
      mm/vmalloc: convert allocation-side gap finding and insertion to maple_tree
      mm/vmalloc: convert free, purge, and pcpu paths to maple_tree
      mm/vmalloc: finalize maple-only indexing and shrink struct vmap_area
      mm/vmalloc: tighten failure handling under memory pressure
      mm/vmalloc: tighten alloc/free hot paths
      mm/vmalloc: consolidate occupied tree as authoritative index on hot path
      mm/vmalloc: track lazy-purge queue as a list_head
      mm/vmalloc: collapse busy-tree find-then-unlink into a single mas_erase
      mm/vmalloc: per-CPU caching of free ranges from the maple_tree allocator
      mm/vmalloc: O(1) lookup of cached vmap_areas with bounded fast-reject
      mm/vmalloc: harden bump-allocator alloc/free against UBSAN array bounds

 include/linux/vmalloc.h |   16 +-
 lib/maple_tree.c        |    7 +
 mm/vmalloc.c            | 2378 ++++++++++++++++++++++++++++++++++-------------
 3 files changed, 1748 insertions(+), 653 deletions(-)
---
base-commit: c425609d6ac4012c8bbf01ec2e10e801b1923a7b
change-id: 20260613-vmalloc_maple-9c96fee8eace

Best regards,
--  
Pranjal Arya <pranjal.arya@oss.qualcomm.com>