include/linux/vmalloc.h | 16 +- lib/maple_tree.c | 7 + mm/vmalloc.c | 2378 ++++++++++++++++++++++++++++++++++------------- 3 files changed, 1748 insertions(+), 653 deletions(-)
vmalloc's free/busy/lazy area tracking is one of the last remaining
augmented-rb_tree consumers in the core mm allocators. The rest of
mm/ has been gradually consolidating range-keyed indexing around
maple_tree (notably the per-process VMA tree in mm/mmap.c), and
the underlying reason is a structural mismatch between rb_tree and
range tracking:
- rb_tree is a binary tree with a single entry per node. A lookup
walks log2(N) nodes via a pointer chase that touches log2(N)
cache lines, with one comparison per node. Range queries are not
native to rb_tree; vmalloc has historically maintained them by
augmenting every node with a subtree_max_size field whose value
has to be recomputed on every insert, erase, and rotation via a
custom callback. That callback machinery is vmalloc-specific
code that exists only to coax range semantics out of a
binary-keyed tree.
- maple_tree stores up to 16 ranges per node (fanout f=16), so a
lookup walks ~log16(N) nodes in tightly-packed pivot/slot arrays
that are far more cache-friendly. Range queries are first-class
(mas_empty_area, mas_find, mas_erase), with no augmentation
callback to maintain. RCU traversal and sentinel range storage
(XA_ZERO_ENTRY) are part of the data structure's contract, not
bolted on by the consumer.
For the vmalloc allocator's hot paths this means shallower walks
under the same N, fewer cache lines touched per lookup, and no
custom augmented-callback machinery to maintain. Aligning vmalloc
with the same range-tree direction the rest of mm/ has taken
collapses the augmented gap walker to a single mas_empty_area()
descent, retires the augmented rb_node from struct vmap_area
(-24 bytes per object on 64-bit), and exposes the range,
sentinel, and RCU primitives needed for a per-CPU range cache
that the augmented rb_tree could not host cleanly.
This series completes the vmalloc internal migration from rb-tree
indexed tracking to maple-tree indexed tracking for free, busy, and
lazy vmap_area range management.
The series removes vmalloc's internal rb-tree dependencies and moves
address indexing and range lookups/scans to maple-tree-backed paths,
while keeping ordered list neighbour traversal where coalescing
semantics require stable predecessor/successor behaviour.
In addition to the direct rb -> maple migration, the series includes
robustness and scalability refinements in the same code path:
- deferred/lazy maple bring-up to avoid early-boot allocator hazards
- maple-assisted ordered-list insertion for busy/lazy tracking
- mas_preallocate / mas_store_prealloc fast path for common-case
publish work, with a non-indexed retry queue that absorbs the
rare publish-under-pressure case without leaking or panicking
- single mas_store(NULL, ...) sub-range trim in va_clip() in place
of an erase-and-restore pair when narrowing a free-area entry
- single mas_erase() for the busy-tree find-then-unlink pair on the
free path
- consolidation of in-use ranges as a single authoritative index on
the steady-state allocation hot path
- list_head representation of the lazy-purge queue, since that queue
is bulk-drained and has no address-keyed query
- per-CPU bump-allocator overlay layered on top of the migrated
indices for short-lived, page-aligned, common-case allocations
(design and chunk-size derivation in the 0010-0012 commit
messages)
- explicit lock/serialisation behaviour preserved (no lock removed)
Primary advantages
==================
- struct vmap_area shrinks by 24 bytes on 64-bit layouts (72 -> 48),
removing the embedded augmented rb_node and the subtree_max_size
field that the rb-tree gap walk depended on. Verified with pahole
on arm64.
- maple_tree's per-node fanout (multiple pivots/slots per node)
replaces a binary rb-tree descent for indexed lookups, giving a
shallower walk for the same allocation count.
- alloc-side gap finding moves from a recursive augmented-rb walk to
mas_empty_area() over the in-use range index, returning the lowest
matching gap in a single descent.
- vfree of a chunk-resident allocation through the per-CPU overlay
resolves addr -> vmap_area in O(1) via the chunk's back-pointer
array, with a bounded fast-reject for addresses outside any
reserved chunk; the maple-tree busy lookup remains the fallback.
- correctness behaviour preserved: ordered list neighbour traversal
for coalescing remains; the locking/serialisation model is
unchanged; lockdep is silent across the test_vmalloc subtest sweep.
- robustness in bring-up and high churn: deferred/lazy maple
initialisation avoids early-boot allocator hazards; the retry
queue keeps publish failures under GFP_NOWAIT pressure correct
without leaking or panicking.
Real-silicon validation
=======================
The series was tested on Qualcomm Snapdragon X1E80100.
The patched kernel was booted on the device against an RB baseline
image of the same kernel revision, and exercised through:
- GFXBench, run for several hours of sustained graphics workload;
the patched kernel ran clean throughout, with throughput matching
the RB baseline within run-to-run noise.
- boot-time module loading via the finit_module / kernel_read_file
path that exercises the bump-allocator's indexed write loop;
this path drove the patch 0012 hardening, and the patched kernel
is UBSAN-clean here.
- repeated insmod / rmmod cycles to soak the chunk install / drain
paths under live workload.
No kernel WARN, BUG, or UBSAN report was observed across the
multi-hour soak.
Quantified validated performance improvements
=============================================
Numbers below compare the current upstream rb-tree-based vmalloc
allocator (baseline) against the final state after applying all 12
patches in this series (patched). Each row reports the median of N
reps under sequential QEMU execution (PARALLEL=1, taskset-pinned
CPUs, nice -n 5) on a production kernel configuration
(CONFIG_DEBUG_VM=n, CONFIG_KASAN=n, CONFIG_LOCKDEP=n,
CONFIG_UBSAN=n). The baseline label is RB; the patched label is MT.
Aggregate observation:
- x86_64 KVM contention sweep (smp=8, 30 reps): improvement on every
measured nt x mask configuration, range +5.1 % to +43.6 %.
- x86_64 KVM saturation sweep (smp=16, 15 reps): improvement on every
measured nt x mask configuration, range +28.2 % to +64.9 %.
- arm64 cortex-a72 TCG contention sweep (smp=8, 15 reps):
improvement on every measured nt x mask configuration, range
+19.5 % to +32.6 %.
- functional sweep (1 rep, all 12 test_vmalloc masks): all 11
passing subtests pass on RB and MT; the 12th (mask=64,
align_shift_alloc_test) is xfail by design and remains xfail on MT.
x86_64 observed faster cases (sorted by gain):
- full_fit_alloc_test, smp=16 nt=16: +64.9 %
- long_busy_list_alloc_test, smp=16 nt=16: +62.9 %
- full_fit_alloc_test, smp=16 nt=8: +51.7 %
- long_busy_list_alloc_test, smp=16 nt=8: +49.3 %
- long_busy_list_alloc_test, smp=8 nt=8: +43.6 %
- full_fit_alloc_test, smp=8 nt=4: +40.4 %
- full_fit_alloc_test, smp=8 nt=8: +40.0 %
- full_fit_alloc_test, smp=16 nt=4: +38.2 %
- full_fit_alloc_test, smp=8 nt=2: +31.2 %
- fix_align_alloc_test, functional sweep: +30.2 %
- long_busy_list_alloc_test, smp=8 nt=4: +29.3 %
- long_busy_list_alloc_test, smp=16 nt=4: +28.2 %
- full_fit_alloc_test, smp=4 nt=2: +26.5 %
- full_fit_alloc_test, smp=8 nt=1: +20.4 %
- long_busy_list_alloc_test, smp=8 nt=2: +17.7 %
- long_busy_list_alloc_test, smp=4 nt=2: +15.3 %
arm64 observed faster cases (sorted by gain):
- long_busy_list_alloc_test, smp=8 nt=8: +32.6 %
- long_busy_list_alloc_test, smp=8 nt=4: +26.7 %
- fix_size_alloc_test, functional sweep: +26.0 %
- full_fit_alloc_test, smp=8 nt=4: +25.5 %
- full_fit_alloc_test, smp=8 nt=2: +23.4 %
- full_fit_alloc_test, smp=8 nt=1: +23.2 %
- full_fit_alloc_test, smp=8 nt=8: +22.3 %
- full_fit_alloc_test, functional sweep: +21.9 %
- long_busy_list_alloc_test, smp=8 nt=2: +21.2 %
- random_size_align_alloc_test, functional sweep: +19.9 %
- long_busy_list_alloc_test, smp=8 nt=1: +19.5 %
- no_block_alloc_test, functional sweep: +18.0 %
- full_fit_alloc_test, smp=4 nt=2: +17.3 %
- kvfree_rcu_1_arg_vmalloc_test, functional sweep: +16.9 %
- long_busy_list_alloc_test, smp=4 nt=2: +16.2 %
- random_size_align_alloc_test, smp=4 nt=2: +14.3 %
- long_busy_list_alloc_test, functional sweep: +12.6 %
- fix_align_alloc_test, functional sweep: +12.4 %
- kvfree_rcu_2_arg_vmalloc_test, functional sweep: +11.0 %
- pcpu_alloc_test, smp=4 nt=2: +6.9 %
Memory impact quantification
============================
struct vmap_area shrinks by 24 bytes per object on 64-bit layouts
(72 -> 48; verified with pahole on the patched build vs the RB
baseline). The shrink comes from removing the embedded augmented
rb_node (24 bytes) and the subtree_max_size union member (8 bytes,
formerly union with vm). list_head (16 bytes) and vm (8 bytes) are
retained.
Aggregate savings scale linearly with the number of concurrent live
vmap_area objects on the system.
mm/vmalloc.o text grows by ~47 KB on x86_64 / ~95 KB on arm64
(absorbing the maple-tree call sites and the per-CPU overlay), but
the linker-page-aligned bzImage / Image size is unchanged within
rounding.
QEMU validation and synthetic benchmarks
========================================
In addition to the real-silicon soak, the series was exercised
under QEMU (KVM on x86_64, TCG on arm64):
11 of 12 test_vmalloc subtests PASS on x86_64 and arm64 at series
HEAD. mask=64 (align_shift_alloc_test) reports xfailed=1 by design;
the same xfail is observed on the unpatched RB baseline.
KASAN, lockdep, CONFIG_DEBUG_VM=y, PROVE_LOCKING and CONFIG_UBSAN
boots are clean on both architectures with
test_vmalloc.run_test_mask=0xFFF (full subtest sweep).
Each commit individually compiles mm/vmalloc.o on x86_64 and arm64.
checkpatch.pl --strict reports no errors across the series.
Methodology limitations and reproducibility caveats:
- arm64 numbers in the validated wins above come from cortex-a72
TCG emulation, so production arm64 silicon
would be expected to track somewhere between these TCG numbers
and the x86_64 KVM numbers.
- Independent replay can change the ranking or sign of small
sub-millisecond deltas in test workloads whose absolute runtime
is below ~1 ms at the configured test_loop_count; the headline
workloads (full_fit_alloc_test, long_busy_list_alloc_test) run
in the tens-to-hundreds-of-ms range and are not affected.
Reproduce
=========
Apply the series with git am, then build with TEST_VMALLOC=y in a
production-config kernel and boot under QEMU:
$ git am 00*.patch
$ make defconfig
$ scripts/config -e TEST_VMALLOC -d DEBUG_VM -d KASAN
$ make olddefconfig
$ make -j$(nproc) bzImage
$ # boot in QEMU with test_vmalloc.run_test_mask=...
Signed-off-by: Pranjal Arya <pranjal.arya@oss.qualcomm.com>
---
Pranjal Arya (12):
mm/vmalloc: introduce maple_tree-based indexing for vmap_area
mm/vmalloc: convert allocation-side gap finding and insertion to maple_tree
mm/vmalloc: convert free, purge, and pcpu paths to maple_tree
mm/vmalloc: finalize maple-only indexing and shrink struct vmap_area
mm/vmalloc: tighten failure handling under memory pressure
mm/vmalloc: tighten alloc/free hot paths
mm/vmalloc: consolidate occupied tree as authoritative index on hot path
mm/vmalloc: track lazy-purge queue as a list_head
mm/vmalloc: collapse busy-tree find-then-unlink into a single mas_erase
mm/vmalloc: per-CPU caching of free ranges from the maple_tree allocator
mm/vmalloc: O(1) lookup of cached vmap_areas with bounded fast-reject
mm/vmalloc: harden bump-allocator alloc/free against UBSAN array bounds
include/linux/vmalloc.h | 16 +-
lib/maple_tree.c | 7 +
mm/vmalloc.c | 2378 ++++++++++++++++++++++++++++++++++-------------
3 files changed, 1748 insertions(+), 653 deletions(-)
---
base-commit: c425609d6ac4012c8bbf01ec2e10e801b1923a7b
change-id: 20260613-vmalloc_maple-9c96fee8eace
Best regards,
--
Pranjal Arya <pranjal.arya@oss.qualcomm.com>
© 2016 - 2026 Red Hat, Inc.