Documentation/mm/multigen_lru.rst | 8 +- include/linux/memcontrol.h | 10 + include/linux/mm_inline.h | 25 +- include/linux/mmzone.h | 127 ++++- mm/memcontrol.c | 16 + mm/page_alloc.c | 1 + mm/vmscan.c | 765 ++++++++++++++++++++---------- mm/workingset.c | 4 +- 8 files changed, 687 insertions(+), 269 deletions(-)
An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs,
since each node and memcg combination has an LRU of folios (see
mem_cgroup_lruvec()).
Its goal is to improve the scalability of global reclaim, which is
critical to systemwide memory overcommit in data centers. Note that
memcg reclaim is currently out of scope.
Its memory bloat is a pointer to each LRU vector and negligible to
each node. In terms of traversing memcgs during global reclaim, it
improves the best-case complexity from O(n) to O(1) and does not
affect the worst-case complexity O(n). Therefore, on average, it has
a sublinear complexity in contrast to the current linear complexity.
The basic structure of an memcg LRU can be understood by an analogy
to the active/inactive LRU (of folios):
1. It has the young and the old (generations);
2. Its linked lists have the head and the tail;
3. The increment of max_seq triggers promotion;
4. Other events, e.g., offlining an memcg, triggers similar
operations.
In terms of global reclaim, it has two distinct features:
1. Sharding, which allows each thread to start at a random memcg (in
the old generation) and improves parallelism;
2. Eventual fairness, which allows direct reclaim to bail out and
reduces latency without affecting fairness over some time.
The commit message in patch 6 details the workflow:
https://lore.kernel.org/r/20221201223923.873696-7-yuzhao@google.com/
The following is a simple test to quickly verify its effectiveness.
More benchmarks are coming soon.
Test design:
1. Create multiple memcgs.
2. Each memcg contains a job (fio).
3. All jobs access the same amount of memory randomly.
4. The system does not experience global memory pressure.
5. Periodically write to the root memory.reclaim.
Desired outcome:
1. All memcgs have similar pgsteal, i.e.,
stddev(pgsteal)/mean(pgsteal) is close to 0%.
2. The total pgsteal is close to the total requested through
memory.reclaim, i.e., sum(pgsteal)/sum(requested) is close to
100%.
Actual outcome [1]:
stddev(pgsteal)/mean(pgsteal) sum(pgsteal)/sum(requested)
MGLRU off 75% 425%
MGLRU on 20% 95%
####################################################################
MEMCGS=128
for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
mkdir /sys/fs/cgroup/memcg$memcg
done
start() {
echo $BASHPID > /sys/fs/cgroup/memcg$memcg/cgroup.procs
fio -name=memcg$memcg --numjobs=1 --ioengine=mmap \
--filename=/dev/zero --size=1920M --rw=randrw \
--rate=64m,64m --random_distribution=random \
--fadvise_hint=0 --time_based --runtime=10h \
--group_reporting --minimal
}
for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
start &
done
sleep 600
for ((i = 0; i < 600; i++)); do
echo 256m >/sys/fs/cgroup/memory.reclaim
sleep 6
done
for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
grep "pgsteal " /sys/fs/cgroup/memcg$memcg/memory.stat
done
####################################################################
[1]: This was obtained from running the above script (touches less
than 256GB memory) on an EPYC 7B13 with 512GB DRAM for over an
hour.
Yu Zhao (8):
mm: multi-gen LRU: rename lru_gen_struct to lru_gen_folio
mm: multi-gen LRU: rename lrugen->lists[] to lrugen->folios[]
mm: multi-gen LRU: remove eviction fairness safeguard
mm: multi-gen LRU: remove aging fairness safeguard
mm: multi-gen LRU: shuffle should_run_aging()
mm: multi-gen LRU: per-node lru_gen_folio lists
mm: multi-gen LRU: clarify scan_control flags
mm: multi-gen LRU: simplify arch_has_hw_pte_young() check
Documentation/mm/multigen_lru.rst | 8 +-
include/linux/memcontrol.h | 10 +
include/linux/mm_inline.h | 25 +-
include/linux/mmzone.h | 127 ++++-
mm/memcontrol.c | 16 +
mm/page_alloc.c | 1 +
mm/vmscan.c | 765 ++++++++++++++++++++----------
mm/workingset.c | 4 +-
8 files changed, 687 insertions(+), 269 deletions(-)
--
2.39.0.rc0.267.gcb52ba06e7-goog
[This is a resend. The original message was lost:
https://lore.kernel.org/r/20221219193613.998597-1-yuzhao@google.com/]
TLDR
====
SPECjbb2015 groups [1] Critical jOPS (95% CI) Max jOPS (95% CI)
---------------------------------------------------------------------
20 NS NS
30 +[4, 7]% NS
40 OOM killed OOM killed
Abbreviations
=============
CI: confidence interval
NS: no statistically significant difference
DUT: device under test
ATE: automatic test equipment
Rational
========
1. Java has been mostly the most popular programming language for the
last two decades, according to the TIOBE Programming Community
Index [2].
2. Power ISA is the longest-lasting alternative to x86 for the server
segment [3].
3. SPECjbb2015 is the industry-standard benchmark for Java.
Hardware
========
DUT $ lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 184
On-line CPU(s) list: 0-183
Model name: POWER9 (raw), altivec supported
Model: 2.2 (pvr 004e 1202)
Thread(s) per core: 4
Core(s) per socket: 23
Socket(s): 2
CPU max MHz: 3000.0000
CPU min MHz: 2300.0000
Caches (sum of all):
L1d: 1.4 MiB (46 instances)
L1i: 1.4 MiB (46 instances)
L2: 12 MiB (24 instances)
L3: 240 MiB (24 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-91
NUMA node1 CPU(s): 92-183
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Mitigation; RFI Flush, L1D private per thread
Mds: Not affected
Meltdown: Mitigation; RFI Flush, L1D private per thread
Mmio stale data: Not affected
Retbleed: Not affected
Spec store bypass: Mitigation; Kernel entry/exit barrier (eieio)
Spectre v1: Mitigation; __user pointer sanitization, ori31 speculation barrier enabled
Spectre v2: Mitigation; Indirect branch serialisation (kernel only), Indirect branch cache disabled, Software link stack flush
Srbds: Not affected
Tsx async abort: Not affected
DUT $ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91
node 0 size: 261659 MB
node 0 free: 259051 MB
node 1 cpus: 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183
node 1 size: 261713 MB
node 1 free: 257499 MB
node distances:
node 0 1
0: 10 40
1: 40 10
DUT $ cat /sys/class/nvme/nvme0/model
INTEL SSDPF21Q800GB
DUT $ cat /sys/class/nvme/nvme0/numa_node
0
DUT $ cat /sys/class/nvme/nvme1/model
INTEL SSDPF21Q800GB
DUT $ cat /sys/class/nvme/nvme1/numa_node
1
Software
========
DUT $ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04 LTS"
DUT $ uname -a
Linux ppc 6.1.0-rc8-mglru #1 SMP Tue Dec 6 06:18:48 UTC 2022 ppc64le ppc64le ppc64le GNU/Linux
DUT $ cat /proc/swaps
Filename Type Size Used Priority
/dev/nvme0n1 partition 268435392 0 -2
/dev/nvme1n1 partition 268435392 0 -3
DUT $ java --version
openjdk 11.0.16 2022-07-19
OpenJDK Runtime Environment (build 11.0.16+8-post-Ubuntu-0ubuntu122.04)
OpenJDK 64-Bit Server VM (build 11.0.16+8-post-Ubuntu-0ubuntu122.04, mixed mode)
DUT $ cat specjbb2015/version.txt
SPECjbb2015 1.03 (11/14/2019)
Procedure
=========
DUT $ cat run_specjbb2015.sh
echo 0 >/proc/sys/kernel/numa_balancing
nodes=2
memcgs=$1
run() {
memcg=$1
path=/sys/fs/cgroup/memcg$memcg
mkdir $path
echo $BASHPID >$path/cgroup.procs
for ((node = 0; node < $nodes; node++)); do
group=$((nodes * memcg + node))
numactl -N $node -m $node java -jar specjbb2015.jar \
-m backend -G GRP$group -J JVM0 &
numactl -N $node -m $node java -jar specjbb2015.jar \
-m txinjector -G GRP$group -J JVM1 &
done
wait
}
numactl -N 0 -m 0 java -Dspecjbb.group.count=$((nodes * memcgs)) \
-Dspecjbb.controller.rtcurve.warmup.step=0.8 \
-jar specjbb2015.jar -m multicontroller &
for ((memcg = 0; memcg < $memcgs; memcg++)); do
run $memcg &
done
wait
Results
=======
Critical jOPS (30 groups)
-------------------------
$ R
> a <- c(33786, 34903, 34254, 34608, 33149, 34530, 33867, 33691, 33284, 34490)
> b <- c(35192, 36691, 35771, 36399, 36321, 35177, 35792, 36145, 36594, 36207)
> t.test(a, b)
Welch Two Sample t-test
data: a and b
t = -7.8327, df = 17.828, p-value = 3.529e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2502.195 -1443.205
sample estimates:
mean of x mean of y
34056.2 36028.9
Max jOPS (30 groups)
--------------------
$ R
> a <- c(61310, 60640, 60515, 59820, 60239, 60140, 60074, 60761, 59099, 59843)
> b <- c(60338, 60515, 60338, 58305, 59660, 62372, 59820, 61499, 60338, 60338)
> t.test(a, b)
Welch Two Sample t-test
data: a and b
t = -0.27732, df = 14.231, p-value = 0.7855
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-943.7491 727.3491
sample estimates:
mean of x mean of y
60244.1 60352.3
References
==========
[1] https://www.spec.org/jbb2015/docs/userguide.pdf
[2] https://www.tiobe.com/tiobe-index/
[3] https://cloud.google.com/blog/products/gcp/introducing-zaius-google-and-rackspaces-open-server-running-ibm-power9
© 2016 - 2026 Red Hat, Inc.