Documentation/mm/multigen_lru.rst | 8 +- include/linux/memcontrol.h | 10 + include/linux/mm_inline.h | 25 +- include/linux/mmzone.h | 127 ++++- mm/memcontrol.c | 16 + mm/page_alloc.c | 1 + mm/vmscan.c | 765 ++++++++++++++++++++---------- mm/workingset.c | 4 +- 8 files changed, 687 insertions(+), 269 deletions(-)
An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs, since each node and memcg combination has an LRU of folios (see mem_cgroup_lruvec()). Its goal is to improve the scalability of global reclaim, which is critical to systemwide memory overcommit in data centers. Note that memcg reclaim is currently out of scope. Its memory bloat is a pointer to each LRU vector and negligible to each node. In terms of traversing memcgs during global reclaim, it improves the best-case complexity from O(n) to O(1) and does not affect the worst-case complexity O(n). Therefore, on average, it has a sublinear complexity in contrast to the current linear complexity. The basic structure of an memcg LRU can be understood by an analogy to the active/inactive LRU (of folios): 1. It has the young and the old (generations); 2. Its linked lists have the head and the tail; 3. The increment of max_seq triggers promotion; 4. Other events, e.g., offlining an memcg, triggers similar operations. In terms of global reclaim, it has two distinct features: 1. Sharding, which allows each thread to start at a random memcg (in the old generation) and improves parallelism; 2. Eventual fairness, which allows direct reclaim to bail out and reduces latency without affecting fairness over some time. The commit message in patch 6 details the workflow: https://lore.kernel.org/r/20221201223923.873696-7-yuzhao@google.com/ The following is a simple test to quickly verify its effectiveness. More benchmarks are coming soon. Test design: 1. Create multiple memcgs. 2. Each memcg contains a job (fio). 3. All jobs access the same amount of memory randomly. 4. The system does not experience global memory pressure. 5. Periodically write to the root memory.reclaim. Desired outcome: 1. All memcgs have similar pgsteal, i.e., stddev(pgsteal)/mean(pgsteal) is close to 0%. 2. The total pgsteal is close to the total requested through memory.reclaim, i.e., sum(pgsteal)/sum(requested) is close to 100%. Actual outcome [1]: stddev(pgsteal)/mean(pgsteal) sum(pgsteal)/sum(requested) MGLRU off 75% 425% MGLRU on 20% 95% #################################################################### MEMCGS=128 for ((memcg = 0; memcg < $MEMCGS; memcg++)); do mkdir /sys/fs/cgroup/memcg$memcg done start() { echo $BASHPID > /sys/fs/cgroup/memcg$memcg/cgroup.procs fio -name=memcg$memcg --numjobs=1 --ioengine=mmap \ --filename=/dev/zero --size=1920M --rw=randrw \ --rate=64m,64m --random_distribution=random \ --fadvise_hint=0 --time_based --runtime=10h \ --group_reporting --minimal } for ((memcg = 0; memcg < $MEMCGS; memcg++)); do start & done sleep 600 for ((i = 0; i < 600; i++)); do echo 256m >/sys/fs/cgroup/memory.reclaim sleep 6 done for ((memcg = 0; memcg < $MEMCGS; memcg++)); do grep "pgsteal " /sys/fs/cgroup/memcg$memcg/memory.stat done #################################################################### [1]: This was obtained from running the above script (touches less than 256GB memory) on an EPYC 7B13 with 512GB DRAM for over an hour. Yu Zhao (8): mm: multi-gen LRU: rename lru_gen_struct to lru_gen_folio mm: multi-gen LRU: rename lrugen->lists[] to lrugen->folios[] mm: multi-gen LRU: remove eviction fairness safeguard mm: multi-gen LRU: remove aging fairness safeguard mm: multi-gen LRU: shuffle should_run_aging() mm: multi-gen LRU: per-node lru_gen_folio lists mm: multi-gen LRU: clarify scan_control flags mm: multi-gen LRU: simplify arch_has_hw_pte_young() check Documentation/mm/multigen_lru.rst | 8 +- include/linux/memcontrol.h | 10 + include/linux/mm_inline.h | 25 +- include/linux/mmzone.h | 127 ++++- mm/memcontrol.c | 16 + mm/page_alloc.c | 1 + mm/vmscan.c | 765 ++++++++++++++++++++---------- mm/workingset.c | 4 +- 8 files changed, 687 insertions(+), 269 deletions(-) -- 2.39.0.rc0.267.gcb52ba06e7-goog
[This is a resend. The original message was lost: https://lore.kernel.org/r/20221219193613.998597-1-yuzhao@google.com/] TLDR ==== SPECjbb2015 groups [1] Critical jOPS (95% CI) Max jOPS (95% CI) --------------------------------------------------------------------- 20 NS NS 30 +[4, 7]% NS 40 OOM killed OOM killed Abbreviations ============= CI: confidence interval NS: no statistically significant difference DUT: device under test ATE: automatic test equipment Rational ======== 1. Java has been mostly the most popular programming language for the last two decades, according to the TIOBE Programming Community Index [2]. 2. Power ISA is the longest-lasting alternative to x86 for the server segment [3]. 3. SPECjbb2015 is the industry-standard benchmark for Java. Hardware ======== DUT $ lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 184 On-line CPU(s) list: 0-183 Model name: POWER9 (raw), altivec supported Model: 2.2 (pvr 004e 1202) Thread(s) per core: 4 Core(s) per socket: 23 Socket(s): 2 CPU max MHz: 3000.0000 CPU min MHz: 2300.0000 Caches (sum of all): L1d: 1.4 MiB (46 instances) L1i: 1.4 MiB (46 instances) L2: 12 MiB (24 instances) L3: 240 MiB (24 instances) NUMA: NUMA node(s): 2 NUMA node0 CPU(s): 0-91 NUMA node1 CPU(s): 92-183 Vulnerabilities: Itlb multihit: Not affected L1tf: Mitigation; RFI Flush, L1D private per thread Mds: Not affected Meltdown: Mitigation; RFI Flush, L1D private per thread Mmio stale data: Not affected Retbleed: Not affected Spec store bypass: Mitigation; Kernel entry/exit barrier (eieio) Spectre v1: Mitigation; __user pointer sanitization, ori31 speculation barrier enabled Spectre v2: Mitigation; Indirect branch serialisation (kernel only), Indirect branch cache disabled, Software link stack flush Srbds: Not affected Tsx async abort: Not affected DUT $ numactl -H available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 node 0 size: 261659 MB node 0 free: 259051 MB node 1 cpus: 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 node 1 size: 261713 MB node 1 free: 257499 MB node distances: node 0 1 0: 10 40 1: 40 10 DUT $ cat /sys/class/nvme/nvme0/model INTEL SSDPF21Q800GB DUT $ cat /sys/class/nvme/nvme0/numa_node 0 DUT $ cat /sys/class/nvme/nvme1/model INTEL SSDPF21Q800GB DUT $ cat /sys/class/nvme/nvme1/numa_node 1 Software ======== DUT $ cat /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=22.04 DISTRIB_CODENAME=jammy DISTRIB_DESCRIPTION="Ubuntu 22.04 LTS" DUT $ uname -a Linux ppc 6.1.0-rc8-mglru #1 SMP Tue Dec 6 06:18:48 UTC 2022 ppc64le ppc64le ppc64le GNU/Linux DUT $ cat /proc/swaps Filename Type Size Used Priority /dev/nvme0n1 partition 268435392 0 -2 /dev/nvme1n1 partition 268435392 0 -3 DUT $ java --version openjdk 11.0.16 2022-07-19 OpenJDK Runtime Environment (build 11.0.16+8-post-Ubuntu-0ubuntu122.04) OpenJDK 64-Bit Server VM (build 11.0.16+8-post-Ubuntu-0ubuntu122.04, mixed mode) DUT $ cat specjbb2015/version.txt SPECjbb2015 1.03 (11/14/2019) Procedure ========= DUT $ cat run_specjbb2015.sh echo 0 >/proc/sys/kernel/numa_balancing nodes=2 memcgs=$1 run() { memcg=$1 path=/sys/fs/cgroup/memcg$memcg mkdir $path echo $BASHPID >$path/cgroup.procs for ((node = 0; node < $nodes; node++)); do group=$((nodes * memcg + node)) numactl -N $node -m $node java -jar specjbb2015.jar \ -m backend -G GRP$group -J JVM0 & numactl -N $node -m $node java -jar specjbb2015.jar \ -m txinjector -G GRP$group -J JVM1 & done wait } numactl -N 0 -m 0 java -Dspecjbb.group.count=$((nodes * memcgs)) \ -Dspecjbb.controller.rtcurve.warmup.step=0.8 \ -jar specjbb2015.jar -m multicontroller & for ((memcg = 0; memcg < $memcgs; memcg++)); do run $memcg & done wait Results ======= Critical jOPS (30 groups) ------------------------- $ R > a <- c(33786, 34903, 34254, 34608, 33149, 34530, 33867, 33691, 33284, 34490) > b <- c(35192, 36691, 35771, 36399, 36321, 35177, 35792, 36145, 36594, 36207) > t.test(a, b) Welch Two Sample t-test data: a and b t = -7.8327, df = 17.828, p-value = 3.529e-07 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2502.195 -1443.205 sample estimates: mean of x mean of y 34056.2 36028.9 Max jOPS (30 groups) -------------------- $ R > a <- c(61310, 60640, 60515, 59820, 60239, 60140, 60074, 60761, 59099, 59843) > b <- c(60338, 60515, 60338, 58305, 59660, 62372, 59820, 61499, 60338, 60338) > t.test(a, b) Welch Two Sample t-test data: a and b t = -0.27732, df = 14.231, p-value = 0.7855 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -943.7491 727.3491 sample estimates: mean of x mean of y 60244.1 60352.3 References ========== [1] https://www.spec.org/jbb2015/docs/userguide.pdf [2] https://www.tiobe.com/tiobe-index/ [3] https://cloud.google.com/blog/products/gcp/introducing-zaius-google-and-rackspaces-open-server-running-ibm-power9
© 2016 - 2025 Red Hat, Inc.