[PATCH mm-unstable v1 0/8] mm: multi-gen LRU: memcg LRU

Yu Zhao posted 8 patches 2 years, 9 months ago
There is a newer version of this series
Documentation/mm/multigen_lru.rst |   8 +-
include/linux/memcontrol.h        |  10 +
include/linux/mm_inline.h         |  25 +-
include/linux/mmzone.h            | 127 ++++-
mm/memcontrol.c                   |  16 +
mm/page_alloc.c                   |   1 +
mm/vmscan.c                       | 765 ++++++++++++++++++++----------
mm/workingset.c                   |   4 +-
8 files changed, 687 insertions(+), 269 deletions(-)
[PATCH mm-unstable v1 0/8] mm: multi-gen LRU: memcg LRU
Posted by Yu Zhao 2 years, 9 months ago
An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs,
since each node and memcg combination has an LRU of folios (see
mem_cgroup_lruvec()).

Its goal is to improve the scalability of global reclaim, which is
critical to systemwide memory overcommit in data centers. Note that
memcg reclaim is currently out of scope.

Its memory bloat is a pointer to each LRU vector and negligible to
each node. In terms of traversing memcgs during global reclaim, it
improves the best-case complexity from O(n) to O(1) and does not
affect the worst-case complexity O(n). Therefore, on average, it has
a sublinear complexity in contrast to the current linear complexity.

The basic structure of an memcg LRU can be understood by an analogy
to the active/inactive LRU (of folios):
1. It has the young and the old (generations);
2. Its linked lists have the head and the tail;
3. The increment of max_seq triggers promotion;
4. Other events, e.g., offlining an memcg, triggers similar
   operations.

In terms of global reclaim, it has two distinct features:
1. Sharding, which allows each thread to start at a random memcg (in
   the old generation) and improves parallelism;
2. Eventual fairness, which allows direct reclaim to bail out and
   reduces latency without affecting fairness over some time.

The commit message in patch 6 details the workflow:
https://lore.kernel.org/r/20221201223923.873696-7-yuzhao@google.com/

The following is a simple test to quickly verify its effectiveness.
More benchmarks are coming soon.

  Test design:
  1. Create multiple memcgs.
  2. Each memcg contains a job (fio).
  3. All jobs access the same amount of memory randomly.
  4. The system does not experience global memory pressure.
  5. Periodically write to the root memory.reclaim.

  Desired outcome:
  1. All memcgs have similar pgsteal, i.e.,
     stddev(pgsteal)/mean(pgsteal) is close to 0%.
  2. The total pgsteal is close to the total requested through
     memory.reclaim, i.e., sum(pgsteal)/sum(requested) is close to
     100%.

  Actual outcome [1]:
             stddev(pgsteal)/mean(pgsteal) sum(pgsteal)/sum(requested)
  MGLRU off  75%                           425%
  MGLRU on   20%                           95%

  ####################################################################
  MEMCGS=128

  for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
      mkdir /sys/fs/cgroup/memcg$memcg
  done

  start() {
      echo $BASHPID > /sys/fs/cgroup/memcg$memcg/cgroup.procs

      fio -name=memcg$memcg --numjobs=1 --ioengine=mmap \
          --filename=/dev/zero --size=1920M --rw=randrw \
          --rate=64m,64m --random_distribution=random \
          --fadvise_hint=0 --time_based --runtime=10h \
          --group_reporting --minimal
  }

  for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
      start &
  done

  sleep 600

  for ((i = 0; i < 600; i++)); do
      echo 256m >/sys/fs/cgroup/memory.reclaim
      sleep 6
  done

  for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
      grep "pgsteal " /sys/fs/cgroup/memcg$memcg/memory.stat
  done
  ####################################################################

[1]: This was obtained from running the above script (touches less
     than 256GB memory) on an EPYC 7B13 with 512GB DRAM for over an
     hour.

Yu Zhao (8):
  mm: multi-gen LRU: rename lru_gen_struct to lru_gen_folio
  mm: multi-gen LRU: rename lrugen->lists[] to lrugen->folios[]
  mm: multi-gen LRU: remove eviction fairness safeguard
  mm: multi-gen LRU: remove aging fairness safeguard
  mm: multi-gen LRU: shuffle should_run_aging()
  mm: multi-gen LRU: per-node lru_gen_folio lists
  mm: multi-gen LRU: clarify scan_control flags
  mm: multi-gen LRU: simplify arch_has_hw_pte_young() check

 Documentation/mm/multigen_lru.rst |   8 +-
 include/linux/memcontrol.h        |  10 +
 include/linux/mm_inline.h         |  25 +-
 include/linux/mmzone.h            | 127 ++++-
 mm/memcontrol.c                   |  16 +
 mm/page_alloc.c                   |   1 +
 mm/vmscan.c                       | 765 ++++++++++++++++++++----------
 mm/workingset.c                   |   4 +-
 8 files changed, 687 insertions(+), 269 deletions(-)

-- 
2.39.0.rc0.267.gcb52ba06e7-goog
Java / POWER9 benchmark with MGLRU
Posted by Yu Zhao 2 years, 9 months ago
[This is a resend. The original message was lost:
https://lore.kernel.org/r/20221219193613.998597-1-yuzhao@google.com/]

TLDR
====
SPECjbb2015 groups [1]    Critical jOPS (95% CI)    Max jOPS (95% CI)
---------------------------------------------------------------------
20                        NS                        NS
30                        +[4, 7]%                  NS
40                        OOM killed                OOM killed

Abbreviations
=============
CI:   confidence interval
NS:   no statistically significant difference
DUT:  device under test
ATE:  automatic test equipment

Rational
========
1. Java has been mostly the most popular programming language for the
   last two decades, according to the TIOBE Programming Community
   Index [2].
2. Power ISA is the longest-lasting alternative to x86 for the server
   segment [3].
3. SPECjbb2015 is the industry-standard benchmark for Java.

Hardware
========
DUT $ lscpu
Architecture:          ppc64le
  Byte Order:          Little Endian
CPU(s):                184
  On-line CPU(s) list: 0-183
Model name:            POWER9 (raw), altivec supported
  Model:               2.2 (pvr 004e 1202)
  Thread(s) per core:  4
  Core(s) per socket:  23
  Socket(s):           2
  CPU max MHz:         3000.0000
  CPU min MHz:         2300.0000
Caches (sum of all):
  L1d:                 1.4 MiB (46 instances)
  L1i:                 1.4 MiB (46 instances)
  L2:                  12 MiB (24 instances)
  L3:                  240 MiB (24 instances)
NUMA:
  NUMA node(s):        2
  NUMA node0 CPU(s):   0-91
  NUMA node1 CPU(s):   92-183
Vulnerabilities:
  Itlb multihit:       Not affected
  L1tf:                Mitigation; RFI Flush, L1D private per thread
  Mds:                 Not affected
  Meltdown:            Mitigation; RFI Flush, L1D private per thread
  Mmio stale data:     Not affected
  Retbleed:            Not affected
  Spec store bypass:   Mitigation; Kernel entry/exit barrier (eieio)
  Spectre v1:          Mitigation; __user pointer sanitization, ori31 speculation barrier enabled
  Spectre v2:          Mitigation; Indirect branch serialisation (kernel only), Indirect branch cache disabled, Software link stack flush
  Srbds:               Not affected
  Tsx async abort:     Not affected

DUT $ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91
node 0 size: 261659 MB
node 0 free: 259051 MB
node 1 cpus: 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183
node 1 size: 261713 MB
node 1 free: 257499 MB
node distances:
node   0   1
  0:  10  40
  1:  40  10

DUT $ cat /sys/class/nvme/nvme0/model
INTEL SSDPF21Q800GB

DUT $ cat /sys/class/nvme/nvme0/numa_node
0

DUT $ cat /sys/class/nvme/nvme1/model
INTEL SSDPF21Q800GB

DUT $ cat /sys/class/nvme/nvme1/numa_node
1

Software
========
DUT $ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04 LTS"

DUT $ uname -a
Linux ppc 6.1.0-rc8-mglru #1 SMP Tue Dec  6 06:18:48 UTC 2022 ppc64le ppc64le ppc64le GNU/Linux

DUT $ cat /proc/swaps
Filename        Type         Size         Used  Priority
/dev/nvme0n1    partition    268435392    0     -2
/dev/nvme1n1    partition    268435392    0     -3

DUT $ java --version
openjdk 11.0.16 2022-07-19
OpenJDK Runtime Environment (build 11.0.16+8-post-Ubuntu-0ubuntu122.04)
OpenJDK 64-Bit Server VM (build 11.0.16+8-post-Ubuntu-0ubuntu122.04, mixed mode)

DUT $ cat specjbb2015/version.txt
SPECjbb2015 1.03 (11/14/2019)

Procedure
=========
DUT $ cat run_specjbb2015.sh
echo 0 >/proc/sys/kernel/numa_balancing

nodes=2
memcgs=$1

run() {
    memcg=$1
    path=/sys/fs/cgroup/memcg$memcg

    mkdir $path
    echo $BASHPID >$path/cgroup.procs

    for ((node = 0; node < $nodes; node++)); do
        group=$((nodes * memcg + node))

        numactl -N $node -m $node java -jar specjbb2015.jar \
            -m backend -G GRP$group -J JVM0 &
        numactl -N $node -m $node java -jar specjbb2015.jar \
            -m txinjector -G GRP$group -J JVM1 &
    done

    wait
}

numactl -N 0 -m 0 java -Dspecjbb.group.count=$((nodes * memcgs)) \
        -Dspecjbb.controller.rtcurve.warmup.step=0.8 \
        -jar specjbb2015.jar -m multicontroller &

for ((memcg = 0; memcg < $memcgs; memcg++)); do
    run $memcg &
done

wait

Results
=======
Critical jOPS (30 groups)
-------------------------
$ R
> a <- c(33786, 34903, 34254, 34608, 33149, 34530, 33867, 33691, 33284, 34490)
> b <- c(35192, 36691, 35771, 36399, 36321, 35177, 35792, 36145, 36594, 36207)
> t.test(a, b)

        Welch Two Sample t-test

data:  a and b
t = -7.8327, df = 17.828, p-value = 3.529e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2502.195 -1443.205
sample estimates:
mean of x mean of y
  34056.2   36028.9

Max jOPS (30 groups)
--------------------
$ R
> a <- c(61310, 60640, 60515, 59820, 60239, 60140, 60074, 60761, 59099, 59843)
> b <- c(60338, 60515, 60338, 58305, 59660, 62372, 59820, 61499, 60338, 60338)
> t.test(a, b)

        Welch Two Sample t-test

data:  a and b
t = -0.27732, df = 14.231, p-value = 0.7855
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -943.7491  727.3491
sample estimates:
mean of x mean of y
  60244.1   60352.3

References
==========
[1] https://www.spec.org/jbb2015/docs/userguide.pdf
[2] https://www.tiobe.com/tiobe-index/
[3] https://cloud.google.com/blog/products/gcp/introducing-zaius-google-and-rackspaces-open-server-running-ibm-power9