Documentation/admin-guide/mm/multigen_lru.rst | 6 +- arch/arm64/include/asm/kvm_host.h | 6 + arch/arm64/include/asm/kvm_pgtable.h | 55 +++++++ arch/arm64/kvm/arm.c | 1 + arch/arm64/kvm/hyp/pgtable.c | 61 +------- arch/arm64/kvm/mmu.c | 53 ++++++- arch/powerpc/include/asm/kvm_host.h | 8 + arch/powerpc/include/asm/kvm_ppc.h | 1 + arch/powerpc/kvm/book3s.c | 6 + arch/powerpc/kvm/book3s.h | 1 + arch/powerpc/kvm/book3s_64_mmu_radix.c | 65 +++++++- arch/powerpc/kvm/book3s_hv.c | 5 + arch/x86/include/asm/kvm_host.h | 13 ++ arch/x86/kvm/mmu.h | 6 - arch/x86/kvm/mmu/spte.h | 1 - arch/x86/kvm/mmu/tdp_mmu.c | 34 +++++ include/linux/kvm_host.h | 22 +++ include/linux/mmu_notifier.h | 79 ++++++---- include/linux/mmzone.h | 6 +- include/trace/events/kvm.h | 15 -- mm/mmu_notifier.c | 48 ++---- mm/rmap.c | 8 +- mm/vmscan.c | 139 ++++++++++++++++-- virt/kvm/kvm_main.c | 114 ++++++++------ 24 files changed, 546 insertions(+), 207 deletions(-)
TLDR
====
This patchset adds a fast path to clear the accessed bit without
taking kvm->mmu_lock. It can significantly improve the performance of
guests when the host is under heavy memory pressure.
ChromeOS has been using a similar approach [1] since mid 2021 and it
was proven successful on tens of millions devices.
This v2 addressed previous requests [2] on refactoring code, removing
inaccurate/redundant texts, etc.
[1] https://crrev.com/c/2987928
[2] https://lore.kernel.org/r/20230217041230.2417228-1-yuzhao@google.com/
Overview
========
The goal of this patchset is to optimize the performance of guests
when the host memory is overcommitted. It focuses on a simple yet
common case where hardware sets the accessed bit in KVM PTEs and VMs
are not nested. Complex cases fall back to the existing slow path
where kvm->mmu_lock is then taken.
The fast path relies on two techniques to safely clear the accessed
bit: RCU and CAS. The former protects KVM page tables from being
freed while the latter clears the accessed bit atomically against
both the hardware and other software page table walkers.
A new mmu_notifier_ops member, test_clear_young(), supersedes the
existing clear_young() and test_young(). This extended callback can
operate on a range of KVM PTEs individually according to a bitmap, if
the caller provides it.
Evaluation
==========
An existing selftest can quickly demonstrate the effectiveness of
this patchset. On a generic workstation equipped with 128 CPUs and
256GB DRAM:
$ sudo max_guest_memory_test -c 64 -m 250 -s 250
MGLRU run2
------------------
Before [1] ~64s
After ~51s
kswapd (MGLRU before)
100.00% balance_pgdat
100.00% shrink_node
100.00% shrink_one
99.99% try_to_shrink_lruvec
99.71% evict_folios
97.29% shrink_folio_list
==>> 13.05% folio_referenced
12.83% rmap_walk_file
12.31% folio_referenced_one
7.90% __mmu_notifier_clear_young
7.72% kvm_mmu_notifier_clear_young
7.34% _raw_write_lock
kswapd (MGLRU after)
100.00% balance_pgdat
100.00% shrink_node
100.00% shrink_one
99.99% try_to_shrink_lruvec
99.59% evict_folios
80.37% shrink_folio_list
==>> 3.74% folio_referenced
3.59% rmap_walk_file
3.19% folio_referenced_one
2.53% lru_gen_look_around
1.06% __mmu_notifier_test_clear_young
Comprehensive benchmarks are coming soon.
[1] "mm: rmap: Don't flush TLB after checking PTE young for page
reference" was included so that the comparison is apples to
apples.
https://lore.kernel.org/r/20220706112041.3831-1-21cnbao@gmail.com/
Yu Zhao (10):
mm/kvm: add mmu_notifier_ops->test_clear_young()
mm/kvm: use mmu_notifier_ops->test_clear_young()
kvm/arm64: export stage2_try_set_pte() and macros
kvm/arm64: make stage2 page tables RCU safe
kvm/arm64: add kvm_arch_test_clear_young()
kvm/powerpc: make radix page tables RCU safe
kvm/powerpc: add kvm_arch_test_clear_young()
kvm/x86: move tdp_mmu_enabled and shadow_accessed_mask
kvm/x86: add kvm_arch_test_clear_young()
mm: multi-gen LRU: use mmu_notifier_test_clear_young()
Documentation/admin-guide/mm/multigen_lru.rst | 6 +-
arch/arm64/include/asm/kvm_host.h | 6 +
arch/arm64/include/asm/kvm_pgtable.h | 55 +++++++
arch/arm64/kvm/arm.c | 1 +
arch/arm64/kvm/hyp/pgtable.c | 61 +-------
arch/arm64/kvm/mmu.c | 53 ++++++-
arch/powerpc/include/asm/kvm_host.h | 8 +
arch/powerpc/include/asm/kvm_ppc.h | 1 +
arch/powerpc/kvm/book3s.c | 6 +
arch/powerpc/kvm/book3s.h | 1 +
arch/powerpc/kvm/book3s_64_mmu_radix.c | 65 +++++++-
arch/powerpc/kvm/book3s_hv.c | 5 +
arch/x86/include/asm/kvm_host.h | 13 ++
arch/x86/kvm/mmu.h | 6 -
arch/x86/kvm/mmu/spte.h | 1 -
arch/x86/kvm/mmu/tdp_mmu.c | 34 +++++
include/linux/kvm_host.h | 22 +++
include/linux/mmu_notifier.h | 79 ++++++----
include/linux/mmzone.h | 6 +-
include/trace/events/kvm.h | 15 --
mm/mmu_notifier.c | 48 ++----
mm/rmap.c | 8 +-
mm/vmscan.c | 139 ++++++++++++++++--
virt/kvm/kvm_main.c | 114 ++++++++------
24 files changed, 546 insertions(+), 207 deletions(-)
--
2.41.0.rc0.172.g3f132b7071-goog
On 5/27/23 01:44, Yu Zhao wrote: > TLDR > ==== > This patchset adds a fast path to clear the accessed bit without > taking kvm->mmu_lock. It can significantly improve the performance of > guests when the host is under heavy memory pressure. > > ChromeOS has been using a similar approach [1] since mid 2021 and it > was proven successful on tens of millions devices. > > This v2 addressed previous requests [2] on refactoring code, removing > inaccurate/redundant texts, etc. > > [1]https://crrev.com/c/2987928 > [2]https://lore.kernel.org/r/20230217041230.2417228-1-yuzhao@google.com/ From the KVM point of view the patches look good (though I wouldn't mind if Nicholas took a look at the ppc part). Jason's comment on the MMU notifier side are promising as well. Can you send v3 with Oliver's comments addressed? Thanks, Paolo
On Fri, Jun 9, 2023 at 3:08 AM Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 5/27/23 01:44, Yu Zhao wrote: > > TLDR > > ==== > > This patchset adds a fast path to clear the accessed bit without > > taking kvm->mmu_lock. It can significantly improve the performance of > > guests when the host is under heavy memory pressure. > > > > ChromeOS has been using a similar approach [1] since mid 2021 and it > > was proven successful on tens of millions devices. > > > > This v2 addressed previous requests [2] on refactoring code, removing > > inaccurate/redundant texts, etc. > > > > [1]https://crrev.com/c/2987928 > > [2]https://lore.kernel.org/r/20230217041230.2417228-1-yuzhao@google.com/ > > From the KVM point of view the patches look good (though I wouldn't > mind if Nicholas took a look at the ppc part). Jason's comment on the > MMU notifier side are promising as well. Can you send v3 with Oliver's > comments addressed? Thanks. I'll address all the comments in v3 and post it asap. Meanwhile, some updates on the recent progress from my side: 1. I've asked some downstream kernels to pick up v2 for testing, the Archlinux Zen kernel did. I don't really expect its enthusiastic testers to find this series relevant to their use cases. But who knows. 2. I've also asked openbenchmarking.org to run their popular highmem benchmark suites with v2. Hopefully they'll have some independent results soon. 3. I've backported v2 to v5.15 and v6.1 and started an A/B experiment involving ~1 million devices, as I mentioned in another email in this thread. I should have some results to share when posting v3.
TLDR
====
Apache Spark spent 12% less time sorting four billion random integers twenty times (in ~4 hours) after this patchset [1].
Hardware
========
HOST $ lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Vendor ID: ARM
Model name: Neoverse-N1
Model: 1
Thread(s) per core: 1
Core(s) per socket: 64
Socket(s): 2
Stepping: r3p1
Frequency boost: disabled
CPU max MHz: 2800.0000
CPU min MHz: 1000.0000
BogoMIPS: 50.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
Caches (sum of all):
L1d: 8 MiB (128 instances)
L1i: 8 MiB (128 instances)
L2: 128 MiB (128 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-63
NUMA node1 CPU(s): 64-127
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; __user pointer sanitization
Spectre v2: Mitigation; CSV2, BHB
Srbds: Not affected
Tsx async abort: Not affected
HOST $ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0-63
node 0 size: 257730 MB
node 0 free: 1447 MB
node 1 cpus: 64-127
node 1 size: 256877 MB
node 1 free: 256093 MB
node distances:
node 0 1
0: 10 20
1: 20 10
HOST $ cat /sys/class/nvme/nvme0/model
INTEL SSDPF21Q800GB
HOST $ cat /sys/class/nvme/nvme0/numa_node
0
Software
========
HOST $ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"
HOST $ uname -a
Linux arm 6.4.0-rc4 #1 SMP Sat Jun 3 05:30:06 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
HOST $ cat /proc/swaps
Filename Type Size Used Priority
/dev/nvme0n1p2 partition 466838356 116922112 -2
HOST $ cat /sys/kernel/mm/lru_gen/enabled
0x000b
HOST $ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]
HOST $ cat /sys/kernel/mm/transparent_hugepage/defrag
always defer defer+madvise madvise [never]
HOST $ qemu-system-aarch64 --version
QEMU emulator version 6.2.0 (Debian 1:6.2+dfsg-2ubuntu6.6)
Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers
GUEST $ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.2 LTS"
GUEST $ java --version
openjdk 17.0.7 2023-04-18
OpenJDK Runtime Environment (build 17.0.7+7-Ubuntu-0ubuntu122.04.2)
OpenJDK 64-Bit Server VM (build 17.0.7+7-Ubuntu-0ubuntu122.04.2, mixed mode, sharing)
GUEST $ spark-shell --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.4.0
/_/
Using Scala version 2.12.17, OpenJDK 64-Bit Server VM, 17.0.7
Branch HEAD
Compiled by user xinrong.meng on 2023-04-07T02:18:01Z
Revision 87a5442f7ed96b11051d8a9333476d080054e5a0
Url https://github.com/apache/spark
Type --help for more information.
Procedure
=========
HOST $ sudo numactl -N 0 -m 0 qemu-system-aarch64 \
-M virt,accel=kvm -cpu host -smp 64 -m 300g -nographic -nic user \
-bios /usr/share/qemu-efi-aarch64/QEMU_EFI.fd \
-drive if=virtio,format=raw,file=/dev/nvme0n1p1
GUEST $ cat gen.scala
import java.io._
import scala.collection.mutable.ArrayBuffer
object GenData {
def main(args: Array[String]): Unit = {
val file = new File("/dev/shm/dataset.txt")
val writer = new BufferedWriter(new FileWriter(file))
val buf = ArrayBuffer(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)
for(_ <- 0 until 400000000) {
for (i <- 0 until 10) {
buf.update(i, scala.util.Random.nextLong())
}
writer.write(s"${buf.mkString(",")}\n")
}
writer.close()
}
}
GenData.main(Array())
GUEST $ cat sort.scala
import java.time.temporal.ChronoUnit
import org.apache.spark.sql.SparkSession
object SparkSort {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().getOrCreate()
val file = sc.textFile("/dev/shm/dataset.txt", 64)
val results = file.flatMap(_.split(",")).map(x => (x, 1)).sortByKey().takeOrdered(10)
results.foreach(println)
spark.stop()
}
}
SparkSort.main(Array())
GUEST $ cat run_spark.sh
export SPARK_LOCAL_DIRS=/dev/shm/
spark-shell <gen.scala
start=$SECONDS
for ((i=0; i<20; i++))
do
spark-3.4.0-bin-hadoop3/bin/spark-shell --master "local[64]" --driver-memory 160g <sort.scala
done
echo "wall time: $((SECONDS - start))"
Results
=======
Before [1] After Change
----------------------------------------------------
Wall time (seconds) 14455 12865 -12%
Notes
=====
[1] "mm: rmap: Don't flush TLB after checking PTE young for page
reference" was included so that the comparison is apples to
Apples.
https://lore.kernel.org/r/20220706112041.3831-1-21cnbao@gmail.com/
On Fri, 09 Jun 2023 01:59:35 +0100, Yu Zhao <yuzhao@google.com> wrote: > > TLDR > ==== > Apache Spark spent 12% less time sorting four billion random integers twenty times (in ~4 hours) after this patchset [1]. Why are the 3 architectures you have considered being evaluated with 3 different benchmarks? I am not suspecting you to have cherry-picked the best results, but I'd really like to see a variety of benchmarks that exercise this stuff differently. Thanks, M. -- Without deviation from the norm, progress is not possible.
On Fri, Jun 9, 2023 at 7:04 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Fri, 09 Jun 2023 01:59:35 +0100,
> Yu Zhao <yuzhao@google.com> wrote:
> >
> > TLDR
> > ====
> > Apache Spark spent 12% less time sorting four billion random integers twenty times (in ~4 hours) after this patchset [1].
>
> Why are the 3 architectures you have considered being evaluated with 3
> different benchmarks?
I was hoping people having special interests in different archs might
try to reproduce the benchmarks that I didn't report (but did cover)
and see what happens.
> I am not suspecting you to have cherry-picked
> the best results
I'm generally very conservative when reporting *synthetic* results.
For example, the same memcached benchmark used on powerpc yielded >50%
improvement on aarch64, because the default Ubuntu Kconfig uses 64KB
base page size for powerpc but 4KB for aarch64. (Before the series,
the reclaim (swap) path takes kvm->mmu_lock for *write* on O(nr of all
pages to consider); after the series, it becomes O(actual nr of pages
to swap), which is <10% given how the benchmark was set up.)
Ops/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency
------------------------------------------------------------------------
Before 639511.40 0.09940 0.04700 0.27100 22.52700
After 974184.60 0.06471 0.04700 0.15900 3.75900
> but I'd really like to see a variety of benchmarks
> that exercise this stuff differently.
I'd be happy to try other synthetic workloads that people think that
are relatively representative. Also, I've backported the series and
started an A/B experiment involving ~1 million devices (real-world
workloads). We should have the preliminary results by the time I post
the next version.
TLDR
====
Memcached achieved 10% more operations per second (in ~4 hours) after this patchset [1].
Hardware
========
HOST $ lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 184
On-line CPU(s) list: 0-183
Model name: POWER9 (raw), altivec supported
Model: 2.2 (pvr 004e 1202)
Thread(s) per core: 4
Core(s) per socket: 23
Socket(s): 2
CPU max MHz: 3000.0000
CPU min MHz: 2300.0000
Caches (sum of all):
L1d: 1.4 MiB (46 instances)
L1i: 1.4 MiB (46 instances)
L2: 12 MiB (24 instances)
L3: 240 MiB (24 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-91
NUMA node1 CPU(s): 92-183
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Mitigation; RFI Flush, L1D private per thread
Mds: Not affected
Meltdown: Mitigation; RFI Flush, L1D private per thread
Mmio stale data: Not affected
Retbleed: Not affected
Spec store bypass: Mitigation; Kernel entry/exit barrier (eieio)
Spectre v1: Mitigation; __user pointer sanitization, ori31 speculation barrier enabled
Spectre v2: Mitigation; Indirect branch serialisation (kernel only), Indirect branch cache disabled, Software link stack flush
Srbds: Not affected
Tsx async abort: Not affected
HOST $ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0-91
node 0 size: 261659 MB
node 0 free: 259152 MB
node 1 cpus: 92-183
node 1 size: 261713 MB
node 1 free: 261076 MB
node distances:
node 0 1
0: 10 40
1: 40 10
HOST $ cat /sys/class/nvme/nvme0/model
INTEL SSDPF21Q800GB
HOST $ cat /sys/class/nvme/nvme0/numa_node
0
Software
========
HOST $ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04 LTS"
HOST $ uname -a
Linux ppc 6.3.0 #1 SMP Sun Jun 4 18:26:37 UTC 2023 ppc64le ppc64le ppc64le GNU/Linux
HOST $ cat /proc/swaps
Filename Type Size Used Priority
/dev/nvme0n1p2 partition 466838272 0 -2
HOST $ cat /sys/kernel/mm/lru_gen/enabled
0x0009
HOST $ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]
HOST $ cat /sys/kernel/mm/transparent_hugepage/defrag
always defer defer+madvise madvise [never]
HOST $ qemu-system-ppc64 --version
QEMU emulator version 6.2.0 (Debian 1:6.2+dfsg-2ubuntu6.6)
Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers
GUEST $ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"
GUEST $ cat /etc/memcached.conf
...
-t 92
-m 262144
-B binary
-s /var/run/memcached/memcached.sock
-a 0766
GUEST $ memtier_benchmark -v
memtier_benchmark 1.4.0
Copyright (C) 2011-2022 Redis Ltd.
This is free software. You may redistribute copies of it under the terms of
the GNU General Public License <http://www.gnu.org/licenses/gpl.html>.
There is NO WARRANTY, to the extent permitted by law.
Procedure
=========
HOST $ sudo numactl -N 0 -m 0 qemu-system-ppc64 \
-M pseries,accel=kvm,kvm-type=HV -cpu host -smp 92 -m 270g
-nographic -nic user \
-drive if=virtio,format=raw,file=/dev/nvme0n1p1
GUEST $ memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -c 1 -t 92 --pipeline 1 --ratio 1:0 \
--key-minimum=1 --key-maximum=120000000 --key-pattern=P:P \
-n allkeys -d 2000
GUEST $ memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -c 1 -t 92 --pipeline 1 --ratio 0:1 \
--key-minimum=1 --key-maximum=120000000 --key-pattern=R:R \
-n allkeys --randomize --distinct-client-seed
Results
=======
Before [1] After Change
-------------------------------------------------
Ops/sec 721586.10 800210.12 +10%
Avg. Latency 0.12546 0.11260 -10%
p50 Latency 0.08700 0.08700 N/C
p99 Latency 0.28700 0.24700 -13%
Notes
=====
[1] "mm: rmap: Don't flush TLB after checking PTE young for page
reference" was included so that the comparison is apples to
Apples.
https://lore.kernel.org/r/20220706112041.3831-1-21cnbao@gmail.com/
TLDR
====
Multichase in 64 microVMs achieved 6% more total samples (in ~4 hours) after this patchset [1].
Hardware
========
HOST $ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 43 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Vendor ID: AuthenticAMD
Model name: AMD Ryzen Threadripper PRO 3995WX 64-Cores
CPU family: 23
Model: 49
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 1
Stepping: 0
Frequency boost: disabled
CPU max MHz: 4308.3979
CPU min MHz: 2200.0000
BogoMIPS: 5390.20
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2
...
Virtualization features:
Virtualization: AMD-V
Caches (sum of all):
L1d: 2 MiB (64 instances)
L1i: 2 MiB (64 instances)
L2: 32 MiB (64 instances)
L3: 256 MiB (16 instances)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-127
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected
HOST $ numactl -H
available: 1 nodes (0)
node 0 cpus: 0-127
node 0 size: 257542 MB
node 0 free: 224855 MB
node distances:
node 0
0: 10
HOST $ cat /sys/class/nvme/nvme0/model
INTEL SSDPF21Q800GB
HOST $ cat /sys/class/nvme/nvme0/numa_node
0
Software
========
HOST $ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"
HOST $ uname -a
Linux x86 6.4.0-rc5+ #1 SMP PREEMPT_DYNAMIC Wed Jun 7 22:17:47 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
HOST $ cat /proc/swaps
Filename Type Size Used Priority
/dev/nvme0n1p2 partition 466838356 0 -2
HOST $ cat /sys/kernel/mm/lru_gen/enabled
0x000f
HOST $ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]
HOST $ cat /sys/kernel/mm/transparent_hugepage/defrag
always defer defer+madvise madvise [never]
Procedure
=========
HOST $ git clone https://github.com/google/multichase
HOST $ <Build multichase>
HOST $ <Unpack /boot/initrd.img into ./initrd/>
HOST $ cp multichase/multichase ./initrd/bin/
HOST $ sed -i \
"/^maybe_break top$/i multichase -t 2 -m 4g -n 28800; poweroff" \
./initrd/init
HOST $ <Pack ./initrd/ into ./initrd.img>
HOST $ cat run_microvms.sh
memcgs=64
run() {
path=/sys/fs/cgroup/memcg$1
mkdir $path
echo $BASHPID >$path/cgroup.procs
qemu-system-x86_64 -M microvm,accel=kvm -cpu host -smp 2 -m 6g \
-nographic -kernel /boot/vmlinuz -initrd ./initrd.img \
-append "console=ttyS0 loglevel=0"
}
for ((memcg = 0; memcg < $memcgs; memcg++)); do
run $memcg &
done
wait
Results
=======
Before [1] After Change
----------------------------------------------
Total samples 6824 7237 +6%
Notes
=====
[1] "mm: rmap: Don't flush TLB after checking PTE young for page
reference" was included so that the comparison is apples to
Apples.
https://lore.kernel.org/r/20220706112041.3831-1-21cnbao@gmail.com/
On Thu, Jun 8, 2023 at 6:59 PM Yu Zhao <yuzhao@google.com> wrote:
>
> TLDR
> ====
> Multichase in 64 microVMs achieved 6% more total samples (in ~4 hours) after this patchset [1].
>
> Hardware
> ========
> HOST $ lscpu
> Architecture: x86_64
> CPU op-mode(s): 32-bit, 64-bit
> Address sizes: 43 bits physical, 48 bits virtual
> Byte Order: Little Endian
> CPU(s): 128
> On-line CPU(s) list: 0-127
> Vendor ID: AuthenticAMD
> Model name: AMD Ryzen Threadripper PRO 3995WX 64-Cores
> CPU family: 23
> Model: 49
> Thread(s) per core: 2
> Core(s) per socket: 64
> Socket(s): 1
> Stepping: 0
> Frequency boost: disabled
> CPU max MHz: 4308.3979
> CPU min MHz: 2200.0000
> BogoMIPS: 5390.20
> Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2
> ...
> Virtualization features:
> Virtualization: AMD-V
> Caches (sum of all):
> L1d: 2 MiB (64 instances)
> L1i: 2 MiB (64 instances)
> L2: 32 MiB (64 instances)
> L3: 256 MiB (16 instances)
> NUMA:
> NUMA node(s): 1
> NUMA node0 CPU(s): 0-127
> Vulnerabilities:
> Itlb multihit: Not affected
> L1tf: Not affected
> Mds: Not affected
> Meltdown: Not affected
> Mmio stale data: Not affected
> Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection
> Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
> Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
> Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
> Srbds: Not affected
> Tsx async abort: Not affected
>
> HOST $ numactl -H
> available: 1 nodes (0)
> node 0 cpus: 0-127
> node 0 size: 257542 MB
> node 0 free: 224855 MB
> node distances:
> node 0
> 0: 10
>
> HOST $ cat /sys/class/nvme/nvme0/model
> INTEL SSDPF21Q800GB
>
> HOST $ cat /sys/class/nvme/nvme0/numa_node
> 0
>
> Software
> ========
> HOST $ cat /etc/lsb-release
> DISTRIB_ID=Ubuntu
> DISTRIB_RELEASE=22.04
> DISTRIB_CODENAME=jammy
> DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"
>
> HOST $ uname -a
> Linux x86 6.4.0-rc5+ #1 SMP PREEMPT_DYNAMIC Wed Jun 7 22:17:47 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
>
> HOST $ cat /proc/swaps
> Filename Type Size Used Priority
> /dev/nvme0n1p2 partition 466838356 0 -2
>
> HOST $ cat /sys/kernel/mm/lru_gen/enabled
> 0x000f
>
> HOST $ cat /sys/kernel/mm/transparent_hugepage/enabled
> always madvise [never]
>
> HOST $ cat /sys/kernel/mm/transparent_hugepage/defrag
> always defer defer+madvise madvise [never]
>
> Procedure
> =========
> HOST $ git clone https://github.com/google/multichase
>
> HOST $ <Build multichase>
> HOST $ <Unpack /boot/initrd.img into ./initrd/>
>
> HOST $ cp multichase/multichase ./initrd/bin/
> HOST $ sed -i \
> "/^maybe_break top$/i multichase -t 2 -m 4g -n 28800; poweroff" \
I was reminded that I missed one parameter above, i.e.,
"/^maybe_break top$/i multichase -N -t 2 -m 4g -n 28800; poweroff" \
^^
> ./initrd/init
>
> HOST $ <Pack ./initrd/ into ./initrd.img>
>
> HOST $ cat run_microvms.sh
> memcgs=64
>
> run() {
> path=/sys/fs/cgroup/memcg$1
>
> mkdir $path
> echo $BASHPID >$path/cgroup.procs
And one line here:
echo 4000m >$path/memory.min # or the largest size that doesn't cause OOM kills
> qemu-system-x86_64 -M microvm,accel=kvm -cpu host -smp 2 -m 6g \
> -nographic -kernel /boot/vmlinuz -initrd ./initrd.img \
> -append "console=ttyS0 loglevel=0"
> }
>
> for ((memcg = 0; memcg < $memcgs; memcg++)); do
> run $memcg &
> done
>
> wait
>
> Results
> =======
> Before [1] After Change
> ----------------------------------------------
> Total samples 6824 7237 +6%
>
> Notes
> =====
> [1] "mm: rmap: Don't flush TLB after checking PTE young for page
> reference" was included so that the comparison is apples to
> Apples.
> https://lore.kernel.org/r/20220706112041.3831-1-21cnbao@gmail.com/
© 2016 - 2026 Red Hat, Inc.