[v14] Multi-Gen LRU Framework

[PATCH v14 00/14] Multi-Gen LRU Framework

Posted by Yu Zhao 3 years, 7 months ago

What's new
==========
Retested on v6.0-rc1; rebased to the latest mm-unstable.

TLDR
====
The current page reclaim is too expensive in terms of CPU usage and it
often makes poor choices about what to evict. This patchset offers an
alternative solution that is performant, versatile and
straightforward.

Patchset overview
=================
The design and implementation overview is in patch 14:
https://lore.kernel.org/r/20220815071332.627393-15-yuzhao@google.com/

01. mm: x86, arm64: add arch_has_hw_pte_young()
02. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
Take advantage of hardware features when trying to clear the accessed
bit in many PTEs.

03. mm/vmscan.c: refactor shrink_node()
04. Revert "include/linux/mm_inline.h: fold __update_lru_size() into
    its sole caller"
Minor refactors to improve readability for the following patches.

05. mm: multi-gen LRU: groundwork
Adds the basic data structure and the functions that insert pages to
and remove pages from the multi-gen LRU (MGLRU) lists.

06. mm: multi-gen LRU: minimal implementation
A minimal implementation without optimizations.

07. mm: multi-gen LRU: exploit locality in rmap
Exploits spatial locality to improve efficiency when using the rmap.

08. mm: multi-gen LRU: support page table walks
Further exploits spatial locality by optionally scanning page tables.

09. mm: multi-gen LRU: optimize multiple memcgs
Optimizes the overall performance for multiple memcgs running mixed
types of workloads.

10. mm: multi-gen LRU: kill switch
Adds a kill switch to enable or disable MGLRU at runtime.

11. mm: multi-gen LRU: thrashing prevention
12. mm: multi-gen LRU: debugfs interface
Provide userspace with features like thrashing prevention, working set
estimation and proactive reclaim.

13. mm: multi-gen LRU: admin guide
14. mm: multi-gen LRU: design doc
Add an admin guide and a design doc.

Benchmark results
=================
Independent lab results
-----------------------
Based on the popularity of searches [01] and the memory usage in
Google's public cloud, the most popular open-source memory-hungry
applications, in alphabetical order, are:
      Apache Cassandra      Memcached
      Apache Hadoop         MongoDB
      Apache Spark          PostgreSQL
      MariaDB (MySQL)       Redis

An independent lab evaluated MGLRU with the most widely used benchmark
suites for the above applications. They posted 960 data points along
with kernel metrics and perf profiles collected over more than 500
hours of total benchmark time. Their final reports show that, with 95%
confidence intervals (CIs), the above applications all performed
significantly better for at least part of their benchmark matrices.

On 5.14:
1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
   less wall time to sort three billion random integers, respectively,
   under the medium- and the high-concurrency conditions, when
   overcommitting memory. There were no statistically significant
   changes in wall time for the rest of the benchmark matrix.
2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
   more transactions per minute (TPM), respectively, under the medium-
   and the high-concurrency conditions, when overcommitting memory.
   There were no statistically significant changes in TPM for the rest
   of the benchmark matrix.
3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
   and [21.59, 30.02]% more operations per second (OPS), respectively,
   for sequential access, random access and Gaussian (distribution)
   access, when THP=always; 95% CIs [13.85, 15.97]% and
   [23.94, 29.92]% more OPS, respectively, for random access and
   Gaussian access, when THP=never. There were no statistically
   significant changes in OPS for the rest of the benchmark matrix.
4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
   [2.16, 3.55]% more operations per second (OPS), respectively, for
   exponential (distribution) access, random access and Zipfian
   (distribution) access, when underutilizing memory; 95% CIs
   [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
   respectively, for exponential access, random access and Zipfian
   access, when overcommitting memory.

On 5.15:
5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
   and [4.11, 7.50]% more operations per second (OPS), respectively,
   for exponential (distribution) access, random access and Zipfian
   (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
   [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
   exponential access, random access and Zipfian access, when swap was
   on.
6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
   less average wall time to finish twelve parallel TeraSort jobs,
   respectively, under the medium- and the high-concurrency
   conditions, when swap was on. There were no statistically
   significant changes in average wall time for the rest of the
   benchmark matrix.
7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
   minute (TPM) under the high-concurrency condition, when swap was
   off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
   respectively, under the medium- and the high-concurrency
   conditions, when swap was on. There were no statistically
   significant changes in TPM for the rest of the benchmark matrix.
8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
   [11.47, 19.36]% more total operations per second (OPS),
   respectively, for sequential access, random access and Gaussian
   (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
   [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
   for sequential access, random access and Gaussian access, when
   THP=never.

Our lab results
---------------
To supplement the above results, we ran the following benchmark suites
on 5.16-rc7 and found no regressions [10].
      fs_fio_bench_hdd_mq      pft
      fs_lmbench               pgsql-hammerdb
      fs_parallelio            redis
      fs_postmark              stream
      hackbench                sysbenchthread
      kernbench                tpcc_spark
      memcached                unixbench
      multichase               vm-scalability
      mutilate                 will-it-scale
      nginx

[01] https://trends.google.com
[02] https://lore.kernel.org/r/20211102002002.92051-1-bot@edi.works/
[03] https://lore.kernel.org/r/20211009054315.47073-1-bot@edi.works/
[04] https://lore.kernel.org/r/20211021194103.65648-1-bot@edi.works/
[05] https://lore.kernel.org/r/20211109021346.50266-1-bot@edi.works/
[06] https://lore.kernel.org/r/20211202062806.80365-1-bot@edi.works/
[07] https://lore.kernel.org/r/20211209072416.33606-1-bot@edi.works/
[08] https://lore.kernel.org/r/20211218071041.24077-1-bot@edi.works/
[09] https://lore.kernel.org/r/20211122053248.57311-1-bot@edi.works/
[10] https://lore.kernel.org/r/20220104202247.2903702-1-yuzhao@google.com/

Read-world applications
=======================
Third-party testimonials
------------------------
Konstantin reported [11]:
   I have Archlinux with 8G RAM + zswap + swap. While developing, I
   have lots of apps opened such as multiple LSP-servers for different
   langs, chats, two browsers, etc... Usually, my system gets quickly
   to a point of SWAP-storms, where I have to kill LSP-servers,
   restart browsers to free memory, etc, otherwise the system lags
   heavily and is barely usable.
   
   1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
   patchset, and I started up by opening lots of apps to create memory
   pressure, and worked for a day like this. Till now I had not a
   single SWAP-storm, and mind you I got 3.4G in SWAP. I was never
   getting to the point of 3G in SWAP before without a single
   SWAP-storm.

Vaibhav from IBM reported [12]:
   In a synthetic MongoDB Benchmark, seeing an average of ~19%
   throughput improvement on POWER10(Radix MMU + 64K Page Size) with
   MGLRU patches on top of 5.16 kernel for MongoDB + YCSB across
   three different request distributions, namely, Exponential, Uniform
   and Zipfan.

Shuang from U of Rochester reported [13]:
   With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
   and [9.26, 10.36]% higher throughput, respectively, for random
   access, Zipfian (distribution) access and Gaussian (distribution)
   access, when the average number of jobs per CPU is 1; 95% CIs
   [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher
   throughput, respectively, for random access, Zipfian access and
   Gaussian access, when the average number of jobs per CPU is 2.

Daniel from Michigan Tech reported [14]:
   With Memcached allocating ~100GB of byte-addressable Optante,
   performance improvement in terms of throughput (measured as queries
   per second) was about 10% for a series of workloads.

Large-scale deployments
-----------------------
We've rolled out MGLRU to tens of millions of Chrome OS users and
about a million Android users. Google's fleetwide profiling [15] shows
an overall 40% decrease in kswapd CPU usage, in addition to
improvements in other UX metrics, e.g., an 85% decrease in the number
of low-memory kills at the 75th percentile and an 18% decrease in
app launch time at the 50th percentile.

The downstream kernels that have been using MGLRU include:
1. Android [16]
2. Arch Linux Zen [17]
3. Armbian [18]
4. Chrome OS [19]
5. Liquorix [20]
6. post-factum [21]
7. XanMod [22]

[11] https://lore.kernel.org/r/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/
[12] https://lore.kernel.org/r/87czj3mux0.fsf@vajain21.in.ibm.com/
[13] https://lore.kernel.org/r/20220105024423.26409-1-szhai2@cs.rochester.edu/
[14] https://lore.kernel.org/r/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
[15] https://dl.acm.org/doi/10.1145/2749469.2750392
[16] https://android.com
[17] https://archlinux.org
[18] https://armbian.com
[19] https://chromium.org
[20] https://liquorix.net
[21] https://codeberg.org/pf-kernel
[22] https://xanmod.org

Summery
=======
The facts are:
1. The independent lab results and the real-world applications
   indicate substantial improvements; there are no known regressions.
2. Thrashing prevention, working set estimation and proactive reclaim
   work out of the box; there are no equivalent solutions.
3. There is a lot of new code; no smaller changes have been
   demonstrated similar effects.

Our options, accordingly, are:
1. Given the amount of evidence, the reported improvements will likely
   materialize for a wide range of workloads.
2. Gauging the interest from the past discussions, the new features
   will likely be put to use for both personal computers and data
   centers.
3. Based on Google's track record, the new code will likely be well
   maintained in the long term. It'd be more difficult if not
   impossible to achieve similar effects with other approaches.

Yu Zhao (14):
  mm: x86, arm64: add arch_has_hw_pte_young()
  mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
  mm/vmscan.c: refactor shrink_node()
  Revert "include/linux/mm_inline.h: fold __update_lru_size() into its
    sole caller"
  mm: multi-gen LRU: groundwork
  mm: multi-gen LRU: minimal implementation
  mm: multi-gen LRU: exploit locality in rmap
  mm: multi-gen LRU: support page table walks
  mm: multi-gen LRU: optimize multiple memcgs
  mm: multi-gen LRU: kill switch
  mm: multi-gen LRU: thrashing prevention
  mm: multi-gen LRU: debugfs interface
  mm: multi-gen LRU: admin guide
  mm: multi-gen LRU: design doc

 Documentation/admin-guide/mm/index.rst        |    1 +
 Documentation/admin-guide/mm/multigen_lru.rst |  156 +
 Documentation/mm/index.rst                    |    1 +
 Documentation/mm/multigen_lru.rst             |  159 +
 arch/Kconfig                                  |    8 +
 arch/arm64/include/asm/pgtable.h              |   15 +-
 arch/x86/Kconfig                              |    1 +
 arch/x86/include/asm/pgtable.h                |    9 +-
 arch/x86/mm/pgtable.c                         |    5 +-
 fs/exec.c                                     |    2 +
 fs/fuse/dev.c                                 |    3 +-
 include/linux/cgroup.h                        |   15 +-
 include/linux/memcontrol.h                    |   36 +
 include/linux/mm.h                            |    5 +
 include/linux/mm_inline.h                     |  231 +-
 include/linux/mm_types.h                      |   77 +
 include/linux/mmzone.h                        |  214 ++
 include/linux/nodemask.h                      |    1 +
 include/linux/page-flags-layout.h             |   16 +-
 include/linux/page-flags.h                    |    4 +-
 include/linux/pgtable.h                       |   17 +-
 include/linux/sched.h                         |    4 +
 include/linux/swap.h                          |    4 +
 kernel/bounds.c                               |    7 +
 kernel/cgroup/cgroup-internal.h               |    1 -
 kernel/exit.c                                 |    1 +
 kernel/fork.c                                 |    9 +
 kernel/sched/core.c                           |    1 +
 mm/Kconfig                                    |   26 +
 mm/huge_memory.c                              |    3 +-
 mm/internal.h                                 |    1 +
 mm/memcontrol.c                               |   28 +
 mm/memory.c                                   |   39 +-
 mm/mm_init.c                                  |    6 +-
 mm/mmzone.c                                   |    2 +
 mm/rmap.c                                     |    6 +
 mm/swap.c                                     |   54 +-
 mm/vmscan.c                                   | 2972 ++++++++++++++++-
 mm/workingset.c                               |  110 +-
 39 files changed, 4095 insertions(+), 155 deletions(-)
 create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst
 create mode 100644 Documentation/mm/multigen_lru.rst


base-commit: d2af7b221349ff6241e25fa8c67bcfae2b360700
-- 
2.37.1.595.g718a3a8f04-goog

Re: [PATCH v14 00/14] Multi-Gen LRU Framework

Posted by Andrew Morton 3 years, 7 months ago

I'd like to move mglru into the mm-stable branch late this week.

I'm not terribly happy about the level of review nor the carefulness of
the code commenting (these things are related) and I have a note here
that "mm: multi-gen LRU: admin guide" is due for an update and everyone
is at conference anyway.  But let's please try to push things along anyway.

Re: [PATCH v14 00/14] Multi-Gen LRU Framework

Posted by Yu Zhao 3 years, 6 months ago

On Sun, Sep 11, 2022 at 6:08 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> I'd like to move mglru into the mm-stable branch late this week.
>
> I'm not terribly happy about the level of review nor the carefulness of
> the code commenting (these things are related) and I have a note here
> that "mm: multi-gen LRU: admin guide" is due for an update and everyone
> is at conference anyway.  But let's please try to push things along anyway.

Thanks for the heads-up. Will add as many comments as I can and wrap
it up by the end of tomorrow.

Re: [PATCH v14 00/14] Multi-Gen LRU Framework

Posted by Yu Zhao 3 years, 6 months ago

On Thu, Sep 15, 2022 at 11:56 AM Yu Zhao <yuzhao@google.com> wrote:
>
> On Sun, Sep 11, 2022 at 6:08 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > I'd like to move mglru into the mm-stable branch late this week.
> >
> > I'm not terribly happy about the level of review nor the carefulness of
> > the code commenting (these things are related) and I have a note here
> > that "mm: multi-gen LRU: admin guide" is due for an update and everyone
> > is at conference anyway.  But let's please try to push things along anyway.
>
> Thanks for the heads-up. Will add as many comments as I can and wrap
> it up by the end of tomorrow.

I've posted v15 which can replace what mm-unstable currently has.
Apologies for the delay: an unexpected lockdep warning from the maple
tree forced me to restart all the tests [1].

Let me also post the incremental patches after this email, in case you
strongly prefer to add them on top of v14.

[1] https://lore.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com/

Re: [PATCH v14 00/14] Multi-Gen LRU Framework

Posted by Andrew Morton 3 years, 6 months ago

On Sun, 18 Sep 2022 14:40:01 -0600 Yu Zhao <yuzhao@google.com> wrote:

> Let me also post the incremental patches after this email, in case you
> strongly prefer to add them on top of v14.

Thanks, helpful.

I have one question regarding 03/11.

The final two updates look pretty substantial.  I guess I'll do a
series replacement and let this and mapletree sit another week.

OpenWrt / MIPS benchmark with MGLRU

Posted by Yu Zhao 3 years, 7 months ago

TLDR
====
RAM utilization  Throughput (95% CI)  P99 Latency (95% CI)
----------------------------------------------------------
~90%             NS                   NS
~110%            +[12, 16]%           -[20, 22]%

Abbreviations
=============
CI:   confidence interval
NS:   no statistically significant difference
DUT:  device under test
ATE:  automatic test equipment

Rational
========
1. OpenWrt is the most popular distro for WiFi routers; many of its
   targets use big endianness [1].
2. 4 out of the top 5 bestselling WiFi routers in the US use MIPS [2];
   MIPS uses software-managed TLB.
3. Memcached is the best available memory benchmark on OpenWrt;
   admittedly such a use case is very limited in the real world.

Hardware
========
DUT: Ubiquiti EdgeRouter (ER-8) [3]

DUT # cat /proc/cpuinfo
system type             : UBNT_E200 (CN6120p1.1-800-NSP)
machine                 : Unknown
processor               : 0
cpu model               : Cavium Octeon II V0.1
BogoMIPS                : 1600.00
wait instruction        : yes
microsecond timers      : yes
tlb_entries             : 128
extra interrupt vector  : yes
hardware watchpoint     : yes, count: 2, address/irw mask: [0x0ffc, 0x0ffb]
isa                     : mips1 mips2 mips3 mips4 mips5 mips32r1 mips32r2 mips64r1 mips64r2
ASEs implemented        :
Options implemented     : tlb rixiex 4kex octeon_cache 32fpr prefetch mcheck ejtag llsc rixi lpa vtag_icache userlocal perf_cntr_intr_bit perf
shadow register sets    : 1
kscratch registers      : 3
package                 : 0
core                    : 0
VCED exceptions         : not available
VCEI exceptions         : not available

processor               : 1
cpu model               : Cavium Octeon II V0.1
BogoMIPS                : 1600.00
wait instruction        : yes
microsecond timers      : yes
tlb_entries             : 128
extra interrupt vector  : yes
hardware watchpoint     : yes, count: 2, address/irw mask: [0x0ffc, 0x0ffb]
isa                     : mips1 mips2 mips3 mips4 mips5 mips32r1 mips32r2 mips64r1 mips64r2
ASEs implemented        :
Options implemented     : tlb rixiex 4kex octeon_cache 32fpr prefetch mcheck ejtag llsc rixi lpa vtag_icache userlocal perf_cntr_intr_bit perf
shadow register sets    : 1
kscratch registers      : 3
package                 : 0
core                    : 1
VCED exceptions         : not available
VCEI exceptions         : not available

DUT # cat /proc/meminfo
MemTotal:        1991964 kB
MemFree:         1917304 kB
MemAvailable:    1896856 kB
Buffers:               4 kB
Cached:            33464 kB
SwapCached:            0 kB
Active:             1316 kB
Inactive:          33500 kB
Active(anon):       1316 kB
Inactive(anon):    33496 kB
Active(file):          0 kB
Inactive(file):        4 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:        995324 kB
SwapFree:         995324 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:          1360 kB
Mapped:             2688 kB
Shmem:             33464 kB
KReclaimable:       8244 kB
Slab:              19772 kB
SReclaimable:       8244 kB
SUnreclaim:        11528 kB
KernelStack:        1056 kB
PageTables:          336 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     1991304 kB
Committed_AS:      38916 kB
VmallocTotal: 1069547512 kB
VmallocUsed:        4856 kB
VmallocChunk:          0 kB
Percpu:              272 kB

Software
========
DUT # cat /etc/openwrt_release
DISTRIB_ID='OpenWrt'
DISTRIB_RELEASE='22.03.0-rc6'
DISTRIB_REVISION='r19590-042d558536'
DISTRIB_TARGET='octeon/generic'
DISTRIB_ARCH='mips64_octeonplus'
DISTRIB_DESCRIPTION='OpenWrt 22.03.0-rc6 r19590-042d558536'
DISTRIB_TAINTS='no-all no-ipv6'

DUT # uname -a
Linux OpenWrt 6.0.0-rc3+ #0 SMP Sun Jul 31 15:12:47 2022 mips64 GNU/Linux

DUT # cat /proc/swaps
Filename    Type       Size    Used  Priority
/dev/zram0  partition  995324  0     100

DUT # memcached -V
memcached 1.6.9

DUT # cat /etc/config/memcached
config memcached
        option user 'memcached'
        option maxconn '1024'
        option listen '0.0.0.0'
        option port '11211'
        option memory '6400'

ATE $ memtier_benchmark -v
memtier_benchmark 1.3.0
Copyright (C) 2011-2022 Redis Ltd.
This is free software.  You may redistribute copies of it under the terms of
the GNU General Public License <http://www.gnu.org/licenses/gpl.html>.
There is NO WARRANTY, to the extent permitted by law.

Procedure
=========
ATE $ cat run_benchmark_matrix.sh
run_memtier_benchmark()
{
    # boot to kernel $3

    # populate dataset
    memtier_benchmark/memtier_benchmark -s $DUT_IP -p 11211 \
        -P memcache_binary -n allkeys -c 1 --ratio 1:0 --pipeline 8 \
        --key-minimum=1 --key-maximum=$2 --key-pattern=P:P \
        -d 1000

    # access dataset using Guassian pattern
    memtier_benchmark/memtier_benchmark -s $DUT_IP -p 11211 \
        -P memcache_binary --test-time $1 -c 1 --ratio 0:1 \
        --pipeline 8 --key-minimum=1 --key-maximum=$2 \
        --key-pattern=G:G --randomize --distinct-client-seed

    # collect results
}

run_duration_secs=1200
mem_utils_90_110=(1600000 2000000)
kernels=("baseline" "patched")

for mem_util in ${mem_utils_90_110[@]}; do
    for kernel in ${kernels[@]}; do
        run_memtier_benchmark $run_duration_secs $mem_util $kernel
    done
done

Results
=======
Baseline                                 90% RAM utilization
------------------------------------------------------------
Ops/sec   Avg. Lat.  p50 Lat.  p99 Lat.  p99.9 Lat.  KB/sec
------------------------------------------------------------
48550.71  0.65687    0.48700   2.84700   5.56700     1812.25
48600.55  0.65629    0.48700   2.86300   5.59900     1814.11
48562.37  0.65674    0.48700   2.84700   5.50300     1812.68
48556.66  0.65688    0.48700   2.84700   5.53500     1812.47
48619.50  0.65600    0.48700   2.87900   5.63100     1814.82
48579.74  0.65654    0.48700   2.84700   5.56700     1813.33
48593.25  0.65764    0.48700   2.86300   5.56700     1814.10
48535.52  0.65716    0.48700   2.86300   5.56700     1811.68
48587.24  0.65645    0.48700   2.83100   5.50300     1813.61
48541.92  0.65704    0.48700   2.81500   5.47100     1811.92

MGLRU                                    90% RAM utilization
------------------------------------------------------------
Ops/sec   Avg. Lat.  p50 Lat.  p99 Lat.  p99.9 Lat.  KB/sec
------------------------------------------------------------
48622.38  0.65594    0.48700   2.81500   5.47100     1814.92
48537.74  0.65715    0.48700   2.84700   5.53500     1811.76
48586.82  0.65646    0.48700   2.84700   5.50300     1813.59
48552.44  0.65695    0.48700   2.83100   5.43900     1812.31
48557.35  0.65680    0.49500   2.83100   5.53500     1812.49
48625.48  0.65593    0.48700   2.81500   5.43900     1815.04
48655.75  0.65557    0.48700   2.84700   5.53500     1816.17
48625.67  0.65595    0.48700   2.84700   5.53500     1815.04
48622.22  0.65600    0.48700   2.84700   5.47100     1814.91
48617.10  0.65610    0.48700   2.84700   5.56700     1814.73

Baseline                                110% RAM utilization
------------------------------------------------------------
Ops/sec   Avg. Lat.  p50 Lat.  p99 Lat.  p99.9 Lat.  KB/sec
------------------------------------------------------------
19813.79  1.61245    0.63100   17.79100  31.74300    744.91
20328.29  1.57158    0.62300   17.27900  31.10300    764.25
20104.12  1.58913    0.62300   17.40700  31.10300    755.82
20342.03  1.57053    0.61500   17.27900  30.84700    764.77
19688.05  1.62268    0.62300   17.91900  31.35900    740.18
19607.31  1.62943    0.63900   17.91900  31.23100    737.15
19250.96  1.65963    0.65500   17.91900  31.10300    723.75
20182.79  1.58290    0.63100   17.40700  30.84700    758.78
20181.88  1.58299    0.63100   17.40700  30.84700    758.75
20615.90  1.54963    0.62300   17.02300  30.84700    775.06

MGLRU                                   110% RAM utilization
------------------------------------------------------------
Ops/sec   Avg. Lat.  p50 Lat.  p99 Lat.  p99.9 Lat.  KB/sec
------------------------------------------------------------
22911.33  1.39405    0.61500   13.69500  28.79900    861.36
22339.08  1.42989    0.61500   14.07900  30.07900    839.85
23394.22  1.36521    0.59900   13.56700  29.05500    879.51
22521.48  1.41830    0.61500   13.88700  29.82300    846.70
22678.10  1.40818    0.61500   13.82300  29.69500    852.59
22344.50  1.42952    0.61500   14.07900  29.95100    840.05
23245.65  1.37406    0.60700   13.50300  28.92700    873.93
23140.17  1.38032    0.59900   13.69500  29.18300    869.96
23003.34  1.38856    0.61500   13.63100  29.05500    864.82
22937.52  1.39253    0.61500   13.69500  29.43900    862.35

Flame graphs
------------
Baseline: https://drive.google.com/file/d/1-Ac4HMPAyZIqxtvKerUTqNNAgBLhpX9R
MGLRU: https://drive.google.com/file/d/1-9x0W2yIYeiRvXWiYRzL6niTqW7zCVPX

References
==========
[1] https://openwrt.org/docs/platforms/start
[2] https://www.amazon.com/bestsellers/pc/300189
[3] https://openwrt.org/toh/ubiquiti/edgerouter

Re: OpenWrt / MIPS benchmark with MGLRU

Posted by Yu Zhao 3 years, 7 months ago

On Tue, Aug 30, 2022 at 10:17 PM Yu Zhao <yuzhao@google.com> wrote:
>
> TLDR
> ====
> RAM utilization  Throughput (95% CI)  P99 Latency (95% CI)
> ----------------------------------------------------------
> ~90%             NS                   NS
> ~110%            +[12, 16]%           -[20, 22]%
>
> Abbreviations
> =============
> CI:   confidence interval
> NS:   no statistically significant difference
> DUT:  device under test
> ATE:  automatic test equipment
>
> Rational
> ========
> 1. OpenWrt is the most popular distro for WiFi routers; many of its
>    targets use big endianness [1].
> 2. 4 out of the top 5 bestselling WiFi routers in the US use MIPS [2];
>    MIPS uses software-managed TLB.
> 3. Memcached is the best available memory benchmark on OpenWrt;
>    admittedly such a use case is very limited in the real world.

Thanks.

My goal is to encourage MM people to extend their test coverage to
some commonly used but less tested configurations. I carefully
constructed this benchmark with the balance between its
representativeness and the effort to reproduce.

When I wear my MM hat, I see ER-8 as the ideal choice because it comes
with a serial port, a replaceable memory DIMM and one of the two cores
that can be disabled. The same SoC is also what the Debian MIPS port
mainly uses for their testing [1]. So if I need help, I might be able
to get it from them.

From OpenWrt's / MIPS OEMs' POVs, I do see ER-8 as an uninteresting
platform. Currently the best selling WiFi router on Amazon US is
Archer A7, a knockoff of Archer C7. The latter comes with not only the
serial port header but also the JTAG header, and that's what I use.
But I seriously doubt showing how I work on C7 would encourage MM
people to try it. I snapped a pictures of it during lunch:
https://drive.google.com/file/d/1rYBwLOyMqBSr6WKUZd7Gbf9RfwA641X5/
And other boards I routinely test the MM performance on:
https://drive.google.com/file/d/1yBMx9OPWw-5czvz3maNUy6WBFwPvAqG5/
All the way dates back to this vintage:
https://drive.google.com/file/d/12N21qiWSoyJgZwVkwAhY8_5Fj4dKftqD/

[1] https://wiki.debian.org/MIPSPort

Re: OpenWrt / MIPS benchmark with MGLRU

Posted by Dave Hansen 3 years, 7 months ago

On 8/30/22 21:17, Yu Zhao wrote:
> TLDR
> ====
> RAM utilization  Throughput (95% CI)  P99 Latency (95% CI)
> ----------------------------------------------------------
> ~90%             NS                   NS
> ~110%            +[12, 16]%           -[20, 22]%

I'll give you points for thinking out of the box on this one.  This is a
piece of hardware where both latency and bandwidth theoretically matter.
 I've got a slightly older but similar piece of Ubiquiti hardware with
512MB of RAM.  It doesn't run OpenWRT, fwiw.  Maybe my firmware is a bit
outdated.

*But*, most of the heavy lifting for packet flow on these systems is
done in hardware.  They have some hardware acceleration to be able to
_route_ at gigabit speeds, so they're probably not quite as sensitive to
software hiccups as lower-end routers.

That said, my system at least does not typically have *any* memory
pressure.  Right now, it hasn't even filled free memory with page cache
and it's been up for over a month:

# cat /proc/meminfo
MemTotal:         491552 kB
MemFree:          160188 kB
MemAvailable:     373088 kB
Cached:           151004 kB

I think a better tl;dr would be:

	MGLRU doesn't help much or cause any regressions on this
	hardware.  Under (atypical) synthetic memory pressure, MGLRU did
	show some modest but measurable throughput and latency benefits.

In other words, this provides more of a data point that MGLRU doesn't
hurt medium-ish sized embedded systems.  I think you could make an even
stronger case with even smaller hardware or something that actually sees
memory pressure on a regular basis in the wild.

Re: OpenWrt / MIPS benchmark with MGLRU

Posted by Arnd Bergmann 3 years, 7 months ago

On Wed, Aug 31, 2022, at 6:17 AM, Yu Zhao wrote:
>
> Rational
> ========
> 1. OpenWrt is the most popular distro for WiFi routers; many of its
>    targets use big endianness [1].
> 2. 4 out of the top 5 bestselling WiFi routers in the US use MIPS [2];
>    MIPS uses software-managed TLB.
> 3. Memcached is the best available memory benchmark on OpenWrt;
>    admittedly such a use case is very limited in the real world.
>
> Hardware
> ========
> DUT: Ubiquiti EdgeRouter (ER-8) [3]

I don't know if it makes any difference to your findings, but
I would point out the test hardware is neither representative
of most devices supported by OpenWRT, nor those on the amazon
best-seller list that I see looking from Germany:

Five of the top-10 devices on that list are arm64 (little-endian,
hardware TLB walker, typically 512MB of RAM), the others are
mips32 (typically only 128MB, mostly single-core) and only
the oldest one (Archer C7) of them is big-endian. I would not
expect endianness to make any difference, but the 16x smaller
memory of typical mips devices (ath79, mt76) might.

       Arnd