[v1] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

[RFC PATCH 0/3] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Yang Shi 1 year, 2 months ago

When rodata=full kernel linear mapping is mapped by PTE due to arm's
break-before-make rule.

This resulted in a couple of problems:
- performance degradation
- more TLB pressure
- memory waste for kernel page table

There are some workarounds to mitigate the problems, for example, using
rodata=on, but this compromises the security measurement.

With FEAT_BBM level 2 support, splitting large block page table to
smaller ones doesn't need to make the page table entry invalid anymore.
This allows kernel split large block mapping on the fly.

Add kernel page table split support and use large block mapping by
default when FEAT_BBM level 2 is supported for rodata=full. When
changing permissions for kernel linear mapping, the page table will be
split to PTE level.

The machine without FEAT_BBM level 2 will fallback to have kernel linear
mapping PTE-mapped when rodata=full.

With this we saw significant performance boost with some benchmarks with
keeping rodata=full security protection in the mean time.

The test was done on AmpereOne machine (192 cores, 1P) with 256GB memory and
4K page size + 48 bit VA.

Function test (4K/16K/64K page size)
- Kernel boot. Kernel needs change kernel linear mapping permission at
boot stage, if the patch didn't work, kernel typically didn't boot.
- Module stress from stress-ng. Kernel module load change permission for
module sections.
- A test kernel module which allocates 80% of total memory via vmalloc(),
then change the vmalloc area permission to RO, then change it back
before vfree(). Then launch a VM which consumes almost all physical
memory.
- VM with the patchset applied in guest kernel too.
- Kernel build in VM with patched guest kernel.

Memory consumption
Before:
MemTotal: 258988984 kB
MemFree: 254821700 kB

After:
MemTotal: 259505132 kB
MemFree: 255410264 kB

Around 500MB more memory are free to use. The larger the machine, the
more memory saved.

Performance benchmarking
* Memcached
We saw performance degradation when running Memcached benchmark with
rodata=full vs rodata=on. Our profiling pointed to kernel TLB pressure.
With this patchset we saw ops/sec is increased by around 3.5%, P99
latency is reduced by around 9.6%.
The gain mainly came from reduced kernel TLB misses. The kernel TLB
MPKI is reduced by 28.5%.

The benchmark data is now on par with rodata=on too.

* Disk encryption (dm-crypt) benchmark
Ran fio benchmark with the below command on a 128G ramdisk (ext4) with disk
encryption (by dm-crypt).
fio --directory=/data --random_generator=lfsr --norandommap --randrepeat 1 \
--status-interval=999 --rw=write --bs=4k --loops=1 --ioengine=sync \
--iodepth=1 --numjobs=1 --fsync_on_close=1 --group_reporting --thread \
--name=iops-test-job --eta-newline=1 --size 100G

The IOPS is increased by 90% - 150% (the variance is high, but the worst
number of good case is around 90% more than the best number of bad case).
The bandwidth is increased and the avg clat is reduced proportionally.

* Sequential file read
Read 100G file sequentially on XFS (xfs_io read with page cache populated).
The bandwidth is increased by 150%.

Yang Shi (3):
arm64: cpufeature: detect FEAT_BBM level 2
arm64: mm: support large block mapping when rodata=full
arm64: cpufeature: workaround AmpereOne FEAT_BBM level 2

arch/arm64/include/asm/cpufeature.h | 24 ++++++++++++++++++
arch/arm64/include/asm/pgtable.h | 7 +++++-
arch/arm64/kernel/cpufeature.c | 11 ++++++++
arch/arm64/mm/mmu.c | 31 +++++++++++++++++++++--
arch/arm64/mm/pageattr.c | 173 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
arch/arm64/tools/cpucaps | 1 +
6 files changed, 238 insertions(+), 9 deletions(-)