arch/arm64/include/asm/kvm_pgtable.h | 13 +++++++++++++ arch/arm64/kvm/hyp/pgtable.c | 28 +++++++++++++++++++++------- 2 files changed, 34 insertions(+), 7 deletions(-)
While playing with dirty-bit tracking, I decided to take a look on how page
splitting works. Found out all entries are walked, even though we can infer,
for instance that:
- If a level-3 entry is walked, it means the parent level-2 entry is split
- If a split just succeeded in an table entry, it means all children nodes
are already split
This patches' idea is to introduce new walking flags to skip pagetable
levels 0-3.
The idea of skipping child nodes was also tested, but it was marginally
slower than just skipping levels, so it was discarted.
Optimization measured on two scenarios involving eager-splitting on a
VM with 1 memslot of 16GB:
- Scenario 1: No manual protect, whole memslot split at dirty-track enable
(KVM_SET_USER_MEMORY_REGION2 ioctl with KVM_MEM_LOG_DIRTY_PAGES)
- Split happens only once, whole region
- Evalutes improved batch performance of splitting
- Scenario 2: Manual protect, split happens during every dirty-bit clean
(KVM_CLEAR_DIRTY_LOG ioctl), average for 2 iterations.
- Split called multiple times, for smaller 64-page sections.
- Evaluate improved performance for multiple calls
Scenario 1, improvement on dirty-track enable ioctl for the memslot:
- Memory was already split (4k pages): -44.01% runtime (stdev 2.80%)
- THP backed memory: -24.66% runtime (stdev 1.21%)
- 16x1GB hugetlb memory: -24.78% runtime (stdev 0.85%)
Scenario 2, improvement on dirty-log clean ioctl for the memslot:
- Memory was already split (4k pages): -38.98% runtime (stdev 1.91%)
- THP backed memory: -25.49% runtime (stdev 0.65%)
- 16x1GB hugetlb memory: -24.24% runtime (stdev 0.65%)
For collecting above numbers, the following script was ran in both vanilla
and patched kernels, with kernel parameter 'default_hugepagesz=1G', on an
TX2 with 32GB RAM.
--- dirty_test.sh
#!/bin/bash
filename=$(uname -r |cut -d'-' -f 4-)
run_test(){
uname -a
cat /proc/cmdline
#prepare
sudo bash -c 'echo 64 > /proc/sys/vm/nr_hugepages'
./dirty_log_perf_test -g -b 64G
./dirty_log_perf_test -g -b 64G -s anonymous_thp
./dirty_log_perf_test -g -b 64G -s shared_hugetlb
./dirty_log_perf_test -b 64G
./dirty_log_perf_test -b 64G -s anonymous_thp
./dirty_log_perf_test -b 64G -s shared_hugetlb
}
run_test 2>&1 | tee ${filename}
---
Above dirty_log_perf_test command is the standard kvm selftest found in the
kernel tree. It tested the following guest modes:
Testing guest mode: PA-bits:40, VA-bits:48, 4K pages
Testing guest mode: PA-bits:40, VA-bits:48, 64K pages
Testing guest mode: PA-bits:36, VA-bits:48, 4K pages
Testing guest mode: PA-bits:36, VA-bits:48, 64K pages
Performance numbers from above modes were used to calculate average and
stdev showed in the optimization results.
Changes since v1:
- Fixed inverted flag verification priority (Sashiko)
- Fixed incorrectly skipping POST call if level was skipped (Sashiko), and to that
- New pre-patch that changes goto-out -> return to avoid re-testing walk_continue
v1 Link: https://lore.kernel.org/lkml/20260610202112.2695205-2-leo.bras@arm.com/
Changes since RFC:
- Changed approach from return value to walk flags (Will Deacon)
- Discarted skip_child approach (Oliver Upton)
- Measured in real hardware, and from userspace perspective (Marc Zyngier)
- Better explanation of what and how numbers were collected
RFC Link: https://lore.kernel.org/all/20260515195904.2466381-1-leo.bras@arm.com/
Thanks!
Leo
Leonardo Bras (3):
KVM: arm64: Avoid re-testing walk_continue
KVM: arm64: Introduce KVM_PGTABLE_WALK_SKIP_LEVEL* walk flags
KVM: arm64: Make stage2_split_walker() skip unnecessary walks
arch/arm64/include/asm/kvm_pgtable.h | 13 +++++++++++++
arch/arm64/kvm/hyp/pgtable.c | 28 +++++++++++++++++++++-------
2 files changed, 34 insertions(+), 7 deletions(-)
base-commit: 66affa37cfac0aec061cc4bcf4a065b0c52f7e19
--
2.54.0
On Thu, Jun 18, 2026 at 02:14:41PM +0100, Leonardo Bras wrote:
> While playing with dirty-bit tracking, I decided to take a look on how page
> splitting works. Found out all entries are walked, even though we can infer,
> for instance that:
> - If a level-3 entry is walked, it means the parent level-2 entry is split
> - If a split just succeeded in an table entry, it means all children nodes
> are already split
>
> This patches' idea is to introduce new walking flags to skip pagetable
> levels 0-3.
>
> The idea of skipping child nodes was also tested, but it was marginally
> slower than just skipping levels, so it was discarted.
>
> Optimization measured on two scenarios involving eager-splitting on a
> VM with 1 memslot of 16GB:
> - Scenario 1: No manual protect, whole memslot split at dirty-track enable
> (KVM_SET_USER_MEMORY_REGION2 ioctl with KVM_MEM_LOG_DIRTY_PAGES)
> - Split happens only once, whole region
> - Evalutes improved batch performance of splitting
> - Scenario 2: Manual protect, split happens during every dirty-bit clean
> (KVM_CLEAR_DIRTY_LOG ioctl), average for 2 iterations.
> - Split called multiple times, for smaller 64-page sections.
> - Evaluate improved performance for multiple calls
>
> Scenario 1, improvement on dirty-track enable ioctl for the memslot:
> - Memory was already split (4k pages): -44.01% runtime (stdev 2.80%)
> - THP backed memory: -24.66% runtime (stdev 1.21%)
> - 16x1GB hugetlb memory: -24.78% runtime (stdev 0.85%)
>
> Scenario 2, improvement on dirty-log clean ioctl for the memslot:
> - Memory was already split (4k pages): -38.98% runtime (stdev 1.91%)
> - THP backed memory: -25.49% runtime (stdev 0.65%)
> - 16x1GB hugetlb memory: -24.24% runtime (stdev 0.65%)
>
> For collecting above numbers, the following script was ran in both vanilla
> and patched kernels, with kernel parameter 'default_hugepagesz=1G', on an
> TX2 with 32GB RAM.
>
> --- dirty_test.sh
> #!/bin/bash
> filename=$(uname -r |cut -d'-' -f 4-)
>
> run_test(){
> uname -a
> cat /proc/cmdline
>
> #prepare
> sudo bash -c 'echo 64 > /proc/sys/vm/nr_hugepages'
>
> ./dirty_log_perf_test -g -b 64G
> ./dirty_log_perf_test -g -b 64G -s anonymous_thp
> ./dirty_log_perf_test -g -b 64G -s shared_hugetlb
>
> ./dirty_log_perf_test -b 64G
> ./dirty_log_perf_test -b 64G -s anonymous_thp
> ./dirty_log_perf_test -b 64G -s shared_hugetlb
> }
>
> run_test 2>&1 | tee ${filename}
> ---
s/64G/16G/ on above script
>
> Above dirty_log_perf_test command is the standard kvm selftest found in the
> kernel tree. It tested the following guest modes:
> Testing guest mode: PA-bits:40, VA-bits:48, 4K pages
> Testing guest mode: PA-bits:40, VA-bits:48, 64K pages
> Testing guest mode: PA-bits:36, VA-bits:48, 4K pages
> Testing guest mode: PA-bits:36, VA-bits:48, 64K pages
>
> Performance numbers from above modes were used to calculate average and
> stdev showed in the optimization results.
>
> Changes since v1:
> - Fixed inverted flag verification priority (Sashiko)
> - Fixed incorrectly skipping POST call if level was skipped (Sashiko), and to that
> - New pre-patch that changes goto-out -> return to avoid re-testing walk_continue
> v1 Link: https://lore.kernel.org/lkml/20260610202112.2695205-2-leo.bras@arm.com/
>
> Changes since RFC:
> - Changed approach from return value to walk flags (Will Deacon)
> - Discarted skip_child approach (Oliver Upton)
> - Measured in real hardware, and from userspace perspective (Marc Zyngier)
> - Better explanation of what and how numbers were collected
> RFC Link: https://lore.kernel.org/all/20260515195904.2466381-1-leo.bras@arm.com/
>
> Thanks!
> Leo
>
> Leonardo Bras (3):
> KVM: arm64: Avoid re-testing walk_continue
> KVM: arm64: Introduce KVM_PGTABLE_WALK_SKIP_LEVEL* walk flags
> KVM: arm64: Make stage2_split_walker() skip unnecessary walks
>
> arch/arm64/include/asm/kvm_pgtable.h | 13 +++++++++++++
> arch/arm64/kvm/hyp/pgtable.c | 28 +++++++++++++++++++++-------
> 2 files changed, 34 insertions(+), 7 deletions(-)
>
>
> base-commit: 66affa37cfac0aec061cc4bcf4a065b0c52f7e19
> --
> 2.54.0
>
© 2016 - 2026 Red Hat, Inc.