arch/arm64/include/asm/cpufeature.h | 2 + arch/arm64/include/asm/mmu.h | 3 + arch/arm64/include/asm/pgtable.h | 5 + arch/arm64/kernel/cpufeature.c | 12 +- arch/arm64/mm/mmu.c | 418 +++++++++++++++++++++++++++- arch/arm64/mm/pageattr.c | 157 ++++++++--- arch/arm64/mm/proc.S | 27 +- include/linux/pagewalk.h | 3 + mm/pagewalk.c | 36 ++- 9 files changed, 599 insertions(+), 64 deletions(-)
Hi All, This is a new version following on from the v6 RFC at [1] which itself is based on Yang Shi's work. On systems with BBML2_NOABORT support, it causes the linear map to be mapped with large blocks, even when rodata=full, and leads to some nice performance improvements. I've tested this on an AmpereOne system (a VM with 12G RAM) in all 3 possible modes by hacking the BBML2 feature detection code: - mode 1: All CPUs support BBML2 so the linear map uses large mappings - mode 2: Boot CPU does not support BBML2 so linear map uses pte mappings - mode 3: Boot CPU supports BBML2 but secondaries do not so linear map initially uses large mappings but is then repainted to use pte mappings In all cases, mm selftests run and no regressions are observed. In all cases, ptdump of linear map is as expected: Mode 1: ======= ---[ Linear Mapping start ]--- 0xffff000000000000-0xffff000000200000 2M PMD RW NX SHD AF BLK UXN MEM/NORMAL-TAGGED 0xffff000000200000-0xffff000000210000 64K PTE RW NX SHD AF CON UXN MEM/NORMAL-TAGGED 0xffff000000210000-0xffff000000400000 1984K PTE ro NX SHD AF UXN MEM/NORMAL 0xffff000000400000-0xffff000002400000 32M PMD ro NX SHD AF BLK UXN MEM/NORMAL 0xffff000002400000-0xffff000002550000 1344K PTE ro NX SHD AF UXN MEM/NORMAL 0xffff000002550000-0xffff000002600000 704K PTE RW NX SHD AF CON UXN MEM/NORMAL-TAGGED 0xffff000002600000-0xffff000004000000 26M PMD RW NX SHD AF BLK UXN MEM/NORMAL-TAGGED 0xffff000004000000-0xffff000040000000 960M PMD RW NX SHD AF CON BLK UXN MEM/NORMAL-TAGGED 0xffff000040000000-0xffff000140000000 4G PUD RW NX SHD AF BLK UXN MEM/NORMAL-TAGGED 0xffff000140000000-0xffff000142000000 32M PMD RW NX SHD AF CON BLK UXN MEM/NORMAL-TAGGED 0xffff000142000000-0xffff000142120000 1152K PTE RW NX SHD AF CON UXN MEM/NORMAL-TAGGED 0xffff000142120000-0xffff000142128000 32K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED 0xffff000142128000-0xffff000142159000 196K PTE ro NX SHD AF UXN MEM/NORMAL-TAGGED 0xffff000142159000-0xffff000142160000 28K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED 0xffff000142160000-0xffff000142240000 896K PTE RW NX SHD AF CON UXN MEM/NORMAL-TAGGED 0xffff000142240000-0xffff00014224e000 56K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED 0xffff00014224e000-0xffff000142250000 8K PTE ro NX SHD AF UXN MEM/NORMAL-TAGGED 0xffff000142250000-0xffff000142260000 64K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED 0xffff000142260000-0xffff000142280000 128K PTE RW NX SHD AF CON UXN MEM/NORMAL-TAGGED 0xffff000142280000-0xffff000142288000 32K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED 0xffff000142288000-0xffff000142290000 32K PTE ro NX SHD AF UXN MEM/NORMAL-TAGGED 0xffff000142290000-0xffff0001422a0000 64K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED 0xffff0001422a0000-0xffff000142465000 1812K PTE ro NX SHD AF UXN MEM/NORMAL-TAGGED 0xffff000142465000-0xffff000142470000 44K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED 0xffff000142470000-0xffff000142600000 1600K PTE RW NX SHD AF CON UXN MEM/NORMAL-TAGGED 0xffff000142600000-0xffff000144000000 26M PMD RW NX SHD AF BLK UXN MEM/NORMAL-TAGGED 0xffff000144000000-0xffff000180000000 960M PMD RW NX SHD AF CON BLK UXN MEM/NORMAL-TAGGED 0xffff000180000000-0xffff000181a00000 26M PMD RW NX SHD AF BLK UXN MEM/NORMAL-TAGGED 0xffff000181a00000-0xffff000181b90000 1600K PTE RW NX SHD AF CON UXN MEM/NORMAL-TAGGED 0xffff000181b90000-0xffff000181b9d000 52K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED 0xffff000181b9d000-0xffff000181c80000 908K PTE ro NX SHD AF UXN MEM/NORMAL-TAGGED 0xffff000181c80000-0xffff000181c90000 64K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED 0xffff000181c90000-0xffff000181ca0000 64K PTE RW NX SHD AF CON UXN MEM/NORMAL-TAGGED 0xffff000181ca0000-0xffff000181dbd000 1140K PTE ro NX SHD AF UXN MEM/NORMAL-TAGGED 0xffff000181dbd000-0xffff000181dc0000 12K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED 0xffff000181dc0000-0xffff000181e00000 256K PTE RW NX SHD AF CON UXN MEM/NORMAL-TAGGED 0xffff000181e00000-0xffff000182000000 2M PMD RW NX SHD AF BLK UXN MEM/NORMAL-TAGGED 0xffff000182000000-0xffff0001c0000000 992M PMD RW NX SHD AF CON BLK UXN MEM/NORMAL-TAGGED 0xffff0001c0000000-0xffff000300000000 5G PUD RW NX SHD AF BLK UXN MEM/NORMAL-TAGGED 0xffff000300000000-0xffff008000000000 500G PUD 0xffff008000000000-0xffff800000000000 130560G PGD ---[ Linear Mapping end ]--- Mode 3: ======= ---[ Linear Mapping start ]--- 0xffff000000000000-0xffff000000210000 2112K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED 0xffff000000210000-0xffff000000400000 1984K PTE ro NX SHD AF UXN MEM/NORMAL 0xffff000000400000-0xffff000002400000 32M PMD ro NX SHD AF BLK UXN MEM/NORMAL 0xffff000002400000-0xffff000002550000 1344K PTE ro NX SHD AF UXN MEM/NORMAL 0xffff000002550000-0xffff000143a61000 5264452K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED 0xffff000143a61000-0xffff000143c61000 2M PTE ro NX SHD AF UXN MEM/NORMAL-TAGGED 0xffff000143c61000-0xffff000181b9a000 1015012K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED 0xffff000181b9a000-0xffff000181d9a000 2M PTE ro NX SHD AF UXN MEM/NORMAL-TAGGED 0xffff000181d9a000-0xffff000300000000 6261144K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED 0xffff000300000000-0xffff008000000000 500G PUD 0xffff008000000000-0xffff800000000000 130560G PGD ---[ Linear Mapping end ]--- Performance Testing =================== Yang Shi has gathered some compelling results which are detailed in the commit log for patch #3. Additionally I have run this through a random selection of benchmarks on AmpereOne. None show any regressions, and various benchmarks show statistically significant improvement. I'm just showing those improvements here: +----------------------+----------------------------------------------------------+-------------------------+ | Benchmark | Result Class | Improvement vs 6.17-rc1 | +======================+==========================================================+=========================+ | micromm/vmalloc | full_fit_alloc_test: p:1, h:0, l:500000 (usec) | (I) -9.00% | | | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) | (I) -6.93% | | | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) | (I) -6.77% | | | pcpu_alloc_test: p:1, h:0, l:500000 (usec) | (I) -4.63% | +----------------------+----------------------------------------------------------+-------------------------+ | mmtests/hackbench | process-sockets-30 (seconds) | (I) -2.96% | +----------------------+----------------------------------------------------------+-------------------------+ | mmtests/kernbench | syst-192 (seconds) | (I) -12.77% | +----------------------+----------------------------------------------------------+-------------------------+ | pts/perl-benchmark | Test: Interpreter (Seconds) | (I) -4.86% | +----------------------+----------------------------------------------------------+-------------------------+ | pts/pgbench | Scale: 1 Clients: 1 Read Write (TPS) | (I) 5.07% | | | Scale: 1 Clients: 1 Read Write - Latency (ms) | (I) -4.72% | | | Scale: 100 Clients: 1000 Read Write (TPS) | (I) 2.58% | | | Scale: 100 Clients: 1000 Read Write - Latency (ms) | (I) -2.52% | +----------------------+----------------------------------------------------------+-------------------------+ | pts/sqlite-speedtest | Timed Time - Size 1,000 (Seconds) | (I) -2.68% | +----------------------+----------------------------------------------------------+-------------------------+ Changes since v6 [1] ==================== - Patch 1: Minor refactor to implement walk_kernel_page_table_range() in terms of walk_kernel_page_table_range_lockless(). Also lead to adding *pmd argument to the lockless variant for consistency (per Catalin). - Misc function/variable renames to improve clarity and consistency. - Share same syncrhonization flag between idmap_kpti_install_ng_mappings and wait_linear_map_split_to_ptes, which allows removal of bbml2_ptes[] to save ~20K from kernel image. - Only take pgtable_split_lock and enter lazy mmu mode once for both splits. - Only walk the pgtable once for the common "split single page" case. - Bypass split to contpmd and contpte when spllitting linear map to ptes. Applies on v6.17-rc3. [1] https://lore.kernel.org/linux-arm-kernel/20250805081350.3854670-1-ryan.roberts@arm.com/ Thanks, Ryan Dev Jain (1): arm64: Enable permission change on arm64 kernel block mappings Ryan Roberts (3): arm64: mm: Optimize split_kernel_leaf_mapping() arm64: mm: split linear mapping if BBML2 unsupported on secondary CPUs arm64: mm: Optimize linear_map_split_to_ptes() Yang Shi (2): arm64: cpufeature: add AmpereOne to BBML2 allow list arm64: mm: support large block mapping when rodata=full arch/arm64/include/asm/cpufeature.h | 2 + arch/arm64/include/asm/mmu.h | 3 + arch/arm64/include/asm/pgtable.h | 5 + arch/arm64/kernel/cpufeature.c | 12 +- arch/arm64/mm/mmu.c | 418 +++++++++++++++++++++++++++- arch/arm64/mm/pageattr.c | 157 ++++++++--- arch/arm64/mm/proc.S | 27 +- include/linux/pagewalk.h | 3 + mm/pagewalk.c | 36 ++- 9 files changed, 599 insertions(+), 64 deletions(-) -- 2.43.0
On 29/08/25 5:22 pm, Ryan Roberts wrote: > Hi All, > > This is a new version following on from the v6 RFC at [1] which itself is based > on Yang Shi's work. On systems with BBML2_NOABORT support, it causes the linear > map to be mapped with large blocks, even when rodata=full, and leads to some > nice performance improvements. > > I've tested this on an AmpereOne system (a VM with 12G RAM) in all 3 possible > modes by hacking the BBML2 feature detection code: > > - mode 1: All CPUs support BBML2 so the linear map uses large mappings > - mode 2: Boot CPU does not support BBML2 so linear map uses pte mappings > - mode 3: Boot CPU supports BBML2 but secondaries do not so linear map > initially uses large mappings but is then repainted to use pte mappings > > In all cases, mm selftests run and no regressions are observed. In all cases, > ptdump of linear map is as expected: > > Mode 1: > ======= > ---[ Linear Mapping start ]--- > 0xffff000000000000-0xffff000000200000 2M PMD RW NX SHD AF BLK UXN MEM/NORMAL-TAGGED > 0xffff000000200000-0xffff000000210000 64K PTE RW NX SHD AF CON UXN MEM/NORMAL-TAGGED > 0xffff000000210000-0xffff000000400000 1984K PTE ro NX SHD AF UXN MEM/NORMAL > 0xffff000000400000-0xffff000002400000 32M PMD ro NX SHD AF BLK UXN MEM/NORMAL > 0xffff000002400000-0xffff000002550000 1344K PTE ro NX SHD AF UXN MEM/NORMAL > 0xffff000002550000-0xffff000002600000 704K PTE RW NX SHD AF CON UXN MEM/NORMAL-TAGGED > 0xffff000002600000-0xffff000004000000 26M PMD RW NX SHD AF BLK UXN MEM/NORMAL-TAGGED > 0xffff000004000000-0xffff000040000000 960M PMD RW NX SHD AF CON BLK UXN MEM/NORMAL-TAGGED > 0xffff000040000000-0xffff000140000000 4G PUD RW NX SHD AF BLK UXN MEM/NORMAL-TAGGED > 0xffff000140000000-0xffff000142000000 32M PMD RW NX SHD AF CON BLK UXN MEM/NORMAL-TAGGED > 0xffff000142000000-0xffff000142120000 1152K PTE RW NX SHD AF CON UXN MEM/NORMAL-TAGGED > 0xffff000142120000-0xffff000142128000 32K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED > 0xffff000142128000-0xffff000142159000 196K PTE ro NX SHD AF UXN MEM/NORMAL-TAGGED > 0xffff000142159000-0xffff000142160000 28K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED > 0xffff000142160000-0xffff000142240000 896K PTE RW NX SHD AF CON UXN MEM/NORMAL-TAGGED > 0xffff000142240000-0xffff00014224e000 56K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED > 0xffff00014224e000-0xffff000142250000 8K PTE ro NX SHD AF UXN MEM/NORMAL-TAGGED > 0xffff000142250000-0xffff000142260000 64K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED > 0xffff000142260000-0xffff000142280000 128K PTE RW NX SHD AF CON UXN MEM/NORMAL-TAGGED > 0xffff000142280000-0xffff000142288000 32K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED > 0xffff000142288000-0xffff000142290000 32K PTE ro NX SHD AF UXN MEM/NORMAL-TAGGED > 0xffff000142290000-0xffff0001422a0000 64K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED > 0xffff0001422a0000-0xffff000142465000 1812K PTE ro NX SHD AF UXN MEM/NORMAL-TAGGED > 0xffff000142465000-0xffff000142470000 44K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED > 0xffff000142470000-0xffff000142600000 1600K PTE RW NX SHD AF CON UXN MEM/NORMAL-TAGGED > 0xffff000142600000-0xffff000144000000 26M PMD RW NX SHD AF BLK UXN MEM/NORMAL-TAGGED > 0xffff000144000000-0xffff000180000000 960M PMD RW NX SHD AF CON BLK UXN MEM/NORMAL-TAGGED > 0xffff000180000000-0xffff000181a00000 26M PMD RW NX SHD AF BLK UXN MEM/NORMAL-TAGGED > 0xffff000181a00000-0xffff000181b90000 1600K PTE RW NX SHD AF CON UXN MEM/NORMAL-TAGGED > 0xffff000181b90000-0xffff000181b9d000 52K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED > 0xffff000181b9d000-0xffff000181c80000 908K PTE ro NX SHD AF UXN MEM/NORMAL-TAGGED > 0xffff000181c80000-0xffff000181c90000 64K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED > 0xffff000181c90000-0xffff000181ca0000 64K PTE RW NX SHD AF CON UXN MEM/NORMAL-TAGGED > 0xffff000181ca0000-0xffff000181dbd000 1140K PTE ro NX SHD AF UXN MEM/NORMAL-TAGGED > 0xffff000181dbd000-0xffff000181dc0000 12K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED > 0xffff000181dc0000-0xffff000181e00000 256K PTE RW NX SHD AF CON UXN MEM/NORMAL-TAGGED > 0xffff000181e00000-0xffff000182000000 2M PMD RW NX SHD AF BLK UXN MEM/NORMAL-TAGGED > 0xffff000182000000-0xffff0001c0000000 992M PMD RW NX SHD AF CON BLK UXN MEM/NORMAL-TAGGED > 0xffff0001c0000000-0xffff000300000000 5G PUD RW NX SHD AF BLK UXN MEM/NORMAL-TAGGED > 0xffff000300000000-0xffff008000000000 500G PUD > 0xffff008000000000-0xffff800000000000 130560G PGD > ---[ Linear Mapping end ]--- > > Mode 3: > ======= > ---[ Linear Mapping start ]--- > 0xffff000000000000-0xffff000000210000 2112K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED > 0xffff000000210000-0xffff000000400000 1984K PTE ro NX SHD AF UXN MEM/NORMAL > 0xffff000000400000-0xffff000002400000 32M PMD ro NX SHD AF BLK UXN MEM/NORMAL > 0xffff000002400000-0xffff000002550000 1344K PTE ro NX SHD AF UXN MEM/NORMAL > 0xffff000002550000-0xffff000143a61000 5264452K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED > 0xffff000143a61000-0xffff000143c61000 2M PTE ro NX SHD AF UXN MEM/NORMAL-TAGGED > 0xffff000143c61000-0xffff000181b9a000 1015012K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED > 0xffff000181b9a000-0xffff000181d9a000 2M PTE ro NX SHD AF UXN MEM/NORMAL-TAGGED > 0xffff000181d9a000-0xffff000300000000 6261144K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED > 0xffff000300000000-0xffff008000000000 500G PUD > 0xffff008000000000-0xffff800000000000 130560G PGD > ---[ Linear Mapping end ]--- > > > Performance Testing > =================== > > Yang Shi has gathered some compelling results which are detailed in the commit > log for patch #3. Additionally I have run this through a random selection of > benchmarks on AmpereOne. None show any regressions, and various benchmarks show > statistically significant improvement. I'm just showing those improvements here: > > +----------------------+----------------------------------------------------------+-------------------------+ > | Benchmark | Result Class | Improvement vs 6.17-rc1 | > +======================+==========================================================+=========================+ > | micromm/vmalloc | full_fit_alloc_test: p:1, h:0, l:500000 (usec) | (I) -9.00% | > | | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) | (I) -6.93% | > | | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) | (I) -6.77% | > | | pcpu_alloc_test: p:1, h:0, l:500000 (usec) | (I) -4.63% | > +----------------------+----------------------------------------------------------+-------------------------+ > | mmtests/hackbench | process-sockets-30 (seconds) | (I) -2.96% | > +----------------------+----------------------------------------------------------+-------------------------+ > | mmtests/kernbench | syst-192 (seconds) | (I) -12.77% | > +----------------------+----------------------------------------------------------+-------------------------+ > | pts/perl-benchmark | Test: Interpreter (Seconds) | (I) -4.86% | > +----------------------+----------------------------------------------------------+-------------------------+ > | pts/pgbench | Scale: 1 Clients: 1 Read Write (TPS) | (I) 5.07% | > | | Scale: 1 Clients: 1 Read Write - Latency (ms) | (I) -4.72% | > | | Scale: 100 Clients: 1000 Read Write (TPS) | (I) 2.58% | > | | Scale: 100 Clients: 1000 Read Write - Latency (ms) | (I) -2.52% | > +----------------------+----------------------------------------------------------+-------------------------+ > | pts/sqlite-speedtest | Timed Time - Size 1,000 (Seconds) | (I) -2.68% | > +----------------------+----------------------------------------------------------+-------------------------+ > > > Changes since v6 [1] > ==================== > > - Patch 1: Minor refactor to implement walk_kernel_page_table_range() in terms > of walk_kernel_page_table_range_lockless(). Also lead to adding *pmd argument > to the lockless variant for consistency (per Catalin). > - Misc function/variable renames to improve clarity and consistency. > - Share same syncrhonization flag between idmap_kpti_install_ng_mappings and > wait_linear_map_split_to_ptes, which allows removal of bbml2_ptes[] to save > ~20K from kernel image. > - Only take pgtable_split_lock and enter lazy mmu mode once for both splits. > - Only walk the pgtable once for the common "split single page" case. > - Bypass split to contpmd and contpte when spllitting linear map to ptes. > > > Applies on v6.17-rc3. > > > [1] https://lore.kernel.org/linux-arm-kernel/20250805081350.3854670-1-ryan.roberts@arm.com/ > > Thanks, > Ryan > > Dev Jain (1): > arm64: Enable permission change on arm64 kernel block mappings > > Ryan Roberts (3): > arm64: mm: Optimize split_kernel_leaf_mapping() > arm64: mm: split linear mapping if BBML2 unsupported on secondary CPUs > arm64: mm: Optimize linear_map_split_to_ptes() > > Yang Shi (2): > arm64: cpufeature: add AmpereOne to BBML2 allow list > arm64: mm: support large block mapping when rodata=full > > arch/arm64/include/asm/cpufeature.h | 2 + > arch/arm64/include/asm/mmu.h | 3 + > arch/arm64/include/asm/pgtable.h | 5 + > arch/arm64/kernel/cpufeature.c | 12 +- > arch/arm64/mm/mmu.c | 418 +++++++++++++++++++++++++++- > arch/arm64/mm/pageattr.c | 157 ++++++++--- > arch/arm64/mm/proc.S | 27 +- > include/linux/pagewalk.h | 3 + > mm/pagewalk.c | 36 ++- > 9 files changed, 599 insertions(+), 64 deletions(-) > > -- > 2.43.0 > Hi Yang and Ryan, I observe there are various callsites which will ultimately use update_range_prot() (from patch 1), that they do not check the return value. I am listing the ones I could find: set_memory_ro() in bpf_jit_comp.c set_memory_valid() in kernel_map_pages() in pageattr.c set_direct_map_invalid_noflush() in vm_reset_perms() in vmalloc.c set_direct_map_default_noflush() in vm_reset_perms() in vmalloc.c, and in secretmem.c (the secretmem.c ones should be safe as explained in the commments therein) The first one I think can be handled easily by returning -EFAULT. For the second, we are already returning in case of !can_set_direct_map, which renders DEBUG_PAGEALLOC useless. So maybe it is safe to ignore the ret from set_memory_valid? For the third, the call chain is a sequence of must-succeed void functions. Notably, when using vfree(), we may have to allocate a single pagetable page for splitting. I am wondering whether we can just have a warn_on_once or something for the case when we fail to allocate a pagetable page. Or, Ryan had suggested in an off-the-list conversation that we can maintain a cache of PTE tables for every PMD block mapping, which will give us the same memory consumption as we do today, but not sure if this is worth it. x86 can already handle splitting but due to the callchains I have described above, it has the same problem, and the code has been working for years :)
On 01/09/2025 06:04, Dev Jain wrote: > > On 29/08/25 5:22 pm, Ryan Roberts wrote: >> Hi All, >> >> This is a new version following on from the v6 RFC at [1] which itself is based >> on Yang Shi's work. On systems with BBML2_NOABORT support, it causes the linear >> map to be mapped with large blocks, even when rodata=full, and leads to some >> nice performance improvements. >> >> I've tested this on an AmpereOne system (a VM with 12G RAM) in all 3 possible >> modes by hacking the BBML2 feature detection code: >> >> - mode 1: All CPUs support BBML2 so the linear map uses large mappings >> - mode 2: Boot CPU does not support BBML2 so linear map uses pte mappings >> - mode 3: Boot CPU supports BBML2 but secondaries do not so linear map >> initially uses large mappings but is then repainted to use pte mappings >> >> In all cases, mm selftests run and no regressions are observed. In all cases, >> ptdump of linear map is as expected: >> >> Mode 1: >> ======= >> ---[ Linear Mapping start ]--- >> 0xffff000000000000-0xffff000000200000 2M PMD RW NX SHD >> AF BLK UXN MEM/NORMAL-TAGGED >> 0xffff000000200000-0xffff000000210000 64K PTE RW NX SHD AF >> CON UXN MEM/NORMAL-TAGGED >> 0xffff000000210000-0xffff000000400000 1984K PTE ro NX SHD >> AF UXN MEM/NORMAL >> 0xffff000000400000-0xffff000002400000 32M PMD ro NX SHD >> AF BLK UXN MEM/NORMAL >> 0xffff000002400000-0xffff000002550000 1344K PTE ro NX SHD >> AF UXN MEM/NORMAL >> 0xffff000002550000-0xffff000002600000 704K PTE RW NX SHD AF >> CON UXN MEM/NORMAL-TAGGED >> 0xffff000002600000-0xffff000004000000 26M PMD RW NX SHD >> AF BLK UXN MEM/NORMAL-TAGGED >> 0xffff000004000000-0xffff000040000000 960M PMD RW NX SHD AF >> CON BLK UXN MEM/NORMAL-TAGGED >> 0xffff000040000000-0xffff000140000000 4G PUD RW NX SHD >> AF BLK UXN MEM/NORMAL-TAGGED >> 0xffff000140000000-0xffff000142000000 32M PMD RW NX SHD AF >> CON BLK UXN MEM/NORMAL-TAGGED >> 0xffff000142000000-0xffff000142120000 1152K PTE RW NX SHD AF >> CON UXN MEM/NORMAL-TAGGED >> 0xffff000142120000-0xffff000142128000 32K PTE RW NX SHD >> AF UXN MEM/NORMAL-TAGGED >> 0xffff000142128000-0xffff000142159000 196K PTE ro NX SHD >> AF UXN MEM/NORMAL-TAGGED >> 0xffff000142159000-0xffff000142160000 28K PTE RW NX SHD >> AF UXN MEM/NORMAL-TAGGED >> 0xffff000142160000-0xffff000142240000 896K PTE RW NX SHD AF >> CON UXN MEM/NORMAL-TAGGED >> 0xffff000142240000-0xffff00014224e000 56K PTE RW NX SHD >> AF UXN MEM/NORMAL-TAGGED >> 0xffff00014224e000-0xffff000142250000 8K PTE ro NX SHD >> AF UXN MEM/NORMAL-TAGGED >> 0xffff000142250000-0xffff000142260000 64K PTE RW NX SHD >> AF UXN MEM/NORMAL-TAGGED >> 0xffff000142260000-0xffff000142280000 128K PTE RW NX SHD AF >> CON UXN MEM/NORMAL-TAGGED >> 0xffff000142280000-0xffff000142288000 32K PTE RW NX SHD >> AF UXN MEM/NORMAL-TAGGED >> 0xffff000142288000-0xffff000142290000 32K PTE ro NX SHD >> AF UXN MEM/NORMAL-TAGGED >> 0xffff000142290000-0xffff0001422a0000 64K PTE RW NX SHD >> AF UXN MEM/NORMAL-TAGGED >> 0xffff0001422a0000-0xffff000142465000 1812K PTE ro NX SHD >> AF UXN MEM/NORMAL-TAGGED >> 0xffff000142465000-0xffff000142470000 44K PTE RW NX SHD >> AF UXN MEM/NORMAL-TAGGED >> 0xffff000142470000-0xffff000142600000 1600K PTE RW NX SHD AF >> CON UXN MEM/NORMAL-TAGGED >> 0xffff000142600000-0xffff000144000000 26M PMD RW NX SHD >> AF BLK UXN MEM/NORMAL-TAGGED >> 0xffff000144000000-0xffff000180000000 960M PMD RW NX SHD AF >> CON BLK UXN MEM/NORMAL-TAGGED >> 0xffff000180000000-0xffff000181a00000 26M PMD RW NX SHD >> AF BLK UXN MEM/NORMAL-TAGGED >> 0xffff000181a00000-0xffff000181b90000 1600K PTE RW NX SHD AF >> CON UXN MEM/NORMAL-TAGGED >> 0xffff000181b90000-0xffff000181b9d000 52K PTE RW NX SHD >> AF UXN MEM/NORMAL-TAGGED >> 0xffff000181b9d000-0xffff000181c80000 908K PTE ro NX SHD >> AF UXN MEM/NORMAL-TAGGED >> 0xffff000181c80000-0xffff000181c90000 64K PTE RW NX SHD >> AF UXN MEM/NORMAL-TAGGED >> 0xffff000181c90000-0xffff000181ca0000 64K PTE RW NX SHD AF >> CON UXN MEM/NORMAL-TAGGED >> 0xffff000181ca0000-0xffff000181dbd000 1140K PTE ro NX SHD >> AF UXN MEM/NORMAL-TAGGED >> 0xffff000181dbd000-0xffff000181dc0000 12K PTE RW NX SHD >> AF UXN MEM/NORMAL-TAGGED >> 0xffff000181dc0000-0xffff000181e00000 256K PTE RW NX SHD AF >> CON UXN MEM/NORMAL-TAGGED >> 0xffff000181e00000-0xffff000182000000 2M PMD RW NX SHD >> AF BLK UXN MEM/NORMAL-TAGGED >> 0xffff000182000000-0xffff0001c0000000 992M PMD RW NX SHD AF >> CON BLK UXN MEM/NORMAL-TAGGED >> 0xffff0001c0000000-0xffff000300000000 5G PUD RW NX SHD >> AF BLK UXN MEM/NORMAL-TAGGED >> 0xffff000300000000-0xffff008000000000 500G PUD >> 0xffff008000000000-0xffff800000000000 130560G PGD >> ---[ Linear Mapping end ]--- >> >> Mode 3: >> ======= >> ---[ Linear Mapping start ]--- >> 0xffff000000000000-0xffff000000210000 2112K PTE RW NX SHD >> AF UXN MEM/NORMAL-TAGGED >> 0xffff000000210000-0xffff000000400000 1984K PTE ro NX SHD >> AF UXN MEM/NORMAL >> 0xffff000000400000-0xffff000002400000 32M PMD ro NX SHD >> AF BLK UXN MEM/NORMAL >> 0xffff000002400000-0xffff000002550000 1344K PTE ro NX SHD >> AF UXN MEM/NORMAL >> 0xffff000002550000-0xffff000143a61000 5264452K PTE RW NX SHD >> AF UXN MEM/NORMAL-TAGGED >> 0xffff000143a61000-0xffff000143c61000 2M PTE ro NX SHD >> AF UXN MEM/NORMAL-TAGGED >> 0xffff000143c61000-0xffff000181b9a000 1015012K PTE RW NX SHD >> AF UXN MEM/NORMAL-TAGGED >> 0xffff000181b9a000-0xffff000181d9a000 2M PTE ro NX SHD >> AF UXN MEM/NORMAL-TAGGED >> 0xffff000181d9a000-0xffff000300000000 6261144K PTE RW NX SHD >> AF UXN MEM/NORMAL-TAGGED >> 0xffff000300000000-0xffff008000000000 500G PUD >> 0xffff008000000000-0xffff800000000000 130560G PGD >> ---[ Linear Mapping end ]--- >> >> >> Performance Testing >> =================== >> >> Yang Shi has gathered some compelling results which are detailed in the commit >> log for patch #3. Additionally I have run this through a random selection of >> benchmarks on AmpereOne. None show any regressions, and various benchmarks show >> statistically significant improvement. I'm just showing those improvements here: >> >> +---------------------- >> +---------------------------------------------------------- >> +-------------------------+ >> | Benchmark | Result >> Class | Improvement vs 6.17-rc1 | >> +======================+==========================================================+=========================+ >> | micromm/vmalloc | full_fit_alloc_test: p:1, h:0, l:500000 >> (usec) | (I) -9.00% | >> | | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 >> (usec) | (I) -6.93% | >> | | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 >> (usec) | (I) -6.77% | >> | | pcpu_alloc_test: p:1, h:0, l:500000 >> (usec) | (I) -4.63% | >> +---------------------- >> +---------------------------------------------------------- >> +-------------------------+ >> | mmtests/hackbench | process-sockets-30 >> (seconds) | (I) -2.96% | >> +---------------------- >> +---------------------------------------------------------- >> +-------------------------+ >> | mmtests/kernbench | syst-192 >> (seconds) | (I) -12.77% | >> +---------------------- >> +---------------------------------------------------------- >> +-------------------------+ >> | pts/perl-benchmark | Test: Interpreter >> (Seconds) | (I) -4.86% | >> +---------------------- >> +---------------------------------------------------------- >> +-------------------------+ >> | pts/pgbench | Scale: 1 Clients: 1 Read Write >> (TPS) | (I) 5.07% | >> | | Scale: 1 Clients: 1 Read Write - Latency >> (ms) | (I) -4.72% | >> | | Scale: 100 Clients: 1000 Read Write >> (TPS) | (I) 2.58% | >> | | Scale: 100 Clients: 1000 Read Write - Latency >> (ms) | (I) -2.52% | >> +---------------------- >> +---------------------------------------------------------- >> +-------------------------+ >> | pts/sqlite-speedtest | Timed Time - Size 1,000 >> (Seconds) | (I) -2.68% | >> +---------------------- >> +---------------------------------------------------------- >> +-------------------------+ >> >> >> Changes since v6 [1] >> ==================== >> >> - Patch 1: Minor refactor to implement walk_kernel_page_table_range() in terms >> of walk_kernel_page_table_range_lockless(). Also lead to adding *pmd argument >> to the lockless variant for consistency (per Catalin). >> - Misc function/variable renames to improve clarity and consistency. >> - Share same syncrhonization flag between idmap_kpti_install_ng_mappings and >> wait_linear_map_split_to_ptes, which allows removal of bbml2_ptes[] to save >> ~20K from kernel image. >> - Only take pgtable_split_lock and enter lazy mmu mode once for both splits. >> - Only walk the pgtable once for the common "split single page" case. >> - Bypass split to contpmd and contpte when spllitting linear map to ptes. >> >> >> Applies on v6.17-rc3. >> >> >> [1] https://lore.kernel.org/linux-arm-kernel/20250805081350.3854670-1- >> ryan.roberts@arm.com/ >> >> Thanks, >> Ryan >> >> Dev Jain (1): >> arm64: Enable permission change on arm64 kernel block mappings >> >> Ryan Roberts (3): >> arm64: mm: Optimize split_kernel_leaf_mapping() >> arm64: mm: split linear mapping if BBML2 unsupported on secondary CPUs >> arm64: mm: Optimize linear_map_split_to_ptes() >> >> Yang Shi (2): >> arm64: cpufeature: add AmpereOne to BBML2 allow list >> arm64: mm: support large block mapping when rodata=full >> >> arch/arm64/include/asm/cpufeature.h | 2 + >> arch/arm64/include/asm/mmu.h | 3 + >> arch/arm64/include/asm/pgtable.h | 5 + >> arch/arm64/kernel/cpufeature.c | 12 +- >> arch/arm64/mm/mmu.c | 418 +++++++++++++++++++++++++++- >> arch/arm64/mm/pageattr.c | 157 ++++++++--- >> arch/arm64/mm/proc.S | 27 +- >> include/linux/pagewalk.h | 3 + >> mm/pagewalk.c | 36 ++- >> 9 files changed, 599 insertions(+), 64 deletions(-) >> >> -- >> 2.43.0 >> > > Hi Yang and Ryan, > > I observe there are various callsites which will ultimately use > update_range_prot() (from patch 1), > that they do not check the return value. I am listing the ones I could find: So your concern is that prior to patch #3 in this series, any error returned by __change_memory_common() would be due to programming error only. But patch #3 introduces the possibility of dynamic error (-ENOMEM) due to the need to allocate pgtable memory to split a mapping? There is a WARN_ON_ONCE(ret) for the return code of split_kernel_leaf_mapping() which will at least make the error visible, but I agree it's not a great solution. > > set_memory_ro() in bpf_jit_comp.c There is a set_memory_rw() for the same region of memory directly above this, which will return -EFAULT on failure. If that one succeeded, then the pgtable must already be appropriately split for set_memory_ro() so that should never fail in practice. I agree with improving the robustness of the code by returning -EFAULT (or just propagate the error?) as you suggest though. > set_memory_valid() in kernel_map_pages() in pageattr.c This is used by CONFIG_DEBUG_PAGEALLOC to make pages in the linear map invalid while they are not in use to catch programming errors. So if making a page invalid during freeing fails would not technically lead to a huge issue, it just reduces our capability of catching an errant access to that free memory. In principle, if we were able to make the memory invalid, we should therefore be able to make it valid again, because the mappings should be sufficiently split already. But that doesn't actually work, because we might be allocating a smaller order than was freed so we might not have split at free-time to the granularity is required at allocation-time. But as you say, for CONFIG_DEBUG_PAGEALLOC we disable this whole path anyway, so no issue here. > set_direct_map_invalid_noflush() in vm_reset_perms() in vmalloc.c > set_direct_map_default_noflush() in vm_reset_perms() in vmalloc.c, and in > secretmem.c > (the secretmem.c ones should be safe as explained in the commments therein) Agreed for secretmem. vmalloc looks like a problem though... If vmalloc was only setting the linear map back to default permissions, I guess this wouldn't be an issue because we must have split the linear map sucessfully when changing away from default permissions in the first place. But the fact that it is unconditionally setting the linear map pages to invalid then back to default causes issues; I guess even without the risk of -ENOMEM, this will cause the linear map to be split to PTEs over time as vmalloc allocs and frees? We probably need to think through how we can solve this. It's not clear to me why vm_reset_perms wants to unconditionally transiently set to invalid? > > The first one I think can be handled easily by returning -EFAULT. > > For the second, we are already returning in case of !can_set_direct_map, which > renders DEBUG_PAGEALLOC useless. So maybe it is > safe to ignore the ret from set_memory_valid? > > For the third, the call chain is a sequence of must-succeed void functions. > Notably, when using vfree(), we may have to allocate a single > pagetable page for splitting. > > I am wondering whether we can just have a warn_on_once or something for the case > when we fail to allocate a pagetable page. Or, Ryan had > suggested in an off-the-list conversation that we can maintain a cache of PTE > tables for every PMD block mapping, which will give us > the same memory consumption as we do today, but not sure if this is worth it. > x86 can already handle splitting but due to the callchains > I have described above, it has the same problem, and the code has been working > for years :) I think it's preferable to avoid having to keep a cache of pgtable memory if we can... Thanks, Ryan
On 9/1/25 1:03 AM, Ryan Roberts wrote: > On 01/09/2025 06:04, Dev Jain wrote: >> On 29/08/25 5:22 pm, Ryan Roberts wrote: >>> Hi All, >>> >>> This is a new version following on from the v6 RFC at [1] which itself is based >>> on Yang Shi's work. On systems with BBML2_NOABORT support, it causes the linear >>> map to be mapped with large blocks, even when rodata=full, and leads to some >>> nice performance improvements. >>> >>> I've tested this on an AmpereOne system (a VM with 12G RAM) in all 3 possible >>> modes by hacking the BBML2 feature detection code: >>> >>> - mode 1: All CPUs support BBML2 so the linear map uses large mappings >>> - mode 2: Boot CPU does not support BBML2 so linear map uses pte mappings >>> - mode 3: Boot CPU supports BBML2 but secondaries do not so linear map >>> initially uses large mappings but is then repainted to use pte mappings >>> >>> In all cases, mm selftests run and no regressions are observed. In all cases, >>> ptdump of linear map is as expected: >>> >>> Mode 1: >>> ======= >>> ---[ Linear Mapping start ]--- >>> 0xffff000000000000-0xffff000000200000 2M PMD RW NX SHD >>> AF BLK UXN MEM/NORMAL-TAGGED >>> 0xffff000000200000-0xffff000000210000 64K PTE RW NX SHD AF >>> CON UXN MEM/NORMAL-TAGGED >>> 0xffff000000210000-0xffff000000400000 1984K PTE ro NX SHD >>> AF UXN MEM/NORMAL >>> 0xffff000000400000-0xffff000002400000 32M PMD ro NX SHD >>> AF BLK UXN MEM/NORMAL >>> 0xffff000002400000-0xffff000002550000 1344K PTE ro NX SHD >>> AF UXN MEM/NORMAL >>> 0xffff000002550000-0xffff000002600000 704K PTE RW NX SHD AF >>> CON UXN MEM/NORMAL-TAGGED >>> 0xffff000002600000-0xffff000004000000 26M PMD RW NX SHD >>> AF BLK UXN MEM/NORMAL-TAGGED >>> 0xffff000004000000-0xffff000040000000 960M PMD RW NX SHD AF >>> CON BLK UXN MEM/NORMAL-TAGGED >>> 0xffff000040000000-0xffff000140000000 4G PUD RW NX SHD >>> AF BLK UXN MEM/NORMAL-TAGGED >>> 0xffff000140000000-0xffff000142000000 32M PMD RW NX SHD AF >>> CON BLK UXN MEM/NORMAL-TAGGED >>> 0xffff000142000000-0xffff000142120000 1152K PTE RW NX SHD AF >>> CON UXN MEM/NORMAL-TAGGED >>> 0xffff000142120000-0xffff000142128000 32K PTE RW NX SHD >>> AF UXN MEM/NORMAL-TAGGED >>> 0xffff000142128000-0xffff000142159000 196K PTE ro NX SHD >>> AF UXN MEM/NORMAL-TAGGED >>> 0xffff000142159000-0xffff000142160000 28K PTE RW NX SHD >>> AF UXN MEM/NORMAL-TAGGED >>> 0xffff000142160000-0xffff000142240000 896K PTE RW NX SHD AF >>> CON UXN MEM/NORMAL-TAGGED >>> 0xffff000142240000-0xffff00014224e000 56K PTE RW NX SHD >>> AF UXN MEM/NORMAL-TAGGED >>> 0xffff00014224e000-0xffff000142250000 8K PTE ro NX SHD >>> AF UXN MEM/NORMAL-TAGGED >>> 0xffff000142250000-0xffff000142260000 64K PTE RW NX SHD >>> AF UXN MEM/NORMAL-TAGGED >>> 0xffff000142260000-0xffff000142280000 128K PTE RW NX SHD AF >>> CON UXN MEM/NORMAL-TAGGED >>> 0xffff000142280000-0xffff000142288000 32K PTE RW NX SHD >>> AF UXN MEM/NORMAL-TAGGED >>> 0xffff000142288000-0xffff000142290000 32K PTE ro NX SHD >>> AF UXN MEM/NORMAL-TAGGED >>> 0xffff000142290000-0xffff0001422a0000 64K PTE RW NX SHD >>> AF UXN MEM/NORMAL-TAGGED >>> 0xffff0001422a0000-0xffff000142465000 1812K PTE ro NX SHD >>> AF UXN MEM/NORMAL-TAGGED >>> 0xffff000142465000-0xffff000142470000 44K PTE RW NX SHD >>> AF UXN MEM/NORMAL-TAGGED >>> 0xffff000142470000-0xffff000142600000 1600K PTE RW NX SHD AF >>> CON UXN MEM/NORMAL-TAGGED >>> 0xffff000142600000-0xffff000144000000 26M PMD RW NX SHD >>> AF BLK UXN MEM/NORMAL-TAGGED >>> 0xffff000144000000-0xffff000180000000 960M PMD RW NX SHD AF >>> CON BLK UXN MEM/NORMAL-TAGGED >>> 0xffff000180000000-0xffff000181a00000 26M PMD RW NX SHD >>> AF BLK UXN MEM/NORMAL-TAGGED >>> 0xffff000181a00000-0xffff000181b90000 1600K PTE RW NX SHD AF >>> CON UXN MEM/NORMAL-TAGGED >>> 0xffff000181b90000-0xffff000181b9d000 52K PTE RW NX SHD >>> AF UXN MEM/NORMAL-TAGGED >>> 0xffff000181b9d000-0xffff000181c80000 908K PTE ro NX SHD >>> AF UXN MEM/NORMAL-TAGGED >>> 0xffff000181c80000-0xffff000181c90000 64K PTE RW NX SHD >>> AF UXN MEM/NORMAL-TAGGED >>> 0xffff000181c90000-0xffff000181ca0000 64K PTE RW NX SHD AF >>> CON UXN MEM/NORMAL-TAGGED >>> 0xffff000181ca0000-0xffff000181dbd000 1140K PTE ro NX SHD >>> AF UXN MEM/NORMAL-TAGGED >>> 0xffff000181dbd000-0xffff000181dc0000 12K PTE RW NX SHD >>> AF UXN MEM/NORMAL-TAGGED >>> 0xffff000181dc0000-0xffff000181e00000 256K PTE RW NX SHD AF >>> CON UXN MEM/NORMAL-TAGGED >>> 0xffff000181e00000-0xffff000182000000 2M PMD RW NX SHD >>> AF BLK UXN MEM/NORMAL-TAGGED >>> 0xffff000182000000-0xffff0001c0000000 992M PMD RW NX SHD AF >>> CON BLK UXN MEM/NORMAL-TAGGED >>> 0xffff0001c0000000-0xffff000300000000 5G PUD RW NX SHD >>> AF BLK UXN MEM/NORMAL-TAGGED >>> 0xffff000300000000-0xffff008000000000 500G PUD >>> 0xffff008000000000-0xffff800000000000 130560G PGD >>> ---[ Linear Mapping end ]--- >>> >>> Mode 3: >>> ======= >>> ---[ Linear Mapping start ]--- >>> 0xffff000000000000-0xffff000000210000 2112K PTE RW NX SHD >>> AF UXN MEM/NORMAL-TAGGED >>> 0xffff000000210000-0xffff000000400000 1984K PTE ro NX SHD >>> AF UXN MEM/NORMAL >>> 0xffff000000400000-0xffff000002400000 32M PMD ro NX SHD >>> AF BLK UXN MEM/NORMAL >>> 0xffff000002400000-0xffff000002550000 1344K PTE ro NX SHD >>> AF UXN MEM/NORMAL >>> 0xffff000002550000-0xffff000143a61000 5264452K PTE RW NX SHD >>> AF UXN MEM/NORMAL-TAGGED >>> 0xffff000143a61000-0xffff000143c61000 2M PTE ro NX SHD >>> AF UXN MEM/NORMAL-TAGGED >>> 0xffff000143c61000-0xffff000181b9a000 1015012K PTE RW NX SHD >>> AF UXN MEM/NORMAL-TAGGED >>> 0xffff000181b9a000-0xffff000181d9a000 2M PTE ro NX SHD >>> AF UXN MEM/NORMAL-TAGGED >>> 0xffff000181d9a000-0xffff000300000000 6261144K PTE RW NX SHD >>> AF UXN MEM/NORMAL-TAGGED >>> 0xffff000300000000-0xffff008000000000 500G PUD >>> 0xffff008000000000-0xffff800000000000 130560G PGD >>> ---[ Linear Mapping end ]--- >>> >>> >>> Performance Testing >>> =================== >>> >>> Yang Shi has gathered some compelling results which are detailed in the commit >>> log for patch #3. Additionally I have run this through a random selection of >>> benchmarks on AmpereOne. None show any regressions, and various benchmarks show >>> statistically significant improvement. I'm just showing those improvements here: >>> >>> +---------------------- >>> +---------------------------------------------------------- >>> +-------------------------+ >>> | Benchmark | Result >>> Class | Improvement vs 6.17-rc1 | >>> +======================+==========================================================+=========================+ >>> | micromm/vmalloc | full_fit_alloc_test: p:1, h:0, l:500000 >>> (usec) | (I) -9.00% | >>> | | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 >>> (usec) | (I) -6.93% | >>> | | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 >>> (usec) | (I) -6.77% | >>> | | pcpu_alloc_test: p:1, h:0, l:500000 >>> (usec) | (I) -4.63% | >>> +---------------------- >>> +---------------------------------------------------------- >>> +-------------------------+ >>> | mmtests/hackbench | process-sockets-30 >>> (seconds) | (I) -2.96% | >>> +---------------------- >>> +---------------------------------------------------------- >>> +-------------------------+ >>> | mmtests/kernbench | syst-192 >>> (seconds) | (I) -12.77% | >>> +---------------------- >>> +---------------------------------------------------------- >>> +-------------------------+ >>> | pts/perl-benchmark | Test: Interpreter >>> (Seconds) | (I) -4.86% | >>> +---------------------- >>> +---------------------------------------------------------- >>> +-------------------------+ >>> | pts/pgbench | Scale: 1 Clients: 1 Read Write >>> (TPS) | (I) 5.07% | >>> | | Scale: 1 Clients: 1 Read Write - Latency >>> (ms) | (I) -4.72% | >>> | | Scale: 100 Clients: 1000 Read Write >>> (TPS) | (I) 2.58% | >>> | | Scale: 100 Clients: 1000 Read Write - Latency >>> (ms) | (I) -2.52% | >>> +---------------------- >>> +---------------------------------------------------------- >>> +-------------------------+ >>> | pts/sqlite-speedtest | Timed Time - Size 1,000 >>> (Seconds) | (I) -2.68% | >>> +---------------------- >>> +---------------------------------------------------------- >>> +-------------------------+ >>> >>> >>> Changes since v6 [1] >>> ==================== >>> >>> - Patch 1: Minor refactor to implement walk_kernel_page_table_range() in terms >>> of walk_kernel_page_table_range_lockless(). Also lead to adding *pmd argument >>> to the lockless variant for consistency (per Catalin). >>> - Misc function/variable renames to improve clarity and consistency. >>> - Share same syncrhonization flag between idmap_kpti_install_ng_mappings and >>> wait_linear_map_split_to_ptes, which allows removal of bbml2_ptes[] to save >>> ~20K from kernel image. >>> - Only take pgtable_split_lock and enter lazy mmu mode once for both splits. >>> - Only walk the pgtable once for the common "split single page" case. >>> - Bypass split to contpmd and contpte when spllitting linear map to ptes. >>> >>> >>> Applies on v6.17-rc3. >>> >>> >>> [1] https://lore.kernel.org/linux-arm-kernel/20250805081350.3854670-1- >>> ryan.roberts@arm.com/ >>> >>> Thanks, >>> Ryan >>> >>> Dev Jain (1): >>> arm64: Enable permission change on arm64 kernel block mappings >>> >>> Ryan Roberts (3): >>> arm64: mm: Optimize split_kernel_leaf_mapping() >>> arm64: mm: split linear mapping if BBML2 unsupported on secondary CPUs >>> arm64: mm: Optimize linear_map_split_to_ptes() >>> >>> Yang Shi (2): >>> arm64: cpufeature: add AmpereOne to BBML2 allow list >>> arm64: mm: support large block mapping when rodata=full >>> >>> arch/arm64/include/asm/cpufeature.h | 2 + >>> arch/arm64/include/asm/mmu.h | 3 + >>> arch/arm64/include/asm/pgtable.h | 5 + >>> arch/arm64/kernel/cpufeature.c | 12 +- >>> arch/arm64/mm/mmu.c | 418 +++++++++++++++++++++++++++- >>> arch/arm64/mm/pageattr.c | 157 ++++++++--- >>> arch/arm64/mm/proc.S | 27 +- >>> include/linux/pagewalk.h | 3 + >>> mm/pagewalk.c | 36 ++- >>> 9 files changed, 599 insertions(+), 64 deletions(-) >>> >>> -- >>> 2.43.0 >>> >> Hi Yang and Ryan, >> >> I observe there are various callsites which will ultimately use >> update_range_prot() (from patch 1), >> that they do not check the return value. I am listing the ones I could find: > So your concern is that prior to patch #3 in this series, any error returned by > __change_memory_common() would be due to programming error only. But patch #3 > introduces the possibility of dynamic error (-ENOMEM) due to the need to > allocate pgtable memory to split a mapping? > > There is a WARN_ON_ONCE(ret) for the return code of split_kernel_leaf_mapping() > which will at least make the error visible, but I agree it's not a great solution. > >> set_memory_ro() in bpf_jit_comp.c Do you mean arch/arm64/net/bpf_jit_comp.c? If so I think you can just check the return value then return -EFAULT just like what the above set_memory_rw() does. > There is a set_memory_rw() for the same region of memory directly above this, > which will return -EFAULT on failure. If that one succeeded, then the pgtable > must already be appropriately split for set_memory_ro() so that should never > fail in practice. I agree with improving the robustness of the code by returning > -EFAULT (or just propagate the error?) as you suggest though. Yeah, I agree. This one should be easy to resolve. > >> set_memory_valid() in kernel_map_pages() in pageattr.c > This is used by CONFIG_DEBUG_PAGEALLOC to make pages in the linear map invalid > while they are not in use to catch programming errors. So if making a page > invalid during freeing fails would not technically lead to a huge issue, it just > reduces our capability of catching an errant access to that free memory. > > In principle, if we were able to make the memory invalid, we should therefore be > able to make it valid again, because the mappings should be sufficiently split > already. But that doesn't actually work, because we might be allocating a > smaller order than was freed so we might not have split at free-time to the > granularity is required at allocation-time. > > But as you say, for CONFIG_DEBUG_PAGEALLOC we disable this whole path anyway, so > no issue here. Yes, agreed. > >> set_direct_map_invalid_noflush() in vm_reset_perms() in vmalloc.c >> set_direct_map_default_noflush() in vm_reset_perms() in vmalloc.c, and in >> secretmem.c >> (the secretmem.c ones should be safe as explained in the commments therein) > Agreed for secretmem. vmalloc looks like a problem though... > > If vmalloc was only setting the linear map back to default permissions, I guess > this wouldn't be an issue because we must have split the linear map sucessfully > when changing away from default permissions in the first place. But the fact Yes, agreed. > that it is unconditionally setting the linear map pages to invalid then back to > default causes issues; I guess even without the risk of -ENOMEM, this will cause > the linear map to be split to PTEs over time as vmalloc allocs and frees? It is possible. However, vm_reset_perms() is not called that often. Theoretically there are plenty of other operations, for example, loading/unloading modules, can cause the linear mapping to be split over time. So this one is not that special IMHO. > > We probably need to think through how we can solve this. It's not clear to me > why vm_reset_perms wants to unconditionally transiently set to invalid? It seems like vm_reset_perms() is just called when VM_FLUSH_RESET_PERMS flag is passed. It is just passed in for secretmem and hyperv. It sounds like some preventive security measurement to me. > >> The first one I think can be handled easily by returning -EFAULT. It may be not that simple. set_direct_map_invalid_noflush() is called on page basis, so does update_range_prot(). If the split requires allocate multiple page table pages, we may have some of the pages permission changed (page table page allocation succeed), but the remaining is skipped due to page table page allocation failure. The vfree() needs to handle such case by setting pages permission back before returning any errno. Anyway it sounds like a general problem rather than ARM specific. >> >> For the second, we are already returning in case of !can_set_direct_map, which >> renders DEBUG_PAGEALLOC useless. So maybe it is >> safe to ignore the ret from set_memory_valid? >> >> For the third, the call chain is a sequence of must-succeed void functions. >> Notably, when using vfree(), we may have to allocate a single >> pagetable page for splitting. >> >> I am wondering whether we can just have a warn_on_once or something for the case >> when we fail to allocate a pagetable page. Or, Ryan had >> suggested in an off-the-list conversation that we can maintain a cache of PTE >> tables for every PMD block mapping, which will give us >> the same memory consumption as we do today, but not sure if this is worth it. >> x86 can already handle splitting but due to the callchains >> I have described above, it has the same problem, and the code has been working >> for years :) > I think it's preferable to avoid having to keep a cache of pgtable memory if we > can... Yes, I agree. We simply don't know how many pages we need to cache, and it still can't guarantee 100% allocation success. Thanks, Yang > > Thanks, > Ryan > >
>>> >>> >>> I am wondering whether we can just have a warn_on_once or something >>> for the case >>> when we fail to allocate a pagetable page. Or, Ryan had >>> suggested in an off-the-list conversation that we can maintain a >>> cache of PTE >>> tables for every PMD block mapping, which will give us >>> the same memory consumption as we do today, but not sure if this is >>> worth it. >>> x86 can already handle splitting but due to the callchains >>> I have described above, it has the same problem, and the code has >>> been working >>> for years :) >> I think it's preferable to avoid having to keep a cache of pgtable >> memory if we >> can... > > Yes, I agree. We simply don't know how many pages we need to cache, > and it still can't guarantee 100% allocation success. This is wrong... We can know how many pages will be needed for splitting linear mapping to PTEs for the worst case once linear mapping is finalized. But it may require a few hundred megabytes memory to guarantee allocation success. I don't think it is worth for such rare corner case. Thanks, Yang > > Thanks, > Yang > >> >> Thanks, >> Ryan >> >> >
On 03/09/2025 01:50, Yang Shi wrote: >>>> >>>> >>>> I am wondering whether we can just have a warn_on_once or something for the >>>> case >>>> when we fail to allocate a pagetable page. Or, Ryan had >>>> suggested in an off-the-list conversation that we can maintain a cache of PTE >>>> tables for every PMD block mapping, which will give us >>>> the same memory consumption as we do today, but not sure if this is worth it. >>>> x86 can already handle splitting but due to the callchains >>>> I have described above, it has the same problem, and the code has been working >>>> for years :) >>> I think it's preferable to avoid having to keep a cache of pgtable memory if we >>> can... >> >> Yes, I agree. We simply don't know how many pages we need to cache, and it >> still can't guarantee 100% allocation success. > > This is wrong... We can know how many pages will be needed for splitting linear > mapping to PTEs for the worst case once linear mapping is finalized. But it may > require a few hundred megabytes memory to guarantee allocation success. I don't > think it is worth for such rare corner case. Indeed, we know exactly how much memory we need for pgtables to map the linear map by pte - that's exactly what we are doing today. So we _could_ keep a cache. We would still get the benefit of improved performance but we would lose the benefit of reduced memory. I think we need to solve the vm_reset_perms() problem somehow, before we can enable this. Thanks, Ryan > > Thanks, > Yang > >> >> Thanks, >> Yang >> >>> >>> Thanks, >>> Ryan >>> >>> >> >
On 04/09/2025 14:14, Ryan Roberts wrote: > On 03/09/2025 01:50, Yang Shi wrote: >>>>> >>>>> >>>>> I am wondering whether we can just have a warn_on_once or something for the >>>>> case >>>>> when we fail to allocate a pagetable page. Or, Ryan had >>>>> suggested in an off-the-list conversation that we can maintain a cache of PTE >>>>> tables for every PMD block mapping, which will give us >>>>> the same memory consumption as we do today, but not sure if this is worth it. >>>>> x86 can already handle splitting but due to the callchains >>>>> I have described above, it has the same problem, and the code has been working >>>>> for years :) >>>> I think it's preferable to avoid having to keep a cache of pgtable memory if we >>>> can... >>> >>> Yes, I agree. We simply don't know how many pages we need to cache, and it >>> still can't guarantee 100% allocation success. >> >> This is wrong... We can know how many pages will be needed for splitting linear >> mapping to PTEs for the worst case once linear mapping is finalized. But it may >> require a few hundred megabytes memory to guarantee allocation success. I don't >> think it is worth for such rare corner case. > > Indeed, we know exactly how much memory we need for pgtables to map the linear > map by pte - that's exactly what we are doing today. So we _could_ keep a cache. > We would still get the benefit of improved performance but we would lose the > benefit of reduced memory. > > I think we need to solve the vm_reset_perms() problem somehow, before we can > enable this. Sorry I realise this was not very clear... I am saying I think we need to fix it somehow. A cache would likely work. But I'd prefer to avoid it if we can find a better solution. > > Thanks, > Ryan > >> >> Thanks, >> Yang >> >>> >>> Thanks, >>> Yang >>> >>>> >>>> Thanks, >>>> Ryan >>>> >>>> >>> >> >
On 9/4/25 6:16 AM, Ryan Roberts wrote: > On 04/09/2025 14:14, Ryan Roberts wrote: >> On 03/09/2025 01:50, Yang Shi wrote: >>>>>> >>>>>> I am wondering whether we can just have a warn_on_once or something for the >>>>>> case >>>>>> when we fail to allocate a pagetable page. Or, Ryan had >>>>>> suggested in an off-the-list conversation that we can maintain a cache of PTE >>>>>> tables for every PMD block mapping, which will give us >>>>>> the same memory consumption as we do today, but not sure if this is worth it. >>>>>> x86 can already handle splitting but due to the callchains >>>>>> I have described above, it has the same problem, and the code has been working >>>>>> for years :) >>>>> I think it's preferable to avoid having to keep a cache of pgtable memory if we >>>>> can... >>>> Yes, I agree. We simply don't know how many pages we need to cache, and it >>>> still can't guarantee 100% allocation success. >>> This is wrong... We can know how many pages will be needed for splitting linear >>> mapping to PTEs for the worst case once linear mapping is finalized. But it may >>> require a few hundred megabytes memory to guarantee allocation success. I don't >>> think it is worth for such rare corner case. >> Indeed, we know exactly how much memory we need for pgtables to map the linear >> map by pte - that's exactly what we are doing today. So we _could_ keep a cache. >> We would still get the benefit of improved performance but we would lose the >> benefit of reduced memory. >> >> I think we need to solve the vm_reset_perms() problem somehow, before we can >> enable this. > Sorry I realise this was not very clear... I am saying I think we need to fix it > somehow. A cache would likely work. But I'd prefer to avoid it if we can find a > better solution. Took a deeper look at vm_reset_perms(). It was introduced by commit 868b104d7379 ("mm/vmalloc: Add flag for freeing of special permsissions"). The VM_FLUSH_RESET_PERMS flag is supposed to be set if the vmalloc memory is RO and/or ROX. So set_memory_ro() or set_memory_rox() is supposed to follow up vmalloc(). So the page table should be already split before reaching vfree(). I think this why vm_reset_perms() doesn't not check return value. I scrutinized all the callsites with VM_FLUSH_RESET_PERMS flag set. The most of them has set_memory_ro() or set_memory_rox() followed. But there are 3 places I don't see set_memory_ro()/set_memory_rox() is called. 1. BPF trampoline allocation. The BPF trampoline calls arch_protect_bpf_trampoline(). The generic implementation does call set_memory_rox(). But the x86 and arm64 implementation just simply return 0. For x86, it is because execmem cache is used and it does call set_memory_rox(). ARM64 doesn't need to split page table before this series, so it should never fail. I think we just need to use the generic implementation (remove arm64 implementation) if this series is merged. 2. BPF dispatcher. It calls execmem_alloc which has VM_FLUSH_RESET_PERMS set. But it is used for rw allocation, so VM_FLUSH_RESET_PERMS should be unnecessary IIUC. So it doesn't matter even though vm_reset_perms() fails. 3. kprobe. S390's alloc_insn_page() does call set_memory_rox(), x86 also called set_memory_rox() before switching to execmem cache. The execmem cache calls set_memory_rox(). I don't know why ARM64 doesn't call it. So I think we just need to fix #1 and #3 per the above analysis. If this analysis look correct to you guys, I will prepare two patches to fix them. Thanks, Yang > > >> Thanks, >> Ryan >> >>> Thanks, >>> Yang >>> >>>> Thanks, >>>> Yang >>>> >>>>> Thanks, >>>>> Ryan >>>>> >>>>>
> > > 3. kprobe. S390's alloc_insn_page() does call set_memory_rox(), x86 > also called set_memory_rox() before switching to execmem cache. The > execmem cache calls set_memory_rox(). I don't know why ARM64 doesn't > call it. When I'm trying to find out the proper fix tag for this, I happened to figure out why ARM64 doesn't call it. ARM64 actually called set_memory_ro() before commit 10d5e97c1bf8 ("arm64: use PAGE_KERNEL_ROX directly in alloc_insn_page"). But this commit removed it. It seems like the author and reviewers overlooked set_memory_ro() also changes direct map permission. So I believe adding set_memory_rox() is the right fix. Thanks, Yang > > So I think we just need to fix #1 and #3 per the above analysis. If > this analysis look correct to you guys, I will prepare two patches to > fix them. > > Thanks, > Yang > >> >> >>> Thanks, >>> Ryan >>> >>>> Thanks, >>>> Yang >>>> >>>>> Thanks, >>>>> Yang >>>>> >>>>>> Thanks, >>>>>> Ryan >>>>>> >>>>>> >
On 9/4/25 10:47 AM, Yang Shi wrote: > > > On 9/4/25 6:16 AM, Ryan Roberts wrote: >> On 04/09/2025 14:14, Ryan Roberts wrote: >>> On 03/09/2025 01:50, Yang Shi wrote: >>>>>>> >>>>>>> I am wondering whether we can just have a warn_on_once or >>>>>>> something for the >>>>>>> case >>>>>>> when we fail to allocate a pagetable page. Or, Ryan had >>>>>>> suggested in an off-the-list conversation that we can maintain a >>>>>>> cache of PTE >>>>>>> tables for every PMD block mapping, which will give us >>>>>>> the same memory consumption as we do today, but not sure if this >>>>>>> is worth it. >>>>>>> x86 can already handle splitting but due to the callchains >>>>>>> I have described above, it has the same problem, and the code >>>>>>> has been working >>>>>>> for years :) >>>>>> I think it's preferable to avoid having to keep a cache of >>>>>> pgtable memory if we >>>>>> can... >>>>> Yes, I agree. We simply don't know how many pages we need to >>>>> cache, and it >>>>> still can't guarantee 100% allocation success. >>>> This is wrong... We can know how many pages will be needed for >>>> splitting linear >>>> mapping to PTEs for the worst case once linear mapping is >>>> finalized. But it may >>>> require a few hundred megabytes memory to guarantee allocation >>>> success. I don't >>>> think it is worth for such rare corner case. >>> Indeed, we know exactly how much memory we need for pgtables to map >>> the linear >>> map by pte - that's exactly what we are doing today. So we _could_ >>> keep a cache. >>> We would still get the benefit of improved performance but we would >>> lose the >>> benefit of reduced memory. >>> >>> I think we need to solve the vm_reset_perms() problem somehow, >>> before we can >>> enable this. >> Sorry I realise this was not very clear... I am saying I think we >> need to fix it >> somehow. A cache would likely work. But I'd prefer to avoid it if we >> can find a >> better solution. > > Took a deeper look at vm_reset_perms(). It was introduced by commit > 868b104d7379 ("mm/vmalloc: Add flag for freeing of special > permsissions"). The VM_FLUSH_RESET_PERMS flag is supposed to be set if > the vmalloc memory is RO and/or ROX. So set_memory_ro() or > set_memory_rox() is supposed to follow up vmalloc(). So the page table > should be already split before reaching vfree(). I think this why > vm_reset_perms() doesn't not check return value. > > I scrutinized all the callsites with VM_FLUSH_RESET_PERMS flag set. > The most of them has set_memory_ro() or set_memory_rox() followed. But > there are 3 places I don't see set_memory_ro()/set_memory_rox() is > called. > > 1. BPF trampoline allocation. The BPF trampoline calls > arch_protect_bpf_trampoline(). The generic implementation does call > set_memory_rox(). But the x86 and arm64 implementation just simply > return 0. For x86, it is because execmem cache is used and it does > call set_memory_rox(). ARM64 doesn't need to split page table before > this series, so it should never fail. I think we just need to use the > generic implementation (remove arm64 implementation) if this series is > merged. > > 2. BPF dispatcher. It calls execmem_alloc which has > VM_FLUSH_RESET_PERMS set. But it is used for rw allocation, so > VM_FLUSH_RESET_PERMS should be unnecessary IIUC. So it doesn't matter > even though vm_reset_perms() fails. > > 3. kprobe. S390's alloc_insn_page() does call set_memory_rox(), x86 > also called set_memory_rox() before switching to execmem cache. The > execmem cache calls set_memory_rox(). I don't know why ARM64 doesn't > call it. > > So I think we just need to fix #1 and #3 per the above analysis. If > this analysis look correct to you guys, I will prepare two patches to > fix them. Tested the below patch with bpftrace kfunc (allocate bpf trampoline) and kprobes. It seems work well. diff --git a/arch/arm64/kernel/probes/kprobes.c b/arch/arm64/kernel/probes/kprobes.c index 0c5d408afd95..c4f8c4750f1e 100644 --- a/arch/arm64/kernel/probes/kprobes.c +++ b/arch/arm64/kernel/probes/kprobes.c @@ -10,6 +10,7 @@ #define pr_fmt(fmt) "kprobes: " fmt +#include <linux/execmem.h> #include <linux/extable.h> #include <linux/kasan.h> #include <linux/kernel.h> @@ -41,6 +42,17 @@ DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk); static void __kprobes post_kprobe_handler(struct kprobe *, struct kprobe_ctlblk *, struct pt_regs *); +void *alloc_insn_page(void) +{ + void *page; + + page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE); + if (!page) + return NULL; + set_memory_rox((unsigned long)page, 1); + return page; +} + static void __kprobes arch_prepare_ss_slot(struct kprobe *p) { kprobe_opcode_t *addr = p->ainsn.xol_insn; diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index 52ffe115a8c4..3e301bc2cd66 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -2717,11 +2717,6 @@ void arch_free_bpf_trampoline(void *image, unsigned int size) bpf_prog_pack_free(image, size); } -int arch_protect_bpf_trampoline(void *image, unsigned int size) -{ - return 0; -} - int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *ro_image, void *ro_image_end, const struct btf_func_model *m, u32 flags, struct bpf_tramp_links *tlinks, > > Thanks, > Yang > >> >> >>> Thanks, >>> Ryan >>> >>>> Thanks, >>>> Yang >>>> >>>>> Thanks, >>>>> Yang >>>>> >>>>>> Thanks, >>>>>> Ryan >>>>>> >>>>>> >
On 04/09/2025 22:49, Yang Shi wrote: > > > On 9/4/25 10:47 AM, Yang Shi wrote: >> >> >> On 9/4/25 6:16 AM, Ryan Roberts wrote: >>> On 04/09/2025 14:14, Ryan Roberts wrote: >>>> On 03/09/2025 01:50, Yang Shi wrote: >>>>>>>> >>>>>>>> I am wondering whether we can just have a warn_on_once or something for the >>>>>>>> case >>>>>>>> when we fail to allocate a pagetable page. Or, Ryan had >>>>>>>> suggested in an off-the-list conversation that we can maintain a cache >>>>>>>> of PTE >>>>>>>> tables for every PMD block mapping, which will give us >>>>>>>> the same memory consumption as we do today, but not sure if this is >>>>>>>> worth it. >>>>>>>> x86 can already handle splitting but due to the callchains >>>>>>>> I have described above, it has the same problem, and the code has been >>>>>>>> working >>>>>>>> for years :) >>>>>>> I think it's preferable to avoid having to keep a cache of pgtable memory >>>>>>> if we >>>>>>> can... >>>>>> Yes, I agree. We simply don't know how many pages we need to cache, and it >>>>>> still can't guarantee 100% allocation success. >>>>> This is wrong... We can know how many pages will be needed for splitting >>>>> linear >>>>> mapping to PTEs for the worst case once linear mapping is finalized. But it >>>>> may >>>>> require a few hundred megabytes memory to guarantee allocation success. I >>>>> don't >>>>> think it is worth for such rare corner case. >>>> Indeed, we know exactly how much memory we need for pgtables to map the linear >>>> map by pte - that's exactly what we are doing today. So we _could_ keep a >>>> cache. >>>> We would still get the benefit of improved performance but we would lose the >>>> benefit of reduced memory. >>>> >>>> I think we need to solve the vm_reset_perms() problem somehow, before we can >>>> enable this. >>> Sorry I realise this was not very clear... I am saying I think we need to fix it >>> somehow. A cache would likely work. But I'd prefer to avoid it if we can find a >>> better solution. >> >> Took a deeper look at vm_reset_perms(). It was introduced by commit >> 868b104d7379 ("mm/vmalloc: Add flag for freeing of special permsissions"). The >> VM_FLUSH_RESET_PERMS flag is supposed to be set if the vmalloc memory is RO >> and/or ROX. So set_memory_ro() or set_memory_rox() is supposed to follow up >> vmalloc(). So the page table should be already split before reaching vfree(). >> I think this why vm_reset_perms() doesn't not check return value. If vm_reset_perms() is assuming it can't/won't fail, I think it should at least output a warning if it does? >> >> I scrutinized all the callsites with VM_FLUSH_RESET_PERMS flag set. Just checking; I think you made a comment before about there only being a few sites that set VM_FLUSH_RESET_PERMS. But one of them is the helper, set_vm_flush_reset_perms(). So just making sure you also followed to the places that use that helper? >> The most >> of them has set_memory_ro() or set_memory_rox() followed. And are all callsites calling set_memory_*() for the entire cell that was allocated by vmalloc? If there are cases where it only calls that for a portion of it, then it's not gurranteed that the memory is correctly split. >> But there are 3 >> places I don't see set_memory_ro()/set_memory_rox() is called. >> >> 1. BPF trampoline allocation. The BPF trampoline calls >> arch_protect_bpf_trampoline(). The generic implementation does call >> set_memory_rox(). But the x86 and arm64 implementation just simply return 0. >> For x86, it is because execmem cache is used and it does call >> set_memory_rox(). ARM64 doesn't need to split page table before this series, >> so it should never fail. I think we just need to use the generic >> implementation (remove arm64 implementation) if this series is merged. I know zero about BPF. But it looks like the allocation happens in arch_alloc_bpf_trampoline(), which for arm64, calls bpf_prog_pack_alloc(). And for small sizes, it grabs some memory from a "pack". So doesn't this mean that you are calling set_memory_rox() for a sub-region of the cell, so that doesn't actually help at vm_reset_perms()-time? >> >> 2. BPF dispatcher. It calls execmem_alloc which has VM_FLUSH_RESET_PERMS set. >> But it is used for rw allocation, so VM_FLUSH_RESET_PERMS should be >> unnecessary IIUC. So it doesn't matter even though vm_reset_perms() fails. >> >> 3. kprobe. S390's alloc_insn_page() does call set_memory_rox(), x86 also >> called set_memory_rox() before switching to execmem cache. The execmem cache >> calls set_memory_rox(). I don't know why ARM64 doesn't call it. >> >> So I think we just need to fix #1 and #3 per the above analysis. If this >> analysis look correct to you guys, I will prepare two patches to fix them. This all seems quite fragile. I find it interesting that vm_reset_perms() is doing break-before-make; it sets the PTEs as invalid, then flushes the TLB, then sets them to default. But for arm64, at least, I think break-before-make is not required. We are only changing the permissions so that can be done on live mappings; essentially change the sequence to; set default, flush TLB. If we do that, then if the memory was already default, then there is no need to do anything (so no chance of allocation failure). If the memory was not default, then it must have already been split to make it non-default, in which case we can also gurrantee that no allocations are required. What am I missing? Thanks, Ryan > > Tested the below patch with bpftrace kfunc (allocate bpf trampoline) and > kprobes. It seems work well. > > diff --git a/arch/arm64/kernel/probes/kprobes.c b/arch/arm64/kernel/probes/ > kprobes.c > index 0c5d408afd95..c4f8c4750f1e 100644 > --- a/arch/arm64/kernel/probes/kprobes.c > +++ b/arch/arm64/kernel/probes/kprobes.c > @@ -10,6 +10,7 @@ > > #define pr_fmt(fmt) "kprobes: " fmt > > +#include <linux/execmem.h> > #include <linux/extable.h> > #include <linux/kasan.h> > #include <linux/kernel.h> > @@ -41,6 +42,17 @@ DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk); > static void __kprobes > post_kprobe_handler(struct kprobe *, struct kprobe_ctlblk *, struct pt_regs *); > > +void *alloc_insn_page(void) > +{ > + void *page; > + > + page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE); > + if (!page) > + return NULL; > + set_memory_rox((unsigned long)page, 1); > + return page; > +} > + > static void __kprobes arch_prepare_ss_slot(struct kprobe *p) > { > kprobe_opcode_t *addr = p->ainsn.xol_insn; > diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c > index 52ffe115a8c4..3e301bc2cd66 100644 > --- a/arch/arm64/net/bpf_jit_comp.c > +++ b/arch/arm64/net/bpf_jit_comp.c > @@ -2717,11 +2717,6 @@ void arch_free_bpf_trampoline(void *image, unsigned int > size) > bpf_prog_pack_free(image, size); > } > > -int arch_protect_bpf_trampoline(void *image, unsigned int size) > -{ > - return 0; > -} > - > int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *ro_image, > void *ro_image_end, const struct btf_func_model *m, > u32 flags, struct bpf_tramp_links *tlinks, > > >> >> Thanks, >> Yang >> >>> >>> >>>> Thanks, >>>> Ryan >>>> >>>>> Thanks, >>>>> Yang >>>>> >>>>>> Thanks, >>>>>> Yang >>>>>> >>>>>>> Thanks, >>>>>>> Ryan >>>>>>> >>>>>>> >> >
On 9/8/25 9:34 AM, Ryan Roberts wrote: > On 04/09/2025 22:49, Yang Shi wrote: >> >> On 9/4/25 10:47 AM, Yang Shi wrote: >>> >>> On 9/4/25 6:16 AM, Ryan Roberts wrote: >>>> On 04/09/2025 14:14, Ryan Roberts wrote: >>>>> On 03/09/2025 01:50, Yang Shi wrote: >>>>>>>>> I am wondering whether we can just have a warn_on_once or something for the >>>>>>>>> case >>>>>>>>> when we fail to allocate a pagetable page. Or, Ryan had >>>>>>>>> suggested in an off-the-list conversation that we can maintain a cache >>>>>>>>> of PTE >>>>>>>>> tables for every PMD block mapping, which will give us >>>>>>>>> the same memory consumption as we do today, but not sure if this is >>>>>>>>> worth it. >>>>>>>>> x86 can already handle splitting but due to the callchains >>>>>>>>> I have described above, it has the same problem, and the code has been >>>>>>>>> working >>>>>>>>> for years :) >>>>>>>> I think it's preferable to avoid having to keep a cache of pgtable memory >>>>>>>> if we >>>>>>>> can... >>>>>>> Yes, I agree. We simply don't know how many pages we need to cache, and it >>>>>>> still can't guarantee 100% allocation success. >>>>>> This is wrong... We can know how many pages will be needed for splitting >>>>>> linear >>>>>> mapping to PTEs for the worst case once linear mapping is finalized. But it >>>>>> may >>>>>> require a few hundred megabytes memory to guarantee allocation success. I >>>>>> don't >>>>>> think it is worth for such rare corner case. >>>>> Indeed, we know exactly how much memory we need for pgtables to map the linear >>>>> map by pte - that's exactly what we are doing today. So we _could_ keep a >>>>> cache. >>>>> We would still get the benefit of improved performance but we would lose the >>>>> benefit of reduced memory. >>>>> >>>>> I think we need to solve the vm_reset_perms() problem somehow, before we can >>>>> enable this. >>>> Sorry I realise this was not very clear... I am saying I think we need to fix it >>>> somehow. A cache would likely work. But I'd prefer to avoid it if we can find a >>>> better solution. >>> Took a deeper look at vm_reset_perms(). It was introduced by commit >>> 868b104d7379 ("mm/vmalloc: Add flag for freeing of special permsissions"). The >>> VM_FLUSH_RESET_PERMS flag is supposed to be set if the vmalloc memory is RO >>> and/or ROX. So set_memory_ro() or set_memory_rox() is supposed to follow up >>> vmalloc(). So the page table should be already split before reaching vfree(). >>> I think this why vm_reset_perms() doesn't not check return value. > If vm_reset_perms() is assuming it can't/won't fail, I think it should at least > output a warning if it does? It should. Anyway warning will be raised if split fails. We have somehow mitigation. > >>> I scrutinized all the callsites with VM_FLUSH_RESET_PERMS flag set. > Just checking; I think you made a comment before about there only being a few > sites that set VM_FLUSH_RESET_PERMS. But one of them is the helper, > set_vm_flush_reset_perms(). So just making sure you also followed to the places > that use that helper? Yes, I did. > >>> The most >>> of them has set_memory_ro() or set_memory_rox() followed. > And are all callsites calling set_memory_*() for the entire cell that was > allocated by vmalloc? If there are cases where it only calls that for a portion > of it, then it's not gurranteed that the memory is correctly split. Yes, all callsites call set_memory_*() for the entire range. > >>> But there are 3 >>> places I don't see set_memory_ro()/set_memory_rox() is called. >>> >>> 1. BPF trampoline allocation. The BPF trampoline calls >>> arch_protect_bpf_trampoline(). The generic implementation does call >>> set_memory_rox(). But the x86 and arm64 implementation just simply return 0. >>> For x86, it is because execmem cache is used and it does call >>> set_memory_rox(). ARM64 doesn't need to split page table before this series, >>> so it should never fail. I think we just need to use the generic >>> implementation (remove arm64 implementation) if this series is merged. > I know zero about BPF. But it looks like the allocation happens in > arch_alloc_bpf_trampoline(), which for arm64, calls bpf_prog_pack_alloc(). And > for small sizes, it grabs some memory from a "pack". So doesn't this mean that > you are calling set_memory_rox() for a sub-region of the cell, so that doesn't > actually help at vm_reset_perms()-time? Took a deeper look at bpf pack allocator. The "pack" is allocated by alloc_new_pack(), which does: bpf_jit_alloc_exec() set_vm_flush_reset_perms() set_memory_rox() If the size is greater than the pack size, it calls: bpf_jit_alloc_exec() set_vm_flush_reset_perms() set_memory_rox() So it looks like bpf trampoline is good, and we don't need do anything. It should be removed from the list. I didn't look deep enough for bpf pack allocator in the first place. > >>> 2. BPF dispatcher. It calls execmem_alloc which has VM_FLUSH_RESET_PERMS set. >>> But it is used for rw allocation, so VM_FLUSH_RESET_PERMS should be >>> unnecessary IIUC. So it doesn't matter even though vm_reset_perms() fails. >>> >>> 3. kprobe. S390's alloc_insn_page() does call set_memory_rox(), x86 also >>> called set_memory_rox() before switching to execmem cache. The execmem cache >>> calls set_memory_rox(). I don't know why ARM64 doesn't call it. >>> >>> So I think we just need to fix #1 and #3 per the above analysis. If this >>> analysis look correct to you guys, I will prepare two patches to fix them. > This all seems quite fragile. I find it interesting that vm_reset_perms() is > doing break-before-make; it sets the PTEs as invalid, then flushes the TLB, then > sets them to default. But for arm64, at least, I think break-before-make is not > required. We are only changing the permissions so that can be done on live > mappings; essentially change the sequence to; set default, flush TLB. Yeah, I agree it is a little bit fragile. I think this is the "contract" for vmalloc users. You allocate ROX memory via vmalloc, you are required to call set_memory_*(). But there is nothing to guarantee the "contract" is followed. But I don't think this is the only case in kernel. > > If we do that, then if the memory was already default, then there is no need to > do anything (so no chance of allocation failure). If the memory was not default, > then it must have already been split to make it non-default, in which case we > can also gurrantee that no allocations are required. > > What am I missing? The comment says: Set direct map to something invalid so that it won't be cached if there are any accesses after the TLB flush, then flush the TLB and reset the direct map permissions to the default. IIUC, it guarantees the direct map can't be cached in TLB after TLB flush from _vm_unmap_aliases() by setting them invalid because TLB never cache invalid entries. Skipping set direct map to invalid seems break this. Or "changing permission on live mappings" on ARM64 can achieve the same goal? Thanks, Yang > Thanks, > Ryan > > >> Tested the below patch with bpftrace kfunc (allocate bpf trampoline) and >> kprobes. It seems work well. >> >> diff --git a/arch/arm64/kernel/probes/kprobes.c b/arch/arm64/kernel/probes/ >> kprobes.c >> index 0c5d408afd95..c4f8c4750f1e 100644 >> --- a/arch/arm64/kernel/probes/kprobes.c >> +++ b/arch/arm64/kernel/probes/kprobes.c >> @@ -10,6 +10,7 @@ >> >> #define pr_fmt(fmt) "kprobes: " fmt >> >> +#include <linux/execmem.h> >> #include <linux/extable.h> >> #include <linux/kasan.h> >> #include <linux/kernel.h> >> @@ -41,6 +42,17 @@ DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk); >> static void __kprobes >> post_kprobe_handler(struct kprobe *, struct kprobe_ctlblk *, struct pt_regs *); >> >> +void *alloc_insn_page(void) >> +{ >> + void *page; >> + >> + page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE); >> + if (!page) >> + return NULL; >> + set_memory_rox((unsigned long)page, 1); >> + return page; >> +} >> + >> static void __kprobes arch_prepare_ss_slot(struct kprobe *p) >> { >> kprobe_opcode_t *addr = p->ainsn.xol_insn; >> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c >> index 52ffe115a8c4..3e301bc2cd66 100644 >> --- a/arch/arm64/net/bpf_jit_comp.c >> +++ b/arch/arm64/net/bpf_jit_comp.c >> @@ -2717,11 +2717,6 @@ void arch_free_bpf_trampoline(void *image, unsigned int >> size) >> bpf_prog_pack_free(image, size); >> } >> >> -int arch_protect_bpf_trampoline(void *image, unsigned int size) >> -{ >> - return 0; >> -} >> - >> int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *ro_image, >> void *ro_image_end, const struct btf_func_model *m, >> u32 flags, struct bpf_tramp_links *tlinks, >> >> >>> Thanks, >>> Yang >>> >>>> >>>>> Thanks, >>>>> Ryan >>>>> >>>>>> Thanks, >>>>>> Yang >>>>>> >>>>>>> Thanks, >>>>>>> Yang >>>>>>> >>>>>>>> Thanks, >>>>>>>> Ryan >>>>>>>> >>>>>>>>
On 08/09/2025 19:31, Yang Shi wrote: > > > On 9/8/25 9:34 AM, Ryan Roberts wrote: >> On 04/09/2025 22:49, Yang Shi wrote: >>> >>> On 9/4/25 10:47 AM, Yang Shi wrote: >>>> >>>> On 9/4/25 6:16 AM, Ryan Roberts wrote: >>>>> On 04/09/2025 14:14, Ryan Roberts wrote: >>>>>> On 03/09/2025 01:50, Yang Shi wrote: >>>>>>>>>> I am wondering whether we can just have a warn_on_once or something >>>>>>>>>> for the >>>>>>>>>> case >>>>>>>>>> when we fail to allocate a pagetable page. Or, Ryan had >>>>>>>>>> suggested in an off-the-list conversation that we can maintain a cache >>>>>>>>>> of PTE >>>>>>>>>> tables for every PMD block mapping, which will give us >>>>>>>>>> the same memory consumption as we do today, but not sure if this is >>>>>>>>>> worth it. >>>>>>>>>> x86 can already handle splitting but due to the callchains >>>>>>>>>> I have described above, it has the same problem, and the code has been >>>>>>>>>> working >>>>>>>>>> for years :) >>>>>>>>> I think it's preferable to avoid having to keep a cache of pgtable memory >>>>>>>>> if we >>>>>>>>> can... >>>>>>>> Yes, I agree. We simply don't know how many pages we need to cache, and it >>>>>>>> still can't guarantee 100% allocation success. >>>>>>> This is wrong... We can know how many pages will be needed for splitting >>>>>>> linear >>>>>>> mapping to PTEs for the worst case once linear mapping is finalized. But it >>>>>>> may >>>>>>> require a few hundred megabytes memory to guarantee allocation success. I >>>>>>> don't >>>>>>> think it is worth for such rare corner case. >>>>>> Indeed, we know exactly how much memory we need for pgtables to map the >>>>>> linear >>>>>> map by pte - that's exactly what we are doing today. So we _could_ keep a >>>>>> cache. >>>>>> We would still get the benefit of improved performance but we would lose the >>>>>> benefit of reduced memory. >>>>>> >>>>>> I think we need to solve the vm_reset_perms() problem somehow, before we can >>>>>> enable this. >>>>> Sorry I realise this was not very clear... I am saying I think we need to >>>>> fix it >>>>> somehow. A cache would likely work. But I'd prefer to avoid it if we can >>>>> find a >>>>> better solution. >>>> Took a deeper look at vm_reset_perms(). It was introduced by commit >>>> 868b104d7379 ("mm/vmalloc: Add flag for freeing of special permsissions"). The >>>> VM_FLUSH_RESET_PERMS flag is supposed to be set if the vmalloc memory is RO >>>> and/or ROX. So set_memory_ro() or set_memory_rox() is supposed to follow up >>>> vmalloc(). So the page table should be already split before reaching vfree(). >>>> I think this why vm_reset_perms() doesn't not check return value. >> If vm_reset_perms() is assuming it can't/won't fail, I think it should at least >> output a warning if it does? > > It should. Anyway warning will be raised if split fails. We have somehow > mitigation. > >> >>>> I scrutinized all the callsites with VM_FLUSH_RESET_PERMS flag set. >> Just checking; I think you made a comment before about there only being a few >> sites that set VM_FLUSH_RESET_PERMS. But one of them is the helper, >> set_vm_flush_reset_perms(). So just making sure you also followed to the places >> that use that helper? > > Yes, I did. > >> >>>> The most >>>> of them has set_memory_ro() or set_memory_rox() followed. >> And are all callsites calling set_memory_*() for the entire cell that was >> allocated by vmalloc? If there are cases where it only calls that for a portion >> of it, then it's not gurranteed that the memory is correctly split. > > Yes, all callsites call set_memory_*() for the entire range. > >> >>>> But there are 3 >>>> places I don't see set_memory_ro()/set_memory_rox() is called. >>>> >>>> 1. BPF trampoline allocation. The BPF trampoline calls >>>> arch_protect_bpf_trampoline(). The generic implementation does call >>>> set_memory_rox(). But the x86 and arm64 implementation just simply return 0. >>>> For x86, it is because execmem cache is used and it does call >>>> set_memory_rox(). ARM64 doesn't need to split page table before this series, >>>> so it should never fail. I think we just need to use the generic >>>> implementation (remove arm64 implementation) if this series is merged. >> I know zero about BPF. But it looks like the allocation happens in >> arch_alloc_bpf_trampoline(), which for arm64, calls bpf_prog_pack_alloc(). And >> for small sizes, it grabs some memory from a "pack". So doesn't this mean that >> you are calling set_memory_rox() for a sub-region of the cell, so that doesn't >> actually help at vm_reset_perms()-time? > > Took a deeper look at bpf pack allocator. The "pack" is allocated by > alloc_new_pack(), which does: > bpf_jit_alloc_exec() > set_vm_flush_reset_perms() > set_memory_rox() > > If the size is greater than the pack size, it calls: > bpf_jit_alloc_exec() > set_vm_flush_reset_perms() > set_memory_rox() > > So it looks like bpf trampoline is good, and we don't need do anything. It > should be removed from the list. I didn't look deep enough for bpf pack > allocator in the first place. > >> >>>> 2. BPF dispatcher. It calls execmem_alloc which has VM_FLUSH_RESET_PERMS set. >>>> But it is used for rw allocation, so VM_FLUSH_RESET_PERMS should be >>>> unnecessary IIUC. So it doesn't matter even though vm_reset_perms() fails. >>>> >>>> 3. kprobe. S390's alloc_insn_page() does call set_memory_rox(), x86 also >>>> called set_memory_rox() before switching to execmem cache. The execmem cache >>>> calls set_memory_rox(). I don't know why ARM64 doesn't call it. >>>> >>>> So I think we just need to fix #1 and #3 per the above analysis. If this >>>> analysis look correct to you guys, I will prepare two patches to fix them. >> This all seems quite fragile. I find it interesting that vm_reset_perms() is >> doing break-before-make; it sets the PTEs as invalid, then flushes the TLB, then >> sets them to default. But for arm64, at least, I think break-before-make is not >> required. We are only changing the permissions so that can be done on live >> mappings; essentially change the sequence to; set default, flush TLB. > > Yeah, I agree it is a little bit fragile. I think this is the "contract" for > vmalloc users. You allocate ROX memory via vmalloc, you are required to call > set_memory_*(). But there is nothing to guarantee the "contract" is followed. > But I don't think this is the only case in kernel. > >> >> If we do that, then if the memory was already default, then there is no need to >> do anything (so no chance of allocation failure). If the memory was not default, >> then it must have already been split to make it non-default, in which case we >> can also gurrantee that no allocations are required. >> >> What am I missing? > > The comment says: > Set direct map to something invalid so that it won't be cached if there are any > accesses after the TLB flush, then flush the TLB and reset the direct map > permissions to the default. > > IIUC, it guarantees the direct map can't be cached in TLB after TLB flush from > _vm_unmap_aliases() by setting them invalid because TLB never cache invalid > entries. Skipping set direct map to invalid seems break this. Or "changing > permission on live mappings" on ARM64 can achieve the same goal? Here's my understanding of the intent of the code: Let's say we start with some memory that has been mapped RO. Our goal is to reset the memory back to RW and ensure that no TLB entry remains in the TLB for the old RO mapping. There are 2 ways to do that: Approach 1 (used in current code): 1. set PTE to invalid 2. invalidate any TLB entry for the VA 3. set the PTE to RW Approach 2: 1. set the PTE to RW 2. invalidate any TLB entry for the VA The benefit of approach 1 is that it is guarranteed that it is impossible for different CPUs to have different translations for the same VA in their respective TLB. But for approach 2, it's possible that between steps 1 and 2, 1 CPU has a RO entry and another CPU has a RW entry. But that will get fixed once the TLB is flushed - it's not really an issue. (There is probably also an obscure way to end up with 2 TLB entries (one with RO and one with RW) for the same CPU, but the arm64 architecture permits that as long as it's only a permission mismatch). Anyway, approach 2 is used when changing memory permissions on user mappings, so I don't see why we can't take the same approach here. That would solve this whole class of issue for us. Thanks, Ryan > > Thanks, > Yang > >> Thanks, >> Ryan >> >> >>> Tested the below patch with bpftrace kfunc (allocate bpf trampoline) and >>> kprobes. It seems work well. >>> >>> diff --git a/arch/arm64/kernel/probes/kprobes.c b/arch/arm64/kernel/probes/ >>> kprobes.c >>> index 0c5d408afd95..c4f8c4750f1e 100644 >>> --- a/arch/arm64/kernel/probes/kprobes.c >>> +++ b/arch/arm64/kernel/probes/kprobes.c >>> @@ -10,6 +10,7 @@ >>> >>> #define pr_fmt(fmt) "kprobes: " fmt >>> >>> +#include <linux/execmem.h> >>> #include <linux/extable.h> >>> #include <linux/kasan.h> >>> #include <linux/kernel.h> >>> @@ -41,6 +42,17 @@ DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk); >>> static void __kprobes >>> post_kprobe_handler(struct kprobe *, struct kprobe_ctlblk *, struct pt_regs >>> *); >>> >>> +void *alloc_insn_page(void) >>> +{ >>> + void *page; >>> + >>> + page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE); >>> + if (!page) >>> + return NULL; >>> + set_memory_rox((unsigned long)page, 1); >>> + return page; >>> +} >>> + >>> static void __kprobes arch_prepare_ss_slot(struct kprobe *p) >>> { >>> kprobe_opcode_t *addr = p->ainsn.xol_insn; >>> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c >>> index 52ffe115a8c4..3e301bc2cd66 100644 >>> --- a/arch/arm64/net/bpf_jit_comp.c >>> +++ b/arch/arm64/net/bpf_jit_comp.c >>> @@ -2717,11 +2717,6 @@ void arch_free_bpf_trampoline(void *image, unsigned int >>> size) >>> bpf_prog_pack_free(image, size); >>> } >>> >>> -int arch_protect_bpf_trampoline(void *image, unsigned int size) >>> -{ >>> - return 0; >>> -} >>> - >>> int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *ro_image, >>> void *ro_image_end, const struct >>> btf_func_model *m, >>> u32 flags, struct bpf_tramp_links *tlinks, >>> >>> >>>> Thanks, >>>> Yang >>>> >>>>> >>>>>> Thanks, >>>>>> Ryan >>>>>> >>>>>>> Thanks, >>>>>>> Yang >>>>>>> >>>>>>>> Thanks, >>>>>>>> Yang >>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Ryan >>>>>>>>> >>>>>>>>> >
On 9/9/25 7:36 AM, Ryan Roberts wrote: > On 08/09/2025 19:31, Yang Shi wrote: >> >> On 9/8/25 9:34 AM, Ryan Roberts wrote: >>> On 04/09/2025 22:49, Yang Shi wrote: >>>> On 9/4/25 10:47 AM, Yang Shi wrote: >>>>> On 9/4/25 6:16 AM, Ryan Roberts wrote: >>>>>> On 04/09/2025 14:14, Ryan Roberts wrote: >>>>>>> On 03/09/2025 01:50, Yang Shi wrote: >>>>>>>>>>> I am wondering whether we can just have a warn_on_once or something >>>>>>>>>>> for the >>>>>>>>>>> case >>>>>>>>>>> when we fail to allocate a pagetable page. Or, Ryan had >>>>>>>>>>> suggested in an off-the-list conversation that we can maintain a cache >>>>>>>>>>> of PTE >>>>>>>>>>> tables for every PMD block mapping, which will give us >>>>>>>>>>> the same memory consumption as we do today, but not sure if this is >>>>>>>>>>> worth it. >>>>>>>>>>> x86 can already handle splitting but due to the callchains >>>>>>>>>>> I have described above, it has the same problem, and the code has been >>>>>>>>>>> working >>>>>>>>>>> for years :) >>>>>>>>>> I think it's preferable to avoid having to keep a cache of pgtable memory >>>>>>>>>> if we >>>>>>>>>> can... >>>>>>>>> Yes, I agree. We simply don't know how many pages we need to cache, and it >>>>>>>>> still can't guarantee 100% allocation success. >>>>>>>> This is wrong... We can know how many pages will be needed for splitting >>>>>>>> linear >>>>>>>> mapping to PTEs for the worst case once linear mapping is finalized. But it >>>>>>>> may >>>>>>>> require a few hundred megabytes memory to guarantee allocation success. I >>>>>>>> don't >>>>>>>> think it is worth for such rare corner case. >>>>>>> Indeed, we know exactly how much memory we need for pgtables to map the >>>>>>> linear >>>>>>> map by pte - that's exactly what we are doing today. So we _could_ keep a >>>>>>> cache. >>>>>>> We would still get the benefit of improved performance but we would lose the >>>>>>> benefit of reduced memory. >>>>>>> >>>>>>> I think we need to solve the vm_reset_perms() problem somehow, before we can >>>>>>> enable this. >>>>>> Sorry I realise this was not very clear... I am saying I think we need to >>>>>> fix it >>>>>> somehow. A cache would likely work. But I'd prefer to avoid it if we can >>>>>> find a >>>>>> better solution. >>>>> Took a deeper look at vm_reset_perms(). It was introduced by commit >>>>> 868b104d7379 ("mm/vmalloc: Add flag for freeing of special permsissions"). The >>>>> VM_FLUSH_RESET_PERMS flag is supposed to be set if the vmalloc memory is RO >>>>> and/or ROX. So set_memory_ro() or set_memory_rox() is supposed to follow up >>>>> vmalloc(). So the page table should be already split before reaching vfree(). >>>>> I think this why vm_reset_perms() doesn't not check return value. >>> If vm_reset_perms() is assuming it can't/won't fail, I think it should at least >>> output a warning if it does? >> It should. Anyway warning will be raised if split fails. We have somehow >> mitigation. >> >>>>> I scrutinized all the callsites with VM_FLUSH_RESET_PERMS flag set. >>> Just checking; I think you made a comment before about there only being a few >>> sites that set VM_FLUSH_RESET_PERMS. But one of them is the helper, >>> set_vm_flush_reset_perms(). So just making sure you also followed to the places >>> that use that helper? >> Yes, I did. >> >>>>> The most >>>>> of them has set_memory_ro() or set_memory_rox() followed. >>> And are all callsites calling set_memory_*() for the entire cell that was >>> allocated by vmalloc? If there are cases where it only calls that for a portion >>> of it, then it's not gurranteed that the memory is correctly split. >> Yes, all callsites call set_memory_*() for the entire range. >> >>>>> But there are 3 >>>>> places I don't see set_memory_ro()/set_memory_rox() is called. >>>>> >>>>> 1. BPF trampoline allocation. The BPF trampoline calls >>>>> arch_protect_bpf_trampoline(). The generic implementation does call >>>>> set_memory_rox(). But the x86 and arm64 implementation just simply return 0. >>>>> For x86, it is because execmem cache is used and it does call >>>>> set_memory_rox(). ARM64 doesn't need to split page table before this series, >>>>> so it should never fail. I think we just need to use the generic >>>>> implementation (remove arm64 implementation) if this series is merged. >>> I know zero about BPF. But it looks like the allocation happens in >>> arch_alloc_bpf_trampoline(), which for arm64, calls bpf_prog_pack_alloc(). And >>> for small sizes, it grabs some memory from a "pack". So doesn't this mean that >>> you are calling set_memory_rox() for a sub-region of the cell, so that doesn't >>> actually help at vm_reset_perms()-time? >> Took a deeper look at bpf pack allocator. The "pack" is allocated by >> alloc_new_pack(), which does: >> bpf_jit_alloc_exec() >> set_vm_flush_reset_perms() >> set_memory_rox() >> >> If the size is greater than the pack size, it calls: >> bpf_jit_alloc_exec() >> set_vm_flush_reset_perms() >> set_memory_rox() >> >> So it looks like bpf trampoline is good, and we don't need do anything. It >> should be removed from the list. I didn't look deep enough for bpf pack >> allocator in the first place. >> >>>>> 2. BPF dispatcher. It calls execmem_alloc which has VM_FLUSH_RESET_PERMS set. >>>>> But it is used for rw allocation, so VM_FLUSH_RESET_PERMS should be >>>>> unnecessary IIUC. So it doesn't matter even though vm_reset_perms() fails. >>>>> >>>>> 3. kprobe. S390's alloc_insn_page() does call set_memory_rox(), x86 also >>>>> called set_memory_rox() before switching to execmem cache. The execmem cache >>>>> calls set_memory_rox(). I don't know why ARM64 doesn't call it. >>>>> >>>>> So I think we just need to fix #1 and #3 per the above analysis. If this >>>>> analysis look correct to you guys, I will prepare two patches to fix them. >>> This all seems quite fragile. I find it interesting that vm_reset_perms() is >>> doing break-before-make; it sets the PTEs as invalid, then flushes the TLB, then >>> sets them to default. But for arm64, at least, I think break-before-make is not >>> required. We are only changing the permissions so that can be done on live >>> mappings; essentially change the sequence to; set default, flush TLB. >> Yeah, I agree it is a little bit fragile. I think this is the "contract" for >> vmalloc users. You allocate ROX memory via vmalloc, you are required to call >> set_memory_*(). But there is nothing to guarantee the "contract" is followed. >> But I don't think this is the only case in kernel. >> >>> If we do that, then if the memory was already default, then there is no need to >>> do anything (so no chance of allocation failure). If the memory was not default, >>> then it must have already been split to make it non-default, in which case we >>> can also gurrantee that no allocations are required. >>> >>> What am I missing? >> The comment says: >> Set direct map to something invalid so that it won't be cached if there are any >> accesses after the TLB flush, then flush the TLB and reset the direct map >> permissions to the default. >> >> IIUC, it guarantees the direct map can't be cached in TLB after TLB flush from >> _vm_unmap_aliases() by setting them invalid because TLB never cache invalid >> entries. Skipping set direct map to invalid seems break this. Or "changing >> permission on live mappings" on ARM64 can achieve the same goal? > Here's my understanding of the intent of the code: > > Let's say we start with some memory that has been mapped RO. Our goal is to > reset the memory back to RW and ensure that no TLB entry remains in the TLB for > the old RO mapping. There are 2 ways to do that: > > Approach 1 (used in current code): > 1. set PTE to invalid > 2. invalidate any TLB entry for the VA > 3. set the PTE to RW > > Approach 2: > 1. set the PTE to RW > 2. invalidate any TLB entry for the VA IIUC, the intent of the code is "reset direct map permission *without* leaving a RW+X window". The TLB flush call actually flushes both VA and direct map together. So if this is the intent, approach #2 may have VA with X permission but direct map may be RW at the mean time. It seems break the intent. Thanks, Yang > > The benefit of approach 1 is that it is guarranteed that it is impossible for > different CPUs to have different translations for the same VA in their > respective TLB. But for approach 2, it's possible that between steps 1 and 2, 1 > CPU has a RO entry and another CPU has a RW entry. But that will get fixed once > the TLB is flushed - it's not really an issue. > > (There is probably also an obscure way to end up with 2 TLB entries (one with RO > and one with RW) for the same CPU, but the arm64 architecture permits that as > long as it's only a permission mismatch). > > Anyway, approach 2 is used when changing memory permissions on user mappings, so > I don't see why we can't take the same approach here. That would solve this > whole class of issue for us. > > Thanks, > Ryan > > >> Thanks, >> Yang >> >>> Thanks, >>> Ryan >>> >>> >>>> Tested the below patch with bpftrace kfunc (allocate bpf trampoline) and >>>> kprobes. It seems work well. >>>> >>>> diff --git a/arch/arm64/kernel/probes/kprobes.c b/arch/arm64/kernel/probes/ >>>> kprobes.c >>>> index 0c5d408afd95..c4f8c4750f1e 100644 >>>> --- a/arch/arm64/kernel/probes/kprobes.c >>>> +++ b/arch/arm64/kernel/probes/kprobes.c >>>> @@ -10,6 +10,7 @@ >>>> >>>> #define pr_fmt(fmt) "kprobes: " fmt >>>> >>>> +#include <linux/execmem.h> >>>> #include <linux/extable.h> >>>> #include <linux/kasan.h> >>>> #include <linux/kernel.h> >>>> @@ -41,6 +42,17 @@ DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk); >>>> static void __kprobes >>>> post_kprobe_handler(struct kprobe *, struct kprobe_ctlblk *, struct pt_regs >>>> *); >>>> >>>> +void *alloc_insn_page(void) >>>> +{ >>>> + void *page; >>>> + >>>> + page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE); >>>> + if (!page) >>>> + return NULL; >>>> + set_memory_rox((unsigned long)page, 1); >>>> + return page; >>>> +} >>>> + >>>> static void __kprobes arch_prepare_ss_slot(struct kprobe *p) >>>> { >>>> kprobe_opcode_t *addr = p->ainsn.xol_insn; >>>> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c >>>> index 52ffe115a8c4..3e301bc2cd66 100644 >>>> --- a/arch/arm64/net/bpf_jit_comp.c >>>> +++ b/arch/arm64/net/bpf_jit_comp.c >>>> @@ -2717,11 +2717,6 @@ void arch_free_bpf_trampoline(void *image, unsigned int >>>> size) >>>> bpf_prog_pack_free(image, size); >>>> } >>>> >>>> -int arch_protect_bpf_trampoline(void *image, unsigned int size) >>>> -{ >>>> - return 0; >>>> -} >>>> - >>>> int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *ro_image, >>>> void *ro_image_end, const struct >>>> btf_func_model *m, >>>> u32 flags, struct bpf_tramp_links *tlinks, >>>> >>>> >>>>> Thanks, >>>>> Yang >>>>> >>>>>>> Thanks, >>>>>>> Ryan >>>>>>> >>>>>>>> Thanks, >>>>>>>> Yang >>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Yang >>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Ryan >>>>>>>>>> >>>>>>>>>>
On 09/09/2025 16:32, Yang Shi wrote: > > > On 9/9/25 7:36 AM, Ryan Roberts wrote: >> On 08/09/2025 19:31, Yang Shi wrote: >>> >>> On 9/8/25 9:34 AM, Ryan Roberts wrote: >>>> On 04/09/2025 22:49, Yang Shi wrote: >>>>> On 9/4/25 10:47 AM, Yang Shi wrote: >>>>>> On 9/4/25 6:16 AM, Ryan Roberts wrote: >>>>>>> On 04/09/2025 14:14, Ryan Roberts wrote: >>>>>>>> On 03/09/2025 01:50, Yang Shi wrote: >>>>>>>>>>>> I am wondering whether we can just have a warn_on_once or something >>>>>>>>>>>> for the >>>>>>>>>>>> case >>>>>>>>>>>> when we fail to allocate a pagetable page. Or, Ryan had >>>>>>>>>>>> suggested in an off-the-list conversation that we can maintain a cache >>>>>>>>>>>> of PTE >>>>>>>>>>>> tables for every PMD block mapping, which will give us >>>>>>>>>>>> the same memory consumption as we do today, but not sure if this is >>>>>>>>>>>> worth it. >>>>>>>>>>>> x86 can already handle splitting but due to the callchains >>>>>>>>>>>> I have described above, it has the same problem, and the code has been >>>>>>>>>>>> working >>>>>>>>>>>> for years :) >>>>>>>>>>> I think it's preferable to avoid having to keep a cache of pgtable >>>>>>>>>>> memory >>>>>>>>>>> if we >>>>>>>>>>> can... >>>>>>>>>> Yes, I agree. We simply don't know how many pages we need to cache, >>>>>>>>>> and it >>>>>>>>>> still can't guarantee 100% allocation success. >>>>>>>>> This is wrong... We can know how many pages will be needed for splitting >>>>>>>>> linear >>>>>>>>> mapping to PTEs for the worst case once linear mapping is finalized. >>>>>>>>> But it >>>>>>>>> may >>>>>>>>> require a few hundred megabytes memory to guarantee allocation success. I >>>>>>>>> don't >>>>>>>>> think it is worth for such rare corner case. >>>>>>>> Indeed, we know exactly how much memory we need for pgtables to map the >>>>>>>> linear >>>>>>>> map by pte - that's exactly what we are doing today. So we _could_ keep a >>>>>>>> cache. >>>>>>>> We would still get the benefit of improved performance but we would lose >>>>>>>> the >>>>>>>> benefit of reduced memory. >>>>>>>> >>>>>>>> I think we need to solve the vm_reset_perms() problem somehow, before we >>>>>>>> can >>>>>>>> enable this. >>>>>>> Sorry I realise this was not very clear... I am saying I think we need to >>>>>>> fix it >>>>>>> somehow. A cache would likely work. But I'd prefer to avoid it if we can >>>>>>> find a >>>>>>> better solution. >>>>>> Took a deeper look at vm_reset_perms(). It was introduced by commit >>>>>> 868b104d7379 ("mm/vmalloc: Add flag for freeing of special permsissions"). >>>>>> The >>>>>> VM_FLUSH_RESET_PERMS flag is supposed to be set if the vmalloc memory is RO >>>>>> and/or ROX. So set_memory_ro() or set_memory_rox() is supposed to follow up >>>>>> vmalloc(). So the page table should be already split before reaching vfree(). >>>>>> I think this why vm_reset_perms() doesn't not check return value. >>>> If vm_reset_perms() is assuming it can't/won't fail, I think it should at least >>>> output a warning if it does? >>> It should. Anyway warning will be raised if split fails. We have somehow >>> mitigation. >>> >>>>>> I scrutinized all the callsites with VM_FLUSH_RESET_PERMS flag set. >>>> Just checking; I think you made a comment before about there only being a few >>>> sites that set VM_FLUSH_RESET_PERMS. But one of them is the helper, >>>> set_vm_flush_reset_perms(). So just making sure you also followed to the places >>>> that use that helper? >>> Yes, I did. >>> >>>>>> The most >>>>>> of them has set_memory_ro() or set_memory_rox() followed. >>>> And are all callsites calling set_memory_*() for the entire cell that was >>>> allocated by vmalloc? If there are cases where it only calls that for a portion >>>> of it, then it's not gurranteed that the memory is correctly split. >>> Yes, all callsites call set_memory_*() for the entire range. >>> >>>>>> But there are 3 >>>>>> places I don't see set_memory_ro()/set_memory_rox() is called. >>>>>> >>>>>> 1. BPF trampoline allocation. The BPF trampoline calls >>>>>> arch_protect_bpf_trampoline(). The generic implementation does call >>>>>> set_memory_rox(). But the x86 and arm64 implementation just simply return 0. >>>>>> For x86, it is because execmem cache is used and it does call >>>>>> set_memory_rox(). ARM64 doesn't need to split page table before this series, >>>>>> so it should never fail. I think we just need to use the generic >>>>>> implementation (remove arm64 implementation) if this series is merged. >>>> I know zero about BPF. But it looks like the allocation happens in >>>> arch_alloc_bpf_trampoline(), which for arm64, calls bpf_prog_pack_alloc(). And >>>> for small sizes, it grabs some memory from a "pack". So doesn't this mean that >>>> you are calling set_memory_rox() for a sub-region of the cell, so that doesn't >>>> actually help at vm_reset_perms()-time? >>> Took a deeper look at bpf pack allocator. The "pack" is allocated by >>> alloc_new_pack(), which does: >>> bpf_jit_alloc_exec() >>> set_vm_flush_reset_perms() >>> set_memory_rox() >>> >>> If the size is greater than the pack size, it calls: >>> bpf_jit_alloc_exec() >>> set_vm_flush_reset_perms() >>> set_memory_rox() >>> >>> So it looks like bpf trampoline is good, and we don't need do anything. It >>> should be removed from the list. I didn't look deep enough for bpf pack >>> allocator in the first place. >>> >>>>>> 2. BPF dispatcher. It calls execmem_alloc which has VM_FLUSH_RESET_PERMS set. >>>>>> But it is used for rw allocation, so VM_FLUSH_RESET_PERMS should be >>>>>> unnecessary IIUC. So it doesn't matter even though vm_reset_perms() fails. >>>>>> >>>>>> 3. kprobe. S390's alloc_insn_page() does call set_memory_rox(), x86 also >>>>>> called set_memory_rox() before switching to execmem cache. The execmem cache >>>>>> calls set_memory_rox(). I don't know why ARM64 doesn't call it. >>>>>> >>>>>> So I think we just need to fix #1 and #3 per the above analysis. If this >>>>>> analysis look correct to you guys, I will prepare two patches to fix them. >>>> This all seems quite fragile. I find it interesting that vm_reset_perms() is >>>> doing break-before-make; it sets the PTEs as invalid, then flushes the TLB, >>>> then >>>> sets them to default. But for arm64, at least, I think break-before-make is not >>>> required. We are only changing the permissions so that can be done on live >>>> mappings; essentially change the sequence to; set default, flush TLB. >>> Yeah, I agree it is a little bit fragile. I think this is the "contract" for >>> vmalloc users. You allocate ROX memory via vmalloc, you are required to call >>> set_memory_*(). But there is nothing to guarantee the "contract" is followed. >>> But I don't think this is the only case in kernel. >>> >>>> If we do that, then if the memory was already default, then there is no need to >>>> do anything (so no chance of allocation failure). If the memory was not >>>> default, >>>> then it must have already been split to make it non-default, in which case we >>>> can also gurrantee that no allocations are required. >>>> >>>> What am I missing? >>> The comment says: >>> Set direct map to something invalid so that it won't be cached if there are any >>> accesses after the TLB flush, then flush the TLB and reset the direct map >>> permissions to the default. >>> >>> IIUC, it guarantees the direct map can't be cached in TLB after TLB flush from >>> _vm_unmap_aliases() by setting them invalid because TLB never cache invalid >>> entries. Skipping set direct map to invalid seems break this. Or "changing >>> permission on live mappings" on ARM64 can achieve the same goal? >> Here's my understanding of the intent of the code: >> >> Let's say we start with some memory that has been mapped RO. Our goal is to >> reset the memory back to RW and ensure that no TLB entry remains in the TLB for >> the old RO mapping. There are 2 ways to do that: > > > >> >> Approach 1 (used in current code): >> 1. set PTE to invalid >> 2. invalidate any TLB entry for the VA >> 3. set the PTE to RW >> >> Approach 2: >> 1. set the PTE to RW >> 2. invalidate any TLB entry for the VA > > IIUC, the intent of the code is "reset direct map permission *without* leaving a > RW+X window". The TLB flush call actually flushes both VA and direct map together. > So if this is the intent, approach #2 may have VA with X permission but direct > map may be RW at the mean time. It seems break the intent. Ahh! Thanks, it's starting to make more sense now. Though on first sight it seems a bit mad to me to form a tlb flush range that covers all the direct map pages and all the lazy vunmap regions. Is that intended to be a perf optimization or something else? It's not clear from the history. Could this be split into 2 operations? 1. unmap the aliases (+ tlbi the aliases). 2. set the direct memory back to default (+ tlbi the direct map region). The only 2 potential problems I can think of are; - Performance: 2 tlbis instead of 1, but conversely we probably avoid flushing a load of TLB entries that we didn't really need to. - Given there is now no lock around the tlbis (currently it's under vmap_purge_lock) is there a race where a new alias can appear between steps 1 and 2? I don't think so, because the memory is allocated to the current mapping so how is it going to get re-mapped? Could this solve it? > > Thanks, > Yang > >> >> The benefit of approach 1 is that it is guarranteed that it is impossible for >> different CPUs to have different translations for the same VA in their >> respective TLB. But for approach 2, it's possible that between steps 1 and 2, 1 >> CPU has a RO entry and another CPU has a RW entry. But that will get fixed once >> the TLB is flushed - it's not really an issue. >> >> (There is probably also an obscure way to end up with 2 TLB entries (one with RO >> and one with RW) for the same CPU, but the arm64 architecture permits that as >> long as it's only a permission mismatch). >> >> Anyway, approach 2 is used when changing memory permissions on user mappings, so >> I don't see why we can't take the same approach here. That would solve this >> whole class of issue for us. >> >> Thanks, >> Ryan >> >> >>> Thanks, >>> Yang >>> >>>> Thanks, >>>> Ryan >>>> >>>> >>>>> Tested the below patch with bpftrace kfunc (allocate bpf trampoline) and >>>>> kprobes. It seems work well. >>>>> >>>>> diff --git a/arch/arm64/kernel/probes/kprobes.c b/arch/arm64/kernel/probes/ >>>>> kprobes.c >>>>> index 0c5d408afd95..c4f8c4750f1e 100644 >>>>> --- a/arch/arm64/kernel/probes/kprobes.c >>>>> +++ b/arch/arm64/kernel/probes/kprobes.c >>>>> @@ -10,6 +10,7 @@ >>>>> >>>>> #define pr_fmt(fmt) "kprobes: " fmt >>>>> >>>>> +#include <linux/execmem.h> >>>>> #include <linux/extable.h> >>>>> #include <linux/kasan.h> >>>>> #include <linux/kernel.h> >>>>> @@ -41,6 +42,17 @@ DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk); >>>>> static void __kprobes >>>>> post_kprobe_handler(struct kprobe *, struct kprobe_ctlblk *, struct pt_regs >>>>> *); >>>>> >>>>> +void *alloc_insn_page(void) >>>>> +{ >>>>> + void *page; >>>>> + >>>>> + page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE); >>>>> + if (!page) >>>>> + return NULL; >>>>> + set_memory_rox((unsigned long)page, 1); >>>>> + return page; >>>>> +} >>>>> + >>>>> static void __kprobes arch_prepare_ss_slot(struct kprobe *p) >>>>> { >>>>> kprobe_opcode_t *addr = p->ainsn.xol_insn; >>>>> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c >>>>> index 52ffe115a8c4..3e301bc2cd66 100644 >>>>> --- a/arch/arm64/net/bpf_jit_comp.c >>>>> +++ b/arch/arm64/net/bpf_jit_comp.c >>>>> @@ -2717,11 +2717,6 @@ void arch_free_bpf_trampoline(void *image, unsigned int >>>>> size) >>>>> bpf_prog_pack_free(image, size); >>>>> } >>>>> >>>>> -int arch_protect_bpf_trampoline(void *image, unsigned int size) >>>>> -{ >>>>> - return 0; >>>>> -} >>>>> - >>>>> int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *ro_image, >>>>> void *ro_image_end, const struct >>>>> btf_func_model *m, >>>>> u32 flags, struct bpf_tramp_links *tlinks, >>>>> >>>>> >>>>>> Thanks, >>>>>> Yang >>>>>> >>>>>>>> Thanks, >>>>>>>> Ryan >>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Yang >>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Yang >>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Ryan >>>>>>>>>>> >>>>>>>>>>> >
On 9/9/25 9:32 AM, Ryan Roberts wrote: > On 09/09/2025 16:32, Yang Shi wrote: >> >> On 9/9/25 7:36 AM, Ryan Roberts wrote: >>> On 08/09/2025 19:31, Yang Shi wrote: >>>> On 9/8/25 9:34 AM, Ryan Roberts wrote: >>>>> On 04/09/2025 22:49, Yang Shi wrote: >>>>>> On 9/4/25 10:47 AM, Yang Shi wrote: >>>>>>> On 9/4/25 6:16 AM, Ryan Roberts wrote: >>>>>>>> On 04/09/2025 14:14, Ryan Roberts wrote: >>>>>>>>> On 03/09/2025 01:50, Yang Shi wrote: >>>>>>>>>>>>> I am wondering whether we can just have a warn_on_once or something >>>>>>>>>>>>> for the >>>>>>>>>>>>> case >>>>>>>>>>>>> when we fail to allocate a pagetable page. Or, Ryan had >>>>>>>>>>>>> suggested in an off-the-list conversation that we can maintain a cache >>>>>>>>>>>>> of PTE >>>>>>>>>>>>> tables for every PMD block mapping, which will give us >>>>>>>>>>>>> the same memory consumption as we do today, but not sure if this is >>>>>>>>>>>>> worth it. >>>>>>>>>>>>> x86 can already handle splitting but due to the callchains >>>>>>>>>>>>> I have described above, it has the same problem, and the code has been >>>>>>>>>>>>> working >>>>>>>>>>>>> for years :) >>>>>>>>>>>> I think it's preferable to avoid having to keep a cache of pgtable >>>>>>>>>>>> memory >>>>>>>>>>>> if we >>>>>>>>>>>> can... >>>>>>>>>>> Yes, I agree. We simply don't know how many pages we need to cache, >>>>>>>>>>> and it >>>>>>>>>>> still can't guarantee 100% allocation success. >>>>>>>>>> This is wrong... We can know how many pages will be needed for splitting >>>>>>>>>> linear >>>>>>>>>> mapping to PTEs for the worst case once linear mapping is finalized. >>>>>>>>>> But it >>>>>>>>>> may >>>>>>>>>> require a few hundred megabytes memory to guarantee allocation success. I >>>>>>>>>> don't >>>>>>>>>> think it is worth for such rare corner case. >>>>>>>>> Indeed, we know exactly how much memory we need for pgtables to map the >>>>>>>>> linear >>>>>>>>> map by pte - that's exactly what we are doing today. So we _could_ keep a >>>>>>>>> cache. >>>>>>>>> We would still get the benefit of improved performance but we would lose >>>>>>>>> the >>>>>>>>> benefit of reduced memory. >>>>>>>>> >>>>>>>>> I think we need to solve the vm_reset_perms() problem somehow, before we >>>>>>>>> can >>>>>>>>> enable this. >>>>>>>> Sorry I realise this was not very clear... I am saying I think we need to >>>>>>>> fix it >>>>>>>> somehow. A cache would likely work. But I'd prefer to avoid it if we can >>>>>>>> find a >>>>>>>> better solution. >>>>>>> Took a deeper look at vm_reset_perms(). It was introduced by commit >>>>>>> 868b104d7379 ("mm/vmalloc: Add flag for freeing of special permsissions"). >>>>>>> The >>>>>>> VM_FLUSH_RESET_PERMS flag is supposed to be set if the vmalloc memory is RO >>>>>>> and/or ROX. So set_memory_ro() or set_memory_rox() is supposed to follow up >>>>>>> vmalloc(). So the page table should be already split before reaching vfree(). >>>>>>> I think this why vm_reset_perms() doesn't not check return value. >>>>> If vm_reset_perms() is assuming it can't/won't fail, I think it should at least >>>>> output a warning if it does? >>>> It should. Anyway warning will be raised if split fails. We have somehow >>>> mitigation. >>>> >>>>>>> I scrutinized all the callsites with VM_FLUSH_RESET_PERMS flag set. >>>>> Just checking; I think you made a comment before about there only being a few >>>>> sites that set VM_FLUSH_RESET_PERMS. But one of them is the helper, >>>>> set_vm_flush_reset_perms(). So just making sure you also followed to the places >>>>> that use that helper? >>>> Yes, I did. >>>> >>>>>>> The most >>>>>>> of them has set_memory_ro() or set_memory_rox() followed. >>>>> And are all callsites calling set_memory_*() for the entire cell that was >>>>> allocated by vmalloc? If there are cases where it only calls that for a portion >>>>> of it, then it's not gurranteed that the memory is correctly split. >>>> Yes, all callsites call set_memory_*() for the entire range. >>>> >>>>>>> But there are 3 >>>>>>> places I don't see set_memory_ro()/set_memory_rox() is called. >>>>>>> >>>>>>> 1. BPF trampoline allocation. The BPF trampoline calls >>>>>>> arch_protect_bpf_trampoline(). The generic implementation does call >>>>>>> set_memory_rox(). But the x86 and arm64 implementation just simply return 0. >>>>>>> For x86, it is because execmem cache is used and it does call >>>>>>> set_memory_rox(). ARM64 doesn't need to split page table before this series, >>>>>>> so it should never fail. I think we just need to use the generic >>>>>>> implementation (remove arm64 implementation) if this series is merged. >>>>> I know zero about BPF. But it looks like the allocation happens in >>>>> arch_alloc_bpf_trampoline(), which for arm64, calls bpf_prog_pack_alloc(). And >>>>> for small sizes, it grabs some memory from a "pack". So doesn't this mean that >>>>> you are calling set_memory_rox() for a sub-region of the cell, so that doesn't >>>>> actually help at vm_reset_perms()-time? >>>> Took a deeper look at bpf pack allocator. The "pack" is allocated by >>>> alloc_new_pack(), which does: >>>> bpf_jit_alloc_exec() >>>> set_vm_flush_reset_perms() >>>> set_memory_rox() >>>> >>>> If the size is greater than the pack size, it calls: >>>> bpf_jit_alloc_exec() >>>> set_vm_flush_reset_perms() >>>> set_memory_rox() >>>> >>>> So it looks like bpf trampoline is good, and we don't need do anything. It >>>> should be removed from the list. I didn't look deep enough for bpf pack >>>> allocator in the first place. >>>> >>>>>>> 2. BPF dispatcher. It calls execmem_alloc which has VM_FLUSH_RESET_PERMS set. >>>>>>> But it is used for rw allocation, so VM_FLUSH_RESET_PERMS should be >>>>>>> unnecessary IIUC. So it doesn't matter even though vm_reset_perms() fails. >>>>>>> >>>>>>> 3. kprobe. S390's alloc_insn_page() does call set_memory_rox(), x86 also >>>>>>> called set_memory_rox() before switching to execmem cache. The execmem cache >>>>>>> calls set_memory_rox(). I don't know why ARM64 doesn't call it. >>>>>>> >>>>>>> So I think we just need to fix #1 and #3 per the above analysis. If this >>>>>>> analysis look correct to you guys, I will prepare two patches to fix them. >>>>> This all seems quite fragile. I find it interesting that vm_reset_perms() is >>>>> doing break-before-make; it sets the PTEs as invalid, then flushes the TLB, >>>>> then >>>>> sets them to default. But for arm64, at least, I think break-before-make is not >>>>> required. We are only changing the permissions so that can be done on live >>>>> mappings; essentially change the sequence to; set default, flush TLB. >>>> Yeah, I agree it is a little bit fragile. I think this is the "contract" for >>>> vmalloc users. You allocate ROX memory via vmalloc, you are required to call >>>> set_memory_*(). But there is nothing to guarantee the "contract" is followed. >>>> But I don't think this is the only case in kernel. >>>> >>>>> If we do that, then if the memory was already default, then there is no need to >>>>> do anything (so no chance of allocation failure). If the memory was not >>>>> default, >>>>> then it must have already been split to make it non-default, in which case we >>>>> can also gurrantee that no allocations are required. >>>>> >>>>> What am I missing? >>>> The comment says: >>>> Set direct map to something invalid so that it won't be cached if there are any >>>> accesses after the TLB flush, then flush the TLB and reset the direct map >>>> permissions to the default. >>>> >>>> IIUC, it guarantees the direct map can't be cached in TLB after TLB flush from >>>> _vm_unmap_aliases() by setting them invalid because TLB never cache invalid >>>> entries. Skipping set direct map to invalid seems break this. Or "changing >>>> permission on live mappings" on ARM64 can achieve the same goal? >>> Here's my understanding of the intent of the code: >>> >>> Let's say we start with some memory that has been mapped RO. Our goal is to >>> reset the memory back to RW and ensure that no TLB entry remains in the TLB for >>> the old RO mapping. There are 2 ways to do that: >> >> >>> Approach 1 (used in current code): >>> 1. set PTE to invalid >>> 2. invalidate any TLB entry for the VA >>> 3. set the PTE to RW >>> >>> Approach 2: >>> 1. set the PTE to RW >>> 2. invalidate any TLB entry for the VA >> IIUC, the intent of the code is "reset direct map permission *without* leaving a >> RW+X window". The TLB flush call actually flushes both VA and direct map together. >> So if this is the intent, approach #2 may have VA with X permission but direct >> map may be RW at the mean time. It seems break the intent. > Ahh! Thanks, it's starting to make more sense now. > > Though on first sight it seems a bit mad to me to form a tlb flush range that > covers all the direct map pages and all the lazy vunmap regions. Is that > intended to be a perf optimization or something else? It's not clear from the > history. I think it should be mainly performance driven. I can't see how come two TLB flushes (for vmap and direct map respectively) don't work if I don't miss something. > > > Could this be split into 2 operations? > > 1. unmap the aliases (+ tlbi the aliases). > 2. set the direct memory back to default (+ tlbi the direct map region). > > The only 2 potential problems I can think of are; > > - Performance: 2 tlbis instead of 1, but conversely we probably avoid flushing > a load of TLB entries that we didn't really need to. The two tlbis should work. But performance is definitely a concern. It may be hard to justify how much performance impact caused by over flush, but multiple TLBIs is definitely not preferred, particularly on some large scale machines. We have experienced some scalability issues with TLBI due to the large core count on Ampere systems. > > - Given there is now no lock around the tlbis (currently it's under > vmap_purge_lock) is there a race where a new alias can appear between steps 1 > and 2? I don't think so, because the memory is allocated to the current mapping > so how is it going to get re-mapped? Yes, I agree. I don't think the race is real. The physical pages will not be freed until vm_reset_perms() is done. The VA may be reallocated, but it will be mapped to different physical pages. > > > Could this solve it? I think it could. But the potential performance impact (two TLBIs) is a real concern. Anyway the vmalloc user should call set_memory_*() for any RO/ROX mapping, set_memory_*() should split the page table before reaching vm_reset_perms() so it should not fail. If set_memory_*() is not called, it is a bug, it should be fixed, like ARM64 kprobes. It is definitely welcome to make it more robust, although the warning from split may mitigate this somehow. But I don't think this should be a blocker for this series IMHO. Thanks, Yang > > > >> Thanks, >> Yang >> >>> The benefit of approach 1 is that it is guarranteed that it is impossible for >>> different CPUs to have different translations for the same VA in their >>> respective TLB. But for approach 2, it's possible that between steps 1 and 2, 1 >>> CPU has a RO entry and another CPU has a RW entry. But that will get fixed once >>> the TLB is flushed - it's not really an issue. >>> >>> (There is probably also an obscure way to end up with 2 TLB entries (one with RO >>> and one with RW) for the same CPU, but the arm64 architecture permits that as >>> long as it's only a permission mismatch). >>> >>> Anyway, approach 2 is used when changing memory permissions on user mappings, so >>> I don't see why we can't take the same approach here. That would solve this >>> whole class of issue for us. >>> >>> Thanks, >>> Ryan >>> >>> >>>> Thanks, >>>> Yang >>>> >>>>> Thanks, >>>>> Ryan >>>>> >>>>> >>>>>> Tested the below patch with bpftrace kfunc (allocate bpf trampoline) and >>>>>> kprobes. It seems work well. >>>>>> >>>>>> diff --git a/arch/arm64/kernel/probes/kprobes.c b/arch/arm64/kernel/probes/ >>>>>> kprobes.c >>>>>> index 0c5d408afd95..c4f8c4750f1e 100644 >>>>>> --- a/arch/arm64/kernel/probes/kprobes.c >>>>>> +++ b/arch/arm64/kernel/probes/kprobes.c >>>>>> @@ -10,6 +10,7 @@ >>>>>> >>>>>> #define pr_fmt(fmt) "kprobes: " fmt >>>>>> >>>>>> +#include <linux/execmem.h> >>>>>> #include <linux/extable.h> >>>>>> #include <linux/kasan.h> >>>>>> #include <linux/kernel.h> >>>>>> @@ -41,6 +42,17 @@ DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk); >>>>>> static void __kprobes >>>>>> post_kprobe_handler(struct kprobe *, struct kprobe_ctlblk *, struct pt_regs >>>>>> *); >>>>>> >>>>>> +void *alloc_insn_page(void) >>>>>> +{ >>>>>> + void *page; >>>>>> + >>>>>> + page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE); >>>>>> + if (!page) >>>>>> + return NULL; >>>>>> + set_memory_rox((unsigned long)page, 1); >>>>>> + return page; >>>>>> +} >>>>>> + >>>>>> static void __kprobes arch_prepare_ss_slot(struct kprobe *p) >>>>>> { >>>>>> kprobe_opcode_t *addr = p->ainsn.xol_insn; >>>>>> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c >>>>>> index 52ffe115a8c4..3e301bc2cd66 100644 >>>>>> --- a/arch/arm64/net/bpf_jit_comp.c >>>>>> +++ b/arch/arm64/net/bpf_jit_comp.c >>>>>> @@ -2717,11 +2717,6 @@ void arch_free_bpf_trampoline(void *image, unsigned int >>>>>> size) >>>>>> bpf_prog_pack_free(image, size); >>>>>> } >>>>>> >>>>>> -int arch_protect_bpf_trampoline(void *image, unsigned int size) >>>>>> -{ >>>>>> - return 0; >>>>>> -} >>>>>> - >>>>>> int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *ro_image, >>>>>> void *ro_image_end, const struct >>>>>> btf_func_model *m, >>>>>> u32 flags, struct bpf_tramp_links *tlinks, >>>>>> >>>>>> >>>>>>> Thanks, >>>>>>> Yang >>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Ryan >>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Yang >>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Yang >>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Ryan >>>>>>>>>>>> >>>>>>>>>>>>
>>> IIUC, the intent of the code is "reset direct map permission >>> *without* leaving a >>> RW+X window". The TLB flush call actually flushes both VA and direct >>> map together. >>> So if this is the intent, approach #2 may have VA with X permission >>> but direct >>> map may be RW at the mean time. It seems break the intent. >> Ahh! Thanks, it's starting to make more sense now. >> >> Though on first sight it seems a bit mad to me to form a tlb flush >> range that >> covers all the direct map pages and all the lazy vunmap regions. Is that >> intended to be a perf optimization or something else? It's not clear >> from the >> history. > > I think it should be mainly performance driven. I can't see how come > two TLB flushes (for vmap and direct map respectively) don't work if I > don't miss something. > >> >> >> Could this be split into 2 operations? >> >> 1. unmap the aliases (+ tlbi the aliases). >> 2. set the direct memory back to default (+ tlbi the direct map region). >> >> The only 2 potential problems I can think of are; >> >> - Performance: 2 tlbis instead of 1, but conversely we probably >> avoid flushing >> a load of TLB entries that we didn't really need to. > > The two tlbis should work. But performance is definitely a concern. It > may be hard to justify how much performance impact caused by over > flush, but multiple TLBIs is definitely not preferred, particularly on > some large scale machines. We have experienced some scalability issues > with TLBI due to the large core count on Ampere systems. >> >> - Given there is now no lock around the tlbis (currently it's under >> vmap_purge_lock) is there a race where a new alias can appear between >> steps 1 >> and 2? I don't think so, because the memory is allocated to the >> current mapping >> so how is it going to get re-mapped? > > Yes, I agree. I don't think the race is real. The physical pages will > not be freed until vm_reset_perms() is done. The VA may be > reallocated, but it will be mapped to different physical pages. > >> >> >> Could this solve it? > > I think it could. But the potential performance impact (two TLBIs) is > a real concern. > > Anyway the vmalloc user should call set_memory_*() for any RO/ROX > mapping, set_memory_*() should split the page table before reaching > vm_reset_perms() so it should not fail. If set_memory_*() is not > called, it is a bug, it should be fixed, like ARM64 kprobes. > > It is definitely welcome to make it more robust, although the warning > from split may mitigate this somehow. But I don't think this should be > a blocker for this series IMHO. Hi Ryan & Catalin, Any more concerns about this? Shall we move forward with v8? We can include the fix to kprobes in v8 or I can send it separately, either is fine to me. Hopefully we can make v6.18. Thanks, Yang > > Thanks, > Yang > >> >> >> >>> Thanks, >>> Yang >>> >>>> The benefit of approach 1 is that it is guarranteed that it is >>>> impossible for >>>> different CPUs to have different translations for the same VA in their >>>> respective TLB. But for approach 2, it's possible that between >>>> steps 1 and 2, 1 >>>> CPU has a RO entry and another CPU has a RW entry. But that will >>>> get fixed once >>>> the TLB is flushed - it's not really an issue. >>>> >>>> (There is probably also an obscure way to end up with 2 TLB entries >>>> (one with RO >>>> and one with RW) for the same CPU, but the arm64 architecture >>>> permits that as >>>> long as it's only a permission mismatch). >>>> >>>> Anyway, approach 2 is used when changing memory permissions on user >>>> mappings, so >>>> I don't see why we can't take the same approach here. That would >>>> solve this >>>> whole class of issue for us. >>>> >>>> Thanks, >>>> Ryan >>>> >>>> >>>>> Thanks, >>>>> Yang >>>>> >>>>>> Thanks, >>>>>> Ryan >>>>>> >>>>>> >>>>>>> Tested the below patch with bpftrace kfunc (allocate bpf >>>>>>> trampoline) and >>>>>>> kprobes. It seems work well. >>>>>>> >>>>>>> diff --git a/arch/arm64/kernel/probes/kprobes.c >>>>>>> b/arch/arm64/kernel/probes/ >>>>>>> kprobes.c >>>>>>> index 0c5d408afd95..c4f8c4750f1e 100644 >>>>>>> --- a/arch/arm64/kernel/probes/kprobes.c >>>>>>> +++ b/arch/arm64/kernel/probes/kprobes.c >>>>>>> @@ -10,6 +10,7 @@ >>>>>>> >>>>>>> #define pr_fmt(fmt) "kprobes: " fmt >>>>>>> >>>>>>> +#include <linux/execmem.h> >>>>>>> #include <linux/extable.h> >>>>>>> #include <linux/kasan.h> >>>>>>> #include <linux/kernel.h> >>>>>>> @@ -41,6 +42,17 @@ DEFINE_PER_CPU(struct kprobe_ctlblk, >>>>>>> kprobe_ctlblk); >>>>>>> static void __kprobes >>>>>>> post_kprobe_handler(struct kprobe *, struct kprobe_ctlblk *, >>>>>>> struct pt_regs >>>>>>> *); >>>>>>> >>>>>>> +void *alloc_insn_page(void) >>>>>>> +{ >>>>>>> + void *page; >>>>>>> + >>>>>>> + page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE); >>>>>>> + if (!page) >>>>>>> + return NULL; >>>>>>> + set_memory_rox((unsigned long)page, 1); >>>>>>> + return page; >>>>>>> +} >>>>>>> + >>>>>>> static void __kprobes arch_prepare_ss_slot(struct kprobe *p) >>>>>>> { >>>>>>> kprobe_opcode_t *addr = p->ainsn.xol_insn; >>>>>>> diff --git a/arch/arm64/net/bpf_jit_comp.c >>>>>>> b/arch/arm64/net/bpf_jit_comp.c >>>>>>> index 52ffe115a8c4..3e301bc2cd66 100644 >>>>>>> --- a/arch/arm64/net/bpf_jit_comp.c >>>>>>> +++ b/arch/arm64/net/bpf_jit_comp.c >>>>>>> @@ -2717,11 +2717,6 @@ void arch_free_bpf_trampoline(void >>>>>>> *image, unsigned int >>>>>>> size) >>>>>>> bpf_prog_pack_free(image, size); >>>>>>> } >>>>>>> >>>>>>> -int arch_protect_bpf_trampoline(void *image, unsigned int size) >>>>>>> -{ >>>>>>> - return 0; >>>>>>> -} >>>>>>> - >>>>>>> int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, >>>>>>> void *ro_image, >>>>>>> void *ro_image_end, const struct >>>>>>> btf_func_model *m, >>>>>>> u32 flags, struct >>>>>>> bpf_tramp_links *tlinks, >>>>>>> >>>>>>> >>>>>>>> Thanks, >>>>>>>> Yang >>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Ryan >>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Yang >>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Yang >>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Ryan >>>>>>>>>>>>> >>>>>>>>>>>>> >
Hi Yang, Sorry for the slow reply; I'm just getting back to this... On 11/09/2025 23:03, Yang Shi wrote: > Hi Ryan & Catalin, > > Any more concerns about this? I've been trying to convince myself that your assertion that all users that set the VM_FLUSH_RESET_PERMS also call set_memory_*() for the entire range that was returned my vmalloc. I agree that if that is the contract and everyone is following it, then there is no problem here. But I haven't been able to convince myself... Some examples (these might intersect with examples you previously raised): 1. bpf_dispatcher_change_prog() -> bpf_jit_alloc_exec() -> execmem_alloc() -> sets VM_FLUSH_RESET_PERMS. But I don't see it calling set_memory_*() for rw_image. 2. module_memory_alloc() -> execmem_alloc_rw() -> execmem_alloc() -> sets VM_FLUSH_RESET_PERMS (note that execmem_force_rw() is nop for arm64). set_memory_*() is not called until much later on in module_set_memory(). Another error in the meantime could cause the memory to be vfreed before that point. 3. When set_vm_flush_reset_perms() is set for the range, it is called before set_memory_*() which might then fail to split prior to vfree. But I guess as long as set_memory_*() is never successfully called for a *sub-range* of the vmalloc'ed region, then for all of the above issues, the memory must still be RW at vfree-time, so this issue should be benign... I think? In summary this all looks horribly fragile. But I *think* it works. It would be good to clean it all up and have some clearly documented rules regardless. But I think that could be a follow up series. > Shall we move forward with v8? Yes; Do you wnat me to post that or would you prefer to do it? I'm happy to do it; there are a few other tidy ups in pageattr.c I want to make which I spotted. > We can include the > fix to kprobes in v8 or I can send it separately, either is fine to me. Post it on list, and I'll also incorporate into the series. > Hopefully we can make v6.18. It's probably getting a bit late now. Anyway, I'll aim to get v8 out tomorrow or Friday and we will see what Will thinks. Thanks, Ryan > > Thanks, > Yang >
On 9/17/25 9:28 AM, Ryan Roberts wrote: > Hi Yang, > > Sorry for the slow reply; I'm just getting back to this... > > On 11/09/2025 23:03, Yang Shi wrote: >> Hi Ryan & Catalin, >> >> Any more concerns about this? > I've been trying to convince myself that your assertion that all users that set > the VM_FLUSH_RESET_PERMS also call set_memory_*() for the entire range that was > returned my vmalloc. I agree that if that is the contract and everyone is > following it, then there is no problem here. > > But I haven't been able to convince myself... > > Some examples (these might intersect with examples you previously raised): > > 1. bpf_dispatcher_change_prog() -> bpf_jit_alloc_exec() -> execmem_alloc() -> > sets VM_FLUSH_RESET_PERMS. But I don't see it calling set_memory_*() for rw_image. Yes, it doesn't call set_memory_*(). I spotted this in the earlier email. But it is actually RW, so it should be ok to miss the call. The later set_direct_map_invalid call in vfree() may fail, but set_direct_map_default call will set RW permission back. But I think it doesn't have to use execmem_alloc(), the plain vmalloc() should be good enough. > > 2. module_memory_alloc() -> execmem_alloc_rw() -> execmem_alloc() -> sets > VM_FLUSH_RESET_PERMS (note that execmem_force_rw() is nop for arm64). > set_memory_*() is not called until much later on in module_set_memory(). Another > error in the meantime could cause the memory to be vfreed before that point. IIUC, execmem_alloc_rw() is used to allocate memory for modules' text section and data section. The code will set mod->mem[type].is_rox according to the type of the section. It is true for text, false for data. Then set_memory_rox() will be called later if it is true *after* insns are copied to the memory. So it is still RW before that point. > > 3. When set_vm_flush_reset_perms() is set for the range, it is called before > set_memory_*() which might then fail to split prior to vfree. Yes, all call sites check the return value and bail out if set_memory_*() failed if I don't miss anything. > > But I guess as long as set_memory_*() is never successfully called for a > *sub-range* of the vmalloc'ed region, then for all of the above issues, the > memory must still be RW at vfree-time, so this issue should be benign... I think? Yes, it is true. > > In summary this all looks horribly fragile. But I *think* it works. It would be > good to clean it all up and have some clearly documented rules regardless. But I > think that could be a follow up series. Yeah, absolutely agreed. > >> Shall we move forward with v8? > Yes; Do you wnat me to post that or would you prefer to do it? I'm happy to do > it; there are a few other tidy ups in pageattr.c I want to make which I spotted. I actually just had v8 ready in my tree. I removed pageattr_pgd_entry and pageattr_pud_entry in pageattr.c and fixed pmd_leaf/pud_leaf as you suggested. Is it the cleanup you are supposed to do? And I also rebased it on top of Shijie's series (https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/commit/?id=bfbbb0d3215f) which has been picked up by Will. > >> We can include the >> fix to kprobes in v8 or I can send it separately, either is fine to me. > Post it on list, and I'll also incorporate into the series. I can include it in v8 series. > >> Hopefully we can make v6.18. > It's probably getting a bit late now. Anyway, I'll aim to get v8 out tomorrow or > Friday and we will see what Will thinks. Thank you. I can post v8 today. Thanks, Yang > > Thanks, > Ryan > >> Thanks, >> Yang >>
On 17/09/2025 18:21, Yang Shi wrote: > > > On 9/17/25 9:28 AM, Ryan Roberts wrote: >> Hi Yang, >> >> Sorry for the slow reply; I'm just getting back to this... >> >> On 11/09/2025 23:03, Yang Shi wrote: >>> Hi Ryan & Catalin, >>> >>> Any more concerns about this? >> I've been trying to convince myself that your assertion that all users that set >> the VM_FLUSH_RESET_PERMS also call set_memory_*() for the entire range that was >> returned my vmalloc. I agree that if that is the contract and everyone is >> following it, then there is no problem here. >> >> But I haven't been able to convince myself... >> >> Some examples (these might intersect with examples you previously raised): >> >> 1. bpf_dispatcher_change_prog() -> bpf_jit_alloc_exec() -> execmem_alloc() -> >> sets VM_FLUSH_RESET_PERMS. But I don't see it calling set_memory_*() for >> rw_image. > > Yes, it doesn't call set_memory_*(). I spotted this in the earlier email. But it > is actually RW, so it should be ok to miss the call. The later > set_direct_map_invalid call in vfree() may fail, but set_direct_map_default call > will set RW permission back. But I think it doesn't have to use execmem_alloc(), > the plain vmalloc() should be good enough. > >> >> 2. module_memory_alloc() -> execmem_alloc_rw() -> execmem_alloc() -> sets >> VM_FLUSH_RESET_PERMS (note that execmem_force_rw() is nop for arm64). >> set_memory_*() is not called until much later on in module_set_memory(). Another >> error in the meantime could cause the memory to be vfreed before that point. > > IIUC, execmem_alloc_rw() is used to allocate memory for modules' text section > and data section. The code will set mod->mem[type].is_rox according to the type > of the section. It is true for text, false for data. Then set_memory_rox() will > be called later if it is true *after* insns are copied to the memory. So it is > still RW before that point. > >> >> 3. When set_vm_flush_reset_perms() is set for the range, it is called before >> set_memory_*() which might then fail to split prior to vfree. > > Yes, all call sites check the return value and bail out if set_memory_*() failed > if I don't miss anything. > >> >> But I guess as long as set_memory_*() is never successfully called for a >> *sub-range* of the vmalloc'ed region, then for all of the above issues, the >> memory must still be RW at vfree-time, so this issue should be benign... I think? > > Yes, it is true. So to summarise, all freshly vmalloc'ed memory starts as RW. set_memory_*() may only be called if VM_FLUSH_RESET_PERMS has already been set. If set_memory_*() is called at all, the first call MUST be for the whole range. If those requirements are all met, then if VM_FLUSH_RESET_PERMS was set but set_memory_*() was never called, the worst that can happen is for both the set_direct_map_invalid() and set_direct_map_default() calls to fail due to not enough memory. But that is safe because the memory was always RW. If set_memory_*() was called for the whole range and failed, it's the same as if it was never called. If it was called for the whole range and succeeded, then the split must have happened already and set_direct_map_invalid() and set_direct_map_default() will therefore definitely succeed. The only way this could be a problem is if someone vmallocs a range then performs a set_memory_*() on a sub-region without having first done it for the whole region. But we have not found any evidence that there are any users that do that. In fact, by that logic, I think alloc_insn_page() must also be safe; it only allocates 1 page, so if set_memory_*() is subsequently called for it, it must by definition be covering the whole allocation; 1 page is the smallest amount that can be protected. So I agree we are safe. > >> >> In summary this all looks horribly fragile. But I *think* it works. It would be >> good to clean it all up and have some clearly documented rules regardless. But I >> think that could be a follow up series. > > Yeah, absolutely agreed. > >> >>> Shall we move forward with v8? >> Yes; Do you wnat me to post that or would you prefer to do it? I'm happy to do >> it; there are a few other tidy ups in pageattr.c I want to make which I spotted. > > I actually just had v8 ready in my tree. I removed pageattr_pgd_entry and > pageattr_pud_entry in pageattr.c and fixed pmd_leaf/pud_leaf as you suggested. > Is it the cleanup you are supposed to do? I was also going to fix up the comment in change_memory_common() which is now stale. > And I also rebased it on top of > Shijie's series (https://git.kernel.org/pub/scm/linux/kernel/git/arm64/ > linux.git/commit/?id=bfbbb0d3215f) which has been picked up by Will. > >> >>> We can include the >>> fix to kprobes in v8 or I can send it separately, either is fine to me. >> Post it on list, and I'll also incorporate into the series. > > I can include it in v8 series. > >> >>> Hopefully we can make v6.18. >> It's probably getting a bit late now. Anyway, I'll aim to get v8 out tomorrow or >> Friday and we will see what Will thinks. > > Thank you. I can post v8 today. OK great - I'll leave it all to you then - thanks! > > Thanks, > Yang > >> >> Thanks, >> Ryan >> >>> Thanks, >>> Yang >>> >
On 9/17/25 11:58 AM, Ryan Roberts wrote: > On 17/09/2025 18:21, Yang Shi wrote: >> >> On 9/17/25 9:28 AM, Ryan Roberts wrote: >>> Hi Yang, >>> >>> Sorry for the slow reply; I'm just getting back to this... >>> >>> On 11/09/2025 23:03, Yang Shi wrote: >>>> Hi Ryan & Catalin, >>>> >>>> Any more concerns about this? >>> I've been trying to convince myself that your assertion that all users that set >>> the VM_FLUSH_RESET_PERMS also call set_memory_*() for the entire range that was >>> returned my vmalloc. I agree that if that is the contract and everyone is >>> following it, then there is no problem here. >>> >>> But I haven't been able to convince myself... >>> >>> Some examples (these might intersect with examples you previously raised): >>> >>> 1. bpf_dispatcher_change_prog() -> bpf_jit_alloc_exec() -> execmem_alloc() -> >>> sets VM_FLUSH_RESET_PERMS. But I don't see it calling set_memory_*() for >>> rw_image. >> Yes, it doesn't call set_memory_*(). I spotted this in the earlier email. But it >> is actually RW, so it should be ok to miss the call. The later >> set_direct_map_invalid call in vfree() may fail, but set_direct_map_default call >> will set RW permission back. But I think it doesn't have to use execmem_alloc(), >> the plain vmalloc() should be good enough. >> >>> 2. module_memory_alloc() -> execmem_alloc_rw() -> execmem_alloc() -> sets >>> VM_FLUSH_RESET_PERMS (note that execmem_force_rw() is nop for arm64). >>> set_memory_*() is not called until much later on in module_set_memory(). Another >>> error in the meantime could cause the memory to be vfreed before that point. >> IIUC, execmem_alloc_rw() is used to allocate memory for modules' text section >> and data section. The code will set mod->mem[type].is_rox according to the type >> of the section. It is true for text, false for data. Then set_memory_rox() will >> be called later if it is true *after* insns are copied to the memory. So it is >> still RW before that point. >> >>> 3. When set_vm_flush_reset_perms() is set for the range, it is called before >>> set_memory_*() which might then fail to split prior to vfree. >> Yes, all call sites check the return value and bail out if set_memory_*() failed >> if I don't miss anything. >> >>> But I guess as long as set_memory_*() is never successfully called for a >>> *sub-range* of the vmalloc'ed region, then for all of the above issues, the >>> memory must still be RW at vfree-time, so this issue should be benign... I think? >> Yes, it is true. > So to summarise, all freshly vmalloc'ed memory starts as RW. set_memory_*() may > only be called if VM_FLUSH_RESET_PERMS has already been set. If set_memory_*() > is called at all, the first call MUST be for the whole range. Whether the default permission is RW or not depends on the type passed in by execmem_alloc(). It is defined by execmem_info in arch/arm64/mm/init.c. For ARM64, module and BPF have PAGE_KERNEL permission (RW) by default, but kprobes is PAGE_KERNEL_ROX (ROX). > If those requirements are all met, then if VM_FLUSH_RESET_PERMS was set but > set_memory_*() was never called, the worst that can happen is for both the > set_direct_map_invalid() and set_direct_map_default() calls to fail due to not > enough memory. But that is safe because the memory was always RW. If > set_memory_*() was called for the whole range and failed, it's the same as if it > was never called. If it was called for the whole range and succeeded, then the > split must have happened already and set_direct_map_invalid() and > set_direct_map_default() will therefore definitely succeed. > > The only way this could be a problem is if someone vmallocs a range then > performs a set_memory_*() on a sub-region without having first done it for the > whole region. But we have not found any evidence that there are any users that > do that. Yes, exactly. > > In fact, by that logic, I think alloc_insn_page() must also be safe; it only > allocates 1 page, so if set_memory_*() is subsequently called for it, it must by > definition be covering the whole allocation; 1 page is the smallest amount that > can be protected. Yes, but kprobes default permission is ROX. > > So I agree we are safe. > > >>> In summary this all looks horribly fragile. But I *think* it works. It would be >>> good to clean it all up and have some clearly documented rules regardless. But I >>> think that could be a follow up series. >> Yeah, absolutely agreed. >> >>>> Shall we move forward with v8? >>> Yes; Do you wnat me to post that or would you prefer to do it? I'm happy to do >>> it; there are a few other tidy ups in pageattr.c I want to make which I spotted. >> I actually just had v8 ready in my tree. I removed pageattr_pgd_entry and >> pageattr_pud_entry in pageattr.c and fixed pmd_leaf/pud_leaf as you suggested. >> Is it the cleanup you are supposed to do? > I was also going to fix up the comment in change_memory_common() which is now stale. Oops, I missed that in my v8. Please just comment for v8, I can fix it up later. Thanks, Yang > >> And I also rebased it on top of >> Shijie's series (https://git.kernel.org/pub/scm/linux/kernel/git/arm64/ >> linux.git/commit/?id=bfbbb0d3215f) which has been picked up by Will. >> >>>> We can include the >>>> fix to kprobes in v8 or I can send it separately, either is fine to me. >>> Post it on list, and I'll also incorporate into the series. >> I can include it in v8 series. >> >>>> Hopefully we can make v6.18. >>> It's probably getting a bit late now. Anyway, I'll aim to get v8 out tomorrow or >>> Friday and we will see what Will thinks. >> Thank you. I can post v8 today. > OK great - I'll leave it all to you then - thanks! > >> Thanks, >> Yang >> >>> Thanks, >>> Ryan >>> >>>> Thanks, >>>> Yang >>>>
On 17/09/2025 20:15, Yang Shi wrote: > > > On 9/17/25 11:58 AM, Ryan Roberts wrote: >> On 17/09/2025 18:21, Yang Shi wrote: >>> >>> On 9/17/25 9:28 AM, Ryan Roberts wrote: >>>> Hi Yang, >>>> >>>> Sorry for the slow reply; I'm just getting back to this... >>>> >>>> On 11/09/2025 23:03, Yang Shi wrote: >>>>> Hi Ryan & Catalin, >>>>> >>>>> Any more concerns about this? >>>> I've been trying to convince myself that your assertion that all users that set >>>> the VM_FLUSH_RESET_PERMS also call set_memory_*() for the entire range that was >>>> returned my vmalloc. I agree that if that is the contract and everyone is >>>> following it, then there is no problem here. >>>> >>>> But I haven't been able to convince myself... >>>> >>>> Some examples (these might intersect with examples you previously raised): >>>> >>>> 1. bpf_dispatcher_change_prog() -> bpf_jit_alloc_exec() -> execmem_alloc() -> >>>> sets VM_FLUSH_RESET_PERMS. But I don't see it calling set_memory_*() for >>>> rw_image. >>> Yes, it doesn't call set_memory_*(). I spotted this in the earlier email. But it >>> is actually RW, so it should be ok to miss the call. The later >>> set_direct_map_invalid call in vfree() may fail, but set_direct_map_default call >>> will set RW permission back. But I think it doesn't have to use execmem_alloc(), >>> the plain vmalloc() should be good enough. >>> >>>> 2. module_memory_alloc() -> execmem_alloc_rw() -> execmem_alloc() -> sets >>>> VM_FLUSH_RESET_PERMS (note that execmem_force_rw() is nop for arm64). >>>> set_memory_*() is not called until much later on in module_set_memory(). >>>> Another >>>> error in the meantime could cause the memory to be vfreed before that point. >>> IIUC, execmem_alloc_rw() is used to allocate memory for modules' text section >>> and data section. The code will set mod->mem[type].is_rox according to the type >>> of the section. It is true for text, false for data. Then set_memory_rox() will >>> be called later if it is true *after* insns are copied to the memory. So it is >>> still RW before that point. >>> >>>> 3. When set_vm_flush_reset_perms() is set for the range, it is called before >>>> set_memory_*() which might then fail to split prior to vfree. >>> Yes, all call sites check the return value and bail out if set_memory_*() failed >>> if I don't miss anything. >>> >>>> But I guess as long as set_memory_*() is never successfully called for a >>>> *sub-range* of the vmalloc'ed region, then for all of the above issues, the >>>> memory must still be RW at vfree-time, so this issue should be benign... I >>>> think? >>> Yes, it is true. >> So to summarise, all freshly vmalloc'ed memory starts as RW. set_memory_*() may >> only be called if VM_FLUSH_RESET_PERMS has already been set. If set_memory_*() >> is called at all, the first call MUST be for the whole range. > > Whether the default permission is RW or not depends on the type passed in by > execmem_alloc(). It is defined by execmem_info in arch/arm64/mm/init.c. For > ARM64, module and BPF have PAGE_KERNEL permission (RW) by default, but kprobes > is PAGE_KERNEL_ROX (ROX). Perhaps I missed it, but as far as I could tell the prot that the arch sets for the type only determines the prot that is set for the vmalloc map. It doesn't look like the linear map is modified at all... which feels like a bug to me since the linear map will be RW while the vmalloc map will be ROX... I guess I must have missed something... > >> If those requirements are all met, then if VM_FLUSH_RESET_PERMS was set but >> set_memory_*() was never called, the worst that can happen is for both the >> set_direct_map_invalid() and set_direct_map_default() calls to fail due to not >> enough memory. But that is safe because the memory was always RW. If >> set_memory_*() was called for the whole range and failed, it's the same as if it >> was never called. If it was called for the whole range and succeeded, then the >> split must have happened already and set_direct_map_invalid() and >> set_direct_map_default() will therefore definitely succeed. >> >> The only way this could be a problem is if someone vmallocs a range then >> performs a set_memory_*() on a sub-region without having first done it for the >> whole region. But we have not found any evidence that there are any users that >> do that. > > Yes, exactly. > >> >> In fact, by that logic, I think alloc_insn_page() must also be safe; it only >> allocates 1 page, so if set_memory_*() is subsequently called for it, it must by >> definition be covering the whole allocation; 1 page is the smallest amount that >> can be protected. > > Yes, but kprobes default permission is ROX. > >> >> So I agree we are safe. >> >> >>>> In summary this all looks horribly fragile. But I *think* it works. It would be >>>> good to clean it all up and have some clearly documented rules regardless. >>>> But I >>>> think that could be a follow up series. >>> Yeah, absolutely agreed. >>> >>>>> Shall we move forward with v8? >>>> Yes; Do you wnat me to post that or would you prefer to do it? I'm happy to do >>>> it; there are a few other tidy ups in pageattr.c I want to make which I >>>> spotted. >>> I actually just had v8 ready in my tree. I removed pageattr_pgd_entry and >>> pageattr_pud_entry in pageattr.c and fixed pmd_leaf/pud_leaf as you suggested. >>> Is it the cleanup you are supposed to do? >> I was also going to fix up the comment in change_memory_common() which is now >> stale. > > Oops, I missed that in my v8. Please just comment for v8, I can fix it up later. Ahh no biggy. If there is a chance Will will take the series, let's not hold it up for a comment. > > Thanks, > Yang > > >> >>> And I also rebased it on top of >>> Shijie's series (https://git.kernel.org/pub/scm/linux/kernel/git/arm64/ >>> linux.git/commit/?id=bfbbb0d3215f) which has been picked up by Will. >>> >>>>> We can include the >>>>> fix to kprobes in v8 or I can send it separately, either is fine to me. >>>> Post it on list, and I'll also incorporate into the series. >>> I can include it in v8 series. >>> >>>>> Hopefully we can make v6.18. >>>> It's probably getting a bit late now. Anyway, I'll aim to get v8 out >>>> tomorrow or >>>> Friday and we will see what Will thinks. >>> Thank you. I can post v8 today. >> OK great - I'll leave it all to you then - thanks! >> >>> Thanks, >>> Yang >>> >>>> Thanks, >>>> Ryan >>>> >>>>> Thanks, >>>>> Yang >>>>> >
On 9/17/25 12:40 PM, Ryan Roberts wrote: > On 17/09/2025 20:15, Yang Shi wrote: >> >> On 9/17/25 11:58 AM, Ryan Roberts wrote: >>> On 17/09/2025 18:21, Yang Shi wrote: >>>> On 9/17/25 9:28 AM, Ryan Roberts wrote: >>>>> Hi Yang, >>>>> >>>>> Sorry for the slow reply; I'm just getting back to this... >>>>> >>>>> On 11/09/2025 23:03, Yang Shi wrote: >>>>>> Hi Ryan & Catalin, >>>>>> >>>>>> Any more concerns about this? >>>>> I've been trying to convince myself that your assertion that all users that set >>>>> the VM_FLUSH_RESET_PERMS also call set_memory_*() for the entire range that was >>>>> returned my vmalloc. I agree that if that is the contract and everyone is >>>>> following it, then there is no problem here. >>>>> >>>>> But I haven't been able to convince myself... >>>>> >>>>> Some examples (these might intersect with examples you previously raised): >>>>> >>>>> 1. bpf_dispatcher_change_prog() -> bpf_jit_alloc_exec() -> execmem_alloc() -> >>>>> sets VM_FLUSH_RESET_PERMS. But I don't see it calling set_memory_*() for >>>>> rw_image. >>>> Yes, it doesn't call set_memory_*(). I spotted this in the earlier email. But it >>>> is actually RW, so it should be ok to miss the call. The later >>>> set_direct_map_invalid call in vfree() may fail, but set_direct_map_default call >>>> will set RW permission back. But I think it doesn't have to use execmem_alloc(), >>>> the plain vmalloc() should be good enough. >>>> >>>>> 2. module_memory_alloc() -> execmem_alloc_rw() -> execmem_alloc() -> sets >>>>> VM_FLUSH_RESET_PERMS (note that execmem_force_rw() is nop for arm64). >>>>> set_memory_*() is not called until much later on in module_set_memory(). >>>>> Another >>>>> error in the meantime could cause the memory to be vfreed before that point. >>>> IIUC, execmem_alloc_rw() is used to allocate memory for modules' text section >>>> and data section. The code will set mod->mem[type].is_rox according to the type >>>> of the section. It is true for text, false for data. Then set_memory_rox() will >>>> be called later if it is true *after* insns are copied to the memory. So it is >>>> still RW before that point. >>>> >>>>> 3. When set_vm_flush_reset_perms() is set for the range, it is called before >>>>> set_memory_*() which might then fail to split prior to vfree. >>>> Yes, all call sites check the return value and bail out if set_memory_*() failed >>>> if I don't miss anything. >>>> >>>>> But I guess as long as set_memory_*() is never successfully called for a >>>>> *sub-range* of the vmalloc'ed region, then for all of the above issues, the >>>>> memory must still be RW at vfree-time, so this issue should be benign... I >>>>> think? >>>> Yes, it is true. >>> So to summarise, all freshly vmalloc'ed memory starts as RW. set_memory_*() may >>> only be called if VM_FLUSH_RESET_PERMS has already been set. If set_memory_*() >>> is called at all, the first call MUST be for the whole range. >> Whether the default permission is RW or not depends on the type passed in by >> execmem_alloc(). It is defined by execmem_info in arch/arm64/mm/init.c. For >> ARM64, module and BPF have PAGE_KERNEL permission (RW) by default, but kprobes >> is PAGE_KERNEL_ROX (ROX). > Perhaps I missed it, but as far as I could tell the prot that the arch sets for > the type only determines the prot that is set for the vmalloc map. It doesn't > look like the linear map is modified at all... which feels like a bug to me > since the linear map will be RW while the vmalloc map will be ROX... I guess I > must have missed something... Yes, it just sets the permission for vmalloc area. The set_memory_*() must be called to change permission for direct map. > >>> If those requirements are all met, then if VM_FLUSH_RESET_PERMS was set but >>> set_memory_*() was never called, the worst that can happen is for both the >>> set_direct_map_invalid() and set_direct_map_default() calls to fail due to not >>> enough memory. But that is safe because the memory was always RW. If >>> set_memory_*() was called for the whole range and failed, it's the same as if it >>> was never called. If it was called for the whole range and succeeded, then the >>> split must have happened already and set_direct_map_invalid() and >>> set_direct_map_default() will therefore definitely succeed. >>> >>> The only way this could be a problem is if someone vmallocs a range then >>> performs a set_memory_*() on a sub-region without having first done it for the >>> whole region. But we have not found any evidence that there are any users that >>> do that. >> Yes, exactly. >> >>> In fact, by that logic, I think alloc_insn_page() must also be safe; it only >>> allocates 1 page, so if set_memory_*() is subsequently called for it, it must by >>> definition be covering the whole allocation; 1 page is the smallest amount that >>> can be protected. >> Yes, but kprobes default permission is ROX. >> >>> So I agree we are safe. >>> >>> >>>>> In summary this all looks horribly fragile. But I *think* it works. It would be >>>>> good to clean it all up and have some clearly documented rules regardless. >>>>> But I >>>>> think that could be a follow up series. >>>> Yeah, absolutely agreed. >>>> >>>>>> Shall we move forward with v8? >>>>> Yes; Do you wnat me to post that or would you prefer to do it? I'm happy to do >>>>> it; there are a few other tidy ups in pageattr.c I want to make which I >>>>> spotted. >>>> I actually just had v8 ready in my tree. I removed pageattr_pgd_entry and >>>> pageattr_pud_entry in pageattr.c and fixed pmd_leaf/pud_leaf as you suggested. >>>> Is it the cleanup you are supposed to do? >>> I was also going to fix up the comment in change_memory_common() which is now >>> stale. >> Oops, I missed that in my v8. Please just comment for v8, I can fix it up later. > Ahh no biggy. If there is a chance Will will take the series, let's not hold it > up for a comment. Yeah, sure, thank you. Yang > >> Thanks, >> Yang >> >> >>>> And I also rebased it on top of >>>> Shijie's series (https://git.kernel.org/pub/scm/linux/kernel/git/arm64/ >>>> linux.git/commit/?id=bfbbb0d3215f) which has been picked up by Will. >>>> >>>>>> We can include the >>>>>> fix to kprobes in v8 or I can send it separately, either is fine to me. >>>>> Post it on list, and I'll also incorporate into the series. >>>> I can include it in v8 series. >>>> >>>>>> Hopefully we can make v6.18. >>>>> It's probably getting a bit late now. Anyway, I'll aim to get v8 out >>>>> tomorrow or >>>>> Friday and we will see what Will thinks. >>>> Thank you. I can post v8 today. >>> OK great - I'll leave it all to you then - thanks! >>> >>>> Thanks, >>>> Yang >>>> >>>>> Thanks, >>>>> Ryan >>>>> >>>>>> Thanks, >>>>>> Yang >>>>>>
© 2016 - 2025 Red Hat, Inc.