[v7] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

[PATCH v7 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Ryan Roberts 5 months, 1 week ago

Hi All,

This is a new version following on from the v6 RFC at [1] which itself is based
on Yang Shi's work. On systems with BBML2_NOABORT support, it causes the linear
map to be mapped with large blocks, even when rodata=full, and leads to some
nice performance improvements.

I've tested this on an AmpereOne system (a VM with 12G RAM) in all 3 possible
modes by hacking the BBML2 feature detection code:

  - mode 1: All CPUs support BBML2 so the linear map uses large mappings
  - mode 2: Boot CPU does not support BBML2 so linear map uses pte mappings
  - mode 3: Boot CPU supports BBML2 but secondaries do not so linear map
    initially uses large mappings but is then repainted to use pte mappings

In all cases, mm selftests run and no regressions are observed. In all cases,
ptdump of linear map is as expected:

Mode 1:
=======
---[ Linear Mapping start ]---
0xffff000000000000-0xffff000000200000           2M PMD       RW NX SHD AF        BLK UXN    MEM/NORMAL-TAGGED
0xffff000000200000-0xffff000000210000          64K PTE       RW NX SHD AF    CON     UXN    MEM/NORMAL-TAGGED
0xffff000000210000-0xffff000000400000        1984K PTE       ro NX SHD AF            UXN    MEM/NORMAL
0xffff000000400000-0xffff000002400000          32M PMD       ro NX SHD AF        BLK UXN    MEM/NORMAL
0xffff000002400000-0xffff000002550000        1344K PTE       ro NX SHD AF            UXN    MEM/NORMAL
0xffff000002550000-0xffff000002600000         704K PTE       RW NX SHD AF    CON     UXN    MEM/NORMAL-TAGGED
0xffff000002600000-0xffff000004000000          26M PMD       RW NX SHD AF        BLK UXN    MEM/NORMAL-TAGGED
0xffff000004000000-0xffff000040000000         960M PMD       RW NX SHD AF    CON BLK UXN    MEM/NORMAL-TAGGED
0xffff000040000000-0xffff000140000000           4G PUD       RW NX SHD AF        BLK UXN    MEM/NORMAL-TAGGED
0xffff000140000000-0xffff000142000000          32M PMD       RW NX SHD AF    CON BLK UXN    MEM/NORMAL-TAGGED
0xffff000142000000-0xffff000142120000        1152K PTE       RW NX SHD AF    CON     UXN    MEM/NORMAL-TAGGED
0xffff000142120000-0xffff000142128000          32K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000142128000-0xffff000142159000         196K PTE       ro NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000142159000-0xffff000142160000          28K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000142160000-0xffff000142240000         896K PTE       RW NX SHD AF    CON     UXN    MEM/NORMAL-TAGGED
0xffff000142240000-0xffff00014224e000          56K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff00014224e000-0xffff000142250000           8K PTE       ro NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000142250000-0xffff000142260000          64K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000142260000-0xffff000142280000         128K PTE       RW NX SHD AF    CON     UXN    MEM/NORMAL-TAGGED
0xffff000142280000-0xffff000142288000          32K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000142288000-0xffff000142290000          32K PTE       ro NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000142290000-0xffff0001422a0000          64K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff0001422a0000-0xffff000142465000        1812K PTE       ro NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000142465000-0xffff000142470000          44K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000142470000-0xffff000142600000        1600K PTE       RW NX SHD AF    CON     UXN    MEM/NORMAL-TAGGED
0xffff000142600000-0xffff000144000000          26M PMD       RW NX SHD AF        BLK UXN    MEM/NORMAL-TAGGED
0xffff000144000000-0xffff000180000000         960M PMD       RW NX SHD AF    CON BLK UXN    MEM/NORMAL-TAGGED
0xffff000180000000-0xffff000181a00000          26M PMD       RW NX SHD AF        BLK UXN    MEM/NORMAL-TAGGED
0xffff000181a00000-0xffff000181b90000        1600K PTE       RW NX SHD AF    CON     UXN    MEM/NORMAL-TAGGED
0xffff000181b90000-0xffff000181b9d000          52K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000181b9d000-0xffff000181c80000         908K PTE       ro NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000181c80000-0xffff000181c90000          64K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000181c90000-0xffff000181ca0000          64K PTE       RW NX SHD AF    CON     UXN    MEM/NORMAL-TAGGED
0xffff000181ca0000-0xffff000181dbd000        1140K PTE       ro NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000181dbd000-0xffff000181dc0000          12K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000181dc0000-0xffff000181e00000         256K PTE       RW NX SHD AF    CON     UXN    MEM/NORMAL-TAGGED
0xffff000181e00000-0xffff000182000000           2M PMD       RW NX SHD AF        BLK UXN    MEM/NORMAL-TAGGED
0xffff000182000000-0xffff0001c0000000         992M PMD       RW NX SHD AF    CON BLK UXN    MEM/NORMAL-TAGGED
0xffff0001c0000000-0xffff000300000000           5G PUD       RW NX SHD AF        BLK UXN    MEM/NORMAL-TAGGED
0xffff000300000000-0xffff008000000000         500G PUD
0xffff008000000000-0xffff800000000000      130560G PGD
---[ Linear Mapping end ]---

Mode 3:
=======
---[ Linear Mapping start ]---
0xffff000000000000-0xffff000000210000        2112K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000000210000-0xffff000000400000        1984K PTE       ro NX SHD AF            UXN    MEM/NORMAL
0xffff000000400000-0xffff000002400000          32M PMD       ro NX SHD AF        BLK UXN    MEM/NORMAL
0xffff000002400000-0xffff000002550000        1344K PTE       ro NX SHD AF            UXN    MEM/NORMAL
0xffff000002550000-0xffff000143a61000     5264452K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000143a61000-0xffff000143c61000           2M PTE       ro NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000143c61000-0xffff000181b9a000     1015012K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000181b9a000-0xffff000181d9a000           2M PTE       ro NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000181d9a000-0xffff000300000000     6261144K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000300000000-0xffff008000000000         500G PUD
0xffff008000000000-0xffff800000000000      130560G PGD
---[ Linear Mapping end ]---


Performance Testing
===================

Yang Shi has gathered some compelling results which are detailed in the commit
log for patch #3. Additionally I have run this through a random selection of
benchmarks on AmpereOne. None show any regressions, and various benchmarks show
statistically significant improvement. I'm just showing those improvements here:

+----------------------+----------------------------------------------------------+-------------------------+
| Benchmark            | Result Class                                             | Improvement vs 6.17-rc1 |
+======================+==========================================================+=========================+
| micromm/vmalloc      | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |              (I) -9.00% |
|                      | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |              (I) -6.93% |
|                      | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |              (I) -6.77% |
|                      | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |              (I) -4.63% |
+----------------------+----------------------------------------------------------+-------------------------+
| mmtests/hackbench    | process-sockets-30 (seconds)                             |              (I) -2.96% |
+----------------------+----------------------------------------------------------+-------------------------+
| mmtests/kernbench    | syst-192 (seconds)                                       |             (I) -12.77% |
+----------------------+----------------------------------------------------------+-------------------------+
| pts/perl-benchmark   | Test: Interpreter (Seconds)                              |              (I) -4.86% |
+----------------------+----------------------------------------------------------+-------------------------+
| pts/pgbench          | Scale: 1 Clients: 1 Read Write (TPS)                     |               (I) 5.07% |
|                      | Scale: 1 Clients: 1 Read Write - Latency (ms)            |              (I) -4.72% |
|                      | Scale: 100 Clients: 1000 Read Write (TPS)                |               (I) 2.58% |
|                      | Scale: 100 Clients: 1000 Read Write - Latency (ms)       |              (I) -2.52% |
+----------------------+----------------------------------------------------------+-------------------------+
| pts/sqlite-speedtest | Timed Time - Size 1,000 (Seconds)                        |              (I) -2.68% |
+----------------------+----------------------------------------------------------+-------------------------+


Changes since v6 [1]
====================

- Patch 1: Minor refactor to implement walk_kernel_page_table_range() in terms
  of walk_kernel_page_table_range_lockless(). Also lead to adding *pmd argument
  to the lockless variant for consistency (per Catalin).
- Misc function/variable renames to improve clarity and consistency.
- Share same syncrhonization flag between idmap_kpti_install_ng_mappings and
  wait_linear_map_split_to_ptes, which allows removal of bbml2_ptes[] to save
  ~20K from kernel image.
- Only take pgtable_split_lock and enter lazy mmu mode once for both splits.
- Only walk the pgtable once for the common "split single page" case.
- Bypass split to contpmd and contpte when spllitting linear map to ptes.


Applies on v6.17-rc3.


[1] https://lore.kernel.org/linux-arm-kernel/20250805081350.3854670-1-ryan.roberts@arm.com/

Thanks,
Ryan

Dev Jain (1):
  arm64: Enable permission change on arm64 kernel block mappings

Ryan Roberts (3):
  arm64: mm: Optimize split_kernel_leaf_mapping()
  arm64: mm: split linear mapping if BBML2 unsupported on secondary CPUs
  arm64: mm: Optimize linear_map_split_to_ptes()

Yang Shi (2):
  arm64: cpufeature: add AmpereOne to BBML2 allow list
  arm64: mm: support large block mapping when rodata=full

 arch/arm64/include/asm/cpufeature.h |   2 +
 arch/arm64/include/asm/mmu.h        |   3 +
 arch/arm64/include/asm/pgtable.h    |   5 +
 arch/arm64/kernel/cpufeature.c      |  12 +-
 arch/arm64/mm/mmu.c                 | 418 +++++++++++++++++++++++++++-
 arch/arm64/mm/pageattr.c            | 157 ++++++++---
 arch/arm64/mm/proc.S                |  27 +-
 include/linux/pagewalk.h            |   3 +
 mm/pagewalk.c                       |  36 ++-
 9 files changed, 599 insertions(+), 64 deletions(-)

--
2.43.0

Re: [PATCH v7 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Dev Jain 5 months, 1 week ago

On 29/08/25 5:22 pm, Ryan Roberts wrote:
> Hi All,
>
> This is a new version following on from the v6 RFC at [1] which itself is based
> on Yang Shi's work. On systems with BBML2_NOABORT support, it causes the linear
> map to be mapped with large blocks, even when rodata=full, and leads to some
> nice performance improvements.
>
> I've tested this on an AmpereOne system (a VM with 12G RAM) in all 3 possible
> modes by hacking the BBML2 feature detection code:
>
>    - mode 1: All CPUs support BBML2 so the linear map uses large mappings
>    - mode 2: Boot CPU does not support BBML2 so linear map uses pte mappings
>    - mode 3: Boot CPU supports BBML2 but secondaries do not so linear map
>      initially uses large mappings but is then repainted to use pte mappings
>
> In all cases, mm selftests run and no regressions are observed. In all cases,
> ptdump of linear map is as expected:
>
> Mode 1:
> =======
> ---[ Linear Mapping start ]---
> 0xffff000000000000-0xffff000000200000           2M PMD       RW NX SHD AF        BLK UXN    MEM/NORMAL-TAGGED
> 0xffff000000200000-0xffff000000210000          64K PTE       RW NX SHD AF    CON     UXN    MEM/NORMAL-TAGGED
> 0xffff000000210000-0xffff000000400000        1984K PTE       ro NX SHD AF            UXN    MEM/NORMAL
> 0xffff000000400000-0xffff000002400000          32M PMD       ro NX SHD AF        BLK UXN    MEM/NORMAL
> 0xffff000002400000-0xffff000002550000        1344K PTE       ro NX SHD AF            UXN    MEM/NORMAL
> 0xffff000002550000-0xffff000002600000         704K PTE       RW NX SHD AF    CON     UXN    MEM/NORMAL-TAGGED
> 0xffff000002600000-0xffff000004000000          26M PMD       RW NX SHD AF        BLK UXN    MEM/NORMAL-TAGGED
> 0xffff000004000000-0xffff000040000000         960M PMD       RW NX SHD AF    CON BLK UXN    MEM/NORMAL-TAGGED
> 0xffff000040000000-0xffff000140000000           4G PUD       RW NX SHD AF        BLK UXN    MEM/NORMAL-TAGGED
> 0xffff000140000000-0xffff000142000000          32M PMD       RW NX SHD AF    CON BLK UXN    MEM/NORMAL-TAGGED
> 0xffff000142000000-0xffff000142120000        1152K PTE       RW NX SHD AF    CON     UXN    MEM/NORMAL-TAGGED
> 0xffff000142120000-0xffff000142128000          32K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
> 0xffff000142128000-0xffff000142159000         196K PTE       ro NX SHD AF            UXN    MEM/NORMAL-TAGGED
> 0xffff000142159000-0xffff000142160000          28K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
> 0xffff000142160000-0xffff000142240000         896K PTE       RW NX SHD AF    CON     UXN    MEM/NORMAL-TAGGED
> 0xffff000142240000-0xffff00014224e000          56K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
> 0xffff00014224e000-0xffff000142250000           8K PTE       ro NX SHD AF            UXN    MEM/NORMAL-TAGGED
> 0xffff000142250000-0xffff000142260000          64K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
> 0xffff000142260000-0xffff000142280000         128K PTE       RW NX SHD AF    CON     UXN    MEM/NORMAL-TAGGED
> 0xffff000142280000-0xffff000142288000          32K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
> 0xffff000142288000-0xffff000142290000          32K PTE       ro NX SHD AF            UXN    MEM/NORMAL-TAGGED
> 0xffff000142290000-0xffff0001422a0000          64K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
> 0xffff0001422a0000-0xffff000142465000        1812K PTE       ro NX SHD AF            UXN    MEM/NORMAL-TAGGED
> 0xffff000142465000-0xffff000142470000          44K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
> 0xffff000142470000-0xffff000142600000        1600K PTE       RW NX SHD AF    CON     UXN    MEM/NORMAL-TAGGED
> 0xffff000142600000-0xffff000144000000          26M PMD       RW NX SHD AF        BLK UXN    MEM/NORMAL-TAGGED
> 0xffff000144000000-0xffff000180000000         960M PMD       RW NX SHD AF    CON BLK UXN    MEM/NORMAL-TAGGED
> 0xffff000180000000-0xffff000181a00000          26M PMD       RW NX SHD AF        BLK UXN    MEM/NORMAL-TAGGED
> 0xffff000181a00000-0xffff000181b90000        1600K PTE       RW NX SHD AF    CON     UXN    MEM/NORMAL-TAGGED
> 0xffff000181b90000-0xffff000181b9d000          52K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
> 0xffff000181b9d000-0xffff000181c80000         908K PTE       ro NX SHD AF            UXN    MEM/NORMAL-TAGGED
> 0xffff000181c80000-0xffff000181c90000          64K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
> 0xffff000181c90000-0xffff000181ca0000          64K PTE       RW NX SHD AF    CON     UXN    MEM/NORMAL-TAGGED
> 0xffff000181ca0000-0xffff000181dbd000        1140K PTE       ro NX SHD AF            UXN    MEM/NORMAL-TAGGED
> 0xffff000181dbd000-0xffff000181dc0000          12K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
> 0xffff000181dc0000-0xffff000181e00000         256K PTE       RW NX SHD AF    CON     UXN    MEM/NORMAL-TAGGED
> 0xffff000181e00000-0xffff000182000000           2M PMD       RW NX SHD AF        BLK UXN    MEM/NORMAL-TAGGED
> 0xffff000182000000-0xffff0001c0000000         992M PMD       RW NX SHD AF    CON BLK UXN    MEM/NORMAL-TAGGED
> 0xffff0001c0000000-0xffff000300000000           5G PUD       RW NX SHD AF        BLK UXN    MEM/NORMAL-TAGGED
> 0xffff000300000000-0xffff008000000000         500G PUD
> 0xffff008000000000-0xffff800000000000      130560G PGD
> ---[ Linear Mapping end ]---
>
> Mode 3:
> =======
> ---[ Linear Mapping start ]---
> 0xffff000000000000-0xffff000000210000        2112K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
> 0xffff000000210000-0xffff000000400000        1984K PTE       ro NX SHD AF            UXN    MEM/NORMAL
> 0xffff000000400000-0xffff000002400000          32M PMD       ro NX SHD AF        BLK UXN    MEM/NORMAL
> 0xffff000002400000-0xffff000002550000        1344K PTE       ro NX SHD AF            UXN    MEM/NORMAL
> 0xffff000002550000-0xffff000143a61000     5264452K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
> 0xffff000143a61000-0xffff000143c61000           2M PTE       ro NX SHD AF            UXN    MEM/NORMAL-TAGGED
> 0xffff000143c61000-0xffff000181b9a000     1015012K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
> 0xffff000181b9a000-0xffff000181d9a000           2M PTE       ro NX SHD AF            UXN    MEM/NORMAL-TAGGED
> 0xffff000181d9a000-0xffff000300000000     6261144K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
> 0xffff000300000000-0xffff008000000000         500G PUD
> 0xffff008000000000-0xffff800000000000      130560G PGD
> ---[ Linear Mapping end ]---
>
>
> Performance Testing
> ===================
>
> Yang Shi has gathered some compelling results which are detailed in the commit
> log for patch #3. Additionally I have run this through a random selection of
> benchmarks on AmpereOne. None show any regressions, and various benchmarks show
> statistically significant improvement. I'm just showing those improvements here:
>
> +----------------------+----------------------------------------------------------+-------------------------+
> | Benchmark            | Result Class                                             | Improvement vs 6.17-rc1 |
> +======================+==========================================================+=========================+
> | micromm/vmalloc      | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |              (I) -9.00% |
> |                      | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |              (I) -6.93% |
> |                      | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |              (I) -6.77% |
> |                      | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |              (I) -4.63% |
> +----------------------+----------------------------------------------------------+-------------------------+
> | mmtests/hackbench    | process-sockets-30 (seconds)                             |              (I) -2.96% |
> +----------------------+----------------------------------------------------------+-------------------------+
> | mmtests/kernbench    | syst-192 (seconds)                                       |             (I) -12.77% |
> +----------------------+----------------------------------------------------------+-------------------------+
> | pts/perl-benchmark   | Test: Interpreter (Seconds)                              |              (I) -4.86% |
> +----------------------+----------------------------------------------------------+-------------------------+
> | pts/pgbench          | Scale: 1 Clients: 1 Read Write (TPS)                     |               (I) 5.07% |
> |                      | Scale: 1 Clients: 1 Read Write - Latency (ms)            |              (I) -4.72% |
> |                      | Scale: 100 Clients: 1000 Read Write (TPS)                |               (I) 2.58% |
> |                      | Scale: 100 Clients: 1000 Read Write - Latency (ms)       |              (I) -2.52% |
> +----------------------+----------------------------------------------------------+-------------------------+
> | pts/sqlite-speedtest | Timed Time - Size 1,000 (Seconds)                        |              (I) -2.68% |
> +----------------------+----------------------------------------------------------+-------------------------+
>
>
> Changes since v6 [1]
> ====================
>
> - Patch 1: Minor refactor to implement walk_kernel_page_table_range() in terms
>    of walk_kernel_page_table_range_lockless(). Also lead to adding *pmd argument
>    to the lockless variant for consistency (per Catalin).
> - Misc function/variable renames to improve clarity and consistency.
> - Share same syncrhonization flag between idmap_kpti_install_ng_mappings and
>    wait_linear_map_split_to_ptes, which allows removal of bbml2_ptes[] to save
>    ~20K from kernel image.
> - Only take pgtable_split_lock and enter lazy mmu mode once for both splits.
> - Only walk the pgtable once for the common "split single page" case.
> - Bypass split to contpmd and contpte when spllitting linear map to ptes.
>
>
> Applies on v6.17-rc3.
>
>
> [1] https://lore.kernel.org/linux-arm-kernel/20250805081350.3854670-1-ryan.roberts@arm.com/
>
> Thanks,
> Ryan
>
> Dev Jain (1):
>    arm64: Enable permission change on arm64 kernel block mappings
>
> Ryan Roberts (3):
>    arm64: mm: Optimize split_kernel_leaf_mapping()
>    arm64: mm: split linear mapping if BBML2 unsupported on secondary CPUs
>    arm64: mm: Optimize linear_map_split_to_ptes()
>
> Yang Shi (2):
>    arm64: cpufeature: add AmpereOne to BBML2 allow list
>    arm64: mm: support large block mapping when rodata=full
>
>   arch/arm64/include/asm/cpufeature.h |   2 +
>   arch/arm64/include/asm/mmu.h        |   3 +
>   arch/arm64/include/asm/pgtable.h    |   5 +
>   arch/arm64/kernel/cpufeature.c      |  12 +-
>   arch/arm64/mm/mmu.c                 | 418 +++++++++++++++++++++++++++-
>   arch/arm64/mm/pageattr.c            | 157 ++++++++---
>   arch/arm64/mm/proc.S                |  27 +-
>   include/linux/pagewalk.h            |   3 +
>   mm/pagewalk.c                       |  36 ++-
>   9 files changed, 599 insertions(+), 64 deletions(-)
>
> --
> 2.43.0
>

Hi Yang and Ryan,

I observe there are various callsites which will ultimately use update_range_prot() (from patch 1),
that they do not check the return value. I am listing the ones I could find:

set_memory_ro() in bpf_jit_comp.c
set_memory_valid() in kernel_map_pages() in pageattr.c
set_direct_map_invalid_noflush() in vm_reset_perms() in vmalloc.c
set_direct_map_default_noflush() in vm_reset_perms() in vmalloc.c, and in secretmem.c
(the secretmem.c ones should be safe as explained in the commments therein)

The first one I think can be handled easily by returning -EFAULT.

For the second, we are already returning in case of !can_set_direct_map, which renders DEBUG_PAGEALLOC useless. So maybe it is
safe to ignore the ret from set_memory_valid?

For the third, the call chain is a sequence of must-succeed void functions. Notably, when using vfree(), we may have to allocate a single
pagetable page for splitting.

I am wondering whether we can just have a warn_on_once or something for the case when we fail to allocate a pagetable page. Or, Ryan had
suggested in an off-the-list conversation that we can maintain a cache of PTE tables for every PMD block mapping, which will give us
the same memory consumption as we do today, but not sure if this is worth it. x86 can already handle splitting but due to the callchains
I have described above, it has the same problem, and the code has been working for years :)

Re: [PATCH v7 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Ryan Roberts 5 months, 1 week ago

On 01/09/2025 06:04, Dev Jain wrote:
> 
> On 29/08/25 5:22 pm, Ryan Roberts wrote:
>> Hi All,
>>
>> This is a new version following on from the v6 RFC at [1] which itself is based
>> on Yang Shi's work. On systems with BBML2_NOABORT support, it causes the linear
>> map to be mapped with large blocks, even when rodata=full, and leads to some
>> nice performance improvements.
>>
>> I've tested this on an AmpereOne system (a VM with 12G RAM) in all 3 possible
>> modes by hacking the BBML2 feature detection code:
>>
>>    - mode 1: All CPUs support BBML2 so the linear map uses large mappings
>>    - mode 2: Boot CPU does not support BBML2 so linear map uses pte mappings
>>    - mode 3: Boot CPU supports BBML2 but secondaries do not so linear map
>>      initially uses large mappings but is then repainted to use pte mappings
>>
>> In all cases, mm selftests run and no regressions are observed. In all cases,
>> ptdump of linear map is as expected:
>>
>> Mode 1:
>> =======
>> ---[ Linear Mapping start ]---
>> 0xffff000000000000-0xffff000000200000           2M PMD       RW NX SHD
>> AF        BLK UXN    MEM/NORMAL-TAGGED
>> 0xffff000000200000-0xffff000000210000          64K PTE       RW NX SHD AF   
>> CON     UXN    MEM/NORMAL-TAGGED
>> 0xffff000000210000-0xffff000000400000        1984K PTE       ro NX SHD
>> AF            UXN    MEM/NORMAL
>> 0xffff000000400000-0xffff000002400000          32M PMD       ro NX SHD
>> AF        BLK UXN    MEM/NORMAL
>> 0xffff000002400000-0xffff000002550000        1344K PTE       ro NX SHD
>> AF            UXN    MEM/NORMAL
>> 0xffff000002550000-0xffff000002600000         704K PTE       RW NX SHD AF   
>> CON     UXN    MEM/NORMAL-TAGGED
>> 0xffff000002600000-0xffff000004000000          26M PMD       RW NX SHD
>> AF        BLK UXN    MEM/NORMAL-TAGGED
>> 0xffff000004000000-0xffff000040000000         960M PMD       RW NX SHD AF   
>> CON BLK UXN    MEM/NORMAL-TAGGED
>> 0xffff000040000000-0xffff000140000000           4G PUD       RW NX SHD
>> AF        BLK UXN    MEM/NORMAL-TAGGED
>> 0xffff000140000000-0xffff000142000000          32M PMD       RW NX SHD AF   
>> CON BLK UXN    MEM/NORMAL-TAGGED
>> 0xffff000142000000-0xffff000142120000        1152K PTE       RW NX SHD AF   
>> CON     UXN    MEM/NORMAL-TAGGED
>> 0xffff000142120000-0xffff000142128000          32K PTE       RW NX SHD
>> AF            UXN    MEM/NORMAL-TAGGED
>> 0xffff000142128000-0xffff000142159000         196K PTE       ro NX SHD
>> AF            UXN    MEM/NORMAL-TAGGED
>> 0xffff000142159000-0xffff000142160000          28K PTE       RW NX SHD
>> AF            UXN    MEM/NORMAL-TAGGED
>> 0xffff000142160000-0xffff000142240000         896K PTE       RW NX SHD AF   
>> CON     UXN    MEM/NORMAL-TAGGED
>> 0xffff000142240000-0xffff00014224e000          56K PTE       RW NX SHD
>> AF            UXN    MEM/NORMAL-TAGGED
>> 0xffff00014224e000-0xffff000142250000           8K PTE       ro NX SHD
>> AF            UXN    MEM/NORMAL-TAGGED
>> 0xffff000142250000-0xffff000142260000          64K PTE       RW NX SHD
>> AF            UXN    MEM/NORMAL-TAGGED
>> 0xffff000142260000-0xffff000142280000         128K PTE       RW NX SHD AF   
>> CON     UXN    MEM/NORMAL-TAGGED
>> 0xffff000142280000-0xffff000142288000          32K PTE       RW NX SHD
>> AF            UXN    MEM/NORMAL-TAGGED
>> 0xffff000142288000-0xffff000142290000          32K PTE       ro NX SHD
>> AF            UXN    MEM/NORMAL-TAGGED
>> 0xffff000142290000-0xffff0001422a0000          64K PTE       RW NX SHD
>> AF            UXN    MEM/NORMAL-TAGGED
>> 0xffff0001422a0000-0xffff000142465000        1812K PTE       ro NX SHD
>> AF            UXN    MEM/NORMAL-TAGGED
>> 0xffff000142465000-0xffff000142470000          44K PTE       RW NX SHD
>> AF            UXN    MEM/NORMAL-TAGGED
>> 0xffff000142470000-0xffff000142600000        1600K PTE       RW NX SHD AF   
>> CON     UXN    MEM/NORMAL-TAGGED
>> 0xffff000142600000-0xffff000144000000          26M PMD       RW NX SHD
>> AF        BLK UXN    MEM/NORMAL-TAGGED
>> 0xffff000144000000-0xffff000180000000         960M PMD       RW NX SHD AF   
>> CON BLK UXN    MEM/NORMAL-TAGGED
>> 0xffff000180000000-0xffff000181a00000          26M PMD       RW NX SHD
>> AF        BLK UXN    MEM/NORMAL-TAGGED
>> 0xffff000181a00000-0xffff000181b90000        1600K PTE       RW NX SHD AF   
>> CON     UXN    MEM/NORMAL-TAGGED
>> 0xffff000181b90000-0xffff000181b9d000          52K PTE       RW NX SHD
>> AF            UXN    MEM/NORMAL-TAGGED
>> 0xffff000181b9d000-0xffff000181c80000         908K PTE       ro NX SHD
>> AF            UXN    MEM/NORMAL-TAGGED
>> 0xffff000181c80000-0xffff000181c90000          64K PTE       RW NX SHD
>> AF            UXN    MEM/NORMAL-TAGGED
>> 0xffff000181c90000-0xffff000181ca0000          64K PTE       RW NX SHD AF   
>> CON     UXN    MEM/NORMAL-TAGGED
>> 0xffff000181ca0000-0xffff000181dbd000        1140K PTE       ro NX SHD
>> AF            UXN    MEM/NORMAL-TAGGED
>> 0xffff000181dbd000-0xffff000181dc0000          12K PTE       RW NX SHD
>> AF            UXN    MEM/NORMAL-TAGGED
>> 0xffff000181dc0000-0xffff000181e00000         256K PTE       RW NX SHD AF   
>> CON     UXN    MEM/NORMAL-TAGGED
>> 0xffff000181e00000-0xffff000182000000           2M PMD       RW NX SHD
>> AF        BLK UXN    MEM/NORMAL-TAGGED
>> 0xffff000182000000-0xffff0001c0000000         992M PMD       RW NX SHD AF   
>> CON BLK UXN    MEM/NORMAL-TAGGED
>> 0xffff0001c0000000-0xffff000300000000           5G PUD       RW NX SHD
>> AF        BLK UXN    MEM/NORMAL-TAGGED
>> 0xffff000300000000-0xffff008000000000         500G PUD
>> 0xffff008000000000-0xffff800000000000      130560G PGD
>> ---[ Linear Mapping end ]---
>>
>> Mode 3:
>> =======
>> ---[ Linear Mapping start ]---
>> 0xffff000000000000-0xffff000000210000        2112K PTE       RW NX SHD
>> AF            UXN    MEM/NORMAL-TAGGED
>> 0xffff000000210000-0xffff000000400000        1984K PTE       ro NX SHD
>> AF            UXN    MEM/NORMAL
>> 0xffff000000400000-0xffff000002400000          32M PMD       ro NX SHD
>> AF        BLK UXN    MEM/NORMAL
>> 0xffff000002400000-0xffff000002550000        1344K PTE       ro NX SHD
>> AF            UXN    MEM/NORMAL
>> 0xffff000002550000-0xffff000143a61000     5264452K PTE       RW NX SHD
>> AF            UXN    MEM/NORMAL-TAGGED
>> 0xffff000143a61000-0xffff000143c61000           2M PTE       ro NX SHD
>> AF            UXN    MEM/NORMAL-TAGGED
>> 0xffff000143c61000-0xffff000181b9a000     1015012K PTE       RW NX SHD
>> AF            UXN    MEM/NORMAL-TAGGED
>> 0xffff000181b9a000-0xffff000181d9a000           2M PTE       ro NX SHD
>> AF            UXN    MEM/NORMAL-TAGGED
>> 0xffff000181d9a000-0xffff000300000000     6261144K PTE       RW NX SHD
>> AF            UXN    MEM/NORMAL-TAGGED
>> 0xffff000300000000-0xffff008000000000         500G PUD
>> 0xffff008000000000-0xffff800000000000      130560G PGD
>> ---[ Linear Mapping end ]---
>>
>>
>> Performance Testing
>> ===================
>>
>> Yang Shi has gathered some compelling results which are detailed in the commit
>> log for patch #3. Additionally I have run this through a random selection of
>> benchmarks on AmpereOne. None show any regressions, and various benchmarks show
>> statistically significant improvement. I'm just showing those improvements here:
>>
>> +----------------------
>> +----------------------------------------------------------
>> +-------------------------+
>> | Benchmark            | Result
>> Class                                             | Improvement vs 6.17-rc1 |
>> +======================+==========================================================+=========================+
>> | micromm/vmalloc      | full_fit_alloc_test: p:1, h:0, l:500000
>> (usec)           |              (I) -9.00% |
>> |                      | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000
>> (usec) |              (I) -6.93% |
>> |                      | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000
>> (usec) |              (I) -6.77% |
>> |                      | pcpu_alloc_test: p:1, h:0, l:500000
>> (usec)               |              (I) -4.63% |
>> +----------------------
>> +----------------------------------------------------------
>> +-------------------------+
>> | mmtests/hackbench    | process-sockets-30
>> (seconds)                             |              (I) -2.96% |
>> +----------------------
>> +----------------------------------------------------------
>> +-------------------------+
>> | mmtests/kernbench    | syst-192
>> (seconds)                                       |             (I) -12.77% |
>> +----------------------
>> +----------------------------------------------------------
>> +-------------------------+
>> | pts/perl-benchmark   | Test: Interpreter
>> (Seconds)                              |              (I) -4.86% |
>> +----------------------
>> +----------------------------------------------------------
>> +-------------------------+
>> | pts/pgbench          | Scale: 1 Clients: 1 Read Write
>> (TPS)                     |               (I) 5.07% |
>> |                      | Scale: 1 Clients: 1 Read Write - Latency
>> (ms)            |              (I) -4.72% |
>> |                      | Scale: 100 Clients: 1000 Read Write
>> (TPS)                |               (I) 2.58% |
>> |                      | Scale: 100 Clients: 1000 Read Write - Latency
>> (ms)       |              (I) -2.52% |
>> +----------------------
>> +----------------------------------------------------------
>> +-------------------------+
>> | pts/sqlite-speedtest | Timed Time - Size 1,000
>> (Seconds)                        |              (I) -2.68% |
>> +----------------------
>> +----------------------------------------------------------
>> +-------------------------+
>>
>>
>> Changes since v6 [1]
>> ====================
>>
>> - Patch 1: Minor refactor to implement walk_kernel_page_table_range() in terms
>>    of walk_kernel_page_table_range_lockless(). Also lead to adding *pmd argument
>>    to the lockless variant for consistency (per Catalin).
>> - Misc function/variable renames to improve clarity and consistency.
>> - Share same syncrhonization flag between idmap_kpti_install_ng_mappings and
>>    wait_linear_map_split_to_ptes, which allows removal of bbml2_ptes[] to save
>>    ~20K from kernel image.
>> - Only take pgtable_split_lock and enter lazy mmu mode once for both splits.
>> - Only walk the pgtable once for the common "split single page" case.
>> - Bypass split to contpmd and contpte when spllitting linear map to ptes.
>>
>>
>> Applies on v6.17-rc3.
>>
>>
>> [1] https://lore.kernel.org/linux-arm-kernel/20250805081350.3854670-1-
>> ryan.roberts@arm.com/
>>
>> Thanks,
>> Ryan
>>
>> Dev Jain (1):
>>    arm64: Enable permission change on arm64 kernel block mappings
>>
>> Ryan Roberts (3):
>>    arm64: mm: Optimize split_kernel_leaf_mapping()
>>    arm64: mm: split linear mapping if BBML2 unsupported on secondary CPUs
>>    arm64: mm: Optimize linear_map_split_to_ptes()
>>
>> Yang Shi (2):
>>    arm64: cpufeature: add AmpereOne to BBML2 allow list
>>    arm64: mm: support large block mapping when rodata=full
>>
>>   arch/arm64/include/asm/cpufeature.h |   2 +
>>   arch/arm64/include/asm/mmu.h        |   3 +
>>   arch/arm64/include/asm/pgtable.h    |   5 +
>>   arch/arm64/kernel/cpufeature.c      |  12 +-
>>   arch/arm64/mm/mmu.c                 | 418 +++++++++++++++++++++++++++-
>>   arch/arm64/mm/pageattr.c            | 157 ++++++++---
>>   arch/arm64/mm/proc.S                |  27 +-
>>   include/linux/pagewalk.h            |   3 +
>>   mm/pagewalk.c                       |  36 ++-
>>   9 files changed, 599 insertions(+), 64 deletions(-)
>>
>> -- 
>> 2.43.0
>>
> 
> Hi Yang and Ryan,
> 
> I observe there are various callsites which will ultimately use
> update_range_prot() (from patch 1),
> that they do not check the return value. I am listing the ones I could find:

So your concern is that prior to patch #3 in this series, any error returned by
__change_memory_common() would be due to programming error only. But patch #3
introduces the possibility of dynamic error (-ENOMEM) due to the need to
allocate pgtable memory to split a mapping?

There is a WARN_ON_ONCE(ret) for the return code of split_kernel_leaf_mapping()
which will at least make the error visible, but I agree it's not a great solution.

> 
> set_memory_ro() in bpf_jit_comp.c

There is a set_memory_rw() for the same region of memory directly above this,
which will return -EFAULT on failure. If that one succeeded, then the pgtable
must already be appropriately split for set_memory_ro() so that should never
fail in practice. I agree with improving the robustness of the code by returning
-EFAULT (or just propagate the error?) as you suggest though.

> set_memory_valid() in kernel_map_pages() in pageattr.c

This is used by CONFIG_DEBUG_PAGEALLOC to make pages in the linear map invalid
while they are not in use to catch programming errors. So if making a page
invalid during freeing fails would not technically lead to a huge issue, it just
reduces our capability of catching an errant access to that free memory.

In principle, if we were able to make the memory invalid, we should therefore be
able to make it valid again, because the mappings should be sufficiently split
already. But that doesn't actually work, because we might be allocating a
smaller order than was freed so we might not have split at free-time to the
granularity is required at allocation-time.

But as you say, for CONFIG_DEBUG_PAGEALLOC we disable this whole path anyway, so
no issue here.

> set_direct_map_invalid_noflush() in vm_reset_perms() in vmalloc.c
> set_direct_map_default_noflush() in vm_reset_perms() in vmalloc.c, and in
> secretmem.c
> (the secretmem.c ones should be safe as explained in the commments therein)

Agreed for secretmem. vmalloc looks like a problem though...

If vmalloc was only setting the linear map back to default permissions, I guess
this wouldn't be an issue because we must have split the linear map sucessfully
when changing away from default permissions in the first place. But the fact
that it is unconditionally setting the linear map pages to invalid then back to
default causes issues; I guess even without the risk of -ENOMEM, this will cause
the linear map to be split to PTEs over time as vmalloc allocs and frees?

We probably need to think through how we can solve this. It's not clear to me
why vm_reset_perms wants to unconditionally transiently set to invalid?

> 
> The first one I think can be handled easily by returning -EFAULT.
> 
> For the second, we are already returning in case of !can_set_direct_map, which
> renders DEBUG_PAGEALLOC useless. So maybe it is
> safe to ignore the ret from set_memory_valid?
> 
> For the third, the call chain is a sequence of must-succeed void functions.
> Notably, when using vfree(), we may have to allocate a single
> pagetable page for splitting.
> 
> I am wondering whether we can just have a warn_on_once or something for the case
> when we fail to allocate a pagetable page. Or, Ryan had
> suggested in an off-the-list conversation that we can maintain a cache of PTE
> tables for every PMD block mapping, which will give us
> the same memory consumption as we do today, but not sure if this is worth it.
> x86 can already handle splitting but due to the callchains
> I have described above, it has the same problem, and the code has been working
> for years :)

I think it's preferable to avoid having to keep a cache of pgtable memory if we
can...

Thanks,
Ryan

Re: [PATCH v7 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Yang Shi 5 months, 1 week ago


On 9/1/25 1:03 AM, Ryan Roberts wrote:
> On 01/09/2025 06:04, Dev Jain wrote:
>> On 29/08/25 5:22 pm, Ryan Roberts wrote:
>>> Hi All,
>>>
>>> This is a new version following on from the v6 RFC at [1] which itself is based
>>> on Yang Shi's work. On systems with BBML2_NOABORT support, it causes the linear
>>> map to be mapped with large blocks, even when rodata=full, and leads to some
>>> nice performance improvements.
>>>
>>> I've tested this on an AmpereOne system (a VM with 12G RAM) in all 3 possible
>>> modes by hacking the BBML2 feature detection code:
>>>
>>>     - mode 1: All CPUs support BBML2 so the linear map uses large mappings
>>>     - mode 2: Boot CPU does not support BBML2 so linear map uses pte mappings
>>>     - mode 3: Boot CPU supports BBML2 but secondaries do not so linear map
>>>       initially uses large mappings but is then repainted to use pte mappings
>>>
>>> In all cases, mm selftests run and no regressions are observed. In all cases,
>>> ptdump of linear map is as expected:
>>>
>>> Mode 1:
>>> =======
>>> ---[ Linear Mapping start ]---
>>> 0xffff000000000000-0xffff000000200000           2M PMD       RW NX SHD
>>> AF        BLK UXN    MEM/NORMAL-TAGGED
>>> 0xffff000000200000-0xffff000000210000          64K PTE       RW NX SHD AF
>>> CON     UXN    MEM/NORMAL-TAGGED
>>> 0xffff000000210000-0xffff000000400000        1984K PTE       ro NX SHD
>>> AF            UXN    MEM/NORMAL
>>> 0xffff000000400000-0xffff000002400000          32M PMD       ro NX SHD
>>> AF        BLK UXN    MEM/NORMAL
>>> 0xffff000002400000-0xffff000002550000        1344K PTE       ro NX SHD
>>> AF            UXN    MEM/NORMAL
>>> 0xffff000002550000-0xffff000002600000         704K PTE       RW NX SHD AF
>>> CON     UXN    MEM/NORMAL-TAGGED
>>> 0xffff000002600000-0xffff000004000000          26M PMD       RW NX SHD
>>> AF        BLK UXN    MEM/NORMAL-TAGGED
>>> 0xffff000004000000-0xffff000040000000         960M PMD       RW NX SHD AF
>>> CON BLK UXN    MEM/NORMAL-TAGGED
>>> 0xffff000040000000-0xffff000140000000           4G PUD       RW NX SHD
>>> AF        BLK UXN    MEM/NORMAL-TAGGED
>>> 0xffff000140000000-0xffff000142000000          32M PMD       RW NX SHD AF
>>> CON BLK UXN    MEM/NORMAL-TAGGED
>>> 0xffff000142000000-0xffff000142120000        1152K PTE       RW NX SHD AF
>>> CON     UXN    MEM/NORMAL-TAGGED
>>> 0xffff000142120000-0xffff000142128000          32K PTE       RW NX SHD
>>> AF            UXN    MEM/NORMAL-TAGGED
>>> 0xffff000142128000-0xffff000142159000         196K PTE       ro NX SHD
>>> AF            UXN    MEM/NORMAL-TAGGED
>>> 0xffff000142159000-0xffff000142160000          28K PTE       RW NX SHD
>>> AF            UXN    MEM/NORMAL-TAGGED
>>> 0xffff000142160000-0xffff000142240000         896K PTE       RW NX SHD AF
>>> CON     UXN    MEM/NORMAL-TAGGED
>>> 0xffff000142240000-0xffff00014224e000          56K PTE       RW NX SHD
>>> AF            UXN    MEM/NORMAL-TAGGED
>>> 0xffff00014224e000-0xffff000142250000           8K PTE       ro NX SHD
>>> AF            UXN    MEM/NORMAL-TAGGED
>>> 0xffff000142250000-0xffff000142260000          64K PTE       RW NX SHD
>>> AF            UXN    MEM/NORMAL-TAGGED
>>> 0xffff000142260000-0xffff000142280000         128K PTE       RW NX SHD AF
>>> CON     UXN    MEM/NORMAL-TAGGED
>>> 0xffff000142280000-0xffff000142288000          32K PTE       RW NX SHD
>>> AF            UXN    MEM/NORMAL-TAGGED
>>> 0xffff000142288000-0xffff000142290000          32K PTE       ro NX SHD
>>> AF            UXN    MEM/NORMAL-TAGGED
>>> 0xffff000142290000-0xffff0001422a0000          64K PTE       RW NX SHD
>>> AF            UXN    MEM/NORMAL-TAGGED
>>> 0xffff0001422a0000-0xffff000142465000        1812K PTE       ro NX SHD
>>> AF            UXN    MEM/NORMAL-TAGGED
>>> 0xffff000142465000-0xffff000142470000          44K PTE       RW NX SHD
>>> AF            UXN    MEM/NORMAL-TAGGED
>>> 0xffff000142470000-0xffff000142600000        1600K PTE       RW NX SHD AF
>>> CON     UXN    MEM/NORMAL-TAGGED
>>> 0xffff000142600000-0xffff000144000000          26M PMD       RW NX SHD
>>> AF        BLK UXN    MEM/NORMAL-TAGGED
>>> 0xffff000144000000-0xffff000180000000         960M PMD       RW NX SHD AF
>>> CON BLK UXN    MEM/NORMAL-TAGGED
>>> 0xffff000180000000-0xffff000181a00000          26M PMD       RW NX SHD
>>> AF        BLK UXN    MEM/NORMAL-TAGGED
>>> 0xffff000181a00000-0xffff000181b90000        1600K PTE       RW NX SHD AF
>>> CON     UXN    MEM/NORMAL-TAGGED
>>> 0xffff000181b90000-0xffff000181b9d000          52K PTE       RW NX SHD
>>> AF            UXN    MEM/NORMAL-TAGGED
>>> 0xffff000181b9d000-0xffff000181c80000         908K PTE       ro NX SHD
>>> AF            UXN    MEM/NORMAL-TAGGED
>>> 0xffff000181c80000-0xffff000181c90000          64K PTE       RW NX SHD
>>> AF            UXN    MEM/NORMAL-TAGGED
>>> 0xffff000181c90000-0xffff000181ca0000          64K PTE       RW NX SHD AF
>>> CON     UXN    MEM/NORMAL-TAGGED
>>> 0xffff000181ca0000-0xffff000181dbd000        1140K PTE       ro NX SHD
>>> AF            UXN    MEM/NORMAL-TAGGED
>>> 0xffff000181dbd000-0xffff000181dc0000          12K PTE       RW NX SHD
>>> AF            UXN    MEM/NORMAL-TAGGED
>>> 0xffff000181dc0000-0xffff000181e00000         256K PTE       RW NX SHD AF
>>> CON     UXN    MEM/NORMAL-TAGGED
>>> 0xffff000181e00000-0xffff000182000000           2M PMD       RW NX SHD
>>> AF        BLK UXN    MEM/NORMAL-TAGGED
>>> 0xffff000182000000-0xffff0001c0000000         992M PMD       RW NX SHD AF
>>> CON BLK UXN    MEM/NORMAL-TAGGED
>>> 0xffff0001c0000000-0xffff000300000000           5G PUD       RW NX SHD
>>> AF        BLK UXN    MEM/NORMAL-TAGGED
>>> 0xffff000300000000-0xffff008000000000         500G PUD
>>> 0xffff008000000000-0xffff800000000000      130560G PGD
>>> ---[ Linear Mapping end ]---
>>>
>>> Mode 3:
>>> =======
>>> ---[ Linear Mapping start ]---
>>> 0xffff000000000000-0xffff000000210000        2112K PTE       RW NX SHD
>>> AF            UXN    MEM/NORMAL-TAGGED
>>> 0xffff000000210000-0xffff000000400000        1984K PTE       ro NX SHD
>>> AF            UXN    MEM/NORMAL
>>> 0xffff000000400000-0xffff000002400000          32M PMD       ro NX SHD
>>> AF        BLK UXN    MEM/NORMAL
>>> 0xffff000002400000-0xffff000002550000        1344K PTE       ro NX SHD
>>> AF            UXN    MEM/NORMAL
>>> 0xffff000002550000-0xffff000143a61000     5264452K PTE       RW NX SHD
>>> AF            UXN    MEM/NORMAL-TAGGED
>>> 0xffff000143a61000-0xffff000143c61000           2M PTE       ro NX SHD
>>> AF            UXN    MEM/NORMAL-TAGGED
>>> 0xffff000143c61000-0xffff000181b9a000     1015012K PTE       RW NX SHD
>>> AF            UXN    MEM/NORMAL-TAGGED
>>> 0xffff000181b9a000-0xffff000181d9a000           2M PTE       ro NX SHD
>>> AF            UXN    MEM/NORMAL-TAGGED
>>> 0xffff000181d9a000-0xffff000300000000     6261144K PTE       RW NX SHD
>>> AF            UXN    MEM/NORMAL-TAGGED
>>> 0xffff000300000000-0xffff008000000000         500G PUD
>>> 0xffff008000000000-0xffff800000000000      130560G PGD
>>> ---[ Linear Mapping end ]---
>>>
>>>
>>> Performance Testing
>>> ===================
>>>
>>> Yang Shi has gathered some compelling results which are detailed in the commit
>>> log for patch #3. Additionally I have run this through a random selection of
>>> benchmarks on AmpereOne. None show any regressions, and various benchmarks show
>>> statistically significant improvement. I'm just showing those improvements here:
>>>
>>> +----------------------
>>> +----------------------------------------------------------
>>> +-------------------------+
>>> | Benchmark            | Result
>>> Class                                             | Improvement vs 6.17-rc1 |
>>> +======================+==========================================================+=========================+
>>> | micromm/vmalloc      | full_fit_alloc_test: p:1, h:0, l:500000
>>> (usec)           |              (I) -9.00% |
>>> |                      | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000
>>> (usec) |              (I) -6.93% |
>>> |                      | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000
>>> (usec) |              (I) -6.77% |
>>> |                      | pcpu_alloc_test: p:1, h:0, l:500000
>>> (usec)               |              (I) -4.63% |
>>> +----------------------
>>> +----------------------------------------------------------
>>> +-------------------------+
>>> | mmtests/hackbench    | process-sockets-30
>>> (seconds)                             |              (I) -2.96% |
>>> +----------------------
>>> +----------------------------------------------------------
>>> +-------------------------+
>>> | mmtests/kernbench    | syst-192
>>> (seconds)                                       |             (I) -12.77% |
>>> +----------------------
>>> +----------------------------------------------------------
>>> +-------------------------+
>>> | pts/perl-benchmark   | Test: Interpreter
>>> (Seconds)                              |              (I) -4.86% |
>>> +----------------------
>>> +----------------------------------------------------------
>>> +-------------------------+
>>> | pts/pgbench          | Scale: 1 Clients: 1 Read Write
>>> (TPS)                     |               (I) 5.07% |
>>> |                      | Scale: 1 Clients: 1 Read Write - Latency
>>> (ms)            |              (I) -4.72% |
>>> |                      | Scale: 100 Clients: 1000 Read Write
>>> (TPS)                |               (I) 2.58% |
>>> |                      | Scale: 100 Clients: 1000 Read Write - Latency
>>> (ms)       |              (I) -2.52% |
>>> +----------------------
>>> +----------------------------------------------------------
>>> +-------------------------+
>>> | pts/sqlite-speedtest | Timed Time - Size 1,000
>>> (Seconds)                        |              (I) -2.68% |
>>> +----------------------
>>> +----------------------------------------------------------
>>> +-------------------------+
>>>
>>>
>>> Changes since v6 [1]
>>> ====================
>>>
>>> - Patch 1: Minor refactor to implement walk_kernel_page_table_range() in terms
>>>     of walk_kernel_page_table_range_lockless(). Also lead to adding *pmd argument
>>>     to the lockless variant for consistency (per Catalin).
>>> - Misc function/variable renames to improve clarity and consistency.
>>> - Share same syncrhonization flag between idmap_kpti_install_ng_mappings and
>>>     wait_linear_map_split_to_ptes, which allows removal of bbml2_ptes[] to save
>>>     ~20K from kernel image.
>>> - Only take pgtable_split_lock and enter lazy mmu mode once for both splits.
>>> - Only walk the pgtable once for the common "split single page" case.
>>> - Bypass split to contpmd and contpte when spllitting linear map to ptes.
>>>
>>>
>>> Applies on v6.17-rc3.
>>>
>>>
>>> [1] https://lore.kernel.org/linux-arm-kernel/20250805081350.3854670-1-
>>> ryan.roberts@arm.com/
>>>
>>> Thanks,
>>> Ryan
>>>
>>> Dev Jain (1):
>>>     arm64: Enable permission change on arm64 kernel block mappings
>>>
>>> Ryan Roberts (3):
>>>     arm64: mm: Optimize split_kernel_leaf_mapping()
>>>     arm64: mm: split linear mapping if BBML2 unsupported on secondary CPUs
>>>     arm64: mm: Optimize linear_map_split_to_ptes()
>>>
>>> Yang Shi (2):
>>>     arm64: cpufeature: add AmpereOne to BBML2 allow list
>>>     arm64: mm: support large block mapping when rodata=full
>>>
>>>    arch/arm64/include/asm/cpufeature.h |   2 +
>>>    arch/arm64/include/asm/mmu.h        |   3 +
>>>    arch/arm64/include/asm/pgtable.h    |   5 +
>>>    arch/arm64/kernel/cpufeature.c      |  12 +-
>>>    arch/arm64/mm/mmu.c                 | 418 +++++++++++++++++++++++++++-
>>>    arch/arm64/mm/pageattr.c            | 157 ++++++++---
>>>    arch/arm64/mm/proc.S                |  27 +-
>>>    include/linux/pagewalk.h            |   3 +
>>>    mm/pagewalk.c                       |  36 ++-
>>>    9 files changed, 599 insertions(+), 64 deletions(-)
>>>
>>> -- 
>>> 2.43.0
>>>
>> Hi Yang and Ryan,
>>
>> I observe there are various callsites which will ultimately use
>> update_range_prot() (from patch 1),
>> that they do not check the return value. I am listing the ones I could find:
> So your concern is that prior to patch #3 in this series, any error returned by
> __change_memory_common() would be due to programming error only. But patch #3
> introduces the possibility of dynamic error (-ENOMEM) due to the need to
> allocate pgtable memory to split a mapping?
>
> There is a WARN_ON_ONCE(ret) for the return code of split_kernel_leaf_mapping()
> which will at least make the error visible, but I agree it's not a great solution.
>
>> set_memory_ro() in bpf_jit_comp.c

Do you mean arch/arm64/net/bpf_jit_comp.c? If so I think you can just 
check the return value then return -EFAULT just like what the above 
set_memory_rw() does.

> There is a set_memory_rw() for the same region of memory directly above this,
> which will return -EFAULT on failure. If that one succeeded, then the pgtable
> must already be appropriately split for set_memory_ro() so that should never
> fail in practice. I agree with improving the robustness of the code by returning
> -EFAULT (or just propagate the error?) as you suggest though.

Yeah, I agree. This one should be easy to resolve.

>
>> set_memory_valid() in kernel_map_pages() in pageattr.c
> This is used by CONFIG_DEBUG_PAGEALLOC to make pages in the linear map invalid
> while they are not in use to catch programming errors. So if making a page
> invalid during freeing fails would not technically lead to a huge issue, it just
> reduces our capability of catching an errant access to that free memory.
>
> In principle, if we were able to make the memory invalid, we should therefore be
> able to make it valid again, because the mappings should be sufficiently split
> already. But that doesn't actually work, because we might be allocating a
> smaller order than was freed so we might not have split at free-time to the
> granularity is required at allocation-time.
>
> But as you say, for CONFIG_DEBUG_PAGEALLOC we disable this whole path anyway, so
> no issue here.

Yes, agreed.

>
>> set_direct_map_invalid_noflush() in vm_reset_perms() in vmalloc.c
>> set_direct_map_default_noflush() in vm_reset_perms() in vmalloc.c, and in
>> secretmem.c
>> (the secretmem.c ones should be safe as explained in the commments therein)
> Agreed for secretmem. vmalloc looks like a problem though...
>
> If vmalloc was only setting the linear map back to default permissions, I guess
> this wouldn't be an issue because we must have split the linear map sucessfully
> when changing away from default permissions in the first place. But the fact

Yes, agreed.

> that it is unconditionally setting the linear map pages to invalid then back to
> default causes issues; I guess even without the risk of -ENOMEM, this will cause
> the linear map to be split to PTEs over time as vmalloc allocs and frees?

It is possible. However, vm_reset_perms() is not called that often. 
Theoretically there are plenty of other operations, for example, 
loading/unloading modules, can cause the linear mapping to be split over 
time. So this one is not that special IMHO.

>
> We probably need to think through how we can solve this. It's not clear to me
> why vm_reset_perms wants to unconditionally transiently set to invalid?

It seems like vm_reset_perms() is just called when VM_FLUSH_RESET_PERMS 
flag is passed. It is just passed in for secretmem and hyperv. It sounds 
like some preventive security measurement to me.

>
>> The first one I think can be handled easily by returning -EFAULT.

It may be not that simple. set_direct_map_invalid_noflush() is called on 
page basis, so does update_range_prot(). If the split requires allocate 
multiple page table pages, we may have some of the pages permission 
changed (page table page allocation succeed), but the remaining is 
skipped due to page table page allocation failure. The vfree() needs to 
handle such case by setting pages permission back before returning any 
errno.

Anyway it sounds like a general problem rather than ARM specific.

>>
>> For the second, we are already returning in case of !can_set_direct_map, which
>> renders DEBUG_PAGEALLOC useless. So maybe it is
>> safe to ignore the ret from set_memory_valid?
>>
>> For the third, the call chain is a sequence of must-succeed void functions.
>> Notably, when using vfree(), we may have to allocate a single
>> pagetable page for splitting.
>>
>> I am wondering whether we can just have a warn_on_once or something for the case
>> when we fail to allocate a pagetable page. Or, Ryan had
>> suggested in an off-the-list conversation that we can maintain a cache of PTE
>> tables for every PMD block mapping, which will give us
>> the same memory consumption as we do today, but not sure if this is worth it.
>> x86 can already handle splitting but due to the callchains
>> I have described above, it has the same problem, and the code has been working
>> for years :)
> I think it's preferable to avoid having to keep a cache of pgtable memory if we
> can...

Yes, I agree. We simply don't know how many pages we need to cache, and 
it still can't guarantee 100% allocation success.

Thanks,
Yang

>
> Thanks,
> Ryan
>
>

Re: [PATCH v7 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Yang Shi 5 months, 1 week ago

>>>
>>>
>>> I am wondering whether we can just have a warn_on_once or something 
>>> for the case
>>> when we fail to allocate a pagetable page. Or, Ryan had
>>> suggested in an off-the-list conversation that we can maintain a 
>>> cache of PTE
>>> tables for every PMD block mapping, which will give us
>>> the same memory consumption as we do today, but not sure if this is 
>>> worth it.
>>> x86 can already handle splitting but due to the callchains
>>> I have described above, it has the same problem, and the code has 
>>> been working
>>> for years :)
>> I think it's preferable to avoid having to keep a cache of pgtable 
>> memory if we
>> can...
>
> Yes, I agree. We simply don't know how many pages we need to cache, 
> and it still can't guarantee 100% allocation success.

This is wrong... We can know how many pages will be needed for splitting 
linear mapping to PTEs for the worst case once linear mapping is 
finalized. But it may require a few hundred megabytes memory to 
guarantee allocation success. I don't think it is worth for such rare 
corner case.

Thanks,
Yang

>
> Thanks,
> Yang
>
>>
>> Thanks,
>> Ryan
>>
>>
>

Re: [PATCH v7 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Ryan Roberts 5 months ago

On 03/09/2025 01:50, Yang Shi wrote:
>>>>
>>>>
>>>> I am wondering whether we can just have a warn_on_once or something for the
>>>> case
>>>> when we fail to allocate a pagetable page. Or, Ryan had
>>>> suggested in an off-the-list conversation that we can maintain a cache of PTE
>>>> tables for every PMD block mapping, which will give us
>>>> the same memory consumption as we do today, but not sure if this is worth it.
>>>> x86 can already handle splitting but due to the callchains
>>>> I have described above, it has the same problem, and the code has been working
>>>> for years :)
>>> I think it's preferable to avoid having to keep a cache of pgtable memory if we
>>> can...
>>
>> Yes, I agree. We simply don't know how many pages we need to cache, and it
>> still can't guarantee 100% allocation success.
> 
> This is wrong... We can know how many pages will be needed for splitting linear
> mapping to PTEs for the worst case once linear mapping is finalized. But it may
> require a few hundred megabytes memory to guarantee allocation success. I don't
> think it is worth for such rare corner case.

Indeed, we know exactly how much memory we need for pgtables to map the linear
map by pte - that's exactly what we are doing today. So we _could_ keep a cache.
We would still get the benefit of improved performance but we would lose the
benefit of reduced memory.

I think we need to solve the vm_reset_perms() problem somehow, before we can
enable this.

Thanks,
Ryan

> 
> Thanks,
> Yang
> 
>>
>> Thanks,
>> Yang
>>
>>>
>>> Thanks,
>>> Ryan
>>>
>>>
>>
>

Re: [PATCH v7 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Ryan Roberts 5 months ago

On 04/09/2025 14:14, Ryan Roberts wrote:
> On 03/09/2025 01:50, Yang Shi wrote:
>>>>>
>>>>>
>>>>> I am wondering whether we can just have a warn_on_once or something for the
>>>>> case
>>>>> when we fail to allocate a pagetable page. Or, Ryan had
>>>>> suggested in an off-the-list conversation that we can maintain a cache of PTE
>>>>> tables for every PMD block mapping, which will give us
>>>>> the same memory consumption as we do today, but not sure if this is worth it.
>>>>> x86 can already handle splitting but due to the callchains
>>>>> I have described above, it has the same problem, and the code has been working
>>>>> for years :)
>>>> I think it's preferable to avoid having to keep a cache of pgtable memory if we
>>>> can...
>>>
>>> Yes, I agree. We simply don't know how many pages we need to cache, and it
>>> still can't guarantee 100% allocation success.
>>
>> This is wrong... We can know how many pages will be needed for splitting linear
>> mapping to PTEs for the worst case once linear mapping is finalized. But it may
>> require a few hundred megabytes memory to guarantee allocation success. I don't
>> think it is worth for such rare corner case.
> 
> Indeed, we know exactly how much memory we need for pgtables to map the linear
> map by pte - that's exactly what we are doing today. So we _could_ keep a cache.
> We would still get the benefit of improved performance but we would lose the
> benefit of reduced memory.
> 
> I think we need to solve the vm_reset_perms() problem somehow, before we can
> enable this.

Sorry I realise this was not very clear... I am saying I think we need to fix it
somehow. A cache would likely work. But I'd prefer to avoid it if we can find a
better solution.


> 
> Thanks,
> Ryan
> 
>>
>> Thanks,
>> Yang
>>
>>>
>>> Thanks,
>>> Yang
>>>
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>
>>>
>>
>

Re: [PATCH v7 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Yang Shi 5 months ago

On 9/4/25 6:16 AM, Ryan Roberts wrote:
> On 04/09/2025 14:14, Ryan Roberts wrote:
>> On 03/09/2025 01:50, Yang Shi wrote:
>>>>>>
>>>>>> I am wondering whether we can just have a warn_on_once or something for the
>>>>>> case
>>>>>> when we fail to allocate a pagetable page. Or, Ryan had
>>>>>> suggested in an off-the-list conversation that we can maintain a cache of PTE
>>>>>> tables for every PMD block mapping, which will give us
>>>>>> the same memory consumption as we do today, but not sure if this is worth it.
>>>>>> x86 can already handle splitting but due to the callchains
>>>>>> I have described above, it has the same problem, and the code has been working
>>>>>> for years :)
>>>>> I think it's preferable to avoid having to keep a cache of pgtable memory if we
>>>>> can...
>>>> Yes, I agree. We simply don't know how many pages we need to cache, and it
>>>> still can't guarantee 100% allocation success.
>>> This is wrong... We can know how many pages will be needed for splitting linear
>>> mapping to PTEs for the worst case once linear mapping is finalized. But it may
>>> require a few hundred megabytes memory to guarantee allocation success. I don't
>>> think it is worth for such rare corner case.
>> Indeed, we know exactly how much memory we need for pgtables to map the linear
>> map by pte - that's exactly what we are doing today. So we _could_ keep a cache.
>> We would still get the benefit of improved performance but we would lose the
>> benefit of reduced memory.
>>
>> I think we need to solve the vm_reset_perms() problem somehow, before we can
>> enable this.
> Sorry I realise this was not very clear... I am saying I think we need to fix it
> somehow. A cache would likely work. But I'd prefer to avoid it if we can find a
> better solution.

Took a deeper look at vm_reset_perms(). It was introduced by commit 
868b104d7379 ("mm/vmalloc: Add flag for freeing of special 
permsissions"). The VM_FLUSH_RESET_PERMS flag is supposed to be set if 
the vmalloc memory is RO and/or ROX. So set_memory_ro() or 
set_memory_rox() is supposed to follow up vmalloc(). So the page table 
should be already split before reaching vfree(). I think this why 
vm_reset_perms() doesn't not check return value.

I scrutinized all the callsites with VM_FLUSH_RESET_PERMS flag set. The 
most of them has set_memory_ro() or set_memory_rox() followed. But there 
are 3 places I don't see set_memory_ro()/set_memory_rox() is called.

1. BPF trampoline allocation. The BPF trampoline calls 
arch_protect_bpf_trampoline(). The generic implementation does call 
set_memory_rox(). But the x86 and arm64 implementation just simply 
return 0. For x86, it is because execmem cache is used and it does call 
set_memory_rox(). ARM64 doesn't need to split page table before this 
series, so it should never fail. I think we just need to use the generic 
implementation (remove arm64 implementation) if this series is merged.

2. BPF dispatcher. It calls execmem_alloc which has VM_FLUSH_RESET_PERMS 
set. But it is used for rw allocation, so VM_FLUSH_RESET_PERMS should be 
unnecessary IIUC. So it doesn't matter even though vm_reset_perms() fails.

3. kprobe. S390's alloc_insn_page() does call set_memory_rox(), x86 also 
called set_memory_rox() before switching to execmem cache. The execmem 
cache calls set_memory_rox(). I don't know why ARM64 doesn't call it.

So I think we just need to fix #1 and #3 per the above analysis. If this 
analysis look correct to you guys, I will prepare two patches to fix them.

Thanks,
Yang

>
>
>> Thanks,
>> Ryan
>>
>>> Thanks,
>>> Yang
>>>
>>>> Thanks,
>>>> Yang
>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>>
>>>>>

Re: [PATCH v7 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Yang Shi 4 months, 3 weeks ago

>
>
> 3. kprobe. S390's alloc_insn_page() does call set_memory_rox(), x86 
> also called set_memory_rox() before switching to execmem cache. The 
> execmem cache calls set_memory_rox(). I don't know why ARM64 doesn't 
> call it.

When I'm trying to find out the proper fix tag for this, I happened to 
figure out why ARM64 doesn't call it. ARM64 actually called 
set_memory_ro() before commit 10d5e97c1bf8 ("arm64: use PAGE_KERNEL_ROX 
directly in alloc_insn_page"). But this commit removed it. It seems like 
the author and reviewers overlooked set_memory_ro() also changes direct 
map permission. So I believe adding set_memory_rox() is the right fix.

Thanks,
Yang

>
> So I think we just need to fix #1 and #3 per the above analysis. If 
> this analysis look correct to you guys, I will prepare two patches to 
> fix them.
>
> Thanks,
> Yang
>
>>
>>
>>> Thanks,
>>> Ryan
>>>
>>>> Thanks,
>>>> Yang
>>>>
>>>>> Thanks,
>>>>> Yang
>>>>>
>>>>>> Thanks,
>>>>>> Ryan
>>>>>>
>>>>>>
>

Re: [PATCH v7 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Yang Shi 5 months ago


On 9/4/25 10:47 AM, Yang Shi wrote:
>
>
> On 9/4/25 6:16 AM, Ryan Roberts wrote:
>> On 04/09/2025 14:14, Ryan Roberts wrote:
>>> On 03/09/2025 01:50, Yang Shi wrote:
>>>>>>>
>>>>>>> I am wondering whether we can just have a warn_on_once or 
>>>>>>> something for the
>>>>>>> case
>>>>>>> when we fail to allocate a pagetable page. Or, Ryan had
>>>>>>> suggested in an off-the-list conversation that we can maintain a 
>>>>>>> cache of PTE
>>>>>>> tables for every PMD block mapping, which will give us
>>>>>>> the same memory consumption as we do today, but not sure if this 
>>>>>>> is worth it.
>>>>>>> x86 can already handle splitting but due to the callchains
>>>>>>> I have described above, it has the same problem, and the code 
>>>>>>> has been working
>>>>>>> for years :)
>>>>>> I think it's preferable to avoid having to keep a cache of 
>>>>>> pgtable memory if we
>>>>>> can...
>>>>> Yes, I agree. We simply don't know how many pages we need to 
>>>>> cache, and it
>>>>> still can't guarantee 100% allocation success.
>>>> This is wrong... We can know how many pages will be needed for 
>>>> splitting linear
>>>> mapping to PTEs for the worst case once linear mapping is 
>>>> finalized. But it may
>>>> require a few hundred megabytes memory to guarantee allocation 
>>>> success. I don't
>>>> think it is worth for such rare corner case.
>>> Indeed, we know exactly how much memory we need for pgtables to map 
>>> the linear
>>> map by pte - that's exactly what we are doing today. So we _could_ 
>>> keep a cache.
>>> We would still get the benefit of improved performance but we would 
>>> lose the
>>> benefit of reduced memory.
>>>
>>> I think we need to solve the vm_reset_perms() problem somehow, 
>>> before we can
>>> enable this.
>> Sorry I realise this was not very clear... I am saying I think we 
>> need to fix it
>> somehow. A cache would likely work. But I'd prefer to avoid it if we 
>> can find a
>> better solution.
>
> Took a deeper look at vm_reset_perms(). It was introduced by commit 
> 868b104d7379 ("mm/vmalloc: Add flag for freeing of special 
> permsissions"). The VM_FLUSH_RESET_PERMS flag is supposed to be set if 
> the vmalloc memory is RO and/or ROX. So set_memory_ro() or 
> set_memory_rox() is supposed to follow up vmalloc(). So the page table 
> should be already split before reaching vfree(). I think this why 
> vm_reset_perms() doesn't not check return value.
>
> I scrutinized all the callsites with VM_FLUSH_RESET_PERMS flag set. 
> The most of them has set_memory_ro() or set_memory_rox() followed. But 
> there are 3 places I don't see set_memory_ro()/set_memory_rox() is 
> called.
>
> 1. BPF trampoline allocation. The BPF trampoline calls 
> arch_protect_bpf_trampoline(). The generic implementation does call 
> set_memory_rox(). But the x86 and arm64 implementation just simply 
> return 0. For x86, it is because execmem cache is used and it does 
> call set_memory_rox(). ARM64 doesn't need to split page table before 
> this series, so it should never fail. I think we just need to use the 
> generic implementation (remove arm64 implementation) if this series is 
> merged.
>
> 2. BPF dispatcher. It calls execmem_alloc which has 
> VM_FLUSH_RESET_PERMS set. But it is used for rw allocation, so 
> VM_FLUSH_RESET_PERMS should be unnecessary IIUC. So it doesn't matter 
> even though vm_reset_perms() fails.
>
> 3. kprobe. S390's alloc_insn_page() does call set_memory_rox(), x86 
> also called set_memory_rox() before switching to execmem cache. The 
> execmem cache calls set_memory_rox(). I don't know why ARM64 doesn't 
> call it.
>
> So I think we just need to fix #1 and #3 per the above analysis. If 
> this analysis look correct to you guys, I will prepare two patches to 
> fix them.

Tested the below patch with bpftrace kfunc (allocate bpf trampoline) and 
kprobes. It seems work well.

diff --git a/arch/arm64/kernel/probes/kprobes.c 
b/arch/arm64/kernel/probes/kprobes.c
index 0c5d408afd95..c4f8c4750f1e 100644
--- a/arch/arm64/kernel/probes/kprobes.c
+++ b/arch/arm64/kernel/probes/kprobes.c
@@ -10,6 +10,7 @@

  #define pr_fmt(fmt) "kprobes: " fmt

+#include <linux/execmem.h>
  #include <linux/extable.h>
  #include <linux/kasan.h>
  #include <linux/kernel.h>
@@ -41,6 +42,17 @@ DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
  static void __kprobes
  post_kprobe_handler(struct kprobe *, struct kprobe_ctlblk *, struct 
pt_regs *);

+void *alloc_insn_page(void)
+{
+       void *page;
+
+       page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
+       if (!page)
+               return NULL;
+       set_memory_rox((unsigned long)page, 1);
+       return page;
+}
+
  static void __kprobes arch_prepare_ss_slot(struct kprobe *p)
  {
         kprobe_opcode_t *addr = p->ainsn.xol_insn;
diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index 52ffe115a8c4..3e301bc2cd66 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -2717,11 +2717,6 @@ void arch_free_bpf_trampoline(void *image, 
unsigned int size)
         bpf_prog_pack_free(image, size);
  }

-int arch_protect_bpf_trampoline(void *image, unsigned int size)
-{
-       return 0;
-}
-
  int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void 
*ro_image,
                                 void *ro_image_end, const struct 
btf_func_model *m,
                                 u32 flags, struct bpf_tramp_links *tlinks,


>
> Thanks,
> Yang
>
>>
>>
>>> Thanks,
>>> Ryan
>>>
>>>> Thanks,
>>>> Yang
>>>>
>>>>> Thanks,
>>>>> Yang
>>>>>
>>>>>> Thanks,
>>>>>> Ryan
>>>>>>
>>>>>>
>

Re: [PATCH v7 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Ryan Roberts 5 months ago

On 04/09/2025 22:49, Yang Shi wrote:
> 
> 
> On 9/4/25 10:47 AM, Yang Shi wrote:
>>
>>
>> On 9/4/25 6:16 AM, Ryan Roberts wrote:
>>> On 04/09/2025 14:14, Ryan Roberts wrote:
>>>> On 03/09/2025 01:50, Yang Shi wrote:
>>>>>>>>
>>>>>>>> I am wondering whether we can just have a warn_on_once or something for the
>>>>>>>> case
>>>>>>>> when we fail to allocate a pagetable page. Or, Ryan had
>>>>>>>> suggested in an off-the-list conversation that we can maintain a cache
>>>>>>>> of PTE
>>>>>>>> tables for every PMD block mapping, which will give us
>>>>>>>> the same memory consumption as we do today, but not sure if this is
>>>>>>>> worth it.
>>>>>>>> x86 can already handle splitting but due to the callchains
>>>>>>>> I have described above, it has the same problem, and the code has been
>>>>>>>> working
>>>>>>>> for years :)
>>>>>>> I think it's preferable to avoid having to keep a cache of pgtable memory
>>>>>>> if we
>>>>>>> can...
>>>>>> Yes, I agree. We simply don't know how many pages we need to cache, and it
>>>>>> still can't guarantee 100% allocation success.
>>>>> This is wrong... We can know how many pages will be needed for splitting
>>>>> linear
>>>>> mapping to PTEs for the worst case once linear mapping is finalized. But it
>>>>> may
>>>>> require a few hundred megabytes memory to guarantee allocation success. I
>>>>> don't
>>>>> think it is worth for such rare corner case.
>>>> Indeed, we know exactly how much memory we need for pgtables to map the linear
>>>> map by pte - that's exactly what we are doing today. So we _could_ keep a
>>>> cache.
>>>> We would still get the benefit of improved performance but we would lose the
>>>> benefit of reduced memory.
>>>>
>>>> I think we need to solve the vm_reset_perms() problem somehow, before we can
>>>> enable this.
>>> Sorry I realise this was not very clear... I am saying I think we need to fix it
>>> somehow. A cache would likely work. But I'd prefer to avoid it if we can find a
>>> better solution.
>>
>> Took a deeper look at vm_reset_perms(). It was introduced by commit
>> 868b104d7379 ("mm/vmalloc: Add flag for freeing of special permsissions"). The
>> VM_FLUSH_RESET_PERMS flag is supposed to be set if the vmalloc memory is RO
>> and/or ROX. So set_memory_ro() or set_memory_rox() is supposed to follow up
>> vmalloc(). So the page table should be already split before reaching vfree().
>> I think this why vm_reset_perms() doesn't not check return value.

If vm_reset_perms() is assuming it can't/won't fail, I think it should at least
output a warning if it does?

>>
>> I scrutinized all the callsites with VM_FLUSH_RESET_PERMS flag set. 

Just checking; I think you made a comment before about there only being a few
sites that set VM_FLUSH_RESET_PERMS. But one of them is the helper,
set_vm_flush_reset_perms(). So just making sure you also followed to the places
that use that helper?

>> The most
>> of them has set_memory_ro() or set_memory_rox() followed. 

And are all callsites calling set_memory_*() for the entire cell that was
allocated by vmalloc? If there are cases where it only calls that for a portion
of it, then it's not gurranteed that the memory is correctly split.

>> But there are 3
>> places I don't see set_memory_ro()/set_memory_rox() is called.
>>
>> 1. BPF trampoline allocation. The BPF trampoline calls
>> arch_protect_bpf_trampoline(). The generic implementation does call
>> set_memory_rox(). But the x86 and arm64 implementation just simply return 0.
>> For x86, it is because execmem cache is used and it does call
>> set_memory_rox(). ARM64 doesn't need to split page table before this series,
>> so it should never fail. I think we just need to use the generic
>> implementation (remove arm64 implementation) if this series is merged.

I know zero about BPF. But it looks like the allocation happens in
arch_alloc_bpf_trampoline(), which for arm64, calls bpf_prog_pack_alloc(). And
for small sizes, it grabs some memory from a "pack". So doesn't this mean that
you are calling set_memory_rox() for a sub-region of the cell, so that doesn't
actually help at vm_reset_perms()-time?

>>
>> 2. BPF dispatcher. It calls execmem_alloc which has VM_FLUSH_RESET_PERMS set.
>> But it is used for rw allocation, so VM_FLUSH_RESET_PERMS should be
>> unnecessary IIUC. So it doesn't matter even though vm_reset_perms() fails.
>>
>> 3. kprobe. S390's alloc_insn_page() does call set_memory_rox(), x86 also
>> called set_memory_rox() before switching to execmem cache. The execmem cache
>> calls set_memory_rox(). I don't know why ARM64 doesn't call it.
>>
>> So I think we just need to fix #1 and #3 per the above analysis. If this
>> analysis look correct to you guys, I will prepare two patches to fix them.

This all seems quite fragile. I find it interesting that vm_reset_perms() is
doing break-before-make; it sets the PTEs as invalid, then flushes the TLB, then
sets them to default. But for arm64, at least, I think break-before-make is not
required. We are only changing the permissions so that can be done on live
mappings; essentially change the sequence to; set default, flush TLB.

If we do that, then if the memory was already default, then there is no need to
do anything (so no chance of allocation failure). If the memory was not default,
then it must have already been split to make it non-default, in which case we
can also gurrantee that no allocations are required.

What am I missing?

Thanks,
Ryan


> 
> Tested the below patch with bpftrace kfunc (allocate bpf trampoline) and
> kprobes. It seems work well.
> 
> diff --git a/arch/arm64/kernel/probes/kprobes.c b/arch/arm64/kernel/probes/
> kprobes.c
> index 0c5d408afd95..c4f8c4750f1e 100644
> --- a/arch/arm64/kernel/probes/kprobes.c
> +++ b/arch/arm64/kernel/probes/kprobes.c
> @@ -10,6 +10,7 @@
> 
>  #define pr_fmt(fmt) "kprobes: " fmt
> 
> +#include <linux/execmem.h>
>  #include <linux/extable.h>
>  #include <linux/kasan.h>
>  #include <linux/kernel.h>
> @@ -41,6 +42,17 @@ DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
>  static void __kprobes
>  post_kprobe_handler(struct kprobe *, struct kprobe_ctlblk *, struct pt_regs *);
> 
> +void *alloc_insn_page(void)
> +{
> +       void *page;
> +
> +       page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
> +       if (!page)
> +               return NULL;
> +       set_memory_rox((unsigned long)page, 1);
> +       return page;
> +}
> +
>  static void __kprobes arch_prepare_ss_slot(struct kprobe *p)
>  {
>         kprobe_opcode_t *addr = p->ainsn.xol_insn;
> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
> index 52ffe115a8c4..3e301bc2cd66 100644
> --- a/arch/arm64/net/bpf_jit_comp.c
> +++ b/arch/arm64/net/bpf_jit_comp.c
> @@ -2717,11 +2717,6 @@ void arch_free_bpf_trampoline(void *image, unsigned int
> size)
>         bpf_prog_pack_free(image, size);
>  }
> 
> -int arch_protect_bpf_trampoline(void *image, unsigned int size)
> -{
> -       return 0;
> -}
> -
>  int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *ro_image,
>                                 void *ro_image_end, const struct btf_func_model *m,
>                                 u32 flags, struct bpf_tramp_links *tlinks,
> 
> 
>>
>> Thanks,
>> Yang
>>
>>>
>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>> Thanks,
>>>>> Yang
>>>>>
>>>>>> Thanks,
>>>>>> Yang
>>>>>>
>>>>>>> Thanks,
>>>>>>> Ryan
>>>>>>>
>>>>>>>
>>
>

Re: [PATCH v7 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Yang Shi 5 months ago


On 9/8/25 9:34 AM, Ryan Roberts wrote:
> On 04/09/2025 22:49, Yang Shi wrote:
>>
>> On 9/4/25 10:47 AM, Yang Shi wrote:
>>>
>>> On 9/4/25 6:16 AM, Ryan Roberts wrote:
>>>> On 04/09/2025 14:14, Ryan Roberts wrote:
>>>>> On 03/09/2025 01:50, Yang Shi wrote:
>>>>>>>>> I am wondering whether we can just have a warn_on_once or something for the
>>>>>>>>> case
>>>>>>>>> when we fail to allocate a pagetable page. Or, Ryan had
>>>>>>>>> suggested in an off-the-list conversation that we can maintain a cache
>>>>>>>>> of PTE
>>>>>>>>> tables for every PMD block mapping, which will give us
>>>>>>>>> the same memory consumption as we do today, but not sure if this is
>>>>>>>>> worth it.
>>>>>>>>> x86 can already handle splitting but due to the callchains
>>>>>>>>> I have described above, it has the same problem, and the code has been
>>>>>>>>> working
>>>>>>>>> for years :)
>>>>>>>> I think it's preferable to avoid having to keep a cache of pgtable memory
>>>>>>>> if we
>>>>>>>> can...
>>>>>>> Yes, I agree. We simply don't know how many pages we need to cache, and it
>>>>>>> still can't guarantee 100% allocation success.
>>>>>> This is wrong... We can know how many pages will be needed for splitting
>>>>>> linear
>>>>>> mapping to PTEs for the worst case once linear mapping is finalized. But it
>>>>>> may
>>>>>> require a few hundred megabytes memory to guarantee allocation success. I
>>>>>> don't
>>>>>> think it is worth for such rare corner case.
>>>>> Indeed, we know exactly how much memory we need for pgtables to map the linear
>>>>> map by pte - that's exactly what we are doing today. So we _could_ keep a
>>>>> cache.
>>>>> We would still get the benefit of improved performance but we would lose the
>>>>> benefit of reduced memory.
>>>>>
>>>>> I think we need to solve the vm_reset_perms() problem somehow, before we can
>>>>> enable this.
>>>> Sorry I realise this was not very clear... I am saying I think we need to fix it
>>>> somehow. A cache would likely work. But I'd prefer to avoid it if we can find a
>>>> better solution.
>>> Took a deeper look at vm_reset_perms(). It was introduced by commit
>>> 868b104d7379 ("mm/vmalloc: Add flag for freeing of special permsissions"). The
>>> VM_FLUSH_RESET_PERMS flag is supposed to be set if the vmalloc memory is RO
>>> and/or ROX. So set_memory_ro() or set_memory_rox() is supposed to follow up
>>> vmalloc(). So the page table should be already split before reaching vfree().
>>> I think this why vm_reset_perms() doesn't not check return value.
> If vm_reset_perms() is assuming it can't/won't fail, I think it should at least
> output a warning if it does?

It should. Anyway warning will be raised if split fails. We have somehow 
mitigation.

>
>>> I scrutinized all the callsites with VM_FLUSH_RESET_PERMS flag set.
> Just checking; I think you made a comment before about there only being a few
> sites that set VM_FLUSH_RESET_PERMS. But one of them is the helper,
> set_vm_flush_reset_perms(). So just making sure you also followed to the places
> that use that helper?

Yes, I did.

>
>>> The most
>>> of them has set_memory_ro() or set_memory_rox() followed.
> And are all callsites calling set_memory_*() for the entire cell that was
> allocated by vmalloc? If there are cases where it only calls that for a portion
> of it, then it's not gurranteed that the memory is correctly split.

Yes, all callsites call set_memory_*() for the entire range.

>
>>> But there are 3
>>> places I don't see set_memory_ro()/set_memory_rox() is called.
>>>
>>> 1. BPF trampoline allocation. The BPF trampoline calls
>>> arch_protect_bpf_trampoline(). The generic implementation does call
>>> set_memory_rox(). But the x86 and arm64 implementation just simply return 0.
>>> For x86, it is because execmem cache is used and it does call
>>> set_memory_rox(). ARM64 doesn't need to split page table before this series,
>>> so it should never fail. I think we just need to use the generic
>>> implementation (remove arm64 implementation) if this series is merged.
> I know zero about BPF. But it looks like the allocation happens in
> arch_alloc_bpf_trampoline(), which for arm64, calls bpf_prog_pack_alloc(). And
> for small sizes, it grabs some memory from a "pack". So doesn't this mean that
> you are calling set_memory_rox() for a sub-region of the cell, so that doesn't
> actually help at vm_reset_perms()-time?

Took a deeper look at bpf pack allocator. The "pack" is allocated by 
alloc_new_pack(), which does:
bpf_jit_alloc_exec()
set_vm_flush_reset_perms()
set_memory_rox()

If the size is greater than the pack size, it calls:
bpf_jit_alloc_exec()
set_vm_flush_reset_perms()
set_memory_rox()

So it looks like bpf trampoline is good, and we don't need do anything. 
It should be removed from the list. I didn't look deep enough for bpf 
pack allocator in the first place.

>
>>> 2. BPF dispatcher. It calls execmem_alloc which has VM_FLUSH_RESET_PERMS set.
>>> But it is used for rw allocation, so VM_FLUSH_RESET_PERMS should be
>>> unnecessary IIUC. So it doesn't matter even though vm_reset_perms() fails.
>>>
>>> 3. kprobe. S390's alloc_insn_page() does call set_memory_rox(), x86 also
>>> called set_memory_rox() before switching to execmem cache. The execmem cache
>>> calls set_memory_rox(). I don't know why ARM64 doesn't call it.
>>>
>>> So I think we just need to fix #1 and #3 per the above analysis. If this
>>> analysis look correct to you guys, I will prepare two patches to fix them.
> This all seems quite fragile. I find it interesting that vm_reset_perms() is
> doing break-before-make; it sets the PTEs as invalid, then flushes the TLB, then
> sets them to default. But for arm64, at least, I think break-before-make is not
> required. We are only changing the permissions so that can be done on live
> mappings; essentially change the sequence to; set default, flush TLB.

Yeah, I agree it is a little bit fragile. I think this is the "contract" 
for vmalloc users. You allocate ROX memory via vmalloc, you are required 
to call set_memory_*(). But there is nothing to guarantee the "contract" 
is followed. But I don't think this is the only case in kernel.

>
> If we do that, then if the memory was already default, then there is no need to
> do anything (so no chance of allocation failure). If the memory was not default,
> then it must have already been split to make it non-default, in which case we
> can also gurrantee that no allocations are required.
>
> What am I missing?

The comment says:
Set direct map to something invalid so that it won't be cached if there 
are any accesses after the TLB flush, then flush the TLB and reset the 
direct map permissions to the default.

IIUC, it guarantees the direct map can't be cached in TLB after TLB 
flush from _vm_unmap_aliases() by setting them invalid because TLB never 
cache invalid entries. Skipping set direct map to invalid seems break 
this. Or "changing permission on live mappings" on ARM64 can achieve the 
same goal?

Thanks,
Yang

> Thanks,
> Ryan
>
>
>> Tested the below patch with bpftrace kfunc (allocate bpf trampoline) and
>> kprobes. It seems work well.
>>
>> diff --git a/arch/arm64/kernel/probes/kprobes.c b/arch/arm64/kernel/probes/
>> kprobes.c
>> index 0c5d408afd95..c4f8c4750f1e 100644
>> --- a/arch/arm64/kernel/probes/kprobes.c
>> +++ b/arch/arm64/kernel/probes/kprobes.c
>> @@ -10,6 +10,7 @@
>>
>>   #define pr_fmt(fmt) "kprobes: " fmt
>>
>> +#include <linux/execmem.h>
>>   #include <linux/extable.h>
>>   #include <linux/kasan.h>
>>   #include <linux/kernel.h>
>> @@ -41,6 +42,17 @@ DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
>>   static void __kprobes
>>   post_kprobe_handler(struct kprobe *, struct kprobe_ctlblk *, struct pt_regs *);
>>
>> +void *alloc_insn_page(void)
>> +{
>> +       void *page;
>> +
>> +       page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
>> +       if (!page)
>> +               return NULL;
>> +       set_memory_rox((unsigned long)page, 1);
>> +       return page;
>> +}
>> +
>>   static void __kprobes arch_prepare_ss_slot(struct kprobe *p)
>>   {
>>          kprobe_opcode_t *addr = p->ainsn.xol_insn;
>> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
>> index 52ffe115a8c4..3e301bc2cd66 100644
>> --- a/arch/arm64/net/bpf_jit_comp.c
>> +++ b/arch/arm64/net/bpf_jit_comp.c
>> @@ -2717,11 +2717,6 @@ void arch_free_bpf_trampoline(void *image, unsigned int
>> size)
>>          bpf_prog_pack_free(image, size);
>>   }
>>
>> -int arch_protect_bpf_trampoline(void *image, unsigned int size)
>> -{
>> -       return 0;
>> -}
>> -
>>   int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *ro_image,
>>                                  void *ro_image_end, const struct btf_func_model *m,
>>                                  u32 flags, struct bpf_tramp_links *tlinks,
>>
>>
>>> Thanks,
>>> Yang
>>>
>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>>
>>>>>> Thanks,
>>>>>> Yang
>>>>>>
>>>>>>> Thanks,
>>>>>>> Yang
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>>

Re: [PATCH v7 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Ryan Roberts 5 months ago

On 08/09/2025 19:31, Yang Shi wrote:
> 
> 
> On 9/8/25 9:34 AM, Ryan Roberts wrote:
>> On 04/09/2025 22:49, Yang Shi wrote:
>>>
>>> On 9/4/25 10:47 AM, Yang Shi wrote:
>>>>
>>>> On 9/4/25 6:16 AM, Ryan Roberts wrote:
>>>>> On 04/09/2025 14:14, Ryan Roberts wrote:
>>>>>> On 03/09/2025 01:50, Yang Shi wrote:
>>>>>>>>>> I am wondering whether we can just have a warn_on_once or something
>>>>>>>>>> for the
>>>>>>>>>> case
>>>>>>>>>> when we fail to allocate a pagetable page. Or, Ryan had
>>>>>>>>>> suggested in an off-the-list conversation that we can maintain a cache
>>>>>>>>>> of PTE
>>>>>>>>>> tables for every PMD block mapping, which will give us
>>>>>>>>>> the same memory consumption as we do today, but not sure if this is
>>>>>>>>>> worth it.
>>>>>>>>>> x86 can already handle splitting but due to the callchains
>>>>>>>>>> I have described above, it has the same problem, and the code has been
>>>>>>>>>> working
>>>>>>>>>> for years :)
>>>>>>>>> I think it's preferable to avoid having to keep a cache of pgtable memory
>>>>>>>>> if we
>>>>>>>>> can...
>>>>>>>> Yes, I agree. We simply don't know how many pages we need to cache, and it
>>>>>>>> still can't guarantee 100% allocation success.
>>>>>>> This is wrong... We can know how many pages will be needed for splitting
>>>>>>> linear
>>>>>>> mapping to PTEs for the worst case once linear mapping is finalized. But it
>>>>>>> may
>>>>>>> require a few hundred megabytes memory to guarantee allocation success. I
>>>>>>> don't
>>>>>>> think it is worth for such rare corner case.
>>>>>> Indeed, we know exactly how much memory we need for pgtables to map the
>>>>>> linear
>>>>>> map by pte - that's exactly what we are doing today. So we _could_ keep a
>>>>>> cache.
>>>>>> We would still get the benefit of improved performance but we would lose the
>>>>>> benefit of reduced memory.
>>>>>>
>>>>>> I think we need to solve the vm_reset_perms() problem somehow, before we can
>>>>>> enable this.
>>>>> Sorry I realise this was not very clear... I am saying I think we need to
>>>>> fix it
>>>>> somehow. A cache would likely work. But I'd prefer to avoid it if we can
>>>>> find a
>>>>> better solution.
>>>> Took a deeper look at vm_reset_perms(). It was introduced by commit
>>>> 868b104d7379 ("mm/vmalloc: Add flag for freeing of special permsissions"). The
>>>> VM_FLUSH_RESET_PERMS flag is supposed to be set if the vmalloc memory is RO
>>>> and/or ROX. So set_memory_ro() or set_memory_rox() is supposed to follow up
>>>> vmalloc(). So the page table should be already split before reaching vfree().
>>>> I think this why vm_reset_perms() doesn't not check return value.
>> If vm_reset_perms() is assuming it can't/won't fail, I think it should at least
>> output a warning if it does?
> 
> It should. Anyway warning will be raised if split fails. We have somehow
> mitigation.
> 
>>
>>>> I scrutinized all the callsites with VM_FLUSH_RESET_PERMS flag set.
>> Just checking; I think you made a comment before about there only being a few
>> sites that set VM_FLUSH_RESET_PERMS. But one of them is the helper,
>> set_vm_flush_reset_perms(). So just making sure you also followed to the places
>> that use that helper?
> 
> Yes, I did.
> 
>>
>>>> The most
>>>> of them has set_memory_ro() or set_memory_rox() followed.
>> And are all callsites calling set_memory_*() for the entire cell that was
>> allocated by vmalloc? If there are cases where it only calls that for a portion
>> of it, then it's not gurranteed that the memory is correctly split.
> 
> Yes, all callsites call set_memory_*() for the entire range.
> 
>>
>>>> But there are 3
>>>> places I don't see set_memory_ro()/set_memory_rox() is called.
>>>>
>>>> 1. BPF trampoline allocation. The BPF trampoline calls
>>>> arch_protect_bpf_trampoline(). The generic implementation does call
>>>> set_memory_rox(). But the x86 and arm64 implementation just simply return 0.
>>>> For x86, it is because execmem cache is used and it does call
>>>> set_memory_rox(). ARM64 doesn't need to split page table before this series,
>>>> so it should never fail. I think we just need to use the generic
>>>> implementation (remove arm64 implementation) if this series is merged.
>> I know zero about BPF. But it looks like the allocation happens in
>> arch_alloc_bpf_trampoline(), which for arm64, calls bpf_prog_pack_alloc(). And
>> for small sizes, it grabs some memory from a "pack". So doesn't this mean that
>> you are calling set_memory_rox() for a sub-region of the cell, so that doesn't
>> actually help at vm_reset_perms()-time?
> 
> Took a deeper look at bpf pack allocator. The "pack" is allocated by
> alloc_new_pack(), which does:
> bpf_jit_alloc_exec()
> set_vm_flush_reset_perms()
> set_memory_rox()
> 
> If the size is greater than the pack size, it calls:
> bpf_jit_alloc_exec()
> set_vm_flush_reset_perms()
> set_memory_rox()
> 
> So it looks like bpf trampoline is good, and we don't need do anything. It
> should be removed from the list. I didn't look deep enough for bpf pack
> allocator in the first place.
> 
>>
>>>> 2. BPF dispatcher. It calls execmem_alloc which has VM_FLUSH_RESET_PERMS set.
>>>> But it is used for rw allocation, so VM_FLUSH_RESET_PERMS should be
>>>> unnecessary IIUC. So it doesn't matter even though vm_reset_perms() fails.
>>>>
>>>> 3. kprobe. S390's alloc_insn_page() does call set_memory_rox(), x86 also
>>>> called set_memory_rox() before switching to execmem cache. The execmem cache
>>>> calls set_memory_rox(). I don't know why ARM64 doesn't call it.
>>>>
>>>> So I think we just need to fix #1 and #3 per the above analysis. If this
>>>> analysis look correct to you guys, I will prepare two patches to fix them.
>> This all seems quite fragile. I find it interesting that vm_reset_perms() is
>> doing break-before-make; it sets the PTEs as invalid, then flushes the TLB, then
>> sets them to default. But for arm64, at least, I think break-before-make is not
>> required. We are only changing the permissions so that can be done on live
>> mappings; essentially change the sequence to; set default, flush TLB.
> 
> Yeah, I agree it is a little bit fragile. I think this is the "contract" for
> vmalloc users. You allocate ROX memory via vmalloc, you are required to call
> set_memory_*(). But there is nothing to guarantee the "contract" is followed.
> But I don't think this is the only case in kernel.
> 
>>
>> If we do that, then if the memory was already default, then there is no need to
>> do anything (so no chance of allocation failure). If the memory was not default,
>> then it must have already been split to make it non-default, in which case we
>> can also gurrantee that no allocations are required.
>>
>> What am I missing?
> 
> The comment says:
> Set direct map to something invalid so that it won't be cached if there are any
> accesses after the TLB flush, then flush the TLB and reset the direct map
> permissions to the default.
> 
> IIUC, it guarantees the direct map can't be cached in TLB after TLB flush from
> _vm_unmap_aliases() by setting them invalid because TLB never cache invalid
> entries. Skipping set direct map to invalid seems break this. Or "changing
> permission on live mappings" on ARM64 can achieve the same goal?

Here's my understanding of the intent of the code:

Let's say we start with some memory that has been mapped RO. Our goal is to
reset the memory back to RW and ensure that no TLB entry remains in the TLB for
the old RO mapping. There are 2 ways to do that:

Approach 1 (used in current code):
1. set PTE to invalid
2. invalidate any TLB entry for the VA
3. set the PTE to RW

Approach 2:
1. set the PTE to RW
2. invalidate any TLB entry for the VA

The benefit of approach 1 is that it is guarranteed that it is impossible for
different CPUs to have different translations for the same VA in their
respective TLB. But for approach 2, it's possible that between steps 1 and 2, 1
CPU has a RO entry and another CPU has a RW entry. But that will get fixed once
the TLB is flushed - it's not really an issue.

(There is probably also an obscure way to end up with 2 TLB entries (one with RO
and one with RW) for the same CPU, but the arm64 architecture permits that as
long as it's only a permission mismatch).

Anyway, approach 2 is used when changing memory permissions on user mappings, so
I don't see why we can't take the same approach here. That would solve this
whole class of issue for us.

Thanks,
Ryan


> 
> Thanks,
> Yang
> 
>> Thanks,
>> Ryan
>>
>>
>>> Tested the below patch with bpftrace kfunc (allocate bpf trampoline) and
>>> kprobes. It seems work well.
>>>
>>> diff --git a/arch/arm64/kernel/probes/kprobes.c b/arch/arm64/kernel/probes/
>>> kprobes.c
>>> index 0c5d408afd95..c4f8c4750f1e 100644
>>> --- a/arch/arm64/kernel/probes/kprobes.c
>>> +++ b/arch/arm64/kernel/probes/kprobes.c
>>> @@ -10,6 +10,7 @@
>>>
>>>   #define pr_fmt(fmt) "kprobes: " fmt
>>>
>>> +#include <linux/execmem.h>
>>>   #include <linux/extable.h>
>>>   #include <linux/kasan.h>
>>>   #include <linux/kernel.h>
>>> @@ -41,6 +42,17 @@ DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
>>>   static void __kprobes
>>>   post_kprobe_handler(struct kprobe *, struct kprobe_ctlblk *, struct pt_regs
>>> *);
>>>
>>> +void *alloc_insn_page(void)
>>> +{
>>> +       void *page;
>>> +
>>> +       page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
>>> +       if (!page)
>>> +               return NULL;
>>> +       set_memory_rox((unsigned long)page, 1);
>>> +       return page;
>>> +}
>>> +
>>>   static void __kprobes arch_prepare_ss_slot(struct kprobe *p)
>>>   {
>>>          kprobe_opcode_t *addr = p->ainsn.xol_insn;
>>> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
>>> index 52ffe115a8c4..3e301bc2cd66 100644
>>> --- a/arch/arm64/net/bpf_jit_comp.c
>>> +++ b/arch/arm64/net/bpf_jit_comp.c
>>> @@ -2717,11 +2717,6 @@ void arch_free_bpf_trampoline(void *image, unsigned int
>>> size)
>>>          bpf_prog_pack_free(image, size);
>>>   }
>>>
>>> -int arch_protect_bpf_trampoline(void *image, unsigned int size)
>>> -{
>>> -       return 0;
>>> -}
>>> -
>>>   int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *ro_image,
>>>                                  void *ro_image_end, const struct
>>> btf_func_model *m,
>>>                                  u32 flags, struct bpf_tramp_links *tlinks,
>>>
>>>
>>>> Thanks,
>>>> Yang
>>>>
>>>>>
>>>>>> Thanks,
>>>>>> Ryan
>>>>>>
>>>>>>> Thanks,
>>>>>>> Yang
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Yang
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>>
>

Re: [PATCH v7 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Yang Shi 5 months ago


On 9/9/25 7:36 AM, Ryan Roberts wrote:
> On 08/09/2025 19:31, Yang Shi wrote:
>>
>> On 9/8/25 9:34 AM, Ryan Roberts wrote:
>>> On 04/09/2025 22:49, Yang Shi wrote:
>>>> On 9/4/25 10:47 AM, Yang Shi wrote:
>>>>> On 9/4/25 6:16 AM, Ryan Roberts wrote:
>>>>>> On 04/09/2025 14:14, Ryan Roberts wrote:
>>>>>>> On 03/09/2025 01:50, Yang Shi wrote:
>>>>>>>>>>> I am wondering whether we can just have a warn_on_once or something
>>>>>>>>>>> for the
>>>>>>>>>>> case
>>>>>>>>>>> when we fail to allocate a pagetable page. Or, Ryan had
>>>>>>>>>>> suggested in an off-the-list conversation that we can maintain a cache
>>>>>>>>>>> of PTE
>>>>>>>>>>> tables for every PMD block mapping, which will give us
>>>>>>>>>>> the same memory consumption as we do today, but not sure if this is
>>>>>>>>>>> worth it.
>>>>>>>>>>> x86 can already handle splitting but due to the callchains
>>>>>>>>>>> I have described above, it has the same problem, and the code has been
>>>>>>>>>>> working
>>>>>>>>>>> for years :)
>>>>>>>>>> I think it's preferable to avoid having to keep a cache of pgtable memory
>>>>>>>>>> if we
>>>>>>>>>> can...
>>>>>>>>> Yes, I agree. We simply don't know how many pages we need to cache, and it
>>>>>>>>> still can't guarantee 100% allocation success.
>>>>>>>> This is wrong... We can know how many pages will be needed for splitting
>>>>>>>> linear
>>>>>>>> mapping to PTEs for the worst case once linear mapping is finalized. But it
>>>>>>>> may
>>>>>>>> require a few hundred megabytes memory to guarantee allocation success. I
>>>>>>>> don't
>>>>>>>> think it is worth for such rare corner case.
>>>>>>> Indeed, we know exactly how much memory we need for pgtables to map the
>>>>>>> linear
>>>>>>> map by pte - that's exactly what we are doing today. So we _could_ keep a
>>>>>>> cache.
>>>>>>> We would still get the benefit of improved performance but we would lose the
>>>>>>> benefit of reduced memory.
>>>>>>>
>>>>>>> I think we need to solve the vm_reset_perms() problem somehow, before we can
>>>>>>> enable this.
>>>>>> Sorry I realise this was not very clear... I am saying I think we need to
>>>>>> fix it
>>>>>> somehow. A cache would likely work. But I'd prefer to avoid it if we can
>>>>>> find a
>>>>>> better solution.
>>>>> Took a deeper look at vm_reset_perms(). It was introduced by commit
>>>>> 868b104d7379 ("mm/vmalloc: Add flag for freeing of special permsissions"). The
>>>>> VM_FLUSH_RESET_PERMS flag is supposed to be set if the vmalloc memory is RO
>>>>> and/or ROX. So set_memory_ro() or set_memory_rox() is supposed to follow up
>>>>> vmalloc(). So the page table should be already split before reaching vfree().
>>>>> I think this why vm_reset_perms() doesn't not check return value.
>>> If vm_reset_perms() is assuming it can't/won't fail, I think it should at least
>>> output a warning if it does?
>> It should. Anyway warning will be raised if split fails. We have somehow
>> mitigation.
>>
>>>>> I scrutinized all the callsites with VM_FLUSH_RESET_PERMS flag set.
>>> Just checking; I think you made a comment before about there only being a few
>>> sites that set VM_FLUSH_RESET_PERMS. But one of them is the helper,
>>> set_vm_flush_reset_perms(). So just making sure you also followed to the places
>>> that use that helper?
>> Yes, I did.
>>
>>>>> The most
>>>>> of them has set_memory_ro() or set_memory_rox() followed.
>>> And are all callsites calling set_memory_*() for the entire cell that was
>>> allocated by vmalloc? If there are cases where it only calls that for a portion
>>> of it, then it's not gurranteed that the memory is correctly split.
>> Yes, all callsites call set_memory_*() for the entire range.
>>
>>>>> But there are 3
>>>>> places I don't see set_memory_ro()/set_memory_rox() is called.
>>>>>
>>>>> 1. BPF trampoline allocation. The BPF trampoline calls
>>>>> arch_protect_bpf_trampoline(). The generic implementation does call
>>>>> set_memory_rox(). But the x86 and arm64 implementation just simply return 0.
>>>>> For x86, it is because execmem cache is used and it does call
>>>>> set_memory_rox(). ARM64 doesn't need to split page table before this series,
>>>>> so it should never fail. I think we just need to use the generic
>>>>> implementation (remove arm64 implementation) if this series is merged.
>>> I know zero about BPF. But it looks like the allocation happens in
>>> arch_alloc_bpf_trampoline(), which for arm64, calls bpf_prog_pack_alloc(). And
>>> for small sizes, it grabs some memory from a "pack". So doesn't this mean that
>>> you are calling set_memory_rox() for a sub-region of the cell, so that doesn't
>>> actually help at vm_reset_perms()-time?
>> Took a deeper look at bpf pack allocator. The "pack" is allocated by
>> alloc_new_pack(), which does:
>> bpf_jit_alloc_exec()
>> set_vm_flush_reset_perms()
>> set_memory_rox()
>>
>> If the size is greater than the pack size, it calls:
>> bpf_jit_alloc_exec()
>> set_vm_flush_reset_perms()
>> set_memory_rox()
>>
>> So it looks like bpf trampoline is good, and we don't need do anything. It
>> should be removed from the list. I didn't look deep enough for bpf pack
>> allocator in the first place.
>>
>>>>> 2. BPF dispatcher. It calls execmem_alloc which has VM_FLUSH_RESET_PERMS set.
>>>>> But it is used for rw allocation, so VM_FLUSH_RESET_PERMS should be
>>>>> unnecessary IIUC. So it doesn't matter even though vm_reset_perms() fails.
>>>>>
>>>>> 3. kprobe. S390's alloc_insn_page() does call set_memory_rox(), x86 also
>>>>> called set_memory_rox() before switching to execmem cache. The execmem cache
>>>>> calls set_memory_rox(). I don't know why ARM64 doesn't call it.
>>>>>
>>>>> So I think we just need to fix #1 and #3 per the above analysis. If this
>>>>> analysis look correct to you guys, I will prepare two patches to fix them.
>>> This all seems quite fragile. I find it interesting that vm_reset_perms() is
>>> doing break-before-make; it sets the PTEs as invalid, then flushes the TLB, then
>>> sets them to default. But for arm64, at least, I think break-before-make is not
>>> required. We are only changing the permissions so that can be done on live
>>> mappings; essentially change the sequence to; set default, flush TLB.
>> Yeah, I agree it is a little bit fragile. I think this is the "contract" for
>> vmalloc users. You allocate ROX memory via vmalloc, you are required to call
>> set_memory_*(). But there is nothing to guarantee the "contract" is followed.
>> But I don't think this is the only case in kernel.
>>
>>> If we do that, then if the memory was already default, then there is no need to
>>> do anything (so no chance of allocation failure). If the memory was not default,
>>> then it must have already been split to make it non-default, in which case we
>>> can also gurrantee that no allocations are required.
>>>
>>> What am I missing?
>> The comment says:
>> Set direct map to something invalid so that it won't be cached if there are any
>> accesses after the TLB flush, then flush the TLB and reset the direct map
>> permissions to the default.
>>
>> IIUC, it guarantees the direct map can't be cached in TLB after TLB flush from
>> _vm_unmap_aliases() by setting them invalid because TLB never cache invalid
>> entries. Skipping set direct map to invalid seems break this. Or "changing
>> permission on live mappings" on ARM64 can achieve the same goal?
> Here's my understanding of the intent of the code:
>
> Let's say we start with some memory that has been mapped RO. Our goal is to
> reset the memory back to RW and ensure that no TLB entry remains in the TLB for
> the old RO mapping. There are 2 ways to do that:



>
> Approach 1 (used in current code):
> 1. set PTE to invalid
> 2. invalidate any TLB entry for the VA
> 3. set the PTE to RW
>
> Approach 2:
> 1. set the PTE to RW
> 2. invalidate any TLB entry for the VA

IIUC, the intent of the code is "reset direct map permission *without* 
leaving a RW+X window". The TLB flush call actually flushes both VA and 
direct map together.
So if this is the intent, approach #2 may have VA with X permission but 
direct map may be RW at the mean time. It seems break the intent.

Thanks,
Yang

>
> The benefit of approach 1 is that it is guarranteed that it is impossible for
> different CPUs to have different translations for the same VA in their
> respective TLB. But for approach 2, it's possible that between steps 1 and 2, 1
> CPU has a RO entry and another CPU has a RW entry. But that will get fixed once
> the TLB is flushed - it's not really an issue.
>
> (There is probably also an obscure way to end up with 2 TLB entries (one with RO
> and one with RW) for the same CPU, but the arm64 architecture permits that as
> long as it's only a permission mismatch).
>
> Anyway, approach 2 is used when changing memory permissions on user mappings, so
> I don't see why we can't take the same approach here. That would solve this
> whole class of issue for us.
>
> Thanks,
> Ryan
>
>
>> Thanks,
>> Yang
>>
>>> Thanks,
>>> Ryan
>>>
>>>
>>>> Tested the below patch with bpftrace kfunc (allocate bpf trampoline) and
>>>> kprobes. It seems work well.
>>>>
>>>> diff --git a/arch/arm64/kernel/probes/kprobes.c b/arch/arm64/kernel/probes/
>>>> kprobes.c
>>>> index 0c5d408afd95..c4f8c4750f1e 100644
>>>> --- a/arch/arm64/kernel/probes/kprobes.c
>>>> +++ b/arch/arm64/kernel/probes/kprobes.c
>>>> @@ -10,6 +10,7 @@
>>>>
>>>>    #define pr_fmt(fmt) "kprobes: " fmt
>>>>
>>>> +#include <linux/execmem.h>
>>>>    #include <linux/extable.h>
>>>>    #include <linux/kasan.h>
>>>>    #include <linux/kernel.h>
>>>> @@ -41,6 +42,17 @@ DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
>>>>    static void __kprobes
>>>>    post_kprobe_handler(struct kprobe *, struct kprobe_ctlblk *, struct pt_regs
>>>> *);
>>>>
>>>> +void *alloc_insn_page(void)
>>>> +{
>>>> +       void *page;
>>>> +
>>>> +       page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
>>>> +       if (!page)
>>>> +               return NULL;
>>>> +       set_memory_rox((unsigned long)page, 1);
>>>> +       return page;
>>>> +}
>>>> +
>>>>    static void __kprobes arch_prepare_ss_slot(struct kprobe *p)
>>>>    {
>>>>           kprobe_opcode_t *addr = p->ainsn.xol_insn;
>>>> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
>>>> index 52ffe115a8c4..3e301bc2cd66 100644
>>>> --- a/arch/arm64/net/bpf_jit_comp.c
>>>> +++ b/arch/arm64/net/bpf_jit_comp.c
>>>> @@ -2717,11 +2717,6 @@ void arch_free_bpf_trampoline(void *image, unsigned int
>>>> size)
>>>>           bpf_prog_pack_free(image, size);
>>>>    }
>>>>
>>>> -int arch_protect_bpf_trampoline(void *image, unsigned int size)
>>>> -{
>>>> -       return 0;
>>>> -}
>>>> -
>>>>    int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *ro_image,
>>>>                                   void *ro_image_end, const struct
>>>> btf_func_model *m,
>>>>                                   u32 flags, struct bpf_tramp_links *tlinks,
>>>>
>>>>
>>>>> Thanks,
>>>>> Yang
>>>>>
>>>>>>> Thanks,
>>>>>>> Ryan
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Yang
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Yang
>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Ryan
>>>>>>>>>>
>>>>>>>>>>

Re: [PATCH v7 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Ryan Roberts 5 months ago

On 09/09/2025 16:32, Yang Shi wrote:
> 
> 
> On 9/9/25 7:36 AM, Ryan Roberts wrote:
>> On 08/09/2025 19:31, Yang Shi wrote:
>>>
>>> On 9/8/25 9:34 AM, Ryan Roberts wrote:
>>>> On 04/09/2025 22:49, Yang Shi wrote:
>>>>> On 9/4/25 10:47 AM, Yang Shi wrote:
>>>>>> On 9/4/25 6:16 AM, Ryan Roberts wrote:
>>>>>>> On 04/09/2025 14:14, Ryan Roberts wrote:
>>>>>>>> On 03/09/2025 01:50, Yang Shi wrote:
>>>>>>>>>>>> I am wondering whether we can just have a warn_on_once or something
>>>>>>>>>>>> for the
>>>>>>>>>>>> case
>>>>>>>>>>>> when we fail to allocate a pagetable page. Or, Ryan had
>>>>>>>>>>>> suggested in an off-the-list conversation that we can maintain a cache
>>>>>>>>>>>> of PTE
>>>>>>>>>>>> tables for every PMD block mapping, which will give us
>>>>>>>>>>>> the same memory consumption as we do today, but not sure if this is
>>>>>>>>>>>> worth it.
>>>>>>>>>>>> x86 can already handle splitting but due to the callchains
>>>>>>>>>>>> I have described above, it has the same problem, and the code has been
>>>>>>>>>>>> working
>>>>>>>>>>>> for years :)
>>>>>>>>>>> I think it's preferable to avoid having to keep a cache of pgtable
>>>>>>>>>>> memory
>>>>>>>>>>> if we
>>>>>>>>>>> can...
>>>>>>>>>> Yes, I agree. We simply don't know how many pages we need to cache,
>>>>>>>>>> and it
>>>>>>>>>> still can't guarantee 100% allocation success.
>>>>>>>>> This is wrong... We can know how many pages will be needed for splitting
>>>>>>>>> linear
>>>>>>>>> mapping to PTEs for the worst case once linear mapping is finalized.
>>>>>>>>> But it
>>>>>>>>> may
>>>>>>>>> require a few hundred megabytes memory to guarantee allocation success. I
>>>>>>>>> don't
>>>>>>>>> think it is worth for such rare corner case.
>>>>>>>> Indeed, we know exactly how much memory we need for pgtables to map the
>>>>>>>> linear
>>>>>>>> map by pte - that's exactly what we are doing today. So we _could_ keep a
>>>>>>>> cache.
>>>>>>>> We would still get the benefit of improved performance but we would lose
>>>>>>>> the
>>>>>>>> benefit of reduced memory.
>>>>>>>>
>>>>>>>> I think we need to solve the vm_reset_perms() problem somehow, before we
>>>>>>>> can
>>>>>>>> enable this.
>>>>>>> Sorry I realise this was not very clear... I am saying I think we need to
>>>>>>> fix it
>>>>>>> somehow. A cache would likely work. But I'd prefer to avoid it if we can
>>>>>>> find a
>>>>>>> better solution.
>>>>>> Took a deeper look at vm_reset_perms(). It was introduced by commit
>>>>>> 868b104d7379 ("mm/vmalloc: Add flag for freeing of special permsissions").
>>>>>> The
>>>>>> VM_FLUSH_RESET_PERMS flag is supposed to be set if the vmalloc memory is RO
>>>>>> and/or ROX. So set_memory_ro() or set_memory_rox() is supposed to follow up
>>>>>> vmalloc(). So the page table should be already split before reaching vfree().
>>>>>> I think this why vm_reset_perms() doesn't not check return value.
>>>> If vm_reset_perms() is assuming it can't/won't fail, I think it should at least
>>>> output a warning if it does?
>>> It should. Anyway warning will be raised if split fails. We have somehow
>>> mitigation.
>>>
>>>>>> I scrutinized all the callsites with VM_FLUSH_RESET_PERMS flag set.
>>>> Just checking; I think you made a comment before about there only being a few
>>>> sites that set VM_FLUSH_RESET_PERMS. But one of them is the helper,
>>>> set_vm_flush_reset_perms(). So just making sure you also followed to the places
>>>> that use that helper?
>>> Yes, I did.
>>>
>>>>>> The most
>>>>>> of them has set_memory_ro() or set_memory_rox() followed.
>>>> And are all callsites calling set_memory_*() for the entire cell that was
>>>> allocated by vmalloc? If there are cases where it only calls that for a portion
>>>> of it, then it's not gurranteed that the memory is correctly split.
>>> Yes, all callsites call set_memory_*() for the entire range.
>>>
>>>>>> But there are 3
>>>>>> places I don't see set_memory_ro()/set_memory_rox() is called.
>>>>>>
>>>>>> 1. BPF trampoline allocation. The BPF trampoline calls
>>>>>> arch_protect_bpf_trampoline(). The generic implementation does call
>>>>>> set_memory_rox(). But the x86 and arm64 implementation just simply return 0.
>>>>>> For x86, it is because execmem cache is used and it does call
>>>>>> set_memory_rox(). ARM64 doesn't need to split page table before this series,
>>>>>> so it should never fail. I think we just need to use the generic
>>>>>> implementation (remove arm64 implementation) if this series is merged.
>>>> I know zero about BPF. But it looks like the allocation happens in
>>>> arch_alloc_bpf_trampoline(), which for arm64, calls bpf_prog_pack_alloc(). And
>>>> for small sizes, it grabs some memory from a "pack". So doesn't this mean that
>>>> you are calling set_memory_rox() for a sub-region of the cell, so that doesn't
>>>> actually help at vm_reset_perms()-time?
>>> Took a deeper look at bpf pack allocator. The "pack" is allocated by
>>> alloc_new_pack(), which does:
>>> bpf_jit_alloc_exec()
>>> set_vm_flush_reset_perms()
>>> set_memory_rox()
>>>
>>> If the size is greater than the pack size, it calls:
>>> bpf_jit_alloc_exec()
>>> set_vm_flush_reset_perms()
>>> set_memory_rox()
>>>
>>> So it looks like bpf trampoline is good, and we don't need do anything. It
>>> should be removed from the list. I didn't look deep enough for bpf pack
>>> allocator in the first place.
>>>
>>>>>> 2. BPF dispatcher. It calls execmem_alloc which has VM_FLUSH_RESET_PERMS set.
>>>>>> But it is used for rw allocation, so VM_FLUSH_RESET_PERMS should be
>>>>>> unnecessary IIUC. So it doesn't matter even though vm_reset_perms() fails.
>>>>>>
>>>>>> 3. kprobe. S390's alloc_insn_page() does call set_memory_rox(), x86 also
>>>>>> called set_memory_rox() before switching to execmem cache. The execmem cache
>>>>>> calls set_memory_rox(). I don't know why ARM64 doesn't call it.
>>>>>>
>>>>>> So I think we just need to fix #1 and #3 per the above analysis. If this
>>>>>> analysis look correct to you guys, I will prepare two patches to fix them.
>>>> This all seems quite fragile. I find it interesting that vm_reset_perms() is
>>>> doing break-before-make; it sets the PTEs as invalid, then flushes the TLB,
>>>> then
>>>> sets them to default. But for arm64, at least, I think break-before-make is not
>>>> required. We are only changing the permissions so that can be done on live
>>>> mappings; essentially change the sequence to; set default, flush TLB.
>>> Yeah, I agree it is a little bit fragile. I think this is the "contract" for
>>> vmalloc users. You allocate ROX memory via vmalloc, you are required to call
>>> set_memory_*(). But there is nothing to guarantee the "contract" is followed.
>>> But I don't think this is the only case in kernel.
>>>
>>>> If we do that, then if the memory was already default, then there is no need to
>>>> do anything (so no chance of allocation failure). If the memory was not
>>>> default,
>>>> then it must have already been split to make it non-default, in which case we
>>>> can also gurrantee that no allocations are required.
>>>>
>>>> What am I missing?
>>> The comment says:
>>> Set direct map to something invalid so that it won't be cached if there are any
>>> accesses after the TLB flush, then flush the TLB and reset the direct map
>>> permissions to the default.
>>>
>>> IIUC, it guarantees the direct map can't be cached in TLB after TLB flush from
>>> _vm_unmap_aliases() by setting them invalid because TLB never cache invalid
>>> entries. Skipping set direct map to invalid seems break this. Or "changing
>>> permission on live mappings" on ARM64 can achieve the same goal?
>> Here's my understanding of the intent of the code:
>>
>> Let's say we start with some memory that has been mapped RO. Our goal is to
>> reset the memory back to RW and ensure that no TLB entry remains in the TLB for
>> the old RO mapping. There are 2 ways to do that:
> 
> 
> 
>>
>> Approach 1 (used in current code):
>> 1. set PTE to invalid
>> 2. invalidate any TLB entry for the VA
>> 3. set the PTE to RW
>>
>> Approach 2:
>> 1. set the PTE to RW
>> 2. invalidate any TLB entry for the VA
> 
> IIUC, the intent of the code is "reset direct map permission *without* leaving a
> RW+X window". The TLB flush call actually flushes both VA and direct map together.
> So if this is the intent, approach #2 may have VA with X permission but direct
> map may be RW at the mean time. It seems break the intent.

Ahh! Thanks, it's starting to make more sense now.

Though on first sight it seems a bit mad to me to form a tlb flush range that
covers all the direct map pages and all the lazy vunmap regions. Is that
intended to be a perf optimization or something else? It's not clear from the
history.


Could this be split into 2 operations?

1. unmap the aliases (+ tlbi the aliases).
2. set the direct memory back to default (+ tlbi the direct map region).

The only 2 potential problems I can think of are;

 - Performance: 2 tlbis instead of 1, but conversely we probably avoid flushing
a load of TLB entries that we didn't really need to.

 - Given there is now no lock around the tlbis (currently it's under
vmap_purge_lock) is there a race where a new alias can appear between steps 1
and 2? I don't think so, because the memory is allocated to the current mapping
so how is it going to get re-mapped?


Could this solve it?



> 
> Thanks,
> Yang
> 
>>
>> The benefit of approach 1 is that it is guarranteed that it is impossible for
>> different CPUs to have different translations for the same VA in their
>> respective TLB. But for approach 2, it's possible that between steps 1 and 2, 1
>> CPU has a RO entry and another CPU has a RW entry. But that will get fixed once
>> the TLB is flushed - it's not really an issue.
>>
>> (There is probably also an obscure way to end up with 2 TLB entries (one with RO
>> and one with RW) for the same CPU, but the arm64 architecture permits that as
>> long as it's only a permission mismatch).
>>
>> Anyway, approach 2 is used when changing memory permissions on user mappings, so
>> I don't see why we can't take the same approach here. That would solve this
>> whole class of issue for us.
>>
>> Thanks,
>> Ryan
>>
>>
>>> Thanks,
>>> Yang
>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>
>>>>> Tested the below patch with bpftrace kfunc (allocate bpf trampoline) and
>>>>> kprobes. It seems work well.
>>>>>
>>>>> diff --git a/arch/arm64/kernel/probes/kprobes.c b/arch/arm64/kernel/probes/
>>>>> kprobes.c
>>>>> index 0c5d408afd95..c4f8c4750f1e 100644
>>>>> --- a/arch/arm64/kernel/probes/kprobes.c
>>>>> +++ b/arch/arm64/kernel/probes/kprobes.c
>>>>> @@ -10,6 +10,7 @@
>>>>>
>>>>>    #define pr_fmt(fmt) "kprobes: " fmt
>>>>>
>>>>> +#include <linux/execmem.h>
>>>>>    #include <linux/extable.h>
>>>>>    #include <linux/kasan.h>
>>>>>    #include <linux/kernel.h>
>>>>> @@ -41,6 +42,17 @@ DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
>>>>>    static void __kprobes
>>>>>    post_kprobe_handler(struct kprobe *, struct kprobe_ctlblk *, struct pt_regs
>>>>> *);
>>>>>
>>>>> +void *alloc_insn_page(void)
>>>>> +{
>>>>> +       void *page;
>>>>> +
>>>>> +       page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
>>>>> +       if (!page)
>>>>> +               return NULL;
>>>>> +       set_memory_rox((unsigned long)page, 1);
>>>>> +       return page;
>>>>> +}
>>>>> +
>>>>>    static void __kprobes arch_prepare_ss_slot(struct kprobe *p)
>>>>>    {
>>>>>           kprobe_opcode_t *addr = p->ainsn.xol_insn;
>>>>> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
>>>>> index 52ffe115a8c4..3e301bc2cd66 100644
>>>>> --- a/arch/arm64/net/bpf_jit_comp.c
>>>>> +++ b/arch/arm64/net/bpf_jit_comp.c
>>>>> @@ -2717,11 +2717,6 @@ void arch_free_bpf_trampoline(void *image, unsigned int
>>>>> size)
>>>>>           bpf_prog_pack_free(image, size);
>>>>>    }
>>>>>
>>>>> -int arch_protect_bpf_trampoline(void *image, unsigned int size)
>>>>> -{
>>>>> -       return 0;
>>>>> -}
>>>>> -
>>>>>    int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *ro_image,
>>>>>                                   void *ro_image_end, const struct
>>>>> btf_func_model *m,
>>>>>                                   u32 flags, struct bpf_tramp_links *tlinks,
>>>>>
>>>>>
>>>>>> Thanks,
>>>>>> Yang
>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Yang
>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Yang
>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Ryan
>>>>>>>>>>>
>>>>>>>>>>>
>

Re: [PATCH v7 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Yang Shi 5 months ago


On 9/9/25 9:32 AM, Ryan Roberts wrote:
> On 09/09/2025 16:32, Yang Shi wrote:
>>
>> On 9/9/25 7:36 AM, Ryan Roberts wrote:
>>> On 08/09/2025 19:31, Yang Shi wrote:
>>>> On 9/8/25 9:34 AM, Ryan Roberts wrote:
>>>>> On 04/09/2025 22:49, Yang Shi wrote:
>>>>>> On 9/4/25 10:47 AM, Yang Shi wrote:
>>>>>>> On 9/4/25 6:16 AM, Ryan Roberts wrote:
>>>>>>>> On 04/09/2025 14:14, Ryan Roberts wrote:
>>>>>>>>> On 03/09/2025 01:50, Yang Shi wrote:
>>>>>>>>>>>>> I am wondering whether we can just have a warn_on_once or something
>>>>>>>>>>>>> for the
>>>>>>>>>>>>> case
>>>>>>>>>>>>> when we fail to allocate a pagetable page. Or, Ryan had
>>>>>>>>>>>>> suggested in an off-the-list conversation that we can maintain a cache
>>>>>>>>>>>>> of PTE
>>>>>>>>>>>>> tables for every PMD block mapping, which will give us
>>>>>>>>>>>>> the same memory consumption as we do today, but not sure if this is
>>>>>>>>>>>>> worth it.
>>>>>>>>>>>>> x86 can already handle splitting but due to the callchains
>>>>>>>>>>>>> I have described above, it has the same problem, and the code has been
>>>>>>>>>>>>> working
>>>>>>>>>>>>> for years :)
>>>>>>>>>>>> I think it's preferable to avoid having to keep a cache of pgtable
>>>>>>>>>>>> memory
>>>>>>>>>>>> if we
>>>>>>>>>>>> can...
>>>>>>>>>>> Yes, I agree. We simply don't know how many pages we need to cache,
>>>>>>>>>>> and it
>>>>>>>>>>> still can't guarantee 100% allocation success.
>>>>>>>>>> This is wrong... We can know how many pages will be needed for splitting
>>>>>>>>>> linear
>>>>>>>>>> mapping to PTEs for the worst case once linear mapping is finalized.
>>>>>>>>>> But it
>>>>>>>>>> may
>>>>>>>>>> require a few hundred megabytes memory to guarantee allocation success. I
>>>>>>>>>> don't
>>>>>>>>>> think it is worth for such rare corner case.
>>>>>>>>> Indeed, we know exactly how much memory we need for pgtables to map the
>>>>>>>>> linear
>>>>>>>>> map by pte - that's exactly what we are doing today. So we _could_ keep a
>>>>>>>>> cache.
>>>>>>>>> We would still get the benefit of improved performance but we would lose
>>>>>>>>> the
>>>>>>>>> benefit of reduced memory.
>>>>>>>>>
>>>>>>>>> I think we need to solve the vm_reset_perms() problem somehow, before we
>>>>>>>>> can
>>>>>>>>> enable this.
>>>>>>>> Sorry I realise this was not very clear... I am saying I think we need to
>>>>>>>> fix it
>>>>>>>> somehow. A cache would likely work. But I'd prefer to avoid it if we can
>>>>>>>> find a
>>>>>>>> better solution.
>>>>>>> Took a deeper look at vm_reset_perms(). It was introduced by commit
>>>>>>> 868b104d7379 ("mm/vmalloc: Add flag for freeing of special permsissions").
>>>>>>> The
>>>>>>> VM_FLUSH_RESET_PERMS flag is supposed to be set if the vmalloc memory is RO
>>>>>>> and/or ROX. So set_memory_ro() or set_memory_rox() is supposed to follow up
>>>>>>> vmalloc(). So the page table should be already split before reaching vfree().
>>>>>>> I think this why vm_reset_perms() doesn't not check return value.
>>>>> If vm_reset_perms() is assuming it can't/won't fail, I think it should at least
>>>>> output a warning if it does?
>>>> It should. Anyway warning will be raised if split fails. We have somehow
>>>> mitigation.
>>>>
>>>>>>> I scrutinized all the callsites with VM_FLUSH_RESET_PERMS flag set.
>>>>> Just checking; I think you made a comment before about there only being a few
>>>>> sites that set VM_FLUSH_RESET_PERMS. But one of them is the helper,
>>>>> set_vm_flush_reset_perms(). So just making sure you also followed to the places
>>>>> that use that helper?
>>>> Yes, I did.
>>>>
>>>>>>> The most
>>>>>>> of them has set_memory_ro() or set_memory_rox() followed.
>>>>> And are all callsites calling set_memory_*() for the entire cell that was
>>>>> allocated by vmalloc? If there are cases where it only calls that for a portion
>>>>> of it, then it's not gurranteed that the memory is correctly split.
>>>> Yes, all callsites call set_memory_*() for the entire range.
>>>>
>>>>>>> But there are 3
>>>>>>> places I don't see set_memory_ro()/set_memory_rox() is called.
>>>>>>>
>>>>>>> 1. BPF trampoline allocation. The BPF trampoline calls
>>>>>>> arch_protect_bpf_trampoline(). The generic implementation does call
>>>>>>> set_memory_rox(). But the x86 and arm64 implementation just simply return 0.
>>>>>>> For x86, it is because execmem cache is used and it does call
>>>>>>> set_memory_rox(). ARM64 doesn't need to split page table before this series,
>>>>>>> so it should never fail. I think we just need to use the generic
>>>>>>> implementation (remove arm64 implementation) if this series is merged.
>>>>> I know zero about BPF. But it looks like the allocation happens in
>>>>> arch_alloc_bpf_trampoline(), which for arm64, calls bpf_prog_pack_alloc(). And
>>>>> for small sizes, it grabs some memory from a "pack". So doesn't this mean that
>>>>> you are calling set_memory_rox() for a sub-region of the cell, so that doesn't
>>>>> actually help at vm_reset_perms()-time?
>>>> Took a deeper look at bpf pack allocator. The "pack" is allocated by
>>>> alloc_new_pack(), which does:
>>>> bpf_jit_alloc_exec()
>>>> set_vm_flush_reset_perms()
>>>> set_memory_rox()
>>>>
>>>> If the size is greater than the pack size, it calls:
>>>> bpf_jit_alloc_exec()
>>>> set_vm_flush_reset_perms()
>>>> set_memory_rox()
>>>>
>>>> So it looks like bpf trampoline is good, and we don't need do anything. It
>>>> should be removed from the list. I didn't look deep enough for bpf pack
>>>> allocator in the first place.
>>>>
>>>>>>> 2. BPF dispatcher. It calls execmem_alloc which has VM_FLUSH_RESET_PERMS set.
>>>>>>> But it is used for rw allocation, so VM_FLUSH_RESET_PERMS should be
>>>>>>> unnecessary IIUC. So it doesn't matter even though vm_reset_perms() fails.
>>>>>>>
>>>>>>> 3. kprobe. S390's alloc_insn_page() does call set_memory_rox(), x86 also
>>>>>>> called set_memory_rox() before switching to execmem cache. The execmem cache
>>>>>>> calls set_memory_rox(). I don't know why ARM64 doesn't call it.
>>>>>>>
>>>>>>> So I think we just need to fix #1 and #3 per the above analysis. If this
>>>>>>> analysis look correct to you guys, I will prepare two patches to fix them.
>>>>> This all seems quite fragile. I find it interesting that vm_reset_perms() is
>>>>> doing break-before-make; it sets the PTEs as invalid, then flushes the TLB,
>>>>> then
>>>>> sets them to default. But for arm64, at least, I think break-before-make is not
>>>>> required. We are only changing the permissions so that can be done on live
>>>>> mappings; essentially change the sequence to; set default, flush TLB.
>>>> Yeah, I agree it is a little bit fragile. I think this is the "contract" for
>>>> vmalloc users. You allocate ROX memory via vmalloc, you are required to call
>>>> set_memory_*(). But there is nothing to guarantee the "contract" is followed.
>>>> But I don't think this is the only case in kernel.
>>>>
>>>>> If we do that, then if the memory was already default, then there is no need to
>>>>> do anything (so no chance of allocation failure). If the memory was not
>>>>> default,
>>>>> then it must have already been split to make it non-default, in which case we
>>>>> can also gurrantee that no allocations are required.
>>>>>
>>>>> What am I missing?
>>>> The comment says:
>>>> Set direct map to something invalid so that it won't be cached if there are any
>>>> accesses after the TLB flush, then flush the TLB and reset the direct map
>>>> permissions to the default.
>>>>
>>>> IIUC, it guarantees the direct map can't be cached in TLB after TLB flush from
>>>> _vm_unmap_aliases() by setting them invalid because TLB never cache invalid
>>>> entries. Skipping set direct map to invalid seems break this. Or "changing
>>>> permission on live mappings" on ARM64 can achieve the same goal?
>>> Here's my understanding of the intent of the code:
>>>
>>> Let's say we start with some memory that has been mapped RO. Our goal is to
>>> reset the memory back to RW and ensure that no TLB entry remains in the TLB for
>>> the old RO mapping. There are 2 ways to do that:
>>
>>
>>> Approach 1 (used in current code):
>>> 1. set PTE to invalid
>>> 2. invalidate any TLB entry for the VA
>>> 3. set the PTE to RW
>>>
>>> Approach 2:
>>> 1. set the PTE to RW
>>> 2. invalidate any TLB entry for the VA
>> IIUC, the intent of the code is "reset direct map permission *without* leaving a
>> RW+X window". The TLB flush call actually flushes both VA and direct map together.
>> So if this is the intent, approach #2 may have VA with X permission but direct
>> map may be RW at the mean time. It seems break the intent.
> Ahh! Thanks, it's starting to make more sense now.
>
> Though on first sight it seems a bit mad to me to form a tlb flush range that
> covers all the direct map pages and all the lazy vunmap regions. Is that
> intended to be a perf optimization or something else? It's not clear from the
> history.

I think it should be mainly performance driven. I can't see how come two 
TLB flushes (for vmap and direct map respectively) don't work if I don't 
miss something.

>
>
> Could this be split into 2 operations?
>
> 1. unmap the aliases (+ tlbi the aliases).
> 2. set the direct memory back to default (+ tlbi the direct map region).
>
> The only 2 potential problems I can think of are;
>
>   - Performance: 2 tlbis instead of 1, but conversely we probably avoid flushing
> a load of TLB entries that we didn't really need to.

The two tlbis should work. But performance is definitely a concern. It 
may be hard to justify how much performance impact caused by over flush, 
but multiple TLBIs is definitely not preferred, particularly on some 
large scale machines. We have experienced some scalability issues with 
TLBI due to the large core count on Ampere systems.
>
>   - Given there is now no lock around the tlbis (currently it's under
> vmap_purge_lock) is there a race where a new alias can appear between steps 1
> and 2? I don't think so, because the memory is allocated to the current mapping
> so how is it going to get re-mapped?

Yes, I agree. I don't think the race is real. The physical pages will 
not be freed until vm_reset_perms() is done. The VA may be reallocated, 
but it will be mapped to different physical pages.

>
>
> Could this solve it?

I think it could. But the potential performance impact (two TLBIs) is a 
real concern.

Anyway the vmalloc user should call set_memory_*() for any RO/ROX 
mapping, set_memory_*() should split the page table before reaching 
vm_reset_perms() so it should not fail. If set_memory_*() is not called, 
it is a bug, it should be fixed, like ARM64 kprobes.

It is definitely welcome to make it more robust, although the warning 
from split may mitigate this somehow. But I don't think this should be a 
blocker for this series IMHO.

Thanks,
Yang

>
>
>
>> Thanks,
>> Yang
>>
>>> The benefit of approach 1 is that it is guarranteed that it is impossible for
>>> different CPUs to have different translations for the same VA in their
>>> respective TLB. But for approach 2, it's possible that between steps 1 and 2, 1
>>> CPU has a RO entry and another CPU has a RW entry. But that will get fixed once
>>> the TLB is flushed - it's not really an issue.
>>>
>>> (There is probably also an obscure way to end up with 2 TLB entries (one with RO
>>> and one with RW) for the same CPU, but the arm64 architecture permits that as
>>> long as it's only a permission mismatch).
>>>
>>> Anyway, approach 2 is used when changing memory permissions on user mappings, so
>>> I don't see why we can't take the same approach here. That would solve this
>>> whole class of issue for us.
>>>
>>> Thanks,
>>> Ryan
>>>
>>>
>>>> Thanks,
>>>> Yang
>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>>
>>>>>
>>>>>> Tested the below patch with bpftrace kfunc (allocate bpf trampoline) and
>>>>>> kprobes. It seems work well.
>>>>>>
>>>>>> diff --git a/arch/arm64/kernel/probes/kprobes.c b/arch/arm64/kernel/probes/
>>>>>> kprobes.c
>>>>>> index 0c5d408afd95..c4f8c4750f1e 100644
>>>>>> --- a/arch/arm64/kernel/probes/kprobes.c
>>>>>> +++ b/arch/arm64/kernel/probes/kprobes.c
>>>>>> @@ -10,6 +10,7 @@
>>>>>>
>>>>>>     #define pr_fmt(fmt) "kprobes: " fmt
>>>>>>
>>>>>> +#include <linux/execmem.h>
>>>>>>     #include <linux/extable.h>
>>>>>>     #include <linux/kasan.h>
>>>>>>     #include <linux/kernel.h>
>>>>>> @@ -41,6 +42,17 @@ DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
>>>>>>     static void __kprobes
>>>>>>     post_kprobe_handler(struct kprobe *, struct kprobe_ctlblk *, struct pt_regs
>>>>>> *);
>>>>>>
>>>>>> +void *alloc_insn_page(void)
>>>>>> +{
>>>>>> +       void *page;
>>>>>> +
>>>>>> +       page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
>>>>>> +       if (!page)
>>>>>> +               return NULL;
>>>>>> +       set_memory_rox((unsigned long)page, 1);
>>>>>> +       return page;
>>>>>> +}
>>>>>> +
>>>>>>     static void __kprobes arch_prepare_ss_slot(struct kprobe *p)
>>>>>>     {
>>>>>>            kprobe_opcode_t *addr = p->ainsn.xol_insn;
>>>>>> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
>>>>>> index 52ffe115a8c4..3e301bc2cd66 100644
>>>>>> --- a/arch/arm64/net/bpf_jit_comp.c
>>>>>> +++ b/arch/arm64/net/bpf_jit_comp.c
>>>>>> @@ -2717,11 +2717,6 @@ void arch_free_bpf_trampoline(void *image, unsigned int
>>>>>> size)
>>>>>>            bpf_prog_pack_free(image, size);
>>>>>>     }
>>>>>>
>>>>>> -int arch_protect_bpf_trampoline(void *image, unsigned int size)
>>>>>> -{
>>>>>> -       return 0;
>>>>>> -}
>>>>>> -
>>>>>>     int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *ro_image,
>>>>>>                                    void *ro_image_end, const struct
>>>>>> btf_func_model *m,
>>>>>>                                    u32 flags, struct bpf_tramp_links *tlinks,
>>>>>>
>>>>>>
>>>>>>> Thanks,
>>>>>>> Yang
>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Yang
>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Yang
>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Ryan
>>>>>>>>>>>>
>>>>>>>>>>>>

Re: [PATCH v7 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Yang Shi 4 months, 4 weeks ago

>>> IIUC, the intent of the code is "reset direct map permission 
>>> *without* leaving a
>>> RW+X window". The TLB flush call actually flushes both VA and direct 
>>> map together.
>>> So if this is the intent, approach #2 may have VA with X permission 
>>> but direct
>>> map may be RW at the mean time. It seems break the intent.
>> Ahh! Thanks, it's starting to make more sense now.
>>
>> Though on first sight it seems a bit mad to me to form a tlb flush 
>> range that
>> covers all the direct map pages and all the lazy vunmap regions. Is that
>> intended to be a perf optimization or something else? It's not clear 
>> from the
>> history.
>
> I think it should be mainly performance driven. I can't see how come 
> two TLB flushes (for vmap and direct map respectively) don't work if I 
> don't miss something.
>
>>
>>
>> Could this be split into 2 operations?
>>
>> 1. unmap the aliases (+ tlbi the aliases).
>> 2. set the direct memory back to default (+ tlbi the direct map region).
>>
>> The only 2 potential problems I can think of are;
>>
>>   - Performance: 2 tlbis instead of 1, but conversely we probably 
>> avoid flushing
>> a load of TLB entries that we didn't really need to.
>
> The two tlbis should work. But performance is definitely a concern. It 
> may be hard to justify how much performance impact caused by over 
> flush, but multiple TLBIs is definitely not preferred, particularly on 
> some large scale machines. We have experienced some scalability issues 
> with TLBI due to the large core count on Ampere systems.
>>
>>   - Given there is now no lock around the tlbis (currently it's under
>> vmap_purge_lock) is there a race where a new alias can appear between 
>> steps 1
>> and 2? I don't think so, because the memory is allocated to the 
>> current mapping
>> so how is it going to get re-mapped?
>
> Yes, I agree. I don't think the race is real. The physical pages will 
> not be freed until vm_reset_perms() is done. The VA may be 
> reallocated, but it will be mapped to different physical pages.
>
>>
>>
>> Could this solve it?
>
> I think it could. But the potential performance impact (two TLBIs) is 
> a real concern.
>
> Anyway the vmalloc user should call set_memory_*() for any RO/ROX 
> mapping, set_memory_*() should split the page table before reaching 
> vm_reset_perms() so it should not fail. If set_memory_*() is not 
> called, it is a bug, it should be fixed, like ARM64 kprobes.
>
> It is definitely welcome to make it more robust, although the warning 
> from split may mitigate this somehow. But I don't think this should be 
> a blocker for this series IMHO.

Hi Ryan & Catalin,

Any more concerns about this? Shall we move forward with v8? We can 
include the fix to kprobes in v8 or I can send it separately, either is 
fine to me. Hopefully we can make v6.18.

Thanks,
Yang

>
> Thanks,
> Yang
>
>>
>>
>>
>>> Thanks,
>>> Yang
>>>
>>>> The benefit of approach 1 is that it is guarranteed that it is 
>>>> impossible for
>>>> different CPUs to have different translations for the same VA in their
>>>> respective TLB. But for approach 2, it's possible that between 
>>>> steps 1 and 2, 1
>>>> CPU has a RO entry and another CPU has a RW entry. But that will 
>>>> get fixed once
>>>> the TLB is flushed - it's not really an issue.
>>>>
>>>> (There is probably also an obscure way to end up with 2 TLB entries 
>>>> (one with RO
>>>> and one with RW) for the same CPU, but the arm64 architecture 
>>>> permits that as
>>>> long as it's only a permission mismatch).
>>>>
>>>> Anyway, approach 2 is used when changing memory permissions on user 
>>>> mappings, so
>>>> I don't see why we can't take the same approach here. That would 
>>>> solve this
>>>> whole class of issue for us.
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>
>>>>> Thanks,
>>>>> Yang
>>>>>
>>>>>> Thanks,
>>>>>> Ryan
>>>>>>
>>>>>>
>>>>>>> Tested the below patch with bpftrace kfunc (allocate bpf 
>>>>>>> trampoline) and
>>>>>>> kprobes. It seems work well.
>>>>>>>
>>>>>>> diff --git a/arch/arm64/kernel/probes/kprobes.c 
>>>>>>> b/arch/arm64/kernel/probes/
>>>>>>> kprobes.c
>>>>>>> index 0c5d408afd95..c4f8c4750f1e 100644
>>>>>>> --- a/arch/arm64/kernel/probes/kprobes.c
>>>>>>> +++ b/arch/arm64/kernel/probes/kprobes.c
>>>>>>> @@ -10,6 +10,7 @@
>>>>>>>
>>>>>>>     #define pr_fmt(fmt) "kprobes: " fmt
>>>>>>>
>>>>>>> +#include <linux/execmem.h>
>>>>>>>     #include <linux/extable.h>
>>>>>>>     #include <linux/kasan.h>
>>>>>>>     #include <linux/kernel.h>
>>>>>>> @@ -41,6 +42,17 @@ DEFINE_PER_CPU(struct kprobe_ctlblk, 
>>>>>>> kprobe_ctlblk);
>>>>>>>     static void __kprobes
>>>>>>>     post_kprobe_handler(struct kprobe *, struct kprobe_ctlblk *, 
>>>>>>> struct pt_regs
>>>>>>> *);
>>>>>>>
>>>>>>> +void *alloc_insn_page(void)
>>>>>>> +{
>>>>>>> +       void *page;
>>>>>>> +
>>>>>>> +       page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
>>>>>>> +       if (!page)
>>>>>>> +               return NULL;
>>>>>>> +       set_memory_rox((unsigned long)page, 1);
>>>>>>> +       return page;
>>>>>>> +}
>>>>>>> +
>>>>>>>     static void __kprobes arch_prepare_ss_slot(struct kprobe *p)
>>>>>>>     {
>>>>>>>            kprobe_opcode_t *addr = p->ainsn.xol_insn;
>>>>>>> diff --git a/arch/arm64/net/bpf_jit_comp.c 
>>>>>>> b/arch/arm64/net/bpf_jit_comp.c
>>>>>>> index 52ffe115a8c4..3e301bc2cd66 100644
>>>>>>> --- a/arch/arm64/net/bpf_jit_comp.c
>>>>>>> +++ b/arch/arm64/net/bpf_jit_comp.c
>>>>>>> @@ -2717,11 +2717,6 @@ void arch_free_bpf_trampoline(void 
>>>>>>> *image, unsigned int
>>>>>>> size)
>>>>>>>            bpf_prog_pack_free(image, size);
>>>>>>>     }
>>>>>>>
>>>>>>> -int arch_protect_bpf_trampoline(void *image, unsigned int size)
>>>>>>> -{
>>>>>>> -       return 0;
>>>>>>> -}
>>>>>>> -
>>>>>>>     int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, 
>>>>>>> void *ro_image,
>>>>>>>                                    void *ro_image_end, const struct
>>>>>>> btf_func_model *m,
>>>>>>>                                    u32 flags, struct 
>>>>>>> bpf_tramp_links *tlinks,
>>>>>>>
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Yang
>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Ryan
>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Yang
>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Yang
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>

Re: [PATCH v7 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Ryan Roberts 4 months, 3 weeks ago

Hi Yang,

Sorry for the slow reply; I'm just getting back to this...

On 11/09/2025 23:03, Yang Shi wrote:
> Hi Ryan & Catalin,
> 
> Any more concerns about this? 

I've been trying to convince myself that your assertion that all users that set
the VM_FLUSH_RESET_PERMS also call set_memory_*() for the entire range that was
returned my vmalloc. I agree that if that is the contract and everyone is
following it, then there is no problem here.

But I haven't been able to convince myself...

Some examples (these might intersect with examples you previously raised):

1. bpf_dispatcher_change_prog() -> bpf_jit_alloc_exec() -> execmem_alloc() ->
sets VM_FLUSH_RESET_PERMS. But I don't see it calling set_memory_*() for rw_image.

2. module_memory_alloc() -> execmem_alloc_rw() -> execmem_alloc() -> sets
VM_FLUSH_RESET_PERMS (note that execmem_force_rw() is nop for arm64).
set_memory_*() is not called until much later on in module_set_memory(). Another
error in the meantime could cause the memory to be vfreed before that point.

3. When set_vm_flush_reset_perms() is set for the range, it is called before
set_memory_*() which might then fail to split prior to vfree.

But I guess as long as set_memory_*() is never successfully called for a
*sub-range* of the vmalloc'ed region, then for all of the above issues, the
memory must still be RW at vfree-time, so this issue should be benign... I think?

In summary this all looks horribly fragile. But I *think* it works. It would be
good to clean it all up and have some clearly documented rules regardless. But I
think that could be a follow up series.

> Shall we move forward with v8? 

Yes; Do you wnat me to post that or would you prefer to do it? I'm happy to do
it; there are a few other tidy ups in pageattr.c I want to make which I spotted.

> We can include the
> fix to kprobes in v8 or I can send it separately, either is fine to me.

Post it on list, and I'll also incorporate into the series.

> Hopefully we can make v6.18.

It's probably getting a bit late now. Anyway, I'll aim to get v8 out tomorrow or
Friday and we will see what Will thinks.

Thanks,
Ryan

> 
> Thanks,
> Yang
>

Re: [PATCH v7 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Yang Shi 4 months, 3 weeks ago

On 9/17/25 9:28 AM, Ryan Roberts wrote:
> Hi Yang,
>
> Sorry for the slow reply; I'm just getting back to this...
>
> On 11/09/2025 23:03, Yang Shi wrote:
>> Hi Ryan & Catalin,
>>
>> Any more concerns about this?
> I've been trying to convince myself that your assertion that all users that set
> the VM_FLUSH_RESET_PERMS also call set_memory_*() for the entire range that was
> returned my vmalloc. I agree that if that is the contract and everyone is
> following it, then there is no problem here.
>
> But I haven't been able to convince myself...
>
> Some examples (these might intersect with examples you previously raised):
>
> 1. bpf_dispatcher_change_prog() -> bpf_jit_alloc_exec() -> execmem_alloc() ->
> sets VM_FLUSH_RESET_PERMS. But I don't see it calling set_memory_*() for rw_image.

Yes, it doesn't call set_memory_*(). I spotted this in the earlier 
email. But it is actually RW, so it should be ok to miss the call. The 
later set_direct_map_invalid call in vfree() may fail, but 
set_direct_map_default call will set RW permission back. But I think it 
doesn't have to use execmem_alloc(), the plain vmalloc() should be good 
enough.

>
> 2. module_memory_alloc() -> execmem_alloc_rw() -> execmem_alloc() -> sets
> VM_FLUSH_RESET_PERMS (note that execmem_force_rw() is nop for arm64).
> set_memory_*() is not called until much later on in module_set_memory(). Another
> error in the meantime could cause the memory to be vfreed before that point.

IIUC, execmem_alloc_rw() is used to allocate memory for modules' text 
section and data section. The code will set mod->mem[type].is_rox 
according to the type of the section. It is true for text, false for 
data. Then set_memory_rox() will be called later if it is true *after* 
insns are copied to the memory. So it is still RW before that point.

>
> 3. When set_vm_flush_reset_perms() is set for the range, it is called before
> set_memory_*() which might then fail to split prior to vfree.

Yes, all call sites check the return value and bail out if 
set_memory_*() failed if I don't miss anything.

>
> But I guess as long as set_memory_*() is never successfully called for a
> *sub-range* of the vmalloc'ed region, then for all of the above issues, the
> memory must still be RW at vfree-time, so this issue should be benign... I think?

Yes, it is true.

>
> In summary this all looks horribly fragile. But I *think* it works. It would be
> good to clean it all up and have some clearly documented rules regardless. But I
> think that could be a follow up series.

Yeah, absolutely agreed.

>
>> Shall we move forward with v8?
> Yes; Do you wnat me to post that or would you prefer to do it? I'm happy to do
> it; there are a few other tidy ups in pageattr.c I want to make which I spotted.

I actually just had v8 ready in my tree. I removed pageattr_pgd_entry 
and pageattr_pud_entry in pageattr.c and fixed pmd_leaf/pud_leaf as you 
suggested. Is it the cleanup you are supposed to do? And I also rebased 
it on top of Shijie's series 
(https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/commit/?id=bfbbb0d3215f) 
which has been picked up by Will.

>
>> We can include the
>> fix to kprobes in v8 or I can send it separately, either is fine to me.
> Post it on list, and I'll also incorporate into the series.

I can include it in v8 series.

>
>> Hopefully we can make v6.18.
> It's probably getting a bit late now. Anyway, I'll aim to get v8 out tomorrow or
> Friday and we will see what Will thinks.

Thank you. I can post v8 today.

Thanks,
Yang

>
> Thanks,
> Ryan
>
>> Thanks,
>> Yang
>>

Re: [PATCH v7 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Ryan Roberts 4 months, 3 weeks ago

On 17/09/2025 18:21, Yang Shi wrote:
> 
> 
> On 9/17/25 9:28 AM, Ryan Roberts wrote:
>> Hi Yang,
>>
>> Sorry for the slow reply; I'm just getting back to this...
>>
>> On 11/09/2025 23:03, Yang Shi wrote:
>>> Hi Ryan & Catalin,
>>>
>>> Any more concerns about this?
>> I've been trying to convince myself that your assertion that all users that set
>> the VM_FLUSH_RESET_PERMS also call set_memory_*() for the entire range that was
>> returned my vmalloc. I agree that if that is the contract and everyone is
>> following it, then there is no problem here.
>>
>> But I haven't been able to convince myself...
>>
>> Some examples (these might intersect with examples you previously raised):
>>
>> 1. bpf_dispatcher_change_prog() -> bpf_jit_alloc_exec() -> execmem_alloc() ->
>> sets VM_FLUSH_RESET_PERMS. But I don't see it calling set_memory_*() for
>> rw_image.
> 
> Yes, it doesn't call set_memory_*(). I spotted this in the earlier email. But it
> is actually RW, so it should be ok to miss the call. The later
> set_direct_map_invalid call in vfree() may fail, but set_direct_map_default call
> will set RW permission back. But I think it doesn't have to use execmem_alloc(),
> the plain vmalloc() should be good enough.
> 
>>
>> 2. module_memory_alloc() -> execmem_alloc_rw() -> execmem_alloc() -> sets
>> VM_FLUSH_RESET_PERMS (note that execmem_force_rw() is nop for arm64).
>> set_memory_*() is not called until much later on in module_set_memory(). Another
>> error in the meantime could cause the memory to be vfreed before that point.
> 
> IIUC, execmem_alloc_rw() is used to allocate memory for modules' text section
> and data section. The code will set mod->mem[type].is_rox according to the type
> of the section. It is true for text, false for data. Then set_memory_rox() will
> be called later if it is true *after* insns are copied to the memory. So it is
> still RW before that point.
> 
>>
>> 3. When set_vm_flush_reset_perms() is set for the range, it is called before
>> set_memory_*() which might then fail to split prior to vfree.
> 
> Yes, all call sites check the return value and bail out if set_memory_*() failed
> if I don't miss anything.
> 
>>
>> But I guess as long as set_memory_*() is never successfully called for a
>> *sub-range* of the vmalloc'ed region, then for all of the above issues, the
>> memory must still be RW at vfree-time, so this issue should be benign... I think?
> 
> Yes, it is true.

So to summarise, all freshly vmalloc'ed memory starts as RW. set_memory_*() may
only be called if VM_FLUSH_RESET_PERMS has already been set. If set_memory_*()
is called at all, the first call MUST be for the whole range.

If those requirements are all met, then if VM_FLUSH_RESET_PERMS was set but
set_memory_*() was never called, the worst that can happen is for both the
set_direct_map_invalid() and set_direct_map_default() calls to fail due to not
enough memory. But that is safe because the memory was always RW. If
set_memory_*() was called for the whole range and failed, it's the same as if it
was never called. If it was called for the whole range and succeeded, then the
split must have happened already and set_direct_map_invalid() and
set_direct_map_default() will therefore definitely succeed.

The only way this could be a problem is if someone vmallocs a range then
performs a set_memory_*() on a sub-region without having first done it for the
whole region. But we have not found any evidence that there are any users that
do that.

In fact, by that logic, I think alloc_insn_page() must also be safe; it only
allocates 1 page, so if set_memory_*() is subsequently called for it, it must by
definition be covering the whole allocation; 1 page is the smallest amount that
can be protected.

So I agree we are safe.


> 
>>
>> In summary this all looks horribly fragile. But I *think* it works. It would be
>> good to clean it all up and have some clearly documented rules regardless. But I
>> think that could be a follow up series.
> 
> Yeah, absolutely agreed.
> 
>>
>>> Shall we move forward with v8?
>> Yes; Do you wnat me to post that or would you prefer to do it? I'm happy to do
>> it; there are a few other tidy ups in pageattr.c I want to make which I spotted.
> 
> I actually just had v8 ready in my tree. I removed pageattr_pgd_entry and
> pageattr_pud_entry in pageattr.c and fixed pmd_leaf/pud_leaf as you suggested.
> Is it the cleanup you are supposed to do? 

I was also going to fix up the comment in change_memory_common() which is now stale.

> And I also rebased it on top of
> Shijie's series (https://git.kernel.org/pub/scm/linux/kernel/git/arm64/
> linux.git/commit/?id=bfbbb0d3215f) which has been picked up by Will.
> 
>>
>>> We can include the
>>> fix to kprobes in v8 or I can send it separately, either is fine to me.
>> Post it on list, and I'll also incorporate into the series.
> 
> I can include it in v8 series.
> 
>>
>>> Hopefully we can make v6.18.
>> It's probably getting a bit late now. Anyway, I'll aim to get v8 out tomorrow or
>> Friday and we will see what Will thinks.
> 
> Thank you. I can post v8 today.

OK great - I'll leave it all to you then - thanks!

> 
> Thanks,
> Yang
> 
>>
>> Thanks,
>> Ryan
>>
>>> Thanks,
>>> Yang
>>>
>

Re: [PATCH v7 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Yang Shi 4 months, 3 weeks ago


On 9/17/25 11:58 AM, Ryan Roberts wrote:
> On 17/09/2025 18:21, Yang Shi wrote:
>>
>> On 9/17/25 9:28 AM, Ryan Roberts wrote:
>>> Hi Yang,
>>>
>>> Sorry for the slow reply; I'm just getting back to this...
>>>
>>> On 11/09/2025 23:03, Yang Shi wrote:
>>>> Hi Ryan & Catalin,
>>>>
>>>> Any more concerns about this?
>>> I've been trying to convince myself that your assertion that all users that set
>>> the VM_FLUSH_RESET_PERMS also call set_memory_*() for the entire range that was
>>> returned my vmalloc. I agree that if that is the contract and everyone is
>>> following it, then there is no problem here.
>>>
>>> But I haven't been able to convince myself...
>>>
>>> Some examples (these might intersect with examples you previously raised):
>>>
>>> 1. bpf_dispatcher_change_prog() -> bpf_jit_alloc_exec() -> execmem_alloc() ->
>>> sets VM_FLUSH_RESET_PERMS. But I don't see it calling set_memory_*() for
>>> rw_image.
>> Yes, it doesn't call set_memory_*(). I spotted this in the earlier email. But it
>> is actually RW, so it should be ok to miss the call. The later
>> set_direct_map_invalid call in vfree() may fail, but set_direct_map_default call
>> will set RW permission back. But I think it doesn't have to use execmem_alloc(),
>> the plain vmalloc() should be good enough.
>>
>>> 2. module_memory_alloc() -> execmem_alloc_rw() -> execmem_alloc() -> sets
>>> VM_FLUSH_RESET_PERMS (note that execmem_force_rw() is nop for arm64).
>>> set_memory_*() is not called until much later on in module_set_memory(). Another
>>> error in the meantime could cause the memory to be vfreed before that point.
>> IIUC, execmem_alloc_rw() is used to allocate memory for modules' text section
>> and data section. The code will set mod->mem[type].is_rox according to the type
>> of the section. It is true for text, false for data. Then set_memory_rox() will
>> be called later if it is true *after* insns are copied to the memory. So it is
>> still RW before that point.
>>
>>> 3. When set_vm_flush_reset_perms() is set for the range, it is called before
>>> set_memory_*() which might then fail to split prior to vfree.
>> Yes, all call sites check the return value and bail out if set_memory_*() failed
>> if I don't miss anything.
>>
>>> But I guess as long as set_memory_*() is never successfully called for a
>>> *sub-range* of the vmalloc'ed region, then for all of the above issues, the
>>> memory must still be RW at vfree-time, so this issue should be benign... I think?
>> Yes, it is true.
> So to summarise, all freshly vmalloc'ed memory starts as RW. set_memory_*() may
> only be called if VM_FLUSH_RESET_PERMS has already been set. If set_memory_*()
> is called at all, the first call MUST be for the whole range.

Whether the default permission is RW or not depends on the type passed 
in by execmem_alloc(). It is defined by execmem_info in 
arch/arm64/mm/init.c. For ARM64, module and BPF have PAGE_KERNEL 
permission (RW) by default, but kprobes is PAGE_KERNEL_ROX (ROX).

> If those requirements are all met, then if VM_FLUSH_RESET_PERMS was set but
> set_memory_*() was never called, the worst that can happen is for both the
> set_direct_map_invalid() and set_direct_map_default() calls to fail due to not
> enough memory. But that is safe because the memory was always RW. If
> set_memory_*() was called for the whole range and failed, it's the same as if it
> was never called. If it was called for the whole range and succeeded, then the
> split must have happened already and set_direct_map_invalid() and
> set_direct_map_default() will therefore definitely succeed.
>
> The only way this could be a problem is if someone vmallocs a range then
> performs a set_memory_*() on a sub-region without having first done it for the
> whole region. But we have not found any evidence that there are any users that
> do that.

Yes, exactly.

>
> In fact, by that logic, I think alloc_insn_page() must also be safe; it only
> allocates 1 page, so if set_memory_*() is subsequently called for it, it must by
> definition be covering the whole allocation; 1 page is the smallest amount that
> can be protected.

Yes, but kprobes default permission is ROX.

>
> So I agree we are safe.
>
>
>>> In summary this all looks horribly fragile. But I *think* it works. It would be
>>> good to clean it all up and have some clearly documented rules regardless. But I
>>> think that could be a follow up series.
>> Yeah, absolutely agreed.
>>
>>>> Shall we move forward with v8?
>>> Yes; Do you wnat me to post that or would you prefer to do it? I'm happy to do
>>> it; there are a few other tidy ups in pageattr.c I want to make which I spotted.
>> I actually just had v8 ready in my tree. I removed pageattr_pgd_entry and
>> pageattr_pud_entry in pageattr.c and fixed pmd_leaf/pud_leaf as you suggested.
>> Is it the cleanup you are supposed to do?
> I was also going to fix up the comment in change_memory_common() which is now stale.

Oops, I missed that in my v8. Please just comment for v8, I can fix it 
up later.

Thanks,
Yang


>
>> And I also rebased it on top of
>> Shijie's series (https://git.kernel.org/pub/scm/linux/kernel/git/arm64/
>> linux.git/commit/?id=bfbbb0d3215f) which has been picked up by Will.
>>
>>>> We can include the
>>>> fix to kprobes in v8 or I can send it separately, either is fine to me.
>>> Post it on list, and I'll also incorporate into the series.
>> I can include it in v8 series.
>>
>>>> Hopefully we can make v6.18.
>>> It's probably getting a bit late now. Anyway, I'll aim to get v8 out tomorrow or
>>> Friday and we will see what Will thinks.
>> Thank you. I can post v8 today.
> OK great - I'll leave it all to you then - thanks!
>
>> Thanks,
>> Yang
>>
>>> Thanks,
>>> Ryan
>>>
>>>> Thanks,
>>>> Yang
>>>>

Re: [PATCH v7 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Ryan Roberts 4 months, 3 weeks ago

On 17/09/2025 20:15, Yang Shi wrote:
> 
> 
> On 9/17/25 11:58 AM, Ryan Roberts wrote:
>> On 17/09/2025 18:21, Yang Shi wrote:
>>>
>>> On 9/17/25 9:28 AM, Ryan Roberts wrote:
>>>> Hi Yang,
>>>>
>>>> Sorry for the slow reply; I'm just getting back to this...
>>>>
>>>> On 11/09/2025 23:03, Yang Shi wrote:
>>>>> Hi Ryan & Catalin,
>>>>>
>>>>> Any more concerns about this?
>>>> I've been trying to convince myself that your assertion that all users that set
>>>> the VM_FLUSH_RESET_PERMS also call set_memory_*() for the entire range that was
>>>> returned my vmalloc. I agree that if that is the contract and everyone is
>>>> following it, then there is no problem here.
>>>>
>>>> But I haven't been able to convince myself...
>>>>
>>>> Some examples (these might intersect with examples you previously raised):
>>>>
>>>> 1. bpf_dispatcher_change_prog() -> bpf_jit_alloc_exec() -> execmem_alloc() ->
>>>> sets VM_FLUSH_RESET_PERMS. But I don't see it calling set_memory_*() for
>>>> rw_image.
>>> Yes, it doesn't call set_memory_*(). I spotted this in the earlier email. But it
>>> is actually RW, so it should be ok to miss the call. The later
>>> set_direct_map_invalid call in vfree() may fail, but set_direct_map_default call
>>> will set RW permission back. But I think it doesn't have to use execmem_alloc(),
>>> the plain vmalloc() should be good enough.
>>>
>>>> 2. module_memory_alloc() -> execmem_alloc_rw() -> execmem_alloc() -> sets
>>>> VM_FLUSH_RESET_PERMS (note that execmem_force_rw() is nop for arm64).
>>>> set_memory_*() is not called until much later on in module_set_memory().
>>>> Another
>>>> error in the meantime could cause the memory to be vfreed before that point.
>>> IIUC, execmem_alloc_rw() is used to allocate memory for modules' text section
>>> and data section. The code will set mod->mem[type].is_rox according to the type
>>> of the section. It is true for text, false for data. Then set_memory_rox() will
>>> be called later if it is true *after* insns are copied to the memory. So it is
>>> still RW before that point.
>>>
>>>> 3. When set_vm_flush_reset_perms() is set for the range, it is called before
>>>> set_memory_*() which might then fail to split prior to vfree.
>>> Yes, all call sites check the return value and bail out if set_memory_*() failed
>>> if I don't miss anything.
>>>
>>>> But I guess as long as set_memory_*() is never successfully called for a
>>>> *sub-range* of the vmalloc'ed region, then for all of the above issues, the
>>>> memory must still be RW at vfree-time, so this issue should be benign... I
>>>> think?
>>> Yes, it is true.
>> So to summarise, all freshly vmalloc'ed memory starts as RW. set_memory_*() may
>> only be called if VM_FLUSH_RESET_PERMS has already been set. If set_memory_*()
>> is called at all, the first call MUST be for the whole range.
> 
> Whether the default permission is RW or not depends on the type passed in by
> execmem_alloc(). It is defined by execmem_info in arch/arm64/mm/init.c. For
> ARM64, module and BPF have PAGE_KERNEL permission (RW) by default, but kprobes
> is PAGE_KERNEL_ROX (ROX).

Perhaps I missed it, but as far as I could tell the prot that the arch sets for
the type only determines the prot that is set for the vmalloc map. It doesn't
look like the linear map is modified at all... which feels like a bug to me
since the linear map will be RW while the vmalloc map will be ROX... I guess I
must have missed something...

> 
>> If those requirements are all met, then if VM_FLUSH_RESET_PERMS was set but
>> set_memory_*() was never called, the worst that can happen is for both the
>> set_direct_map_invalid() and set_direct_map_default() calls to fail due to not
>> enough memory. But that is safe because the memory was always RW. If
>> set_memory_*() was called for the whole range and failed, it's the same as if it
>> was never called. If it was called for the whole range and succeeded, then the
>> split must have happened already and set_direct_map_invalid() and
>> set_direct_map_default() will therefore definitely succeed.
>>
>> The only way this could be a problem is if someone vmallocs a range then
>> performs a set_memory_*() on a sub-region without having first done it for the
>> whole region. But we have not found any evidence that there are any users that
>> do that.
> 
> Yes, exactly.
> 
>>
>> In fact, by that logic, I think alloc_insn_page() must also be safe; it only
>> allocates 1 page, so if set_memory_*() is subsequently called for it, it must by
>> definition be covering the whole allocation; 1 page is the smallest amount that
>> can be protected.
> 
> Yes, but kprobes default permission is ROX.
> 
>>
>> So I agree we are safe.
>>
>>
>>>> In summary this all looks horribly fragile. But I *think* it works. It would be
>>>> good to clean it all up and have some clearly documented rules regardless.
>>>> But I
>>>> think that could be a follow up series.
>>> Yeah, absolutely agreed.
>>>
>>>>> Shall we move forward with v8?
>>>> Yes; Do you wnat me to post that or would you prefer to do it? I'm happy to do
>>>> it; there are a few other tidy ups in pageattr.c I want to make which I
>>>> spotted.
>>> I actually just had v8 ready in my tree. I removed pageattr_pgd_entry and
>>> pageattr_pud_entry in pageattr.c and fixed pmd_leaf/pud_leaf as you suggested.
>>> Is it the cleanup you are supposed to do?
>> I was also going to fix up the comment in change_memory_common() which is now
>> stale.
> 
> Oops, I missed that in my v8. Please just comment for v8, I can fix it up later.

Ahh no biggy. If there is a chance Will will take the series, let's not hold it
up for a comment.

> 
> Thanks,
> Yang
> 
> 
>>
>>> And I also rebased it on top of
>>> Shijie's series (https://git.kernel.org/pub/scm/linux/kernel/git/arm64/
>>> linux.git/commit/?id=bfbbb0d3215f) which has been picked up by Will.
>>>
>>>>> We can include the
>>>>> fix to kprobes in v8 or I can send it separately, either is fine to me.
>>>> Post it on list, and I'll also incorporate into the series.
>>> I can include it in v8 series.
>>>
>>>>> Hopefully we can make v6.18.
>>>> It's probably getting a bit late now. Anyway, I'll aim to get v8 out
>>>> tomorrow or
>>>> Friday and we will see what Will thinks.
>>> Thank you. I can post v8 today.
>> OK great - I'll leave it all to you then - thanks!
>>
>>> Thanks,
>>> Yang
>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>> Thanks,
>>>>> Yang
>>>>>
>

Re: [PATCH v7 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Yang Shi 4 months, 3 weeks ago


On 9/17/25 12:40 PM, Ryan Roberts wrote:
> On 17/09/2025 20:15, Yang Shi wrote:
>>
>> On 9/17/25 11:58 AM, Ryan Roberts wrote:
>>> On 17/09/2025 18:21, Yang Shi wrote:
>>>> On 9/17/25 9:28 AM, Ryan Roberts wrote:
>>>>> Hi Yang,
>>>>>
>>>>> Sorry for the slow reply; I'm just getting back to this...
>>>>>
>>>>> On 11/09/2025 23:03, Yang Shi wrote:
>>>>>> Hi Ryan & Catalin,
>>>>>>
>>>>>> Any more concerns about this?
>>>>> I've been trying to convince myself that your assertion that all users that set
>>>>> the VM_FLUSH_RESET_PERMS also call set_memory_*() for the entire range that was
>>>>> returned my vmalloc. I agree that if that is the contract and everyone is
>>>>> following it, then there is no problem here.
>>>>>
>>>>> But I haven't been able to convince myself...
>>>>>
>>>>> Some examples (these might intersect with examples you previously raised):
>>>>>
>>>>> 1. bpf_dispatcher_change_prog() -> bpf_jit_alloc_exec() -> execmem_alloc() ->
>>>>> sets VM_FLUSH_RESET_PERMS. But I don't see it calling set_memory_*() for
>>>>> rw_image.
>>>> Yes, it doesn't call set_memory_*(). I spotted this in the earlier email. But it
>>>> is actually RW, so it should be ok to miss the call. The later
>>>> set_direct_map_invalid call in vfree() may fail, but set_direct_map_default call
>>>> will set RW permission back. But I think it doesn't have to use execmem_alloc(),
>>>> the plain vmalloc() should be good enough.
>>>>
>>>>> 2. module_memory_alloc() -> execmem_alloc_rw() -> execmem_alloc() -> sets
>>>>> VM_FLUSH_RESET_PERMS (note that execmem_force_rw() is nop for arm64).
>>>>> set_memory_*() is not called until much later on in module_set_memory().
>>>>> Another
>>>>> error in the meantime could cause the memory to be vfreed before that point.
>>>> IIUC, execmem_alloc_rw() is used to allocate memory for modules' text section
>>>> and data section. The code will set mod->mem[type].is_rox according to the type
>>>> of the section. It is true for text, false for data. Then set_memory_rox() will
>>>> be called later if it is true *after* insns are copied to the memory. So it is
>>>> still RW before that point.
>>>>
>>>>> 3. When set_vm_flush_reset_perms() is set for the range, it is called before
>>>>> set_memory_*() which might then fail to split prior to vfree.
>>>> Yes, all call sites check the return value and bail out if set_memory_*() failed
>>>> if I don't miss anything.
>>>>
>>>>> But I guess as long as set_memory_*() is never successfully called for a
>>>>> *sub-range* of the vmalloc'ed region, then for all of the above issues, the
>>>>> memory must still be RW at vfree-time, so this issue should be benign... I
>>>>> think?
>>>> Yes, it is true.
>>> So to summarise, all freshly vmalloc'ed memory starts as RW. set_memory_*() may
>>> only be called if VM_FLUSH_RESET_PERMS has already been set. If set_memory_*()
>>> is called at all, the first call MUST be for the whole range.
>> Whether the default permission is RW or not depends on the type passed in by
>> execmem_alloc(). It is defined by execmem_info in arch/arm64/mm/init.c. For
>> ARM64, module and BPF have PAGE_KERNEL permission (RW) by default, but kprobes
>> is PAGE_KERNEL_ROX (ROX).
> Perhaps I missed it, but as far as I could tell the prot that the arch sets for
> the type only determines the prot that is set for the vmalloc map. It doesn't
> look like the linear map is modified at all... which feels like a bug to me
> since the linear map will be RW while the vmalloc map will be ROX... I guess I
> must have missed something...

Yes, it just sets the permission for vmalloc area. The set_memory_*() 
must be called to change permission for direct map.

>
>>> If those requirements are all met, then if VM_FLUSH_RESET_PERMS was set but
>>> set_memory_*() was never called, the worst that can happen is for both the
>>> set_direct_map_invalid() and set_direct_map_default() calls to fail due to not
>>> enough memory. But that is safe because the memory was always RW. If
>>> set_memory_*() was called for the whole range and failed, it's the same as if it
>>> was never called. If it was called for the whole range and succeeded, then the
>>> split must have happened already and set_direct_map_invalid() and
>>> set_direct_map_default() will therefore definitely succeed.
>>>
>>> The only way this could be a problem is if someone vmallocs a range then
>>> performs a set_memory_*() on a sub-region without having first done it for the
>>> whole region. But we have not found any evidence that there are any users that
>>> do that.
>> Yes, exactly.
>>
>>> In fact, by that logic, I think alloc_insn_page() must also be safe; it only
>>> allocates 1 page, so if set_memory_*() is subsequently called for it, it must by
>>> definition be covering the whole allocation; 1 page is the smallest amount that
>>> can be protected.
>> Yes, but kprobes default permission is ROX.
>>
>>> So I agree we are safe.
>>>
>>>
>>>>> In summary this all looks horribly fragile. But I *think* it works. It would be
>>>>> good to clean it all up and have some clearly documented rules regardless.
>>>>> But I
>>>>> think that could be a follow up series.
>>>> Yeah, absolutely agreed.
>>>>
>>>>>> Shall we move forward with v8?
>>>>> Yes; Do you wnat me to post that or would you prefer to do it? I'm happy to do
>>>>> it; there are a few other tidy ups in pageattr.c I want to make which I
>>>>> spotted.
>>>> I actually just had v8 ready in my tree. I removed pageattr_pgd_entry and
>>>> pageattr_pud_entry in pageattr.c and fixed pmd_leaf/pud_leaf as you suggested.
>>>> Is it the cleanup you are supposed to do?
>>> I was also going to fix up the comment in change_memory_common() which is now
>>> stale.
>> Oops, I missed that in my v8. Please just comment for v8, I can fix it up later.
> Ahh no biggy. If there is a chance Will will take the series, let's not hold it
> up for a comment.

Yeah, sure, thank you.

Yang

>
>> Thanks,
>> Yang
>>
>>
>>>> And I also rebased it on top of
>>>> Shijie's series (https://git.kernel.org/pub/scm/linux/kernel/git/arm64/
>>>> linux.git/commit/?id=bfbbb0d3215f) which has been picked up by Will.
>>>>
>>>>>> We can include the
>>>>>> fix to kprobes in v8 or I can send it separately, either is fine to me.
>>>>> Post it on list, and I'll also incorporate into the series.
>>>> I can include it in v8 series.
>>>>
>>>>>> Hopefully we can make v6.18.
>>>>> It's probably getting a bit late now. Anyway, I'll aim to get v8 out
>>>>> tomorrow or
>>>>> Friday and we will see what Will thinks.
>>>> Thank you. I can post v8 today.
>>> OK great - I'll leave it all to you then - thanks!
>>>
>>>> Thanks,
>>>> Yang
>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>>
>>>>>> Thanks,
>>>>>> Yang
>>>>>>