[v8] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

[PATCH v8 0/5] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Yang Shi 2 weeks ago

On systems with BBML2_NOABORT support, it causes the linear map to be mapped
with large blocks, even when rodata=full, and leads to some nice performance
improvements.

Ryan tested v7 on an AmpereOne system (a VM with 12G RAM) in all 3 possible
modes by hacking the BBML2 feature detection code:

  - mode 1: All CPUs support BBML2 so the linear map uses large mappings
  - mode 2: Boot CPU does not support BBML2 so linear map uses pte mappings
  - mode 3: Boot CPU supports BBML2 but secondaries do not so linear map
    initially uses large mappings but is then repainted to use pte mappings

In all cases, mm selftests run and no regressions are observed. In all cases,
ptdump of linear map is as expected. Because there are just some cleanups
between v7 and v8, so I kept using Ryan's test result:

Mode 1:
=======
---[ Linear Mapping start ]---
0xffff000000000000-0xffff000000200000           2M PMD       RW NX SHD AF        BLK UXN    MEM/NORMAL-TAGGED
0xffff000000200000-0xffff000000210000          64K PTE       RW NX SHD AF    CON     UXN    MEM/NORMAL-TAGGED
0xffff000000210000-0xffff000000400000        1984K PTE       ro NX SHD AF            UXN    MEM/NORMAL
0xffff000000400000-0xffff000002400000          32M PMD       ro NX SHD AF        BLK UXN    MEM/NORMAL
0xffff000002400000-0xffff000002550000        1344K PTE       ro NX SHD AF            UXN    MEM/NORMAL
0xffff000002550000-0xffff000002600000         704K PTE       RW NX SHD AF    CON     UXN    MEM/NORMAL-TAGGED
0xffff000002600000-0xffff000004000000          26M PMD       RW NX SHD AF        BLK UXN    MEM/NORMAL-TAGGED
0xffff000004000000-0xffff000040000000         960M PMD       RW NX SHD AF    CON BLK UXN    MEM/NORMAL-TAGGED
0xffff000040000000-0xffff000140000000           4G PUD       RW NX SHD AF        BLK UXN    MEM/NORMAL-TAGGED
0xffff000140000000-0xffff000142000000          32M PMD       RW NX SHD AF    CON BLK UXN    MEM/NORMAL-TAGGED
0xffff000142000000-0xffff000142120000        1152K PTE       RW NX SHD AF    CON     UXN    MEM/NORMAL-TAGGED
0xffff000142120000-0xffff000142128000          32K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000142128000-0xffff000142159000         196K PTE       ro NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000142159000-0xffff000142160000          28K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000142160000-0xffff000142240000         896K PTE       RW NX SHD AF    CON     UXN    MEM/NORMAL-TAGGED
0xffff000142240000-0xffff00014224e000          56K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff00014224e000-0xffff000142250000           8K PTE       ro NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000142250000-0xffff000142260000          64K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000142260000-0xffff000142280000         128K PTE       RW NX SHD AF    CON     UXN    MEM/NORMAL-TAGGED
0xffff000142280000-0xffff000142288000          32K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000142288000-0xffff000142290000          32K PTE       ro NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000142290000-0xffff0001422a0000          64K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff0001422a0000-0xffff000142465000        1812K PTE       ro NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000142465000-0xffff000142470000          44K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000142470000-0xffff000142600000        1600K PTE       RW NX SHD AF    CON     UXN    MEM/NORMAL-TAGGED
0xffff000142600000-0xffff000144000000          26M PMD       RW NX SHD AF        BLK UXN    MEM/NORMAL-TAGGED
0xffff000144000000-0xffff000180000000         960M PMD       RW NX SHD AF    CON BLK UXN    MEM/NORMAL-TAGGED
0xffff000180000000-0xffff000181a00000          26M PMD       RW NX SHD AF        BLK UXN    MEM/NORMAL-TAGGED
0xffff000181a00000-0xffff000181b90000        1600K PTE       RW NX SHD AF    CON     UXN    MEM/NORMAL-TAGGED
0xffff000181b90000-0xffff000181b9d000          52K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000181b9d000-0xffff000181c80000         908K PTE       ro NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000181c80000-0xffff000181c90000          64K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000181c90000-0xffff000181ca0000          64K PTE       RW NX SHD AF    CON     UXN    MEM/NORMAL-TAGGED
0xffff000181ca0000-0xffff000181dbd000        1140K PTE       ro NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000181dbd000-0xffff000181dc0000          12K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000181dc0000-0xffff000181e00000         256K PTE       RW NX SHD AF    CON     UXN    MEM/NORMAL-TAGGED
0xffff000181e00000-0xffff000182000000           2M PMD       RW NX SHD AF        BLK UXN    MEM/NORMAL-TAGGED
0xffff000182000000-0xffff0001c0000000         992M PMD       RW NX SHD AF    CON BLK UXN    MEM/NORMAL-TAGGED
0xffff0001c0000000-0xffff000300000000           5G PUD       RW NX SHD AF        BLK UXN    MEM/NORMAL-TAGGED
0xffff000300000000-0xffff008000000000         500G PUD
0xffff008000000000-0xffff800000000000      130560G PGD
---[ Linear Mapping end ]---

Mode 3:
=======
---[ Linear Mapping start ]---
0xffff000000000000-0xffff000000210000        2112K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000000210000-0xffff000000400000        1984K PTE       ro NX SHD AF            UXN    MEM/NORMAL
0xffff000000400000-0xffff000002400000          32M PMD       ro NX SHD AF        BLK UXN    MEM/NORMAL
0xffff000002400000-0xffff000002550000        1344K PTE       ro NX SHD AF            UXN    MEM/NORMAL
0xffff000002550000-0xffff000143a61000     5264452K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000143a61000-0xffff000143c61000           2M PTE       ro NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000143c61000-0xffff000181b9a000     1015012K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000181b9a000-0xffff000181d9a000           2M PTE       ro NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000181d9a000-0xffff000300000000     6261144K PTE       RW NX SHD AF            UXN    MEM/NORMAL-TAGGED
0xffff000300000000-0xffff008000000000         500G PUD
0xffff008000000000-0xffff800000000000      130560G PGD
---[ Linear Mapping end ]---


Performance Testing
===================
* Memory use after boot
Before:
MemTotal:       258988984 kB
MemFree:        254821700 kB

After:
MemTotal:       259505132 kB
MemFree:        255410264 kB

Around 500MB more memory are free to use.  The larger the machine, the
more memory saved.

* Memcached
We saw performance degradation when running Memcached benchmark with
rodata=full vs rodata=on.  Our profiling pointed to kernel TLB pressure.
With this patchset we saw ops/sec is increased by around 3.5%, P99
latency is reduced by around 9.6%.
The gain mainly came from reduced kernel TLB misses.  The kernel TLB
MPKI is reduced by 28.5%.

The benchmark data is now on par with rodata=on too.

* Disk encryption (dm-crypt) benchmark
Ran fio benchmark with the below command on a 128G ramdisk (ext4) with
disk encryption (by dm-crypt).
fio --directory=/data --random_generator=lfsr --norandommap            \
    --randrepeat 1 --status-interval=999 --rw=write --bs=4k --loops=1  \
    --ioengine=sync --iodepth=1 --numjobs=1 --fsync_on_close=1         \
    --group_reporting --thread --name=iops-test-job --eta-newline=1    \
    --size 100G

The IOPS is increased by 90% - 150% (the variance is high, but the worst
number of good case is around 90% more than the best number of bad
case). The bandwidth is increased and the avg clat is reduced
proportionally.

* Sequential file read
Read 100G file sequentially on XFS (xfs_io read with page cache
populated). The bandwidth is increased by 150%.

Additionally Ryan also ran this through a random selection of benchmarks on
AmpereOne. None show any regressions, and various benchmarks show statistically
significant improvement. I'm just showing those improvements here:

+----------------------+----------------------------------------------------------+-------------------------+
| Benchmark            | Result Class                                             | Improvement vs 6.17-rc1 |
+======================+==========================================================+=========================+
| micromm/vmalloc      | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |              (I) -9.00% |
|                      | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |              (I) -6.93% |
|                      | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |              (I) -6.77% |
|                      | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |              (I) -4.63% |
+----------------------+----------------------------------------------------------+-------------------------+
| mmtests/hackbench    | process-sockets-30 (seconds)                             |              (I) -2.96% |
+----------------------+----------------------------------------------------------+-------------------------+
| mmtests/kernbench    | syst-192 (seconds)                                       |             (I) -12.77% |
+----------------------+----------------------------------------------------------+-------------------------+
| pts/perl-benchmark   | Test: Interpreter (Seconds)                              |              (I) -4.86% |
+----------------------+----------------------------------------------------------+-------------------------+
| pts/pgbench          | Scale: 1 Clients: 1 Read Write (TPS)                     |               (I) 5.07% |
|                      | Scale: 1 Clients: 1 Read Write - Latency (ms)            |              (I) -4.72% |
|                      | Scale: 100 Clients: 1000 Read Write (TPS)                |               (I) 2.58% |
|                      | Scale: 100 Clients: 1000 Read Write - Latency (ms)       |              (I) -2.52% |
+----------------------+----------------------------------------------------------+-------------------------+
| pts/sqlite-speedtest | Timed Time - Size 1,000 (Seconds)                        |              (I) -2.68% |
+----------------------+----------------------------------------------------------+-------------------------+

Changes since v7 [1]
====================
- Rebased on v6.17-rc6 and Shijie's rodata series (https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/commit/?id=bfbbb0d3215f)
  which has been picked up by Will.
- Patch 1: Fixed pmd_leaf/pud_leaf issue since the code may need to change
  permission for invalid entries per Jinjiang Tu.
- Patch 1: Removed pageattr_pgd_entry and pageattr_p4d_entry per Ryan.
- Used (-1ULL) instead of -1 per Catalin.
- Added comment about arm64 lazy mmu allow sleeping per Ryan.
- Squashed patch #4 in v7 into patch #3.
- Squashed patch #6 in v7 into patch #4.
- Added patch #5 to fix a arm64 kprobes bug. It guarantees set_memory_rox()
  is called before vfree(). It can go into separately or with this series
  together.
- Collected all the R-bs and A-bs.

Changes since v6 [2]
====================
- Patch 1: Minor refactor to implement walk_kernel_page_table_range() in terms
  of walk_kernel_page_table_range_lockless(). Also lead to adding *pmd argument
  to the lockless variant for consistency (per Catalin).
- Misc function/variable renames to improve clarity and consistency.
- Share same syncrhonization flag between idmap_kpti_install_ng_mappings and
  wait_linear_map_split_to_ptes, which allows removal of bbml2_ptes[] to save
  ~20K from kernel image.
- Only take pgtable_split_lock and enter lazy mmu mode once for both splits.
- Only walk the pgtable once for the common "split single page" case.
- Bypass split to contpmd and contpte when spllitting linear map to ptes.

[1] https://lore.kernel.org/linux-arm-kernel/20250829115250.2395585-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-arm-kernel/20250805081350.3854670-1-ryan.roberts@arm.com/


Dev Jain (1):
      arm64: Enable permission change on arm64 kernel block mappings

Ryan Roberts (1):
      arm64: mm: split linear mapping if BBML2 unsupported on secondary CPUs

Yang Shi (3):
      arm64: cpufeature: add AmpereOne to BBML2 allow list
      arm64: mm: support large block mapping when rodata=full
      arm64: kprobes: call set_memory_rox() for kprobe page

 arch/arm64/include/asm/cpufeature.h |   2 +
 arch/arm64/include/asm/mmu.h        |   3 +
 arch/arm64/include/asm/pgtable.h    |   5 ++
 arch/arm64/kernel/cpufeature.c      |  12 +++-
 arch/arm64/kernel/probes/kprobes.c  |  12 ++++
 arch/arm64/mm/mmu.c                 | 422 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 arch/arm64/mm/pageattr.c            | 123 ++++++++++++++++++++++++---------
 arch/arm64/mm/proc.S                |  27 ++++++--
 include/linux/pagewalk.h            |   3 +
 mm/pagewalk.c                       |  36 ++++++----
 10 files changed, 581 insertions(+), 64 deletions(-)

Re: [PATCH v8 0/5] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Will Deacon 1 week, 6 days ago

On Wed, 17 Sep 2025 12:02:06 -0700, Yang Shi wrote:
> On systems with BBML2_NOABORT support, it causes the linear map to be mapped
> with large blocks, even when rodata=full, and leads to some nice performance
> improvements.
> 
> Ryan tested v7 on an AmpereOne system (a VM with 12G RAM) in all 3 possible
> modes by hacking the BBML2 feature detection code:
> 
> [...]

Applied patches 1 and 3 to arm64 (for-next/mm), thanks!

[1/5] arm64: Enable permission change on arm64 kernel block mappings
      https://git.kernel.org/arm64/c/a660194dd101
[3/5] arm64: mm: support large block mapping when rodata=full
      https://git.kernel.org/arm64/c/a166563e7ec3

I also picked up the BBML allow-list addition (second patch) on
for-next/cpufeature.

The fourth patch ("arm64: mm: split linear mapping if BBML2 unsupported
on secondary CPUs") has some really horrible conflicts. These are partly
due to some of the type cleanups on for-next/mm but I think mainly due
to Kevin's kpti rework that landed after -rc1.

So I think the best bet might be to leave that one for next time, if
that's ok?

Cheers,
-- 
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev

Re: [PATCH v8 0/5] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Yang Shi 1 week, 5 days ago


On 9/18/25 2:10 PM, Will Deacon wrote:
> On Wed, 17 Sep 2025 12:02:06 -0700, Yang Shi wrote:
>> On systems with BBML2_NOABORT support, it causes the linear map to be mapped
>> with large blocks, even when rodata=full, and leads to some nice performance
>> improvements.
>>
>> Ryan tested v7 on an AmpereOne system (a VM with 12G RAM) in all 3 possible
>> modes by hacking the BBML2 feature detection code:
>>
>> [...]
> Applied patches 1 and 3 to arm64 (for-next/mm), thanks!
>
> [1/5] arm64: Enable permission change on arm64 kernel block mappings
>        https://git.kernel.org/arm64/c/a660194dd101
> [3/5] arm64: mm: support large block mapping when rodata=full
>        https://git.kernel.org/arm64/c/a166563e7ec3
>
> I also picked up the BBML allow-list addition (second patch) on
> for-next/cpufeature.

Hi Will,

Thank you so much!

>
> The fourth patch ("arm64: mm: split linear mapping if BBML2 unsupported
> on secondary CPUs") has some really horrible conflicts. These are partly
> due to some of the type cleanups on for-next/mm but I think mainly due
> to Kevin's kpti rework that landed after -rc1.
>
> So I think the best bet might be to leave that one for next time, if
> that's ok?

I saw you and Ryan just figured out how to move forward. You guys are 
definitely more knowledgeable than me regarding the asymmetric systems.

Thanks,
Yang

>
> Cheers,

Re: [PATCH v8 0/5] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Ryan Roberts 1 week, 5 days ago

On 18/09/2025 22:10, Will Deacon wrote:
> On Wed, 17 Sep 2025 12:02:06 -0700, Yang Shi wrote:
>> On systems with BBML2_NOABORT support, it causes the linear map to be mapped
>> with large blocks, even when rodata=full, and leads to some nice performance
>> improvements.
>>
>> Ryan tested v7 on an AmpereOne system (a VM with 12G RAM) in all 3 possible
>> modes by hacking the BBML2 feature detection code:
>>
>> [...]
> 
> Applied patches 1 and 3 to arm64 (for-next/mm), thanks!
> 
> [1/5] arm64: Enable permission change on arm64 kernel block mappings
>       https://git.kernel.org/arm64/c/a660194dd101
> [3/5] arm64: mm: support large block mapping when rodata=full
>       https://git.kernel.org/arm64/c/a166563e7ec3
> 
> I also picked up the BBML allow-list addition (second patch) on
> for-next/cpufeature.
> 
> The fourth patch ("arm64: mm: split linear mapping if BBML2 unsupported
> on secondary CPUs") has some really horrible conflicts. These are partly
> due to some of the type cleanups on for-next/mm but I think mainly due
> to Kevin's kpti rework that landed after -rc1.

Thanks Will, although I'm nervous that without this patch, some platforms might
not boot; Wikipedia tells me that there are some Google, Mediatek and Qualcomm
SoCs that pair X4 CPUs (which is on the BBML2_NOABORT allow list) with A720
and/or A520 (which are not). See previous mail at [1].

I'd be happy to rebase it if you can let me know the prefered base SHA/tree?

[1]
https://lore.kernel.org/linux-arm-kernel/11f84d00-8c76-402d-bbad-014a3542992f@arm.com/

Thanks,
Ryan


> 
> So I think the best bet might be to leave that one for next time, if
> that's ok?
> 
> Cheers,

Re: [PATCH v8 0/5] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Will Deacon 1 week, 5 days ago

On Fri, Sep 19, 2025 at 11:08:47AM +0100, Ryan Roberts wrote:
> On 18/09/2025 22:10, Will Deacon wrote:
> > On Wed, 17 Sep 2025 12:02:06 -0700, Yang Shi wrote:
> >> On systems with BBML2_NOABORT support, it causes the linear map to be mapped
> >> with large blocks, even when rodata=full, and leads to some nice performance
> >> improvements.
> >>
> >> Ryan tested v7 on an AmpereOne system (a VM with 12G RAM) in all 3 possible
> >> modes by hacking the BBML2 feature detection code:
> >>
> >> [...]
> > 
> > Applied patches 1 and 3 to arm64 (for-next/mm), thanks!
> > 
> > [1/5] arm64: Enable permission change on arm64 kernel block mappings
> >       https://git.kernel.org/arm64/c/a660194dd101
> > [3/5] arm64: mm: support large block mapping when rodata=full
> >       https://git.kernel.org/arm64/c/a166563e7ec3
> > 
> > I also picked up the BBML allow-list addition (second patch) on
> > for-next/cpufeature.
> > 
> > The fourth patch ("arm64: mm: split linear mapping if BBML2 unsupported
> > on secondary CPUs") has some really horrible conflicts. These are partly
> > due to some of the type cleanups on for-next/mm but I think mainly due
> > to Kevin's kpti rework that landed after -rc1.
> 
> Thanks Will, although I'm nervous that without this patch, some platforms might
> not boot; Wikipedia tells me that there are some Google, Mediatek and Qualcomm
> SoCs that pair X4 CPUs (which is on the BBML2_NOABORT allow list) with A720
> and/or A520 (which are not). See previous mail at [1].

I'd be surprised if these SoCs are booting on the X4 but who knows.

Lemme have another look at applying the patch with fresh eyes, but I do
wonder whether having X4 on the allow list really makes any sense. Are
there any SoCs out there that _don't_ pair it with CPUs that aren't on
the allow list? (apologies for the double negative).

Will

Re: [PATCH v8 0/5] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Ryan Roberts 1 week, 5 days ago

On 19/09/2025 12:27, Will Deacon wrote:
> On Fri, Sep 19, 2025 at 11:08:47AM +0100, Ryan Roberts wrote:
>> On 18/09/2025 22:10, Will Deacon wrote:
>>> On Wed, 17 Sep 2025 12:02:06 -0700, Yang Shi wrote:
>>>> On systems with BBML2_NOABORT support, it causes the linear map to be mapped
>>>> with large blocks, even when rodata=full, and leads to some nice performance
>>>> improvements.
>>>>
>>>> Ryan tested v7 on an AmpereOne system (a VM with 12G RAM) in all 3 possible
>>>> modes by hacking the BBML2 feature detection code:
>>>>
>>>> [...]
>>>
>>> Applied patches 1 and 3 to arm64 (for-next/mm), thanks!
>>>
>>> [1/5] arm64: Enable permission change on arm64 kernel block mappings
>>>       https://git.kernel.org/arm64/c/a660194dd101
>>> [3/5] arm64: mm: support large block mapping when rodata=full
>>>       https://git.kernel.org/arm64/c/a166563e7ec3
>>>
>>> I also picked up the BBML allow-list addition (second patch) on
>>> for-next/cpufeature.
>>>
>>> The fourth patch ("arm64: mm: split linear mapping if BBML2 unsupported
>>> on secondary CPUs") has some really horrible conflicts. These are partly
>>> due to some of the type cleanups on for-next/mm but I think mainly due
>>> to Kevin's kpti rework that landed after -rc1.
>>
>> Thanks Will, although I'm nervous that without this patch, some platforms might
>> not boot; Wikipedia tells me that there are some Google, Mediatek and Qualcomm
>> SoCs that pair X4 CPUs (which is on the BBML2_NOABORT allow list) with A720
>> and/or A520 (which are not). See previous mail at [1].
> 
> I'd be surprised if these SoCs are booting on the X4 but who knows.

Ahh. You can probably tell I'm a bit naive to some of this system level stuff...
I had assumed they would want to boot on the big CPU to reduce boot time.

> 
> Lemme have another look at applying the patch with fresh eyes, but I do
> wonder whether having X4 on the allow list really makes any sense. Are
> there any SoCs out there that _don't_ pair it with CPUs that aren't on
> the allow list? (apologies for the double negative).

Hmm, that's a fair question. I'm not aware of any. So I guess the simplest
solution is to remove X4 from the allow list and ditch fourth patch.


> 
> Will

Re: [PATCH v8 0/5] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Will Deacon 1 week, 5 days ago

On Fri, Sep 19, 2025 at 12:49:22PM +0100, Ryan Roberts wrote:
> On 19/09/2025 12:27, Will Deacon wrote:
> > On Fri, Sep 19, 2025 at 11:08:47AM +0100, Ryan Roberts wrote:
> >> On 18/09/2025 22:10, Will Deacon wrote:
> >>> On Wed, 17 Sep 2025 12:02:06 -0700, Yang Shi wrote:
> >>>> On systems with BBML2_NOABORT support, it causes the linear map to be mapped
> >>>> with large blocks, even when rodata=full, and leads to some nice performance
> >>>> improvements.
> >>>>
> >>>> Ryan tested v7 on an AmpereOne system (a VM with 12G RAM) in all 3 possible
> >>>> modes by hacking the BBML2 feature detection code:
> >>>>
> >>>> [...]
> >>>
> >>> Applied patches 1 and 3 to arm64 (for-next/mm), thanks!
> >>>
> >>> [1/5] arm64: Enable permission change on arm64 kernel block mappings
> >>>       https://git.kernel.org/arm64/c/a660194dd101
> >>> [3/5] arm64: mm: support large block mapping when rodata=full
> >>>       https://git.kernel.org/arm64/c/a166563e7ec3
> >>>
> >>> I also picked up the BBML allow-list addition (second patch) on
> >>> for-next/cpufeature.
> >>>
> >>> The fourth patch ("arm64: mm: split linear mapping if BBML2 unsupported
> >>> on secondary CPUs") has some really horrible conflicts. These are partly
> >>> due to some of the type cleanups on for-next/mm but I think mainly due
> >>> to Kevin's kpti rework that landed after -rc1.
> >>
> >> Thanks Will, although I'm nervous that without this patch, some platforms might
> >> not boot; Wikipedia tells me that there are some Google, Mediatek and Qualcomm
> >> SoCs that pair X4 CPUs (which is on the BBML2_NOABORT allow list) with A720
> >> and/or A520 (which are not). See previous mail at [1].
> > 
> > I'd be surprised if these SoCs are booting on the X4 but who knows.
> 
> Ahh. You can probably tell I'm a bit naive to some of this system level stuff...
> I had assumed they would want to boot on the big CPU to reduce boot time.

One of the problems is that the boot CPU becomes CPU0 and that inevitably
means it ends up being responsible for a tonne of extra stuff (interrupts,
TZ, etc) and in many cases can't be offlined. So it's all a trade-off.

> > Lemme have another look at applying the patch with fresh eyes, but I do
> > wonder whether having X4 on the allow list really makes any sense. Are
> > there any SoCs out there that _don't_ pair it with CPUs that aren't on
> > the allow list? (apologies for the double negative).
> 
> Hmm, that's a fair question. I'm not aware of any. So I guess the simplest
> solution is to remove X4 from the allow list and ditch fourth patch.

That's probably a good idea but I have a horrible feeling we _are_ going
to need your patch once the errata start flying about :)

So how about we:

  - Remove X4 from the list
  - I try harder to apply your patch for secondary CPUs...
  - ... if I fail, we can apply it next time around

Sound reasonable?

Will

Re: [PATCH v8 0/5] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Ryan Roberts 1 week, 5 days ago

On 19/09/2025 12:56, Will Deacon wrote:
> On Fri, Sep 19, 2025 at 12:49:22PM +0100, Ryan Roberts wrote:
>> On 19/09/2025 12:27, Will Deacon wrote:
>>> On Fri, Sep 19, 2025 at 11:08:47AM +0100, Ryan Roberts wrote:
>>>> On 18/09/2025 22:10, Will Deacon wrote:
>>>>> On Wed, 17 Sep 2025 12:02:06 -0700, Yang Shi wrote:
>>>>>> On systems with BBML2_NOABORT support, it causes the linear map to be mapped
>>>>>> with large blocks, even when rodata=full, and leads to some nice performance
>>>>>> improvements.
>>>>>>
>>>>>> Ryan tested v7 on an AmpereOne system (a VM with 12G RAM) in all 3 possible
>>>>>> modes by hacking the BBML2 feature detection code:
>>>>>>
>>>>>> [...]
>>>>>
>>>>> Applied patches 1 and 3 to arm64 (for-next/mm), thanks!
>>>>>
>>>>> [1/5] arm64: Enable permission change on arm64 kernel block mappings
>>>>>       https://git.kernel.org/arm64/c/a660194dd101
>>>>> [3/5] arm64: mm: support large block mapping when rodata=full
>>>>>       https://git.kernel.org/arm64/c/a166563e7ec3
>>>>>
>>>>> I also picked up the BBML allow-list addition (second patch) on
>>>>> for-next/cpufeature.
>>>>>
>>>>> The fourth patch ("arm64: mm: split linear mapping if BBML2 unsupported
>>>>> on secondary CPUs") has some really horrible conflicts. These are partly
>>>>> due to some of the type cleanups on for-next/mm but I think mainly due
>>>>> to Kevin's kpti rework that landed after -rc1.
>>>>
>>>> Thanks Will, although I'm nervous that without this patch, some platforms might
>>>> not boot; Wikipedia tells me that there are some Google, Mediatek and Qualcomm
>>>> SoCs that pair X4 CPUs (which is on the BBML2_NOABORT allow list) with A720
>>>> and/or A520 (which are not). See previous mail at [1].
>>>
>>> I'd be surprised if these SoCs are booting on the X4 but who knows.
>>
>> Ahh. You can probably tell I'm a bit naive to some of this system level stuff...
>> I had assumed they would want to boot on the big CPU to reduce boot time.
> 
> One of the problems is that the boot CPU becomes CPU0 and that inevitably
> means it ends up being responsible for a tonne of extra stuff (interrupts,
> TZ, etc) and in many cases can't be offlined. So it's all a trade-off.
> 
>>> Lemme have another look at applying the patch with fresh eyes, but I do
>>> wonder whether having X4 on the allow list really makes any sense. Are
>>> there any SoCs out there that _don't_ pair it with CPUs that aren't on
>>> the allow list? (apologies for the double negative).
>>
>> Hmm, that's a fair question. I'm not aware of any. So I guess the simplest
>> solution is to remove X4 from the allow list and ditch fourth patch.
> 
> That's probably a good idea but I have a horrible feeling we _are_ going
> to need your patch once the errata start flying about :)
> 
> So how about we:
> 
>   - Remove X4 from the list
>   - I try harder to apply your patch for secondary CPUs...
>   - ... if I fail, we can apply it next time around
> 
> Sound reasonable?

Yeah that works for me. Cheers!

> 
> Will

Re: [PATCH v8 0/5] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Will Deacon 1 week, 5 days ago

On Fri, Sep 19, 2025 at 01:00:49PM +0100, Ryan Roberts wrote:
> On 19/09/2025 12:56, Will Deacon wrote:
> > So how about we:
> > 
> >   - Remove X4 from the list
> >   - I try harder to apply your patch for secondary CPUs...
> >   - ... if I fail, we can apply it next time around
> > 
> > Sound reasonable?
> 
> Yeah that works for me. Cheers!

So after all that, the conflict was straightforward once I sat down and
looked at it properly.

Please can you check for-next/core? I forcefully triggered the
repainting path in qemu and it booted without any problems.

Will

Re: [PATCH v8 0/5] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Posted by Ryan Roberts 1 week, 2 days ago

On 19/09/2025 19:44, Will Deacon wrote:
> On Fri, Sep 19, 2025 at 01:00:49PM +0100, Ryan Roberts wrote:
>> On 19/09/2025 12:56, Will Deacon wrote:
>>> So how about we:
>>>
>>>   - Remove X4 from the list
>>>   - I try harder to apply your patch for secondary CPUs...
>>>   - ... if I fail, we can apply it next time around
>>>
>>> Sound reasonable?
>>
>> Yeah that works for me. Cheers!
> 
> So after all that, the conflict was straightforward once I sat down and
> looked at it properly.
> 
> Please can you check for-next/core? I forcefully triggered the
> repainting path in qemu and it booted without any problems.

Thanks Will, I took a look and didn't spot any problems. Thanks for squeezing
this in.

> 
> Will