arch/arm64/include/asm/pgtable.h | 34 ++++++------------- .../include/asm/book3s/64/tlbflush-hash.h | 22 ++++++++---- arch/powerpc/mm/book3s64/hash_tlb.c | 10 +++--- arch/powerpc/mm/book3s64/subpage_prot.c | 5 +-- arch/sparc/include/asm/tlbflush_64.h | 6 ++-- arch/sparc/mm/tlb.c | 17 +++++++--- arch/x86/include/asm/paravirt.h | 8 ++--- arch/x86/include/asm/paravirt_types.h | 6 ++-- arch/x86/include/asm/pgtable.h | 3 +- arch/x86/xen/enlighten_pv.c | 2 +- arch/x86/xen/mmu_pv.c | 13 ++++--- fs/proc/task_mmu.c | 5 +-- include/linux/mm_types.h | 3 ++ include/linux/pgtable.h | 21 +++++++++--- mm/kasan/shadow.c | 4 +-- mm/madvise.c | 20 ++++++----- mm/memory.c | 20 ++++++----- mm/migrate_device.c | 5 +-- mm/mprotect.c | 5 +-- mm/mremap.c | 5 +-- mm/userfaultfd.c | 5 +-- mm/vmalloc.c | 15 ++++---- mm/vmscan.c | 15 ++++---- 23 files changed, 148 insertions(+), 101 deletions(-)
When the lazy MMU mode was introduced eons ago, it wasn't made clear whether such a sequence was legal: arch_enter_lazy_mmu_mode() ... arch_enter_lazy_mmu_mode() ... arch_leave_lazy_mmu_mode() ... arch_leave_lazy_mmu_mode() It seems fair to say that nested calls to arch_{enter,leave}_lazy_mmu_mode() were not expected, and most architectures never explicitly supported it. Ryan Roberts' series from March [1] attempted to prevent nesting from ever occurring, and mostly succeeded. Unfortunately, a corner case (DEBUG_PAGEALLOC) may still cause nesting to occur on arm64. Ryan proposed [2] to address that corner case at the generic level but this approach received pushback; [3] then attempted to solve the issue on arm64 only, but it was deemed too fragile. It feels generally fragile to rely on lazy_mmu sections not to nest, because callers of various standard mm functions do not know if the function uses lazy_mmu itself. This series therefore performs a U-turn and adds support for nested lazy_mmu sections, on all architectures. The main change enabling nesting is patch 2, following the approach suggested by Catalin Marinas [4]: have enter() return some state and the matching leave() take that state. In this series, the state is only used to handle nesting, but it could be used for other purposes such as restoring context modified by enter(); the proposed kpkeys framework would be an immediate user [5]. Patch overview: * Patch 1: general cleanup - not directly related, but avoids any doubt regarding the expected behaviour of arch_flush_lazy_mmu_mode() outside x86 * Patch 2: main API change, no functional change * Patch 3-6: nesting support for all architectures that support lazy_mmu * Patch 7: clarification that nesting is supported in the documentation Patch 4-6 are technically not required at this stage since nesting is only observed on arm64, but they ensure future correctness in case nesting is (re)introduced in generic paths. For instance, it could be beneficial in some configurations to enter lazy_mmu set_ptes() once again. This series has been tested by running the mm kselfetsts on arm64 with DEBUG_PAGEALLOC and KFENCE. It was also build-tested on other architectures (with and without XEN_PV on x86). - Kevin [1] https://lore.kernel.org/all/20250303141542.3371656-1-ryan.roberts@arm.com/ [2] https://lore.kernel.org/all/20250530140446.2387131-1-ryan.roberts@arm.com/ [3] https://lore.kernel.org/all/20250606135654.178300-1-ryan.roberts@arm.com/ [4] https://lore.kernel.org/all/aEhKSq0zVaUJkomX@arm.com/ [5] https://lore.kernel.org/linux-hardening/20250815085512.2182322-19-kevin.brodsky@arm.com/ --- Changelog v1..v2: - Rebased on mm-unstable. - Patch 2: handled new calls to enter()/leave(), clarified how the "flush" pattern (leave() followed by enter()) is handled. - Patch 5,6: removed unnecessary local variable [Alexander Gordeev's suggestion]. - Added Mike Rapoport's Acked-by. v1: https://lore.kernel.org/all/20250904125736.3918646-1-kevin.brodsky@arm.com/ --- Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Juergen Gross <jgross@suse.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yeoreum Yun <yeoreum.yun@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: sparclinux@vger.kernel.org Cc: xen-devel@lists.xenproject.org --- Kevin Brodsky (7): mm: remove arch_flush_lazy_mmu_mode() mm: introduce local state for lazy_mmu sections arm64: mm: fully support nested lazy_mmu sections x86/xen: support nested lazy_mmu sections (again) powerpc/mm: support nested lazy_mmu sections sparc/mm: support nested lazy_mmu sections mm: update lazy_mmu documentation arch/arm64/include/asm/pgtable.h | 34 ++++++------------- .../include/asm/book3s/64/tlbflush-hash.h | 22 ++++++++---- arch/powerpc/mm/book3s64/hash_tlb.c | 10 +++--- arch/powerpc/mm/book3s64/subpage_prot.c | 5 +-- arch/sparc/include/asm/tlbflush_64.h | 6 ++-- arch/sparc/mm/tlb.c | 17 +++++++--- arch/x86/include/asm/paravirt.h | 8 ++--- arch/x86/include/asm/paravirt_types.h | 6 ++-- arch/x86/include/asm/pgtable.h | 3 +- arch/x86/xen/enlighten_pv.c | 2 +- arch/x86/xen/mmu_pv.c | 13 ++++--- fs/proc/task_mmu.c | 5 +-- include/linux/mm_types.h | 3 ++ include/linux/pgtable.h | 21 +++++++++--- mm/kasan/shadow.c | 4 +-- mm/madvise.c | 20 ++++++----- mm/memory.c | 20 ++++++----- mm/migrate_device.c | 5 +-- mm/mprotect.c | 5 +-- mm/mremap.c | 5 +-- mm/userfaultfd.c | 5 +-- mm/vmalloc.c | 15 ++++---- mm/vmscan.c | 15 ++++---- 23 files changed, 148 insertions(+), 101 deletions(-) base-commit: b024763926d2726978dff6588b81877d000159c1 -- 2.47.0
On Mon, 8 Sep 2025 08:39:24 +0100 Kevin Brodsky <kevin.brodsky@arm.com> wrote: > The main change enabling nesting is patch 2, following the approach > suggested by Catalin Marinas [4]: have enter() return some state and > the matching leave() take that state. This is so totally the correct way. Thanks.
On 09.09.25 04:16, Andrew Morton wrote: > On Mon, 8 Sep 2025 08:39:24 +0100 Kevin Brodsky <kevin.brodsky@arm.com> wrote: > >> The main change enabling nesting is patch 2, following the approach >> suggested by Catalin Marinas [4]: have enter() return some state and >> the matching leave() take that state. > > This is so totally the correct way. Thanks. Staring at this, I wonder if we could alternatively handle it like pagefault_disable()/pagefault_enable(), having something like current->lazy_mmu_enabled. We wouldn't have to worry about preemption in that case I guess (unless the arch has special requirements). Not sure if that was already discussed, just a thought. -- Cheers David / dhildenb
On 09/09/2025 11:21, David Hildenbrand wrote: > On 09.09.25 04:16, Andrew Morton wrote: >> On Mon, 8 Sep 2025 08:39:24 +0100 Kevin Brodsky >> <kevin.brodsky@arm.com> wrote: >> >>> The main change enabling nesting is patch 2, following the approach >>> suggested by Catalin Marinas [4]: have enter() return some state and >>> the matching leave() take that state. >> >> This is so totally the correct way. Thanks. > > Staring at this, I wonder if we could alternatively handle it like > pagefault_disable()/pagefault_enable(), having something like > current->lazy_mmu_enabled. > > We wouldn't have to worry about preemption in that case I guess > (unless the arch has special requirements). > > Not sure if that was already discussed, just a thought. Based on the outcome of the discussion with David on patch 2 [1p], there is indeed an alternative approach that we should seriously consider. In summary: * Keep the API stateless, handle nesting with a counter in task_struct * Introduce new functions to temporarily disable lazy_mmu without impacting nesting, track that with a bool in task_struct (addresses the situation in mm/kasan/shadow.c and possibly some x86 cases too) * Move as much handling from arch_* to generic functions What the new generic infrastructure would look like: struct task_struct { ... #ifdef CONFIG_ARCH_LAZY_MMU struct { uint8_t count; bool enabled; /* or paused, see below */ } lazy_mmu_state; #endif } * lazy_mmu_mode_enable(): if (!lazy_mmu_state.count) { arch_enter_lazy_mmu_mode(); lazy_mmu_state.enabled = true; } lazy_mmu_state.count++; * lazy_mmu_mode_disable(): lazy_mmu_count--; if (!lazy_mmu_state.count) { lazy_mmu_state.enabled = false; arch_leave_lazy_mmu_mode(); } else { arch_flush_lazy_mmu_mode(); } * lazy_mmu_mode_pause(): lazy_mmu_state.enabled = false; arch_leave_lazy_mmu_mode(); * lazy_mmu_mode_resume(); arch_enter_lazy_mmu_mode(); lazy_mmu_state.enabled = true; The generic enable()/disable() helpers are able to handle most of the logic, leaving only truly arch-specific code to the arch callbacks: * Updating lazy_mmu_state * Sanity checks on lazy_mmu_state (e.g. count underflow/overflow, pause()/resume() only called when count > 0, etc.) * Bailing out if in_interrupt() (not done consistently across arch's at the moment) A further improvement is to make arch code check lazy_mmu_state.enabled to determine whether lazy_mmu is enabled at any given point. At the moment every arch uses a different mechanism, and this is an occasion to make them converge. The arch callback interface remains unchanged, and we are resurrecting arch_flush_lazy_mmu_mode() to handle the nested disable() case (flushing must happen when exiting a section regardless of nesting): enable() -> arch_enter() enable() -> [nothing] disable() -> arch_flush() disable() -> arch_leave() Note: lazy_mmu_state.enabled (set whenever lazy_mmu is actually enabled) could be replaced with lazy_mmu_state.paused (set inside a pause()/resume() section). I believe this is equivalent but the former is slightly more convenient for arch code - to be confirmed in practice. Any thoughts on this? Unless there are concerns, I will move towards that approach in v3. - Kevin [1p] https://lore.kernel.org/all/4aa28016-5678-4c66-8104-8dcc3fa2f5ce@redhat.com/t/#u
On 09/09/2025 11:21, David Hildenbrand wrote: > On 09.09.25 04:16, Andrew Morton wrote: >> On Mon, 8 Sep 2025 08:39:24 +0100 Kevin Brodsky >> <kevin.brodsky@arm.com> wrote: >> >>> The main change enabling nesting is patch 2, following the approach >>> suggested by Catalin Marinas [4]: have enter() return some state and >>> the matching leave() take that state. >> >> This is so totally the correct way. Thanks. > > Staring at this, I wonder if we could alternatively handle it like > pagefault_disable()/pagefault_enable(), having something like > current->lazy_mmu_enabled. > > We wouldn't have to worry about preemption in that case I guess > (unless the arch has special requirements). > > Not sure if that was already discussed, just a thought. That's an interesting point, I think I've addressed it in reply to patch 2 [1]. - Kevin [1] https://lore.kernel.org/all/47ee1df7-1602-4200-af94-475f84ca8d80@arm.com/
On Mon, Sep 08, 2025 at 08:39:24AM +0100, Kevin Brodsky wrote: > When the lazy MMU mode was introduced eons ago, it wasn't made clear > whether such a sequence was legal: > > arch_enter_lazy_mmu_mode() > ... > arch_enter_lazy_mmu_mode() > ... > arch_leave_lazy_mmu_mode() > ... > arch_leave_lazy_mmu_mode() > > It seems fair to say that nested calls to > arch_{enter,leave}_lazy_mmu_mode() were not expected, and most > architectures never explicitly supported it. This is compiling with CONFIG_USERFAULTFD at all commits and series is compiling with allmodconfig plus all mm selftests are passing so from my side this looks good, thanks for addressing issues and rebasing! :) Cheers, Lorenzo
On 08/09/2025 18:56, Lorenzo Stoakes wrote: > On Mon, Sep 08, 2025 at 08:39:24AM +0100, Kevin Brodsky wrote: >> When the lazy MMU mode was introduced eons ago, it wasn't made clear >> whether such a sequence was legal: >> >> arch_enter_lazy_mmu_mode() >> ... >> arch_enter_lazy_mmu_mode() >> ... >> arch_leave_lazy_mmu_mode() >> ... >> arch_leave_lazy_mmu_mode() >> >> It seems fair to say that nested calls to >> arch_{enter,leave}_lazy_mmu_mode() were not expected, and most >> architectures never explicitly supported it. > > This is compiling with CONFIG_USERFAULTFD at all commits and series is > compiling with allmodconfig plus all mm selftests are passing so from my > side this looks good, thanks for addressing issues and rebasing! :) Great thank you for that extensive testing, very appreciated! Shall I add your Reviewed-by to the whole series? - Kevin
© 2016 - 2025 Red Hat, Inc.