Despite recent efforts to prevent lazy_mmu sections from nesting, it
remains difficult to ensure that it never occurs - and in fact it
does occur on arm64 in certain situations (CONFIG_DEBUG_PAGEALLOC).
Commit 1ef3095b1405 ("arm64/mm: Permit lazy_mmu_mode to be nested")
made nesting tolerable on arm64, but without truly supporting it:
the inner call to leave() disables the batching optimisation before
the outer section ends.
This patch actually enables lazy_mmu sections to nest by tracking
the nesting level in task_struct, in a similar fashion to e.g.
pagefault_{enable,disable}(). This is fully handled by the generic
lazy_mmu helpers that were recently introduced.
lazy_mmu sections were not initially intended to nest, so we need to
clarify the semantics w.r.t. the arch_*_lazy_mmu_mode() callbacks.
This patch takes the following approach:
* The outermost calls to lazy_mmu_mode_{enable,disable}() trigger
calls to arch_{enter,leave}_lazy_mmu_mode() - this is unchanged.
* Nested calls to lazy_mmu_mode_{enable,disable}() are not forwarded
to the arch via arch_{enter,leave} - lazy MMU remains enabled so
the assumption is that these callbacks are not relevant. However,
existing code may rely on a call to disable() to flush any batched
state, regardless of nesting. arch_flush_lazy_mmu_mode() is
therefore called in that situation.
A separate interface was recently introduced to temporarily pause
the lazy MMU mode: lazy_mmu_mode_{pause,resume}(). pause() fully
exits the mode *regardless of the nesting level*, and resume()
restores the mode at the same nesting level.
Whether the mode is actually enabled or not at any point is tracked
by a separate "active" field in task_struct; this makes it possible
to check invariants in the generic API, and to expose a new
in_lazy_mmu_mode() helper to replace the various ways arch's
currently track whether the mode is enabled (this will be done in
later patches).
In summary (nesting/active represent the values *after* the call):
lazy_mmu_mode_enable() -> arch_enter() nesting=1 active=1
lazy_mmu_mode_enable() -> ø nesting=2 active=1
lazy_mmu_mode_pause() -> arch_leave() nesting=2 active=0
lazy_mmu_mode_resume() -> arch_enter() nesting=2 active=1
lazy_mmu_mode_disable() -> arch_flush() nesting=1 active=1
lazy_mmu_mode_disable() -> arch_leave() nesting=0 active=0
Note: in_lazy_mmu_mode() is added to <linux/sched.h> to allow arch
headers included by <linux/pgtable.h> to use it.
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
arch/arm64/include/asm/pgtable.h | 12 ------
include/linux/mm_types_task.h | 5 +++
include/linux/pgtable.h | 67 ++++++++++++++++++++++++++++++--
include/linux/sched.h | 16 ++++++++
4 files changed, 84 insertions(+), 16 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 54f8d6bb6f22..535435248923 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -82,18 +82,6 @@ static inline void queue_pte_barriers(void)
static inline void arch_enter_lazy_mmu_mode(void)
{
- /*
- * lazy_mmu_mode is not supposed to permit nesting. But in practice this
- * does happen with CONFIG_DEBUG_PAGEALLOC, where a page allocation
- * inside a lazy_mmu_mode section (such as zap_pte_range()) will change
- * permissions on the linear map with apply_to_page_range(), which
- * re-enters lazy_mmu_mode. So we tolerate nesting in our
- * implementation. The first call to arch_leave_lazy_mmu_mode() will
- * flush and clear the flag such that the remainder of the work in the
- * outer nest behaves as if outside of lazy mmu mode. This is safe and
- * keeps tracking simple.
- */
-
if (in_interrupt())
return;
diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
index a82aa80c0ba4..632d404f8191 100644
--- a/include/linux/mm_types_task.h
+++ b/include/linux/mm_types_task.h
@@ -88,4 +88,9 @@ struct tlbflush_unmap_batch {
#endif
};
+struct lazy_mmu_state {
+ u8 nesting_level;
+ bool active;
+};
+
#endif /* _LINUX_MM_TYPES_TASK_H */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b5fdf32c437f..e6064e00b22d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -228,27 +228,86 @@ static inline int pmd_dirty(pmd_t pmd)
* of the lazy mode. So the implementation must assume preemption may be enabled
* and cpu migration is possible; it must take steps to be robust against this.
* (In practice, for user PTE updates, the appropriate page table lock(s) are
- * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
- * and the mode cannot be used in interrupt context.
+ * held, but for kernel PTE updates, no lock is held). The mode cannot be used
+ * in interrupt context.
+ *
+ * The lazy MMU mode is enabled for a given block of code using:
+ *
+ * lazy_mmu_mode_enable();
+ * <code>
+ * lazy_mmu_mode_disable();
+ *
+ * Nesting is permitted: <code> may itself use an enable()/disable() pair.
+ * A nested call to enable() has no functional effect; however disable() causes
+ * any batched architectural state to be flushed regardless of nesting. After a
+ * call to disable(), the caller can therefore rely on all previous page table
+ * modifications to have taken effect, but the lazy MMU mode may still be
+ * enabled.
+ *
+ * In certain cases, it may be desirable to temporarily pause the lazy MMU mode.
+ * This can be done using:
+ *
+ * lazy_mmu_mode_pause();
+ * <code>
+ * lazy_mmu_mode_resume();
+ *
+ * This sequence must only be used if the lazy MMU mode is already enabled.
+ * pause() ensures that the mode is exited regardless of the nesting level;
+ * resume() re-enters the mode at the same nesting level. <code> must not modify
+ * the lazy MMU state (i.e. it must not call any of the lazy_mmu_mode_*
+ * helpers).
+ *
+ * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
+ * currently enabled.
*/
#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
static inline void lazy_mmu_mode_enable(void)
{
- arch_enter_lazy_mmu_mode();
+ struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
+
+ VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
+ /* enable() must not be called while paused */
+ VM_WARN_ON(state->nesting_level > 0 && !state->active);
+
+ if (state->nesting_level++ == 0) {
+ state->active = true;
+ arch_enter_lazy_mmu_mode();
+ }
}
static inline void lazy_mmu_mode_disable(void)
{
- arch_leave_lazy_mmu_mode();
+ struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
+
+ VM_WARN_ON_ONCE(state->nesting_level == 0);
+ VM_WARN_ON(!state->active);
+
+ if (--state->nesting_level == 0) {
+ state->active = false;
+ arch_leave_lazy_mmu_mode();
+ } else {
+ /* Exiting a nested section */
+ arch_flush_lazy_mmu_mode();
+ }
}
static inline void lazy_mmu_mode_pause(void)
{
+ struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
+
+ VM_WARN_ON(state->nesting_level == 0 || !state->active);
+
+ state->active = false;
arch_leave_lazy_mmu_mode();
}
static inline void lazy_mmu_mode_resume(void)
{
+ struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
+
+ VM_WARN_ON(state->nesting_level == 0 || state->active);
+
+ state->active = true;
arch_enter_lazy_mmu_mode();
}
#else
diff --git a/include/linux/sched.h b/include/linux/sched.h
index cbb7340c5866..11566d973f42 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1441,6 +1441,10 @@ struct task_struct {
struct page_frag task_frag;
+#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
+ struct lazy_mmu_state lazy_mmu_state;
+#endif
+
#ifdef CONFIG_TASK_DELAY_ACCT
struct task_delay_info *delays;
#endif
@@ -1724,6 +1728,18 @@ static inline char task_state_to_char(struct task_struct *tsk)
return task_index_to_char(task_state_index(tsk));
}
+#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
+static inline bool in_lazy_mmu_mode(void)
+{
+ return current->lazy_mmu_state.active;
+}
+#else
+static inline bool in_lazy_mmu_mode(void)
+{
+ return false;
+}
+#endif
+
extern struct pid *cad_pid;
/*
--
2.47.0
On 29/10/2025 10:09, Kevin Brodsky wrote:
> Despite recent efforts to prevent lazy_mmu sections from nesting, it
> remains difficult to ensure that it never occurs - and in fact it
> does occur on arm64 in certain situations (CONFIG_DEBUG_PAGEALLOC).
> Commit 1ef3095b1405 ("arm64/mm: Permit lazy_mmu_mode to be nested")
> made nesting tolerable on arm64, but without truly supporting it:
> the inner call to leave() disables the batching optimisation before
> the outer section ends.
>
> This patch actually enables lazy_mmu sections to nest by tracking
> the nesting level in task_struct, in a similar fashion to e.g.
> pagefault_{enable,disable}(). This is fully handled by the generic
> lazy_mmu helpers that were recently introduced.
>
> lazy_mmu sections were not initially intended to nest, so we need to
> clarify the semantics w.r.t. the arch_*_lazy_mmu_mode() callbacks.
> This patch takes the following approach:
>
> * The outermost calls to lazy_mmu_mode_{enable,disable}() trigger
> calls to arch_{enter,leave}_lazy_mmu_mode() - this is unchanged.
>
> * Nested calls to lazy_mmu_mode_{enable,disable}() are not forwarded
> to the arch via arch_{enter,leave} - lazy MMU remains enabled so
> the assumption is that these callbacks are not relevant. However,
> existing code may rely on a call to disable() to flush any batched
> state, regardless of nesting. arch_flush_lazy_mmu_mode() is
> therefore called in that situation.
>
> A separate interface was recently introduced to temporarily pause
> the lazy MMU mode: lazy_mmu_mode_{pause,resume}(). pause() fully
> exits the mode *regardless of the nesting level*, and resume()
> restores the mode at the same nesting level.
>
> Whether the mode is actually enabled or not at any point is tracked
> by a separate "active" field in task_struct; this makes it possible
> to check invariants in the generic API, and to expose a new
> in_lazy_mmu_mode() helper to replace the various ways arch's
> currently track whether the mode is enabled (this will be done in
> later patches).
>
> In summary (nesting/active represent the values *after* the call):
>
> lazy_mmu_mode_enable() -> arch_enter() nesting=1 active=1
> lazy_mmu_mode_enable() -> ø nesting=2 active=1
> lazy_mmu_mode_pause() -> arch_leave() nesting=2 active=0
> lazy_mmu_mode_resume() -> arch_enter() nesting=2 active=1
> lazy_mmu_mode_disable() -> arch_flush() nesting=1 active=1
> lazy_mmu_mode_disable() -> arch_leave() nesting=0 active=0
>
> Note: in_lazy_mmu_mode() is added to <linux/sched.h> to allow arch
> headers included by <linux/pgtable.h> to use it.
>
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
> arch/arm64/include/asm/pgtable.h | 12 ------
> include/linux/mm_types_task.h | 5 +++
> include/linux/pgtable.h | 67 ++++++++++++++++++++++++++++++--
> include/linux/sched.h | 16 ++++++++
> 4 files changed, 84 insertions(+), 16 deletions(-)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 54f8d6bb6f22..535435248923 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -82,18 +82,6 @@ static inline void queue_pte_barriers(void)
>
> static inline void arch_enter_lazy_mmu_mode(void)
> {
> - /*
> - * lazy_mmu_mode is not supposed to permit nesting. But in practice this
> - * does happen with CONFIG_DEBUG_PAGEALLOC, where a page allocation
> - * inside a lazy_mmu_mode section (such as zap_pte_range()) will change
> - * permissions on the linear map with apply_to_page_range(), which
> - * re-enters lazy_mmu_mode. So we tolerate nesting in our
> - * implementation. The first call to arch_leave_lazy_mmu_mode() will
> - * flush and clear the flag such that the remainder of the work in the
> - * outer nest behaves as if outside of lazy mmu mode. This is safe and
> - * keeps tracking simple.
> - */
> -
> if (in_interrupt())
> return;
>
> diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
> index a82aa80c0ba4..632d404f8191 100644
> --- a/include/linux/mm_types_task.h
> +++ b/include/linux/mm_types_task.h
> @@ -88,4 +88,9 @@ struct tlbflush_unmap_batch {
> #endif
> };
>
> +struct lazy_mmu_state {
> + u8 nesting_level;
> + bool active;
> +};
> +
> #endif /* _LINUX_MM_TYPES_TASK_H */
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b5fdf32c437f..e6064e00b22d 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -228,27 +228,86 @@ static inline int pmd_dirty(pmd_t pmd)
> * of the lazy mode. So the implementation must assume preemption may be enabled
> * and cpu migration is possible; it must take steps to be robust against this.
> * (In practice, for user PTE updates, the appropriate page table lock(s) are
> - * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
> - * and the mode cannot be used in interrupt context.
> + * held, but for kernel PTE updates, no lock is held). The mode cannot be used
> + * in interrupt context.
"The mode cannot be used in interrupt context"; except it is for arm64. KFENCE
and/or DEBUG_PAGEALLOC will request the arch to change linear map permissions,
which will enter lazy mmu (now using the new generic API). This can happen in
softirq context.
> + *
> + * The lazy MMU mode is enabled for a given block of code using:
> + *
> + * lazy_mmu_mode_enable();
> + * <code>
> + * lazy_mmu_mode_disable();
> + *
> + * Nesting is permitted: <code> may itself use an enable()/disable() pair.
> + * A nested call to enable() has no functional effect; however disable() causes
> + * any batched architectural state to be flushed regardless of nesting. After a
> + * call to disable(), the caller can therefore rely on all previous page table
> + * modifications to have taken effect, but the lazy MMU mode may still be
> + * enabled.
> + *
> + * In certain cases, it may be desirable to temporarily pause the lazy MMU mode.
> + * This can be done using:
> + *
> + * lazy_mmu_mode_pause();
> + * <code>
> + * lazy_mmu_mode_resume();
> + *
> + * This sequence must only be used if the lazy MMU mode is already enabled.
> + * pause() ensures that the mode is exited regardless of the nesting level;
> + * resume() re-enters the mode at the same nesting level. <code> must not modify
> + * the lazy MMU state (i.e. it must not call any of the lazy_mmu_mode_*
> + * helpers).
> + *
> + * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
> + * currently enabled.
> */
Nice documentation!
> #ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> static inline void lazy_mmu_mode_enable(void)
> {
> - arch_enter_lazy_mmu_mode();
> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
> +
> + VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
> + /* enable() must not be called while paused */
> + VM_WARN_ON(state->nesting_level > 0 && !state->active);
> +
> + if (state->nesting_level++ == 0) {
Hmm... for the arm64 case of calling this in an interrupt, Is it safe?
If a task is calling this function and gets interrupted here, nesting_level==1
but active==false. The interrupt then calls this function and increments from 1
to 2 but arch_enter_lazy_mmu_mode() hasn't been called.
More dangerously (I think), when the interrupt handler calls
lazy_mmu_mode_disable(), it will end up calling arch_flush_lazy_mmu_mode() which
could be an issue because as far as the arch is concerned, it's not in lazy mode.
The current arm64 implementation works because setting and clearing the thread
flags is atomic.
Perhaps you need to disable preemption around the if block?
> + state->active = true;
> + arch_enter_lazy_mmu_mode();
> + }
> }
>
> static inline void lazy_mmu_mode_disable(void)
> {
> - arch_leave_lazy_mmu_mode();
> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
> +
> + VM_WARN_ON_ONCE(state->nesting_level == 0);
> + VM_WARN_ON(!state->active);
> +
> + if (--state->nesting_level == 0) {
> + state->active = false;
> + arch_leave_lazy_mmu_mode();
> + } else {
> + /* Exiting a nested section */
> + arch_flush_lazy_mmu_mode();
> + }
> }
>
> static inline void lazy_mmu_mode_pause(void)
> {
> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
> +
> + VM_WARN_ON(state->nesting_level == 0 || !state->active);
nit: do you need the first condition? I think when nesting_level==0, we expect
to be !active?
> +
> + state->active = false;
> arch_leave_lazy_mmu_mode();
> }
>
> static inline void lazy_mmu_mode_resume(void)
> {
> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
> +
> + VM_WARN_ON(state->nesting_level == 0 || state->active);
Similar argument?
> +
> + state->active = true;
> arch_enter_lazy_mmu_mode();
> }
> #else
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index cbb7340c5866..11566d973f42 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1441,6 +1441,10 @@ struct task_struct {
>
> struct page_frag task_frag;
>
> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> + struct lazy_mmu_state lazy_mmu_state;
> +#endif
> +
> #ifdef CONFIG_TASK_DELAY_ACCT
> struct task_delay_info *delays;
> #endif
> @@ -1724,6 +1728,18 @@ static inline char task_state_to_char(struct task_struct *tsk)
> return task_index_to_char(task_state_index(tsk));
> }
>
> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> +static inline bool in_lazy_mmu_mode(void)
> +{
> + return current->lazy_mmu_state.active;
> +}
> +#else
> +static inline bool in_lazy_mmu_mode(void)
> +{
> + return false;
Just pointing out that this isn't really a correct implementation:
lazy_mmu_mode_enable()
ASSERT(in_lazy_mmu_mode()) << triggers for arches without lazy mmu
lazy_mmu_mode_disable()
Although it probably doesn't matter in practice?
Thanks,
Ryan
> +}
> +#endif
> +
> extern struct pid *cad_pid;
>
> /*
On 07/11/2025 14:59, Ryan Roberts wrote:
> On 29/10/2025 10:09, Kevin Brodsky wrote:
>> [...]
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index b5fdf32c437f..e6064e00b22d 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -228,27 +228,86 @@ static inline int pmd_dirty(pmd_t pmd)
>> * of the lazy mode. So the implementation must assume preemption may be enabled
>> * and cpu migration is possible; it must take steps to be robust against this.
>> * (In practice, for user PTE updates, the appropriate page table lock(s) are
>> - * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
>> - * and the mode cannot be used in interrupt context.
>> + * held, but for kernel PTE updates, no lock is held). The mode cannot be used
>> + * in interrupt context.
> "The mode cannot be used in interrupt context"; except it is for arm64. KFENCE
> and/or DEBUG_PAGEALLOC will request the arch to change linear map permissions,
> which will enter lazy mmu (now using the new generic API). This can happen in
> softirq context.
Are you happy with the wording update in patch 12?
>> + *
>> + * The lazy MMU mode is enabled for a given block of code using:
>> + *
>> + * lazy_mmu_mode_enable();
>> + * <code>
>> + * lazy_mmu_mode_disable();
>> + *
>> + * Nesting is permitted: <code> may itself use an enable()/disable() pair.
>> + * A nested call to enable() has no functional effect; however disable() causes
>> + * any batched architectural state to be flushed regardless of nesting. After a
>> + * call to disable(), the caller can therefore rely on all previous page table
>> + * modifications to have taken effect, but the lazy MMU mode may still be
>> + * enabled.
>> + *
>> + * In certain cases, it may be desirable to temporarily pause the lazy MMU mode.
>> + * This can be done using:
>> + *
>> + * lazy_mmu_mode_pause();
>> + * <code>
>> + * lazy_mmu_mode_resume();
>> + *
>> + * This sequence must only be used if the lazy MMU mode is already enabled.
>> + * pause() ensures that the mode is exited regardless of the nesting level;
>> + * resume() re-enters the mode at the same nesting level. <code> must not modify
>> + * the lazy MMU state (i.e. it must not call any of the lazy_mmu_mode_*
>> + * helpers).
>> + *
>> + * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
>> + * currently enabled.
>> */
> Nice documentation!
Thanks!
>> #ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>> static inline void lazy_mmu_mode_enable(void)
>> {
>> - arch_enter_lazy_mmu_mode();
>> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
>> +
>> + VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
>> + /* enable() must not be called while paused */
>> + VM_WARN_ON(state->nesting_level > 0 && !state->active);
>> +
>> + if (state->nesting_level++ == 0) {
> Hmm... for the arm64 case of calling this in an interrupt, Is it safe?
>
> If a task is calling this function and gets interrupted here, nesting_level==1
> but active==false. The interrupt then calls this function and increments from 1
> to 2 but arch_enter_lazy_mmu_mode() hasn't been called.
>
> More dangerously (I think), when the interrupt handler calls
> lazy_mmu_mode_disable(), it will end up calling arch_flush_lazy_mmu_mode() which
> could be an issue because as far as the arch is concerned, it's not in lazy mode.
>
> The current arm64 implementation works because setting and clearing the thread
> flags is atomic.
>
> Perhaps you need to disable preemption around the if block?
As you found out this is addressed in patch 12, but indeed I hadn't
realised that this patch leaves the generic API in an unsafe situation
w.r.t. interrupts. We at least need to have in_interrupt() checks in the
generic layer by the time we get to this patch.
>> + state->active = true;
>> + arch_enter_lazy_mmu_mode();
>> + }
>> }
>>
>> static inline void lazy_mmu_mode_disable(void)
>> {
>> - arch_leave_lazy_mmu_mode();
>> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
>> +
>> + VM_WARN_ON_ONCE(state->nesting_level == 0);
>> + VM_WARN_ON(!state->active);
>> +
>> + if (--state->nesting_level == 0) {
>> + state->active = false;
>> + arch_leave_lazy_mmu_mode();
>> + } else {
>> + /* Exiting a nested section */
>> + arch_flush_lazy_mmu_mode();
>> + }
>> }
>>
>> static inline void lazy_mmu_mode_pause(void)
>> {
>> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
>> +
>> + VM_WARN_ON(state->nesting_level == 0 || !state->active);
> nit: do you need the first condition? I think when nesting_level==0, we expect
> to be !active?
I suppose this should never happen indeed - I was just being extra
defensive.
Either way David suggested allowing pause()/resume() to be called
outside of any section so the next version will bail out on
nesting_level == 0.
>> +
>> + state->active = false;
>> arch_leave_lazy_mmu_mode();
>> }
>>
>> static inline void lazy_mmu_mode_resume(void)
>> {
>> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
>> +
>> + VM_WARN_ON(state->nesting_level == 0 || state->active);
> Similar argument?
>
>> +
>> + state->active = true;
>> arch_enter_lazy_mmu_mode();
>> }
>> #else
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index cbb7340c5866..11566d973f42 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1441,6 +1441,10 @@ struct task_struct {
>>
>> struct page_frag task_frag;
>>
>> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>> + struct lazy_mmu_state lazy_mmu_state;
>> +#endif
>> +
>> #ifdef CONFIG_TASK_DELAY_ACCT
>> struct task_delay_info *delays;
>> #endif
>> @@ -1724,6 +1728,18 @@ static inline char task_state_to_char(struct task_struct *tsk)
>> return task_index_to_char(task_state_index(tsk));
>> }
>>
>> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>> +static inline bool in_lazy_mmu_mode(void)
>> +{
>> + return current->lazy_mmu_state.active;
>> +}
>> +#else
>> +static inline bool in_lazy_mmu_mode(void)
>> +{
>> + return false;
> Just pointing out that this isn't really a correct implementation:
>
> lazy_mmu_mode_enable()
> ASSERT(in_lazy_mmu_mode()) << triggers for arches without lazy mmu
> lazy_mmu_mode_disable()
>
> Although it probably doesn't matter in practice?
I'd say that the expectation is invalid - lazy MMU mode can only be
enabled if the architecture supports it. In fact as you pointed out
above the API may be called in interrupt context but it will have no
effect, so this sequence would always fail in interrupt context.
Worth nothing that in_lazy_mmu_mode() is only ever called from arch code
where lazy MMU is implemented. I added the fallback as a matter of
principle, but it isn't actually required.
- Kevin
On 10/11/2025 10:47, Kevin Brodsky wrote:
> On 07/11/2025 14:59, Ryan Roberts wrote:
>> On 29/10/2025 10:09, Kevin Brodsky wrote:
>>> [...]
>>>
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index b5fdf32c437f..e6064e00b22d 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -228,27 +228,86 @@ static inline int pmd_dirty(pmd_t pmd)
>>> * of the lazy mode. So the implementation must assume preemption may be enabled
>>> * and cpu migration is possible; it must take steps to be robust against this.
>>> * (In practice, for user PTE updates, the appropriate page table lock(s) are
>>> - * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
>>> - * and the mode cannot be used in interrupt context.
>>> + * held, but for kernel PTE updates, no lock is held). The mode cannot be used
>>> + * in interrupt context.
>> "The mode cannot be used in interrupt context"; except it is for arm64. KFENCE
>> and/or DEBUG_PAGEALLOC will request the arch to change linear map permissions,
>> which will enter lazy mmu (now using the new generic API). This can happen in
>> softirq context.
>
> Are you happy with the wording update in patch 12?
Yes!
>
>>> + *
>>> + * The lazy MMU mode is enabled for a given block of code using:
>>> + *
>>> + * lazy_mmu_mode_enable();
>>> + * <code>
>>> + * lazy_mmu_mode_disable();
>>> + *
>>> + * Nesting is permitted: <code> may itself use an enable()/disable() pair.
>>> + * A nested call to enable() has no functional effect; however disable() causes
>>> + * any batched architectural state to be flushed regardless of nesting. After a
>>> + * call to disable(), the caller can therefore rely on all previous page table
>>> + * modifications to have taken effect, but the lazy MMU mode may still be
>>> + * enabled.
>>> + *
>>> + * In certain cases, it may be desirable to temporarily pause the lazy MMU mode.
>>> + * This can be done using:
>>> + *
>>> + * lazy_mmu_mode_pause();
>>> + * <code>
>>> + * lazy_mmu_mode_resume();
>>> + *
>>> + * This sequence must only be used if the lazy MMU mode is already enabled.
>>> + * pause() ensures that the mode is exited regardless of the nesting level;
>>> + * resume() re-enters the mode at the same nesting level. <code> must not modify
>>> + * the lazy MMU state (i.e. it must not call any of the lazy_mmu_mode_*
>>> + * helpers).
>>> + *
>>> + * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
>>> + * currently enabled.
>>> */
>> Nice documentation!
>
> Thanks!
>
>>> #ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>>> static inline void lazy_mmu_mode_enable(void)
>>> {
>>> - arch_enter_lazy_mmu_mode();
>>> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
>>> +
>>> + VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
>>> + /* enable() must not be called while paused */
>>> + VM_WARN_ON(state->nesting_level > 0 && !state->active);
>>> +
>>> + if (state->nesting_level++ == 0) {
>> Hmm... for the arm64 case of calling this in an interrupt, Is it safe?
>>
>> If a task is calling this function and gets interrupted here, nesting_level==1
>> but active==false. The interrupt then calls this function and increments from 1
>> to 2 but arch_enter_lazy_mmu_mode() hasn't been called.
>>
>> More dangerously (I think), when the interrupt handler calls
>> lazy_mmu_mode_disable(), it will end up calling arch_flush_lazy_mmu_mode() which
>> could be an issue because as far as the arch is concerned, it's not in lazy mode.
>>
>> The current arm64 implementation works because setting and clearing the thread
>> flags is atomic.
>>
>> Perhaps you need to disable preemption around the if block?
>
> As you found out this is addressed in patch 12, but indeed I hadn't
> realised that this patch leaves the generic API in an unsafe situation
> w.r.t. interrupts. We at least need to have in_interrupt() checks in the
> generic layer by the time we get to this patch.
Yeah that should solve it.
>
>>> + state->active = true;
>>> + arch_enter_lazy_mmu_mode();
>>> + }
>>> }
>>>
>>> static inline void lazy_mmu_mode_disable(void)
>>> {
>>> - arch_leave_lazy_mmu_mode();
>>> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
>>> +
>>> + VM_WARN_ON_ONCE(state->nesting_level == 0);
>>> + VM_WARN_ON(!state->active);
>>> +
>>> + if (--state->nesting_level == 0) {
>>> + state->active = false;
>>> + arch_leave_lazy_mmu_mode();
>>> + } else {
>>> + /* Exiting a nested section */
>>> + arch_flush_lazy_mmu_mode();
>>> + }
>>> }
>>>
>>> static inline void lazy_mmu_mode_pause(void)
>>> {
>>> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
>>> +
>>> + VM_WARN_ON(state->nesting_level == 0 || !state->active);
>> nit: do you need the first condition? I think when nesting_level==0, we expect
>> to be !active?
>
> I suppose this should never happen indeed - I was just being extra
> defensive.
>
> Either way David suggested allowing pause()/resume() to be called
> outside of any section so the next version will bail out on
> nesting_level == 0.
Ignoring my current opinion that we don't need pause/resume at all for now; Are
you suggesting that pause/resume will be completely independent of
enable/disable? I think that would be best. So enable/disable increment and
decrement the nesting_level counter regardless of whether we are paused.
nesting_level 0 => 1 enables if not paused. nesting_level 1 => 0 disables if not
paused. pause disables nesting_level >= 1, resume enables if nesting_level >= 1.
Perhaps we also need nested pause/resume? Then you just end up with 2 counters;
enable_count and pause_count. Sorry if this has already been discussed.
>
>>> +
>>> + state->active = false;
>>> arch_leave_lazy_mmu_mode();
>>> }
>>>
>>> static inline void lazy_mmu_mode_resume(void)
>>> {
>>> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
>>> +
>>> + VM_WARN_ON(state->nesting_level == 0 || state->active);
>> Similar argument?
>>
>>> +
>>> + state->active = true;
>>> arch_enter_lazy_mmu_mode();
>>> }
>>> #else
>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>> index cbb7340c5866..11566d973f42 100644
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -1441,6 +1441,10 @@ struct task_struct {
>>>
>>> struct page_frag task_frag;
>>>
>>> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>>> + struct lazy_mmu_state lazy_mmu_state;
>>> +#endif
>>> +
>>> #ifdef CONFIG_TASK_DELAY_ACCT
>>> struct task_delay_info *delays;
>>> #endif
>>> @@ -1724,6 +1728,18 @@ static inline char task_state_to_char(struct task_struct *tsk)
>>> return task_index_to_char(task_state_index(tsk));
>>> }
>>>
>>> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>>> +static inline bool in_lazy_mmu_mode(void)
>>> +{
>>> + return current->lazy_mmu_state.active;
>>> +}
>>> +#else
>>> +static inline bool in_lazy_mmu_mode(void)
>>> +{
>>> + return false;
>> Just pointing out that this isn't really a correct implementation:
>>
>> lazy_mmu_mode_enable()
>> ASSERT(in_lazy_mmu_mode()) << triggers for arches without lazy mmu
>> lazy_mmu_mode_disable()
>>
>> Although it probably doesn't matter in practice?
>
> I'd say that the expectation is invalid - lazy MMU mode can only be
> enabled if the architecture supports it. In fact as you pointed out
> above the API may be called in interrupt context but it will have no
> effect, so this sequence would always fail in interrupt context.
Yep, but previously there was no way to query the current state so it didn't
matter. Now you have a query API so it might matter if/when people come along
and use it in unexpected ways.
>
> Worth nothing that in_lazy_mmu_mode() is only ever called from arch code
> where lazy MMU is implemented. I added the fallback as a matter of
> principle, but it isn't actually required.
Yes, I agree that's the intent. I'm just wondering if it's possible to enforce
that only arch code uses this. Perhaps add some docs to explain that it's only
intended for arches that implement lazy_mmu, and don't define it for arches that
don't, which would catch any generic users?
Thanks,
Ryan
>
> - Kevin
On 11/11/2025 10:24, Ryan Roberts wrote:
> [...]
>
>>>> + state->active = true;
>>>> + arch_enter_lazy_mmu_mode();
>>>> + }
>>>> }
>>>>
>>>> static inline void lazy_mmu_mode_disable(void)
>>>> {
>>>> - arch_leave_lazy_mmu_mode();
>>>> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
>>>> +
>>>> + VM_WARN_ON_ONCE(state->nesting_level == 0);
>>>> + VM_WARN_ON(!state->active);
>>>> +
>>>> + if (--state->nesting_level == 0) {
>>>> + state->active = false;
>>>> + arch_leave_lazy_mmu_mode();
>>>> + } else {
>>>> + /* Exiting a nested section */
>>>> + arch_flush_lazy_mmu_mode();
>>>> + }
>>>> }
>>>>
>>>> static inline void lazy_mmu_mode_pause(void)
>>>> {
>>>> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
>>>> +
>>>> + VM_WARN_ON(state->nesting_level == 0 || !state->active);
>>> nit: do you need the first condition? I think when nesting_level==0, we expect
>>> to be !active?
>> I suppose this should never happen indeed - I was just being extra
>> defensive.
>>
>> Either way David suggested allowing pause()/resume() to be called
>> outside of any section so the next version will bail out on
>> nesting_level == 0.
> Ignoring my current opinion that we don't need pause/resume at all for now; Are
> you suggesting that pause/resume will be completely independent of
> enable/disable? I think that would be best. So enable/disable increment and
> decrement the nesting_level counter regardless of whether we are paused.
> nesting_level 0 => 1 enables if not paused. nesting_level 1 => 0 disables if not
> paused. pause disables nesting_level >= 1, resume enables if nesting_level >= 1.
This is something else. Currently the rules are:
[A]
// pausing forbidden
enable()
pause()
// pausing/enabling forbidden
resume()
disable()
David suggested allowing:
[B]
pause()
// pausing/enabling forbidden
resume()
Your suggestion is also allowing:
[C]
pause()
// pausing forbidden
enable()
disable()
resume()
> Perhaps we also need nested pause/resume? Then you just end up with 2 counters;
> enable_count and pause_count. Sorry if this has already been discussed.
And finally:
[D]
pause()
pause()
enable()
disable()
resume()
resume()
I don't really mind either way, but I don't see an immediate use for [C]
and [D] - the idea is that the paused section is short and controlled,
not made up of arbitrary calls. A potential downside of allowing [C] and
[D] is that it makes it harder to detect unintended nesting (fewer
VM_WARN assertions). Happy to implement it if this proves useful though.
OTOH the idea behind [B] is that it allows the caller of
pause()/resume() not to care about whether lazy MMU is actually enabled
or not - i.e. the kasan helpers would keep working even if
apply_to_page_range() didn't use lazy MMU any more.
>>>> +
>>>> + state->active = false;
>>>> arch_leave_lazy_mmu_mode();
>>>> }
>>>>
>>>> static inline void lazy_mmu_mode_resume(void)
>>>> {
>>>> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
>>>> +
>>>> + VM_WARN_ON(state->nesting_level == 0 || state->active);
>>> Similar argument?
>>>
>>>> +
>>>> + state->active = true;
>>>> arch_enter_lazy_mmu_mode();
>>>> }
>>>> #else
>>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>>> index cbb7340c5866..11566d973f42 100644
>>>> --- a/include/linux/sched.h
>>>> +++ b/include/linux/sched.h
>>>> @@ -1441,6 +1441,10 @@ struct task_struct {
>>>>
>>>> struct page_frag task_frag;
>>>>
>>>> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>>>> + struct lazy_mmu_state lazy_mmu_state;
>>>> +#endif
>>>> +
>>>> #ifdef CONFIG_TASK_DELAY_ACCT
>>>> struct task_delay_info *delays;
>>>> #endif
>>>> @@ -1724,6 +1728,18 @@ static inline char task_state_to_char(struct task_struct *tsk)
>>>> return task_index_to_char(task_state_index(tsk));
>>>> }
>>>>
>>>> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>>>> +static inline bool in_lazy_mmu_mode(void)
>>>> +{
>>>> + return current->lazy_mmu_state.active;
>>>> +}
>>>> +#else
>>>> +static inline bool in_lazy_mmu_mode(void)
>>>> +{
>>>> + return false;
>>> Just pointing out that this isn't really a correct implementation:
>>>
>>> lazy_mmu_mode_enable()
>>> ASSERT(in_lazy_mmu_mode()) << triggers for arches without lazy mmu
>>> lazy_mmu_mode_disable()
>>>
>>> Although it probably doesn't matter in practice?
>> I'd say that the expectation is invalid - lazy MMU mode can only be
>> enabled if the architecture supports it. In fact as you pointed out
>> above the API may be called in interrupt context but it will have no
>> effect, so this sequence would always fail in interrupt context.
> Yep, but previously there was no way to query the current state so it didn't
> matter. Now you have a query API so it might matter if/when people come along
> and use it in unexpected ways.
I suppose the best we can do is document it alongside those helpers
(David has already suggested some documentation, see patch 11).
>> Worth nothing that in_lazy_mmu_mode() is only ever called from arch code
>> where lazy MMU is implemented. I added the fallback as a matter of
>> principle, but it isn't actually required.
> Yes, I agree that's the intent. I'm just wondering if it's possible to enforce
> that only arch code uses this. Perhaps add some docs to explain that it's only
> intended for arches that implement lazy_mmu, and don't define it for arches that
> don't, which would catch any generic users?
Yep sounds like the best option - a lot less risk of misuse if it can't
be called from generic code :) The build would still succeed on arch's
that implement it, but the kernel CI should catch such calls sooner or
later.
- Kevin
On 11/11/2025 15:56, Kevin Brodsky wrote:
> On 11/11/2025 10:24, Ryan Roberts wrote:
>> [...]
>>
>>>>> + state->active = true;
>>>>> + arch_enter_lazy_mmu_mode();
>>>>> + }
>>>>> }
>>>>>
>>>>> static inline void lazy_mmu_mode_disable(void)
>>>>> {
>>>>> - arch_leave_lazy_mmu_mode();
>>>>> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
>>>>> +
>>>>> + VM_WARN_ON_ONCE(state->nesting_level == 0);
>>>>> + VM_WARN_ON(!state->active);
>>>>> +
>>>>> + if (--state->nesting_level == 0) {
>>>>> + state->active = false;
>>>>> + arch_leave_lazy_mmu_mode();
>>>>> + } else {
>>>>> + /* Exiting a nested section */
>>>>> + arch_flush_lazy_mmu_mode();
>>>>> + }
>>>>> }
>>>>>
>>>>> static inline void lazy_mmu_mode_pause(void)
>>>>> {
>>>>> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
>>>>> +
>>>>> + VM_WARN_ON(state->nesting_level == 0 || !state->active);
>>>> nit: do you need the first condition? I think when nesting_level==0, we expect
>>>> to be !active?
>>> I suppose this should never happen indeed - I was just being extra
>>> defensive.
>>>
>>> Either way David suggested allowing pause()/resume() to be called
>>> outside of any section so the next version will bail out on
>>> nesting_level == 0.
>> Ignoring my current opinion that we don't need pause/resume at all for now; Are
>> you suggesting that pause/resume will be completely independent of
>> enable/disable? I think that would be best. So enable/disable increment and
>> decrement the nesting_level counter regardless of whether we are paused.
>> nesting_level 0 => 1 enables if not paused. nesting_level 1 => 0 disables if not
>> paused. pause disables nesting_level >= 1, resume enables if nesting_level >= 1.
>
> This is something else. Currently the rules are:
>
> [A]
>
> // pausing forbidden
> enable()
> pause()
> // pausing/enabling forbidden
> resume()
> disable()
>
> David suggested allowing:
>
> [B]
>
> pause()
> // pausing/enabling forbidden
> resume()
>
> Your suggestion is also allowing:
>
> [C]
>
> pause()
> // pausing forbidden
> enable()
> disable()
> resume()
I think the current kasan kasan_depopulate_vmalloc_pte() path will require [C]
if CONFIG_DEBUG_PAGEALLOC is enabled on arm64. It calls __free_page() while
paused. I guess CONFIG_DEBUG_PAGEALLOC will cause __free_page() ->
debug_pagealloc_unmap_pages() ->->-> update_range_prot() -> lazy_mmu_enable().
Arguably you could move the resume() to before the __free_page(). But it just
illustrates that it's all a bit brittle at the moment...
>
>> Perhaps we also need nested pause/resume? Then you just end up with 2 counters;
>> enable_count and pause_count. Sorry if this has already been discussed.
>
> And finally:
>
> [D]
>
> pause()
> pause()
> enable()
> disable()
> resume()
> resume()
>
> I don't really mind either way, but I don't see an immediate use for [C]
> and [D] - the idea is that the paused section is short and controlled,
> not made up of arbitrary calls.
If my thinking above is correct, then I've already demonstrated that this is not
the case. So I'd be inclined to go with [D] on the basis that it is the most robust.
Keeping 2 nesting counts (enable and pause) feels pretty elegant to me and gives
the fewest opportunities for surprises.
Thanks,
Ryan
> A potential downside of allowing [C] and
> [D] is that it makes it harder to detect unintended nesting (fewer
> VM_WARN assertions). Happy to implement it if this proves useful though.
>
> OTOH the idea behind [B] is that it allows the caller of
> pause()/resume() not to care about whether lazy MMU is actually enabled
> or not - i.e. the kasan helpers would keep working even if
> apply_to_page_range() didn't use lazy MMU any more.
>
>>>>> +
>>>>> + state->active = false;
>>>>> arch_leave_lazy_mmu_mode();
>>>>> }
>>>>>
>>>>> static inline void lazy_mmu_mode_resume(void)
>>>>> {
>>>>> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
>>>>> +
>>>>> + VM_WARN_ON(state->nesting_level == 0 || state->active);
>>>> Similar argument?
>>>>
>>>>> +
>>>>> + state->active = true;
>>>>> arch_enter_lazy_mmu_mode();
>>>>> }
>>>>> #else
>>>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>>>> index cbb7340c5866..11566d973f42 100644
>>>>> --- a/include/linux/sched.h
>>>>> +++ b/include/linux/sched.h
>>>>> @@ -1441,6 +1441,10 @@ struct task_struct {
>>>>>
>>>>> struct page_frag task_frag;
>>>>>
>>>>> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>>>>> + struct lazy_mmu_state lazy_mmu_state;
>>>>> +#endif
>>>>> +
>>>>> #ifdef CONFIG_TASK_DELAY_ACCT
>>>>> struct task_delay_info *delays;
>>>>> #endif
>>>>> @@ -1724,6 +1728,18 @@ static inline char task_state_to_char(struct task_struct *tsk)
>>>>> return task_index_to_char(task_state_index(tsk));
>>>>> }
>>>>>
>>>>> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>>>>> +static inline bool in_lazy_mmu_mode(void)
>>>>> +{
>>>>> + return current->lazy_mmu_state.active;
>>>>> +}
>>>>> +#else
>>>>> +static inline bool in_lazy_mmu_mode(void)
>>>>> +{
>>>>> + return false;
>>>> Just pointing out that this isn't really a correct implementation:
>>>>
>>>> lazy_mmu_mode_enable()
>>>> ASSERT(in_lazy_mmu_mode()) << triggers for arches without lazy mmu
>>>> lazy_mmu_mode_disable()
>>>>
>>>> Although it probably doesn't matter in practice?
>>> I'd say that the expectation is invalid - lazy MMU mode can only be
>>> enabled if the architecture supports it. In fact as you pointed out
>>> above the API may be called in interrupt context but it will have no
>>> effect, so this sequence would always fail in interrupt context.
>> Yep, but previously there was no way to query the current state so it didn't
>> matter. Now you have a query API so it might matter if/when people come along
>> and use it in unexpected ways.
>
> I suppose the best we can do is document it alongside those helpers
> (David has already suggested some documentation, see patch 11).
>
>>> Worth nothing that in_lazy_mmu_mode() is only ever called from arch code
>>> where lazy MMU is implemented. I added the fallback as a matter of
>>> principle, but it isn't actually required.
>> Yes, I agree that's the intent. I'm just wondering if it's possible to enforce
>> that only arch code uses this. Perhaps add some docs to explain that it's only
>> intended for arches that implement lazy_mmu, and don't define it for arches that
>> don't, which would catch any generic users?
>
> Yep sounds like the best option - a lot less risk of misuse if it can't
> be called from generic code :) The build would still succeed on arch's
> that implement it, but the kernel CI should catch such calls sooner or
> later.
>
> - Kevin
On 11/11/2025 17:03, Ryan Roberts wrote:
> On 11/11/2025 15:56, Kevin Brodsky wrote:
>> On 11/11/2025 10:24, Ryan Roberts wrote:
>>> [...]
>>>
>>>>>> + state->active = true;
>>>>>> + arch_enter_lazy_mmu_mode();
>>>>>> + }
>>>>>> }
>>>>>>
>>>>>> static inline void lazy_mmu_mode_disable(void)
>>>>>> {
>>>>>> - arch_leave_lazy_mmu_mode();
>>>>>> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
>>>>>> +
>>>>>> + VM_WARN_ON_ONCE(state->nesting_level == 0);
>>>>>> + VM_WARN_ON(!state->active);
>>>>>> +
>>>>>> + if (--state->nesting_level == 0) {
>>>>>> + state->active = false;
>>>>>> + arch_leave_lazy_mmu_mode();
>>>>>> + } else {
>>>>>> + /* Exiting a nested section */
>>>>>> + arch_flush_lazy_mmu_mode();
>>>>>> + }
>>>>>> }
>>>>>>
>>>>>> static inline void lazy_mmu_mode_pause(void)
>>>>>> {
>>>>>> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
>>>>>> +
>>>>>> + VM_WARN_ON(state->nesting_level == 0 || !state->active);
>>>>> nit: do you need the first condition? I think when nesting_level==0, we expect
>>>>> to be !active?
>>>> I suppose this should never happen indeed - I was just being extra
>>>> defensive.
>>>>
>>>> Either way David suggested allowing pause()/resume() to be called
>>>> outside of any section so the next version will bail out on
>>>> nesting_level == 0.
>>> Ignoring my current opinion that we don't need pause/resume at all for now; Are
>>> you suggesting that pause/resume will be completely independent of
>>> enable/disable? I think that would be best. So enable/disable increment and
>>> decrement the nesting_level counter regardless of whether we are paused.
>>> nesting_level 0 => 1 enables if not paused. nesting_level 1 => 0 disables if not
>>> paused. pause disables nesting_level >= 1, resume enables if nesting_level >= 1.
>> This is something else. Currently the rules are:
>>
>> [A]
>>
>> // pausing forbidden
>> enable()
>> pause()
>> // pausing/enabling forbidden
>> resume()
>> disable()
>>
>> David suggested allowing:
>>
>> [B]
>>
>> pause()
>> // pausing/enabling forbidden
>> resume()
>>
>> Your suggestion is also allowing:
>>
>> [C]
>>
>> pause()
>> // pausing forbidden
>> enable()
>> disable()
>> resume()
> I think the current kasan kasan_depopulate_vmalloc_pte() path will require [C]
> if CONFIG_DEBUG_PAGEALLOC is enabled on arm64. It calls __free_page() while
> paused. I guess CONFIG_DEBUG_PAGEALLOC will cause __free_page() ->
> debug_pagealloc_unmap_pages() ->->-> update_range_prot() -> lazy_mmu_enable().
Well, I really should have tried booting with KASAN enabled before...
lazy_mmu_mode_enable() complains exactly as you predicted:
> [ 1.047587] WARNING: CPU: 0 PID: 1 at include/linux/pgtable.h:273
> update_range_prot+0x2dc/0x50c
> [ 1.048025] Modules linked in:
> [ 1.048296] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted
> 6.18.0-rc3-00012-ga901e7f479f1 #142 PREEMPT
> [ 1.048706] Hardware name: FVP Base RevC (DT)
> [ 1.048941] pstate: 11400009 (nzcV daif +PAN -UAO -TCO +DIT -SSBS
> BTYPE=--)
> [ 1.049309] pc : update_range_prot+0x2dc/0x50c
> [ 1.049631] lr : update_range_prot+0x80/0x50c
> [ 1.049950] sp : ffff8000800e6f20
> [ 1.050162] x29: ffff8000800e6fb0 x28: ffff700010014000 x27:
> ffff700010016000
> [ 1.050747] x26: 0000000000000000 x25: 0000000000000001 x24:
> 00000000008800f7
> [ 1.051308] x23: 0000000000000000 x22: fff00008000f7000 x21:
> fff00008003009f8
> [ 1.051884] x20: fff00008000f8000 x19: 1ffff0001001cdea x18:
> ffff800080769000
> [ 1.052469] x17: ffff95c63264ec00 x16: ffff8000800e7504 x15:
> 0000000000000003
> [ 1.053045] x14: ffff95c63482f000 x13: 0000000000000000 x12:
> ffff783ffc0007bf
> [ 1.053620] x11: 1ffff83ffc0007be x10: ffff783ffc0007be x9 :
> dfff800000000000
> [ 1.054203] x8 : fffd80010001f000 x7 : ffffffffffffffff x6 :
> 0000000000000001
> [ 1.054776] x5 : 0000000000000000 x4 : fff00008003009f9 x3 :
> 1ffe00010006013f
> [ 1.055348] x2 : fff0000800300000 x1 : 0000000000000001 x0 :
> 0000000000000000
> [ 1.055912] Call trace:
> [ 1.056100] update_range_prot+0x2dc/0x50c (P)
> [ 1.056478] set_memory_valid+0x44/0x70
> [ 1.056850] __kernel_map_pages+0x68/0xe4
> [ 1.057226] __free_frozen_pages+0x528/0x1180
> [ 1.057601] ___free_pages+0x11c/0x160
> [ 1.057961] __free_pages+0x14/0x20
> [ 1.058307] kasan_depopulate_vmalloc_pte+0xd4/0x184
> [ 1.058748] __apply_to_page_range+0x678/0xda8
> [ 1.059149] apply_to_existing_page_range+0x14/0x20
> [ 1.059553] kasan_release_vmalloc+0x138/0x200
> [ 1.059982] purge_vmap_node+0x1b4/0x8a0
> [ 1.060371] __purge_vmap_area_lazy+0x4f8/0x870
> [ 1.060779] _vm_unmap_aliases+0x488/0x6ec
> [ 1.061176] vm_unmap_aliases+0x1c/0x34
> [ 1.061567] change_memory_common+0x17c/0x380
> [ 1.061949] set_memory_ro+0x18/0x24
> [...]
> Arguably you could move the resume() to before the __free_page(). But it just
> illustrates that it's all a bit brittle at the moment...
Difficult to disagree. With things like DEBUG_PAGEALLOC it becomes very
hard to know what is guaranteed not to use lazy MMU.
>>> Perhaps we also need nested pause/resume? Then you just end up with 2 counters;
>>> enable_count and pause_count. Sorry if this has already been discussed.
>> And finally:
>>
>> [D]
>>
>> pause()
>> pause()
>> enable()
>> disable()
>> resume()
>> resume()
>>
>> I don't really mind either way, but I don't see an immediate use for [C]
>> and [D] - the idea is that the paused section is short and controlled,
>> not made up of arbitrary calls.
> If my thinking above is correct, then I've already demonstrated that this is not
> the case. So I'd be inclined to go with [D] on the basis that it is the most robust.
>
> Keeping 2 nesting counts (enable and pause) feels pretty elegant to me and gives
> the fewest opportunities for surprises.
Agreed, if we're going to allow enable() within a paused section, then
we might as well allow paused sections to nest too. The use-case is
clear, so I'm happy to go ahead and make those changes.
David, any thoughts?
- Kevin
>
> Thanks,
> Ryan
>
>> A potential downside of allowing [C] and
>> [D] is that it makes it harder to detect unintended nesting (fewer
>> VM_WARN assertions). Happy to implement it if this proves useful though.
>>
>> OTOH the idea behind [B] is that it allows the caller of
>> pause()/resume() not to care about whether lazy MMU is actually enabled
>> or not - i.e. the kasan helpers would keep working even if
>> apply_to_page_range() didn't use lazy MMU any more.
>>
>>>>>> +
>>>>>> + state->active = false;
>>>>>> arch_leave_lazy_mmu_mode();
>>>>>> }
>>>>>>
>>>>>> [...]
>>> >>> I don't really mind either way, but I don't see an immediate use for [C] >>> and [D] - the idea is that the paused section is short and controlled, >>> not made up of arbitrary calls. >> If my thinking above is correct, then I've already demonstrated that this is not >> the case. So I'd be inclined to go with [D] on the basis that it is the most robust. >> >> Keeping 2 nesting counts (enable and pause) feels pretty elegant to me and gives >> the fewest opportunities for surprises. > > Agreed, if we're going to allow enable() within a paused section, then > we might as well allow paused sections to nest too. The use-case is > clear, so I'm happy to go ahead and make those changes. > > David, any thoughts? I don't mind allowing nesting of pause(), so works for me. -- Cheers David
Kevin Brodsky <kevin.brodsky@arm.com> writes:
> Despite recent efforts to prevent lazy_mmu sections from nesting, it
> remains difficult to ensure that it never occurs - and in fact it
> does occur on arm64 in certain situations (CONFIG_DEBUG_PAGEALLOC).
> Commit 1ef3095b1405 ("arm64/mm: Permit lazy_mmu_mode to be nested")
> made nesting tolerable on arm64, but without truly supporting it:
> the inner call to leave() disables the batching optimisation before
> the outer section ends.
>
> This patch actually enables lazy_mmu sections to nest by tracking
> the nesting level in task_struct, in a similar fashion to e.g.
> pagefault_{enable,disable}(). This is fully handled by the generic
> lazy_mmu helpers that were recently introduced.
>
> lazy_mmu sections were not initially intended to nest, so we need to
> clarify the semantics w.r.t. the arch_*_lazy_mmu_mode() callbacks.
> This patch takes the following approach:
>
> * The outermost calls to lazy_mmu_mode_{enable,disable}() trigger
> calls to arch_{enter,leave}_lazy_mmu_mode() - this is unchanged.
>
> * Nested calls to lazy_mmu_mode_{enable,disable}() are not forwarded
> to the arch via arch_{enter,leave} - lazy MMU remains enabled so
> the assumption is that these callbacks are not relevant. However,
> existing code may rely on a call to disable() to flush any batched
> state, regardless of nesting. arch_flush_lazy_mmu_mode() is
> therefore called in that situation.
>
> A separate interface was recently introduced to temporarily pause
> the lazy MMU mode: lazy_mmu_mode_{pause,resume}(). pause() fully
> exits the mode *regardless of the nesting level*, and resume()
> restores the mode at the same nesting level.
>
> Whether the mode is actually enabled or not at any point is tracked
> by a separate "active" field in task_struct; this makes it possible
> to check invariants in the generic API, and to expose a new
> in_lazy_mmu_mode() helper to replace the various ways arch's
> currently track whether the mode is enabled (this will be done in
> later patches).
>
> In summary (nesting/active represent the values *after* the call):
>
> lazy_mmu_mode_enable() -> arch_enter() nesting=1 active=1
> lazy_mmu_mode_enable() -> ø nesting=2 active=1
> lazy_mmu_mode_pause() -> arch_leave() nesting=2 active=0
> lazy_mmu_mode_resume() -> arch_enter() nesting=2 active=1
> lazy_mmu_mode_disable() -> arch_flush() nesting=1 active=1
> lazy_mmu_mode_disable() -> arch_leave() nesting=0 active=0
>
> Note: in_lazy_mmu_mode() is added to <linux/sched.h> to allow arch
> headers included by <linux/pgtable.h> to use it.
>
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
> arch/arm64/include/asm/pgtable.h | 12 ------
> include/linux/mm_types_task.h | 5 +++
> include/linux/pgtable.h | 67 ++++++++++++++++++++++++++++++--
> include/linux/sched.h | 16 ++++++++
> 4 files changed, 84 insertions(+), 16 deletions(-)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 54f8d6bb6f22..535435248923 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -82,18 +82,6 @@ static inline void queue_pte_barriers(void)
>
> static inline void arch_enter_lazy_mmu_mode(void)
> {
> - /*
> - * lazy_mmu_mode is not supposed to permit nesting. But in practice this
> - * does happen with CONFIG_DEBUG_PAGEALLOC, where a page allocation
> - * inside a lazy_mmu_mode section (such as zap_pte_range()) will change
> - * permissions on the linear map with apply_to_page_range(), which
> - * re-enters lazy_mmu_mode. So we tolerate nesting in our
> - * implementation. The first call to arch_leave_lazy_mmu_mode() will
> - * flush and clear the flag such that the remainder of the work in the
> - * outer nest behaves as if outside of lazy mmu mode. This is safe and
> - * keeps tracking simple.
> - */
> -
> if (in_interrupt())
> return;
>
> diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
> index a82aa80c0ba4..632d404f8191 100644
> --- a/include/linux/mm_types_task.h
> +++ b/include/linux/mm_types_task.h
> @@ -88,4 +88,9 @@ struct tlbflush_unmap_batch {
> #endif
> };
>
> +struct lazy_mmu_state {
> + u8 nesting_level;
> + bool active;
> +};
> +
> #endif /* _LINUX_MM_TYPES_TASK_H */
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b5fdf32c437f..e6064e00b22d 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -228,27 +228,86 @@ static inline int pmd_dirty(pmd_t pmd)
> * of the lazy mode. So the implementation must assume preemption may be enabled
> * and cpu migration is possible; it must take steps to be robust against this.
> * (In practice, for user PTE updates, the appropriate page table lock(s) are
> - * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
> - * and the mode cannot be used in interrupt context.
> + * held, but for kernel PTE updates, no lock is held). The mode cannot be used
> + * in interrupt context.
> + *
> + * The lazy MMU mode is enabled for a given block of code using:
> + *
> + * lazy_mmu_mode_enable();
> + * <code>
> + * lazy_mmu_mode_disable();
> + *
> + * Nesting is permitted: <code> may itself use an enable()/disable() pair.
> + * A nested call to enable() has no functional effect; however disable() causes
> + * any batched architectural state to be flushed regardless of nesting. After a
> + * call to disable(), the caller can therefore rely on all previous page table
> + * modifications to have taken effect, but the lazy MMU mode may still be
> + * enabled.
> + *
> + * In certain cases, it may be desirable to temporarily pause the lazy MMU mode.
> + * This can be done using:
> + *
> + * lazy_mmu_mode_pause();
> + * <code>
> + * lazy_mmu_mode_resume();
> + *
> + * This sequence must only be used if the lazy MMU mode is already enabled.
> + * pause() ensures that the mode is exited regardless of the nesting level;
> + * resume() re-enters the mode at the same nesting level. <code> must not modify
> + * the lazy MMU state (i.e. it must not call any of the lazy_mmu_mode_*
> + * helpers).
> + *
> + * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
> + * currently enabled.
> */
> #ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> static inline void lazy_mmu_mode_enable(void)
> {
> - arch_enter_lazy_mmu_mode();
> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
> +
> + VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
> + /* enable() must not be called while paused */
> + VM_WARN_ON(state->nesting_level > 0 && !state->active);
> +
> + if (state->nesting_level++ == 0) {
> + state->active = true;
> + arch_enter_lazy_mmu_mode();
> + }
> }
Some architectures disables preemption in their
arch_enter_lazy_mmu_mode(). So shouldn't the state->active = true should
happen after arch_enter_lazy_mmu_mode() has disabled preemption()? i.e.
static inline void lazy_mmu_mode_enable(void)
{
- arch_enter_lazy_mmu_mode();
+ struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
+
+ VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
+ /* enable() must not be called while paused */
+ VM_WARN_ON(state->nesting_level > 0 && !state->active);
+
+ if (state->nesting_level++ == 0) {
+ arch_enter_lazy_mmu_mode();
+ state->active = true;
+ }
}
... I think it make more sense to enable the state after the arch_**
call right.
>
> static inline void lazy_mmu_mode_disable(void)
> {
> - arch_leave_lazy_mmu_mode();
> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
> +
> + VM_WARN_ON_ONCE(state->nesting_level == 0);
> + VM_WARN_ON(!state->active);
> +
> + if (--state->nesting_level == 0) {
> + state->active = false;
> + arch_leave_lazy_mmu_mode();
> + } else {
> + /* Exiting a nested section */
> + arch_flush_lazy_mmu_mode();
> + }
> }
This looks ok though.
>
> static inline void lazy_mmu_mode_pause(void)
> {
> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
> +
> + VM_WARN_ON(state->nesting_level == 0 || !state->active);
> +
> + state->active = false;
> arch_leave_lazy_mmu_mode();
> }
>
> static inline void lazy_mmu_mode_resume(void)
> {
> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
> +
> + VM_WARN_ON(state->nesting_level == 0 || state->active);
> +
> + state->active = true;
> arch_enter_lazy_mmu_mode();
> }
Ditto.
-ritesh
> #else
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index cbb7340c5866..11566d973f42 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1441,6 +1441,10 @@ struct task_struct {
>
> struct page_frag task_frag;
>
> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> + struct lazy_mmu_state lazy_mmu_state;
> +#endif
> +
> #ifdef CONFIG_TASK_DELAY_ACCT
> struct task_delay_info *delays;
> #endif
> @@ -1724,6 +1728,18 @@ static inline char task_state_to_char(struct task_struct *tsk)
> return task_index_to_char(task_state_index(tsk));
> }
>
> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> +static inline bool in_lazy_mmu_mode(void)
> +{
> + return current->lazy_mmu_state.active;
> +}
> +#else
> +static inline bool in_lazy_mmu_mode(void)
> +{
> + return false;
> +}
> +#endif
> +
> extern struct pid *cad_pid;
>
> /*
> --
> 2.47.0
On Wed, Nov 05, 2025 at 02:19:03PM +0530, Ritesh Harjani wrote:
> > + * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
> > + * currently enabled.
> > */
> > #ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> > static inline void lazy_mmu_mode_enable(void)
> > {
> > - arch_enter_lazy_mmu_mode();
> > + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
> > +
> > + VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
> > + /* enable() must not be called while paused */
> > + VM_WARN_ON(state->nesting_level > 0 && !state->active);
> > +
> > + if (state->nesting_level++ == 0) {
> > + state->active = true;
> > + arch_enter_lazy_mmu_mode();
> > + }
> > }
>
> Some architectures disables preemption in their
> arch_enter_lazy_mmu_mode(). So shouldn't the state->active = true should
> happen after arch_enter_lazy_mmu_mode() has disabled preemption()? i.e.
Do you have some scenario in mind that could cause an issue?
IOW, what could go wrong if the process is scheduled to another
CPU before preempt_disable() is called?
> static inline void lazy_mmu_mode_enable(void)
> {
> - arch_enter_lazy_mmu_mode();
> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
> +
> + VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
> + /* enable() must not be called while paused */
> + VM_WARN_ON(state->nesting_level > 0 && !state->active);
> +
> + if (state->nesting_level++ == 0) {
> + arch_enter_lazy_mmu_mode();
> + state->active = true;
> + }
> }
>
> ... I think it make more sense to enable the state after the arch_**
> call right.
But then in_lazy_mmu_mode() would return false if called from
arch_enter_lazy_mmu_mode(). Not big problem, but still..
> > static inline void lazy_mmu_mode_disable(void)
> > {
> > - arch_leave_lazy_mmu_mode();
> > + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
> > +
> > + VM_WARN_ON_ONCE(state->nesting_level == 0);
> > + VM_WARN_ON(!state->active);
> > +
> > + if (--state->nesting_level == 0) {
> > + state->active = false;
> > + arch_leave_lazy_mmu_mode();
> > + } else {
> > + /* Exiting a nested section */
> > + arch_flush_lazy_mmu_mode();
> > + }
> > }
>
> This looks ok though.
>
> >
> > static inline void lazy_mmu_mode_pause(void)
> > {
> > + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
> > +
> > + VM_WARN_ON(state->nesting_level == 0 || !state->active);
> > +
> > + state->active = false;
> > arch_leave_lazy_mmu_mode();
> > }
> >
> > static inline void lazy_mmu_mode_resume(void)
> > {
> > + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
> > +
> > + VM_WARN_ON(state->nesting_level == 0 || state->active);
> > +
> > + state->active = true;
> > arch_enter_lazy_mmu_mode();
> > }
>
> Ditto.
>
> -ritesh
Alexander Gordeev <agordeev@linux.ibm.com> writes:
> On Wed, Nov 05, 2025 at 02:19:03PM +0530, Ritesh Harjani wrote:
>> > + * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
>> > + * currently enabled.
>> > */
>> > #ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>> > static inline void lazy_mmu_mode_enable(void)
>> > {
>> > - arch_enter_lazy_mmu_mode();
>> > + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
>> > +
>> > + VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
>> > + /* enable() must not be called while paused */
>> > + VM_WARN_ON(state->nesting_level > 0 && !state->active);
>> > +
>> > + if (state->nesting_level++ == 0) {
>> > + state->active = true;
>> > + arch_enter_lazy_mmu_mode();
>> > + }
>> > }
>>
>> Some architectures disables preemption in their
>> arch_enter_lazy_mmu_mode(). So shouldn't the state->active = true should
>> happen after arch_enter_lazy_mmu_mode() has disabled preemption()? i.e.
>
> Do you have some scenario in mind that could cause an issue?
>
No not really. But that's a deviation from what previous arch hooks were
expecting. Although thinking this through - I don't have any usecase
where this can be a problem.
But let me re-visit some of the code paths on ppc64 lazy mmu...
Looking at the arch specific usecase I see we always do get_cpu_var()
for accessing the per-cpu batch array which disables preemption before
accessing the per-cpu structure.. This per-cpu structure is where we
batch pte updates...
For e.g...
arch_enter_lazy_mmu_mode()
hpte_need_flush()
get_cpu_var() // this takes care of preempt_disable()
adds vpns to per-cpu batch[i]
put_cpu_var() //
arch_leave_lazy_mmu_mode()
> IOW, what could go wrong if the process is scheduled to another
> CPU before preempt_disable() is called?
So from above - I don't think your sequence to update
state->active = true
before calling arch_enter hook should be a problem.
Based on above this looks mostly ok to me.
-ritesh
On 06/11/2025 16:32, Ritesh Harjani (IBM) wrote:
> Alexander Gordeev <agordeev@linux.ibm.com> writes:
>
>> On Wed, Nov 05, 2025 at 02:19:03PM +0530, Ritesh Harjani wrote:
>>>> + * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
>>>> + * currently enabled.
>>>> */
>>>> #ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>>>> static inline void lazy_mmu_mode_enable(void)
>>>> {
>>>> - arch_enter_lazy_mmu_mode();
>>>> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
>>>> +
>>>> + VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
>>>> + /* enable() must not be called while paused */
>>>> + VM_WARN_ON(state->nesting_level > 0 && !state->active);
>>>> +
>>>> + if (state->nesting_level++ == 0) {
>>>> + state->active = true;
>>>> + arch_enter_lazy_mmu_mode();
>>>> + }
>>>> }
>>> Some architectures disables preemption in their
>>> arch_enter_lazy_mmu_mode(). So shouldn't the state->active = true should
>>> happen after arch_enter_lazy_mmu_mode() has disabled preemption()? i.e.
>> Do you have some scenario in mind that could cause an issue?
>>
> No not really. But that's a deviation from what previous arch hooks were
> expecting. Although thinking this through - I don't have any usecase
> where this can be a problem.
Which arch hook expectations are you referring to?
> But let me re-visit some of the code paths on ppc64 lazy mmu...
>
> Looking at the arch specific usecase I see we always do get_cpu_var()
> for accessing the per-cpu batch array which disables preemption before
> accessing the per-cpu structure.. This per-cpu structure is where we
> batch pte updates...
arch_enter() disables preemption so accesses to per-CPU variables
anywhere in the section shouldn't be an issue either way.
The bigger picture (regarding patch 9) is that what in_lazy_mmu_state()
returns is based on the current task's state (not a per-CPU variable),
and always false while in interrupt. As a result whether preemption is
disabled or not should make no difference, only program order matters.
- Kevin
> For e.g...
>
> arch_enter_lazy_mmu_mode()
> hpte_need_flush()
> get_cpu_var() // this takes care of preempt_disable()
> adds vpns to per-cpu batch[i]
> put_cpu_var() //
> arch_leave_lazy_mmu_mode()
>
>> IOW, what could go wrong if the process is scheduled to another
>> CPU before preempt_disable() is called?
> So from above - I don't think your sequence to update
> state->active = true
> before calling arch_enter hook should be a problem.
> Based on above this looks mostly ok to me.
>
> -ritesh
Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:
> For e.g...
>
> arch_enter_lazy_mmu_mode()
> hpte_need_flush()
> get_cpu_var() // this takes care of preempt_disable()
> adds vpns to per-cpu batch[i]
> put_cpu_var() //
> arch_leave_lazy_mmu_mode()
>
Sorry, here is the more accurate call sequence for previous email.
caller()...
arch_enter_lazy_mmu_mode()
ptep_xxx_()
pte_update()
hpte_need_flush()
get_cpu_var() // this takes care of preempt_disable()
adds vpns to per-cpu batch[i]
put_cpu_var() //
arch_leave_lazy_mmu_mode()
-ritesh
On 05/11/2025 16:12, Alexander Gordeev wrote:
> On Wed, Nov 05, 2025 at 02:19:03PM +0530, Ritesh Harjani wrote:
>>> + * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
>>> + * currently enabled.
>>> */
>>> #ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>>> static inline void lazy_mmu_mode_enable(void)
>>> {
>>> - arch_enter_lazy_mmu_mode();
>>> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
>>> +
>>> + VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
>>> + /* enable() must not be called while paused */
>>> + VM_WARN_ON(state->nesting_level > 0 && !state->active);
>>> +
>>> + if (state->nesting_level++ == 0) {
>>> + state->active = true;
>>> + arch_enter_lazy_mmu_mode();
>>> + }
>>> }
>> Some architectures disables preemption in their
>> arch_enter_lazy_mmu_mode(). So shouldn't the state->active = true should
>> happen after arch_enter_lazy_mmu_mode() has disabled preemption()? i.e.
> Do you have some scenario in mind that could cause an issue?
> IOW, what could go wrong if the process is scheduled to another
> CPU before preempt_disable() is called?
I'm not sure I understand the issue either.
>> static inline void lazy_mmu_mode_enable(void)
>> {
>> - arch_enter_lazy_mmu_mode();
>> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
>> +
>> + VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
>> + /* enable() must not be called while paused */
>> + VM_WARN_ON(state->nesting_level > 0 && !state->active);
>> +
>> + if (state->nesting_level++ == 0) {
>> + arch_enter_lazy_mmu_mode();
>> + state->active = true;
>> + }
>> }
>>
>> ... I think it make more sense to enable the state after the arch_**
>> call right.
> But then in_lazy_mmu_mode() would return false if called from
> arch_enter_lazy_mmu_mode(). Not big problem, but still..
The ordering of nesting_level/active was the way you expected in v3, but
the conclusion of the discussion with David H [1] is that it doesn't
really matter so I simplified the ordering in v4 - the arch hooks
shouldn't call in_lazy_mmu_mode() or inspect lazy_mmu_state.
arch_enter()/arch_leave() shouldn't need it anyway since they're called
once per outer section (not in nested sections). arch_flush() could
potentially do something different when nested, but that seems unlikely.
- Kevin
[1]
https://lore.kernel.org/all/af4414b6-617c-4dc8-bddc-3ea00d1f6f3b@redhat.com/
On Thu, Nov 06, 2025 at 10:51:43AM +0000, Kevin Brodsky wrote:
> On 05/11/2025 16:12, Alexander Gordeev wrote:
> > On Wed, Nov 05, 2025 at 02:19:03PM +0530, Ritesh Harjani wrote:
> >>> + * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
> >>> + * currently enabled.
> >>> */
> >>> #ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> >>> static inline void lazy_mmu_mode_enable(void)
> >>> {
> >>> - arch_enter_lazy_mmu_mode();
> >>> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
> >>> +
> >>> + VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
> >>> + /* enable() must not be called while paused */
> >>> + VM_WARN_ON(state->nesting_level > 0 && !state->active);
> >>> +
> >>> + if (state->nesting_level++ == 0) {
> >>> + state->active = true;
> >>> + arch_enter_lazy_mmu_mode();
> >>> + }
> >>> }
> >> Some architectures disables preemption in their
> >> arch_enter_lazy_mmu_mode(). So shouldn't the state->active = true should
> >> happen after arch_enter_lazy_mmu_mode() has disabled preemption()? i.e.
> > Do you have some scenario in mind that could cause an issue?
> > IOW, what could go wrong if the process is scheduled to another
> > CPU before preempt_disable() is called?
>
> I'm not sure I understand the issue either.
>
> >> static inline void lazy_mmu_mode_enable(void)
> >> {
> >> - arch_enter_lazy_mmu_mode();
> >> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
> >> +
> >> + VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
> >> + /* enable() must not be called while paused */
> >> + VM_WARN_ON(state->nesting_level > 0 && !state->active);
> >> +
> >> + if (state->nesting_level++ == 0) {
> >> + arch_enter_lazy_mmu_mode();
> >> + state->active = true;
> >> + }
> >> }
> >>
> >> ... I think it make more sense to enable the state after the arch_**
> >> call right.
> > But then in_lazy_mmu_mode() would return false if called from
> > arch_enter_lazy_mmu_mode(). Not big problem, but still..
>
> The ordering of nesting_level/active was the way you expected in v3, but
> the conclusion of the discussion with David H [1] is that it doesn't
> really matter so I simplified the ordering in v4 - the arch hooks
> shouldn't call in_lazy_mmu_mode() or inspect lazy_mmu_state.
> arch_enter()/arch_leave() shouldn't need it anyway since they're called
> once per outer section (not in nested sections). arch_flush() could
> potentially do something different when nested, but that seems unlikely.
>
> - Kevin
>
> [1]
> https://lore.kernel.org/all/af4414b6-617c-4dc8-bddc-3ea00d1f6f3b@redhat.com/
I might be misunderstand this conversation, but it looked to me as a discussion
about lazy_mmu_state::nesting_level value, not lazy_mmu_state::active.
I do use in_lazy_mmu_mode() (lazy_mmu_state::active) check from the arch-
callbacks. Here is the example (and likely the only case so far) where it hits:
static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
void *_data)
{
lazy_mmu_mode_pause();
...
if (likely(pte_none(ptep_get(ptep)))) {
/* Here set_pte() checks whether we are in lazy_mmu mode */
set_pte_at(&init_mm, addr, ptep, pte); <--- calls set_pte()
data->pages[index] = NULL;
}
...
lazy_mmu_mode_resume();
...
}
So without in_lazy_mmu_mode() check above the arch-specific set_pte()
implementation enters a wrong branch, which ends up in:
[ 394.503134] Call Trace:
[ 394.503137] [<00007fffe01333f4>] dump_stack_lvl+0xbc/0xf0
[ 394.503143] [<00007fffe010298c>] vpanic+0x1cc/0x418
[ 394.503149] [<00007fffe0102c7a>] panic+0xa2/0xa8
[ 394.503154] [<00007fffe01e7a8a>] check_panic_on_warn+0x8a/0xb0
[ 394.503160] [<00007fffe082d122>] end_report+0x72/0x110
[ 394.503166] [<00007fffe082d3e6>] kasan_report+0xc6/0x100
[ 394.503171] [<00007fffe01b9556>] ipte_batch_ptep_get+0x146/0x150
[ 394.503176] [<00007fffe0830096>] kasan_populate_vmalloc_pte+0xe6/0x1e0
[ 394.503183] [<00007fffe0718050>] apply_to_pte_range+0x1a0/0x570
[ 394.503189] [<00007fffe07260fa>] __apply_to_page_range+0x3ca/0x8f0
[ 394.503195] [<00007fffe0726648>] apply_to_page_range+0x28/0x40
[ 394.503201] [<00007fffe082fe34>] __kasan_populate_vmalloc+0x324/0x340
[ 394.503207] [<00007fffe076954e>] alloc_vmap_area+0x31e/0xbf0
[ 394.503213] [<00007fffe0770106>] __get_vm_area_node+0x1a6/0x2d0
[ 394.503218] [<00007fffe07716fa>] __vmalloc_node_range_noprof+0xba/0x260
[ 394.503224] [<00007fffe0771970>] __vmalloc_node_noprof+0xd0/0x110
[ 394.503229] [<00007fffe0771a22>] vmalloc_noprof+0x32/0x40
[ 394.503234] [<00007fff604eaa42>] full_fit_alloc_test+0xb2/0x3e0 [test_vmalloc]
[ 394.503241] [<00007fff604eb478>] test_func+0x488/0x760 [test_vmalloc]
[ 394.503247] [<00007fffe025ad68>] kthread+0x368/0x630
[ 394.503253] [<00007fffe01391e0>] __ret_from_fork+0xd0/0x490
[ 394.503259] [<00007fffe24e468a>] ret_from_fork+0xa/0x30
I could have cached lazy_mmu_state::active as arch-specific data
and check it, but then what is the point to have it generalized?
Thanks!
On 06/11/2025 15:33, Alexander Gordeev wrote:
>> [...]
>>>> static inline void lazy_mmu_mode_enable(void)
>>>> {
>>>> - arch_enter_lazy_mmu_mode();
>>>> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
>>>> +
>>>> + VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
>>>> + /* enable() must not be called while paused */
>>>> + VM_WARN_ON(state->nesting_level > 0 && !state->active);
>>>> +
>>>> + if (state->nesting_level++ == 0) {
>>>> + arch_enter_lazy_mmu_mode();
>>>> + state->active = true;
>>>> + }
>>>> }
>>>>
>>>> ... I think it make more sense to enable the state after the arch_**
>>>> call right.
>>> But then in_lazy_mmu_mode() would return false if called from
>>> arch_enter_lazy_mmu_mode(). Not big problem, but still..
>> The ordering of nesting_level/active was the way you expected in v3, but
>> the conclusion of the discussion with David H [1] is that it doesn't
>> really matter so I simplified the ordering in v4 - the arch hooks
>> shouldn't call in_lazy_mmu_mode() or inspect lazy_mmu_state.
>> arch_enter()/arch_leave() shouldn't need it anyway since they're called
>> once per outer section (not in nested sections). arch_flush() could
>> potentially do something different when nested, but that seems unlikely.
>>
>> - Kevin
>>
>> [1]
>> https://lore.kernel.org/all/af4414b6-617c-4dc8-bddc-3ea00d1f6f3b@redhat.com/
> I might be misunderstand this conversation, but it looked to me as a discussion
> about lazy_mmu_state::nesting_level value, not lazy_mmu_state::active.
>
> I do use in_lazy_mmu_mode() (lazy_mmu_state::active) check from the arch-
> callbacks. Here is the example (and likely the only case so far) where it hits:
Sorry I didn't mean arch callbacks in general, I meant the ones called
from lazy_mmu_mode_*, that is arch_*_lazy_mmu_mode.
Patch 8 also makes use of in_lazy_mmu_mode() in set_pte() et al. on arm64.
- Kevin
> static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
> void *_data)
> {
> lazy_mmu_mode_pause();
> ...
> if (likely(pte_none(ptep_get(ptep)))) {
>
> /* Here set_pte() checks whether we are in lazy_mmu mode */
> set_pte_at(&init_mm, addr, ptep, pte); <--- calls set_pte()
> data->pages[index] = NULL;
> }
> ...
> lazy_mmu_mode_resume();
> ...
> }
>
> So without in_lazy_mmu_mode() check above the arch-specific set_pte()
> implementation enters a wrong branch, which ends up in:
>
> [ 394.503134] Call Trace:
> [ 394.503137] [<00007fffe01333f4>] dump_stack_lvl+0xbc/0xf0 FWIW
> [ 394.503143] [<00007fffe010298c>] vpanic+0x1cc/0x418
> [ 394.503149] [<00007fffe0102c7a>] panic+0xa2/0xa8
> [ 394.503154] [<00007fffe01e7a8a>] check_panic_on_warn+0x8a/0xb0
> [ 394.503160] [<00007fffe082d122>] end_report+0x72/0x110
> [ 394.503166] [<00007fffe082d3e6>] kasan_report+0xc6/0x100
> [ 394.503171] [<00007fffe01b9556>] ipte_batch_ptep_get+0x146/0x150
> [ 394.503176] [<00007fffe0830096>] kasan_populate_vmalloc_pte+0xe6/0x1e0
> [ 394.503183] [<00007fffe0718050>] apply_to_pte_range+0x1a0/0x570
> [ 394.503189] [<00007fffe07260fa>] __apply_to_page_range+0x3ca/0x8f0
> [ 394.503195] [<00007fffe0726648>] apply_to_page_range+0x28/0x40
> [ 394.503201] [<00007fffe082fe34>] __kasan_populate_vmalloc+0x324/0x340
> [ 394.503207] [<00007fffe076954e>] alloc_vmap_area+0x31e/0xbf0
> [ 394.503213] [<00007fffe0770106>] __get_vm_area_node+0x1a6/0x2d0
> [ 394.503218] [<00007fffe07716fa>] __vmalloc_node_range_noprof+0xba/0x260
> [ 394.503224] [<00007fffe0771970>] __vmalloc_node_noprof+0xd0/0x110
> [ 394.503229] [<00007fffe0771a22>] vmalloc_noprof+0x32/0x40
> [ 394.503234] [<00007fff604eaa42>] full_fit_alloc_test+0xb2/0x3e0 [test_vmalloc]
> [ 394.503241] [<00007fff604eb478>] test_func+0x488/0x760 [test_vmalloc]
> [ 394.503247] [<00007fffe025ad68>] kthread+0x368/0x630
> [ 394.503253] [<00007fffe01391e0>] __ret_from_fork+0xd0/0x490
> [ 394.503259] [<00007fffe24e468a>] ret_from_fork+0xa/0x30
>
> I could have cached lazy_mmu_state::active as arch-specific data
> and check it, but then what is the point to have it generalized?
>
> Thanks!
> static inline void lazy_mmu_mode_pause(void)
> {
> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
> +
> + VM_WARN_ON(state->nesting_level == 0 || !state->active);
> +
> + state->active = false;
> arch_leave_lazy_mmu_mode();
Just one question:
Don't we want to allow for pause/resume when not enabled? Would seem
valid to me, because pause/resume code should actually not worry about
that, right?
if (!state->nesting_level) {
VM_WARN_ON(state->active);
return;
}
VM_WARN_ON(!state->active);
state->active = false;
arch_leave_lazy_mmu_mode();
--
Cheers
David / dhildenb
On 01/11/2025 12:22, David Hildenbrand wrote:
>
>> static inline void lazy_mmu_mode_pause(void)
>> {
>> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
>> +
>> + VM_WARN_ON(state->nesting_level == 0 || !state->active);
>> +
>> + state->active = false;
>> arch_leave_lazy_mmu_mode();
>
> Just one question:
>
> Don't we want to allow for pause/resume when not enabled? Would seem
> valid to me, because pause/resume code should actually not worry about
> that, right?
This does sound sensible, thanks for the suggestion. The initial goal
was to allow functions that know they're called with lazy MMU enabled to
be able to pause it temporarily if they need batching disabled. But we
could generalise this to: if you know batching would break things, then
you can preemptively add a pause/resume pair, and it won't do anything
unless you're called with lazy MMU enabled.
I also like this as this removes an invalid usage situation - now as
long as you have balanced enable/disable and pause/resume calls, you're
good. Will make that change in v5.
- Kevin
>
> if (!state->nesting_level) {
> VM_WARN_ON(state->active);
> return;
> }
> VM_WARN_ON(!state->active);
> state->active = false;
> arch_leave_lazy_mmu_mode();
>
On Wed, Oct 29, 2025 at 10:09:04AM +0000, Kevin Brodsky wrote:
Hi Kevin,
> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> +static inline bool in_lazy_mmu_mode(void)
> +{
> + return current->lazy_mmu_state.active;
Whether (nesting_level > 0) is more correct check?
Otherwise, it returns false while in paused mode.
May be check both nesting_level and active and also introduce
in_lazy_mmu_paused_mode() right away to avoid any confusion?
> +}
> +#else
> +static inline bool in_lazy_mmu_mode(void)
> +{
> + return false;
> +}
> +#endif
> +
> extern struct pid *cad_pid;
>
> /*
Thanks!
On 29/10/2025 17:41, Alexander Gordeev wrote:
> On Wed, Oct 29, 2025 at 10:09:04AM +0000, Kevin Brodsky wrote:
>
> Hi Kevin,
>
>> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>> +static inline bool in_lazy_mmu_mode(void)
>> +{
>> + return current->lazy_mmu_state.active;
> Whether (nesting_level > 0) is more correct check?
> Otherwise, it returns false while in paused mode.
That's exactly the intention. Lazy MMU is disabled while paused. The
users of that helper want to know if lazy MMU is currently enabled (to
decide whether to batch updates for instance); whether this is because
we are paused or not in any lazy_mmu section (nesting_level == 0) makes
no difference.
> May be check both nesting_level and active and also introduce
> in_lazy_mmu_paused_mode() right away to avoid any confusion?
Can you think of any situation where a caller would specifically want to
know that lazy MMU is paused?
- Kevin
On Thu, Oct 30, 2025 at 11:28:53AM +0100, Kevin Brodsky wrote:
> On 29/10/2025 17:41, Alexander Gordeev wrote:
> > On Wed, Oct 29, 2025 at 10:09:04AM +0000, Kevin Brodsky wrote:
> >
> > Hi Kevin,
> >
> >> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> >> +static inline bool in_lazy_mmu_mode(void)
> >> +{
> >> + return current->lazy_mmu_state.active;
> > Whether (nesting_level > 0) is more correct check?
> > Otherwise, it returns false while in paused mode.
>
> That's exactly the intention. Lazy MMU is disabled while paused. The
> users of that helper want to know if lazy MMU is currently enabled (to
> decide whether to batch updates for instance); whether this is because
> we are paused or not in any lazy_mmu section (nesting_level == 0) makes
> no difference.
>
> > May be check both nesting_level and active and also introduce
> > in_lazy_mmu_paused_mode() right away to avoid any confusion?
>
> Can you think of any situation where a caller would specifically want to
> know that lazy MMU is paused?
I thought I do, but in_lazy_mmu_mode() alone works just fine,
as you described (at least for now).
> - Kevin
Thanks!
© 2016 - 2025 Red Hat, Inc.