[PATCH v2 0/5] Fix lazy mmu mode

Ryan Roberts posted 5 patches 8 months ago
Failed in applying to current master (apply log)
There is a newer version of this series
arch/sparc/include/asm/pgtable_64.h   |  2 --
arch/sparc/mm/tlb.c                   |  5 ++++-
arch/x86/include/asm/xen/hypervisor.h | 15 ++-------------
arch/x86/xen/enlighten_pv.c           |  1 -
fs/proc/task_mmu.c                    | 11 ++++-------
include/linux/pgtable.h               | 14 ++++++++------
6 files changed, 18 insertions(+), 30 deletions(-)
[PATCH v2 0/5] Fix lazy mmu mode
Posted by Ryan Roberts 8 months ago
Hi All,

I'm planning to implement lazy mmu mode for arm64 to optimize vmalloc. As part
of that, I will extend lazy mmu mode to cover kernel mappings in vmalloc table
walkers. While lazy mmu mode is already used for kernel mappings in a few
places, this will extend it's use significantly.

Having reviewed the existing lazy mmu implementations in powerpc, sparc and x86,
it looks like there are a bunch of bugs, some of which may be more likely to
trigger once I extend the use of lazy mmu. So this series attempts to clarify
the requirements and fix all the bugs in advance of that series. See patch #1
commit log for all the details.

Note that I have only been able to compile test these changes but I think they
are in good enough shape for some linux-next testing.

Applies on Friday's mm-unstable (5f089a9aa987), as I assume this would be
preferred via that tree.

Changes since v1
================
  - split v1 patch #1 into v2 patch #1 and #2; per David
  - Added Acked-by tags from David and Andreas; Thanks!
  - Refined the patches which are truely fixes and added to stable to cc

Thanks,
Ryan

Ryan Roberts (5):
  mm: Fix lazy mmu docs and usage
  fs/proc/task_mmu: Reduce scope of lazy mmu region
  sparc/mm: Disable preemption in lazy mmu mode
  sparc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes
  Revert "x86/xen: allow nesting of same lazy mode"

 arch/sparc/include/asm/pgtable_64.h   |  2 --
 arch/sparc/mm/tlb.c                   |  5 ++++-
 arch/x86/include/asm/xen/hypervisor.h | 15 ++-------------
 arch/x86/xen/enlighten_pv.c           |  1 -
 fs/proc/task_mmu.c                    | 11 ++++-------
 include/linux/pgtable.h               | 14 ++++++++------
 6 files changed, 18 insertions(+), 30 deletions(-)

--
2.43.0
Re: [PATCH v2 0/5] Fix lazy mmu mode
Posted by Jürgen Groß 8 months ago
On 03.03.25 15:15, Ryan Roberts wrote:
> Hi All,
> 
> I'm planning to implement lazy mmu mode for arm64 to optimize vmalloc. As part
> of that, I will extend lazy mmu mode to cover kernel mappings in vmalloc table
> walkers. While lazy mmu mode is already used for kernel mappings in a few
> places, this will extend it's use significantly.
> 
> Having reviewed the existing lazy mmu implementations in powerpc, sparc and x86,
> it looks like there are a bunch of bugs, some of which may be more likely to
> trigger once I extend the use of lazy mmu. So this series attempts to clarify
> the requirements and fix all the bugs in advance of that series. See patch #1
> commit log for all the details.
> 
> Note that I have only been able to compile test these changes but I think they
> are in good enough shape for some linux-next testing.
> 
> Applies on Friday's mm-unstable (5f089a9aa987), as I assume this would be
> preferred via that tree.
> 
> Changes since v1
> ================
>    - split v1 patch #1 into v2 patch #1 and #2; per David
>    - Added Acked-by tags from David and Andreas; Thanks!
>    - Refined the patches which are truely fixes and added to stable to cc
> 
> Thanks,
> Ryan
> 
> Ryan Roberts (5):
>    mm: Fix lazy mmu docs and usage
>    fs/proc/task_mmu: Reduce scope of lazy mmu region
>    sparc/mm: Disable preemption in lazy mmu mode
>    sparc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes
>    Revert "x86/xen: allow nesting of same lazy mode"
> 
>   arch/sparc/include/asm/pgtable_64.h   |  2 --
>   arch/sparc/mm/tlb.c                   |  5 ++++-
>   arch/x86/include/asm/xen/hypervisor.h | 15 ++-------------
>   arch/x86/xen/enlighten_pv.c           |  1 -
>   fs/proc/task_mmu.c                    | 11 ++++-------
>   include/linux/pgtable.h               | 14 ++++++++------
>   6 files changed, 18 insertions(+), 30 deletions(-)

For the series:

Acked-by: Juergen Gross <jgross@suse.com>


Juergen
Re: [PATCH v2 0/5] Fix lazy mmu mode
Posted by Alexander Gordeev 6 months, 3 weeks ago
On Mon, Mar 03, 2025 at 02:15:34PM +0000, Ryan Roberts wrote:

Hi Ryan,

> I'm planning to implement lazy mmu mode for arm64 to optimize vmalloc. As part
> of that, I will extend lazy mmu mode to cover kernel mappings in vmalloc table
> walkers. While lazy mmu mode is already used for kernel mappings in a few
> places, this will extend it's use significantly.
> 
> Having reviewed the existing lazy mmu implementations in powerpc, sparc and x86,
> it looks like there are a bunch of bugs, some of which may be more likely to
> trigger once I extend the use of lazy mmu.

Do you have any idea about generic code issues as result of not adhering to
the originally stated requirement:

  /*
   ...
   * the PTE updates which happen during this window.  Note that using this
   * interface requires that read hazards be removed from the code.  A read
   * hazard could result in the direct mode hypervisor case, since the actual
   * write to the page tables may not yet have taken place, so reads though
   * a raw PTE pointer after it has been modified are not guaranteed to be
   * up to date.
   ...
   */

I tried to follow few code paths and at least this one does not look so good:

copy_pte_range(..., src_pte, ...)
	ret = copy_nonpresent_pte(..., src_pte, ...)
		try_restore_exclusive_pte(..., src_pte, ...)	// is_device_exclusive_entry(entry)
			restore_exclusive_pte(..., ptep, ...)
				set_pte_at(..., ptep, ...)
					set_pte(ptep, pte);	// save in lazy mmu mode

	// ret == -ENOENT

	ptent = ptep_get(src_pte);				// lazy mmu save is not observed
	ret = copy_present_ptes(..., ptent, ...);		// wrong ptent used

I am not aware whether the effort to "read hazards be removed from the code"
has ever been made and the generic code is safe in this regard.

What is your take on this?

Thanks!
Re: [PATCH v2 0/5] Fix lazy mmu mode
Posted by Ryan Roberts 6 months, 3 weeks ago
On 10/04/2025 17:07, Alexander Gordeev wrote:
> On Mon, Mar 03, 2025 at 02:15:34PM +0000, Ryan Roberts wrote:
> 
> Hi Ryan,
> 
>> I'm planning to implement lazy mmu mode for arm64 to optimize vmalloc. As part
>> of that, I will extend lazy mmu mode to cover kernel mappings in vmalloc table
>> walkers. While lazy mmu mode is already used for kernel mappings in a few
>> places, this will extend it's use significantly.
>>
>> Having reviewed the existing lazy mmu implementations in powerpc, sparc and x86,
>> it looks like there are a bunch of bugs, some of which may be more likely to
>> trigger once I extend the use of lazy mmu.
> 
> Do you have any idea about generic code issues as result of not adhering to
> the originally stated requirement:
> 
>   /*
>    ...
>    * the PTE updates which happen during this window.  Note that using this
>    * interface requires that read hazards be removed from the code.  A read
>    * hazard could result in the direct mode hypervisor case, since the actual
>    * write to the page tables may not yet have taken place, so reads though
>    * a raw PTE pointer after it has been modified are not guaranteed to be
>    * up to date.
>    ...
>    */
> 
> I tried to follow few code paths and at least this one does not look so good:
> 
> copy_pte_range(..., src_pte, ...)
> 	ret = copy_nonpresent_pte(..., src_pte, ...)
> 		try_restore_exclusive_pte(..., src_pte, ...)	// is_device_exclusive_entry(entry)
> 			restore_exclusive_pte(..., ptep, ...)
> 				set_pte_at(..., ptep, ...)
> 					set_pte(ptep, pte);	// save in lazy mmu mode
> 
> 	// ret == -ENOENT
> 
> 	ptent = ptep_get(src_pte);				// lazy mmu save is not observed
> 	ret = copy_present_ptes(..., ptent, ...);		// wrong ptent used
> 
> I am not aware whether the effort to "read hazards be removed from the code"
> has ever been made and the generic code is safe in this regard.
> 
> What is your take on this?

Hmm, that looks like a bug to me, at least based on the stated requirements.
Although this is not a "read through a raw PTE *pointer*", it is a ptep_get().
The arch code can override that so I guess it has an opportunity to flush. But I
don't think any arches are currently doing that.

Probably the simplest fix is to add arch_flush_lazy_mmu_mode() before the
ptep_get()?

It won't be a problem in practice for arm64, since the pgtables are always
updated immediately. I just want to use these hooks to defer/batch barriers in
certain cases.

And this is a pre-existing issue for the arches that use lazy mmu with
device-exclusive mappings, which my extending lazy mmu into vmalloc won't
exacerbate.

Would you be willing/able to submit a fix?

Thanks,
Ryan


> 
> Thanks!
Re: [PATCH v2 0/5] Fix lazy mmu mode
Posted by Alexander Gordeev 6 months, 3 weeks ago
On Mon, Apr 14, 2025 at 02:22:53PM +0100, Ryan Roberts wrote:
> On 10/04/2025 17:07, Alexander Gordeev wrote:
> >> I'm planning to implement lazy mmu mode for arm64 to optimize vmalloc. As part
> >> of that, I will extend lazy mmu mode to cover kernel mappings in vmalloc table
> >> walkers. While lazy mmu mode is already used for kernel mappings in a few
> >> places, this will extend it's use significantly.
> >>
> >> Having reviewed the existing lazy mmu implementations in powerpc, sparc and x86,
> >> it looks like there are a bunch of bugs, some of which may be more likely to
> >> trigger once I extend the use of lazy mmu.
> > 
> > Do you have any idea about generic code issues as result of not adhering to
> > the originally stated requirement:
> > 
> >   /*
> >    ...
> >    * the PTE updates which happen during this window.  Note that using this
> >    * interface requires that read hazards be removed from the code.  A read
> >    * hazard could result in the direct mode hypervisor case, since the actual
> >    * write to the page tables may not yet have taken place, so reads though
> >    * a raw PTE pointer after it has been modified are not guaranteed to be
> >    * up to date.
> >    ...
> >    */
> > 
> > I tried to follow few code paths and at least this one does not look so good:
> > 
> > copy_pte_range(..., src_pte, ...)
> > 	ret = copy_nonpresent_pte(..., src_pte, ...)
> > 		try_restore_exclusive_pte(..., src_pte, ...)	// is_device_exclusive_entry(entry)
> > 			restore_exclusive_pte(..., ptep, ...)
> > 				set_pte_at(..., ptep, ...)
> > 					set_pte(ptep, pte);	// save in lazy mmu mode
> > 
> > 	// ret == -ENOENT
> > 
> > 	ptent = ptep_get(src_pte);				// lazy mmu save is not observed
> > 	ret = copy_present_ptes(..., ptent, ...);		// wrong ptent used
> > 
> > I am not aware whether the effort to "read hazards be removed from the code"
> > has ever been made and the generic code is safe in this regard.
> > 
> > What is your take on this?
> 
> Hmm, that looks like a bug to me, at least based on the stated requirements.
> Although this is not a "read through a raw PTE *pointer*", it is a ptep_get().
> The arch code can override that so I guess it has an opportunity to flush. But I
> don't think any arches are currently doing that.
> 
> Probably the simplest fix is to add arch_flush_lazy_mmu_mode() before the
> ptep_get()?

Which would completely revert the very idea of the lazy mmu mode?
(As one would flush on every PTE page table iteration).

> It won't be a problem in practice for arm64, since the pgtables are always
> updated immediately. I just want to use these hooks to defer/batch barriers in
> certain cases.
> 
> And this is a pre-existing issue for the arches that use lazy mmu with
> device-exclusive mappings, which my extending lazy mmu into vmalloc won't
> exacerbate.
> 
> Would you be willing/able to submit a fix?

Well, we have a dozen of lazy mmu cases and I would guess it is not the
only piece of code that seems affected. I was thinking about debug feature
that could help spotting all troubled locations.

Then we could assess and decide if it is feasible to fix. Just turning the
code above into the PTE read-modify-update pattern is quite an exercise...

> Thanks,
> Ryan
Re: [PATCH v2 0/5] Fix lazy mmu mode
Posted by Ryan Roberts 6 months, 3 weeks ago
On 14/04/2025 15:04, Alexander Gordeev wrote:
> On Mon, Apr 14, 2025 at 02:22:53PM +0100, Ryan Roberts wrote:
>> On 10/04/2025 17:07, Alexander Gordeev wrote:
>>>> I'm planning to implement lazy mmu mode for arm64 to optimize vmalloc. As part
>>>> of that, I will extend lazy mmu mode to cover kernel mappings in vmalloc table
>>>> walkers. While lazy mmu mode is already used for kernel mappings in a few
>>>> places, this will extend it's use significantly.
>>>>
>>>> Having reviewed the existing lazy mmu implementations in powerpc, sparc and x86,
>>>> it looks like there are a bunch of bugs, some of which may be more likely to
>>>> trigger once I extend the use of lazy mmu.
>>>
>>> Do you have any idea about generic code issues as result of not adhering to
>>> the originally stated requirement:
>>>
>>>   /*
>>>    ...
>>>    * the PTE updates which happen during this window.  Note that using this
>>>    * interface requires that read hazards be removed from the code.  A read
>>>    * hazard could result in the direct mode hypervisor case, since the actual
>>>    * write to the page tables may not yet have taken place, so reads though
>>>    * a raw PTE pointer after it has been modified are not guaranteed to be
>>>    * up to date.
>>>    ...
>>>    */
>>>
>>> I tried to follow few code paths and at least this one does not look so good:
>>>
>>> copy_pte_range(..., src_pte, ...)
>>> 	ret = copy_nonpresent_pte(..., src_pte, ...)
>>> 		try_restore_exclusive_pte(..., src_pte, ...)	// is_device_exclusive_entry(entry)
>>> 			restore_exclusive_pte(..., ptep, ...)
>>> 				set_pte_at(..., ptep, ...)
>>> 					set_pte(ptep, pte);	// save in lazy mmu mode
>>>
>>> 	// ret == -ENOENT
>>>
>>> 	ptent = ptep_get(src_pte);				// lazy mmu save is not observed
>>> 	ret = copy_present_ptes(..., ptent, ...);		// wrong ptent used
>>>
>>> I am not aware whether the effort to "read hazards be removed from the code"
>>> has ever been made and the generic code is safe in this regard.
>>>
>>> What is your take on this?
>>
>> Hmm, that looks like a bug to me, at least based on the stated requirements.
>> Although this is not a "read through a raw PTE *pointer*", it is a ptep_get().
>> The arch code can override that so I guess it has an opportunity to flush. But I
>> don't think any arches are currently doing that.
>>
>> Probably the simplest fix is to add arch_flush_lazy_mmu_mode() before the
>> ptep_get()?
> 
> Which would completely revert the very idea of the lazy mmu mode?
> (As one would flush on every PTE page table iteration).

Well yes, but this is a pretty rare path, I'm guessing?

> 
>> It won't be a problem in practice for arm64, since the pgtables are always
>> updated immediately. I just want to use these hooks to defer/batch barriers in
>> certain cases.
>>
>> And this is a pre-existing issue for the arches that use lazy mmu with
>> device-exclusive mappings, which my extending lazy mmu into vmalloc won't
>> exacerbate.
>>
>> Would you be willing/able to submit a fix?
> 
> Well, we have a dozen of lazy mmu cases and I would guess it is not the
> only piece of code that seems affected. I was thinking about debug feature
> that could help spotting all troubled locations.
> 
> Then we could assess and decide if it is feasible to fix. Just turning the
> code above into the PTE read-modify-update pattern is quite an exercise...
> 
>> Thanks,
>> Ryan