Lazy mmu mode fixes and improvements

[RFC PATCH v1 0/6] Lazy mmu mode fixes and improvements

Posted by Ryan Roberts 5 months ago

Hi All,

I recently added support for lazy mmu mode on arm64. The series is now in
Linus's tree so should be in v6.16-rc1. But during testing in linux-next we
found some ugly corners (unexpected nesting). I was able to fix those issues by
making the arm64 implementation more permissive (like the other arches). But
this is quite fragile IMHO. So I'd rather fix the root cause and ensure that
lazy mmu mode never nests, and more importantly, that code never makes pgtable
modifications expecting them to be immediate, not knowing that it's actually in
lazy mmu mode so the changes get deferred.

The first 2 patches are unrelated, very obvious bug fixes. They don't affect
arm64 because arm64 only uses lazy mmu for kernel mappings. But I noticed them
during code review and think they should be fixed.

The next 3 patches are aimed at solving the nesting issue.

And the final patch is reverting the "permissive" fix I did for arm64, which is
no longer needed after the previous 3 patches.

I've labelled this RFC for now because it depends on the arm64 lazy mmu patches
in Linus's master, so it won't apply to mm-unstable. But I'm keen to get review
and siince I'm touching various arches and modifying some core mm stuff, I
thought that might take a while so thought I'd beat the rush and get a first
version out early.

I've build-tested all the affected arches. And I've run mm selftests for the
arm64 build, with no issues (with DEBUG_PAGEALLOC and KFENCE enabled).

Applies against Linus's master branch (f66bc387efbe).

Thanks,
Ryan


Ryan Roberts (6):
  fs/proc/task_mmu: Fix pte update and tlb maintenance ordering in
    pagemap_scan_pmd_entry()
  mm: Fix pte update and tlb maintenance ordering in
    migrate_vma_collect_pmd()
  mm: Avoid calling page allocator from apply_to_page_range()
  mm: Introduce arch_in_lazy_mmu_mode()
  mm: Avoid calling page allocator while in lazy mmu mode
  Revert "arm64/mm: Permit lazy_mmu_mode to be nested"

 arch/arm64/include/asm/pgtable.h              | 22 ++++----
 .../include/asm/book3s/64/tlbflush-hash.h     | 15 ++++++
 arch/sparc/include/asm/tlbflush_64.h          |  1 +
 arch/sparc/mm/tlb.c                           | 12 +++++
 arch/x86/include/asm/paravirt.h               |  5 ++
 arch/x86/include/asm/paravirt_types.h         |  1 +
 arch/x86/kernel/paravirt.c                    |  6 +++
 arch/x86/xen/mmu_pv.c                         |  6 +++
 fs/proc/task_mmu.c                            |  3 +-
 include/asm-generic/tlb.h                     |  2 +
 include/linux/mm.h                            |  6 +++
 include/linux/pgtable.h                       |  1 +
 kernel/bpf/arena.c                            |  6 +--
 mm/kasan/shadow.c                             |  2 +-
 mm/memory.c                                   | 54 ++++++++++++++-----
 mm/migrate_device.c                           |  3 +-
 mm/mmu_gather.c                               | 15 ++++++
 17 files changed, 128 insertions(+), 32 deletions(-)

--
2.43.0

Re: [RFC PATCH v1 0/6] Lazy mmu mode fixes and improvements

Posted by Lorenzo Stoakes 5 months ago

+cc Jann who is a specialist in all things page table-y and especially scary
edge cases :)

On Fri, May 30, 2025 at 03:04:38PM +0100, Ryan Roberts wrote:
> Hi All,
>
> I recently added support for lazy mmu mode on arm64. The series is now in
> Linus's tree so should be in v6.16-rc1. But during testing in linux-next we
> found some ugly corners (unexpected nesting). I was able to fix those issues by
> making the arm64 implementation more permissive (like the other arches). But
> this is quite fragile IMHO. So I'd rather fix the root cause and ensure that
> lazy mmu mode never nests, and more importantly, that code never makes pgtable
> modifications expecting them to be immediate, not knowing that it's actually in
> lazy mmu mode so the changes get deferred.

When you say fragile, are you confident it _works_ but perhaps not quite as well
as you want? Or are you concerned this might be broken upstream in any way?

I am thinking specifically about the proposed use in Dev's new series [0] and
obviously hoping (and assuming in fact) that it's the former :)

[0]: https://lore.kernel.org/linux-mm/20250530090407.19237-1-dev.jain@arm.com/

>
> The first 2 patches are unrelated, very obvious bug fixes. They don't affect
> arm64 because arm64 only uses lazy mmu for kernel mappings. But I noticed them
> during code review and think they should be fixed.
>
> The next 3 patches are aimed at solving the nesting issue.
>
> And the final patch is reverting the "permissive" fix I did for arm64, which is
> no longer needed after the previous 3 patches.
>
> I've labelled this RFC for now because it depends on the arm64 lazy mmu patches
> in Linus's master, so it won't apply to mm-unstable. But I'm keen to get review
> and siince I'm touching various arches and modifying some core mm stuff, I
> thought that might take a while so thought I'd beat the rush and get a first
> version out early.
>
> I've build-tested all the affected arches. And I've run mm selftests for the
> arm64 build, with no issues (with DEBUG_PAGEALLOC and KFENCE enabled).
>
> Applies against Linus's master branch (f66bc387efbe).
>
> Thanks,
> Ryan
>
>
> Ryan Roberts (6):
>   fs/proc/task_mmu: Fix pte update and tlb maintenance ordering in
>     pagemap_scan_pmd_entry()
>   mm: Fix pte update and tlb maintenance ordering in
>     migrate_vma_collect_pmd()
>   mm: Avoid calling page allocator from apply_to_page_range()
>   mm: Introduce arch_in_lazy_mmu_mode()
>   mm: Avoid calling page allocator while in lazy mmu mode
>   Revert "arm64/mm: Permit lazy_mmu_mode to be nested"
>
>  arch/arm64/include/asm/pgtable.h              | 22 ++++----
>  .../include/asm/book3s/64/tlbflush-hash.h     | 15 ++++++
>  arch/sparc/include/asm/tlbflush_64.h          |  1 +
>  arch/sparc/mm/tlb.c                           | 12 +++++
>  arch/x86/include/asm/paravirt.h               |  5 ++
>  arch/x86/include/asm/paravirt_types.h         |  1 +
>  arch/x86/kernel/paravirt.c                    |  6 +++
>  arch/x86/xen/mmu_pv.c                         |  6 +++
>  fs/proc/task_mmu.c                            |  3 +-
>  include/asm-generic/tlb.h                     |  2 +
>  include/linux/mm.h                            |  6 +++
>  include/linux/pgtable.h                       |  1 +
>  kernel/bpf/arena.c                            |  6 +--
>  mm/kasan/shadow.c                             |  2 +-
>  mm/memory.c                                   | 54 ++++++++++++++-----
>  mm/migrate_device.c                           |  3 +-
>  mm/mmu_gather.c                               | 15 ++++++
>  17 files changed, 128 insertions(+), 32 deletions(-)
>
> --
> 2.43.0
>

Re: [RFC PATCH v1 0/6] Lazy mmu mode fixes and improvements

Posted by Ryan Roberts 5 months ago

On 30/05/2025 15:47, Lorenzo Stoakes wrote:
> +cc Jann who is a specialist in all things page table-y and especially scary
> edge cases :)
> 
> On Fri, May 30, 2025 at 03:04:38PM +0100, Ryan Roberts wrote:
>> Hi All,
>>
>> I recently added support for lazy mmu mode on arm64. The series is now in
>> Linus's tree so should be in v6.16-rc1. But during testing in linux-next we
>> found some ugly corners (unexpected nesting). I was able to fix those issues by
>> making the arm64 implementation more permissive (like the other arches). But
>> this is quite fragile IMHO. So I'd rather fix the root cause and ensure that
>> lazy mmu mode never nests, and more importantly, that code never makes pgtable
>> modifications expecting them to be immediate, not knowing that it's actually in
>> lazy mmu mode so the changes get deferred.
> 
> When you say fragile, are you confident it _works_ but perhaps not quite as well
> as you want? Or are you concerned this might be broken upstream in any way?

I'm confident that it _works_ for arm64 as it is, upstream. But if Dev's series
were to go in _without_ the lazy_mmu bracketting in some manner, then it would
be broken if the config includes CONFIG_DEBUG_PAGEALLOC.

There's a lot more explanation in the later patches as to how it can be broken,
but for arm64, the situation is currently like this, because our implementation
of __change_memory_common() uses apply_to_page_range() which implicitly starts
an inner lazy_mmu_mode. We enter multiple times, but we exit one the first call
to exit. Everything works correctly but it's not optimal because C is no longer
deferred:

arch_enter_lazy_mmu_mode()                        << outer lazy mmu region
  <do some pte changes (A)>
  alloc_pages()
    debug_pagealloc_map_pages()
      __kernel_map_pages()
        __change_memory_common()
          arch_enter_lazy_mmu_mode()              << inner lazy mmu region
            <change kernel pte to make valid (B)>
          arch_leave_lazy_mmu_mode()              << exit; complete A + B
    clear_page()
  <do some more pte changes (C)>                  << no longer in lazy mode
arch_leave_lazy_mmu_mode()                        << nop

An alternative implementation would not add the nested lazy mmu mode, so we end
up with this:

arch_enter_lazy_mmu_mode()                        << outer lazy mmu region
  <do some pte changes (A)>
  alloc_pages()
    debug_pagealloc_map_pages()
      __kernel_map_pages()
        __change_memory_common()
            <change kernel pte to make valid (B)> << deferred due to lazy mmu
    clear_page()                                  << BANG! B has not be actioned
  <do some more pte changes (C)>
arch_leave_lazy_mmu_mode()

This is clearly a much worse outcome. It's not happening today but it could in
future. That's why I'm claiming it's fragile. It's much better (IMHO) to
disallow calling the page allocator when in lazy mmu mode.

I won't speak for other arches; there may be more or less potential impact for them.

> 
> I am thinking specifically about the proposed use in Dev's new series [0] and
> obviously hoping (and assuming in fact) that it's the former :)

Dev's changes aren't directly related to this, but if a version was accepted
that didn't include the lazy mmu mode, that would cause non-obvious issues.

Hope that helps?

Thanks,
Ryan

> 
> [0]: https://lore.kernel.org/linux-mm/20250530090407.19237-1-dev.jain@arm.com/
> 
>>
>> The first 2 patches are unrelated, very obvious bug fixes. They don't affect
>> arm64 because arm64 only uses lazy mmu for kernel mappings. But I noticed them
>> during code review and think they should be fixed.
>>
>> The next 3 patches are aimed at solving the nesting issue.
>>
>> And the final patch is reverting the "permissive" fix I did for arm64, which is
>> no longer needed after the previous 3 patches.
>>
>> I've labelled this RFC for now because it depends on the arm64 lazy mmu patches
>> in Linus's master, so it won't apply to mm-unstable. But I'm keen to get review
>> and siince I'm touching various arches and modifying some core mm stuff, I
>> thought that might take a while so thought I'd beat the rush and get a first
>> version out early.
>>
>> I've build-tested all the affected arches. And I've run mm selftests for the
>> arm64 build, with no issues (with DEBUG_PAGEALLOC and KFENCE enabled).
>>
>> Applies against Linus's master branch (f66bc387efbe).
>>
>> Thanks,
>> Ryan
>>
>>
>> Ryan Roberts (6):
>>   fs/proc/task_mmu: Fix pte update and tlb maintenance ordering in
>>     pagemap_scan_pmd_entry()
>>   mm: Fix pte update and tlb maintenance ordering in
>>     migrate_vma_collect_pmd()
>>   mm: Avoid calling page allocator from apply_to_page_range()
>>   mm: Introduce arch_in_lazy_mmu_mode()
>>   mm: Avoid calling page allocator while in lazy mmu mode
>>   Revert "arm64/mm: Permit lazy_mmu_mode to be nested"
>>
>>  arch/arm64/include/asm/pgtable.h              | 22 ++++----
>>  .../include/asm/book3s/64/tlbflush-hash.h     | 15 ++++++
>>  arch/sparc/include/asm/tlbflush_64.h          |  1 +
>>  arch/sparc/mm/tlb.c                           | 12 +++++
>>  arch/x86/include/asm/paravirt.h               |  5 ++
>>  arch/x86/include/asm/paravirt_types.h         |  1 +
>>  arch/x86/kernel/paravirt.c                    |  6 +++
>>  arch/x86/xen/mmu_pv.c                         |  6 +++
>>  fs/proc/task_mmu.c                            |  3 +-
>>  include/asm-generic/tlb.h                     |  2 +
>>  include/linux/mm.h                            |  6 +++
>>  include/linux/pgtable.h                       |  1 +
>>  kernel/bpf/arena.c                            |  6 +--
>>  mm/kasan/shadow.c                             |  2 +-
>>  mm/memory.c                                   | 54 ++++++++++++++-----
>>  mm/migrate_device.c                           |  3 +-
>>  mm/mmu_gather.c                               | 15 ++++++
>>  17 files changed, 128 insertions(+), 32 deletions(-)
>>
>> --
>> 2.43.0
>>

Re: [RFC PATCH v1 0/6] Lazy mmu mode fixes and improvements

Posted by Mike Rapoport 5 months ago

Hi Ryan,

On Fri, May 30, 2025 at 04:55:36PM +0100, Ryan Roberts wrote:
> On 30/05/2025 15:47, Lorenzo Stoakes wrote:
> > +cc Jann who is a specialist in all things page table-y and especially scary
> > edge cases :)
> > 
> > On Fri, May 30, 2025 at 03:04:38PM +0100, Ryan Roberts wrote:
> >> Hi All,
> >>
> >> I recently added support for lazy mmu mode on arm64. The series is now in
> >> Linus's tree so should be in v6.16-rc1. But during testing in linux-next we
> >> found some ugly corners (unexpected nesting). I was able to fix those issues by
> >> making the arm64 implementation more permissive (like the other arches). But
> >> this is quite fragile IMHO. So I'd rather fix the root cause and ensure that
> >> lazy mmu mode never nests, and more importantly, that code never makes pgtable
> >> modifications expecting them to be immediate, not knowing that it's actually in
> >> lazy mmu mode so the changes get deferred.
> > 
> > When you say fragile, are you confident it _works_ but perhaps not quite as well
> > as you want? Or are you concerned this might be broken upstream in any way?
> 
> I'm confident that it _works_ for arm64 as it is, upstream. But if Dev's series
> were to go in _without_ the lazy_mmu bracketting in some manner, then it would
> be broken if the config includes CONFIG_DEBUG_PAGEALLOC.
> 
> There's a lot more explanation in the later patches as to how it can be broken,
> but for arm64, the situation is currently like this, because our implementation
> of __change_memory_common() uses apply_to_page_range() which implicitly starts
> an inner lazy_mmu_mode. We enter multiple times, but we exit one the first call
> to exit. Everything works correctly but it's not optimal because C is no longer
> deferred:
> 
> arch_enter_lazy_mmu_mode()                        << outer lazy mmu region
>   <do some pte changes (A)>
>   alloc_pages()
>     debug_pagealloc_map_pages()
>       __kernel_map_pages()
>         __change_memory_common()
>           arch_enter_lazy_mmu_mode()              << inner lazy mmu region
>             <change kernel pte to make valid (B)>
>           arch_leave_lazy_mmu_mode()              << exit; complete A + B
>     clear_page()
>   <do some more pte changes (C)>                  << no longer in lazy mode
> arch_leave_lazy_mmu_mode()                        << nop
> 
> An alternative implementation would not add the nested lazy mmu mode, so we end
> up with this:
> 
> arch_enter_lazy_mmu_mode()                        << outer lazy mmu region
>   <do some pte changes (A)>
>   alloc_pages()
>     debug_pagealloc_map_pages()
>       __kernel_map_pages()
>         __change_memory_common()
>             <change kernel pte to make valid (B)> << deferred due to lazy mmu
>     clear_page()                                  << BANG! B has not be actioned
>   <do some more pte changes (C)>
> arch_leave_lazy_mmu_mode()
> 
> This is clearly a much worse outcome. It's not happening today but it could in
> future. That's why I'm claiming it's fragile. It's much better (IMHO) to
> disallow calling the page allocator when in lazy mmu mode.

First, I think it should be handled completely inside arch/arm64. Page
allocation worked on lazy mmu mode on other architectures, no reason it
should be changed because of the way arm64 implements lazy mmu.

Second, DEBUG_PAGEALLOC already implies that performance is bad, for it to
be useful the kernel should be mapped with base pages and there's map/unmap
for every page allocation so optimizing a few pte changes (C in your
example) won't matter much.

If there's a potential correctness issue with Dev's patches, it should be
dealt with as a part of those patches with the necessary updates of how
lazy mmu is implemented on arm64 and used in pageattr.c.

And it seems to me that adding something along the lines below to
__kernel_map_pages() would solve DEBUG_PAGEALLOC issue:

void __kernel_map_pages(struct page *page, int numpages, int enable)
{
	unsigned long flags;
	bool lazy_mmu = false;

	if (!can_set_direct_map())
		return;

	flags = read_thread_flags();
	if (flags & BIT(TIF_LAZY_MMU))
		lazy_mmu = true;

	set_memory_valid((unsigned long)page_address(page), numpages, enable);

	if (lazy_mmu)
		set_thread_flag(TIF_LAZY_MMU);
}

> Thanks,
> Ryan

-- 
Sincerely yours,
Mike.

Re: [RFC PATCH v1 0/6] Lazy mmu mode fixes and improvements

Posted by Ryan Roberts 5 months ago

On 31/05/2025 08:46, Mike Rapoport wrote:
> Hi Ryan,
> 
> On Fri, May 30, 2025 at 04:55:36PM +0100, Ryan Roberts wrote:
>> On 30/05/2025 15:47, Lorenzo Stoakes wrote:
>>> +cc Jann who is a specialist in all things page table-y and especially scary
>>> edge cases :)
>>>
>>> On Fri, May 30, 2025 at 03:04:38PM +0100, Ryan Roberts wrote:
>>>> Hi All,
>>>>
>>>> I recently added support for lazy mmu mode on arm64. The series is now in
>>>> Linus's tree so should be in v6.16-rc1. But during testing in linux-next we
>>>> found some ugly corners (unexpected nesting). I was able to fix those issues by
>>>> making the arm64 implementation more permissive (like the other arches). But
>>>> this is quite fragile IMHO. So I'd rather fix the root cause and ensure that
>>>> lazy mmu mode never nests, and more importantly, that code never makes pgtable
>>>> modifications expecting them to be immediate, not knowing that it's actually in
>>>> lazy mmu mode so the changes get deferred.
>>>
>>> When you say fragile, are you confident it _works_ but perhaps not quite as well
>>> as you want? Or are you concerned this might be broken upstream in any way?
>>
>> I'm confident that it _works_ for arm64 as it is, upstream. But if Dev's series
>> were to go in _without_ the lazy_mmu bracketting in some manner, then it would
>> be broken if the config includes CONFIG_DEBUG_PAGEALLOC.
>>
>> There's a lot more explanation in the later patches as to how it can be broken,
>> but for arm64, the situation is currently like this, because our implementation
>> of __change_memory_common() uses apply_to_page_range() which implicitly starts
>> an inner lazy_mmu_mode. We enter multiple times, but we exit one the first call
>> to exit. Everything works correctly but it's not optimal because C is no longer
>> deferred:
>>
>> arch_enter_lazy_mmu_mode()                        << outer lazy mmu region
>>   <do some pte changes (A)>
>>   alloc_pages()
>>     debug_pagealloc_map_pages()
>>       __kernel_map_pages()
>>         __change_memory_common()
>>           arch_enter_lazy_mmu_mode()              << inner lazy mmu region
>>             <change kernel pte to make valid (B)>
>>           arch_leave_lazy_mmu_mode()              << exit; complete A + B
>>     clear_page()
>>   <do some more pte changes (C)>                  << no longer in lazy mode
>> arch_leave_lazy_mmu_mode()                        << nop
>>
>> An alternative implementation would not add the nested lazy mmu mode, so we end
>> up with this:
>>
>> arch_enter_lazy_mmu_mode()                        << outer lazy mmu region
>>   <do some pte changes (A)>
>>   alloc_pages()
>>     debug_pagealloc_map_pages()
>>       __kernel_map_pages()
>>         __change_memory_common()
>>             <change kernel pte to make valid (B)> << deferred due to lazy mmu
>>     clear_page()                                  << BANG! B has not be actioned
>>   <do some more pte changes (C)>
>> arch_leave_lazy_mmu_mode()
>>
>> This is clearly a much worse outcome. It's not happening today but it could in
>> future. That's why I'm claiming it's fragile. It's much better (IMHO) to
>> disallow calling the page allocator when in lazy mmu mode.
> 
> First, I think it should be handled completely inside arch/arm64. Page
> allocation worked on lazy mmu mode on other architectures, no reason it
> should be changed because of the way arm64 implements lazy mmu.
> 
> Second, DEBUG_PAGEALLOC already implies that performance is bad, for it to
> be useful the kernel should be mapped with base pages and there's map/unmap
> for every page allocation so optimizing a few pte changes (C in your
> example) won't matter much.
> 
> If there's a potential correctness issue with Dev's patches, it should be
> dealt with as a part of those patches with the necessary updates of how
> lazy mmu is implemented on arm64 and used in pageattr.c.
> 
> And it seems to me that adding something along the lines below to
> __kernel_map_pages() would solve DEBUG_PAGEALLOC issue:
> 
> void __kernel_map_pages(struct page *page, int numpages, int enable)
> {
> 	unsigned long flags;
> 	bool lazy_mmu = false;
> 
> 	if (!can_set_direct_map())
> 		return;
> 
> 	flags = read_thread_flags();
> 	if (flags & BIT(TIF_LAZY_MMU))
> 		lazy_mmu = true;
> 
> 	set_memory_valid((unsigned long)page_address(page), numpages, enable);
> 
> 	if (lazy_mmu)
> 		set_thread_flag(TIF_LAZY_MMU);
> }

Hi Mike,

I've thought about this for a bit, and concluded that you are totally right.
This is a much smaller, arm64-contained patch. Sorry for the noise here, and
thanks for the suggestion.

Thanks,
Ryan


> 
>> Thanks,
>> Ryan
>