[PATCH v2 00/11] mm: rewrite pfnmap tracking and remove VM_PAT

David Hildenbrand posted 11 patches 9 months ago
arch/x86/mm/pat/memtype.c          | 194 ++++-------------------------
arch/x86/mm/pat/memtype_interval.c |  63 ++--------
drivers/gpu/drm/i915/i915_mm.c     |   4 +-
include/linux/mm.h                 |   4 +-
include/linux/mm_inline.h          |   2 +
include/linux/mm_types.h           |  11 ++
include/linux/pgtable.h            | 127 ++++++++++---------
include/trace/events/mmflags.h     |   4 +-
mm/huge_memory.c                   |   5 +-
mm/io-mapping.c                    |   2 +-
mm/memory.c                        |  86 ++++++++++---
mm/memremap.c                      |   8 +-
mm/mmap.c                          |   5 -
mm/mremap.c                        |   4 -
mm/vma_init.c                      |  50 ++++++++
15 files changed, 242 insertions(+), 327 deletions(-)
[PATCH v2 00/11] mm: rewrite pfnmap tracking and remove VM_PAT
Posted by David Hildenbrand 9 months ago
On top of mm-unstable.

VM_PAT annoyed me too much and wasted too much of my time, let's clean
PAT handling up and remove VM_PAT.

This should sort out various issues with VM_PAT we discovered recently,
and will hopefully make the whole code more stable and easier to maintain.

In essence: we stop letting PAT mode mess with VMAs and instead lift
what to track/untrack to the MM core. We remember per VMA which pfn range
we tracked in a new struct we attach to a VMA (we have space without
exceeding 192 bytes), use a kref to share it among VMAs during
split/mremap/fork, and automatically untrack once the kref drops to 0.

This implies that we'll keep tracking a full pfn range even after partially
unmapping it, until fully unmapping it; but as that case was mostly broken
before, this at least makes it work in a way that is least intrusive to
VMA handling.

Shrinking with mremap() used to work in a hacky way, now we'll similarly
keep the original pfn range tacked even after this form of partial unmap.
Does anybody care about that? Unlikely. If we run into issues, we could
likely handled that (adjust the tracking) when our kref drops to 1 while
freeing a VMA. But it adds more complexity, so avoid that for now.

Briefly tested with the new pfnmap selftests [1].

[1] https://lkml.kernel.org/r/20250509153033.952746-1-david@redhat.com

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tvrtko Ursulin <tursulin@ursulin.net>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Peter Xu <peterx@redhat.com>

v1 -> v2:
* "mm: convert track_pfn_insert() to pfnmap_setup_cachemode*()"
 -> Call it "pfnmap_setup_cachemode()" and improve the documentation
 -> Add pfnmap_setup_cachemode_pfn()
 -> Keep checking a single PFN for PMD/PUD case and document why it's ok
* Merged memremap conversion patch with pfnmap_track() introduction patch
 -> Improve documentation
* "mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()"
 -> Adjust to code changes in mm-unstable
* Added "x86/mm/pat: inline memtype_match() into memtype_erase()"
* "mm/io-mapping: track_pfn() -> "pfnmap tracking""
 -> Adjust to code changes in mm-unstable

David Hildenbrand (11):
  x86/mm/pat: factor out setting cachemode into pgprot_set_cachemode()
  mm: convert track_pfn_insert() to pfnmap_setup_cachemode*()
  mm: introduce pfnmap_track() and pfnmap_untrack() and use them for
    memremap
  mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
  x86/mm/pat: remove old pfnmap tracking interface
  mm: remove VM_PAT
  x86/mm/pat: remove strict_prot parameter from reserve_pfn_range()
  x86/mm/pat: remove MEMTYPE_*_MATCH
  x86/mm/pat: inline memtype_match() into memtype_erase()
  drm/i915: track_pfn() -> "pfnmap tracking"
  mm/io-mapping: track_pfn() -> "pfnmap tracking"

 arch/x86/mm/pat/memtype.c          | 194 ++++-------------------------
 arch/x86/mm/pat/memtype_interval.c |  63 ++--------
 drivers/gpu/drm/i915/i915_mm.c     |   4 +-
 include/linux/mm.h                 |   4 +-
 include/linux/mm_inline.h          |   2 +
 include/linux/mm_types.h           |  11 ++
 include/linux/pgtable.h            | 127 ++++++++++---------
 include/trace/events/mmflags.h     |   4 +-
 mm/huge_memory.c                   |   5 +-
 mm/io-mapping.c                    |   2 +-
 mm/memory.c                        |  86 ++++++++++---
 mm/memremap.c                      |   8 +-
 mm/mmap.c                          |   5 -
 mm/mremap.c                        |   4 -
 mm/vma_init.c                      |  50 ++++++++
 15 files changed, 242 insertions(+), 327 deletions(-)


base-commit: c68cfbc5048ede4b10a1d3fe16f7f6192fc2c9c8
-- 
2.49.0
Re: [PATCH v2 00/11] mm: rewrite pfnmap tracking and remove VM_PAT
Posted by Liam R. Howlett 8 months, 4 weeks ago
* David Hildenbrand <david@redhat.com> [250512 08:34]:
> On top of mm-unstable.
> 
> VM_PAT annoyed me too much and wasted too much of my time, let's clean
> PAT handling up and remove VM_PAT.
> 
> This should sort out various issues with VM_PAT we discovered recently,
> and will hopefully make the whole code more stable and easier to maintain.
> 
> In essence: we stop letting PAT mode mess with VMAs and instead lift
> what to track/untrack to the MM core. We remember per VMA which pfn range
> we tracked in a new struct we attach to a VMA (we have space without
> exceeding 192 bytes), use a kref to share it among VMAs during
> split/mremap/fork, and automatically untrack once the kref drops to 0.

What you do here seems to be decouple the vma start/end addresses by
abstracting them into another allocated ref counted struct.  This is
close to what we do with the anon vma name..

It took a while to understand the underlying interval tree tracking of
this change, but I think it's as good as it was.  IIRC, there was a
shrinking and matching to the end address in the interval tree, but I
failed to find that commit and code - maybe it never made it upstream.
I was able to find a thread about splitting [1], so maybe I'm mistaken.

> 
> This implies that we'll keep tracking a full pfn range even after partially
> unmapping it, until fully unmapping it; but as that case was mostly broken
> before, this at least makes it work in a way that is least intrusive to
> VMA handling.
> 
> Shrinking with mremap() used to work in a hacky way, now we'll similarly
> keep the original pfn range tacked even after this form of partial unmap.
> Does anybody care about that? Unlikely. If we run into issues, we could
> likely handled that (adjust the tracking) when our kref drops to 1 while
> freeing a VMA. But it adds more complexity, so avoid that for now.

The decoupling of the vma and ref counted range means that we could beef
up the backend to support actually tracking the correct range, which
would be nice.. but I have very little desire to work on that.


[1] https://lore.kernel.org/all/5jrd43vusvcchpk2x6mouighkfhamjpaya5fu2cvikzaieg5pq@wqccwmjs4ian/

> 
> Briefly tested with the new pfnmap selftests [1].
> 
> [1] https://lkml.kernel.org/r/20250509153033.952746-1-david@redhat.com

oh yes, that's still a pr_info() log.  I think that should be a pr_err()
at least?

> 
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Jani Nikula <jani.nikula@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Tvrtko Ursulin <tursulin@ursulin.net>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Jann Horn <jannh@google.com>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Cc: Peter Xu <peterx@redhat.com>
> 
> v1 -> v2:
> * "mm: convert track_pfn_insert() to pfnmap_setup_cachemode*()"
>  -> Call it "pfnmap_setup_cachemode()" and improve the documentation
>  -> Add pfnmap_setup_cachemode_pfn()
>  -> Keep checking a single PFN for PMD/PUD case and document why it's ok
> * Merged memremap conversion patch with pfnmap_track() introduction patch
>  -> Improve documentation
> * "mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()"
>  -> Adjust to code changes in mm-unstable
> * Added "x86/mm/pat: inline memtype_match() into memtype_erase()"
> * "mm/io-mapping: track_pfn() -> "pfnmap tracking""
>  -> Adjust to code changes in mm-unstable
> 
> David Hildenbrand (11):
>   x86/mm/pat: factor out setting cachemode into pgprot_set_cachemode()
>   mm: convert track_pfn_insert() to pfnmap_setup_cachemode*()
>   mm: introduce pfnmap_track() and pfnmap_untrack() and use them for
>     memremap
>   mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
>   x86/mm/pat: remove old pfnmap tracking interface
>   mm: remove VM_PAT
>   x86/mm/pat: remove strict_prot parameter from reserve_pfn_range()
>   x86/mm/pat: remove MEMTYPE_*_MATCH
>   x86/mm/pat: inline memtype_match() into memtype_erase()
>   drm/i915: track_pfn() -> "pfnmap tracking"
>   mm/io-mapping: track_pfn() -> "pfnmap tracking"
> 
>  arch/x86/mm/pat/memtype.c          | 194 ++++-------------------------
>  arch/x86/mm/pat/memtype_interval.c |  63 ++--------
>  drivers/gpu/drm/i915/i915_mm.c     |   4 +-
>  include/linux/mm.h                 |   4 +-
>  include/linux/mm_inline.h          |   2 +
>  include/linux/mm_types.h           |  11 ++
>  include/linux/pgtable.h            | 127 ++++++++++---------
>  include/trace/events/mmflags.h     |   4 +-
>  mm/huge_memory.c                   |   5 +-
>  mm/io-mapping.c                    |   2 +-
>  mm/memory.c                        |  86 ++++++++++---
>  mm/memremap.c                      |   8 +-
>  mm/mmap.c                          |   5 -
>  mm/mremap.c                        |   4 -
>  mm/vma_init.c                      |  50 ++++++++
>  15 files changed, 242 insertions(+), 327 deletions(-)
> 
> 
> base-commit: c68cfbc5048ede4b10a1d3fe16f7f6192fc2c9c8
> -- 
> 2.49.0
>
Re: [PATCH v2 00/11] mm: rewrite pfnmap tracking and remove VM_PAT
Posted by David Hildenbrand 8 months, 4 weeks ago
On 13.05.25 17:53, Liam R. Howlett wrote:
> * David Hildenbrand <david@redhat.com> [250512 08:34]:
>> On top of mm-unstable.
>>
>> VM_PAT annoyed me too much and wasted too much of my time, let's clean
>> PAT handling up and remove VM_PAT.
>>
>> This should sort out various issues with VM_PAT we discovered recently,
>> and will hopefully make the whole code more stable and easier to maintain.
>>
>> In essence: we stop letting PAT mode mess with VMAs and instead lift
>> what to track/untrack to the MM core. We remember per VMA which pfn range
>> we tracked in a new struct we attach to a VMA (we have space without
>> exceeding 192 bytes), use a kref to share it among VMAs during
>> split/mremap/fork, and automatically untrack once the kref drops to 0.
> 
> What you do here seems to be decouple the vma start/end addresses by
> abstracting them into another allocated ref counted struct.  This is
> close to what we do with the anon vma name..

Yes, inspired by that.

> 
> It took a while to understand the underlying interval tree tracking of
> this change, but I think it's as good as it was.  IIRC, there was a
> shrinking and matching to the end address in the interval tree, but I
> failed to find that commit and code - maybe it never made it upstream.
> I was able to find a thread about splitting [1], so maybe I'm mistaken.

There was hidden code that kept a memremap() shrinking working 
(adjusting the tracked range).

The leftovers are removed in patch #8.

See below.

> 
>>
>> This implies that we'll keep tracking a full pfn range even after partially
>> unmapping it, until fully unmapping it; but as that case was mostly broken
>> before, this at least makes it work in a way that is least intrusive to
>> VMA handling.
>>
>> Shrinking with mremap() used to work in a hacky way, now we'll similarly
>> keep the original pfn range tacked even after this form of partial unmap.
>> Does anybody care about that? Unlikely. If we run into issues, we could
>> likely handled that (adjust the tracking) when our kref drops to 1 while
>> freeing a VMA. But it adds more complexity, so avoid that for now.
> 
> The decoupling of the vma and ref counted range means that we could beef
> up the backend to support actually tracking the correct range, which
> would be nice.. 

Right, in patch #4 I have

"
This change implies that we'll keep tracking the original PFN range even 
after splitting + partially unmapping it: not too bad, because it was 
not working reliably before. The only thing that kind-of worked before 
was shrinking such a mapping using mremap(): we managed to adjust the 
reservation in a hacky way, now we won't adjust the reservation but
leave it around until all involved VMAs are gone.

If that ever turns out to be an issue, we could hook into VM splitting
code and split the tracking; however, that adds complexity that might
not be required, so we'll keep it simple for now.
"

Duplicating/moving/forking VMAs is now definitely better than before.

Splitting is also arguably better than before -- even a simple partial 
munmap() [1] is currently problematic, unless we're munmapping the last 
part of a VMA (-> shrinking).

Implementing splitting properly is a bit complicated if the pnfmap ctx 
has more than one ref, but it could be added if ever really required.

[1] https://lkml.kernel.org/r/20250509153033.952746-1-david@redhat.com

 > but I have very little desire to work on that.

Jap :)

> 
> 
> [1] https://lore.kernel.org/all/5jrd43vusvcchpk2x6mouighkfhamjpaya5fu2cvikzaieg5pq@wqccwmjs4ian/
> 
>>
>> Briefly tested with the new pfnmap selftests [1].
>>
>> [1] https://lkml.kernel.org/r/20250509153033.952746-1-david@redhat.com
> 
> oh yes, that's still a pr_info() log.  I think that should be a pr_err()
> at least?

I was wondering if that is actually a WARN_ON_ONCE(). Now, it should be 
much harder to actually trigger.

Thanks!

-- 
Cheers,

David / dhildenb