[PATCH 0/4] implement lightweight guard pages

Lorenzo Stoakes posted 4 patches 1 month, 1 week ago
There is a newer version of this series
arch/alpha/include/uapi/asm/mman.h       |    3 +
arch/mips/include/uapi/asm/mman.h        |    3 +
arch/parisc/include/uapi/asm/mman.h      |    3 +
arch/xtensa/include/uapi/asm/mman.h      |    3 +
include/linux/mm_inline.h                |    2 +-
include/linux/pagewalk.h                 |   18 +-
include/linux/swapops.h                  |   26 +-
include/uapi/asm-generic/mman-common.h   |    3 +
mm/hugetlb.c                             |    3 +
mm/internal.h                            |    6 +
mm/madvise.c                             |  168 ++++
mm/memory.c                              |   18 +-
mm/mprotect.c                            |    3 +-
mm/mseal.c                               |    1 +
mm/pagewalk.c                            |  200 ++--
tools/testing/selftests/mm/.gitignore    |    1 +
tools/testing/selftests/mm/Makefile      |    1 +
tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++
18 files changed, 1564 insertions(+), 66 deletions(-)
create mode 100644 tools/testing/selftests/mm/guard-pages.c
[PATCH 0/4] implement lightweight guard pages
Posted by Lorenzo Stoakes 1 month, 1 week ago
Userland library functions such as allocators and threading implementations
often require regions of memory to act as 'guard pages' - mappings which,
when accessed, result in a fatal signal being sent to the accessing
process.

The current means by which these are implemented is via a PROT_NONE mmap()
mapping, which provides the required semantics however incur an overhead of
a VMA for each such region.

With a great many processes and threads, this can rapidly add up and incur
a significant memory penalty. It also has the added problem of preventing
merges that might otherwise be permitted.

This series takes a different approach - an idea suggested by Vlasimil
Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the
provenance becomes a little tricky to ascertain after this - please forgive
any omissions!)  - rather than locating the guard pages at the VMA layer,
instead placing them in page tables mapping the required ranges.

Early testing of the prototype version of this code suggests a 5 times
speed up in memory mapping invocations (in conjunction with use of
process_madvise()) and a 13% reduction in VMAs on an entirely idle android
system and unoptimised code.

We expect with optimisation and a loaded system with a larger number of
guard pages this could significantly increase, but in any case these
numbers are encouraging.

This way, rather than having separate VMAs specifying which parts of a
range are guard pages, instead we have a VMA spanning the entire range of
memory a user is permitted to access and including ranges which are to be
'guarded'.

After mapping this, a user can specify which parts of the range should
result in a fatal signal when accessed.

By restricting the ability to specify guard pages to memory mapped by
existing VMAs, we can rely on the mappings being torn down when the
mappings are ultimately unmapped and everything works simply as if the
memory were not faulted in, from the point of view of the containing VMAs.

This mechanism in effect poisons memory ranges similar to hardware memory
poisoning, only it is an entirely software-controlled form of poisoning.

Any poisoned region of memory is also able to 'unpoisoned', that is, to
have its poison markers removed.

The mechanism is implemented via madvise() behaviour - MADV_GUARD_POISON
which simply poisons ranges - and MADV_GUARD_UNPOISON - which clears this
poisoning.

Poisoning can be performed across multiple VMAs and any existing mappings
will be cleared, that is zapped, before installing the poisoned page table
mappings.

There is no concept of 'nested' poisoning, multiple attempts to poison a
range will, after the first poisoning, have no effect.

Importantly, unpoisoning of poisoned ranges has no effect on non-poisoned
memory, so a user can safely unpoison a range of memory and clear only
poison page table mappings leaving the rest intact.

The actual mechanism by which the page table entries are specified makes
use of existing logic - PTE markers, which are used for the userfaultfd
UFFDIO_POISON mechanism.

Unfortunately PTE_MARKER_POISONED is not suited for the guard page
mechanism as it results in VM_FAULT_HWPOISON semantics in the fault
handler, so we add our own specific PTE_MARKER_GUARD and adapt existing
logic to handle it.

We also extend the generic page walk mechanism to allow for installation of
PTEs (carefully restricted to memory management logic only to prevent
unwanted abuse).

We ensure that zapping performed by, for instance, MADV_DONTNEED, does not
remove guard poison markers, nor does forking (except when VM_WIPEONFORK is
specified for a VMA which implies a total removal of memory
characteristics).

It's important to note that the guard page implementation is emphatically
NOT a security feature, so a user can remove the poisoning if they wish. We
simply implement it in such a way as to provide the least surprising
behaviour.

An extensive set of self-tests are provided which ensure behaviour is as
expected and additionally self-documents expected behaviour of poisoned
ranges.

Suggested-by: Vlastimil Babka <vbabka@suze.cz>
Suggested-by: Jann Horn <jannh@google.com>
Suggested-by: David Hildenbrand <david@redhat.com>

v1
* Un-RFC'd as appears no major objections to approach but rather debate on
  implementation.
* Fixed issue with arches which need mmu_context.h and
  tlbfush.h. header imports in pagewalker logic to be able to use
  update_mmu_cache() as reported by the kernel test bot.
* Added comments in page walker logic to clarify who can use
  ops->install_pte and why as well as adding a check_ops_valid() helper
  function, as suggested by Christoph.
* Pass false in full parameter in pte_clear_not_present_full() as suggested
  by Jann.
* Stopped erroneously requiring a write lock for the poison operation as
  suggested by Jann and Suren.
* Moved anon_vma_prepare() to the start of madvise_guard_poison() to be
  consistent with how this is used elsewhere in the kernel as suggested by
  Jann.
* Avoid returning -EAGAIN if we are raced on page faults, just keep looping
  and duck out if a fatal signal is pending or a conditional reschedule is
  needed, as suggested by Jann.
* Avoid needlessly splitting huge PUDs and PMDs by specifying
  ACTION_CONTINUE, as suggested by Jann.

RFC
https://lore.kernel.org/all/cover.1727440966.git.lorenzo.stoakes@oracle.com/

Lorenzo Stoakes (4):
  mm: pagewalk: add the ability to install PTEs
  mm: add PTE_MARKER_GUARD PTE marker
  mm: madvise: implement lightweight guard page mechanism
  selftests/mm: add self tests for guard page feature

 arch/alpha/include/uapi/asm/mman.h       |    3 +
 arch/mips/include/uapi/asm/mman.h        |    3 +
 arch/parisc/include/uapi/asm/mman.h      |    3 +
 arch/xtensa/include/uapi/asm/mman.h      |    3 +
 include/linux/mm_inline.h                |    2 +-
 include/linux/pagewalk.h                 |   18 +-
 include/linux/swapops.h                  |   26 +-
 include/uapi/asm-generic/mman-common.h   |    3 +
 mm/hugetlb.c                             |    3 +
 mm/internal.h                            |    6 +
 mm/madvise.c                             |  168 ++++
 mm/memory.c                              |   18 +-
 mm/mprotect.c                            |    3 +-
 mm/mseal.c                               |    1 +
 mm/pagewalk.c                            |  200 ++--
 tools/testing/selftests/mm/.gitignore    |    1 +
 tools/testing/selftests/mm/Makefile      |    1 +
 tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++
 18 files changed, 1564 insertions(+), 66 deletions(-)
 create mode 100644 tools/testing/selftests/mm/guard-pages.c

--
2.46.2
Re: [PATCH 0/4] implement lightweight guard pages
Posted by Vlastimil Babka 1 month, 1 week ago
+CC linux-api (also should on future revisions)

On 10/17/24 22:42, Lorenzo Stoakes wrote:
> Userland library functions such as allocators and threading implementations
> often require regions of memory to act as 'guard pages' - mappings which,
> when accessed, result in a fatal signal being sent to the accessing
> process.
> 
> The current means by which these are implemented is via a PROT_NONE mmap()
> mapping, which provides the required semantics however incur an overhead of
> a VMA for each such region.
> 
> With a great many processes and threads, this can rapidly add up and incur
> a significant memory penalty. It also has the added problem of preventing
> merges that might otherwise be permitted.
> 
> This series takes a different approach - an idea suggested by Vlasimil
> Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the
> provenance becomes a little tricky to ascertain after this - please forgive
> any omissions!)  - rather than locating the guard pages at the VMA layer,
> instead placing them in page tables mapping the required ranges.
> 
> Early testing of the prototype version of this code suggests a 5 times
> speed up in memory mapping invocations (in conjunction with use of
> process_madvise()) and a 13% reduction in VMAs on an entirely idle android
> system and unoptimised code.
> 
> We expect with optimisation and a loaded system with a larger number of
> guard pages this could significantly increase, but in any case these
> numbers are encouraging.
> 
> This way, rather than having separate VMAs specifying which parts of a
> range are guard pages, instead we have a VMA spanning the entire range of
> memory a user is permitted to access and including ranges which are to be
> 'guarded'.
> 
> After mapping this, a user can specify which parts of the range should
> result in a fatal signal when accessed.
> 
> By restricting the ability to specify guard pages to memory mapped by
> existing VMAs, we can rely on the mappings being torn down when the
> mappings are ultimately unmapped and everything works simply as if the
> memory were not faulted in, from the point of view of the containing VMAs.
> 
> This mechanism in effect poisons memory ranges similar to hardware memory
> poisoning, only it is an entirely software-controlled form of poisoning.
> 
> Any poisoned region of memory is also able to 'unpoisoned', that is, to
> have its poison markers removed.
> 
> The mechanism is implemented via madvise() behaviour - MADV_GUARD_POISON
> which simply poisons ranges - and MADV_GUARD_UNPOISON - which clears this
> poisoning.
> 
> Poisoning can be performed across multiple VMAs and any existing mappings
> will be cleared, that is zapped, before installing the poisoned page table
> mappings.
> 
> There is no concept of 'nested' poisoning, multiple attempts to poison a
> range will, after the first poisoning, have no effect.
> 
> Importantly, unpoisoning of poisoned ranges has no effect on non-poisoned
> memory, so a user can safely unpoison a range of memory and clear only
> poison page table mappings leaving the rest intact.
> 
> The actual mechanism by which the page table entries are specified makes
> use of existing logic - PTE markers, which are used for the userfaultfd
> UFFDIO_POISON mechanism.
> 
> Unfortunately PTE_MARKER_POISONED is not suited for the guard page
> mechanism as it results in VM_FAULT_HWPOISON semantics in the fault
> handler, so we add our own specific PTE_MARKER_GUARD and adapt existing
> logic to handle it.
> 
> We also extend the generic page walk mechanism to allow for installation of
> PTEs (carefully restricted to memory management logic only to prevent
> unwanted abuse).
> 
> We ensure that zapping performed by, for instance, MADV_DONTNEED, does not
> remove guard poison markers, nor does forking (except when VM_WIPEONFORK is
> specified for a VMA which implies a total removal of memory
> characteristics).
> 
> It's important to note that the guard page implementation is emphatically
> NOT a security feature, so a user can remove the poisoning if they wish. We
> simply implement it in such a way as to provide the least surprising
> behaviour.
> 
> An extensive set of self-tests are provided which ensure behaviour is as
> expected and additionally self-documents expected behaviour of poisoned
> ranges.
> 
> Suggested-by: Vlastimil Babka <vbabka@suze.cz>

Please fix the domain typo (also in patch 3 :)

Thanks for implementing this,
Vlastimil

> Suggested-by: Jann Horn <jannh@google.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
> 
> v1
> * Un-RFC'd as appears no major objections to approach but rather debate on
>   implementation.
> * Fixed issue with arches which need mmu_context.h and
>   tlbfush.h. header imports in pagewalker logic to be able to use
>   update_mmu_cache() as reported by the kernel test bot.
> * Added comments in page walker logic to clarify who can use
>   ops->install_pte and why as well as adding a check_ops_valid() helper
>   function, as suggested by Christoph.
> * Pass false in full parameter in pte_clear_not_present_full() as suggested
>   by Jann.
> * Stopped erroneously requiring a write lock for the poison operation as
>   suggested by Jann and Suren.
> * Moved anon_vma_prepare() to the start of madvise_guard_poison() to be
>   consistent with how this is used elsewhere in the kernel as suggested by
>   Jann.
> * Avoid returning -EAGAIN if we are raced on page faults, just keep looping
>   and duck out if a fatal signal is pending or a conditional reschedule is
>   needed, as suggested by Jann.
> * Avoid needlessly splitting huge PUDs and PMDs by specifying
>   ACTION_CONTINUE, as suggested by Jann.
> 
> RFC
> https://lore.kernel.org/all/cover.1727440966.git.lorenzo.stoakes@oracle.com/
> 
> Lorenzo Stoakes (4):
>   mm: pagewalk: add the ability to install PTEs
>   mm: add PTE_MARKER_GUARD PTE marker
>   mm: madvise: implement lightweight guard page mechanism
>   selftests/mm: add self tests for guard page feature
> 
>  arch/alpha/include/uapi/asm/mman.h       |    3 +
>  arch/mips/include/uapi/asm/mman.h        |    3 +
>  arch/parisc/include/uapi/asm/mman.h      |    3 +
>  arch/xtensa/include/uapi/asm/mman.h      |    3 +
>  include/linux/mm_inline.h                |    2 +-
>  include/linux/pagewalk.h                 |   18 +-
>  include/linux/swapops.h                  |   26 +-
>  include/uapi/asm-generic/mman-common.h   |    3 +
>  mm/hugetlb.c                             |    3 +
>  mm/internal.h                            |    6 +
>  mm/madvise.c                             |  168 ++++
>  mm/memory.c                              |   18 +-
>  mm/mprotect.c                            |    3 +-
>  mm/mseal.c                               |    1 +
>  mm/pagewalk.c                            |  200 ++--
>  tools/testing/selftests/mm/.gitignore    |    1 +
>  tools/testing/selftests/mm/Makefile      |    1 +
>  tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++
>  18 files changed, 1564 insertions(+), 66 deletions(-)
>  create mode 100644 tools/testing/selftests/mm/guard-pages.c
> 
> --
> 2.46.2
Re: [PATCH 0/4] implement lightweight guard pages
Posted by Lorenzo Stoakes 1 month, 1 week ago
On Fri, Oct 18, 2024 at 06:10:37PM +0200, Vlastimil Babka wrote:
> +CC linux-api (also should on future revisions)
>

They're cc'd :) assuming Linux API <linux-api@vger.kernel.org> is correct
right?

> On 10/17/24 22:42, Lorenzo Stoakes wrote:
> > Userland library functions such as allocators and threading implementations
> > often require regions of memory to act as 'guard pages' - mappings which,
> > when accessed, result in a fatal signal being sent to the accessing
> > process.
> >
> > The current means by which these are implemented is via a PROT_NONE mmap()
> > mapping, which provides the required semantics however incur an overhead of
> > a VMA for each such region.
> >
> > With a great many processes and threads, this can rapidly add up and incur
> > a significant memory penalty. It also has the added problem of preventing
> > merges that might otherwise be permitted.
> >
> > This series takes a different approach - an idea suggested by Vlasimil
> > Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the
> > provenance becomes a little tricky to ascertain after this - please forgive
> > any omissions!)  - rather than locating the guard pages at the VMA layer,
> > instead placing them in page tables mapping the required ranges.
> >
> > Early testing of the prototype version of this code suggests a 5 times
> > speed up in memory mapping invocations (in conjunction with use of
> > process_madvise()) and a 13% reduction in VMAs on an entirely idle android
> > system and unoptimised code.
> >
> > We expect with optimisation and a loaded system with a larger number of
> > guard pages this could significantly increase, but in any case these
> > numbers are encouraging.
> >
> > This way, rather than having separate VMAs specifying which parts of a
> > range are guard pages, instead we have a VMA spanning the entire range of
> > memory a user is permitted to access and including ranges which are to be
> > 'guarded'.
> >
> > After mapping this, a user can specify which parts of the range should
> > result in a fatal signal when accessed.
> >
> > By restricting the ability to specify guard pages to memory mapped by
> > existing VMAs, we can rely on the mappings being torn down when the
> > mappings are ultimately unmapped and everything works simply as if the
> > memory were not faulted in, from the point of view of the containing VMAs.
> >
> > This mechanism in effect poisons memory ranges similar to hardware memory
> > poisoning, only it is an entirely software-controlled form of poisoning.
> >
> > Any poisoned region of memory is also able to 'unpoisoned', that is, to
> > have its poison markers removed.
> >
> > The mechanism is implemented via madvise() behaviour - MADV_GUARD_POISON
> > which simply poisons ranges - and MADV_GUARD_UNPOISON - which clears this
> > poisoning.
> >
> > Poisoning can be performed across multiple VMAs and any existing mappings
> > will be cleared, that is zapped, before installing the poisoned page table
> > mappings.
> >
> > There is no concept of 'nested' poisoning, multiple attempts to poison a
> > range will, after the first poisoning, have no effect.
> >
> > Importantly, unpoisoning of poisoned ranges has no effect on non-poisoned
> > memory, so a user can safely unpoison a range of memory and clear only
> > poison page table mappings leaving the rest intact.
> >
> > The actual mechanism by which the page table entries are specified makes
> > use of existing logic - PTE markers, which are used for the userfaultfd
> > UFFDIO_POISON mechanism.
> >
> > Unfortunately PTE_MARKER_POISONED is not suited for the guard page
> > mechanism as it results in VM_FAULT_HWPOISON semantics in the fault
> > handler, so we add our own specific PTE_MARKER_GUARD and adapt existing
> > logic to handle it.
> >
> > We also extend the generic page walk mechanism to allow for installation of
> > PTEs (carefully restricted to memory management logic only to prevent
> > unwanted abuse).
> >
> > We ensure that zapping performed by, for instance, MADV_DONTNEED, does not
> > remove guard poison markers, nor does forking (except when VM_WIPEONFORK is
> > specified for a VMA which implies a total removal of memory
> > characteristics).
> >
> > It's important to note that the guard page implementation is emphatically
> > NOT a security feature, so a user can remove the poisoning if they wish. We
> > simply implement it in such a way as to provide the least surprising
> > behaviour.
> >
> > An extensive set of self-tests are provided which ensure behaviour is as
> > expected and additionally self-documents expected behaviour of poisoned
> > ranges.
> >
> > Suggested-by: Vlastimil Babka <vbabka@suze.cz>
>
> Please fix the domain typo (also in patch 3 :)
>

Damnnn it! I can't believe I left that in. Sorry about that! Will fix on
respin.

Hopefully not to suse.cs ;)

> Thanks for implementing this,
> Vlastimil

Thanks!

>
> > Suggested-by: Jann Horn <jannh@google.com>
> > Suggested-by: David Hildenbrand <david@redhat.com>
> >
> > v1
> > * Un-RFC'd as appears no major objections to approach but rather debate on
> >   implementation.
> > * Fixed issue with arches which need mmu_context.h and
> >   tlbfush.h. header imports in pagewalker logic to be able to use
> >   update_mmu_cache() as reported by the kernel test bot.
> > * Added comments in page walker logic to clarify who can use
> >   ops->install_pte and why as well as adding a check_ops_valid() helper
> >   function, as suggested by Christoph.
> > * Pass false in full parameter in pte_clear_not_present_full() as suggested
> >   by Jann.
> > * Stopped erroneously requiring a write lock for the poison operation as
> >   suggested by Jann and Suren.
> > * Moved anon_vma_prepare() to the start of madvise_guard_poison() to be
> >   consistent with how this is used elsewhere in the kernel as suggested by
> >   Jann.
> > * Avoid returning -EAGAIN if we are raced on page faults, just keep looping
> >   and duck out if a fatal signal is pending or a conditional reschedule is
> >   needed, as suggested by Jann.
> > * Avoid needlessly splitting huge PUDs and PMDs by specifying
> >   ACTION_CONTINUE, as suggested by Jann.
> >
> > RFC
> > https://lore.kernel.org/all/cover.1727440966.git.lorenzo.stoakes@oracle.com/
> >
> > Lorenzo Stoakes (4):
> >   mm: pagewalk: add the ability to install PTEs
> >   mm: add PTE_MARKER_GUARD PTE marker
> >   mm: madvise: implement lightweight guard page mechanism
> >   selftests/mm: add self tests for guard page feature
> >
> >  arch/alpha/include/uapi/asm/mman.h       |    3 +
> >  arch/mips/include/uapi/asm/mman.h        |    3 +
> >  arch/parisc/include/uapi/asm/mman.h      |    3 +
> >  arch/xtensa/include/uapi/asm/mman.h      |    3 +
> >  include/linux/mm_inline.h                |    2 +-
> >  include/linux/pagewalk.h                 |   18 +-
> >  include/linux/swapops.h                  |   26 +-
> >  include/uapi/asm-generic/mman-common.h   |    3 +
> >  mm/hugetlb.c                             |    3 +
> >  mm/internal.h                            |    6 +
> >  mm/madvise.c                             |  168 ++++
> >  mm/memory.c                              |   18 +-
> >  mm/mprotect.c                            |    3 +-
> >  mm/mseal.c                               |    1 +
> >  mm/pagewalk.c                            |  200 ++--
> >  tools/testing/selftests/mm/.gitignore    |    1 +
> >  tools/testing/selftests/mm/Makefile      |    1 +
> >  tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++
> >  18 files changed, 1564 insertions(+), 66 deletions(-)
> >  create mode 100644 tools/testing/selftests/mm/guard-pages.c
> >
> > --
> > 2.46.2
>
Re: [PATCH 0/4] implement lightweight guard pages
Posted by Lorenzo Stoakes 1 month, 1 week ago
On Fri, Oct 18, 2024 at 05:17:56PM +0100, Lorenzo Stoakes wrote:
> On Fri, Oct 18, 2024 at 06:10:37PM +0200, Vlastimil Babka wrote:
> > +CC linux-api (also should on future revisions)
> >
>
> They're cc'd :) assuming Linux API <linux-api@vger.kernel.org> is correct
> right?

As discussed on IRC, no I was being a little slow here and hadn't realised
you'd added them, apologies!

Will add them on future respins, sorry guys :)

>
> > On 10/17/24 22:42, Lorenzo Stoakes wrote:
> > > Userland library functions such as allocators and threading implementations
> > > often require regions of memory to act as 'guard pages' - mappings which,
> > > when accessed, result in a fatal signal being sent to the accessing
> > > process.
> > >
> > > The current means by which these are implemented is via a PROT_NONE mmap()
> > > mapping, which provides the required semantics however incur an overhead of
> > > a VMA for each such region.
> > >
> > > With a great many processes and threads, this can rapidly add up and incur
> > > a significant memory penalty. It also has the added problem of preventing
> > > merges that might otherwise be permitted.
> > >
> > > This series takes a different approach - an idea suggested by Vlasimil
> > > Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the
> > > provenance becomes a little tricky to ascertain after this - please forgive
> > > any omissions!)  - rather than locating the guard pages at the VMA layer,
> > > instead placing them in page tables mapping the required ranges.
> > >
> > > Early testing of the prototype version of this code suggests a 5 times
> > > speed up in memory mapping invocations (in conjunction with use of
> > > process_madvise()) and a 13% reduction in VMAs on an entirely idle android
> > > system and unoptimised code.
> > >
> > > We expect with optimisation and a loaded system with a larger number of
> > > guard pages this could significantly increase, but in any case these
> > > numbers are encouraging.
> > >
> > > This way, rather than having separate VMAs specifying which parts of a
> > > range are guard pages, instead we have a VMA spanning the entire range of
> > > memory a user is permitted to access and including ranges which are to be
> > > 'guarded'.
> > >
> > > After mapping this, a user can specify which parts of the range should
> > > result in a fatal signal when accessed.
> > >
> > > By restricting the ability to specify guard pages to memory mapped by
> > > existing VMAs, we can rely on the mappings being torn down when the
> > > mappings are ultimately unmapped and everything works simply as if the
> > > memory were not faulted in, from the point of view of the containing VMAs.
> > >
> > > This mechanism in effect poisons memory ranges similar to hardware memory
> > > poisoning, only it is an entirely software-controlled form of poisoning.
> > >
> > > Any poisoned region of memory is also able to 'unpoisoned', that is, to
> > > have its poison markers removed.
> > >
> > > The mechanism is implemented via madvise() behaviour - MADV_GUARD_POISON
> > > which simply poisons ranges - and MADV_GUARD_UNPOISON - which clears this
> > > poisoning.
> > >
> > > Poisoning can be performed across multiple VMAs and any existing mappings
> > > will be cleared, that is zapped, before installing the poisoned page table
> > > mappings.
> > >
> > > There is no concept of 'nested' poisoning, multiple attempts to poison a
> > > range will, after the first poisoning, have no effect.
> > >
> > > Importantly, unpoisoning of poisoned ranges has no effect on non-poisoned
> > > memory, so a user can safely unpoison a range of memory and clear only
> > > poison page table mappings leaving the rest intact.
> > >
> > > The actual mechanism by which the page table entries are specified makes
> > > use of existing logic - PTE markers, which are used for the userfaultfd
> > > UFFDIO_POISON mechanism.
> > >
> > > Unfortunately PTE_MARKER_POISONED is not suited for the guard page
> > > mechanism as it results in VM_FAULT_HWPOISON semantics in the fault
> > > handler, so we add our own specific PTE_MARKER_GUARD and adapt existing
> > > logic to handle it.
> > >
> > > We also extend the generic page walk mechanism to allow for installation of
> > > PTEs (carefully restricted to memory management logic only to prevent
> > > unwanted abuse).
> > >
> > > We ensure that zapping performed by, for instance, MADV_DONTNEED, does not
> > > remove guard poison markers, nor does forking (except when VM_WIPEONFORK is
> > > specified for a VMA which implies a total removal of memory
> > > characteristics).
> > >
> > > It's important to note that the guard page implementation is emphatically
> > > NOT a security feature, so a user can remove the poisoning if they wish. We
> > > simply implement it in such a way as to provide the least surprising
> > > behaviour.
> > >
> > > An extensive set of self-tests are provided which ensure behaviour is as
> > > expected and additionally self-documents expected behaviour of poisoned
> > > ranges.
> > >
> > > Suggested-by: Vlastimil Babka <vbabka@suze.cz>
> >
> > Please fix the domain typo (also in patch 3 :)
> >
>
> Damnnn it! I can't believe I left that in. Sorry about that! Will fix on
> respin.
>
> Hopefully not to suse.cs ;)
>
> > Thanks for implementing this,
> > Vlastimil
>
> Thanks!
>
> >
> > > Suggested-by: Jann Horn <jannh@google.com>
> > > Suggested-by: David Hildenbrand <david@redhat.com>
> > >
> > > v1
> > > * Un-RFC'd as appears no major objections to approach but rather debate on
> > >   implementation.
> > > * Fixed issue with arches which need mmu_context.h and
> > >   tlbfush.h. header imports in pagewalker logic to be able to use
> > >   update_mmu_cache() as reported by the kernel test bot.
> > > * Added comments in page walker logic to clarify who can use
> > >   ops->install_pte and why as well as adding a check_ops_valid() helper
> > >   function, as suggested by Christoph.
> > > * Pass false in full parameter in pte_clear_not_present_full() as suggested
> > >   by Jann.
> > > * Stopped erroneously requiring a write lock for the poison operation as
> > >   suggested by Jann and Suren.
> > > * Moved anon_vma_prepare() to the start of madvise_guard_poison() to be
> > >   consistent with how this is used elsewhere in the kernel as suggested by
> > >   Jann.
> > > * Avoid returning -EAGAIN if we are raced on page faults, just keep looping
> > >   and duck out if a fatal signal is pending or a conditional reschedule is
> > >   needed, as suggested by Jann.
> > > * Avoid needlessly splitting huge PUDs and PMDs by specifying
> > >   ACTION_CONTINUE, as suggested by Jann.
> > >
> > > RFC
> > > https://lore.kernel.org/all/cover.1727440966.git.lorenzo.stoakes@oracle.com/
> > >
> > > Lorenzo Stoakes (4):
> > >   mm: pagewalk: add the ability to install PTEs
> > >   mm: add PTE_MARKER_GUARD PTE marker
> > >   mm: madvise: implement lightweight guard page mechanism
> > >   selftests/mm: add self tests for guard page feature
> > >
> > >  arch/alpha/include/uapi/asm/mman.h       |    3 +
> > >  arch/mips/include/uapi/asm/mman.h        |    3 +
> > >  arch/parisc/include/uapi/asm/mman.h      |    3 +
> > >  arch/xtensa/include/uapi/asm/mman.h      |    3 +
> > >  include/linux/mm_inline.h                |    2 +-
> > >  include/linux/pagewalk.h                 |   18 +-
> > >  include/linux/swapops.h                  |   26 +-
> > >  include/uapi/asm-generic/mman-common.h   |    3 +
> > >  mm/hugetlb.c                             |    3 +
> > >  mm/internal.h                            |    6 +
> > >  mm/madvise.c                             |  168 ++++
> > >  mm/memory.c                              |   18 +-
> > >  mm/mprotect.c                            |    3 +-
> > >  mm/mseal.c                               |    1 +
> > >  mm/pagewalk.c                            |  200 ++--
> > >  tools/testing/selftests/mm/.gitignore    |    1 +
> > >  tools/testing/selftests/mm/Makefile      |    1 +
> > >  tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++
> > >  18 files changed, 1564 insertions(+), 66 deletions(-)
> > >  create mode 100644 tools/testing/selftests/mm/guard-pages.c
> > >
> > > --
> > > 2.46.2
> >