arch/alpha/include/uapi/asm/mman.h | 3 + arch/mips/include/uapi/asm/mman.h | 3 + arch/parisc/include/uapi/asm/mman.h | 3 + arch/xtensa/include/uapi/asm/mman.h | 3 + include/linux/mm_inline.h | 2 +- include/linux/pagewalk.h | 18 +- include/linux/swapops.h | 26 +- include/uapi/asm-generic/mman-common.h | 3 + mm/hugetlb.c | 3 + mm/internal.h | 6 + mm/madvise.c | 168 +++ mm/memory.c | 18 +- mm/mprotect.c | 3 +- mm/mseal.c | 1 + mm/pagewalk.c | 200 ++- tools/include/uapi/asm-generic/mman-common.h | 3 + tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/guard-pages.c | 1228 ++++++++++++++++++ 19 files changed, 1627 insertions(+), 66 deletions(-) create mode 100644 tools/testing/selftests/mm/guard-pages.c
Userland library functions such as allocators and threading implementations often require regions of memory to act as 'guard pages' - mappings which, when accessed, result in a fatal signal being sent to the accessing process. The current means by which these are implemented is via a PROT_NONE mmap() mapping, which provides the required semantics however incur an overhead of a VMA for each such region. With a great many processes and threads, this can rapidly add up and incur a significant memory penalty. It also has the added problem of preventing merges that might otherwise be permitted. This series takes a different approach - an idea suggested by Vlasimil Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the provenance becomes a little tricky to ascertain after this - please forgive any omissions!) - rather than locating the guard pages at the VMA layer, instead placing them in page tables mapping the required ranges. Early testing of the prototype version of this code suggests a 5 times speed up in memory mapping invocations (in conjunction with use of process_madvise()) and a 13% reduction in VMAs on an entirely idle android system and unoptimised code. We expect with optimisation and a loaded system with a larger number of guard pages this could significantly increase, but in any case these numbers are encouraging. This way, rather than having separate VMAs specifying which parts of a range are guard pages, instead we have a VMA spanning the entire range of memory a user is permitted to access and including ranges which are to be 'guarded'. After mapping this, a user can specify which parts of the range should result in a fatal signal when accessed. By restricting the ability to specify guard pages to memory mapped by existing VMAs, we can rely on the mappings being torn down when the mappings are ultimately unmapped and everything works simply as if the memory were not faulted in, from the point of view of the containing VMAs. This mechanism in effect poisons memory ranges similar to hardware memory poisoning, only it is an entirely software-controlled form of poisoning. Any poisoned region of memory is also able to 'unpoisoned', that is, to have its poison markers removed. The mechanism is implemented via madvise() behaviour - MADV_GUARD_POISON which simply poisons ranges - and MADV_GUARD_UNPOISON - which clears this poisoning. Poisoning can be performed across multiple VMAs and any existing mappings will be cleared, that is zapped, before installing the poisoned page table mappings. There is no concept of 'nested' poisoning, multiple attempts to poison a range will, after the first poisoning, have no effect. Importantly, unpoisoning of poisoned ranges has no effect on non-poisoned memory, so a user can safely unpoison a range of memory and clear only poison page table mappings leaving the rest intact. The actual mechanism by which the page table entries are specified makes use of existing logic - PTE markers, which are used for the userfaultfd UFFDIO_POISON mechanism. Unfortunately PTE_MARKER_POISONED is not suited for the guard page mechanism as it results in VM_FAULT_HWPOISON semantics in the fault handler, so we add our own specific PTE_MARKER_GUARD and adapt existing logic to handle it. We also extend the generic page walk mechanism to allow for installation of PTEs (carefully restricted to memory management logic only to prevent unwanted abuse). We ensure that zapping performed by, for instance, MADV_DONTNEED, does not remove guard poison markers, nor does forking (except when VM_WIPEONFORK is specified for a VMA which implies a total removal of memory characteristics). It's important to note that the guard page implementation is emphatically NOT a security feature, so a user can remove the poisoning if they wish. We simply implement it in such a way as to provide the least surprising behaviour. An extensive set of self-tests are provided which ensure behaviour is as expected and additionally self-documents expected behaviour of poisoned ranges. Suggested-by: Vlastimil Babka <vbabka@suse.cz> Suggested-by: Jann Horn <jannh@google.com> Suggested-by: David Hildenbrand <david@redhat.com> v2 * The macros in kselftest_harness.h seem to be broken - __EXPECT() is terminated by '} while (0); OPTIONAL_HANDLER(_assert)' meaning it is not safe in single line if / else or for /which blocks, however working around this results in checkpatch producing invalid warnings, as reported by Shuah. * Fixing these macros is out of scope for this series, so compromise and instead rewrite test blocks so as to use multiple lines by separating out a decl in most cases. This has the side effect of, for the most part, making things more readable. * Heavily document the use of the volatile keyword - we can't avoid checkpatch complaining about this, so we explain it, as reported by Shuah. * Updated commit message to highlight that we skip tests we lack permissions for, as reported by Shuah. * Replaced a perror() with ksft_exit_fail_perror(), as reported by Shuah. * Added user friendly messages to cases where tests are skipped due to lack of permissions, as reported by Shuah. * Update the tool header to include the new MADV_GUARD_POISON/UNPOISON defines and directly include asm-generic/mman.h to get the platform-neutral versions to ensure we import them. * Finally fixed Vlastimil's email address in Suggested-by tags from suze to suse, as reported by Vlastimil. * Added linux-api to cc list, as reported by Vlastimil. v1 * Un-RFC'd as appears no major objections to approach but rather debate on implementation. * Fixed issue with arches which need mmu_context.h and tlbfush.h. header imports in pagewalker logic to be able to use update_mmu_cache() as reported by the kernel test bot. * Added comments in page walker logic to clarify who can use ops->install_pte and why as well as adding a check_ops_valid() helper function, as suggested by Christoph. * Pass false in full parameter in pte_clear_not_present_full() as suggested by Jann. * Stopped erroneously requiring a write lock for the poison operation as suggested by Jann and Suren. * Moved anon_vma_prepare() to the start of madvise_guard_poison() to be consistent with how this is used elsewhere in the kernel as suggested by Jann. * Avoid returning -EAGAIN if we are raced on page faults, just keep looping and duck out if a fatal signal is pending or a conditional reschedule is needed, as suggested by Jann. * Avoid needlessly splitting huge PUDs and PMDs by specifying ACTION_CONTINUE, as suggested by Jann. https://lore.kernel.org/all/cover.1729196871.git.lorenzo.stoakes@oracle.com/ RFC https://lore.kernel.org/all/cover.1727440966.git.lorenzo.stoakes@oracle.com/ Lorenzo Stoakes (5): mm: pagewalk: add the ability to install PTEs mm: add PTE_MARKER_GUARD PTE marker mm: madvise: implement lightweight guard page mechanism tools: testing: update tools UAPI header for mman-common.h selftests/mm: add self tests for guard page feature arch/alpha/include/uapi/asm/mman.h | 3 + arch/mips/include/uapi/asm/mman.h | 3 + arch/parisc/include/uapi/asm/mman.h | 3 + arch/xtensa/include/uapi/asm/mman.h | 3 + include/linux/mm_inline.h | 2 +- include/linux/pagewalk.h | 18 +- include/linux/swapops.h | 26 +- include/uapi/asm-generic/mman-common.h | 3 + mm/hugetlb.c | 3 + mm/internal.h | 6 + mm/madvise.c | 168 +++ mm/memory.c | 18 +- mm/mprotect.c | 3 +- mm/mseal.c | 1 + mm/pagewalk.c | 200 ++- tools/include/uapi/asm-generic/mman-common.h | 3 + tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/guard-pages.c | 1228 ++++++++++++++++++ 19 files changed, 1627 insertions(+), 66 deletions(-) create mode 100644 tools/testing/selftests/mm/guard-pages.c -- 2.47.0
* Lorenzo Stoakes: > Early testing of the prototype version of this code suggests a 5 times > speed up in memory mapping invocations (in conjunction with use of > process_madvise()) and a 13% reduction in VMAs on an entirely idle android > system and unoptimised code. > > We expect with optimisation and a loaded system with a larger number of > guard pages this could significantly increase, but in any case these > numbers are encouraging. > > This way, rather than having separate VMAs specifying which parts of a > range are guard pages, instead we have a VMA spanning the entire range of > memory a user is permitted to access and including ranges which are to be > 'guarded'. > > After mapping this, a user can specify which parts of the range should > result in a fatal signal when accessed. > > By restricting the ability to specify guard pages to memory mapped by > existing VMAs, we can rely on the mappings being torn down when the > mappings are ultimately unmapped and everything works simply as if the > memory were not faulted in, from the point of view of the containing VMAs. We have a glibc (so not Android) dynamic linker bug that asks us to remove PROT_NONE mappings in mapped shared objects: Extra struct vm_area_struct with ---p created when PAGE_SIZE < max-page-size <https://sourceware.org/bugzilla/show_bug.cgi?id=31076> It's slightly different from a guard page because our main goal is to avoid other mappings to end up in those gaps, which has been shown to cause odd application behavior in cases where it happens. If I understand the series correctly, the kernel would not automatically attribute those PROT_NONE gaps to the previous or subsequent mapping. We would have to extend one of the surrounding mapps and apply MADV_POISON to that over-mapped part. That doesn't seem too onerous. Could the ELF loader in the kernel do the same thing for the main executable and the program loader?
Hi Florian, Lorenzo, This looks great! What I am VERY interested in is if poisoned pages cause SIGSEGV even when the access happens in the kernel. Namely, the syscall still returns EFAULT, but also SIGSEGV is queued on return to user-space. Catching bad accesses in system calls is currently the weak spot for all user-space bug detection tools (GWP-ASan, libefence, libefency, etc). It's almost possible with userfaultfd, but catching faults in the kernel requires admin capability, so not really an option for generic bug detection tools (+inconvinience of userfaultfd setup/handler). Intercepting all EFAULT from syscalls is not generally possible (w/o ptrace, usually not an option as well), and EFAULT does not always mean a bug. Triggering SIGSEGV even in syscalls would be not just a performance optimization, but a new useful capability that would allow it to catch more bugs. Thanks
On 23.10.24 08:24, Dmitry Vyukov wrote: > Hi Florian, Lorenzo, > > This looks great! > > What I am VERY interested in is if poisoned pages cause SIGSEGV even when > the access happens in the kernel. Namely, the syscall still returns EFAULT, > but also SIGSEGV is queued on return to user-space. > > Catching bad accesses in system calls is currently the weak spot for > all user-space bug detection tools (GWP-ASan, libefence, libefency, etc). > It's almost possible with userfaultfd, but catching faults in the kernel > requires admin capability, so not really an option for generic bug > detection tools (+inconvinience of userfaultfd setup/handler). > Intercepting all EFAULT from syscalls is not generally possible > (w/o ptrace, usually not an option as well), and EFAULT does not always > mean a bug. > > Triggering SIGSEGV even in syscalls would be not just a performance > optimization, but a new useful capability that would allow it to catch > more bugs. Right, we discussed that offline also as a possible extension to the userfaultfd SIGBUS mode. I did not look into that yet, but I was wonder if there could be cases where a different process could trigger that SIGSEGV, and how to (and if to) handle that. For example, ptrace (access_remote_vm()) -> GUP likely can trigger that. I think with userfaultfd() we will currently return -EFAULT, because we call get_user_page_vma_remote() that is not prepared for dropping the mmap lock. Possibly that is the right thing to do, but not sure :) These "remote" faults set FOLL_REMOTE -> FAULT_FLAG_REMOTE, so we might be able to distinguish them and perform different handling. -- Cheers, David / dhildenb
+cc Linus as reference a commit of his below... On Wed, Oct 23, 2024 at 09:19:03AM +0200, David Hildenbrand wrote: > On 23.10.24 08:24, Dmitry Vyukov wrote: > > Hi Florian, Lorenzo, > > > > This looks great! Thanks! > > > > What I am VERY interested in is if poisoned pages cause SIGSEGV even when > > the access happens in the kernel. Namely, the syscall still returns EFAULT, > > but also SIGSEGV is queued on return to user-space. Yeah we don't in any way. I think adding something like this would be a bit of its own project. The fault andler for this is in handle_pte_marker() in mm/memory.c, where we do the following: /* Hitting a guard page is always a fatal condition. */ if (marker & PTE_MARKER_GUARD) return VM_FAULT_SIGSEGV; So basically we pass this back to whoever invoked the fault. For uaccess we end up in arch-specific code that eventually checks exception tables etc. and for x86-64 that's kernelmode_fixup_or_oops(). There used to be a sig_on_uaccess_err in the x86-specific thread_struct that let you propagate it but Linus pulled it out in commit 02b670c1f88e ("x86/mm: Remove broken vsyscall emulation code from the page fault code") where it was presumably used for vsyscall. Of course we could just get something much higher up the stack to send the signal, but we'd need to be careful we weren't breaking anything doing it... I address GUP below. > > > > Catching bad accesses in system calls is currently the weak spot for > > all user-space bug detection tools (GWP-ASan, libefence, libefency, etc). > > It's almost possible with userfaultfd, but catching faults in the kernel > > requires admin capability, so not really an option for generic bug > > detection tools (+inconvinience of userfaultfd setup/handler). > > Intercepting all EFAULT from syscalls is not generally possible > > (w/o ptrace, usually not an option as well), and EFAULT does not always > > mean a bug. > > > > Triggering SIGSEGV even in syscalls would be not just a performance > > optimization, but a new useful capability that would allow it to catch > > more bugs. > > Right, we discussed that offline also as a possible extension to the > userfaultfd SIGBUS mode. > > I did not look into that yet, but I was wonder if there could be cases where > a different process could trigger that SIGSEGV, and how to (and if to) > handle that. > > For example, ptrace (access_remote_vm()) -> GUP likely can trigger that. I > think with userfaultfd() we will currently return -EFAULT, because we call > get_user_page_vma_remote() that is not prepared for dropping the mmap lock. > Possibly that is the right thing to do, but not sure :) > > These "remote" faults set FOLL_REMOTE -> FAULT_FLAG_REMOTE, so we might be > able to distinguish them and perform different handling. So all GUP will return -EFAULT when hitting guard pages unless we change something. In GUP we handle this in faultin_page(): if (ret & VM_FAULT_ERROR) { int err = vm_fault_to_errno(ret, flags); if (err) return err; BUG(); } And vm_fault_to_errno() is: static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags) { if (vm_fault & VM_FAULT_OOM) return -ENOMEM; if (vm_fault & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE)) return (foll_flags & FOLL_HWPOISON) ? -EHWPOISON : -EFAULT; if (vm_fault & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV)) return -EFAULT; return 0; } Again, I think if we wanted special handling here we'd need to probably propagate that fault from higher up, but yes we'd need to for one definitely not do so if it's remote but I worry about other cases. > > -- > Cheers, > > David / dhildenb > Overall while I sympathise with this, it feels dangerous and a pretty major change, because there'll be something somewhere that will break because it expects faults to be swallowed that we no longer do swallow. So I'd say it'd be something we should defer, but of course it's a highly user-facing change so how easy that would be I don't know. But I definitely don't think a 'introduce the ability to do cheap PROT_NONE guards' series is the place to also fundmentally change how user access page faults are handled within the kernel :)
On Wed, 23 Oct 2024 at 10:12, Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > +cc Linus as reference a commit of his below... > > On Wed, Oct 23, 2024 at 09:19:03AM +0200, David Hildenbrand wrote: > > On 23.10.24 08:24, Dmitry Vyukov wrote: > > > Hi Florian, Lorenzo, > > > > > > This looks great! > > Thanks! > > > > > > > What I am VERY interested in is if poisoned pages cause SIGSEGV even when > > > the access happens in the kernel. Namely, the syscall still returns EFAULT, > > > but also SIGSEGV is queued on return to user-space. > > Yeah we don't in any way. > > I think adding something like this would be a bit of its own project. I can totally understand this. > The fault andler for this is in handle_pte_marker() in mm/memory.c, where > we do the following: > > /* Hitting a guard page is always a fatal condition. */ > if (marker & PTE_MARKER_GUARD) > return VM_FAULT_SIGSEGV; > > So basically we pass this back to whoever invoked the fault. For uaccess we > end up in arch-specific code that eventually checks exception tables > etc. and for x86-64 that's kernelmode_fixup_or_oops(). > > There used to be a sig_on_uaccess_err in the x86-specific thread_struct > that let you propagate it but Linus pulled it out in commit 02b670c1f88e > ("x86/mm: Remove broken vsyscall emulation code from the page fault code") > where it was presumably used for vsyscall. > > Of course we could just get something much higher up the stack to send the > signal, but we'd need to be careful we weren't breaking anything doing > it... Can setting TIF_NOTIFY_RESUME and then doing the rest when returning to userspace help here? > I address GUP below. > > > > > > > Catching bad accesses in system calls is currently the weak spot for > > > all user-space bug detection tools (GWP-ASan, libefence, libefency, etc). > > > It's almost possible with userfaultfd, but catching faults in the kernel > > > requires admin capability, so not really an option for generic bug > > > detection tools (+inconvinience of userfaultfd setup/handler). > > > Intercepting all EFAULT from syscalls is not generally possible > > > (w/o ptrace, usually not an option as well), and EFAULT does not always > > > mean a bug. > > > > > > Triggering SIGSEGV even in syscalls would be not just a performance > > > optimization, but a new useful capability that would allow it to catch > > > more bugs. > > > > Right, we discussed that offline also as a possible extension to the > > userfaultfd SIGBUS mode. > > > > I did not look into that yet, but I was wonder if there could be cases where > > a different process could trigger that SIGSEGV, and how to (and if to) > > handle that. > > > > For example, ptrace (access_remote_vm()) -> GUP likely can trigger that. I > > think with userfaultfd() we will currently return -EFAULT, because we call > > get_user_page_vma_remote() that is not prepared for dropping the mmap lock. > > Possibly that is the right thing to do, but not sure :) That's a good corner case. I guess also process_vm_readv/writev. Not triggering the signal in these cases looks like the right thing to do. > > These "remote" faults set FOLL_REMOTE -> FAULT_FLAG_REMOTE, so we might be > > able to distinguish them and perform different handling. > > So all GUP will return -EFAULT when hitting guard pages unless we change > something. > > In GUP we handle this in faultin_page(): > > if (ret & VM_FAULT_ERROR) { > int err = vm_fault_to_errno(ret, flags); > > if (err) > return err; > BUG(); > } > > And vm_fault_to_errno() is: > > static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags) > { > if (vm_fault & VM_FAULT_OOM) > return -ENOMEM; > if (vm_fault & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE)) > return (foll_flags & FOLL_HWPOISON) ? -EHWPOISON : -EFAULT; > if (vm_fault & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV)) > return -EFAULT; > return 0; > } > > Again, I think if we wanted special handling here we'd need to probably > propagate that fault from higher up, but yes we'd need to for one > definitely not do so if it's remote but I worry about other cases. > > > > > -- > > Cheers, > > > > David / dhildenb > > > > Overall while I sympathise with this, it feels dangerous and a pretty major > change, because there'll be something somewhere that will break because it > expects faults to be swallowed that we no longer do swallow. > > So I'd say it'd be something we should defer, but of course it's a highly > user-facing change so how easy that would be I don't know. > > But I definitely don't think a 'introduce the ability to do cheap PROT_NONE > guards' series is the place to also fundmentally change how user access > page faults are handled within the kernel :) Will delivering signals on kernel access be a backwards compatible change? Or will we need a different API? MADV_GUARD_POISON_KERNEL? It's just somewhat painful to detect/update all userspace if we add this feature in future. Can we say signal delivery on kernel accesses is unspecified?
On 10/23/24 10:56, Dmitry Vyukov wrote: >> >> Overall while I sympathise with this, it feels dangerous and a pretty major >> change, because there'll be something somewhere that will break because it >> expects faults to be swallowed that we no longer do swallow. >> >> So I'd say it'd be something we should defer, but of course it's a highly >> user-facing change so how easy that would be I don't know. >> >> But I definitely don't think a 'introduce the ability to do cheap PROT_NONE >> guards' series is the place to also fundmentally change how user access >> page faults are handled within the kernel :) > > Will delivering signals on kernel access be a backwards compatible > change? Or will we need a different API? MADV_GUARD_POISON_KERNEL? > It's just somewhat painful to detect/update all userspace if we add > this feature in future. Can we say signal delivery on kernel accesses > is unspecified? Would adding signal delivery to guard PTEs only help enough the ASAN etc usecase? Wouldn't it be instead possible to add some prctl to opt-in the whole ASANized process to deliver all existing segfaults as signals instead of -EFAULT ?
On Wed, 23 Oct 2024 at 11:06, Vlastimil Babka <vbabka@suse.cz> wrote: > > On 10/23/24 10:56, Dmitry Vyukov wrote: > >> > >> Overall while I sympathise with this, it feels dangerous and a pretty major > >> change, because there'll be something somewhere that will break because it > >> expects faults to be swallowed that we no longer do swallow. > >> > >> So I'd say it'd be something we should defer, but of course it's a highly > >> user-facing change so how easy that would be I don't know. > >> > >> But I definitely don't think a 'introduce the ability to do cheap PROT_NONE > >> guards' series is the place to also fundmentally change how user access > >> page faults are handled within the kernel :) > > > > Will delivering signals on kernel access be a backwards compatible > > change? Or will we need a different API? MADV_GUARD_POISON_KERNEL? > > It's just somewhat painful to detect/update all userspace if we add > > this feature in future. Can we say signal delivery on kernel accesses > > is unspecified? > > Would adding signal delivery to guard PTEs only help enough the ASAN etc > usecase? Wouldn't it be instead possible to add some prctl to opt-in the > whole ASANized process to deliver all existing segfaults as signals instead > of -EFAULT ? ASAN per se does not need this (it does not use page protection). However, if you mean bug detection tools in general, then, yes, that's what I had in mind. There are also things like stack guard pages in libc that would benefit from that as well. But I observed that some libraries intentionally use EFAULT to probe for memory readability, i.e. use some cheap syscall to probe memory before reading it. So changing behavior globally may not work.
On 23.10.24 11:06, Vlastimil Babka wrote: > On 10/23/24 10:56, Dmitry Vyukov wrote: >>> >>> Overall while I sympathise with this, it feels dangerous and a pretty major >>> change, because there'll be something somewhere that will break because it >>> expects faults to be swallowed that we no longer do swallow. >>> >>> So I'd say it'd be something we should defer, but of course it's a highly >>> user-facing change so how easy that would be I don't know. >>> >>> But I definitely don't think a 'introduce the ability to do cheap PROT_NONE >>> guards' series is the place to also fundmentally change how user access >>> page faults are handled within the kernel :) >> >> Will delivering signals on kernel access be a backwards compatible >> change? Or will we need a different API? MADV_GUARD_POISON_KERNEL? >> It's just somewhat painful to detect/update all userspace if we add >> this feature in future. Can we say signal delivery on kernel accesses >> is unspecified? > > Would adding signal delivery to guard PTEs only help enough the ASAN etc > usecase? Wouldn't it be instead possible to add some prctl to opt-in the > whole ASANized process to deliver all existing segfaults as signals instead > of -EFAULT ? Not sure if it is an "instead", you might have to deliver the signal in addition to letting the syscall fail (not that I would be an expert on signal delivery :D ). prctl sounds better, or some way to configure the behavior on VMA ranges; otherwise we would need yet another marker, which is not the end of the world but would make it slightly more confusing. -- Cheers, David / dhildenb
On Wed, Oct 23, 2024 at 11:13:47AM +0200, David Hildenbrand wrote: > On 23.10.24 11:06, Vlastimil Babka wrote: > > On 10/23/24 10:56, Dmitry Vyukov wrote: > > > > > > > > Overall while I sympathise with this, it feels dangerous and a pretty major > > > > change, because there'll be something somewhere that will break because it > > > > expects faults to be swallowed that we no longer do swallow. > > > > > > > > So I'd say it'd be something we should defer, but of course it's a highly > > > > user-facing change so how easy that would be I don't know. > > > > > > > > But I definitely don't think a 'introduce the ability to do cheap PROT_NONE > > > > guards' series is the place to also fundmentally change how user access > > > > page faults are handled within the kernel :) > > > > > > Will delivering signals on kernel access be a backwards compatible > > > change? Or will we need a different API? MADV_GUARD_POISON_KERNEL? > > > It's just somewhat painful to detect/update all userspace if we add > > > this feature in future. Can we say signal delivery on kernel accesses > > > is unspecified? > > > > Would adding signal delivery to guard PTEs only help enough the ASAN etc > > usecase? Wouldn't it be instead possible to add some prctl to opt-in the > > whole ASANized process to deliver all existing segfaults as signals instead > > of -EFAULT ? > > Not sure if it is an "instead", you might have to deliver the signal in > addition to letting the syscall fail (not that I would be an expert on > signal delivery :D ). > > prctl sounds better, or some way to configure the behavior on VMA ranges; > otherwise we would need yet another marker, which is not the end of the > world but would make it slightly more confusing. > Yeah prctl() sounds sensible, and since we are explicitly adding a marker for guard pages here we can do this as a follow up too without breaking any userland expectations, i.e. 'new feature to make guard pages signal' is not going to contradict the default behaviour. So all makes sense to me, but I do think best as a follow up! :) > -- > Cheers, > > David / dhildenb >
On 23.10.24 11:18, Lorenzo Stoakes wrote: > On Wed, Oct 23, 2024 at 11:13:47AM +0200, David Hildenbrand wrote: >> On 23.10.24 11:06, Vlastimil Babka wrote: >>> On 10/23/24 10:56, Dmitry Vyukov wrote: >>>>> >>>>> Overall while I sympathise with this, it feels dangerous and a pretty major >>>>> change, because there'll be something somewhere that will break because it >>>>> expects faults to be swallowed that we no longer do swallow. >>>>> >>>>> So I'd say it'd be something we should defer, but of course it's a highly >>>>> user-facing change so how easy that would be I don't know. >>>>> >>>>> But I definitely don't think a 'introduce the ability to do cheap PROT_NONE >>>>> guards' series is the place to also fundmentally change how user access >>>>> page faults are handled within the kernel :) >>>> >>>> Will delivering signals on kernel access be a backwards compatible >>>> change? Or will we need a different API? MADV_GUARD_POISON_KERNEL? >>>> It's just somewhat painful to detect/update all userspace if we add >>>> this feature in future. Can we say signal delivery on kernel accesses >>>> is unspecified? >>> >>> Would adding signal delivery to guard PTEs only help enough the ASAN etc >>> usecase? Wouldn't it be instead possible to add some prctl to opt-in the >>> whole ASANized process to deliver all existing segfaults as signals instead >>> of -EFAULT ? >> >> Not sure if it is an "instead", you might have to deliver the signal in >> addition to letting the syscall fail (not that I would be an expert on >> signal delivery :D ). >> >> prctl sounds better, or some way to configure the behavior on VMA ranges; >> otherwise we would need yet another marker, which is not the end of the >> world but would make it slightly more confusing. >> > > Yeah prctl() sounds sensible, and since we are explicitly adding a marker > for guard pages here we can do this as a follow up too without breaking any > userland expectations, i.e. 'new feature to make guard pages signal' is not > going to contradict the default behaviour. > > So all makes sense to me, but I do think best as a follow up! :) Yeah, fully agreed. And my gut feeling is that it might not be that easy ... :) In the end, what we want is *some* notification that a guard PTE was accessed. Likely the notification must not necessarily completely synchronous (although it would be ideal) and it must not be a signal. Maybe having a different way to obtain that information from user space would work. -- Cheers, David / dhildenb
On Wed, 23 Oct 2024 at 11:29, David Hildenbrand <david@redhat.com> wrote: > > On 23.10.24 11:18, Lorenzo Stoakes wrote: > > On Wed, Oct 23, 2024 at 11:13:47AM +0200, David Hildenbrand wrote: > >> On 23.10.24 11:06, Vlastimil Babka wrote: > >>> On 10/23/24 10:56, Dmitry Vyukov wrote: > >>>>> > >>>>> Overall while I sympathise with this, it feels dangerous and a pretty major > >>>>> change, because there'll be something somewhere that will break because it > >>>>> expects faults to be swallowed that we no longer do swallow. > >>>>> > >>>>> So I'd say it'd be something we should defer, but of course it's a highly > >>>>> user-facing change so how easy that would be I don't know. > >>>>> > >>>>> But I definitely don't think a 'introduce the ability to do cheap PROT_NONE > >>>>> guards' series is the place to also fundmentally change how user access > >>>>> page faults are handled within the kernel :) > >>>> > >>>> Will delivering signals on kernel access be a backwards compatible > >>>> change? Or will we need a different API? MADV_GUARD_POISON_KERNEL? > >>>> It's just somewhat painful to detect/update all userspace if we add > >>>> this feature in future. Can we say signal delivery on kernel accesses > >>>> is unspecified? > >>> > >>> Would adding signal delivery to guard PTEs only help enough the ASAN etc > >>> usecase? Wouldn't it be instead possible to add some prctl to opt-in the > >>> whole ASANized process to deliver all existing segfaults as signals instead > >>> of -EFAULT ? > >> > >> Not sure if it is an "instead", you might have to deliver the signal in > >> addition to letting the syscall fail (not that I would be an expert on > >> signal delivery :D ). > >> > >> prctl sounds better, or some way to configure the behavior on VMA ranges; > >> otherwise we would need yet another marker, which is not the end of the > >> world but would make it slightly more confusing. > >> > > > > Yeah prctl() sounds sensible, and since we are explicitly adding a marker > > for guard pages here we can do this as a follow up too without breaking any > > userland expectations, i.e. 'new feature to make guard pages signal' is not > > going to contradict the default behaviour. > > > > So all makes sense to me, but I do think best as a follow up! :) > > Yeah, fully agreed. And my gut feeling is that it might not be that easy > ... :) > > In the end, what we want is *some* notification that a guard PTE was > accessed. Likely the notification must not necessarily completely > synchronous (although it would be ideal) and it must not be a signal. > > Maybe having a different way to obtain that information from user space > would work. For bug detection tools (like GWP-ASan [1]) it's essential to have useful stack traces. As such, having this signal be synchronous would be more useful. I don't see how one could get a useful stack trace (or other information like what's stashed away in ucontext like CPU registers) if this were asynchronous. [1] https://arxiv.org/pdf/2311.09394
On 23.10.24 13:31, Marco Elver wrote: > On Wed, 23 Oct 2024 at 11:29, David Hildenbrand <david@redhat.com> wrote: >> >> On 23.10.24 11:18, Lorenzo Stoakes wrote: >>> On Wed, Oct 23, 2024 at 11:13:47AM +0200, David Hildenbrand wrote: >>>> On 23.10.24 11:06, Vlastimil Babka wrote: >>>>> On 10/23/24 10:56, Dmitry Vyukov wrote: >>>>>>> >>>>>>> Overall while I sympathise with this, it feels dangerous and a pretty major >>>>>>> change, because there'll be something somewhere that will break because it >>>>>>> expects faults to be swallowed that we no longer do swallow. >>>>>>> >>>>>>> So I'd say it'd be something we should defer, but of course it's a highly >>>>>>> user-facing change so how easy that would be I don't know. >>>>>>> >>>>>>> But I definitely don't think a 'introduce the ability to do cheap PROT_NONE >>>>>>> guards' series is the place to also fundmentally change how user access >>>>>>> page faults are handled within the kernel :) >>>>>> >>>>>> Will delivering signals on kernel access be a backwards compatible >>>>>> change? Or will we need a different API? MADV_GUARD_POISON_KERNEL? >>>>>> It's just somewhat painful to detect/update all userspace if we add >>>>>> this feature in future. Can we say signal delivery on kernel accesses >>>>>> is unspecified? >>>>> >>>>> Would adding signal delivery to guard PTEs only help enough the ASAN etc >>>>> usecase? Wouldn't it be instead possible to add some prctl to opt-in the >>>>> whole ASANized process to deliver all existing segfaults as signals instead >>>>> of -EFAULT ? >>>> >>>> Not sure if it is an "instead", you might have to deliver the signal in >>>> addition to letting the syscall fail (not that I would be an expert on >>>> signal delivery :D ). >>>> >>>> prctl sounds better, or some way to configure the behavior on VMA ranges; >>>> otherwise we would need yet another marker, which is not the end of the >>>> world but would make it slightly more confusing. >>>> >>> >>> Yeah prctl() sounds sensible, and since we are explicitly adding a marker >>> for guard pages here we can do this as a follow up too without breaking any >>> userland expectations, i.e. 'new feature to make guard pages signal' is not >>> going to contradict the default behaviour. >>> >>> So all makes sense to me, but I do think best as a follow up! :) >> >> Yeah, fully agreed. And my gut feeling is that it might not be that easy >> ... :) >> >> In the end, what we want is *some* notification that a guard PTE was >> accessed. Likely the notification must not necessarily completely >> synchronous (although it would be ideal) and it must not be a signal. >> >> Maybe having a different way to obtain that information from user space >> would work. > > For bug detection tools (like GWP-ASan [1]) it's essential to have > useful stack traces. As such, having this signal be synchronous would > be more useful. I don't see how one could get a useful stack trace (or > other information like what's stashed away in ucontext like CPU > registers) if this were asynchronous. Yes, I know. But it would be better than not getting *any* notification except of some syscalls simply failing with -EFAULT, and not having an idea which address was even accessed. Maybe the signal injection is easier than I think, but I somehow doubt it ... -- Cheers, David / dhildenb
On Wed, Oct 23, 2024 at 01:36:10PM +0200, David Hildenbrand wrote: > On 23.10.24 13:31, Marco Elver wrote: > > On Wed, 23 Oct 2024 at 11:29, David Hildenbrand <david@redhat.com> wrote: > > > > > > On 23.10.24 11:18, Lorenzo Stoakes wrote: > > > > On Wed, Oct 23, 2024 at 11:13:47AM +0200, David Hildenbrand wrote: > > > > > On 23.10.24 11:06, Vlastimil Babka wrote: > > > > > > On 10/23/24 10:56, Dmitry Vyukov wrote: > > > > > > > > > > > > > > > > Overall while I sympathise with this, it feels dangerous and a pretty major > > > > > > > > change, because there'll be something somewhere that will break because it > > > > > > > > expects faults to be swallowed that we no longer do swallow. > > > > > > > > > > > > > > > > So I'd say it'd be something we should defer, but of course it's a highly > > > > > > > > user-facing change so how easy that would be I don't know. > > > > > > > > > > > > > > > > But I definitely don't think a 'introduce the ability to do cheap PROT_NONE > > > > > > > > guards' series is the place to also fundmentally change how user access > > > > > > > > page faults are handled within the kernel :) > > > > > > > > > > > > > > Will delivering signals on kernel access be a backwards compatible > > > > > > > change? Or will we need a different API? MADV_GUARD_POISON_KERNEL? > > > > > > > It's just somewhat painful to detect/update all userspace if we add > > > > > > > this feature in future. Can we say signal delivery on kernel accesses > > > > > > > is unspecified? > > > > > > > > > > > > Would adding signal delivery to guard PTEs only help enough the ASAN etc > > > > > > usecase? Wouldn't it be instead possible to add some prctl to opt-in the > > > > > > whole ASANized process to deliver all existing segfaults as signals instead > > > > > > of -EFAULT ? > > > > > > > > > > Not sure if it is an "instead", you might have to deliver the signal in > > > > > addition to letting the syscall fail (not that I would be an expert on > > > > > signal delivery :D ). > > > > > > > > > > prctl sounds better, or some way to configure the behavior on VMA ranges; > > > > > otherwise we would need yet another marker, which is not the end of the > > > > > world but would make it slightly more confusing. > > > > > > > > > > > > > Yeah prctl() sounds sensible, and since we are explicitly adding a marker > > > > for guard pages here we can do this as a follow up too without breaking any > > > > userland expectations, i.e. 'new feature to make guard pages signal' is not > > > > going to contradict the default behaviour. > > > > > > > > So all makes sense to me, but I do think best as a follow up! :) > > > > > > Yeah, fully agreed. And my gut feeling is that it might not be that easy > > > ... :) > > > > > > In the end, what we want is *some* notification that a guard PTE was > > > accessed. Likely the notification must not necessarily completely > > > synchronous (although it would be ideal) and it must not be a signal. > > > > > > Maybe having a different way to obtain that information from user space > > > would work. > > > > For bug detection tools (like GWP-ASan [1]) it's essential to have > > useful stack traces. As such, having this signal be synchronous would > > be more useful. I don't see how one could get a useful stack trace (or > > other information like what's stashed away in ucontext like CPU > > registers) if this were asynchronous. > > Yes, I know. But it would be better than not getting *any* notification > except of some syscalls simply failing with -EFAULT, and not having an idea > which address was even accessed. > > Maybe the signal injection is easier than I think, but I somehow doubt it > ... Yeah I'm afraid I don't think this series is a place where I can fundamentally change how something so sensitive works in the kernel. It's espeically super sensitive because this is a uAPI change and a wrong decision here could result in guard pages being broken out the gate and I really don't want to risk that. > > -- > Cheers, > > David / dhildenb >
On Sun, Oct 20, 2024 at 07:37:54PM +0200, Florian Weimer wrote: > * Lorenzo Stoakes: > > > Early testing of the prototype version of this code suggests a 5 times > > speed up in memory mapping invocations (in conjunction with use of > > process_madvise()) and a 13% reduction in VMAs on an entirely idle android > > system and unoptimised code. > > > > We expect with optimisation and a loaded system with a larger number of > > guard pages this could significantly increase, but in any case these > > numbers are encouraging. > > > > This way, rather than having separate VMAs specifying which parts of a > > range are guard pages, instead we have a VMA spanning the entire range of > > memory a user is permitted to access and including ranges which are to be > > 'guarded'. > > > > After mapping this, a user can specify which parts of the range should > > result in a fatal signal when accessed. > > > > By restricting the ability to specify guard pages to memory mapped by > > existing VMAs, we can rely on the mappings being torn down when the > > mappings are ultimately unmapped and everything works simply as if the > > memory were not faulted in, from the point of view of the containing VMAs. > > We have a glibc (so not Android) dynamic linker bug that asks us to > remove PROT_NONE mappings in mapped shared objects: > > Extra struct vm_area_struct with ---p created when PAGE_SIZE < max-page-size > <https://sourceware.org/bugzilla/show_bug.cgi?id=31076> > > It's slightly different from a guard page because our main goal is to > avoid other mappings to end up in those gaps, which has been shown to > cause odd application behavior in cases where it happens. If I > understand the series correctly, the kernel would not automatically > attribute those PROT_NONE gaps to the previous or subsequent mapping. > We would have to extend one of the surrounding mapps and apply > MADV_POISON to that over-mapped part. That doesn't seem too onerous. > > Could the ELF loader in the kernel do the same thing for the main > executable and the program loader? Currently this implementation is only available for private anonymous memory. We may look at extending it to shmem and file-backed memory in the future but there are a whole host of things to overcome to make that work so it's one step at a time! :)
© 2016 - 2024 Red Hat, Inc.