include/linux/mm.h | 9 +++ include/linux/shmem_fs.h | 14 ----- include/linux/userfaultfd_k.h | 98 +++++++++++++++++++---------- mm/hugetlb.c | 19 ++++++ mm/shmem.c | 28 ++++++++- mm/userfaultfd.c | 115 +++++++++++++++++++++++++--------- 6 files changed, 207 insertions(+), 76 deletions(-)
[based on latest akpm/mm-new of June 27th, commit 9be7387ae43f] v2 changelog: - Patch 1 - update English in commit log [David] - move vm_uffd_ops definition to userfaultfd_k.h [Mike] - Patch 4 - fix sparse warning on bitwise type conversions [syzbot] - Commit message updates on explanation of vma_can_userfault check [James] v1: https://lore.kernel.org/r/20250620190342.1780170-1-peterx@redhat.com This series is an alternative proposal of what Nikita proposed here on the initial three patches: https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com This is not yet relevant to any guest-memfd support, but paving way for it. Here, the major goal is to make kernel modules be able to opt-in with any form of userfaultfd supports, like guest-memfd. This alternative option should hopefully be cleaner, and avoid leaking userfault details into vm_ops.fault(). It also means this series does not depend on anything. It's a pure refactoring of userfaultfd internals to provide a generic API, so that other types of files, especially RAM based, can support userfaultfd without touching mm/ at all. To achieve it, this series introduced a file operation called vm_uffd_ops. The ops needs to be provided when a file type supports any of userfaultfd. With that, I moved both hugetlbfs and shmem over. Hugetlbfs is still very special that it will only use partial of the vm_uffd_ops API, due to similar reason why hugetlb_vm_op_fault() has a BUG() and so far hard-coded into core mm. But this should still be better, because at least hugetlbfs is still always involved in feature probing (e.g. where it used to not support ZEROPAGE and we have a hard-coded line to fail that, and some more). Meanwhile after this series, shmem will be completely converted to the new vm_uffd_ops API; the final vm_uffd_ops for shmem looks like this: static const vm_uffd_ops shmem_uffd_ops = { .uffd_features = __VM_UFFD_FLAGS, .uffd_ioctls = BIT(_UFFDIO_COPY) | BIT(_UFFDIO_ZEROPAGE) | BIT(_UFFDIO_WRITEPROTECT) | BIT(_UFFDIO_CONTINUE) | BIT(_UFFDIO_POISON), .uffd_get_folio = shmem_uffd_get_folio, .uffd_copy = shmem_mfill_atomic_pte, }; As I mentioned in one of my reply to Nikita, I don't like the current interface of uffd_copy(), but this will be the minimum change version of such API to support complete extrenal-module-ready userfaultfd. Here, very minimal change will be needed from shmem side to support that. Meanwhile, the vm_uffd_ops is also not the only place one will need to provide to support userfaultfd. Normally vm_ops.fault() will also need to be updated, but that's a generic function and it'll play together with the new vm_uffd_ops to make everything fly. No functional change expected at all after the whole series applied. There might be some slightly stricter check on uffd ops here and there in the last patch, but that really shouldn't stand out anywhere to anyone. For testing: besides the cross-compilation tests, I did also try with uffd-stress in a VM to measure any perf difference before/after the change; The static call becomes a pointer now. I really cannot measure anything different, which is more or less expected. Comments welcomed, thanks. Peter Xu (4): mm: Introduce vm_uffd_ops API mm/shmem: Support vm_uffd_ops API mm/hugetlb: Support vm_uffd_ops API mm: Apply vm_uffd_ops API to core mm include/linux/mm.h | 9 +++ include/linux/shmem_fs.h | 14 ----- include/linux/userfaultfd_k.h | 98 +++++++++++++++++++---------- mm/hugetlb.c | 19 ++++++ mm/shmem.c | 28 ++++++++- mm/userfaultfd.c | 115 +++++++++++++++++++++++++--------- 6 files changed, 207 insertions(+), 76 deletions(-) -- 2.49.0
On Fri, Jun 27, 2025 at 11:46:51AM -0400, Peter Xu wrote: > [based on latest akpm/mm-new of June 27th, commit 9be7387ae43f] > > v2 changelog: > - Patch 1 > - update English in commit log [David] > - move vm_uffd_ops definition to userfaultfd_k.h [Mike] > - Patch 4 > - fix sparse warning on bitwise type conversions [syzbot] > - Commit message updates on explanation of vma_can_userfault check [James] > > v1: https://lore.kernel.org/r/20250620190342.1780170-1-peterx@redhat.com > > This series is an alternative proposal of what Nikita proposed here on the > initial three patches: > > https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com > > This is not yet relevant to any guest-memfd support, but paving way for it. > Here, the major goal is to make kernel modules be able to opt-in with any > form of userfaultfd supports, like guest-memfd. This alternative option > should hopefully be cleaner, and avoid leaking userfault details into > vm_ops.fault(). > > It also means this series does not depend on anything. It's a pure > refactoring of userfaultfd internals to provide a generic API, so that > other types of files, especially RAM based, can support userfaultfd without > touching mm/ at all. I'm very concerned that this change will simply move core mm functionality out of mm and into drivers where it can bitrot and cause subtle bugs? You're proposing providing stuff like page table state and asking for a folio back from a driver etc. I absolutely am not in favour of us providing core mm internals like this to drivers, and I don't want to see us having to EXPORT() mm internals just to make module-ised uffd code work (I mean I just will flat out refuse to do that). I think we need to think _very_ carefully about how we do this. I also feel like this series is at a really basic level and you've not fully determined what API calls you need. I agree that it's sensible to be incremental, but I feel like you sort of need to somewhat prove the case that you can jump from 'incremental version where we only support code in mm/' to supporting arbitrary file system code that might be modules. Because otherwise you're basically _guessing_ that you can do this, possibly, in the future and maybe it's just not the right approach but that's not clear yet? > > To achieve it, this series introduced a file operation called vm_uffd_ops. > The ops needs to be provided when a file type supports any of userfaultfd. > > With that, I moved both hugetlbfs and shmem over. Well as you say below hugetlbfs is sort of a stub implementation, I wonder whether we'd need quite a bit more to make that work. One thing I'd _really_ like to avoid is us having to add a bunch of hook points into core mm code just for uffd that then call out to some driver. We've encountered such a total nightmare with .mmap() for instance in the past (including stuff that resulted in security issues) because we - simply cannot assume anything - about what the hook implementor might do with the passed parameters. This is really really problematic. I also absolutely hate the: if (uffd) do_something_weird(); Pattern, so hopefully this won't proliferate that. > > Hugetlbfs is still very special that it will only use partial of the > vm_uffd_ops API, due to similar reason why hugetlb_vm_op_fault() has a > BUG() and so far hard-coded into core mm. But this should still be better, > because at least hugetlbfs is still always involved in feature probing > (e.g. where it used to not support ZEROPAGE and we have a hard-coded line > to fail that, and some more). Meanwhile after this series, shmem will be > completely converted to the new vm_uffd_ops API; the final vm_uffd_ops for > shmem looks like this: > > static const vm_uffd_ops shmem_uffd_ops = { > .uffd_features = __VM_UFFD_FLAGS, > .uffd_ioctls = BIT(_UFFDIO_COPY) | > BIT(_UFFDIO_ZEROPAGE) | > BIT(_UFFDIO_WRITEPROTECT) | > BIT(_UFFDIO_CONTINUE) | > BIT(_UFFDIO_POISON), > .uffd_get_folio = shmem_uffd_get_folio, > .uffd_copy = shmem_mfill_atomic_pte, > }; > > As I mentioned in one of my reply to Nikita, I don't like the current > interface of uffd_copy(), but this will be the minimum change version of > such API to support complete extrenal-module-ready userfaultfd. Here, very > minimal change will be needed from shmem side to support that. Right, maybe a better version of this interface might address some of my concerns... :) > > Meanwhile, the vm_uffd_ops is also not the only place one will need to > provide to support userfaultfd. Normally vm_ops.fault() will also need to > be updated, but that's a generic function and it'll play together with the > new vm_uffd_ops to make everything fly. > > No functional change expected at all after the whole series applied. There > might be some slightly stricter check on uffd ops here and there in the > last patch, but that really shouldn't stand out anywhere to anyone. > > For testing: besides the cross-compilation tests, I did also try with > uffd-stress in a VM to measure any perf difference before/after the change; > The static call becomes a pointer now. I really cannot measure anything > different, which is more or less expected. > > Comments welcomed, thanks. > > Peter Xu (4): > mm: Introduce vm_uffd_ops API > mm/shmem: Support vm_uffd_ops API > mm/hugetlb: Support vm_uffd_ops API > mm: Apply vm_uffd_ops API to core mm > > include/linux/mm.h | 9 +++ > include/linux/shmem_fs.h | 14 ----- > include/linux/userfaultfd_k.h | 98 +++++++++++++++++++---------- > mm/hugetlb.c | 19 ++++++ > mm/shmem.c | 28 ++++++++- > mm/userfaultfd.c | 115 +++++++++++++++++++++++++--------- > 6 files changed, 207 insertions(+), 76 deletions(-) > > -- > 2.49.0 > Sorry to be critical, I just want to make sure we're not setting ourselves up for trouble here. I _very much_ support efforts to make uffd more generalised, and ideally to find a way to separate out shmem and hugetlbfs implementation bits, so I support the intent _fully_. I just want to make sure we do it in a safe way :) Cheers, Lorenzo
On Mon, Jun 30, 2025 at 11:29:30AM +0100, Lorenzo Stoakes wrote: > On Fri, Jun 27, 2025 at 11:46:51AM -0400, Peter Xu wrote: > > [based on latest akpm/mm-new of June 27th, commit 9be7387ae43f] > > > > v2 changelog: > > - Patch 1 > > - update English in commit log [David] > > - move vm_uffd_ops definition to userfaultfd_k.h [Mike] > > - Patch 4 > > - fix sparse warning on bitwise type conversions [syzbot] > > - Commit message updates on explanation of vma_can_userfault check [James] > > > > v1: https://lore.kernel.org/r/20250620190342.1780170-1-peterx@redhat.com > > > > This series is an alternative proposal of what Nikita proposed here on the > > initial three patches: > > > > https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com > > > > This is not yet relevant to any guest-memfd support, but paving way for it. > > Here, the major goal is to make kernel modules be able to opt-in with any > > form of userfaultfd supports, like guest-memfd. This alternative option > > should hopefully be cleaner, and avoid leaking userfault details into > > vm_ops.fault(). > > > > It also means this series does not depend on anything. It's a pure > > refactoring of userfaultfd internals to provide a generic API, so that > > other types of files, especially RAM based, can support userfaultfd without > > touching mm/ at all. > > I'm very concerned that this change will simply move core mm functionality out > of mm and into drivers where it can bitrot and cause subtle bugs? > > You're proposing providing stuff like page table state and asking for a folio > back from a driver etc. > > I absolutely am not in favour of us providing core mm internals like this to > drivers, and I don't want to see us having to EXPORT() mm internals just to make > module-ised uffd code work (I mean I just will flat out refuse to do that). > > I think we need to think _very_ carefully about how we do this. > > I also feel like this series is at a really basic level and you've not fully > determined what API calls you need. See: https://lore.kernel.org/all/aGWVIjmmsmskA4bp@x1.local/#t > > I agree that it's sensible to be incremental, but I feel like you sort of need > to somewhat prove the case that you can jump from 'incremental version where we > only support code in mm/' to supporting arbitrary file system code that might be > modules. > > Because otherwise you're basically _guessing_ that you can do this, possibly, in > the future and maybe it's just not the right approach but that's not clear yet? Did you follow up with the discussions in v1? I copied you too. https://lore.kernel.org/r/114133f5-0282-463d-9d65-3143aa658806@amazon.com Would Nikita's work help here? Could you explain what are you asking for to prove that this works for us? > > > > > To achieve it, this series introduced a file operation called vm_uffd_ops. > > The ops needs to be provided when a file type supports any of userfaultfd. > > > > With that, I moved both hugetlbfs and shmem over. > > Well as you say below hugetlbfs is sort of a stub implementation, I wonder > whether we'd need quite a bit more to make that work. > > One thing I'd _really_ like to avoid is us having to add a bunch of hook points > into core mm code just for uffd that then call out to some driver. > > We've encountered such a total nightmare with .mmap() for instance in the past > (including stuff that resulted in security issues) because we - simply cannot > assume anything - about what the hook implementor might do with the passed > parameters. > > This is really really problematic. > > I also absolutely hate the: > > if (uffd) > do_something_weird(); > > Pattern, so hopefully this won't proliferate that. > > > > > Hugetlbfs is still very special that it will only use partial of the > > vm_uffd_ops API, due to similar reason why hugetlb_vm_op_fault() has a > > BUG() and so far hard-coded into core mm. But this should still be better, > > because at least hugetlbfs is still always involved in feature probing > > (e.g. where it used to not support ZEROPAGE and we have a hard-coded line > > to fail that, and some more). Meanwhile after this series, shmem will be > > completely converted to the new vm_uffd_ops API; the final vm_uffd_ops for > > shmem looks like this: > > > > static const vm_uffd_ops shmem_uffd_ops = { > > .uffd_features = __VM_UFFD_FLAGS, > > .uffd_ioctls = BIT(_UFFDIO_COPY) | > > BIT(_UFFDIO_ZEROPAGE) | > > BIT(_UFFDIO_WRITEPROTECT) | > > BIT(_UFFDIO_CONTINUE) | > > BIT(_UFFDIO_POISON), > > .uffd_get_folio = shmem_uffd_get_folio, > > .uffd_copy = shmem_mfill_atomic_pte, > > }; > > > > As I mentioned in one of my reply to Nikita, I don't like the current > > interface of uffd_copy(), but this will be the minimum change version of > > such API to support complete extrenal-module-ready userfaultfd. Here, very > > minimal change will be needed from shmem side to support that. > > Right, maybe a better version of this interface might address some of my > concerns... :) > > > > > Meanwhile, the vm_uffd_ops is also not the only place one will need to > > provide to support userfaultfd. Normally vm_ops.fault() will also need to > > be updated, but that's a generic function and it'll play together with the > > new vm_uffd_ops to make everything fly. > > > > No functional change expected at all after the whole series applied. There > > might be some slightly stricter check on uffd ops here and there in the > > last patch, but that really shouldn't stand out anywhere to anyone. > > > > For testing: besides the cross-compilation tests, I did also try with > > uffd-stress in a VM to measure any perf difference before/after the change; > > The static call becomes a pointer now. I really cannot measure anything > > different, which is more or less expected. > > > > Comments welcomed, thanks. > > > > Peter Xu (4): > > mm: Introduce vm_uffd_ops API > > mm/shmem: Support vm_uffd_ops API > > mm/hugetlb: Support vm_uffd_ops API > > mm: Apply vm_uffd_ops API to core mm > > > > include/linux/mm.h | 9 +++ > > include/linux/shmem_fs.h | 14 ----- > > include/linux/userfaultfd_k.h | 98 +++++++++++++++++++---------- > > mm/hugetlb.c | 19 ++++++ > > mm/shmem.c | 28 ++++++++- > > mm/userfaultfd.c | 115 +++++++++++++++++++++++++--------- > > 6 files changed, 207 insertions(+), 76 deletions(-) > > > > -- > > 2.49.0 > > > > Sorry to be critical, I just want to make sure we're not setting ourselves up > for trouble here. > > I _very much_ support efforts to make uffd more generalised, and ideally to find > a way to separate out shmem and hugetlbfs implementation bits, so I support the > intent _fully_. > > I just want to make sure we do it in a safe way :) Any explicit suggestions (besides objections)? Thanks, -- Peter Xu
Peter you've ignored a large part of what I've said here, I'm not sure this is productive. On Wed, Jul 02, 2025 at 04:36:31PM -0400, Peter Xu wrote: > On Mon, Jun 30, 2025 at 11:29:30AM +0100, Lorenzo Stoakes wrote: > > On Fri, Jun 27, 2025 at 11:46:51AM -0400, Peter Xu wrote: > > > [based on latest akpm/mm-new of June 27th, commit 9be7387ae43f] > > > > > > v2 changelog: > > > - Patch 1 > > > - update English in commit log [David] > > > - move vm_uffd_ops definition to userfaultfd_k.h [Mike] > > > - Patch 4 > > > - fix sparse warning on bitwise type conversions [syzbot] > > > - Commit message updates on explanation of vma_can_userfault check [James] > > > > > > v1: https://lore.kernel.org/r/20250620190342.1780170-1-peterx@redhat.com > > > > > > This series is an alternative proposal of what Nikita proposed here on the > > > initial three patches: > > > > > > https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com > > > > > > This is not yet relevant to any guest-memfd support, but paving way for it. > > > Here, the major goal is to make kernel modules be able to opt-in with any > > > form of userfaultfd supports, like guest-memfd. This alternative option > > > should hopefully be cleaner, and avoid leaking userfault details into > > > vm_ops.fault(). > > > > > > It also means this series does not depend on anything. It's a pure > > > refactoring of userfaultfd internals to provide a generic API, so that > > > other types of files, especially RAM based, can support userfaultfd without > > > touching mm/ at all. > > > > I'm very concerned that this change will simply move core mm functionality out > > of mm and into drivers where it can bitrot and cause subtle bugs? > > > > You're proposing providing stuff like page table state and asking for a folio > > back from a driver etc. > > > > I absolutely am not in favour of us providing core mm internals like this to > > drivers, and I don't want to see us having to EXPORT() mm internals just to make > > module-ised uffd code work (I mean I just will flat out refuse to do that). > > > > I think we need to think _very_ carefully about how we do this. > > > > I also feel like this series is at a really basic level and you've not fully > > determined what API calls you need. > > See: > > https://lore.kernel.org/all/aGWVIjmmsmskA4bp@x1.local/#t OK so it seems the point you're making there is - there's a lot of complexity - and you'd rather not think about it for now? My concern is that in _actuality_ you'll need to do a _lot_ more and expose a _lot_ more mm internals to achieve what you need to. I am happy for the API to be internal-to-mm-only. My concerns are really simple: - exposing page table manipulation outside of mm is something I just cannot accept us doing for reasons I already mentioned and also Liam - we should absolutely minimise how much we do expose - we mustn't have hooks that don't allow us to make sensible decisions in core mm code. I think overall I'm just not in favour of us having uffd functionality in modules, I think it's an abstraction mismatch. Now if you instead had something like: enum uffd_minor_fault_handler_type { UFFD_MINOR_FAULT_DEFAULT_HANDLER, UFFD_MINOR_FAULT_FOO_HANDLER, UFFD_MINOR_FAULT_BAR_HANDLER, etc. }; And the module could _choose_ which should be used then that's far far better. Really - hermetically seal the sensitive parts but allow module choice. You could find ways to do this in a more sophisticated way too by e.g. having a driver callback that provides a config struct that sets things up. If we're going 'simple first' and can 'change what we want' why not do it like this to start? > > > > > I agree that it's sensible to be incremental, but I feel like you sort of need > > to somewhat prove the case that you can jump from 'incremental version where we > > only support code in mm/' to supporting arbitrary file system code that might be > > modules. > > > > Because otherwise you're basically _guessing_ that you can do this, possibly, in > > the future and maybe it's just not the right approach but that's not clear yet? > > Did you follow up with the discussions in v1? I copied you too. > > https://lore.kernel.org/r/114133f5-0282-463d-9d65-3143aa658806@amazon.com > > Would Nikita's work help here? Could you explain what are you asking for > to prove that this works for us? Yeah thanks this seems like it actually implements useful functionality. The point is the API seems currently to be a stub so doesn't really give us much to go on as to what might ultimately be required. You say yourself in the series that you'll likely need to expand things and you already have todos to this effect. > > > > > > > > > To achieve it, this series introduced a file operation called vm_uffd_ops. > > > The ops needs to be provided when a file type supports any of userfaultfd. > > > > > > With that, I moved both hugetlbfs and shmem over. > > > > Well as you say below hugetlbfs is sort of a stub implementation, I wonder > > whether we'd need quite a bit more to make that work. > > > > One thing I'd _really_ like to avoid is us having to add a bunch of hook points > > into core mm code just for uffd that then call out to some driver. > > > > We've encountered such a total nightmare with .mmap() for instance in the past > > (including stuff that resulted in security issues) because we - simply cannot > > assume anything - about what the hook implementor might do with the passed > > parameters. > > > > This is really really problematic. > > > > I also absolutely hate the: > > > > if (uffd) > > do_something_weird(); > > > > Pattern, so hopefully this won't proliferate that. ^ you didn't respond to this. > > > > > > > > Hugetlbfs is still very special that it will only use partial of the > > > vm_uffd_ops API, due to similar reason why hugetlb_vm_op_fault() has a > > > BUG() and so far hard-coded into core mm. But this should still be better, > > > because at least hugetlbfs is still always involved in feature probing > > > (e.g. where it used to not support ZEROPAGE and we have a hard-coded line > > > to fail that, and some more). Meanwhile after this series, shmem will be > > > completely converted to the new vm_uffd_ops API; the final vm_uffd_ops for > > > shmem looks like this: > > > > > > static const vm_uffd_ops shmem_uffd_ops = { > > > .uffd_features = __VM_UFFD_FLAGS, > > > .uffd_ioctls = BIT(_UFFDIO_COPY) | > > > BIT(_UFFDIO_ZEROPAGE) | > > > BIT(_UFFDIO_WRITEPROTECT) | > > > BIT(_UFFDIO_CONTINUE) | > > > BIT(_UFFDIO_POISON), > > > .uffd_get_folio = shmem_uffd_get_folio, > > > .uffd_copy = shmem_mfill_atomic_pte, > > > }; > > > > > > As I mentioned in one of my reply to Nikita, I don't like the current > > > interface of uffd_copy(), but this will be the minimum change version of > > > such API to support complete extrenal-module-ready userfaultfd. Here, very > > > minimal change will be needed from shmem side to support that. > > > > Right, maybe a better version of this interface might address some of my > > concerns... :) > > > > > > > > Meanwhile, the vm_uffd_ops is also not the only place one will need to > > > provide to support userfaultfd. Normally vm_ops.fault() will also need to > > > be updated, but that's a generic function and it'll play together with the > > > new vm_uffd_ops to make everything fly. > > > > > > No functional change expected at all after the whole series applied. There > > > might be some slightly stricter check on uffd ops here and there in the > > > last patch, but that really shouldn't stand out anywhere to anyone. > > > > > > For testing: besides the cross-compilation tests, I did also try with > > > uffd-stress in a VM to measure any perf difference before/after the change; > > > The static call becomes a pointer now. I really cannot measure anything > > > different, which is more or less expected. > > > > > > Comments welcomed, thanks. > > > > > > Peter Xu (4): > > > mm: Introduce vm_uffd_ops API > > > mm/shmem: Support vm_uffd_ops API > > > mm/hugetlb: Support vm_uffd_ops API > > > mm: Apply vm_uffd_ops API to core mm > > > > > > include/linux/mm.h | 9 +++ > > > include/linux/shmem_fs.h | 14 ----- > > > include/linux/userfaultfd_k.h | 98 +++++++++++++++++++---------- > > > mm/hugetlb.c | 19 ++++++ > > > mm/shmem.c | 28 ++++++++- > > > mm/userfaultfd.c | 115 +++++++++++++++++++++++++--------- > > > 6 files changed, 207 insertions(+), 76 deletions(-) > > > > > > -- > > > 2.49.0 > > > > > > > Sorry to be critical, I just want to make sure we're not setting ourselves up > > for trouble here. > > > > I _very much_ support efforts to make uffd more generalised, and ideally to find > > a way to separate out shmem and hugetlbfs implementation bits, so I support the > > intent _fully_. > > > > I just want to make sure we do it in a safe way :) > > Any explicit suggestions (besides objections)? I gave some above. I'm fundamentally opposed to us exposing page table manipulation or really any state subject to sensitive locks to modules. Cheers, Lorenzo
On Thu, Jul 03, 2025 at 04:55:01PM +0100, Lorenzo Stoakes wrote: > Peter you've ignored a large part of what I've said here, I'm not sure this is > productive. > > On Wed, Jul 02, 2025 at 04:36:31PM -0400, Peter Xu wrote: > > On Mon, Jun 30, 2025 at 11:29:30AM +0100, Lorenzo Stoakes wrote: > > > On Fri, Jun 27, 2025 at 11:46:51AM -0400, Peter Xu wrote: > > > > [based on latest akpm/mm-new of June 27th, commit 9be7387ae43f] > > > > > > > > v2 changelog: > > > > - Patch 1 > > > > - update English in commit log [David] > > > > - move vm_uffd_ops definition to userfaultfd_k.h [Mike] > > > > - Patch 4 > > > > - fix sparse warning on bitwise type conversions [syzbot] > > > > - Commit message updates on explanation of vma_can_userfault check [James] > > > > > > > > v1: https://lore.kernel.org/r/20250620190342.1780170-1-peterx@redhat.com > > > > > > > > This series is an alternative proposal of what Nikita proposed here on the > > > > initial three patches: > > > > > > > > https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com > > > > > > > > This is not yet relevant to any guest-memfd support, but paving way for it. > > > > Here, the major goal is to make kernel modules be able to opt-in with any > > > > form of userfaultfd supports, like guest-memfd. This alternative option > > > > should hopefully be cleaner, and avoid leaking userfault details into > > > > vm_ops.fault(). > > > > > > > > It also means this series does not depend on anything. It's a pure > > > > refactoring of userfaultfd internals to provide a generic API, so that > > > > other types of files, especially RAM based, can support userfaultfd without > > > > touching mm/ at all. > > > > > > I'm very concerned that this change will simply move core mm functionality out > > > of mm and into drivers where it can bitrot and cause subtle bugs? > > > > > > You're proposing providing stuff like page table state and asking for a folio > > > back from a driver etc. > > > > > > I absolutely am not in favour of us providing core mm internals like this to > > > drivers, and I don't want to see us having to EXPORT() mm internals just to make > > > module-ised uffd code work (I mean I just will flat out refuse to do that). > > > > > > I think we need to think _very_ carefully about how we do this. > > > > > > I also feel like this series is at a really basic level and you've not fully > > > determined what API calls you need. > > > > See: > > > > https://lore.kernel.org/all/aGWVIjmmsmskA4bp@x1.local/#t > > OK so it seems the point you're making there is - there's a lot of complexity - > and you'd rather not think about it for now? This is not a fair judgement. I think I have provided my thought process and how I made the decision to come up with this series. > > My concern is that in _actuality_ you'll need to do a _lot_ more and expose a > _lot_ more mm internals to achieve what you need to. > > I am happy for the API to be internal-to-mm-only. > > My concerns are really simple: > > - exposing page table manipulation outside of mm is something I just cannot > accept us doing for reasons I already mentioned and also Liam > > - we should absolutely minimise how much we do expose > > - we mustn't have hooks that don't allow us to make sensible decisions in core > mm code. > > I think overall I'm just not in favour of us having uffd functionality in > modules, I think it's an abstraction mismatch. > > Now if you instead had something like: > > enum uffd_minor_fault_handler_type { > UFFD_MINOR_FAULT_DEFAULT_HANDLER, > UFFD_MINOR_FAULT_FOO_HANDLER, > UFFD_MINOR_FAULT_BAR_HANDLER, > etc. > }; > > And the module could _choose_ which should be used then that's far far better. > > Really - hermetically seal the sensitive parts but allow module choice. > > You could find ways to do this in a more sophisticated way too by e.g. having a > driver callback that provides a config struct that sets things up. > > If we're going 'simple first' and can 'change what we want' why not do it like > this to start? Could you help to define UFFD_MINOR_FAULT_FOO_HANDLER for guest-memfd, and how that handler would be able to hook to the guest-memfd driver? The kernel needs to build with/without CONFIG_KVM. What about MISSING? Do you plan to support missing in the proposal you mentioned? > > > > > > > > > I agree that it's sensible to be incremental, but I feel like you sort of need > > > to somewhat prove the case that you can jump from 'incremental version where we > > > only support code in mm/' to supporting arbitrary file system code that might be > > > modules. > > > > > > Because otherwise you're basically _guessing_ that you can do this, possibly, in > > > the future and maybe it's just not the right approach but that's not clear yet? > > > > Did you follow up with the discussions in v1? I copied you too. > > > > https://lore.kernel.org/r/114133f5-0282-463d-9d65-3143aa658806@amazon.com > > > > Would Nikita's work help here? Could you explain what are you asking for > > to prove that this works for us? > > Yeah thanks this seems like it actually implements useful functionality. The > point is the API seems currently to be a stub so doesn't really give us much to > go on as to what might ultimately be required. > > You say yourself in the series that you'll likely need to expand things and you > already have todos to this effect. > > > > > > > > > > > > > > To achieve it, this series introduced a file operation called vm_uffd_ops. > > > > The ops needs to be provided when a file type supports any of userfaultfd. > > > > > > > > With that, I moved both hugetlbfs and shmem over. > > > > > > Well as you say below hugetlbfs is sort of a stub implementation, I wonder > > > whether we'd need quite a bit more to make that work. > > > > > > One thing I'd _really_ like to avoid is us having to add a bunch of hook points > > > into core mm code just for uffd that then call out to some driver. > > > > > > We've encountered such a total nightmare with .mmap() for instance in the past > > > (including stuff that resulted in security issues) because we - simply cannot > > > assume anything - about what the hook implementor might do with the passed > > > parameters. > > > > > > This is really really problematic. > > > > > > I also absolutely hate the: > > > > > > if (uffd) > > > do_something_weird(); > > > > > > Pattern, so hopefully this won't proliferate that. > > ^ you didn't respond to this. Can you elaborate? I don't understand how this is attached to this series. AFAIU, the series is trying to remove some "if()s", not adding. > > > > > > > > > > > > Hugetlbfs is still very special that it will only use partial of the > > > > vm_uffd_ops API, due to similar reason why hugetlb_vm_op_fault() has a > > > > BUG() and so far hard-coded into core mm. But this should still be better, > > > > because at least hugetlbfs is still always involved in feature probing > > > > (e.g. where it used to not support ZEROPAGE and we have a hard-coded line > > > > to fail that, and some more). Meanwhile after this series, shmem will be > > > > completely converted to the new vm_uffd_ops API; the final vm_uffd_ops for > > > > shmem looks like this: > > > > > > > > static const vm_uffd_ops shmem_uffd_ops = { > > > > .uffd_features = __VM_UFFD_FLAGS, > > > > .uffd_ioctls = BIT(_UFFDIO_COPY) | > > > > BIT(_UFFDIO_ZEROPAGE) | > > > > BIT(_UFFDIO_WRITEPROTECT) | > > > > BIT(_UFFDIO_CONTINUE) | > > > > BIT(_UFFDIO_POISON), > > > > .uffd_get_folio = shmem_uffd_get_folio, > > > > .uffd_copy = shmem_mfill_atomic_pte, > > > > }; > > > > > > > > As I mentioned in one of my reply to Nikita, I don't like the current > > > > interface of uffd_copy(), but this will be the minimum change version of > > > > such API to support complete extrenal-module-ready userfaultfd. Here, very > > > > minimal change will be needed from shmem side to support that. > > > > > > Right, maybe a better version of this interface might address some of my > > > concerns... :) > > > > > > > > > > > Meanwhile, the vm_uffd_ops is also not the only place one will need to > > > > provide to support userfaultfd. Normally vm_ops.fault() will also need to > > > > be updated, but that's a generic function and it'll play together with the > > > > new vm_uffd_ops to make everything fly. > > > > > > > > No functional change expected at all after the whole series applied. There > > > > might be some slightly stricter check on uffd ops here and there in the > > > > last patch, but that really shouldn't stand out anywhere to anyone. > > > > > > > > For testing: besides the cross-compilation tests, I did also try with > > > > uffd-stress in a VM to measure any perf difference before/after the change; > > > > The static call becomes a pointer now. I really cannot measure anything > > > > different, which is more or less expected. > > > > > > > > Comments welcomed, thanks. > > > > > > > > Peter Xu (4): > > > > mm: Introduce vm_uffd_ops API > > > > mm/shmem: Support vm_uffd_ops API > > > > mm/hugetlb: Support vm_uffd_ops API > > > > mm: Apply vm_uffd_ops API to core mm > > > > > > > > include/linux/mm.h | 9 +++ > > > > include/linux/shmem_fs.h | 14 ----- > > > > include/linux/userfaultfd_k.h | 98 +++++++++++++++++++---------- > > > > mm/hugetlb.c | 19 ++++++ > > > > mm/shmem.c | 28 ++++++++- > > > > mm/userfaultfd.c | 115 +++++++++++++++++++++++++--------- > > > > 6 files changed, 207 insertions(+), 76 deletions(-) > > > > > > > > -- > > > > 2.49.0 > > > > > > > > > > Sorry to be critical, I just want to make sure we're not setting ourselves up > > > for trouble here. > > > > > > I _very much_ support efforts to make uffd more generalised, and ideally to find > > > a way to separate out shmem and hugetlbfs implementation bits, so I support the > > > intent _fully_. > > > > > > I just want to make sure we do it in a safe way :) > > > > Any explicit suggestions (besides objections)? > > I gave some above. > > I'm fundamentally opposed to us exposing page table manipulation or really any > state subject to sensitive locks to modules. > > Cheers, Lorenzo > -- Peter Xu
On Thu, Jul 03, 2025 at 12:26:17PM -0400, Peter Xu wrote: > On Thu, Jul 03, 2025 at 04:55:01PM +0100, Lorenzo Stoakes wrote: > > > > I'm very concerned that this change will simply move core mm functionality out > > > > of mm and into drivers where it can bitrot and cause subtle bugs? > > > > > > > > You're proposing providing stuff like page table state and asking for a folio > > > > back from a driver etc. > > > > > > > > I absolutely am not in favour of us providing core mm internals like this to > > > > drivers, and I don't want to see us having to EXPORT() mm internals just to make > > > > module-ised uffd code work (I mean I just will flat out refuse to do that). > > > > > > > > I think we need to think _very_ carefully about how we do this. > > > > > > > > I also feel like this series is at a really basic level and you've not fully > > > > determined what API calls you need. > > > > > > See: > > > > > > https://lore.kernel.org/all/aGWVIjmmsmskA4bp@x1.local/#t > > > > OK so it seems the point you're making there is - there's a lot of complexity - > > and you'd rather not think about it for now? > > This is not a fair judgement. I think I have provided my thought process > and how I made the decision to come up with this series. Sorry Peter I am not trying to suggest you've not thought this through, that's not it at all. Maybe I'm phrasing this badly. What I mean to say is - you've implemented effectively a minimum viable interface, and my concern is that it doesn't fully express what you're going to need to actually implement this in reality. And my interpretation of what you say in the patches is that you are basically stating this. But happy to be corrected! > > > > > My concern is that in _actuality_ you'll need to do a _lot_ more and expose a > > _lot_ more mm internals to achieve what you need to. > > > > I am happy for the API to be internal-to-mm-only. > > > > My concerns are really simple: > > > > - exposing page table manipulation outside of mm is something I just cannot > > accept us doing for reasons I already mentioned and also Liam > > > > - we should absolutely minimise how much we do expose > > > > - we mustn't have hooks that don't allow us to make sensible decisions in core > > mm code. > > > > I think overall I'm just not in favour of us having uffd functionality in > > modules, I think it's an abstraction mismatch. > > > > Now if you instead had something like: > > > > enum uffd_minor_fault_handler_type { > > UFFD_MINOR_FAULT_DEFAULT_HANDLER, > > UFFD_MINOR_FAULT_FOO_HANDLER, > > UFFD_MINOR_FAULT_BAR_HANDLER, > > etc. > > }; > > > > And the module could _choose_ which should be used then that's far far better. > > > > Really - hermetically seal the sensitive parts but allow module choice. > > > > You could find ways to do this in a more sophisticated way too by e.g. having a > > driver callback that provides a config struct that sets things up. > > > > If we're going 'simple first' and can 'change what we want' why not do it like > > this to start? > > Could you help to define UFFD_MINOR_FAULT_FOO_HANDLER for guest-memfd, and > how that handler would be able to hook to the guest-memfd driver? The > kernel needs to build with/without CONFIG_KVM. > > What about MISSING? Do you plan to support missing in the proposal you > mentioned? I'm simply providing a vague hand wavey notion of something that doesn't expose mm internals. I would have thought you could figure out ways of doing this kind of thing, or providing some minimal and safely controlled hook for the different modes? It seems odd that on the one hand you're ok with providing something imcomplete - but which exposes mm internals here - but then require an entirely complete solution for an alternative proposal. Exposing mm internals to drivers is just a no-go. On the other hand, providing internals for -mm code only- is fine. If the purpose of the series was changed to expose a simplified interface, that could then _call into_ mm code that used an internal-mm API that'd be a way forward also. > > > > Well as you say below hugetlbfs is sort of a stub implementation, I wonder > > > > whether we'd need quite a bit more to make that work. > > > > > > > > One thing I'd _really_ like to avoid is us having to add a bunch of hook points > > > > into core mm code just for uffd that then call out to some driver. > > > > > > > > We've encountered such a total nightmare with .mmap() for instance in the past > > > > (including stuff that resulted in security issues) because we - simply cannot > > > > assume anything - about what the hook implementor might do with the passed > > > > parameters. > > > > > > > > This is really really problematic. > > > > > > > > I also absolutely hate the: > > > > > > > > if (uffd) > > > > do_something_weird(); > > > > > > > > Pattern, so hopefully this won't proliferate that. > > > > ^ you didn't respond to this. > > Can you elaborate? I don't understand how this is attached to this series. > AFAIU, the series is trying to remove some "if()s", not adding. What I'm saying is that is a _very good_ thing! I was just raising the point that us doing anything to address that is positive. But it can't be at the cost of the issues raised. Cheers, Lorenzo
On Mon, 30 Jun 2025 11:29:30 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > It also means this series does not depend on anything. It's a pure > > refactoring of userfaultfd internals to provide a generic API, so that > > other types of files, especially RAM based, can support userfaultfd without > > touching mm/ at all. > > I'm very concerned that this change will simply move core mm functionality out > of mm and into drivers where it can bitrot and cause subtle bugs? > > You're proposing providing stuff like page table state and asking for a folio > back from a driver etc. > > I absolutely am not in favour of us providing core mm internals like this to > drivers, and I don't want to see us having to EXPORT() mm internals just to make > module-ised uffd code work (I mean I just will flat out refuse to do that). > > I think we need to think _very_ carefully about how we do this. > > I also feel like this series is at a really basic level and you've not fully > determined what API calls you need. > > I agree that it's sensible to be incremental, but I feel like you sort of need > to somewhat prove the case that you can jump from 'incremental version where we > only support code in mm/' to supporting arbitrary file system code that might be > modules. > > Because otherwise you're basically _guessing_ that you can do this, possibly, in > the future and maybe it's just not the right approach but that's not clear yet? Thanks, this is pretty fundamental stuff so I'll push this series back into mm-new while we think about it. Soon, please - I don't want people to be basing new work on something which might go away,
© 2016 - 2025 Red Hat, Inc.