include/qemu/userfaultfd.h | 1 + include/standard-headers/drm/drm_fourcc.h | 34 ++++- include/standard-headers/linux/ethtool.h | 63 +++++++- include/standard-headers/linux/fuse.h | 6 +- .../linux/input-event-codes.h | 1 + include/standard-headers/linux/virtio_blk.h | 19 +++ linux-headers/asm-generic/hugetlb_encode.h | 26 ++-- linux-headers/asm-generic/mman-common.h | 2 + linux-headers/asm-mips/mman.h | 2 + linux-headers/asm-riscv/kvm.h | 4 + linux-headers/linux/kvm.h | 1 + linux-headers/linux/psci.h | 14 ++ linux-headers/linux/userfaultfd.h | 4 + linux-headers/linux/vfio.h | 142 ++++++++++++++++++ migration/postcopy-ram.c | 11 +- tests/qtest/migration-test.c | 3 +- util/trace-events | 1 + util/userfaultfd.c | 49 +++++- 18 files changed, 354 insertions(+), 29 deletions(-)
The new /dev/userfaultfd handle is superior to the system call with a better permission control and also works for a restricted seccomp environment. The new device was only introduced in v6.1 so we need a header update. Please have a look, thanks. Peter Xu (3): linux-headers: Update to v6.1 util/userfaultfd: Add uffd_open() util/userfaultfd: Support /dev/userfaultfd include/qemu/userfaultfd.h | 1 + include/standard-headers/drm/drm_fourcc.h | 34 ++++- include/standard-headers/linux/ethtool.h | 63 +++++++- include/standard-headers/linux/fuse.h | 6 +- .../linux/input-event-codes.h | 1 + include/standard-headers/linux/virtio_blk.h | 19 +++ linux-headers/asm-generic/hugetlb_encode.h | 26 ++-- linux-headers/asm-generic/mman-common.h | 2 + linux-headers/asm-mips/mman.h | 2 + linux-headers/asm-riscv/kvm.h | 4 + linux-headers/linux/kvm.h | 1 + linux-headers/linux/psci.h | 14 ++ linux-headers/linux/userfaultfd.h | 4 + linux-headers/linux/vfio.h | 142 ++++++++++++++++++ migration/postcopy-ram.c | 11 +- tests/qtest/migration-test.c | 3 +- util/trace-events | 1 + util/userfaultfd.c | 49 +++++- 18 files changed, 354 insertions(+), 29 deletions(-) -- 2.37.3
On 1/25/23 23:40, Peter Xu wrote: > The new /dev/userfaultfd handle is superior to the system call with a > better permission control and also works for a restricted seccomp > environment. > > The new device was only introduced in v6.1 so we need a header update. > > Please have a look, thanks. I was wondering whether it would make sense/be possible for mgmt app (libvirt) to pass FD for /dev/userfaultfd instead of QEMU opening it itself. But looking into the code, libvirt would need to do that when spawning QEMU because that's when QEMU itself initializes internal state and queries userfaultfd caps. Michal
* Michal Prívozník (mprivozn@redhat.com) wrote: > On 1/25/23 23:40, Peter Xu wrote: > > The new /dev/userfaultfd handle is superior to the system call with a > > better permission control and also works for a restricted seccomp > > environment. > > > > The new device was only introduced in v6.1 so we need a header update. > > > > Please have a look, thanks. > > I was wondering whether it would make sense/be possible for mgmt app > (libvirt) to pass FD for /dev/userfaultfd instead of QEMU opening it > itself. But looking into the code, libvirt would need to do that when > spawning QEMU because that's when QEMU itself initializes internal state > and queries userfaultfd caps. You also have to be careful about what the userfaultfd semantics are; I can't remember them - but if you open it in one process and pass it to another process, which processes address space are you trying to monitor? Dave > Michal > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On Thu, Jan 26, 2023 at 02:15:11PM +0000, Dr. David Alan Gilbert wrote: > * Michal Prívozník (mprivozn@redhat.com) wrote: > > On 1/25/23 23:40, Peter Xu wrote: > > > The new /dev/userfaultfd handle is superior to the system call with a > > > better permission control and also works for a restricted seccomp > > > environment. > > > > > > The new device was only introduced in v6.1 so we need a header update. > > > > > > Please have a look, thanks. > > > > I was wondering whether it would make sense/be possible for mgmt app > > (libvirt) to pass FD for /dev/userfaultfd instead of QEMU opening it > > itself. But looking into the code, libvirt would need to do that when > > spawning QEMU because that's when QEMU itself initializes internal state > > and queries userfaultfd caps. > > You also have to be careful about what the userfaultfd semantics are; I > can't remember them - but if you open it in one process and pass it to > another process, which processes address space are you trying to > monitor? Yes it's a problem. The kernel always fetches the current mm_struct* which represents the current context of virtual address space when creating the uffd handle (for either the syscall or the ioctl() approach). It works only if Libvirt will invoke QEMU as a thread and they'll share the same address space. Why libvirt would like to do so? Thanks, -- Peter Xu
On Thu, Jan 26, 2023 at 10:25:05AM -0500, Peter Xu wrote: > On Thu, Jan 26, 2023 at 02:15:11PM +0000, Dr. David Alan Gilbert wrote: > > * Michal Prívozník (mprivozn@redhat.com) wrote: > > > On 1/25/23 23:40, Peter Xu wrote: > > > > The new /dev/userfaultfd handle is superior to the system call with a > > > > better permission control and also works for a restricted seccomp > > > > environment. > > > > > > > > The new device was only introduced in v6.1 so we need a header update. > > > > > > > > Please have a look, thanks. > > > > > > I was wondering whether it would make sense/be possible for mgmt app > > > (libvirt) to pass FD for /dev/userfaultfd instead of QEMU opening it > > > itself. But looking into the code, libvirt would need to do that when > > > spawning QEMU because that's when QEMU itself initializes internal state > > > and queries userfaultfd caps. > > > > You also have to be careful about what the userfaultfd semantics are; I > > can't remember them - but if you open it in one process and pass it to > > another process, which processes address space are you trying to > > monitor? > > Yes it's a problem. The kernel always fetches the current mm_struct* which > represents the current context of virtual address space when creating the > uffd handle (for either the syscall or the ioctl() approach). At what point does the process address space get associated ? When the /dev/userfaultfd is opened, or only when ioctl(USERFAULTFD_IOC_NEW) is called ? If it is the former, then we have no choice, QEMU must open it. if it is the latter, then libvirt can open /dev/userfaultfd, pass it to QEMU which can then do the ioctl(USERFAULTFD_IOC_NEW). With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Thu, Jan 26, 2023 at 03:59:33PM +0000, Daniel P. Berrangé wrote: > On Thu, Jan 26, 2023 at 10:25:05AM -0500, Peter Xu wrote: > > On Thu, Jan 26, 2023 at 02:15:11PM +0000, Dr. David Alan Gilbert wrote: > > > * Michal Prívozník (mprivozn@redhat.com) wrote: > > > > On 1/25/23 23:40, Peter Xu wrote: > > > > > The new /dev/userfaultfd handle is superior to the system call with a > > > > > better permission control and also works for a restricted seccomp > > > > > environment. > > > > > > > > > > The new device was only introduced in v6.1 so we need a header update. > > > > > > > > > > Please have a look, thanks. > > > > > > > > I was wondering whether it would make sense/be possible for mgmt app > > > > (libvirt) to pass FD for /dev/userfaultfd instead of QEMU opening it > > > > itself. But looking into the code, libvirt would need to do that when > > > > spawning QEMU because that's when QEMU itself initializes internal state > > > > and queries userfaultfd caps. > > > > > > You also have to be careful about what the userfaultfd semantics are; I > > > can't remember them - but if you open it in one process and pass it to > > > another process, which processes address space are you trying to > > > monitor? > > > > Yes it's a problem. The kernel always fetches the current mm_struct* which > > represents the current context of virtual address space when creating the > > uffd handle (for either the syscall or the ioctl() approach). > > At what point does the process address space get associated ? When > the /dev/userfaultfd is opened, or only when ioctl(USERFAULTFD_IOC_NEW) > is called ? If it is the former, then we have no choice, QEMU must open > it. if it is the latter, then libvirt can open /dev/userfaultfd, pass > it to QEMU which can then do the ioctl(USERFAULTFD_IOC_NEW). Good point.. It should be the latter, so should be doable. What should be the best interface for QEMU to detect the fd passing over to it? IIUC qemu_open() requires the name to be /dev/fdset/*, but there's no existing cmdline that QEMU can know which fd number to fetch from fdset to be used as the /dev/userfaultfd descriptor. monitor_get_fd() seems more proper, where we can define an unique string so Libvirt can preset the descriptor with the same string attached to it, then I can opt-in monitor_get_fd() before trying to open() or doing the syscall. Thanks, -- Peter Xu
On Thu, Jan 26, 2023 at 12:26:45PM -0500, Peter Xu wrote: > On Thu, Jan 26, 2023 at 03:59:33PM +0000, Daniel P. Berrangé wrote: > > On Thu, Jan 26, 2023 at 10:25:05AM -0500, Peter Xu wrote: > > > On Thu, Jan 26, 2023 at 02:15:11PM +0000, Dr. David Alan Gilbert wrote: > > > > * Michal Prívozník (mprivozn@redhat.com) wrote: > > > > > On 1/25/23 23:40, Peter Xu wrote: > > > > > > The new /dev/userfaultfd handle is superior to the system call with a > > > > > > better permission control and also works for a restricted seccomp > > > > > > environment. > > > > > > > > > > > > The new device was only introduced in v6.1 so we need a header update. > > > > > > > > > > > > Please have a look, thanks. > > > > > > > > > > I was wondering whether it would make sense/be possible for mgmt app > > > > > (libvirt) to pass FD for /dev/userfaultfd instead of QEMU opening it > > > > > itself. But looking into the code, libvirt would need to do that when > > > > > spawning QEMU because that's when QEMU itself initializes internal state > > > > > and queries userfaultfd caps. > > > > > > > > You also have to be careful about what the userfaultfd semantics are; I > > > > can't remember them - but if you open it in one process and pass it to > > > > another process, which processes address space are you trying to > > > > monitor? > > > > > > Yes it's a problem. The kernel always fetches the current mm_struct* which > > > represents the current context of virtual address space when creating the > > > uffd handle (for either the syscall or the ioctl() approach). > > > > At what point does the process address space get associated ? When > > the /dev/userfaultfd is opened, or only when ioctl(USERFAULTFD_IOC_NEW) > > is called ? If it is the former, then we have no choice, QEMU must open > > it. if it is the latter, then libvirt can open /dev/userfaultfd, pass > > it to QEMU which can then do the ioctl(USERFAULTFD_IOC_NEW). > > Good point.. It should be the latter, so should be doable. > > What should be the best interface for QEMU to detect the fd passing over to > it? IIUC qemu_open() requires the name to be /dev/fdset/*, but there's no > existing cmdline that QEMU can know which fd number to fetch from fdset to > be used as the /dev/userfaultfd descriptor. > > monitor_get_fd() seems more proper, where we can define an unique string so > Libvirt can preset the descriptor with the same string attached to it, then > I can opt-in monitor_get_fd() before trying to open() or doing the syscall. Daniel/Michal, any input here from Libvirt side? I just noticed that monitor_get_fd() is bound to a specific monitor, then it seems not clear which one is from Libvirt. If to use qemu_open() and add-fd I think we need another QEMU cmdline to set the fd path, iiuc. I can also leave that for later if opening /dev/userfaultfd is already resolving the immediate problem in containers. Thanks, -- Peter Xu
On Tue, Jan 31, 2023 at 02:48:54PM -0500, Peter Xu wrote: > On Thu, Jan 26, 2023 at 12:26:45PM -0500, Peter Xu wrote: > > On Thu, Jan 26, 2023 at 03:59:33PM +0000, Daniel P. Berrangé wrote: > > > On Thu, Jan 26, 2023 at 10:25:05AM -0500, Peter Xu wrote: > > > > On Thu, Jan 26, 2023 at 02:15:11PM +0000, Dr. David Alan Gilbert wrote: > > > > > * Michal Prívozník (mprivozn@redhat.com) wrote: > > > > > > On 1/25/23 23:40, Peter Xu wrote: > > > > > > > The new /dev/userfaultfd handle is superior to the system call with a > > > > > > > better permission control and also works for a restricted seccomp > > > > > > > environment. > > > > > > > > > > > > > > The new device was only introduced in v6.1 so we need a header update. > > > > > > > > > > > > > > Please have a look, thanks. > > > > > > > > > > > > I was wondering whether it would make sense/be possible for mgmt app > > > > > > (libvirt) to pass FD for /dev/userfaultfd instead of QEMU opening it > > > > > > itself. But looking into the code, libvirt would need to do that when > > > > > > spawning QEMU because that's when QEMU itself initializes internal state > > > > > > and queries userfaultfd caps. > > > > > > > > > > You also have to be careful about what the userfaultfd semantics are; I > > > > > can't remember them - but if you open it in one process and pass it to > > > > > another process, which processes address space are you trying to > > > > > monitor? > > > > > > > > Yes it's a problem. The kernel always fetches the current mm_struct* which > > > > represents the current context of virtual address space when creating the > > > > uffd handle (for either the syscall or the ioctl() approach). > > > > > > At what point does the process address space get associated ? When > > > the /dev/userfaultfd is opened, or only when ioctl(USERFAULTFD_IOC_NEW) > > > is called ? If it is the former, then we have no choice, QEMU must open > > > it. if it is the latter, then libvirt can open /dev/userfaultfd, pass > > > it to QEMU which can then do the ioctl(USERFAULTFD_IOC_NEW). > > > > Good point.. It should be the latter, so should be doable. > > > > What should be the best interface for QEMU to detect the fd passing over to > > it? IIUC qemu_open() requires the name to be /dev/fdset/*, but there's no > > existing cmdline that QEMU can know which fd number to fetch from fdset to > > be used as the /dev/userfaultfd descriptor. > > > > monitor_get_fd() seems more proper, where we can define an unique string so > > Libvirt can preset the descriptor with the same string attached to it, then > > I can opt-in monitor_get_fd() before trying to open() or doing the syscall. > > Daniel/Michal, any input here from Libvirt side? > > I just noticed that monitor_get_fd() is bound to a specific monitor, then > it seems not clear which one is from Libvirt. If to use qemu_open() and > add-fd I think we need another QEMU cmdline to set the fd path, iiuc. > > I can also leave that for later if opening /dev/userfaultfd is already > resolving the immediate problem in containers. I don't have any great ideas really. If we assume the /dev/userfaultfd is accessible to QEMU we can ignore it. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Tue, Jan 31, 2023 at 08:06:55PM +0000, Daniel P. Berrangé wrote: > On Tue, Jan 31, 2023 at 02:48:54PM -0500, Peter Xu wrote: > > On Thu, Jan 26, 2023 at 12:26:45PM -0500, Peter Xu wrote: > > > On Thu, Jan 26, 2023 at 03:59:33PM +0000, Daniel P. Berrangé wrote: > > > > On Thu, Jan 26, 2023 at 10:25:05AM -0500, Peter Xu wrote: > > > > > On Thu, Jan 26, 2023 at 02:15:11PM +0000, Dr. David Alan Gilbert wrote: > > > > > > * Michal Prívozník (mprivozn@redhat.com) wrote: > > > > > > > On 1/25/23 23:40, Peter Xu wrote: > > > > > > > > The new /dev/userfaultfd handle is superior to the system call with a > > > > > > > > better permission control and also works for a restricted seccomp > > > > > > > > environment. > > > > > > > > > > > > > > > > The new device was only introduced in v6.1 so we need a header update. > > > > > > > > > > > > > > > > Please have a look, thanks. > > > > > > > > > > > > > > I was wondering whether it would make sense/be possible for mgmt app > > > > > > > (libvirt) to pass FD for /dev/userfaultfd instead of QEMU opening it > > > > > > > itself. But looking into the code, libvirt would need to do that when > > > > > > > spawning QEMU because that's when QEMU itself initializes internal state > > > > > > > and queries userfaultfd caps. > > > > > > > > > > > > You also have to be careful about what the userfaultfd semantics are; I > > > > > > can't remember them - but if you open it in one process and pass it to > > > > > > another process, which processes address space are you trying to > > > > > > monitor? > > > > > > > > > > Yes it's a problem. The kernel always fetches the current mm_struct* which > > > > > represents the current context of virtual address space when creating the > > > > > uffd handle (for either the syscall or the ioctl() approach). > > > > > > > > At what point does the process address space get associated ? When > > > > the /dev/userfaultfd is opened, or only when ioctl(USERFAULTFD_IOC_NEW) > > > > is called ? If it is the former, then we have no choice, QEMU must open > > > > it. if it is the latter, then libvirt can open /dev/userfaultfd, pass > > > > it to QEMU which can then do the ioctl(USERFAULTFD_IOC_NEW). > > > > > > Good point.. It should be the latter, so should be doable. > > > > > > What should be the best interface for QEMU to detect the fd passing over to > > > it? IIUC qemu_open() requires the name to be /dev/fdset/*, but there's no > > > existing cmdline that QEMU can know which fd number to fetch from fdset to > > > be used as the /dev/userfaultfd descriptor. > > > > > > monitor_get_fd() seems more proper, where we can define an unique string so > > > Libvirt can preset the descriptor with the same string attached to it, then > > > I can opt-in monitor_get_fd() before trying to open() or doing the syscall. > > > > Daniel/Michal, any input here from Libvirt side? > > > > I just noticed that monitor_get_fd() is bound to a specific monitor, then > > it seems not clear which one is from Libvirt. If to use qemu_open() and > > add-fd I think we need another QEMU cmdline to set the fd path, iiuc. > > > > I can also leave that for later if opening /dev/userfaultfd is already > > resolving the immediate problem in containers. > > I don't have any great ideas really. If we assume the /dev/userfaultfd > is accessible to QEMU we can ignore it. It's my understanding that QEMU process will be invoked by the user or group that has access to /dev/userfaultfd, probably in the same context as what Libvirt specified. So hopefully everything will work out naturally already. There's one thing I'm unsure on introducing a new qemu cmdline option - I can't remember where I get this memory, but - IIRC Paolo suggested at some point to reduce or forbid introducing new options to QEMU. To remedy that, we can also add a migration parameter which will point to /dev/userfaultfd (which can be set to "/dev/fdsets/N" by Libvirt in QMP in QEMU's early stage), considering that so far most of the uffd features are used by migration submodule, IMHO it's fine to do so. Said that, I think we can always work on top of this series if that'll be useful to libvirt some day; the change should be trivial. So I can keep this series simple. I'll wait 1-2 more days to see whether Michal has anything to comment. Thanks, -- Peter Xu
On 1/31/23 22:01, Peter Xu wrote: > I'll wait 1-2 more days to see whether Michal has anything to comment. Yeah, we can go with your patches and leave FD passing for future work. It's orthogonal after all. In the end we can have (in the order of precedence): 1) new cmd line argument, say: qemu-system-x86_64 -userfaultfd fd=5 # where FD 5 is passed by libvirt when exec()-ing qemu, just like other FDs, e.g. -chardev socket,fd=XXX 2) your patches, where qemu opens /dev/userfaultfd 3) current behavior, userfaultfd syscall Michal
On Wed, Feb 01, 2023 at 08:55:01AM +0100, Michal Prívozník wrote: > On 1/31/23 22:01, Peter Xu wrote: > > I'll wait 1-2 more days to see whether Michal has anything to comment. > > Yeah, we can go with your patches and leave FD passing for future work. > It's orthogonal after all. > > In the end we can have (in the order of precedence): > > 1) new cmd line argument, say: > > qemu-system-x86_64 -userfaultfd fd=5 # where FD 5 is passed by > libvirt when exec()-ing qemu, just like other FDs, e.g. -chardev > socket,fd=XXX > > 2) your patches, where qemu opens /dev/userfaultfd > > 3) current behavior, userfaultfd syscall Sounds good. Thanks. -- Peter Xu
On 1/26/23 16:25, Peter Xu wrote: > On Thu, Jan 26, 2023 at 02:15:11PM +0000, Dr. David Alan Gilbert wrote: >> * Michal Prívozník (mprivozn@redhat.com) wrote: >>> On 1/25/23 23:40, Peter Xu wrote: >>>> The new /dev/userfaultfd handle is superior to the system call with a >>>> better permission control and also works for a restricted seccomp >>>> environment. >>>> >>>> The new device was only introduced in v6.1 so we need a header update. >>>> >>>> Please have a look, thanks. >>> >>> I was wondering whether it would make sense/be possible for mgmt app >>> (libvirt) to pass FD for /dev/userfaultfd instead of QEMU opening it >>> itself. But looking into the code, libvirt would need to do that when >>> spawning QEMU because that's when QEMU itself initializes internal state >>> and queries userfaultfd caps. >> >> You also have to be careful about what the userfaultfd semantics are; I >> can't remember them - but if you open it in one process and pass it to >> another process, which processes address space are you trying to >> monitor? > > Yes it's a problem. The kernel always fetches the current mm_struct* which > represents the current context of virtual address space when creating the > uffd handle (for either the syscall or the ioctl() approach). Ah, I did not realize that. > > It works only if Libvirt will invoke QEMU as a thread and they'll share the > same address space. > > Why libvirt would like to do so? Well, we tend to pass files as FD more and more, because it allows us to give access to "privileged" files to unprivileged process. What I did not realize is that userfaultfd is different, not yet another file. Michal
On Thu, Jan 26, 2023 at 04:29:10PM +0100, Michal Prívozník wrote: > On 1/26/23 16:25, Peter Xu wrote: > > On Thu, Jan 26, 2023 at 02:15:11PM +0000, Dr. David Alan Gilbert wrote: > >> * Michal Prívozník (mprivozn@redhat.com) wrote: > >>> On 1/25/23 23:40, Peter Xu wrote: > >>>> The new /dev/userfaultfd handle is superior to the system call with a > >>>> better permission control and also works for a restricted seccomp > >>>> environment. > >>>> > >>>> The new device was only introduced in v6.1 so we need a header update. > >>>> > >>>> Please have a look, thanks. > >>> > >>> I was wondering whether it would make sense/be possible for mgmt app > >>> (libvirt) to pass FD for /dev/userfaultfd instead of QEMU opening it > >>> itself. But looking into the code, libvirt would need to do that when > >>> spawning QEMU because that's when QEMU itself initializes internal state > >>> and queries userfaultfd caps. > >> > >> You also have to be careful about what the userfaultfd semantics are; I > >> can't remember them - but if you open it in one process and pass it to > >> another process, which processes address space are you trying to > >> monitor? > > > > Yes it's a problem. The kernel always fetches the current mm_struct* which > > represents the current context of virtual address space when creating the > > uffd handle (for either the syscall or the ioctl() approach). > > Ah, I did not realize that. > > > > > It works only if Libvirt will invoke QEMU as a thread and they'll share the > > same address space. > > > > Why libvirt would like to do so? > > Well, we tend to pass files as FD more and more, because it allows us to > give access to "privileged" files to unprivileged process. What I did > not realize is that userfaultfd is different, not yet another file. I see. Yes uffd is special comparing to most of the other fds, IMHO majorly because it's a resource not being public but closely bound to the process context of the mm. There used to have proposals that grant permission to open uffd handle for other processes, but the security implication was still not fully clear and that discussion discontinued. Then the question is whether there is still any scenario that QEMU may not have privilege to either /dev/userfaultfd or using the syscall. Thanks, -- Peter Xu
© 2016 - 2024 Red Hat, Inc.