hw/vfio/ap.c | 2 +- hw/vfio/ccw.c | 2 +- hw/vfio/common.c | 471 +++++++++++++++++++++++----------- hw/vfio/pci.c | 10 +- hw/vfio/platform.c | 2 +- hw/vfio/trace-events | 4 +- include/hw/vfio/vfio-common.h | 10 +- qapi/qom.json | 29 +++ tests/avocado/vfio.py | 156 +++++++++++ 9 files changed, 525 insertions(+), 161 deletions(-) create mode 100644 tests/avocado/vfio.py
These patches add possibility to pass VFIO device to QEMU using file descriptors of VFIO container/group, instead of creating those by QEMU. This allows to take away permissions to open /dev/vfio/* from QEMU and delegate that to managment layer like libvirt. The VFIO API doen't allow to pass just fd of device, since we also need to have VFIO container and group. So these patches allow to pass created VFIO container/group to QEMU via command line/QMP, e.g. like this: -object vfio-container,id=ct,fd=5 \ -object vfio-group,id=grp,fd=6,container=ct \ -device vfio-pci,host=05:00.0,group=grp A bit more detailed example can be found in the test: tests/avocado/vfio.py *Possible future steps* Also these patches could be a step for making local migration (within one host) of the QEMU with VFIO devices. I've built some prototype on top of these patches to try such idea. In short the scheme of such migration is following: - migrate source VM to file. - retrieve fd numbers of VFIO container/group/device via new property and qom-get command - get the actual file descriptor via SCM_RIGHTS using new qmp command 'returnfd' which sends fd from QEMU by the number: { 'command': 'returnfd', 'data': {'fd': 'int'}} - shutdown source VM - launch destination VM, plug VFIO devices using obtained file descriptors. - PCI device reset duriing plugging the device avoided with the help of new parameter on vfio-pci device. This is alternative to 'cpr-exec' migration scheme proposed here: https://lore.kernel.org/qemu-devel/1658851843-236870-1-git-send-email-steven.sistare@oracle.com/ Unlike cpr-exec it doesn't require new kernel flags VFIO_DMA_UNMAP_FLAG_VADDR/VFIO_DMA_MAP_FLAG_VADDR And doesn't require new migration mode, just some additional steps from management layer. Andrey Ryabinin (4): vfio: add vfio-container user createable object vfio: add vfio-group user createable object vfio: Add 'group' property to 'vfio-pci' device tests/avocado/vfio: add test for vfio devices hw/vfio/ap.c | 2 +- hw/vfio/ccw.c | 2 +- hw/vfio/common.c | 471 +++++++++++++++++++++++----------- hw/vfio/pci.c | 10 +- hw/vfio/platform.c | 2 +- hw/vfio/trace-events | 4 +- include/hw/vfio/vfio-common.h | 10 +- qapi/qom.json | 29 +++ tests/avocado/vfio.py | 156 +++++++++++ 9 files changed, 525 insertions(+), 161 deletions(-) create mode 100644 tests/avocado/vfio.py -- 2.37.3
On Mon, 17 Oct 2022 13:54:03 +0300 Andrey Ryabinin <arbn@yandex-team.com> wrote: > These patches add possibility to pass VFIO device to QEMU using file > descriptors of VFIO container/group, instead of creating those by QEMU. > This allows to take away permissions to open /dev/vfio/* from QEMU and > delegate that to managment layer like libvirt. > > The VFIO API doen't allow to pass just fd of device, since we also need to have > VFIO container and group. So these patches allow to pass created VFIO container/group > to QEMU via command line/QMP, e.g. like this: > -object vfio-container,id=ct,fd=5 \ > -object vfio-group,id=grp,fd=6,container=ct \ > -device vfio-pci,host=05:00.0,group=grp This suggests that management tools need to become intimately familiar with container and group association restrictions for implicit dependencies, such as device AddressSpace. We had considered this before and intentionally chosen to allow QEMU to manage that relationship. Things like PCI bus type and presence of a vIOMMU factor into these relationships. In the above example, what happens in a mixed environment, for example if we then add '-device vfio-pci,host=06:00.0' to the command line? Isn't QEMU still going to try to re-use the container if it exists in the same address space? Potentially this device could also be a member of the same group. How would the management tool know when to expect the provided fds be released? We also have an outstanding RFC for iommufd that already proposes an fd passing interface, where iommufd removes many of the issues of the vfio container by supporting multiple address spaces within a single fd context, avoiding the duplicate locked page accounting issues between containers, and proposing a direct device fd interface for vfio. Why at this point in time would we choose to expand the QEMU vfio interface in this way? Thanks, Alex > A bit more detailed example can be found in the test: > tests/avocado/vfio.py > > *Possible future steps* > > Also these patches could be a step for making local migration (within one host) > of the QEMU with VFIO devices. > I've built some prototype on top of these patches to try such idea. > In short the scheme of such migration is following: > - migrate source VM to file. > - retrieve fd numbers of VFIO container/group/device via new property and qom-get command > - get the actual file descriptor via SCM_RIGHTS using new qmp command 'returnfd' which > sends fd from QEMU by the number: { 'command': 'returnfd', 'data': {'fd': 'int'}} > - shutdown source VM > - launch destination VM, plug VFIO devices using obtained file descriptors. > - PCI device reset duriing plugging the device avoided with the help of new parameter > on vfio-pci device. > This is alternative to 'cpr-exec' migration scheme proposed here: > https://lore.kernel.org/qemu-devel/1658851843-236870-1-git-send-email-steven.sistare@oracle.com/ > Unlike cpr-exec it doesn't require new kernel flags VFIO_DMA_UNMAP_FLAG_VADDR/VFIO_DMA_MAP_FLAG_VADDR > And doesn't require new migration mode, just some additional steps from management layer. > > > Andrey Ryabinin (4): > vfio: add vfio-container user createable object > vfio: add vfio-group user createable object > vfio: Add 'group' property to 'vfio-pci' device > tests/avocado/vfio: add test for vfio devices > > hw/vfio/ap.c | 2 +- > hw/vfio/ccw.c | 2 +- > hw/vfio/common.c | 471 +++++++++++++++++++++++----------- > hw/vfio/pci.c | 10 +- > hw/vfio/platform.c | 2 +- > hw/vfio/trace-events | 4 +- > include/hw/vfio/vfio-common.h | 10 +- > qapi/qom.json | 29 +++ > tests/avocado/vfio.py | 156 +++++++++++ > 9 files changed, 525 insertions(+), 161 deletions(-) > create mode 100644 tests/avocado/vfio.py >
On 10/17/22 18:21, Alex Williamson wrote: > On Mon, 17 Oct 2022 13:54:03 +0300 > Andrey Ryabinin <arbn@yandex-team.com> wrote: > >> These patches add possibility to pass VFIO device to QEMU using file >> descriptors of VFIO container/group, instead of creating those by QEMU. >> This allows to take away permissions to open /dev/vfio/* from QEMU and >> delegate that to managment layer like libvirt. >> >> The VFIO API doen't allow to pass just fd of device, since we also need to have >> VFIO container and group. So these patches allow to pass created VFIO container/group >> to QEMU via command line/QMP, e.g. like this: >> -object vfio-container,id=ct,fd=5 \ >> -object vfio-group,id=grp,fd=6,container=ct \ >> -device vfio-pci,host=05:00.0,group=grp > > This suggests that management tools need to become intimately familiar > with container and group association restrictions for implicit > dependencies, such as device AddressSpace. We had considered this > before and intentionally chosen to allow QEMU to manage that > relationship. Things like PCI bus type and presence of a vIOMMU factor > into these relationships. > This is already the case. These patches doesn't change much. QEMU doesn't allow to adding device from one group to several address spaces. So the management tool needs to know whether devices are in the same group or not and whether QEMU will create separate address spaces for these devices or not. E.g. qemu-system-x86_64 -nodefaults -M q35,accel=kvm,kernel-irqchip=split \ -device intel-iommu,intremap=on,caching-mode=on \ -device vfio-pci,host=00:1f.3 \ -device vfio-pci,host=00:1f.4 qemu-system-x86_64: -device vfio-pci,host=00:1f.4: vfio 0000:00:1f.4: group 14 used in multiple address spaces > In the above example, what happens in a mixed environment, for example > if we then add '-device vfio-pci,host=06:00.0' to the command line? > Isn't QEMU still going to try to re-use the container if it exists in > the same address space? Potentially this device could also be a member > of the same group. How would the management tool know when to expect > the provided fds be released? > Valid point, container indeed will be reused and second device will occupy it. But we could make new container instead. Using several containers in one address space won't be a problem, right? Of course several devices from same group won't be allowed to be added in mixed way. > We also have an outstanding RFC for iommufd that already proposes an fd > passing interface, where iommufd removes many of the issues of the vfio > container by supporting multiple address spaces within a single fd > context, avoiding the duplicate locked page accounting issues between > containers, and proposing a direct device fd interface for vfio. Why at > this point in time would we choose to expand the QEMU vfio interface in > this way? Thanks, > It sounds nice, but iommufd is new API which doesn't exist in any kernel yet. These patches is something that can be used on existing, already deployed kernels.
On Wed, 26 Oct 2022 15:07:32 +0300 Andrey Ryabinin <arbn@yandex-team.com> wrote: > On 10/17/22 18:21, Alex Williamson wrote: > > On Mon, 17 Oct 2022 13:54:03 +0300 > > Andrey Ryabinin <arbn@yandex-team.com> wrote: > > > >> These patches add possibility to pass VFIO device to QEMU using file > >> descriptors of VFIO container/group, instead of creating those by QEMU. > >> This allows to take away permissions to open /dev/vfio/* from QEMU and > >> delegate that to managment layer like libvirt. > >> > >> The VFIO API doen't allow to pass just fd of device, since we also need to have > >> VFIO container and group. So these patches allow to pass created VFIO container/group > >> to QEMU via command line/QMP, e.g. like this: > >> -object vfio-container,id=ct,fd=5 \ > >> -object vfio-group,id=grp,fd=6,container=ct \ > >> -device vfio-pci,host=05:00.0,group=grp > > > > This suggests that management tools need to become intimately familiar > > with container and group association restrictions for implicit > > dependencies, such as device AddressSpace. We had considered this > > before and intentionally chosen to allow QEMU to manage that > > relationship. Things like PCI bus type and presence of a vIOMMU factor > > into these relationships. > > > > This is already the case. These patches doesn't change much. > QEMU doesn't allow to adding device from one group to several address spaces. > So the management tool needs to know whether devices are in the same group or not > and whether QEMU will create separate address spaces for these devices or not. > > E.g. > qemu-system-x86_64 -nodefaults -M q35,accel=kvm,kernel-irqchip=split \ > -device intel-iommu,intremap=on,caching-mode=on \ > -device vfio-pci,host=00:1f.3 \ > -device vfio-pci,host=00:1f.4 > qemu-system-x86_64: -device vfio-pci,host=00:1f.4: vfio 0000:00:1f.4: group 14 used in multiple address spaces Obviously QEMU fails this configuration. It must. How does that suggest that a management tool, like libvirt, is already aware of this requirement. In fact, libvirt will happily validate xml creating such a configuration. The point was that tools like libvirt would need to provide these group and container file descriptors and they currently impose no restrictions or working knowledge on the relationship between devices, groups, containers, and address spaces. > > In the above example, what happens in a mixed environment, for example > > if we then add '-device vfio-pci,host=06:00.0' to the command line? > > Isn't QEMU still going to try to re-use the container if it exists in > > the same address space? Potentially this device could also be a member > > of the same group. How would the management tool know when to expect > > the provided fds be released? > > > > Valid point, container indeed will be reused and second device will occupy it. > But we could make new container instead. Using several containers in one address > space won't be a problem, right? > Of course several devices from same group won't be allowed to be added in mixed way. Potentially, yes, that is a problem. Each container represents a separate IOMMU context, separate DMA map and unmap operations, and separate locked page accounting. So if libvirt chooses the more trivial solution to impose a new container for every group, that translates to space, time, and process accounting overhead. > > We also have an outstanding RFC for iommufd that already proposes an fd > > passing interface, where iommufd removes many of the issues of the vfio > > container by supporting multiple address spaces within a single fd > > context, avoiding the duplicate locked page accounting issues between > > containers, and proposing a direct device fd interface for vfio. Why at > > this point in time would we choose to expand the QEMU vfio interface in > > this way? Thanks, > > > > It sounds nice, but iommufd is new API which doesn't exist in any kernel yet. > These patches is something that can be used on existing, already deployed kernels. OTOH, we expect iommufd in the near term, non-RFC patches are posted. The vfio kernel modules have undergone significant churn in recent kernels to align with the development goals of iommufd. QEMU support to accept file descriptors for "legacy" implementations of vfio is only the beginning, where the next step would require the management tools to be sufficiently enlightened to implement file descriptor passing. All of that suggests development and maintenance effort for something we're actively trying to replace. Thanks, Alex
On Mon, Oct 17, 2022 at 01:54:03PM +0300, Andrey Ryabinin wrote: > These patches add possibility to pass VFIO device to QEMU using file > descriptors of VFIO container/group, instead of creating those by QEMU. > This allows to take away permissions to open /dev/vfio/* from QEMU and > delegate that to managment layer like libvirt. > > The VFIO API doen't allow to pass just fd of device, since we also need to have > VFIO container and group. So these patches allow to pass created VFIO container/group > to QEMU via command line/QMP, e.g. like this: > -object vfio-container,id=ct,fd=5 \ > -object vfio-group,id=grp,fd=6,container=ct \ > -device vfio-pci,host=05:00.0,group=grp > > A bit more detailed example can be found in the test: > tests/avocado/vfio.py > > *Possible future steps* > > Also these patches could be a step for making local migration (within one host) > of the QEMU with VFIO devices. > I've built some prototype on top of these patches to try such idea. > In short the scheme of such migration is following: > - migrate source VM to file. > - retrieve fd numbers of VFIO container/group/device via new property and qom-get command > - get the actual file descriptor via SCM_RIGHTS using new qmp command 'returnfd' which > sends fd from QEMU by the number: { 'command': 'returnfd', 'data': {'fd': 'int'}} > - shutdown source VM > - launch destination VM, plug VFIO devices using obtained file descriptors. > - PCI device reset duriing plugging the device avoided with the help of new parameter > on vfio-pci device. Is there a restriction by VFIO on how many processes can have the FD open concurrently ? I guess it must be, as with SCM_RIGHTS, both src QEMU and libvirt will have the FD open concurrently for at least a short period, as you can't atomically close the FD at the exact same time as SCM_RIGHTS sends it. With migration it is *highly* desirable to never stop the source VM's QEMU until the new QEMU has completed migration and got its vCPUs running, in order to have best chance of successful rollback upon failure So assuming both QEMU's can have the FD open, provided they don't both concurrently operate on it, could src QEMU just pass the FDs to the target QEMU as part of the migration stream. eg use a UNIX socket between the 2 QEMUs, and SCM_RIGHTS to pass the FDs across, avoiding libvirt needing to be in the middle of the FD passing dance. Since target QEMU gets the FDs as part of the migration stream, it would inherantly know that it shold skip device reset in that flow, without requiring any new param. > This is alternative to 'cpr-exec' migration scheme proposed here: > https://lore.kernel.org/qemu-devel/1658851843-236870-1-git-send-email-steven.sistare@oracle.com/ > Unlike cpr-exec it doesn't require new kernel flags VFIO_DMA_UNMAP_FLAG_VADDR/VFIO_DMA_MAP_FLAG_VADDR > And doesn't require new migration mode, just some additional steps from management layer. Avoiding creating a whole new set of mgmt commands in QMP does make this appealing as an option instead of cpr-exec. If we can layer FD passing into the migration stream too, that'd be even more compelling IMHO. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On 10/17/22 14:05, Daniel P. Berrangé wrote: > On Mon, Oct 17, 2022 at 01:54:03PM +0300, Andrey Ryabinin wrote: >> These patches add possibility to pass VFIO device to QEMU using file >> descriptors of VFIO container/group, instead of creating those by QEMU. >> This allows to take away permissions to open /dev/vfio/* from QEMU and >> delegate that to managment layer like libvirt. >> >> The VFIO API doen't allow to pass just fd of device, since we also need to have >> VFIO container and group. So these patches allow to pass created VFIO container/group >> to QEMU via command line/QMP, e.g. like this: >> -object vfio-container,id=ct,fd=5 \ >> -object vfio-group,id=grp,fd=6,container=ct \ >> -device vfio-pci,host=05:00.0,group=grp >> >> A bit more detailed example can be found in the test: >> tests/avocado/vfio.py >> >> *Possible future steps* >> >> Also these patches could be a step for making local migration (within one host) >> of the QEMU with VFIO devices. >> I've built some prototype on top of these patches to try such idea. >> In short the scheme of such migration is following: >> - migrate source VM to file. >> - retrieve fd numbers of VFIO container/group/device via new property and qom-get command >> - get the actual file descriptor via SCM_RIGHTS using new qmp command 'returnfd' which >> sends fd from QEMU by the number: { 'command': 'returnfd', 'data': {'fd': 'int'}} >> - shutdown source VM >> - launch destination VM, plug VFIO devices using obtained file descriptors. >> - PCI device reset duriing plugging the device avoided with the help of new parameter >> on vfio-pci device. > > Is there a restriction by VFIO on how many processes can have the FD > open concurrently ? I guess it must be, as with SCM_RIGHTS, both src > QEMU and libvirt will have the FD open concurrently for at least a > short period, as you can't atomically close the FD at the exact same > time as SCM_RIGHTS sends it. > There is no such restriction. Several opened descriptors is what allows us to survive PCI device reset. The kernel reset device on the first open. Obviously we shouldn't use these descriptors concurrently and can't in many cases (ioctl()s will fail), but there is no problem with just saving/passing FD between processes. > With migration it is *highly* desirable to never stop the source VM's > QEMU until the new QEMU has completed migration and got its vCPUs > running, in order to have best chance of successful rollback upon > failure > > So assuming both QEMU's can have the FD open, provided they don't > both concurrently operate on it, could src QEMU just pass the FDs > to the target QEMU as part of the migration stream. eg use a UNIX > socket between the 2 QEMUs, and SCM_RIGHTS to pass the FDs across, > avoiding libvirt needing to be in the middle of the FD passing > dance. Since target QEMU gets the FDs as part of the migration > stream, it would inherantly know that it shold skip device reset > in that flow, without requiring any new param. > Yeah, I had similar idea, but this would require a lot of rework of VFIO initialization phase in QEMU. The main problem here is all initialization happens on device addition, which will fail if device already used by another QEMU. I guess we would need to move lot's of initialization to the ->post_load() hook.
© 2016 - 2024 Red Hat, Inc.