Allow to pass pre-created VFIO container/group to QEMU

[PATCH 0/4] Allow to pass pre-created VFIO container/group to QEMU

Posted by Andrey Ryabinin 1 year, 7 months ago

These patches add possibility to pass VFIO device to QEMU using file
descriptors of VFIO container/group, instead of creating those by QEMU.
This allows to take away permissions to open /dev/vfio/* from QEMU and
delegate that to managment layer like libvirt.

The VFIO API doen't allow to pass just fd of device, since we also need to have
VFIO container and group. So these patches allow to pass created VFIO container/group
to QEMU via command line/QMP, e.g. like this:
            -object vfio-container,id=ct,fd=5 \
            -object vfio-group,id=grp,fd=6,container=ct \
            -device vfio-pci,host=05:00.0,group=grp

A bit more detailed example can be found in the test:
tests/avocado/vfio.py

 *Possible future steps*

Also these patches could be a step for making local migration (within one host)
of the QEMU with VFIO devices.
I've built some prototype on top of these patches to try such idea.
In short the scheme of such migration is following:
 - migrate source VM to file.
 - retrieve fd numbers of VFIO container/group/device via new property and qom-get command
 - get the actual file descriptor via SCM_RIGHTS using new qmp command 'returnfd' which
   sends fd from QEMU by the number: { 'command': 'returnfd', 'data': {'fd': 'int'}}
 - shutdown source VM
 - launch destination VM, plug VFIO devices using obtained file descriptors.
 - PCI device reset duriing plugging the device avoided with the help of new parameter
    on vfio-pci device.
This is alternative to 'cpr-exec' migration scheme proposed here:
   https://lore.kernel.org/qemu-devel/1658851843-236870-1-git-send-email-steven.sistare@oracle.com/
Unlike cpr-exec it doesn't require new kernel flags VFIO_DMA_UNMAP_FLAG_VADDR/VFIO_DMA_MAP_FLAG_VADDR
And doesn't require new migration mode, just some additional steps from management layer.


Andrey Ryabinin (4):
  vfio: add vfio-container user createable object
  vfio: add vfio-group user createable object
  vfio: Add 'group' property to 'vfio-pci' device
  tests/avocado/vfio: add test for vfio devices

 hw/vfio/ap.c                  |   2 +-
 hw/vfio/ccw.c                 |   2 +-
 hw/vfio/common.c              | 471 +++++++++++++++++++++++-----------
 hw/vfio/pci.c                 |  10 +-
 hw/vfio/platform.c            |   2 +-
 hw/vfio/trace-events          |   4 +-
 include/hw/vfio/vfio-common.h |  10 +-
 qapi/qom.json                 |  29 +++
 tests/avocado/vfio.py         | 156 +++++++++++
 9 files changed, 525 insertions(+), 161 deletions(-)
 create mode 100644 tests/avocado/vfio.py

-- 
2.37.3

Re: [PATCH 0/4] Allow to pass pre-created VFIO container/group to QEMU

Posted by Alex Williamson 1 year, 7 months ago

On Mon, 17 Oct 2022 13:54:03 +0300
Andrey Ryabinin <arbn@yandex-team.com> wrote:

> These patches add possibility to pass VFIO device to QEMU using file
> descriptors of VFIO container/group, instead of creating those by QEMU.
> This allows to take away permissions to open /dev/vfio/* from QEMU and
> delegate that to managment layer like libvirt.
> 
> The VFIO API doen't allow to pass just fd of device, since we also need to have
> VFIO container and group. So these patches allow to pass created VFIO container/group
> to QEMU via command line/QMP, e.g. like this:
>             -object vfio-container,id=ct,fd=5 \
>             -object vfio-group,id=grp,fd=6,container=ct \
>             -device vfio-pci,host=05:00.0,group=grp

This suggests that management tools need to become intimately familiar
with container and group association restrictions for implicit
dependencies, such as device AddressSpace.  We had considered this
before and intentionally chosen to allow QEMU to manage that
relationship.  Things like PCI bus type and presence of a vIOMMU factor
into these relationships.

In the above example, what happens in a mixed environment, for example
if we then add '-device vfio-pci,host=06:00.0' to the command line?
Isn't QEMU still going to try to re-use the container if it exists in
the same address space?  Potentially this device could also be a member
of the same group.  How would the management tool know when to expect
the provided fds be released?

We also have an outstanding RFC for iommufd that already proposes an fd
passing interface, where iommufd removes many of the issues of the vfio
container by supporting multiple address spaces within a single fd
context, avoiding the duplicate locked page accounting issues between
containers, and proposing a direct device fd interface for vfio.  Why at
this point in time would we choose to expand the QEMU vfio interface in
this way?  Thanks,

Alex

> A bit more detailed example can be found in the test:
> tests/avocado/vfio.py
> 
>  *Possible future steps*
> 
> Also these patches could be a step for making local migration (within one host)
> of the QEMU with VFIO devices.
> I've built some prototype on top of these patches to try such idea.
> In short the scheme of such migration is following:
>  - migrate source VM to file.
>  - retrieve fd numbers of VFIO container/group/device via new property and qom-get command
>  - get the actual file descriptor via SCM_RIGHTS using new qmp command 'returnfd' which
>    sends fd from QEMU by the number: { 'command': 'returnfd', 'data': {'fd': 'int'}}
>  - shutdown source VM
>  - launch destination VM, plug VFIO devices using obtained file descriptors.
>  - PCI device reset duriing plugging the device avoided with the help of new parameter
>     on vfio-pci device.
> This is alternative to 'cpr-exec' migration scheme proposed here:
>    https://lore.kernel.org/qemu-devel/1658851843-236870-1-git-send-email-steven.sistare@oracle.com/
> Unlike cpr-exec it doesn't require new kernel flags VFIO_DMA_UNMAP_FLAG_VADDR/VFIO_DMA_MAP_FLAG_VADDR
> And doesn't require new migration mode, just some additional steps from management layer.
> 
> 
> Andrey Ryabinin (4):
>   vfio: add vfio-container user createable object
>   vfio: add vfio-group user createable object
>   vfio: Add 'group' property to 'vfio-pci' device
>   tests/avocado/vfio: add test for vfio devices
> 
>  hw/vfio/ap.c                  |   2 +-
>  hw/vfio/ccw.c                 |   2 +-
>  hw/vfio/common.c              | 471 +++++++++++++++++++++++-----------
>  hw/vfio/pci.c                 |  10 +-
>  hw/vfio/platform.c            |   2 +-
>  hw/vfio/trace-events          |   4 +-
>  include/hw/vfio/vfio-common.h |  10 +-
>  qapi/qom.json                 |  29 +++
>  tests/avocado/vfio.py         | 156 +++++++++++
>  9 files changed, 525 insertions(+), 161 deletions(-)
>  create mode 100644 tests/avocado/vfio.py
>

Re: [PATCH 0/4] Allow to pass pre-created VFIO container/group to QEMU

Posted by Andrey Ryabinin 1 year, 6 months ago


On 10/17/22 18:21, Alex Williamson wrote:
> On Mon, 17 Oct 2022 13:54:03 +0300
> Andrey Ryabinin <arbn@yandex-team.com> wrote:
> 
>> These patches add possibility to pass VFIO device to QEMU using file
>> descriptors of VFIO container/group, instead of creating those by QEMU.
>> This allows to take away permissions to open /dev/vfio/* from QEMU and
>> delegate that to managment layer like libvirt.
>>
>> The VFIO API doen't allow to pass just fd of device, since we also need to have
>> VFIO container and group. So these patches allow to pass created VFIO container/group
>> to QEMU via command line/QMP, e.g. like this:
>>             -object vfio-container,id=ct,fd=5 \
>>             -object vfio-group,id=grp,fd=6,container=ct \
>>             -device vfio-pci,host=05:00.0,group=grp
> 
> This suggests that management tools need to become intimately familiar
> with container and group association restrictions for implicit
> dependencies, such as device AddressSpace.  We had considered this
> before and intentionally chosen to allow QEMU to manage that
> relationship.  Things like PCI bus type and presence of a vIOMMU factor
> into these relationships.
> 

This is already the case. These patches doesn't change much.
QEMU doesn't allow to adding device from one group to several address spaces.
So the management tool needs to know whether devices are in the same group or not
and whether QEMU will create separate address spaces for these devices or not.

E.g.
qemu-system-x86_64 -nodefaults -M q35,accel=kvm,kernel-irqchip=split \
        -device intel-iommu,intremap=on,caching-mode=on \
        -device vfio-pci,host=00:1f.3 \
        -device vfio-pci,host=00:1f.4 
qemu-system-x86_64: -device vfio-pci,host=00:1f.4: vfio 0000:00:1f.4: group 14 used in multiple address spaces

> In the above example, what happens in a mixed environment, for example
> if we then add '-device vfio-pci,host=06:00.0' to the command line?
> Isn't QEMU still going to try to re-use the container if it exists in
> the same address space? Potentially this device could also be a member
> of the same group.  How would the management tool know when to expect
> the provided fds be released?
> 

Valid point, container indeed will be reused and second device will occupy it.
But we could make new container instead. Using several containers in one address
space won't be a problem, right?
Of course several devices from same group won't be allowed to be added in mixed way.


> We also have an outstanding RFC for iommufd that already proposes an fd
> passing interface, where iommufd removes many of the issues of the vfio
> container by supporting multiple address spaces within a single fd
> context, avoiding the duplicate locked page accounting issues between
> containers, and proposing a direct device fd interface for vfio.  Why at
> this point in time would we choose to expand the QEMU vfio interface in
> this way?  Thanks,
> 

It sounds nice, but iommufd is new API which doesn't exist in any kernel yet.
These patches is something that can be used on existing, already deployed kernels.

Re: [PATCH 0/4] Allow to pass pre-created VFIO container/group to QEMU

Posted by Alex Williamson 1 year, 6 months ago

On Wed, 26 Oct 2022 15:07:32 +0300
Andrey Ryabinin <arbn@yandex-team.com> wrote:

> On 10/17/22 18:21, Alex Williamson wrote:
> > On Mon, 17 Oct 2022 13:54:03 +0300
> > Andrey Ryabinin <arbn@yandex-team.com> wrote:
> >   
> >> These patches add possibility to pass VFIO device to QEMU using file
> >> descriptors of VFIO container/group, instead of creating those by QEMU.
> >> This allows to take away permissions to open /dev/vfio/* from QEMU and
> >> delegate that to managment layer like libvirt.
> >>
> >> The VFIO API doen't allow to pass just fd of device, since we also need to have
> >> VFIO container and group. So these patches allow to pass created VFIO container/group
> >> to QEMU via command line/QMP, e.g. like this:
> >>             -object vfio-container,id=ct,fd=5 \
> >>             -object vfio-group,id=grp,fd=6,container=ct \
> >>             -device vfio-pci,host=05:00.0,group=grp  
> > 
> > This suggests that management tools need to become intimately familiar
> > with container and group association restrictions for implicit
> > dependencies, such as device AddressSpace.  We had considered this
> > before and intentionally chosen to allow QEMU to manage that
> > relationship.  Things like PCI bus type and presence of a vIOMMU factor
> > into these relationships.
> >   
> 
> This is already the case. These patches doesn't change much.
> QEMU doesn't allow to adding device from one group to several address spaces.
> So the management tool needs to know whether devices are in the same group or not
> and whether QEMU will create separate address spaces for these devices or not.
> 
> E.g.
> qemu-system-x86_64 -nodefaults -M q35,accel=kvm,kernel-irqchip=split \
>         -device intel-iommu,intremap=on,caching-mode=on \
>         -device vfio-pci,host=00:1f.3 \
>         -device vfio-pci,host=00:1f.4 
> qemu-system-x86_64: -device vfio-pci,host=00:1f.4: vfio 0000:00:1f.4: group 14 used in multiple address spaces

Obviously QEMU fails this configuration.  It must.  How does that
suggest that a management tool, like libvirt, is already aware of this
requirement.  In fact, libvirt will happily validate xml creating such
a configuration.  The point was that tools like libvirt would need to
provide these group and container file descriptors and they currently
impose no restrictions or working knowledge on the relationship between
devices, groups, containers, and address spaces.

> > In the above example, what happens in a mixed environment, for example
> > if we then add '-device vfio-pci,host=06:00.0' to the command line?
> > Isn't QEMU still going to try to re-use the container if it exists in
> > the same address space? Potentially this device could also be a member
> > of the same group.  How would the management tool know when to expect
> > the provided fds be released?
> >   
> 
> Valid point, container indeed will be reused and second device will occupy it.
> But we could make new container instead. Using several containers in one address
> space won't be a problem, right?
> Of course several devices from same group won't be allowed to be added in mixed way.

Potentially, yes, that is a problem.  Each container represents a
separate IOMMU context, separate DMA map and unmap operations, and
separate locked page accounting.  So if libvirt chooses the more
trivial solution to impose a new container for every group, that
translates to space, time, and process accounting overhead.

> > We also have an outstanding RFC for iommufd that already proposes an fd
> > passing interface, where iommufd removes many of the issues of the vfio
> > container by supporting multiple address spaces within a single fd
> > context, avoiding the duplicate locked page accounting issues between
> > containers, and proposing a direct device fd interface for vfio.  Why at
> > this point in time would we choose to expand the QEMU vfio interface in
> > this way?  Thanks,
> >   
> 
> It sounds nice, but iommufd is new API which doesn't exist in any kernel yet.
> These patches is something that can be used on existing, already deployed kernels.

OTOH, we expect iommufd in the near term, non-RFC patches are posted.
The vfio kernel modules have undergone significant churn in recent
kernels to align with the development goals of iommufd.  QEMU support to
accept file descriptors for "legacy" implementations of vfio is only
the beginning, where the next step would require the management tools
to be sufficiently enlightened to implement file descriptor passing.
All of that suggests development and maintenance effort for something
we're actively trying to replace.  Thanks,

Alex

Re: [PATCH 0/4] Allow to pass pre-created VFIO container/group to QEMU

Posted by Daniel P. Berrangé 1 year, 7 months ago

On Mon, Oct 17, 2022 at 01:54:03PM +0300, Andrey Ryabinin wrote:
> These patches add possibility to pass VFIO device to QEMU using file
> descriptors of VFIO container/group, instead of creating those by QEMU.
> This allows to take away permissions to open /dev/vfio/* from QEMU and
> delegate that to managment layer like libvirt.
> 
> The VFIO API doen't allow to pass just fd of device, since we also need to have
> VFIO container and group. So these patches allow to pass created VFIO container/group
> to QEMU via command line/QMP, e.g. like this:
>             -object vfio-container,id=ct,fd=5 \
>             -object vfio-group,id=grp,fd=6,container=ct \
>             -device vfio-pci,host=05:00.0,group=grp
> 
> A bit more detailed example can be found in the test:
> tests/avocado/vfio.py
> 
>  *Possible future steps*
> 
> Also these patches could be a step for making local migration (within one host)
> of the QEMU with VFIO devices.
> I've built some prototype on top of these patches to try such idea.
> In short the scheme of such migration is following:
>  - migrate source VM to file.
>  - retrieve fd numbers of VFIO container/group/device via new property and qom-get command
>  - get the actual file descriptor via SCM_RIGHTS using new qmp command 'returnfd' which
>    sends fd from QEMU by the number: { 'command': 'returnfd', 'data': {'fd': 'int'}}
>  - shutdown source VM
>  - launch destination VM, plug VFIO devices using obtained file descriptors.
>  - PCI device reset duriing plugging the device avoided with the help of new parameter
>     on vfio-pci device.

Is there a restriction by VFIO on how many processes can have the FD
open concurrently ? I guess it must be, as with SCM_RIGHTS, both src
QEMU and libvirt will have the FD open concurrently for at least a
short period, as you can't atomically close the FD at the exact same
time as SCM_RIGHTS sends it.

With migration it is *highly* desirable to never stop the source VM's
QEMU until the new QEMU has completed migration and got its vCPUs
running, in order to have best chance of successful rollback upon
failure

So assuming both QEMU's can have the FD open, provided they don't
both concurrently operate on it, could src QEMU just pass the FDs
to the target QEMU as part of the migration stream. eg use a UNIX
socket between the 2 QEMUs, and SCM_RIGHTS to pass the FDs across,
avoiding libvirt needing to be in the middle of the FD passing
dance. Since target QEMU gets the FDs as part of the migration
stream, it would inherantly know that it shold skip device reset
in that flow, without requiring any new param.


> This is alternative to 'cpr-exec' migration scheme proposed here:
>    https://lore.kernel.org/qemu-devel/1658851843-236870-1-git-send-email-steven.sistare@oracle.com/
> Unlike cpr-exec it doesn't require new kernel flags VFIO_DMA_UNMAP_FLAG_VADDR/VFIO_DMA_MAP_FLAG_VADDR
> And doesn't require new migration mode, just some additional steps from management layer.

Avoiding creating a whole new set of mgmt commands in QMP does
make this appealing as an option instead of cpr-exec. If we can
layer FD passing into the migration stream too, that'd be even
more compelling IMHO.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH 0/4] Allow to pass pre-created VFIO container/group to QEMU

Posted by Andrey Ryabinin 1 year, 6 months ago

On 10/17/22 14:05, Daniel P. Berrangé wrote:
> On Mon, Oct 17, 2022 at 01:54:03PM +0300, Andrey Ryabinin wrote:
>> These patches add possibility to pass VFIO device to QEMU using file
>> descriptors of VFIO container/group, instead of creating those by QEMU.
>> This allows to take away permissions to open /dev/vfio/* from QEMU and
>> delegate that to managment layer like libvirt.
>>
>> The VFIO API doen't allow to pass just fd of device, since we also need to have
>> VFIO container and group. So these patches allow to pass created VFIO container/group
>> to QEMU via command line/QMP, e.g. like this:
>>             -object vfio-container,id=ct,fd=5 \
>>             -object vfio-group,id=grp,fd=6,container=ct \
>>             -device vfio-pci,host=05:00.0,group=grp
>>
>> A bit more detailed example can be found in the test:
>> tests/avocado/vfio.py
>>
>>  *Possible future steps*
>>
>> Also these patches could be a step for making local migration (within one host)
>> of the QEMU with VFIO devices.
>> I've built some prototype on top of these patches to try such idea.
>> In short the scheme of such migration is following:
>>  - migrate source VM to file.
>>  - retrieve fd numbers of VFIO container/group/device via new property and qom-get command
>>  - get the actual file descriptor via SCM_RIGHTS using new qmp command 'returnfd' which
>>    sends fd from QEMU by the number: { 'command': 'returnfd', 'data': {'fd': 'int'}}
>>  - shutdown source VM
>>  - launch destination VM, plug VFIO devices using obtained file descriptors.
>>  - PCI device reset duriing plugging the device avoided with the help of new parameter
>>     on vfio-pci device.
> 
> Is there a restriction by VFIO on how many processes can have the FD
> open concurrently ? I guess it must be, as with SCM_RIGHTS, both src
> QEMU and libvirt will have the FD open concurrently for at least a
> short period, as you can't atomically close the FD at the exact same
> time as SCM_RIGHTS sends it.
> 

There is no such restriction. Several opened descriptors is what allows us to survive
PCI device reset. The kernel reset device on the first open.
Obviously we shouldn't use these descriptors concurrently and can't in many cases
(ioctl()s will fail), but there is no problem with just saving/passing FD between processes.


> With migration it is *highly* desirable to never stop the source VM's
> QEMU until the new QEMU has completed migration and got its vCPUs
> running, in order to have best chance of successful rollback upon
> failure
> 
> So assuming both QEMU's can have the FD open, provided they don't
> both concurrently operate on it, could src QEMU just pass the FDs
> to the target QEMU as part of the migration stream. eg use a UNIX
> socket between the 2 QEMUs, and SCM_RIGHTS to pass the FDs across,
> avoiding libvirt needing to be in the middle of the FD passing
> dance. Since target QEMU gets the FDs as part of the migration
> stream, it would inherantly know that it shold skip device reset
> in that flow, without requiring any new param.
> 

Yeah, I had similar idea, but this would require a lot of rework of VFIO initialization
phase in QEMU. The main problem here is all initialization happens on device addition, which will fail
if device already used by another QEMU. I guess we would  need to move lot's of initialization to the
->post_load() hook.