[libvirt] [RFC PATCH 00/16] Introduce vGPU mdev framework to libvirt

Erik Skultety posted 16 patches 7 years, 1 month ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/libvirt tags/patchew/cover.1486383339.git.eskultet@redhat.com
There is a newer version of this series
docs/formatdomain.html.in                          |  40 ++-
docs/schemas/domaincommon.rng                      |  17 +
po/POTFILES.in                                     |   1 +
src/Makefile.am                                    |   1 +
src/conf/domain_conf.c                             |  81 ++++-
src/conf/domain_conf.h                             |  10 +
src/libvirt_private.syms                           |  19 ++
src/qemu/qemu_cgroup.c                             |  35 ++
src/qemu/qemu_command.c                            |  49 +++
src/qemu/qemu_command.h                            |   5 +
src/qemu/qemu_domain.c                             |  13 +
src/qemu/qemu_domain_address.c                     |  12 +-
src/qemu/qemu_hostdev.c                            |  37 ++
src/qemu/qemu_hostdev.h                            |   8 +
src/qemu/qemu_hotplug.c                            |   2 +
src/security/security_apparmor.c                   |   3 +
src/security/security_dac.c                        |  56 +++
src/security/security_selinux.c                    |  55 +++
src/util/virhostdev.c                              | 179 +++++++++-
src/util/virhostdev.h                              |  16 +
src/util/virmdev.c                                 | 375 +++++++++++++++++++++
src/util/virmdev.h                                 |  85 +++++
tests/domaincapsschemadata/full.xml                |   1 +
...qemuxml2argv-hostdev-mdev-unmanaged-no-uuid.xml |  37 ++
.../qemuxml2argv-hostdev-mdev-unmanaged.args       |  25 ++
.../qemuxml2argv-hostdev-mdev-unmanaged.xml        |  38 +++
tests/qemuxml2argvtest.c                           |   6 +
.../qemuxml2xmlout-hostdev-mdev-unmanaged.xml      |  41 +++
tests/qemuxml2xmltest.c                            |   1 +
29 files changed, 1239 insertions(+), 9 deletions(-)
create mode 100644 src/util/virmdev.c
create mode 100644 src/util/virmdev.h
create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged-no-uuid.xml
create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged.args
create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged.xml
create mode 100644 tests/qemuxml2xmloutdata/qemuxml2xmlout-hostdev-mdev-unmanaged.xml
[libvirt] [RFC PATCH 00/16] Introduce vGPU mdev framework to libvirt
Posted by Erik Skultety 7 years, 1 month ago
Finally. It's here. This is the initial suggestion on how libvirt might
interract with the mdev framework, currently only focussing on the non-managed
devices, i.e. those pre-created by the user, since that will be revisited once
we all settled on how the XML should look like, given we might not want to use
the sysfs path directly as an attribute in the domain XML. My proposal on the
XML is the following:

<hostdev mode='subsystem' type='mdev'>  
    <source>
        <!-- this is the host's physical device address -->
        <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'>
        <uuid>vGPU_UUID<uuid>
    <source>
    <!-- target PCI address can be omitted to assign it automatically -->
</hostdev>

So the mediated device is identified by the physical parent device visible on
the host and a UUID which allows us to construct the sysfs path by ourselves,
which we then put on the QEMU's command line.

A few remarks if you actually happen to have a machine to test this on:
- right now the mediated devices are one-time use only, i.e. they have to be
recreated before every machine boot
- I wouldn't recommend assigning multiple vGPUs to a single domain

Once this series is sorted out, we can then continue with 'managed=yes' where
as Laine pointed out [1], we need to figure out how exactly should the
management layer hint libvirt which vGPU type should be used for device
instantiation.

[1] https://www.redhat.com/archives/libvir-list/2017-January/msg00287.html  

#pleaseshareyourfeedback

Thanks,
Erik

Erik Skultety (16):
  util: Introduce new module virmdev
  conf: Introduce new hostdev device type mdev
  docs: Update RNG schema to reflect the new hostdev type mdev
  conf: Adjust the domain parser to work with mdevs
  Adjust the formatter to reflect the new hostdev type mdev
  security: dac: Enable labeling of vfio mediated devices
  security: selinux: Enable labeling of vfio mediated devices
  conf: Enable cold-plug of a mediated device
  qemu: Assign PCI addresses for mediated devices as well
  hostdev: Maintain a driver list of active mediated devices
  hostdev: Introduce a reattach method for mediated devices
  qemu: cgroup: Adjust cgroups' logic to allow mediated devices
  qemu: namespace: Hook up the discovery of mdevs into the namespace
    code
  qemu: Format mdevs on the qemu command line
  test: Add some test cases for our test suite regarding the mdevs
  docs: Document the new hostdev device type 'mdev'

 docs/formatdomain.html.in                          |  40 ++-
 docs/schemas/domaincommon.rng                      |  17 +
 po/POTFILES.in                                     |   1 +
 src/Makefile.am                                    |   1 +
 src/conf/domain_conf.c                             |  81 ++++-
 src/conf/domain_conf.h                             |  10 +
 src/libvirt_private.syms                           |  19 ++
 src/qemu/qemu_cgroup.c                             |  35 ++
 src/qemu/qemu_command.c                            |  49 +++
 src/qemu/qemu_command.h                            |   5 +
 src/qemu/qemu_domain.c                             |  13 +
 src/qemu/qemu_domain_address.c                     |  12 +-
 src/qemu/qemu_hostdev.c                            |  37 ++
 src/qemu/qemu_hostdev.h                            |   8 +
 src/qemu/qemu_hotplug.c                            |   2 +
 src/security/security_apparmor.c                   |   3 +
 src/security/security_dac.c                        |  56 +++
 src/security/security_selinux.c                    |  55 +++
 src/util/virhostdev.c                              | 179 +++++++++-
 src/util/virhostdev.h                              |  16 +
 src/util/virmdev.c                                 | 375 +++++++++++++++++++++
 src/util/virmdev.h                                 |  85 +++++
 tests/domaincapsschemadata/full.xml                |   1 +
 ...qemuxml2argv-hostdev-mdev-unmanaged-no-uuid.xml |  37 ++
 .../qemuxml2argv-hostdev-mdev-unmanaged.args       |  25 ++
 .../qemuxml2argv-hostdev-mdev-unmanaged.xml        |  38 +++
 tests/qemuxml2argvtest.c                           |   6 +
 .../qemuxml2xmlout-hostdev-mdev-unmanaged.xml      |  41 +++
 tests/qemuxml2xmltest.c                            |   1 +
 29 files changed, 1239 insertions(+), 9 deletions(-)
 create mode 100644 src/util/virmdev.c
 create mode 100644 src/util/virmdev.h
 create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged-no-uuid.xml
 create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged.args
 create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged.xml
 create mode 100644 tests/qemuxml2xmloutdata/qemuxml2xmlout-hostdev-mdev-unmanaged.xml

-- 
2.10.2

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [RFC PATCH 00/16] Introduce vGPU mdev framework to libvirt
Posted by Michal Privoznik 7 years, 1 month ago
On 06.02.2017 13:19, Erik Skultety wrote:
> Finally. It's here. This is the initial suggestion on how libvirt might
> interract with the mdev framework, currently only focussing on the non-managed
> devices, i.e. those pre-created by the user, since that will be revisited once
> we all settled on how the XML should look like, given we might not want to use
> the sysfs path directly as an attribute in the domain XML. My proposal on the
> XML is the following:
> 
> <hostdev mode='subsystem' type='mdev'>  
>     <source>
>         <!-- this is the host's physical device address -->
>         <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'>
>         <uuid>vGPU_UUID<uuid>
>     <source>
>     <!-- target PCI address can be omitted to assign it automatically -->
> </hostdev>
> 
> So the mediated device is identified by the physical parent device visible on
> the host and a UUID which allows us to construct the sysfs path by ourselves,
> which we then put on the QEMU's command line.
> 
> A few remarks if you actually happen to have a machine to test this on:
> - right now the mediated devices are one-time use only, i.e. they have to be
> recreated before every machine boot
> - I wouldn't recommend assigning multiple vGPUs to a single domain
> 
> Once this series is sorted out, we can then continue with 'managed=yes' where
> as Laine pointed out [1], we need to figure out how exactly should the
> management layer hint libvirt which vGPU type should be used for device
> instantiation.
> 
> [1] https://www.redhat.com/archives/libvir-list/2017-January/msg00287.html  
> 
> #pleaseshareyourfeedback
> 
> Thanks,
> Erik
> 
> Erik Skultety (16):
>   util: Introduce new module virmdev
>   conf: Introduce new hostdev device type mdev
>   docs: Update RNG schema to reflect the new hostdev type mdev
>   conf: Adjust the domain parser to work with mdevs
>   Adjust the formatter to reflect the new hostdev type mdev
>   security: dac: Enable labeling of vfio mediated devices
>   security: selinux: Enable labeling of vfio mediated devices
>   conf: Enable cold-plug of a mediated device
>   qemu: Assign PCI addresses for mediated devices as well
>   hostdev: Maintain a driver list of active mediated devices
>   hostdev: Introduce a reattach method for mediated devices
>   qemu: cgroup: Adjust cgroups' logic to allow mediated devices
>   qemu: namespace: Hook up the discovery of mdevs into the namespace
>     code
>   qemu: Format mdevs on the qemu command line
>   test: Add some test cases for our test suite regarding the mdevs
>   docs: Document the new hostdev device type 'mdev'
> 
>  docs/formatdomain.html.in                          |  40 ++-
>  docs/schemas/domaincommon.rng                      |  17 +
>  po/POTFILES.in                                     |   1 +
>  src/Makefile.am                                    |   1 +
>  src/conf/domain_conf.c                             |  81 ++++-
>  src/conf/domain_conf.h                             |  10 +
>  src/libvirt_private.syms                           |  19 ++
>  src/qemu/qemu_cgroup.c                             |  35 ++
>  src/qemu/qemu_command.c                            |  49 +++
>  src/qemu/qemu_command.h                            |   5 +
>  src/qemu/qemu_domain.c                             |  13 +
>  src/qemu/qemu_domain_address.c                     |  12 +-
>  src/qemu/qemu_hostdev.c                            |  37 ++
>  src/qemu/qemu_hostdev.h                            |   8 +
>  src/qemu/qemu_hotplug.c                            |   2 +
>  src/security/security_apparmor.c                   |   3 +
>  src/security/security_dac.c                        |  56 +++
>  src/security/security_selinux.c                    |  55 +++
>  src/util/virhostdev.c                              | 179 +++++++++-
>  src/util/virhostdev.h                              |  16 +
>  src/util/virmdev.c                                 | 375 +++++++++++++++++++++
>  src/util/virmdev.h                                 |  85 +++++
>  tests/domaincapsschemadata/full.xml                |   1 +
>  ...qemuxml2argv-hostdev-mdev-unmanaged-no-uuid.xml |  37 ++
>  .../qemuxml2argv-hostdev-mdev-unmanaged.args       |  25 ++
>  .../qemuxml2argv-hostdev-mdev-unmanaged.xml        |  38 +++
>  tests/qemuxml2argvtest.c                           |   6 +
>  .../qemuxml2xmlout-hostdev-mdev-unmanaged.xml      |  41 +++
>  tests/qemuxml2xmltest.c                            |   1 +
>  29 files changed, 1239 insertions(+), 9 deletions(-)
>  create mode 100644 src/util/virmdev.c
>  create mode 100644 src/util/virmdev.h
>  create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged-no-uuid.xml
>  create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged.args
>  create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged.xml
>  create mode 100644 tests/qemuxml2xmloutdata/qemuxml2xmlout-hostdev-mdev-unmanaged.xml
> 

I'm no expert in mdevs, but from code POV these look solid.

Michal

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [RFC PATCH 00/16] Introduce vGPU mdev framework to libvirt
Posted by Alex Williamson 7 years, 1 month ago
On Mon,  6 Feb 2017 13:19:42 +0100
Erik Skultety <eskultet@redhat.com> wrote:

> Finally. It's here. This is the initial suggestion on how libvirt might
> interract with the mdev framework, currently only focussing on the non-managed
> devices, i.e. those pre-created by the user, since that will be revisited once
> we all settled on how the XML should look like, given we might not want to use
> the sysfs path directly as an attribute in the domain XML. My proposal on the
> XML is the following:
> 
> <hostdev mode='subsystem' type='mdev'>  
>     <source>
>         <!-- this is the host's physical device address -->
>         <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'>
>         <uuid>vGPU_UUID<uuid>
>     <source>
>     <!-- target PCI address can be omitted to assign it automatically -->
> </hostdev>
> 
> So the mediated device is identified by the physical parent device visible on
> the host and a UUID which allows us to construct the sysfs path by ourselves,
> which we then put on the QEMU's command line.

Based on your test code, I think you're creating something like this:

-device vfio-pci,sysfsdev=/sys/class/mdev_bus/0000:00:03.0/53764d0e-85a0-42b4-af5c-2046b460b1dc

That would explain the need for the parent device address, but that's
an entirely self inflicted requirement.  For a managed="no" scenarios,
we shouldn't need the parent, we can get to the mdev device
via /sys/bus/mdev/devices/53764d0e-85a0-42b4-af5c-2046b460b1dc.  So it
seems that the UUID should be the only required source element for
managed="no".

For managed="yes", it seems like the parent device is still an optional
field.  The most important thing that libvirt needs to know when
creating a mdev device for a VM is the mdev type name.  The parent
device should be an optional field to help higher level management
tools deal with placement of the device for locality or load balancing.
Also, we can't assume that the parent device is a PCI device, the
sample mtty driver already breaks this assumption.

Also, grep'ing through the patches, I don't see that the "device_api"
file is being used to test that the mdev device actually exports the
vfio-pci API before making use of it with the QEMU vfio-pci driver.  We
don't yet have any examples to the contrary, but non vfio-pci mdev
devices are in development.  Just like we can't assume the parent
device type, we can't assume the API of an mdev device to the user.
Thanks,

Alex

> A few remarks if you actually happen to have a machine to test this on:
> - right now the mediated devices are one-time use only, i.e. they have to be
> recreated before every machine boot
> - I wouldn't recommend assigning multiple vGPUs to a single domain
> 
> Once this series is sorted out, we can then continue with 'managed=yes' where
> as Laine pointed out [1], we need to figure out how exactly should the
> management layer hint libvirt which vGPU type should be used for device
> instantiation.
> 
> [1] https://www.redhat.com/archives/libvir-list/2017-January/msg00287.html  
> 
> #pleaseshareyourfeedback
> 
> Thanks,
> Erik
> 
> Erik Skultety (16):
>   util: Introduce new module virmdev
>   conf: Introduce new hostdev device type mdev
>   docs: Update RNG schema to reflect the new hostdev type mdev
>   conf: Adjust the domain parser to work with mdevs
>   Adjust the formatter to reflect the new hostdev type mdev
>   security: dac: Enable labeling of vfio mediated devices
>   security: selinux: Enable labeling of vfio mediated devices
>   conf: Enable cold-plug of a mediated device
>   qemu: Assign PCI addresses for mediated devices as well
>   hostdev: Maintain a driver list of active mediated devices
>   hostdev: Introduce a reattach method for mediated devices
>   qemu: cgroup: Adjust cgroups' logic to allow mediated devices
>   qemu: namespace: Hook up the discovery of mdevs into the namespace
>     code
>   qemu: Format mdevs on the qemu command line
>   test: Add some test cases for our test suite regarding the mdevs
>   docs: Document the new hostdev device type 'mdev'
> 
>  docs/formatdomain.html.in                          |  40 ++-
>  docs/schemas/domaincommon.rng                      |  17 +
>  po/POTFILES.in                                     |   1 +
>  src/Makefile.am                                    |   1 +
>  src/conf/domain_conf.c                             |  81 ++++-
>  src/conf/domain_conf.h                             |  10 +
>  src/libvirt_private.syms                           |  19 ++
>  src/qemu/qemu_cgroup.c                             |  35 ++
>  src/qemu/qemu_command.c                            |  49 +++
>  src/qemu/qemu_command.h                            |   5 +
>  src/qemu/qemu_domain.c                             |  13 +
>  src/qemu/qemu_domain_address.c                     |  12 +-
>  src/qemu/qemu_hostdev.c                            |  37 ++
>  src/qemu/qemu_hostdev.h                            |   8 +
>  src/qemu/qemu_hotplug.c                            |   2 +
>  src/security/security_apparmor.c                   |   3 +
>  src/security/security_dac.c                        |  56 +++
>  src/security/security_selinux.c                    |  55 +++
>  src/util/virhostdev.c                              | 179 +++++++++-
>  src/util/virhostdev.h                              |  16 +
>  src/util/virmdev.c                                 | 375 +++++++++++++++++++++
>  src/util/virmdev.h                                 |  85 +++++
>  tests/domaincapsschemadata/full.xml                |   1 +
>  ...qemuxml2argv-hostdev-mdev-unmanaged-no-uuid.xml |  37 ++
>  .../qemuxml2argv-hostdev-mdev-unmanaged.args       |  25 ++
>  .../qemuxml2argv-hostdev-mdev-unmanaged.xml        |  38 +++
>  tests/qemuxml2argvtest.c                           |   6 +
>  .../qemuxml2xmlout-hostdev-mdev-unmanaged.xml      |  41 +++
>  tests/qemuxml2xmltest.c                            |   1 +
>  29 files changed, 1239 insertions(+), 9 deletions(-)
>  create mode 100644 src/util/virmdev.c
>  create mode 100644 src/util/virmdev.h
>  create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged-no-uuid.xml
>  create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged.args
>  create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged.xml
>  create mode 100644 tests/qemuxml2xmloutdata/qemuxml2xmlout-hostdev-mdev-unmanaged.xml
> 

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [RFC PATCH 00/16] Introduce vGPU mdev framework to libvirt
Posted by Erik Skultety 7 years, 1 month ago
On Mon, Feb 06, 2017 at 09:33:14AM -0700, Alex Williamson wrote:
> On Mon,  6 Feb 2017 13:19:42 +0100
> Erik Skultety <eskultet@redhat.com> wrote:
> 
> > Finally. It's here. This is the initial suggestion on how libvirt might
> > interract with the mdev framework, currently only focussing on the non-managed
> > devices, i.e. those pre-created by the user, since that will be revisited once
> > we all settled on how the XML should look like, given we might not want to use
> > the sysfs path directly as an attribute in the domain XML. My proposal on the
> > XML is the following:
> > 
> > <hostdev mode='subsystem' type='mdev'>  
> >     <source>
> >         <!-- this is the host's physical device address -->
> >         <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'>
> >         <uuid>vGPU_UUID<uuid>
> >     <source>
> >     <!-- target PCI address can be omitted to assign it automatically -->
> > </hostdev>
> > 
> > So the mediated device is identified by the physical parent device visible on
> > the host and a UUID which allows us to construct the sysfs path by ourselves,
> > which we then put on the QEMU's command line.
> 
> Based on your test code, I think you're creating something like this:
> 
> -device vfio-pci,sysfsdev=/sys/class/mdev_bus/0000:00:03.0/53764d0e-85a0-42b4-af5c-2046b460b1dc
> 
> That would explain the need for the parent device address, but that's
> an entirely self inflicted requirement.  For a managed="no" scenarios,
> we shouldn't need the parent, we can get to the mdev device
> via /sys/bus/mdev/devices/53764d0e-85a0-42b4-af5c-2046b460b1dc.  So it

True, for managed="no" would this path be a nice optimization.

> seems that the UUID should be the only required source element for
> managed="no".
> 
> For managed="yes", it seems like the parent device is still an optional

The reason I went with the parent address element (and purposely neglecting the
sample mtty driver) was that I assumed any modern mdev capable HW would be
accessible through the PCI bus on the host. Also I wanted to explicitly hint
libvirt as much as possible which parent device a vGPU device instance should
be created on in case there are more than one of them, rather then scanning
sysfs for a suitable parent which actually supports the given vGPU type.

> field.  The most important thing that libvirt needs to know when
> creating a mdev device for a VM is the mdev type name.  The parent
> device should be an optional field to help higher level management
> tools deal with placement of the device for locality or load balancing.
> Also, we can't assume that the parent device is a PCI device, the
> sample mtty driver already breaks this assumption.

Since we need to assume non-PCI devices and we still need to enable management
to hint libvirt about the parent to utilize load balancing and stuff, I've come
up with the following adjustments/ideas on how to reflect that in the XML:
- still use the address element but use it with the 'type' attribute [1] (still
  breaks the sample mtty driver though) while making the element truly optional
  if I'm going to be outvoted in favor of scanning the directory for a suitable
  parent device on our own, rather than requiring the user to provide that

- providing either an attribute or a standalone element for the parent device
  name, like a string version of the PCI address or whatever form the parent
  device comes in (doesn't break the mtty driver but I don't quite like this)

- providing a path element/attribute to sysfs pointing to the parent device
  which I'm afraid is what Daniel is not in favor of libvirt doing

So, this is what I've so far come up with in terms of hinting libvirt about the
parent device, do you have any input on this, maybe some more ideas on how we
should identify the parent device?

> 
> Also, grep'ing through the patches, I don't see that the "device_api"

Yep, this was also on purpose since as you write below, right now the only
functioning mdev devices we have to work with are vfio-pci capable only, so
with this RFC I wanted to gather some feedback on whether I'm moving the right
direction in the first place. So yeah, I thought this could be added at any point
later.

[1] http://libvirt.org/formatdomain.html#elementsAddress

Erik

> file is being used to test that the mdev device actually exports the
> vfio-pci API before making use of it with the QEMU vfio-pci driver.  We
> don't yet have any examples to the contrary, but non vfio-pci mdev
> devices are in development.  Just like we can't assume the parent
> device type, we can't assume the API of an mdev device to the user.
> Thanks,
> 
> Alex
> 
> > A few remarks if you actually happen to have a machine to test this on:
> > - right now the mediated devices are one-time use only, i.e. they have to be
> > recreated before every machine boot
> > - I wouldn't recommend assigning multiple vGPUs to a single domain
> > 
> > Once this series is sorted out, we can then continue with 'managed=yes' where
> > as Laine pointed out [1], we need to figure out how exactly should the
> > management layer hint libvirt which vGPU type should be used for device
> > instantiation.
> > 
> > [1] https://www.redhat.com/archives/libvir-list/2017-January/msg00287.html  
> > 

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [RFC PATCH 00/16] Introduce vGPU mdev framework to libvirt
Posted by Alex Williamson 7 years, 1 month ago
On Tue, 7 Feb 2017 17:26:51 +0100
Erik Skultety <eskultet@redhat.com> wrote:

> On Mon, Feb 06, 2017 at 09:33:14AM -0700, Alex Williamson wrote:
> > On Mon,  6 Feb 2017 13:19:42 +0100
> > Erik Skultety <eskultet@redhat.com> wrote:
> >   
> > > Finally. It's here. This is the initial suggestion on how libvirt might
> > > interract with the mdev framework, currently only focussing on the non-managed
> > > devices, i.e. those pre-created by the user, since that will be revisited once
> > > we all settled on how the XML should look like, given we might not want to use
> > > the sysfs path directly as an attribute in the domain XML. My proposal on the
> > > XML is the following:
> > > 
> > > <hostdev mode='subsystem' type='mdev'>  
> > >     <source>
> > >         <!-- this is the host's physical device address -->
> > >         <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'>
> > >         <uuid>vGPU_UUID<uuid>
> > >     <source>
> > >     <!-- target PCI address can be omitted to assign it automatically -->
> > > </hostdev>
> > > 
> > > So the mediated device is identified by the physical parent device visible on
> > > the host and a UUID which allows us to construct the sysfs path by ourselves,
> > > which we then put on the QEMU's command line.  
> > 
> > Based on your test code, I think you're creating something like this:
> > 
> > -device vfio-pci,sysfsdev=/sys/class/mdev_bus/0000:00:03.0/53764d0e-85a0-42b4-af5c-2046b460b1dc
> > 
> > That would explain the need for the parent device address, but that's
> > an entirely self inflicted requirement.  For a managed="no" scenarios,
> > we shouldn't need the parent, we can get to the mdev device
> > via /sys/bus/mdev/devices/53764d0e-85a0-42b4-af5c-2046b460b1dc.  So it  
> 
> True, for managed="no" would this path be a nice optimization.
> 
> > seems that the UUID should be the only required source element for
> > managed="no".
> > 
> > For managed="yes", it seems like the parent device is still an optional  
> 
> The reason I went with the parent address element (and purposely neglecting the
> sample mtty driver) was that I assumed any modern mdev capable HW would be
> accessible through the PCI bus on the host. Also I wanted to explicitly hint
> libvirt as much as possible which parent device a vGPU device instance should
> be created on in case there are more than one of them, rather then scanning
> sysfs for a suitable parent which actually supports the given vGPU type.
> 
> > field.  The most important thing that libvirt needs to know when
> > creating a mdev device for a VM is the mdev type name.  The parent
> > device should be an optional field to help higher level management
> > tools deal with placement of the device for locality or load balancing.
> > Also, we can't assume that the parent device is a PCI device, the
> > sample mtty driver already breaks this assumption.  
> 
> Since we need to assume non-PCI devices and we still need to enable management
> to hint libvirt about the parent to utilize load balancing and stuff, I've come
> up with the following adjustments/ideas on how to reflect that in the XML:
> - still use the address element but use it with the 'type' attribute [1] (still
>   breaks the sample mtty driver though) while making the element truly optional
>   if I'm going to be outvoted in favor of scanning the directory for a suitable
>   parent device on our own, rather than requiring the user to provide that
> 
> - providing either an attribute or a standalone element for the parent device
>   name, like a string version of the PCI address or whatever form the parent
>   device comes in (doesn't break the mtty driver but I don't quite like this)
> 
> - providing a path element/attribute to sysfs pointing to the parent device
>   which I'm afraid is what Daniel is not in favor of libvirt doing
> 
> So, this is what I've so far come up with in terms of hinting libvirt about the
> parent device, do you have any input on this, maybe some more ideas on how we
> should identify the parent device?

IMO, if we cannot account for the mtty sample driver, we're doing it
wrong.  I suppose we can leave it unspecified how one selects a parent
device for the mtty driver, but it should be possible to expand the
syntax to include it.  So I think that means that when the parent
address is provided, the parent address type needs to be specified as
PCI.  So...

 <hostdev mode='subsystem' type='mdev'>

This needs to encompass the device API or else the optional VM address
cannot be resolved.  Perhaps model='vfio-pci' here?  Seems similar to
how we specify the device type for PCI controllers where we have
multiple options:

 <hostdev mode='subsystem' type='mdev' model='vfio-pci'>

   <source>

For managed='no', I don't see that anything other than the mdev UUID is
useful.

     <uuid>MDEV_UUID</uuid>

If libvirt gets into the business of creating mdev devices and we call
that managed='yes', then the mdev type to create is required.  I don't
know whether there's anything similar we can steal syntax from:

     <type>"nvidia-11"</type>

That's pretty horrible, needs some xml guru love.

We need to provide for specifying a parent, but we can't assume the
type and address format of the parent.  If we say the parent is a
string, then we don't care, libvirt simply matches
the /sys/clas/mdev_bus/$STRING/ device.  If we want libvirt to
interpret the parent address (I can't figure what value this adds, but
aiui raw strings are frowned upon), then the address type would need to
be specified. So instead of:

     <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'>

Perhaps:

     <address type='pci' domain='0x0000' bus='0x00'...>

If we want to support mtty here, we'd need a different type, but at
least we have the possibility to define that (though if parent is
simply a string then even mtty works).

   <source>

The type of the optional <address> here would be based on the "model"
of the hostdev.

 </hostdev>

Clearly I'm less than a novice at defining xml tags, this is just what
I think needs to be there.

> > Also, grep'ing through the patches, I don't see that the "device_api"  
> 
> Yep, this was also on purpose since as you write below, right now the only
> functioning mdev devices we have to work with are vfio-pci capable only, so
> with this RFC I wanted to gather some feedback on whether I'm moving the right
> direction in the first place. So yeah, I thought this could be added at any point
> later.

Hmm, I think that might be painting us into a corner for the future.
 
> [1] http://libvirt.org/formatdomain.html#elementsAddress

I see here that we already do the <address type='pci'...> thing, so
using that as the parent source element fits.

The VM <address> though is optional, so we can't rely on queues from
that to tell us what mdev type to expect, which implies we do need
something like the model='vfio-pci' to determine the VM address format.
Thanks,

Alex

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] Introduce vGPU mdev framework to libvirt
Posted by Martin Polednik 7 years, 1 month ago
On 07/02/17 12:29 -0700, Alex Williamson wrote:
>On Tue, 7 Feb 2017 17:26:51 +0100
>Erik Skultety <eskultet@redhat.com> wrote:
>
>> On Mon, Feb 06, 2017 at 09:33:14AM -0700, Alex Williamson wrote:
>> > On Mon,  6 Feb 2017 13:19:42 +0100
>> > Erik Skultety <eskultet@redhat.com> wrote:
>> >
>> > > Finally. It's here. This is the initial suggestion on how libvirt might
>> > > interract with the mdev framework, currently only focussing on the non-managed
>> > > devices, i.e. those pre-created by the user, since that will be revisited once
>> > > we all settled on how the XML should look like, given we might not want to use
>> > > the sysfs path directly as an attribute in the domain XML. My proposal on the
>> > > XML is the following:
>> > >
>> > > <hostdev mode='subsystem' type='mdev'>
>> > >     <source>
>> > >         <!-- this is the host's physical device address -->
>> > >         <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'>
>> > >         <uuid>vGPU_UUID<uuid>
>> > >     <source>
>> > >     <!-- target PCI address can be omitted to assign it automatically -->
>> > > </hostdev>
>> > >
>> > > So the mediated device is identified by the physical parent device visible on
>> > > the host and a UUID which allows us to construct the sysfs path by ourselves,
>> > > which we then put on the QEMU's command line.
>> >
>> > Based on your test code, I think you're creating something like this:
>> >
>> > -device vfio-pci,sysfsdev=/sys/class/mdev_bus/0000:00:03.0/53764d0e-85a0-42b4-af5c-2046b460b1dc
>> >
>> > That would explain the need for the parent device address, but that's
>> > an entirely self inflicted requirement.  For a managed="no" scenarios,
>> > we shouldn't need the parent, we can get to the mdev device
>> > via /sys/bus/mdev/devices/53764d0e-85a0-42b4-af5c-2046b460b1dc.  So it
>>
>> True, for managed="no" would this path be a nice optimization.
>>
>> > seems that the UUID should be the only required source element for
>> > managed="no".
>> >
>> > For managed="yes", it seems like the parent device is still an optional
>>
>> The reason I went with the parent address element (and purposely neglecting the
>> sample mtty driver) was that I assumed any modern mdev capable HW would be
>> accessible through the PCI bus on the host. Also I wanted to explicitly hint
>> libvirt as much as possible which parent device a vGPU device instance should
>> be created on in case there are more than one of them, rather then scanning
>> sysfs for a suitable parent which actually supports the given vGPU type.
>>
>> > field.  The most important thing that libvirt needs to know when
>> > creating a mdev device for a VM is the mdev type name.  The parent
>> > device should be an optional field to help higher level management
>> > tools deal with placement of the device for locality or load balancing.
>> > Also, we can't assume that the parent device is a PCI device, the
>> > sample mtty driver already breaks this assumption.
>>
>> Since we need to assume non-PCI devices and we still need to enable management
>> to hint libvirt about the parent to utilize load balancing and stuff, I've come
>> up with the following adjustments/ideas on how to reflect that in the XML:
>> - still use the address element but use it with the 'type' attribute [1] (still
>>   breaks the sample mtty driver though) while making the element truly optional
>>   if I'm going to be outvoted in favor of scanning the directory for a suitable
>>   parent device on our own, rather than requiring the user to provide that
>>
>> - providing either an attribute or a standalone element for the parent device
>>   name, like a string version of the PCI address or whatever form the parent
>>   device comes in (doesn't break the mtty driver but I don't quite like this)
>>
>> - providing a path element/attribute to sysfs pointing to the parent device
>>   which I'm afraid is what Daniel is not in favor of libvirt doing
>>
>> So, this is what I've so far come up with in terms of hinting libvirt about the
>> parent device, do you have any input on this, maybe some more ideas on how we
>> should identify the parent device?
>
>IMO, if we cannot account for the mtty sample driver, we're doing it
>wrong.  I suppose we can leave it unspecified how one selects a parent
>device for the mtty driver, but it should be possible to expand the
>syntax to include it.  So I think that means that when the parent
>address is provided, the parent address type needs to be specified as
>PCI.  So...
>
> <hostdev mode='subsystem' type='mdev'>
>
>This needs to encompass the device API or else the optional VM address
>cannot be resolved.  Perhaps model='vfio-pci' here?  Seems similar to
>how we specify the device type for PCI controllers where we have
>multiple options:
>
> <hostdev mode='subsystem' type='mdev' model='vfio-pci'>
>
>   <source>
>
>For managed='no', I don't see that anything other than the mdev UUID is
>useful.
>
>     <uuid>MDEV_UUID</uuid>
>
>If libvirt gets into the business of creating mdev devices and we call
>that managed='yes', then the mdev type to create is required.  I don't
>know whether there's anything similar we can steal syntax from:
>
>     <type>"nvidia-11"</type>
>
>That's pretty horrible, needs some xml guru love.
>
>We need to provide for specifying a parent, but we can't assume the

>From higher level perspective, I believe it would be "good
enough" for most of the cases to only specify the type. Libvirt will
anyway have to be able to enumerate the devices for listAllDevices
afaik.

My wish would be specifying
<hostdev mode='subsystem' type='mdev'>
    <type>nvidia-11</type>
</hostdev>
unless the user has specific requests or some other decision (mmio
numa placement) takes place.

We would additionally need (allocated instances/max instances of that
type) in listAllDevices to account for the specific assignment
possibility.

I'm not sure what the decision was wrt type naming, can 2 different
cards have similarly named type with different meaning?

Sorry if I've missed some obvious reason why this can't work.

mpolednik

>type and address format of the parent.  If we say the parent is a
>string, then we don't care, libvirt simply matches
>the /sys/clas/mdev_bus/$STRING/ device.  If we want libvirt to
>interpret the parent address (I can't figure what value this adds, but
>aiui raw strings are frowned upon), then the address type would need to
>be specified. So instead of:
>
>     <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'>
>
>Perhaps:
>
>     <address type='pci' domain='0x0000' bus='0x00'...>
>
>If we want to support mtty here, we'd need a different type, but at
>least we have the possibility to define that (though if parent is
>simply a string then even mtty works).
>
>   <source>
>
>The type of the optional <address> here would be based on the "model"
>of the hostdev.
>
> </hostdev>
>
>Clearly I'm less than a novice at defining xml tags, this is just what
>I think needs to be there.
>
>> > Also, grep'ing through the patches, I don't see that the "device_api"
>>
>> Yep, this was also on purpose since as you write below, right now the only
>> functioning mdev devices we have to work with are vfio-pci capable only, so
>> with this RFC I wanted to gather some feedback on whether I'm moving the right
>> direction in the first place. So yeah, I thought this could be added at any point
>> later.
>
>Hmm, I think that might be painting us into a corner for the future.
>
>> [1] http://libvirt.org/formatdomain.html#elementsAddress
>
>I see here that we already do the <address type='pci'...> thing, so
>using that as the parent source element fits.
>
>The VM <address> though is optional, so we can't rely on queues from
>that to tell us what mdev type to expect, which implies we do need
>something like the model='vfio-pci' to determine the VM address format.
>Thanks,
>
>Alex
>
>--
>libvir-list mailing list
>libvir-list@redhat.com
>https://www.redhat.com/mailman/listinfo/libvir-list

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] Introduce vGPU mdev framework to libvirt
Posted by Alex Williamson 7 years, 1 month ago
On Tue, 14 Feb 2017 16:50:14 +0100
Martin Polednik <mpolednik@redhat.com> wrote:

> On 07/02/17 12:29 -0700, Alex Williamson wrote:
> >On Tue, 7 Feb 2017 17:26:51 +0100
> >Erik Skultety <eskultet@redhat.com> wrote:
> >  
> >> On Mon, Feb 06, 2017 at 09:33:14AM -0700, Alex Williamson wrote:  
> >> > On Mon,  6 Feb 2017 13:19:42 +0100
> >> > Erik Skultety <eskultet@redhat.com> wrote:
> >> >  
> >> > > Finally. It's here. This is the initial suggestion on how libvirt might
> >> > > interract with the mdev framework, currently only focussing on the non-managed
> >> > > devices, i.e. those pre-created by the user, since that will be revisited once
> >> > > we all settled on how the XML should look like, given we might not want to use
> >> > > the sysfs path directly as an attribute in the domain XML. My proposal on the
> >> > > XML is the following:
> >> > >
> >> > > <hostdev mode='subsystem' type='mdev'>
> >> > >     <source>
> >> > >         <!-- this is the host's physical device address -->
> >> > >         <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'>
> >> > >         <uuid>vGPU_UUID<uuid>
> >> > >     <source>
> >> > >     <!-- target PCI address can be omitted to assign it automatically -->
> >> > > </hostdev>
> >> > >
> >> > > So the mediated device is identified by the physical parent device visible on
> >> > > the host and a UUID which allows us to construct the sysfs path by ourselves,
> >> > > which we then put on the QEMU's command line.  
> >> >
> >> > Based on your test code, I think you're creating something like this:
> >> >
> >> > -device vfio-pci,sysfsdev=/sys/class/mdev_bus/0000:00:03.0/53764d0e-85a0-42b4-af5c-2046b460b1dc
> >> >
> >> > That would explain the need for the parent device address, but that's
> >> > an entirely self inflicted requirement.  For a managed="no" scenarios,
> >> > we shouldn't need the parent, we can get to the mdev device
> >> > via /sys/bus/mdev/devices/53764d0e-85a0-42b4-af5c-2046b460b1dc.  So it  
> >>
> >> True, for managed="no" would this path be a nice optimization.
> >>  
> >> > seems that the UUID should be the only required source element for
> >> > managed="no".
> >> >
> >> > For managed="yes", it seems like the parent device is still an optional  
> >>
> >> The reason I went with the parent address element (and purposely neglecting the
> >> sample mtty driver) was that I assumed any modern mdev capable HW would be
> >> accessible through the PCI bus on the host. Also I wanted to explicitly hint
> >> libvirt as much as possible which parent device a vGPU device instance should
> >> be created on in case there are more than one of them, rather then scanning
> >> sysfs for a suitable parent which actually supports the given vGPU type.
> >>  
> >> > field.  The most important thing that libvirt needs to know when
> >> > creating a mdev device for a VM is the mdev type name.  The parent
> >> > device should be an optional field to help higher level management
> >> > tools deal with placement of the device for locality or load balancing.
> >> > Also, we can't assume that the parent device is a PCI device, the
> >> > sample mtty driver already breaks this assumption.  
> >>
> >> Since we need to assume non-PCI devices and we still need to enable management
> >> to hint libvirt about the parent to utilize load balancing and stuff, I've come
> >> up with the following adjustments/ideas on how to reflect that in the XML:
> >> - still use the address element but use it with the 'type' attribute [1] (still
> >>   breaks the sample mtty driver though) while making the element truly optional
> >>   if I'm going to be outvoted in favor of scanning the directory for a suitable
> >>   parent device on our own, rather than requiring the user to provide that
> >>
> >> - providing either an attribute or a standalone element for the parent device
> >>   name, like a string version of the PCI address or whatever form the parent
> >>   device comes in (doesn't break the mtty driver but I don't quite like this)
> >>
> >> - providing a path element/attribute to sysfs pointing to the parent device
> >>   which I'm afraid is what Daniel is not in favor of libvirt doing
> >>
> >> So, this is what I've so far come up with in terms of hinting libvirt about the
> >> parent device, do you have any input on this, maybe some more ideas on how we
> >> should identify the parent device?  
> >
> >IMO, if we cannot account for the mtty sample driver, we're doing it
> >wrong.  I suppose we can leave it unspecified how one selects a parent
> >device for the mtty driver, but it should be possible to expand the
> >syntax to include it.  So I think that means that when the parent
> >address is provided, the parent address type needs to be specified as
> >PCI.  So...
> >
> > <hostdev mode='subsystem' type='mdev'>
> >
> >This needs to encompass the device API or else the optional VM address
> >cannot be resolved.  Perhaps model='vfio-pci' here?  Seems similar to
> >how we specify the device type for PCI controllers where we have
> >multiple options:
> >
> > <hostdev mode='subsystem' type='mdev' model='vfio-pci'>
> >
> >   <source>
> >
> >For managed='no', I don't see that anything other than the mdev UUID is
> >useful.
> >
> >     <uuid>MDEV_UUID</uuid>
> >
> >If libvirt gets into the business of creating mdev devices and we call
> >that managed='yes', then the mdev type to create is required.  I don't
> >know whether there's anything similar we can steal syntax from:
> >
> >     <type>"nvidia-11"</type>
> >
> >That's pretty horrible, needs some xml guru love.
> >
> >We need to provide for specifying a parent, but we can't assume the  
> 
> From higher level perspective, I believe it would be "good
> enough" for most of the cases to only specify the type. Libvirt will
> anyway have to be able to enumerate the devices for listAllDevices
> afaik.
> 
> My wish would be specifying
> <hostdev mode='subsystem' type='mdev'>
>     <type>nvidia-11</type>
> </hostdev>
> unless the user has specific requests or some other decision (mmio
> numa placement) takes place.

Yes, the <type> is the minimum information necessary for libvirt to
create the mdev device itself.  A <source> section could add optional
placement information.  Note though that without an nvidia-11 type
device on the system to query, the xml doesn't tell us what sort of
device this creates in the VM.  We could assume that it's vfio-pci, but
designing in an assumption isn't a great idea.  So, as above, some
mechanism to make the xml self contained, such as specifying the model
as vfio-pci, helps avoid that assumption and allows us to know the
format for expressing the VM <address>
 
> We would additionally need (allocated instances/max instances of that
> type) in listAllDevices to account for the specific assignment
> possibility.

mdev devices support an available_instances per mdev type that is
dynamically updated as devices are created.  The interaction of
available_instances between different types is going to require some
heuristics to understand.  Some vendors may not support heterogeneous
types, others may pull from a common pool of resources, where each type
may consume resources from that pool at different rates.
 
> I'm not sure what the decision was wrt type naming, can 2 different
> cards have similarly named type with different meaning?

We don't deal in similarities, each type ID is unique and it's up to
the mdev vendor driver to make sure that an "nvidia-11" on and M60 card
is software equivalent to an "nvidia-11" on an M10 card.  If they're
not equivalent, the type ID will be different.  Something we may want
to consider eventually is whether we want/need to deal with
compatibility strings.  For instance, NVIDIA seems to be tying the type
ID strongly to specific implementations, an nvidia-11 may only be
available on an M60 card.  An M10 card may offer an nvidia-21 type with
similar capabilities.  There may be a need to express an mdev device as
compatible with various type IDs for hardware availability, at the risk
of exposing slight variations to the VM.  This could also make
placement easier for vendor drivers that only support homogeneous mdev
devices, "I prefer an mdev ID of type 'nvidia-11', but will accept one
of type 'nvidia-12,nvidia-21'".  Thanks,

Alex

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] Introduce vGPU mdev framework to libvirt
Posted by Martin Polednik 7 years, 1 month ago
On 14/02/17 09:58 -0700, Alex Williamson wrote:
>On Tue, 14 Feb 2017 16:50:14 +0100
>Martin Polednik <mpolednik@redhat.com> wrote:
>
>> On 07/02/17 12:29 -0700, Alex Williamson wrote:
>> >On Tue, 7 Feb 2017 17:26:51 +0100
>> >Erik Skultety <eskultet@redhat.com> wrote:
>> >
>> >> On Mon, Feb 06, 2017 at 09:33:14AM -0700, Alex Williamson wrote:
>> >> > On Mon,  6 Feb 2017 13:19:42 +0100
>> >> > Erik Skultety <eskultet@redhat.com> wrote:
>> >> >
>> >> > > Finally. It's here. This is the initial suggestion on how libvirt might
>> >> > > interract with the mdev framework, currently only focussing on the non-managed
>> >> > > devices, i.e. those pre-created by the user, since that will be revisited once
>> >> > > we all settled on how the XML should look like, given we might not want to use
>> >> > > the sysfs path directly as an attribute in the domain XML. My proposal on the
>> >> > > XML is the following:
>> >> > >
>> >> > > <hostdev mode='subsystem' type='mdev'>
>> >> > >     <source>
>> >> > >         <!-- this is the host's physical device address -->
>> >> > >         <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'>
>> >> > >         <uuid>vGPU_UUID<uuid>
>> >> > >     <source>
>> >> > >     <!-- target PCI address can be omitted to assign it automatically -->
>> >> > > </hostdev>
>> >> > >
>> >> > > So the mediated device is identified by the physical parent device visible on
>> >> > > the host and a UUID which allows us to construct the sysfs path by ourselves,
>> >> > > which we then put on the QEMU's command line.
>> >> >
>> >> > Based on your test code, I think you're creating something like this:
>> >> >
>> >> > -device vfio-pci,sysfsdev=/sys/class/mdev_bus/0000:00:03.0/53764d0e-85a0-42b4-af5c-2046b460b1dc
>> >> >
>> >> > That would explain the need for the parent device address, but that's
>> >> > an entirely self inflicted requirement.  For a managed="no" scenarios,
>> >> > we shouldn't need the parent, we can get to the mdev device
>> >> > via /sys/bus/mdev/devices/53764d0e-85a0-42b4-af5c-2046b460b1dc.  So it
>> >>
>> >> True, for managed="no" would this path be a nice optimization.
>> >>
>> >> > seems that the UUID should be the only required source element for
>> >> > managed="no".
>> >> >
>> >> > For managed="yes", it seems like the parent device is still an optional
>> >>
>> >> The reason I went with the parent address element (and purposely neglecting the
>> >> sample mtty driver) was that I assumed any modern mdev capable HW would be
>> >> accessible through the PCI bus on the host. Also I wanted to explicitly hint
>> >> libvirt as much as possible which parent device a vGPU device instance should
>> >> be created on in case there are more than one of them, rather then scanning
>> >> sysfs for a suitable parent which actually supports the given vGPU type.
>> >>
>> >> > field.  The most important thing that libvirt needs to know when
>> >> > creating a mdev device for a VM is the mdev type name.  The parent
>> >> > device should be an optional field to help higher level management
>> >> > tools deal with placement of the device for locality or load balancing.
>> >> > Also, we can't assume that the parent device is a PCI device, the
>> >> > sample mtty driver already breaks this assumption.
>> >>
>> >> Since we need to assume non-PCI devices and we still need to enable management
>> >> to hint libvirt about the parent to utilize load balancing and stuff, I've come
>> >> up with the following adjustments/ideas on how to reflect that in the XML:
>> >> - still use the address element but use it with the 'type' attribute [1] (still
>> >>   breaks the sample mtty driver though) while making the element truly optional
>> >>   if I'm going to be outvoted in favor of scanning the directory for a suitable
>> >>   parent device on our own, rather than requiring the user to provide that
>> >>
>> >> - providing either an attribute or a standalone element for the parent device
>> >>   name, like a string version of the PCI address or whatever form the parent
>> >>   device comes in (doesn't break the mtty driver but I don't quite like this)
>> >>
>> >> - providing a path element/attribute to sysfs pointing to the parent device
>> >>   which I'm afraid is what Daniel is not in favor of libvirt doing
>> >>
>> >> So, this is what I've so far come up with in terms of hinting libvirt about the
>> >> parent device, do you have any input on this, maybe some more ideas on how we
>> >> should identify the parent device?
>> >
>> >IMO, if we cannot account for the mtty sample driver, we're doing it
>> >wrong.  I suppose we can leave it unspecified how one selects a parent
>> >device for the mtty driver, but it should be possible to expand the
>> >syntax to include it.  So I think that means that when the parent
>> >address is provided, the parent address type needs to be specified as
>> >PCI.  So...
>> >
>> > <hostdev mode='subsystem' type='mdev'>
>> >
>> >This needs to encompass the device API or else the optional VM address
>> >cannot be resolved.  Perhaps model='vfio-pci' here?  Seems similar to
>> >how we specify the device type for PCI controllers where we have
>> >multiple options:
>> >
>> > <hostdev mode='subsystem' type='mdev' model='vfio-pci'>
>> >
>> >   <source>
>> >
>> >For managed='no', I don't see that anything other than the mdev UUID is
>> >useful.
>> >
>> >     <uuid>MDEV_UUID</uuid>
>> >
>> >If libvirt gets into the business of creating mdev devices and we call
>> >that managed='yes', then the mdev type to create is required.  I don't
>> >know whether there's anything similar we can steal syntax from:
>> >
>> >     <type>"nvidia-11"</type>
>> >
>> >That's pretty horrible, needs some xml guru love.
>> >
>> >We need to provide for specifying a parent, but we can't assume the
>>
>> From higher level perspective, I believe it would be "good
>> enough" for most of the cases to only specify the type. Libvirt will
>> anyway have to be able to enumerate the devices for listAllDevices
>> afaik.
>>
>> My wish would be specifying
>> <hostdev mode='subsystem' type='mdev'>
>>     <type>nvidia-11</type>
>> </hostdev>
>> unless the user has specific requests or some other decision (mmio
>> numa placement) takes place.
>
>Yes, the <type> is the minimum information necessary for libvirt to
>create the mdev device itself.  A <source> section could add optional
>placement information.  Note though that without an nvidia-11 type
>device on the system to query, the xml doesn't tell us what sort of
>device this creates in the VM.  We could assume that it's vfio-pci, but
>designing in an assumption isn't a great idea.  So, as above, some
>mechanism to make the xml self contained, such as specifying the model
>as vfio-pci, helps avoid that assumption and allows us to know the
>format for expressing the VM <address>

As long as libvirt provides means to determine the model via device
listing (listAllDevices), OK.

>> We would additionally need (allocated instances/max instances of that
>> type) in listAllDevices to account for the specific assignment
>> possibility.
>
>mdev devices support an available_instances per mdev type that is
>dynamically updated as devices are created.  The interaction of
>available_instances between different types is going to require some
>heuristics to understand.  Some vendors may not support heterogeneous
>types, others may pull from a common pool of resources, where each type
>may consume resources from that pool at different rates.

Given common pool semantics, will we be able to calculate how many of
each type will be available in the pool if we were to instantiate
certain type? Example:

available types:
type_a: 4 devices (each consumes 1 "slot")
type_b: 1 device  (each consumes 4 "slots")
total "slots": 4

we know that creating type_a device prevents any
more type_b devices to be created.

Does NVIDIA or AMD use the common pool?

>> I'm not sure what the decision was wrt type naming, can 2 different
>> cards have similarly named type with different meaning?
>
>We don't deal in similarities, each type ID is unique and it's up to
>the mdev vendor driver to make sure that an "nvidia-11" on and M60 card
>is software equivalent to an "nvidia-11" on an M10 card.  If they're
>not equivalent, the type ID will be different.  Something we may want
>to consider eventually is whether we want/need to deal with
>compatibility strings.  For instance, NVIDIA seems to be tying the type
>ID strongly to specific implementations, an nvidia-11 may only be
>available on an M60 card.  An M10 card may offer an nvidia-21 type with
>similar capabilities.  There may be a need to express an mdev device as
>compatible with various type IDs for hardware availability, at the risk
>of exposing slight variations to the VM.  This could also make
>placement easier for vendor drivers that only support homogeneous mdev
>devices, "I prefer an mdev ID of type 'nvidia-11', but will accept one
>of type 'nvidia-12,nvidia-21'".  Thanks,

I like the idea of libvirt being able to select one of specified
types, we have to bear in mind that it'll slightly complicate the XML:

    <mdev_types>
        <type>nvidia-11</type>
        <type>nvidia-21</type>
    </mdev_types>

That luckily shouldn't be problem for libvirt or management software.
On the other hand, the type equivalence will require some kind of
labeling on the management side -- user defines "mygpu" as "vgpu with
type nvidia-11 or nvidia-21" unless libvirt commits to a maintaining a
database with capability-equivalent types for devices (which, given
the generic-ness of the mdev, doesn't seem like a good idea).

>Alex

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] Introduce vGPU mdev framework to libvirt
Posted by Erik Skultety 7 years, 1 month ago
On Wed, Feb 15, 2017 at 09:50:03AM +0100, Martin Polednik wrote:
> On 14/02/17 09:58 -0700, Alex Williamson wrote:
> > On Tue, 14 Feb 2017 16:50:14 +0100
> > Martin Polednik <mpolednik@redhat.com> wrote:
> > 
> > > On 07/02/17 12:29 -0700, Alex Williamson wrote:
> > > >On Tue, 7 Feb 2017 17:26:51 +0100
> > > >Erik Skultety <eskultet@redhat.com> wrote:
> > > >
> > > >> On Mon, Feb 06, 2017 at 09:33:14AM -0700, Alex Williamson wrote:
> > > >> > On Mon,  6 Feb 2017 13:19:42 +0100
> > > >> > Erik Skultety <eskultet@redhat.com> wrote:
> > > >> >
> > > >> > > Finally. It's here. This is the initial suggestion on how libvirt might
> > > >> > > interract with the mdev framework, currently only focussing on the non-managed
> > > >> > > devices, i.e. those pre-created by the user, since that will be revisited once
> > > >> > > we all settled on how the XML should look like, given we might not want to use
> > > >> > > the sysfs path directly as an attribute in the domain XML. My proposal on the
> > > >> > > XML is the following:
> > > >> > >
> > > >> > > <hostdev mode='subsystem' type='mdev'>
> > > >> > >     <source>
> > > >> > >         <!-- this is the host's physical device address -->
> > > >> > >         <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'>
> > > >> > >         <uuid>vGPU_UUID<uuid>
> > > >> > >     <source>
> > > >> > >     <!-- target PCI address can be omitted to assign it automatically -->
> > > >> > > </hostdev>
> > > >> > >
> > > >> > > So the mediated device is identified by the physical parent device visible on
> > > >> > > the host and a UUID which allows us to construct the sysfs path by ourselves,
> > > >> > > which we then put on the QEMU's command line.
> > > >> >
> > > >> > Based on your test code, I think you're creating something like this:
> > > >> >
> > > >> > -device vfio-pci,sysfsdev=/sys/class/mdev_bus/0000:00:03.0/53764d0e-85a0-42b4-af5c-2046b460b1dc
> > > >> >
> > > >> > That would explain the need for the parent device address, but that's
> > > >> > an entirely self inflicted requirement.  For a managed="no" scenarios,
> > > >> > we shouldn't need the parent, we can get to the mdev device
> > > >> > via /sys/bus/mdev/devices/53764d0e-85a0-42b4-af5c-2046b460b1dc.  So it
> > > >>
> > > >> True, for managed="no" would this path be a nice optimization.
> > > >>
> > > >> > seems that the UUID should be the only required source element for
> > > >> > managed="no".
> > > >> >
> > > >> > For managed="yes", it seems like the parent device is still an optional
> > > >>
> > > >> The reason I went with the parent address element (and purposely neglecting the
> > > >> sample mtty driver) was that I assumed any modern mdev capable HW would be
> > > >> accessible through the PCI bus on the host. Also I wanted to explicitly hint
> > > >> libvirt as much as possible which parent device a vGPU device instance should
> > > >> be created on in case there are more than one of them, rather then scanning
> > > >> sysfs for a suitable parent which actually supports the given vGPU type.
> > > >>
> > > >> > field.  The most important thing that libvirt needs to know when
> > > >> > creating a mdev device for a VM is the mdev type name.  The parent
> > > >> > device should be an optional field to help higher level management
> > > >> > tools deal with placement of the device for locality or load balancing.
> > > >> > Also, we can't assume that the parent device is a PCI device, the
> > > >> > sample mtty driver already breaks this assumption.
> > > >>
> > > >> Since we need to assume non-PCI devices and we still need to enable management
> > > >> to hint libvirt about the parent to utilize load balancing and stuff, I've come
> > > >> up with the following adjustments/ideas on how to reflect that in the XML:
> > > >> - still use the address element but use it with the 'type' attribute [1] (still
> > > >>   breaks the sample mtty driver though) while making the element truly optional
> > > >>   if I'm going to be outvoted in favor of scanning the directory for a suitable
> > > >>   parent device on our own, rather than requiring the user to provide that
> > > >>
> > > >> - providing either an attribute or a standalone element for the parent device
> > > >>   name, like a string version of the PCI address or whatever form the parent
> > > >>   device comes in (doesn't break the mtty driver but I don't quite like this)
> > > >>
> > > >> - providing a path element/attribute to sysfs pointing to the parent device
> > > >>   which I'm afraid is what Daniel is not in favor of libvirt doing
> > > >>
> > > >> So, this is what I've so far come up with in terms of hinting libvirt about the
> > > >> parent device, do you have any input on this, maybe some more ideas on how we
> > > >> should identify the parent device?
> > > >
> > > >IMO, if we cannot account for the mtty sample driver, we're doing it
> > > >wrong.  I suppose we can leave it unspecified how one selects a parent
> > > >device for the mtty driver, but it should be possible to expand the
> > > >syntax to include it.  So I think that means that when the parent
> > > >address is provided, the parent address type needs to be specified as
> > > >PCI.  So...
> > > >
> > > > <hostdev mode='subsystem' type='mdev'>
> > > >
> > > >This needs to encompass the device API or else the optional VM address
> > > >cannot be resolved.  Perhaps model='vfio-pci' here?  Seems similar to
> > > >how we specify the device type for PCI controllers where we have
> > > >multiple options:
> > > >
> > > > <hostdev mode='subsystem' type='mdev' model='vfio-pci'>
> > > >
> > > >   <source>
> > > >
> > > >For managed='no', I don't see that anything other than the mdev UUID is
> > > >useful.
> > > >
> > > >     <uuid>MDEV_UUID</uuid>
> > > >
> > > >If libvirt gets into the business of creating mdev devices and we call
> > > >that managed='yes', then the mdev type to create is required.  I don't
> > > >know whether there's anything similar we can steal syntax from:
> > > >
> > > >     <type>"nvidia-11"</type>
> > > >
> > > >That's pretty horrible, needs some xml guru love.
> > > >
> > > >We need to provide for specifying a parent, but we can't assume the
> > > 
> > > From higher level perspective, I believe it would be "good
> > > enough" for most of the cases to only specify the type. Libvirt will
> > > anyway have to be able to enumerate the devices for listAllDevices
> > > afaik.
> > > 
> > > My wish would be specifying
> > > <hostdev mode='subsystem' type='mdev'>
> > >     <type>nvidia-11</type>
> > > </hostdev>
> > > unless the user has specific requests or some other decision (mmio
> > > numa placement) takes place.
> > 
> > Yes, the <type> is the minimum information necessary for libvirt to
> > create the mdev device itself.  A <source> section could add optional
> > placement information.  Note though that without an nvidia-11 type
> > device on the system to query, the xml doesn't tell us what sort of
> > device this creates in the VM.  We could assume that it's vfio-pci, but
> > designing in an assumption isn't a great idea.  So, as above, some
> > mechanism to make the xml self contained, such as specifying the model
> > as vfio-pci, helps avoid that assumption and allows us to know the
> > format for expressing the VM <address>
> 
> As long as libvirt provides means to determine the model via device
> listing (listAllDevices), OK.
> 

Yes, libvirt will provide means expose this information.

> > > We would additionally need (allocated instances/max instances of that
> > > type) in listAllDevices to account for the specific assignment
> > > possibility.
> > 
> > mdev devices support an available_instances per mdev type that is
> > dynamically updated as devices are created.  The interaction of
> > available_instances between different types is going to require some
> > heuristics to understand.  Some vendors may not support heterogeneous
> > types, others may pull from a common pool of resources, where each type
> > may consume resources from that pool at different rates.
> 
> Given common pool semantics, will we be able to calculate how many of
> each type will be available in the pool if we were to instantiate
> certain type? Example:
> 
> available types:
> type_a: 4 devices (each consumes 1 "slot")
> type_b: 1 device  (each consumes 4 "slots")
> total "slots": 4
> 

Well, if we could assume that the number of instances for a specific type would
always be a power of 2 and the resources are distributed in that manner, then
it's simple, you're allocating a resources that a more resource-demanding type
would need to instantiate a single device, so you'll end up with one less
device for each more resource-demanding type recursively. However, that is a
strong assumption to make, so I'm not sure, it's possible that available
instances, which only updates once you instantiated a specific type, is the
only thing we should rely on.

> we know that creating type_a device prevents any
> more type_b devices to be created.
> 
> Does NVIDIA or AMD use the common pool?
> 
> > > I'm not sure what the decision was wrt type naming, can 2 different
> > > cards have similarly named type with different meaning?
> > 
> > We don't deal in similarities, each type ID is unique and it's up to
> > the mdev vendor driver to make sure that an "nvidia-11" on and M60 card
> > is software equivalent to an "nvidia-11" on an M10 card.  If they're
> > not equivalent, the type ID will be different.  Something we may want
> > to consider eventually is whether we want/need to deal with
> > compatibility strings.  For instance, NVIDIA seems to be tying the type
> > ID strongly to specific implementations, an nvidia-11 may only be
> > available on an M60 card.  An M10 card may offer an nvidia-21 type with
> > similar capabilities.  There may be a need to express an mdev device as
> > compatible with various type IDs for hardware availability, at the risk
> > of exposing slight variations to the VM.  This could also make
> > placement easier for vendor drivers that only support homogeneous mdev
> > devices, "I prefer an mdev ID of type 'nvidia-11', but will accept one
> > of type 'nvidia-12,nvidia-21'".  Thanks,
> 
> I like the idea of libvirt being able to select one of specified
> types, we have to bear in mind that it'll slightly complicate the XML:
> 
>    <mdev_types>
>        <type>nvidia-11</type>
>        <type>nvidia-21</type>
>    </mdev_types>

^^ are you referring to nodedev XML or domain XML, because in case of a domain,
there should be only one type per <hostdev type='mdev'>. There is also the
ongoing question what's the best way to approach creation of mdev with libvirt
and we have to be very careful with that so it won't bite us back in the
future.
However, for 7.4 the priority is to accept a pre-created device and to provide
means in the nodedev driver to list all existing mdev devices and their
corresponding parent devices.

> 
> That luckily shouldn't be problem for libvirt or management software.
> On the other hand, the type equivalence will require some kind of
> labeling on the management side -- user defines "mygpu" as "vgpu with
> type nvidia-11 or nvidia-21" unless libvirt commits to a maintaining a
> database with capability-equivalent types for devices (which, given
> the generic-ness of the mdev, doesn't seem like a good idea).
> 

Libvirt definitely shouldn't be handling type compatibility-related issues.
As Alex pointed out, this should be vendor driver's responsibility. There's
also Intel's KVMGT which has a different approach to it's type IDs. IIUC they
based their type IDs on the fraction of actual resources used, i.e. type _1
consumes the whole HW _2 consumes half, etc. but this is a question for Alex as
he's been playing with it for some time. Anyhow, from my understanding Intel's
types look more generic, thus more compatible with different HW revisions, if
so, then in that case by dealing with the type compatibility, libvirt would be
tailoring its logic to a specific vendor's use whereas I think libvirt
should only focus on interacting with the mdev framework using the data it's
got from the user. IOW new mdev-capable HW will be coming out which would in
turn just bring more types to deal with. If the vendor driver won't be willing
to accept any other type than just the set it's exporting, then I think the
management may want to try to compensate for this with the information it can
query from libvirt.

Erik

> > Alex

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] Introduce vGPU mdev framework to libvirt
Posted by Alex Williamson 7 years, 1 month ago
On Wed, 15 Feb 2017 14:43:22 +0100
Erik Skultety <eskultet@redhat.com> wrote:

> On Wed, Feb 15, 2017 at 09:50:03AM +0100, Martin Polednik wrote:
> > On 14/02/17 09:58 -0700, Alex Williamson wrote:  
> > > On Tue, 14 Feb 2017 16:50:14 +0100
> > > Martin Polednik <mpolednik@redhat.com> wrote:
> > >   
> > > > On 07/02/17 12:29 -0700, Alex Williamson wrote:  
> > > > >On Tue, 7 Feb 2017 17:26:51 +0100
> > > > >Erik Skultety <eskultet@redhat.com> wrote:
> > > > >  
> > > > >> On Mon, Feb 06, 2017 at 09:33:14AM -0700, Alex Williamson wrote:  
> > > > >> > On Mon,  6 Feb 2017 13:19:42 +0100
> > > > >> > Erik Skultety <eskultet@redhat.com> wrote:
> > > > >> >  
> > > > >> > > Finally. It's here. This is the initial suggestion on how libvirt might
> > > > >> > > interract with the mdev framework, currently only focussing on the non-managed
> > > > >> > > devices, i.e. those pre-created by the user, since that will be revisited once
> > > > >> > > we all settled on how the XML should look like, given we might not want to use
> > > > >> > > the sysfs path directly as an attribute in the domain XML. My proposal on the
> > > > >> > > XML is the following:
> > > > >> > >
> > > > >> > > <hostdev mode='subsystem' type='mdev'>
> > > > >> > >     <source>
> > > > >> > >         <!-- this is the host's physical device address -->
> > > > >> > >         <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'>
> > > > >> > >         <uuid>vGPU_UUID<uuid>
> > > > >> > >     <source>
> > > > >> > >     <!-- target PCI address can be omitted to assign it automatically -->
> > > > >> > > </hostdev>
> > > > >> > >
> > > > >> > > So the mediated device is identified by the physical parent device visible on
> > > > >> > > the host and a UUID which allows us to construct the sysfs path by ourselves,
> > > > >> > > which we then put on the QEMU's command line.  
> > > > >> >
> > > > >> > Based on your test code, I think you're creating something like this:
> > > > >> >
> > > > >> > -device vfio-pci,sysfsdev=/sys/class/mdev_bus/0000:00:03.0/53764d0e-85a0-42b4-af5c-2046b460b1dc
> > > > >> >
> > > > >> > That would explain the need for the parent device address, but that's
> > > > >> > an entirely self inflicted requirement.  For a managed="no" scenarios,
> > > > >> > we shouldn't need the parent, we can get to the mdev device
> > > > >> > via /sys/bus/mdev/devices/53764d0e-85a0-42b4-af5c-2046b460b1dc.  So it  
> > > > >>
> > > > >> True, for managed="no" would this path be a nice optimization.
> > > > >>  
> > > > >> > seems that the UUID should be the only required source element for
> > > > >> > managed="no".
> > > > >> >
> > > > >> > For managed="yes", it seems like the parent device is still an optional  
> > > > >>
> > > > >> The reason I went with the parent address element (and purposely neglecting the
> > > > >> sample mtty driver) was that I assumed any modern mdev capable HW would be
> > > > >> accessible through the PCI bus on the host. Also I wanted to explicitly hint
> > > > >> libvirt as much as possible which parent device a vGPU device instance should
> > > > >> be created on in case there are more than one of them, rather then scanning
> > > > >> sysfs for a suitable parent which actually supports the given vGPU type.
> > > > >>  
> > > > >> > field.  The most important thing that libvirt needs to know when
> > > > >> > creating a mdev device for a VM is the mdev type name.  The parent
> > > > >> > device should be an optional field to help higher level management
> > > > >> > tools deal with placement of the device for locality or load balancing.
> > > > >> > Also, we can't assume that the parent device is a PCI device, the
> > > > >> > sample mtty driver already breaks this assumption.  
> > > > >>
> > > > >> Since we need to assume non-PCI devices and we still need to enable management
> > > > >> to hint libvirt about the parent to utilize load balancing and stuff, I've come
> > > > >> up with the following adjustments/ideas on how to reflect that in the XML:
> > > > >> - still use the address element but use it with the 'type' attribute [1] (still
> > > > >>   breaks the sample mtty driver though) while making the element truly optional
> > > > >>   if I'm going to be outvoted in favor of scanning the directory for a suitable
> > > > >>   parent device on our own, rather than requiring the user to provide that
> > > > >>
> > > > >> - providing either an attribute or a standalone element for the parent device
> > > > >>   name, like a string version of the PCI address or whatever form the parent
> > > > >>   device comes in (doesn't break the mtty driver but I don't quite like this)
> > > > >>
> > > > >> - providing a path element/attribute to sysfs pointing to the parent device
> > > > >>   which I'm afraid is what Daniel is not in favor of libvirt doing
> > > > >>
> > > > >> So, this is what I've so far come up with in terms of hinting libvirt about the
> > > > >> parent device, do you have any input on this, maybe some more ideas on how we
> > > > >> should identify the parent device?  
> > > > >
> > > > >IMO, if we cannot account for the mtty sample driver, we're doing it
> > > > >wrong.  I suppose we can leave it unspecified how one selects a parent
> > > > >device for the mtty driver, but it should be possible to expand the
> > > > >syntax to include it.  So I think that means that when the parent
> > > > >address is provided, the parent address type needs to be specified as
> > > > >PCI.  So...
> > > > >
> > > > > <hostdev mode='subsystem' type='mdev'>
> > > > >
> > > > >This needs to encompass the device API or else the optional VM address
> > > > >cannot be resolved.  Perhaps model='vfio-pci' here?  Seems similar to
> > > > >how we specify the device type for PCI controllers where we have
> > > > >multiple options:
> > > > >
> > > > > <hostdev mode='subsystem' type='mdev' model='vfio-pci'>
> > > > >
> > > > >   <source>
> > > > >
> > > > >For managed='no', I don't see that anything other than the mdev UUID is
> > > > >useful.
> > > > >
> > > > >     <uuid>MDEV_UUID</uuid>
> > > > >
> > > > >If libvirt gets into the business of creating mdev devices and we call
> > > > >that managed='yes', then the mdev type to create is required.  I don't
> > > > >know whether there's anything similar we can steal syntax from:
> > > > >
> > > > >     <type>"nvidia-11"</type>
> > > > >
> > > > >That's pretty horrible, needs some xml guru love.
> > > > >
> > > > >We need to provide for specifying a parent, but we can't assume the  
> > > > 
> > > > From higher level perspective, I believe it would be "good
> > > > enough" for most of the cases to only specify the type. Libvirt will
> > > > anyway have to be able to enumerate the devices for listAllDevices
> > > > afaik.
> > > > 
> > > > My wish would be specifying
> > > > <hostdev mode='subsystem' type='mdev'>
> > > >     <type>nvidia-11</type>
> > > > </hostdev>
> > > > unless the user has specific requests or some other decision (mmio
> > > > numa placement) takes place.  
> > > 
> > > Yes, the <type> is the minimum information necessary for libvirt to
> > > create the mdev device itself.  A <source> section could add optional
> > > placement information.  Note though that without an nvidia-11 type
> > > device on the system to query, the xml doesn't tell us what sort of
> > > device this creates in the VM.  We could assume that it's vfio-pci, but
> > > designing in an assumption isn't a great idea.  So, as above, some
> > > mechanism to make the xml self contained, such as specifying the model
> > > as vfio-pci, helps avoid that assumption and allows us to know the
> > > format for expressing the VM <address>  
> > 
> > As long as libvirt provides means to determine the model via device
> > listing (listAllDevices), OK.
> >   
> 
> Yes, libvirt will provide means expose this information.
> 
> > > > We would additionally need (allocated instances/max instances of that
> > > > type) in listAllDevices to account for the specific assignment
> > > > possibility.  
> > > 
> > > mdev devices support an available_instances per mdev type that is
> > > dynamically updated as devices are created.  The interaction of
> > > available_instances between different types is going to require some
> > > heuristics to understand.  Some vendors may not support heterogeneous
> > > types, others may pull from a common pool of resources, where each type
> > > may consume resources from that pool at different rates.  
> > 
> > Given common pool semantics, will we be able to calculate how many of
> > each type will be available in the pool if we were to instantiate
> > certain type? Example:
> > 
> > available types:
> > type_a: 4 devices (each consumes 1 "slot")
> > type_b: 1 device  (each consumes 4 "slots")
> > total "slots": 4
> >   
> 
> Well, if we could assume that the number of instances for a specific type would
> always be a power of 2 and the resources are distributed in that manner, then
> it's simple, you're allocating a resources that a more resource-demanding type
> would need to instantiate a single device, so you'll end up with one less
> device for each more resource-demanding type recursively. However, that is a
> strong assumption to make, so I'm not sure, it's possible that available
> instances, which only updates once you instantiated a specific type, is the
> only thing we should rely on.

Agree, and vendors can change how they manage this at any time.  For
instance if I boot one version of the kernel, i915 gives me:

i915-GVTg_V4_1
i915-GVTg_V4_2
i915-GVTg_V4_4

If I boot another, I get:

i915-GVTg_V4_1
i915-GVTg_V4_2
i915-GVTg_V4_5
i915-GVTg_V4_7

Now we don't have evenly divisible numbers.  If I create a type _1
device, available_instances still says I can create one type _5 or _7.
It's perhaps best for libvirt to just look at the current state and not
try to predict the future.

> > we know that creating type_a device prevents any
> > more type_b devices to be created.
> > 
> > Does NVIDIA or AMD use the common pool?

AMD isn't a player here yet, Intel and NVIDIA have vGPUs, IBM has a
model under development for S390 channel I/O.  The only thing you can
rely on is available_instances per mdev type at a given point of time.
How available_instances chances when we start creating devices is
vendor specific and may change at any time.

> > > > I'm not sure what the decision was wrt type naming, can 2 different
> > > > cards have similarly named type with different meaning?  
> > > 
> > > We don't deal in similarities, each type ID is unique and it's up to
> > > the mdev vendor driver to make sure that an "nvidia-11" on and M60 card
> > > is software equivalent to an "nvidia-11" on an M10 card.  If they're
> > > not equivalent, the type ID will be different.  Something we may want
> > > to consider eventually is whether we want/need to deal with
> > > compatibility strings.  For instance, NVIDIA seems to be tying the type
> > > ID strongly to specific implementations, an nvidia-11 may only be
> > > available on an M60 card.  An M10 card may offer an nvidia-21 type with
> > > similar capabilities.  There may be a need to express an mdev device as
> > > compatible with various type IDs for hardware availability, at the risk
> > > of exposing slight variations to the VM.  This could also make
> > > placement easier for vendor drivers that only support homogeneous mdev
> > > devices, "I prefer an mdev ID of type 'nvidia-11', but will accept one
> > > of type 'nvidia-12,nvidia-21'".  Thanks,  
> > 
> > I like the idea of libvirt being able to select one of specified
> > types, we have to bear in mind that it'll slightly complicate the XML:
> > 
> >    <mdev_types>
> >        <type>nvidia-11</type>
> >        <type>nvidia-21</type>
> >    </mdev_types>  
> 
> ^^ are you referring to nodedev XML or domain XML, because in case of a domain,
> there should be only one type per <hostdev type='mdev'>. There is also the
> ongoing question what's the best way to approach creation of mdev with libvirt
> and we have to be very careful with that so it won't bite us back in the
> future.
> However, for 7.4 the priority is to accept a pre-created device and to provide
> means in the nodedev driver to list all existing mdev devices and their
> corresponding parent devices.

Sorry if I confused the topic with some sort of compatibility listing.
I agree that for a <hostdev> there needs to be a single type if
libvirt is to create an instance of that type.  Any notion of
compatible or secondary acceptable types is lower priority than the
necessary basic behaviors, I'm just trying to plan ahead for later
extensions that might be useful.  Perhaps any notion of compatibility
lives in user define lists above libvirt.

> > That luckily shouldn't be problem for libvirt or management software.
> > On the other hand, the type equivalence will require some kind of
> > labeling on the management side -- user defines "mygpu" as "vgpu with
> > type nvidia-11 or nvidia-21" unless libvirt commits to a maintaining a
> > database with capability-equivalent types for devices (which, given
> > the generic-ness of the mdev, doesn't seem like a good idea).
> >   
> 
> Libvirt definitely shouldn't be handling type compatibility-related issues.
> As Alex pointed out, this should be vendor driver's responsibility. There's
> also Intel's KVMGT which has a different approach to it's type IDs. IIUC they
> based their type IDs on the fraction of actual resources used, i.e. type _1
> consumes the whole HW _2 consumes half, etc. but this is a question for Alex as
> he's been playing with it for some time. Anyhow, from my understanding Intel's
> types look more generic, thus more compatible with different HW revisions, if
> so, then in that case by dealing with the type compatibility, libvirt would be
> tailoring its logic to a specific vendor's use whereas I think libvirt
> should only focus on interacting with the mdev framework using the data it's
> got from the user. IOW new mdev-capable HW will be coming out which would in
> turn just bring more types to deal with. If the vendor driver won't be willing
> to accept any other type than just the set it's exporting, then I think the
> management may want to try to compensate for this with the information it can
> query from libvirt.

AIUI, looking at the example Intel mdev types I list above, that "V4"
indicates a Broadwell class GPU.  So don't be fooled into thinking
Intel is making some sort of generic device that it can represent on
any platform.  I'd expect a Skylake system to export similar types, but
with a "V5" component to the name.  NVIDIA naming is just more opaque,
possibly changing not only across generations but across
implementations.  I fully agree that it's the vendor's responsibility
to maintain that a given type is compatible wherever it is exposed and
libvirt's first priority is to focus on specifying a single type in the
xml and working towards instantiating an mdev of that type.  Libvirt
should never assume that anything other than that single, exact type is
compatible or sufficient for the VM.

I did plant the seed above about whether user defined compatibility
lists might be useful, it seems like something we should keep in mind,
but at a lower priority than any sort of initial support.  As Erik
suggested in a separate discussion, perhaps any notion of user defined
compatibility happens at a management layer above libvirt.  Migration
also needs to be considered when we think about compatible devices.
Compatibility likely only refers to the point at which we instantiate
the VM, if we were ever to support migration of an mdev device, the
target and source would need to be identical.  Yet more reasons for
libvirt to leave compatibility to higher layers of management tools.
Thanks,

Alex

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [RFC PATCH 00/16] Introduce vGPU mdev framework to libvirt
Posted by Daniel P. Berrange 7 years, 1 month ago
On Mon, Feb 06, 2017 at 01:19:42PM +0100, Erik Skultety wrote:
> Finally. It's here. This is the initial suggestion on how libvirt might
> interract with the mdev framework, currently only focussing on the non-managed
> devices, i.e. those pre-created by the user, since that will be revisited once
> we all settled on how the XML should look like, given we might not want to use
> the sysfs path directly as an attribute in the domain XML. My proposal on the
> XML is the following:
> 
> <hostdev mode='subsystem' type='mdev'>  
>     <source>
>         <!-- this is the host's physical device address -->
>         <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'>
>         <uuid>vGPU_UUID<uuid>
>     <source>
>     <!-- target PCI address can be omitted to assign it automatically -->
> </hostdev>
> 
> So the mediated device is identified by the physical parent device visible on
> the host and a UUID which allows us to construct the sysfs path by ourselves,
> which we then put on the QEMU's command line.
> 
> A few remarks if you actually happen to have a machine to test this on:
> - right now the mediated devices are one-time use only, i.e. they have to be
> recreated before every machine boot
> - I wouldn't recommend assigning multiple vGPUs to a single domain
> 
> Once this series is sorted out, we can then continue with 'managed=yes' where
> as Laine pointed out [1], we need to figure out how exactly should the
> management layer hint libvirt which vGPU type should be used for device
> instantiation.

You seem to be suggesting that managed=yes with mdev devices would
cause create / delete of a mdev device from a specified parent.

This is rather different semantics from what managed=yes does with
PCI device assignment today.  There the managed=yes flag is just
about controlling host device driver attachment. ie whether libvirt
will manually bind to vfio.ko, or expect the admin to have bound
it to vfio.ko before hand. I think it is important to keep that
concept as is for mdev too.

While we're thinking of mdev purely in terms of KVM + vfio usage,
it wouldn't suprise me if there ended up being non-KVM based
use cases for mdev.

It isn't clear to me that auto-creation of mdev devices as a concept
even belongs in the domain XML neccessarily.

Looking at two similar areas. For SRIOV NICs, in the domain XML
you either specify an explicit VF to use, or you reference a
libvirt virtual network. The latter takes care of dynamically
providing VFs to VMs.  For NPIV, IIRC, the domain XML works
similarly either taking an explicit vHBA, or referencing a
storage pool to get one more dynamically.

Before we even consider auto-creation though, I think we need
to have manual creation designed & integrated in the node device
APIs.

So in terms of the domain XML, I think the only think we need
to provide is the address of the pre-existing mdev device
to be used. In this case "address" means the UUID. We should
not need anything about the parent device AFAICT.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [RFC PATCH 00/16] Introduce vGPU mdev framework to libvirt
Posted by Alex Williamson 7 years, 1 month ago
On Mon, 6 Feb 2017 16:44:37 +0000
"Daniel P. Berrange" <berrange@redhat.com> wrote:

> On Mon, Feb 06, 2017 at 01:19:42PM +0100, Erik Skultety wrote:
> > Finally. It's here. This is the initial suggestion on how libvirt might
> > interract with the mdev framework, currently only focussing on the non-managed
> > devices, i.e. those pre-created by the user, since that will be revisited once
> > we all settled on how the XML should look like, given we might not want to use
> > the sysfs path directly as an attribute in the domain XML. My proposal on the
> > XML is the following:
> > 
> > <hostdev mode='subsystem' type='mdev'>  
> >     <source>
> >         <!-- this is the host's physical device address -->
> >         <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'>
> >         <uuid>vGPU_UUID<uuid>
> >     <source>
> >     <!-- target PCI address can be omitted to assign it automatically -->
> > </hostdev>
> > 
> > So the mediated device is identified by the physical parent device visible on
> > the host and a UUID which allows us to construct the sysfs path by ourselves,
> > which we then put on the QEMU's command line.
> > 
> > A few remarks if you actually happen to have a machine to test this on:
> > - right now the mediated devices are one-time use only, i.e. they have to be
> > recreated before every machine boot
> > - I wouldn't recommend assigning multiple vGPUs to a single domain
> > 
> > Once this series is sorted out, we can then continue with 'managed=yes' where
> > as Laine pointed out [1], we need to figure out how exactly should the
> > management layer hint libvirt which vGPU type should be used for device
> > instantiation.  
> 
> You seem to be suggesting that managed=yes with mdev devices would
> cause create / delete of a mdev device from a specified parent.
> 
> This is rather different semantics from what managed=yes does with
> PCI device assignment today.  There the managed=yes flag is just
> about controlling host device driver attachment. ie whether libvirt
> will manually bind to vfio.ko, or expect the admin to have bound
> it to vfio.ko before hand. I think it is important to keep that
> concept as is for mdev too.
> 
> While we're thinking of mdev purely in terms of KVM + vfio usage,
> it wouldn't suprise me if there ended up being non-KVM based
> use cases for mdev.
> 
> It isn't clear to me that auto-creation of mdev devices as a concept
> even belongs in the domain XML neccessarily.
> 
> Looking at two similar areas. For SRIOV NICs, in the domain XML
> you either specify an explicit VF to use, or you reference a
> libvirt virtual network. The latter takes care of dynamically
> providing VFs to VMs.  For NPIV, IIRC, the domain XML works
> similarly either taking an explicit vHBA, or referencing a
> storage pool to get one more dynamically.

Nit, there are other constraints of SR-IOV which I think are over
simplifying this analogy.  With SR-IOV, we can't dynamically
instantiate new VFs individually.  The process there requires that we
set the number of VFs we need and enable them.  Changing that number
of VFs requires that all existing VFs on that PF are removed and
recreated.  So, does libvirt work the way it does with SR-IOV devices
because that's the optimal way for users to make use of those VFs, or
does it behave that way because it must to follow the constraints of
the device?  I think libvirt handles VFs much like it does PFs because
it has no other choice.  Here we do have a choice.  Individual mdev
devices can be created and destroyed.  The only dependency between mdev
devices is how creating one affects the availability of mdev types
remaining on the parent device.  It would really be a shame to not take
advantage of the fact that the underlying device creation has advanced
so far from SR-IOV and lump it into the same sort of management.  My
impression is that user management of creating SR-IOV VFs via module
options or self defined scripts is a stumbling point that libvirt could
help to address here.
 
> Before we even consider auto-creation though, I think we need
> to have manual creation designed & integrated in the node device
> APIs.
> 
> So in terms of the domain XML, I think the only think we need
> to provide is the address of the pre-existing mdev device
> to be used. In this case "address" means the UUID. We should
> not need anything about the parent device AFAICT.

Yep, agree.  Thanks,

Alex

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [RFC PATCH 00/16] Introduce vGPU mdev framework to libvirt
Posted by Erik Skultety 7 years, 1 month ago
On Mon, Feb 06, 2017 at 04:44:37PM +0000, Daniel P. Berrange wrote:
> On Mon, Feb 06, 2017 at 01:19:42PM +0100, Erik Skultety wrote:
> > Finally. It's here. This is the initial suggestion on how libvirt might
> > interract with the mdev framework, currently only focussing on the non-managed
> > devices, i.e. those pre-created by the user, since that will be revisited once
> > we all settled on how the XML should look like, given we might not want to use
> > the sysfs path directly as an attribute in the domain XML. My proposal on the
> > XML is the following:
> > 
> > <hostdev mode='subsystem' type='mdev'>  
> >     <source>
> >         <!-- this is the host's physical device address -->
> >         <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'>
> >         <uuid>vGPU_UUID<uuid>
> >     <source>
> >     <!-- target PCI address can be omitted to assign it automatically -->
> > </hostdev>
> > 
> > So the mediated device is identified by the physical parent device visible on
> > the host and a UUID which allows us to construct the sysfs path by ourselves,
> > which we then put on the QEMU's command line.
> > 
> > A few remarks if you actually happen to have a machine to test this on:
> > - right now the mediated devices are one-time use only, i.e. they have to be
> > recreated before every machine boot
> > - I wouldn't recommend assigning multiple vGPUs to a single domain
> > 
> > Once this series is sorted out, we can then continue with 'managed=yes' where
> > as Laine pointed out [1], we need to figure out how exactly should the
> > management layer hint libvirt which vGPU type should be used for device
> > instantiation.
> 
> You seem to be suggesting that managed=yes with mdev devices would
> cause create / delete of a mdev device from a specified parent.
> 
> This is rather different semantics from what managed=yes does with
> PCI device assignment today.  There the managed=yes flag is just
> about controlling host device driver attachment. ie whether libvirt
> will manually bind to vfio.ko, or expect the admin to have bound
> it to vfio.ko before hand. I think it is important to keep that
> concept as is for mdev too.

If the managed attribute was used with other devices beside PCI, then sure, we
should keep the concept, however, since only PCI devices support it and now we
have another device type that potentially might have a use for such an
attribute I think it's perfectly reasonable to alter the logic behind that
attribute in favor of the new possibilities to device management which mdev
framework is providing us with which in this case is dynamic creation and
removal of a mediated device.

> 
> While we're thinking of mdev purely in terms of KVM + vfio usage,
> it wouldn't suprise me if there ended up being non-KVM based
> use cases for mdev.
> 
> It isn't clear to me that auto-creation of mdev devices as a concept
> even belongs in the domain XML neccessarily.
> 
> Looking at two similar areas. For SRIOV NICs, in the domain XML
> you either specify an explicit VF to use, or you reference a
> libvirt virtual network. The latter takes care of dynamically
> providing VFs to VMs.  For NPIV, IIRC, the domain XML works
> similarly either taking an explicit vHBA, or referencing a
> storage pool to get one more dynamically.
> 
> Before we even consider auto-creation though, I think we need
> to have manual creation designed & integrated in the node device
> APIs.
> 

Yes, integrating mdev into the nodedev driver this way ^^ is definitely planned.

> So in terms of the domain XML, I think the only think we need
> to provide is the address of the pre-existing mdev device
> to be used. In this case "address" means the UUID. We should
> not need anything about the parent device AFAICT.
> 
> Regards,
> Daniel
> -- 
> |: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
> |: http://libvirt.org              -o-             http://virt-manager.org :|
> |: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [RFC PATCH 00/16] Introduce vGPU mdev framework to libvirt
Posted by Daniel P. Berrange 7 years, 1 month ago
On Tue, Feb 07, 2017 at 05:48:23PM +0100, Erik Skultety wrote:
> On Mon, Feb 06, 2017 at 04:44:37PM +0000, Daniel P. Berrange wrote:
> > On Mon, Feb 06, 2017 at 01:19:42PM +0100, Erik Skultety wrote:
> > > Finally. It's here. This is the initial suggestion on how libvirt might
> > > interract with the mdev framework, currently only focussing on the non-managed
> > > devices, i.e. those pre-created by the user, since that will be revisited once
> > > we all settled on how the XML should look like, given we might not want to use
> > > the sysfs path directly as an attribute in the domain XML. My proposal on the
> > > XML is the following:
> > > 
> > > <hostdev mode='subsystem' type='mdev'>  
> > >     <source>
> > >         <!-- this is the host's physical device address -->
> > >         <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'>
> > >         <uuid>vGPU_UUID<uuid>
> > >     <source>
> > >     <!-- target PCI address can be omitted to assign it automatically -->
> > > </hostdev>
> > > 
> > > So the mediated device is identified by the physical parent device visible on
> > > the host and a UUID which allows us to construct the sysfs path by ourselves,
> > > which we then put on the QEMU's command line.
> > > 
> > > A few remarks if you actually happen to have a machine to test this on:
> > > - right now the mediated devices are one-time use only, i.e. they have to be
> > > recreated before every machine boot
> > > - I wouldn't recommend assigning multiple vGPUs to a single domain
> > > 
> > > Once this series is sorted out, we can then continue with 'managed=yes' where
> > > as Laine pointed out [1], we need to figure out how exactly should the
> > > management layer hint libvirt which vGPU type should be used for device
> > > instantiation.
> > 
> > You seem to be suggesting that managed=yes with mdev devices would
> > cause create / delete of a mdev device from a specified parent.
> > 
> > This is rather different semantics from what managed=yes does with
> > PCI device assignment today.  There the managed=yes flag is just
> > about controlling host device driver attachment. ie whether libvirt
> > will manually bind to vfio.ko, or expect the admin to have bound
> > it to vfio.ko before hand. I think it is important to keep that
> > concept as is for mdev too.
> 
> If the managed attribute was used with other devices beside PCI, then sure, we
> should keep the concept, however, since only PCI devices support it and now we
> have another device type that potentially might have a use for such an
> attribute I think it's perfectly reasonable to alter the logic behind that
> attribute in favor of the new possibilities to device management which mdev
> framework is providing us with which in this case is dynamic creation and
> removal of a mediated device.

No we really shouldn't use one attribute to overload completely
different semantics. As I say, we may well find we want to implement
the existing PCI semantics for mdev devices too.

If we want to auto-create, that should be a different attribute
eg 'autocreate=yes|no'

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list