[v2] Add callback mech for qemu and nodedev

[libvirt] [PATCH REPOST v2 0/3] Add callback mech for qemu and nodedev

Posted by John Ferlan 8 years, 11 months ago

Repost: http://www.redhat.com/archives/libvir-list/2017-February/msg00501.html

to update to top of branch as of commit id '5ad03b9db2'


John Ferlan (3):
  nodedev: Add driver callback mechanism to add/remove devices
  qemu: Use nodedev callback mechanism to to get a nodedev data
  qemu: Add configuration variable to control nodedev enumeration

 src/check-aclrules.pl              |   3 +-
 src/check-driverimpls.pl           |   1 +
 src/conf/node_device_conf.c        | 112 ++++++++++++++++++++++++++++++++++++-
 src/conf/node_device_conf.h        |  18 ++++++
 src/libvirt_private.syms           |   3 +
 src/node_device/node_device_udev.c |  29 ++++++++++
 src/qemu/libvirtd_qemu.aug         |   1 +
 src/qemu/qemu.conf                 |  19 +++++++
 src/qemu/qemu_conf.c               |  53 ++++++++++++++++++
 src/qemu/qemu_conf.h               |  17 ++++++
 src/qemu/qemu_driver.c             |  46 +++++++++++++++
 src/qemu/test_libvirtd_qemu.aug.in |   1 +
 12 files changed, 301 insertions(+), 2 deletions(-)

-- 
2.9.3

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [PATCH REPOST v2 0/3] Add callback mech for qemu and nodedev

Posted by Daniel P. Berrange 8 years, 11 months ago

On Tue, Feb 21, 2017 at 11:33:25AM -0500, John Ferlan wrote:
> Repost: http://www.redhat.com/archives/libvir-list/2017-February/msg00501.html
> 
> to update to top of branch as of commit id '5ad03b9db2'

BTW, could you include the full cover letter in each new version rather
than making people follow links all the way back to v1 to find info
about the patch series goals.

IIUC, the intention here is that we automatically create NPIV devices
when starting guests and delete them when stopping guests. I can see
some appeal in this, but at the same time I'm not convinced we should
add such a feature.

AFAICT, the node device APIs already allow a management application to
achieve the same end goal without needing this integration. Yes, it
would simplify usage of NPIV on the surface, but the cost of doing this
is that it ends a specific usage policy for NPIV in the libvirt code and
makes error handling harder. In particular it is possible to get into a
situation where a VM fails to start and we're also unable to clear up
the NPIV device we just auto-created. Now this could be said to apply
to pretty much everything we do during guest startup, but in most cases
the failure is harmless or gets auto-cleaned up by the kernel (ie the
tap devices get auto-deleted when the FD is closed, or SELinux labels
get reset next time a VM wants that file, locks are released when we
close the virtlockd file handle, etc).   NPIV is significantly more
complicated and more likely to hit failure scenarios due to fact that
it involves interactions with off-node hardware resources.

Is there some aspect of NPIV mgmt that can only be achieved if libvirt
is explicitly managing the device lifecycle during VM start/stop, as
opposed to having the mgmt app manage it ?

If OpenStack were to provide NPIV support I think it'd probably end
up dealing with device setup explicitly via the node device APIs
rather than relying on libvirt to create/delete them. That way it
can track the lifecycle of NPIV devices explicitly, and if it is not
possible to delete them at time of QEMU shutdown for some reason, it
can easily arrange to delete them later.

Overall I think one of the more successful aspects of libvirt's design
has been the way we minimise the addition of usage policy decisions, in
favour of providing mechanisms that applications can use to implement
policies. This has had a cost in that applications need todo more work
themselves, but on balance I still think it is a win to avoid adding
policy driven features to libvirt.

A key question is just where "autocreation/delete of NPIV devices" falls
in the line between mechanism & policy, since the line is not entirely
black & white. I tend towards it being policy though, since it is just
providing a less general purpose way todo something that can be achieved
already via the node device APIs.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [PATCH REPOST v2 0/3] Add callback mech for qemu and nodedev

Posted by John Ferlan 8 years, 11 months ago

On 02/21/2017 12:03 PM, Daniel P. Berrange wrote:
> On Tue, Feb 21, 2017 at 11:33:25AM -0500, John Ferlan wrote:
>> Repost: http://www.redhat.com/archives/libvir-list/2017-February/msg00501.html
>>
>> to update to top of branch as of commit id '5ad03b9db2'
> 
> BTW, could you include the full cover letter in each new version rather
> than making people follow links all the way back to v1 to find info
> about the patch series goals.

OK - I'll try to remember.

> 
> IIUC, the intention here is that we automatically create NPIV devices
> when starting guests and delete them when stopping guests. I can see
> some appeal in this, but at the same time I'm not convinced we should
> add such a feature.

A bit more than that - create the vHBA and assign the LUN's to the guest
as they are discovered and remove them as they are removed (events from
udev). This is a mechanism/idea from Paolo. The RHV team would be the
primary consumer and IIRC they don't use storage pools.

> 
> AFAICT, the node device APIs already allow a management application to
> achieve the same end goal without needing this integration. Yes, it
> would simplify usage of NPIV on the surface, but the cost of doing this
> is that it ends a specific usage policy for NPIV in the libvirt code and
> makes error handling harder. In particular it is possible to get into a
> situation where a VM fails to start and we're also unable to clear up
> the NPIV device we just auto-created. Now this could be said to apply
> to pretty much everything we do during guest startup, but in most cases
> the failure is harmless or gets auto-cleaned up by the kernel (ie the
> tap devices get auto-deleted when the FD is closed, or SELinux labels
> get reset next time a VM wants that file, locks are released when we
> close the virtlockd file handle, etc).   NPIV is significantly more
> complicated and more likely to hit failure scenarios due to fact that
> it involves interactions with off-node hardware resources.

I agree with your points. The "purpose" of libvirt taking care of it
would be to let libvirt handle all those nasty and odd failure or
integration issues - including migration.  Of course from a libvirt
perspective, I'd rather take the 'scsi_hostX' vHBA and just pass that
through to QEMU directly to allow it (or the guest) to find the LUN's,
but that's push the problem the other way.

I said early on that this is something that could be done by the upper
layers that would be able to receive the add/remove lun events whether
they created a storage pool just for that purpose or they created the
vHBA themselves. It's probably even in the bz's on this.

> 
> Is there some aspect of NPIV mgmt that can only be achieved if libvirt
> is explicitly managing the device lifecycle during VM start/stop, as
> opposed to having the mgmt app manage it ?
> 

Beyond the upper layers not needing to handle anything other than
creating the vHBA for the domain and letting libvirt handle the rest.

> If OpenStack were to provide NPIV support I think it'd probably end
> up dealing with device setup explicitly via the node device APIs
> rather than relying on libvirt to create/delete them. That way it
> can track the lifecycle of NPIV devices explicitly, and if it is not
> possible to delete them at time of QEMU shutdown for some reason, it
> can easily arrange to delete them later.
> 
> 
> Overall I think one of the more successful aspects of libvirt's design
> has been the way we minimise the addition of usage policy decisions, in
> favour of providing mechanisms that applications can use to implement
> policies. This has had a cost in that applications need todo more work
> themselves, but on balance I still think it is a win to avoid adding
> policy driven features to libvirt.
> 
> A key question is just where "autocreation/delete of NPIV devices" falls
> in the line between mechanism & policy, since the line is not entirely
> black & white. I tend towards it being policy though, since it is just
> providing a less general purpose way todo something that can be achieved
> already via the node device APIs.
> 
> Regards,
> Daniel
> 

I understand - to a degree I guess I had assumed some of these type
discussions had taken place by those that wanted the feature added.

One other good thing that's come out of these changes is a bit more
testing for vHBA creation via nodedev/storage pool and quite a bit of
code cleanup once/if most of the patches I posted earlier in the week
are accepted.

John

(FWIW: I'll have limited access to email over the next couple of days...)

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [PATCH REPOST v2 0/3] Add callback mech for qemu and nodedev

Posted by Daniel P. Berrange 8 years, 11 months ago

On Tue, Feb 21, 2017 at 10:45:05PM -0500, John Ferlan wrote:
> 
> 
> On 02/21/2017 12:03 PM, Daniel P. Berrange wrote:
> > On Tue, Feb 21, 2017 at 11:33:25AM -0500, John Ferlan wrote:
> >> Repost: http://www.redhat.com/archives/libvir-list/2017-February/msg00501.html
> >>
> >> to update to top of branch as of commit id '5ad03b9db2'
> > 
> > BTW, could you include the full cover letter in each new version rather
> > than making people follow links all the way back to v1 to find info
> > about the patch series goals.
> 
> OK - I'll try to remember.
> 
> > 
> > IIUC, the intention here is that we automatically create NPIV devices
> > when starting guests and delete them when stopping guests. I can see
> > some appeal in this, but at the same time I'm not convinced we should
> > add such a feature.
> 
> A bit more than that - create the vHBA and assign the LUN's to the guest
> as they are discovered and remove them as they are removed (events from
> udev). This is a mechanism/idea from Paolo. The RHV team would be the
> primary consumer and IIRC they don't use storage pools.
> 
> > 
> > AFAICT, the node device APIs already allow a management application to
> > achieve the same end goal without needing this integration. Yes, it
> > would simplify usage of NPIV on the surface, but the cost of doing this
> > is that it ends a specific usage policy for NPIV in the libvirt code and
> > makes error handling harder. In particular it is possible to get into a
> > situation where a VM fails to start and we're also unable to clear up
> > the NPIV device we just auto-created. Now this could be said to apply
> > to pretty much everything we do during guest startup, but in most cases
> > the failure is harmless or gets auto-cleaned up by the kernel (ie the
> > tap devices get auto-deleted when the FD is closed, or SELinux labels
> > get reset next time a VM wants that file, locks are released when we
> > close the virtlockd file handle, etc).   NPIV is significantly more
> > complicated and more likely to hit failure scenarios due to fact that
> > it involves interactions with off-node hardware resources.
> 
> I agree with your points. The "purpose" of libvirt taking care of it
> would be to let libvirt handle all those nasty and odd failure or
> integration issues - including migration.  Of course from a libvirt
> perspective, I'd rather take the 'scsi_hostX' vHBA and just pass that
> through to QEMU directly to allow it (or the guest) to find the LUN's,
> but that's push the problem the other way.
> 
> I said early on that this is something that could be done by the upper
> layers that would be able to receive the add/remove lun events whether
> they created a storage pool just for that purpose or they created the
> vHBA themselves. It's probably even in the bz's on this.

Yeah, I'm curious where the desire to assign individual LUNs comes
from - that makes me even less comfortable with this idea. It means
we are taking a single device from libvirt's POV - a HBA, and turning
it into many devices from the guest's POV, which means for a single
device we'd need to track multiple SCSI addresses surely, and we'd
not know how many addresses we need to track up front either. I'd
really strongly prefer we track individual devices given to the guest
explicitly.

I'd also suggest just assigning the whole vHBA, but We can still make
it possible for a mgmt app to dynamically add/remove LUNS from a vHBA.
Our node device APIs can report events when devices appear/disappear
so an app can create a vHBA and monitor for LUN changes & update the
guest. That said I'm curious how you'd handle remove, because surely
by the time you see a LUN disappear at vHBA, the guest is already
broken as its got an orphaned device. This seems like another reason
to just assign the vHBA itself to let the guest deal with LUN removal
more directly.


Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [PATCH REPOST v2 0/3] Add callback mech for qemu and nodedev

Posted by Paolo Bonzini 8 years, 11 months ago

On 22/02/2017 11:05, Daniel P. Berrange wrote:
> I'd also suggest just assigning the whole vHBA

What do you mean by "assigning the whole vHBA"?  There is no concept of
a vHBA as e.g. a PCI virtual function.

> , but We can still make
> it possible for a mgmt app to dynamically add/remove LUNS from a vHBA.

> Our node device APIs can report events when devices appear/disappear
> so an app can create a vHBA and monitor for LUN changes & update the
> guest. That said I'm curious how you'd handle remove, because surely
> by the time you see a LUN disappear at vHBA, the guest is already
> broken as its got an orphaned device. This seems like another reason
> to just assign the vHBA itself to let the guest deal with LUN removal
> more directly.

This is the same on bare metal: the device is orphaned at the time you
unplug it.  If the guest is using it and doesn't have a fallback, bad
things happen.  If it is not using it, or if it's using multipath,
things ought to work just fine.

Paolo

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [PATCH REPOST v2 0/3] Add callback mech for qemu and nodedev

Posted by Daniel P. Berrange 8 years, 11 months ago

On Wed, Feb 22, 2017 at 02:38:32PM +0100, Paolo Bonzini wrote:
> 
> 
> On 22/02/2017 11:05, Daniel P. Berrange wrote:
> > I'd also suggest just assigning the whole vHBA
> 
> What do you mean by "assigning the whole vHBA"?  There is no concept of
> a vHBA as e.g. a PCI virtual function.

I thought we had a generic SCSI passthrough capability that could
attach to the SCSI vHBA in the host, or am I mis-remembering ?

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [PATCH REPOST v2 0/3] Add callback mech for qemu and nodedev

Posted by Paolo Bonzini 8 years, 11 months ago

On 22/02/2017 14:46, Daniel P. Berrange wrote:
> On Wed, Feb 22, 2017 at 02:38:32PM +0100, Paolo Bonzini wrote:
>>
>>
>> On 22/02/2017 11:05, Daniel P. Berrange wrote:
>>> I'd also suggest just assigning the whole vHBA
>>
>> What do you mean by "assigning the whole vHBA"?  There is no concept of
>> a vHBA as e.g. a PCI virtual function.
> 
> I thought we had a generic SCSI passthrough capability that could
> attach to the SCSI vHBA in the host, or am I mis-remembering ?

No, the SCSI passthrough capability attaches to a single device.

With Fibre Channel, zoning allows you to configure the set of visible
disks for each machine (or more accurately for each WWPN and/or WWNN) at
the storage fabric level.  So the idea is to just configure the
WWPN/WWNN and let libvirt handle the discovery of disks underneath.  The
alternative is to do it in both oVirt and Nova though, as mentioned
elsewhere in the thread, libvirt could still provide a udev wrapper.

Paolo

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [PATCH REPOST v2 0/3] Add callback mech for qemu and nodedev

Posted by Daniel P. Berrange 8 years, 11 months ago

On Wed, Feb 22, 2017 at 03:10:00PM +0100, Paolo Bonzini wrote:
> 
> 
> On 22/02/2017 14:46, Daniel P. Berrange wrote:
> > On Wed, Feb 22, 2017 at 02:38:32PM +0100, Paolo Bonzini wrote:
> >>
> >>
> >> On 22/02/2017 11:05, Daniel P. Berrange wrote:
> >>> I'd also suggest just assigning the whole vHBA
> >>
> >> What do you mean by "assigning the whole vHBA"?  There is no concept of
> >> a vHBA as e.g. a PCI virtual function.
> > 
> > I thought we had a generic SCSI passthrough capability that could
> > attach to the SCSI vHBA in the host, or am I mis-remembering ?
> 
> No, the SCSI passthrough capability attaches to a single device.
> 
> With Fibre Channel, zoning allows you to configure the set of visible
> disks for each machine (or more accurately for each WWPN and/or WWNN) at
> the storage fabric level.  So the idea is to just configure the
> WWPN/WWNN and let libvirt handle the discovery of disks underneath.  The
> alternative is to do it in both oVirt and Nova though, as mentioned
> elsewhere in the thread, libvirt could still provide a udev wrapper.

What do you mean by "provide a udev wrapper" ?

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [PATCH REPOST v2 0/3] Add callback mech for qemu and nodedev

Posted by Paolo Bonzini 8 years, 11 months ago


On 22/02/2017 15:11, Daniel P. Berrange wrote:
> On Wed, Feb 22, 2017 at 03:10:00PM +0100, Paolo Bonzini wrote:
>>
>>
>> On 22/02/2017 14:46, Daniel P. Berrange wrote:
>>> On Wed, Feb 22, 2017 at 02:38:32PM +0100, Paolo Bonzini wrote:
>>>>
>>>>
>>>> On 22/02/2017 11:05, Daniel P. Berrange wrote:
>>>>> I'd also suggest just assigning the whole vHBA
>>>>
>>>> What do you mean by "assigning the whole vHBA"?  There is no concept of
>>>> a vHBA as e.g. a PCI virtual function.
>>>
>>> I thought we had a generic SCSI passthrough capability that could
>>> attach to the SCSI vHBA in the host, or am I mis-remembering ?
>>
>> No, the SCSI passthrough capability attaches to a single device.
>>
>> With Fibre Channel, zoning allows you to configure the set of visible
>> disks for each machine (or more accurately for each WWPN and/or WWNN) at
>> the storage fabric level.  So the idea is to just configure the
>> WWPN/WWNN and let libvirt handle the discovery of disks underneath.  The
>> alternative is to do it in both oVirt and Nova though, as mentioned
>> elsewhere in the thread, libvirt could still provide a udev wrapper.
> 
> What do you mean by "provide a udev wrapper" ?

Expose udev events in a way that is friendly to someone who is consuming
the nodedev API.

Thanks,

Paolo

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [PATCH REPOST v2 0/3] Add callback mech for qemu and nodedev

Posted by Daniel P. Berrange 8 years, 11 months ago

On Wed, Feb 22, 2017 at 03:13:05PM +0100, Paolo Bonzini wrote:
> 
> 
> On 22/02/2017 15:11, Daniel P. Berrange wrote:
> > On Wed, Feb 22, 2017 at 03:10:00PM +0100, Paolo Bonzini wrote:
> >>
> >>
> >> On 22/02/2017 14:46, Daniel P. Berrange wrote:
> >>> On Wed, Feb 22, 2017 at 02:38:32PM +0100, Paolo Bonzini wrote:
> >>>>
> >>>>
> >>>> On 22/02/2017 11:05, Daniel P. Berrange wrote:
> >>>>> I'd also suggest just assigning the whole vHBA
> >>>>
> >>>> What do you mean by "assigning the whole vHBA"?  There is no concept of
> >>>> a vHBA as e.g. a PCI virtual function.
> >>>
> >>> I thought we had a generic SCSI passthrough capability that could
> >>> attach to the SCSI vHBA in the host, or am I mis-remembering ?
> >>
> >> No, the SCSI passthrough capability attaches to a single device.
> >>
> >> With Fibre Channel, zoning allows you to configure the set of visible
> >> disks for each machine (or more accurately for each WWPN and/or WWNN) at
> >> the storage fabric level.  So the idea is to just configure the
> >> WWPN/WWNN and let libvirt handle the discovery of disks underneath.  The
> >> alternative is to do it in both oVirt and Nova though, as mentioned
> >> elsewhere in the thread, libvirt could still provide a udev wrapper.
> > 
> > What do you mean by "provide a udev wrapper" ?
> 
> Expose udev events in a way that is friendly to someone who is consuming
> the nodedev API.

We've got that support already. Apps can register to receive events
whenever a virNodeDevice is added or removed from a host, as well as
if its config changes.  

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list