[Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device

Stefan Hajnoczi posted 2 patches 6 years, 3 months ago
Failed in applying to current master (apply log)
configure                                   |   18 +
hw/virtio/Makefile.objs                     |    1 +
hw/virtio/virtio-pci.h                      |   21 +
include/hw/pci/pci.h                        |    1 +
include/hw/virtio/vhost-user.h              |  106 +++
include/hw/virtio/virtio-vhost-user.h       |   88 +++
include/standard-headers/linux/virtio_ids.h |    1 +
hw/virtio/vhost-user.c                      |  100 +--
hw/virtio/virtio-pci.c                      |   61 ++
hw/virtio/virtio-vhost-user.c               | 1047 +++++++++++++++++++++++++++
hw/virtio/trace-events                      |   22 +
11 files changed, 1367 insertions(+), 99 deletions(-)
create mode 100644 include/hw/virtio/vhost-user.h
create mode 100644 include/hw/virtio/virtio-vhost-user.h
create mode 100644 hw/virtio/virtio-vhost-user.c
[Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device
Posted by Stefan Hajnoczi 6 years, 3 months ago
These patches implement the virtio-vhost-user device design that I have
described here:
https://stefanha.github.io/virtio/vhost-user-slave.html#x1-2830007

The goal is to let the guest act as the vhost device backend for other guests.
This allows virtual networking and storage appliances to run inside guests.
This device is particularly interesting for poll mode drivers where exitless
VM-to-VM communication is possible, completely bypassing the hypervisor in the
data path.

The DPDK driver is here:
https://github.com/stefanha/dpdk/tree/virtio-vhost-user

For more information, see
https://wiki.qemu.org/Features/VirtioVhostUser.

virtio-vhost-user is inspired by Wei Wang and Zhiyong Yang's vhost-pci work.
It differs from vhost-pci in that it has:
1. Vhost-user protocol message tunneling, allowing existing vhost-user
   slave software to be reused inside the guest.
2. Support for all vhost device types.
3. Disconnected operation and reconnection support.
4. Asynchronous vhost-user socket implementation that avoids blocking.

I have written this code to demonstrate how the virtio-vhost-user approach
works and that it is more maintainable than vhost-pci because vhost-user slave
software can use both AF_UNIX and virtio-vhost-user without significant code
changes to the vhost device backends.

One of the main concerns about virtio-vhost-user was that the QEMU
virtio-vhost-user device implementation could be complex because it needs to
parse all messages.  I hope this patch series shows that it's actually very
simple because most messages are passed through.

After this patch series has been reviewed, we need to decide whether to follow
the original vhost-pci approach or to use this one.  Either way, both patch
series still require improvements before they can be merged.  Here are my todos
for this series:

 * Implement "Additional Device Resources over PCI" for shared memory,
   doorbells, and notifications instead of hardcoding a BAR with magic
   offsets into virtio-vhost-user:
   https://stefanha.github.io/virtio/vhost-user-slave.html#x1-2920007
 * Implement the VRING_KICK eventfd - currently vhost-user slaves must be poll
   mode drivers.
 * Optimize VRING_CALL doorbell with ioeventfd to avoid QEMU exit.
 * vhost-user log feature
 * UUID config field for stable device identification regardless of PCI
   bus addresses.
 * vhost-user IOMMU and SLAVE_REQ_FD feature
 * VhostUserMsg little-endian conversion for cross-endian support
 * Chardev disconnect using qemu_chr_fe_set_watch() since CHR_CLOSED is
   only emitted while a read callback is registered.  We don't keep a
   read callback registered all the time.
 * Drain txq on reconnection to discard stale messages.

Stefan Hajnoczi (1):
  virtio-vhost-user: add virtio-vhost-user device

Wei Wang (1):
  vhost-user: share the vhost-user protocol related structures

 configure                                   |   18 +
 hw/virtio/Makefile.objs                     |    1 +
 hw/virtio/virtio-pci.h                      |   21 +
 include/hw/pci/pci.h                        |    1 +
 include/hw/virtio/vhost-user.h              |  106 +++
 include/hw/virtio/virtio-vhost-user.h       |   88 +++
 include/standard-headers/linux/virtio_ids.h |    1 +
 hw/virtio/vhost-user.c                      |  100 +--
 hw/virtio/virtio-pci.c                      |   61 ++
 hw/virtio/virtio-vhost-user.c               | 1047 +++++++++++++++++++++++++++
 hw/virtio/trace-events                      |   22 +
 11 files changed, 1367 insertions(+), 99 deletions(-)
 create mode 100644 include/hw/virtio/vhost-user.h
 create mode 100644 include/hw/virtio/virtio-vhost-user.h
 create mode 100644 hw/virtio/virtio-vhost-user.c

-- 
2.14.3


Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device
Posted by Stefan Hajnoczi 6 years, 3 months ago
The DPDK patch series is here:
http://dpdk.org/ml/archives/dev/2018-January/088155.html

Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device
Posted by Jason Wang 6 years, 3 months ago

On 2018年01月19日 21:06, Stefan Hajnoczi wrote:
> These patches implement the virtio-vhost-user device design that I have
> described here:
> https://stefanha.github.io/virtio/vhost-user-slave.html#x1-2830007

Thanks for the patches, looks rather interesting and similar to split 
device model used by Xen.

>
> The goal is to let the guest act as the vhost device backend for other guests.
> This allows virtual networking and storage appliances to run inside guests.

So the question still, what kind of protocol do you want to run on top? 
If it was ethernet based, virtio-net work pretty well and it can even do 
migration.

> This device is particularly interesting for poll mode drivers where exitless
> VM-to-VM communication is possible, completely bypassing the hypervisor in the
> data path.

It's better to clarify the reason of hypervisor bypassing. (performance, 
security or scalability).

Probably not for the following cases:

1) kick/call
2) device IOTLB / IOMMU transaction (or any other case that backends 
needs metadata from qemu).

>
> The DPDK driver is here:
> https://github.com/stefanha/dpdk/tree/virtio-vhost-user
>
> For more information, see
> https://wiki.qemu.org/Features/VirtioVhostUser.
>
> virtio-vhost-user is inspired by Wei Wang and Zhiyong Yang's vhost-pci work.
> It differs from vhost-pci in that it has:
> 1. Vhost-user protocol message tunneling, allowing existing vhost-user
>     slave software to be reused inside the guest.
> 2. Support for all vhost device types.
> 3. Disconnected operation and reconnection support.
> 4. Asynchronous vhost-user socket implementation that avoids blocking.
>
> I have written this code to demonstrate how the virtio-vhost-user approach
> works and that it is more maintainable than vhost-pci because vhost-user slave
> software can use both AF_UNIX and virtio-vhost-user without significant code
> changes to the vhost device backends.

Yes, this looks cleaner than vhost-pci.

>
> One of the main concerns about virtio-vhost-user was that the QEMU
> virtio-vhost-user device implementation could be complex because it needs to
> parse all messages.  I hope this patch series shows that it's actually very
> simple because most messages are passed through.
>
> After this patch series has been reviewed, we need to decide whether to follow
> the original vhost-pci approach or to use this one.  Either way, both patch
> series still require improvements before they can be merged.  Here are my todos
> for this series:
>
>   * Implement "Additional Device Resources over PCI" for shared memory,
>     doorbells, and notifications instead of hardcoding a BAR with magic
>     offsets into virtio-vhost-user:
>     https://stefanha.github.io/virtio/vhost-user-slave.html#x1-2920007

Does this mean we need to standardize vhost-user protocol first?

>   * Implement the VRING_KICK eventfd - currently vhost-user slaves must be poll
>     mode drivers.
>   * Optimize VRING_CALL doorbell with ioeventfd to avoid QEMU exit.

The performance implication needs to be measured. It looks to me both 
kick and call will introduce more latency form the point of guest.

>   * vhost-user log feature
>   * UUID config field for stable device identification regardless of PCI
>     bus addresses.
>   * vhost-user IOMMU and SLAVE_REQ_FD feature

So an assumption is the VM that implements vhost backends should be at 
least as secure as vhost-user backend process on host. Could we have 
this conclusion?

>   * VhostUserMsg little-endian conversion for cross-endian support
>   * Chardev disconnect using qemu_chr_fe_set_watch() since CHR_CLOSED is
>     only emitted while a read callback is registered.  We don't keep a
>     read callback registered all the time.
>   * Drain txq on reconnection to discard stale messages.
>
> Stefan Hajnoczi (1):
>    virtio-vhost-user: add virtio-vhost-user device
>
> Wei Wang (1):
>    vhost-user: share the vhost-user protocol related structures
>
>   configure                                   |   18 +
>   hw/virtio/Makefile.objs                     |    1 +
>   hw/virtio/virtio-pci.h                      |   21 +
>   include/hw/pci/pci.h                        |    1 +
>   include/hw/virtio/vhost-user.h              |  106 +++
>   include/hw/virtio/virtio-vhost-user.h       |   88 +++
>   include/standard-headers/linux/virtio_ids.h |    1 +
>   hw/virtio/vhost-user.c                      |  100 +--
>   hw/virtio/virtio-pci.c                      |   61 ++
>   hw/virtio/virtio-vhost-user.c               | 1047 +++++++++++++++++++++++++++
>   hw/virtio/trace-events                      |   22 +
>   11 files changed, 1367 insertions(+), 99 deletions(-)
>   create mode 100644 include/hw/virtio/vhost-user.h
>   create mode 100644 include/hw/virtio/virtio-vhost-user.h
>   create mode 100644 hw/virtio/virtio-vhost-user.c
>

Btw, it's better to have some early numbers, e.g what testpmd reports 
during forwarding.

Thanks

Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device
Posted by Stefan Hajnoczi 6 years, 3 months ago
On Mon, Jan 22, 2018 at 11:33:46AM +0800, Jason Wang wrote:
> On 2018年01月19日 21:06, Stefan Hajnoczi wrote:
> > These patches implement the virtio-vhost-user device design that I have
> > described here:
> > https://stefanha.github.io/virtio/vhost-user-slave.html#x1-2830007
> 
> Thanks for the patches, looks rather interesting and similar to split device
> model used by Xen.
> 
> > 
> > The goal is to let the guest act as the vhost device backend for other guests.
> > This allows virtual networking and storage appliances to run inside guests.
> 
> So the question still, what kind of protocol do you want to run on top? If
> it was ethernet based, virtio-net work pretty well and it can even do
> migration.
> 
> > This device is particularly interesting for poll mode drivers where exitless
> > VM-to-VM communication is possible, completely bypassing the hypervisor in the
> > data path.
> 
> It's better to clarify the reason of hypervisor bypassing. (performance,
> security or scalability).

Performance - yes, definitely.  Exitless VM-to-VM is the fastest
possible way to communicate between VMs.  Today it can only be done
using ivshmem.  This patch series allows virtio devices to take
advantage of it and will encourage people to use virtio instead of
non-standard ivshmem devices.

Security - I don't think this feature is a security improvement.  It
reduces isolation because VM1 has full shared memory access to VM2.  In
fact, this is a reason for users to consider carefully whether they
even want to use this feature.

Scalability - much for the same reasons as the Performance section
above.  Bypassing the hypervisor eliminates scalability bottlenecks
(e.g. host network stack and bridge).

> Probably not for the following cases:
> 
> 1) kick/call

I disagree here because kick/call is actually very efficient!

VM1's irqfd is the ioeventfd for VM2.  When VM2 writes to the ioeventfd
there is a single lightweight vmexit which injects an interrupt into
VM1.  QEMU is not involved and the host kernel scheduler is not involved
so this is a low-latency operation.

I haven't tested this yet but the ioeventfd code looks like this will
work.

> 2) device IOTLB / IOMMU transaction (or any other case that backends needs
> metadata from qemu).

Yes, this is the big weakness of vhost-user in general.  The IOMMU
feature doesn't offer good isolation and even when it does, performance
will be an issue.

> >   * Implement "Additional Device Resources over PCI" for shared memory,
> >     doorbells, and notifications instead of hardcoding a BAR with magic
> >     offsets into virtio-vhost-user:
> >     https://stefanha.github.io/virtio/vhost-user-slave.html#x1-2920007
> 
> Does this mean we need to standardize vhost-user protocol first?

Currently the draft spec says:

  This section relies on definitions from the Vhost-user Protocol [1].

  [1] https://git.qemu.org/?p=qemu.git;a=blob_plain;f=docs/interop/vhost-user.txt;hb=HEAD

Michael: Is it okay to simply include this link?

> >   * Implement the VRING_KICK eventfd - currently vhost-user slaves must be poll
> >     mode drivers.
> >   * Optimize VRING_CALL doorbell with ioeventfd to avoid QEMU exit.
> 
> The performance implication needs to be measured. It looks to me both kick
> and call will introduce more latency form the point of guest.

I described the irqfd + ioeventfd approach above.  It should be faster
than virtio-net + bridge today.

> >   * vhost-user log feature
> >   * UUID config field for stable device identification regardless of PCI
> >     bus addresses.
> >   * vhost-user IOMMU and SLAVE_REQ_FD feature
> 
> So an assumption is the VM that implements vhost backends should be at least
> as secure as vhost-user backend process on host. Could we have this
> conclusion?

Yes.

Sadly the vhost-user IOMMU protocol feature does not provide isolation.
At the moment IOMMU is basically a layer of indirection (mapping) but
the vhost-user backend process still has full access to guest RAM :(.

> Btw, it's better to have some early numbers, e.g what testpmd reports during
> forwarding.

I need to rely on others to do this (and many other things!) because
virtio-vhost-user isn't the focus of my work.

These patches were written to demonstrate my suggestions for vhost-pci.
They were written at work but also on weekends, early mornings, and late
nights to avoid delaying Wei and Zhiyong's vhost-pci work too much.

If this approach has merit then I hope others will take over and I'll
play a smaller role addressing some of the todo items and cleanups.

Stefan
Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device
Posted by Michael S. Tsirkin 6 years, 3 months ago
On Mon, Jan 22, 2018 at 12:17:51PM +0000, Stefan Hajnoczi wrote:
> On Mon, Jan 22, 2018 at 11:33:46AM +0800, Jason Wang wrote:
> > On 2018年01月19日 21:06, Stefan Hajnoczi wrote:
> > > These patches implement the virtio-vhost-user device design that I have
> > > described here:
> > > https://stefanha.github.io/virtio/vhost-user-slave.html#x1-2830007
> > 
> > Thanks for the patches, looks rather interesting and similar to split device
> > model used by Xen.
> > 
> > > 
> > > The goal is to let the guest act as the vhost device backend for other guests.
> > > This allows virtual networking and storage appliances to run inside guests.
> > 
> > So the question still, what kind of protocol do you want to run on top? If
> > it was ethernet based, virtio-net work pretty well and it can even do
> > migration.
> > 
> > > This device is particularly interesting for poll mode drivers where exitless
> > > VM-to-VM communication is possible, completely bypassing the hypervisor in the
> > > data path.
> > 
> > It's better to clarify the reason of hypervisor bypassing. (performance,
> > security or scalability).
> 
> Performance - yes, definitely.  Exitless VM-to-VM is the fastest
> possible way to communicate between VMs.  Today it can only be done
> using ivshmem.  This patch series allows virtio devices to take
> advantage of it and will encourage people to use virtio instead of
> non-standard ivshmem devices.
> 
> Security - I don't think this feature is a security improvement.  It
> reduces isolation because VM1 has full shared memory access to VM2.  In
> fact, this is a reason for users to consider carefully whether they
> even want to use this feature.

True without an IOMMU, however using a vIOMMU within VM2
can protect the VM2, can't it?

> Scalability - much for the same reasons as the Performance section
> above.  Bypassing the hypervisor eliminates scalability bottlenecks
> (e.g. host network stack and bridge).
> 
> > Probably not for the following cases:
> > 
> > 1) kick/call
> 
> I disagree here because kick/call is actually very efficient!
> 
> VM1's irqfd is the ioeventfd for VM2.  When VM2 writes to the ioeventfd
> there is a single lightweight vmexit which injects an interrupt into
> VM1.  QEMU is not involved and the host kernel scheduler is not involved
> so this is a low-latency operation.
> 
> I haven't tested this yet but the ioeventfd code looks like this will
> work.
> 
> > 2) device IOTLB / IOMMU transaction (or any other case that backends needs
> > metadata from qemu).
> 
> Yes, this is the big weakness of vhost-user in general.  The IOMMU
> feature doesn't offer good isolation

I think that's an implementation issue, not a protocol issue.


> and even when it does, performance
> will be an issue.

If the IOMMU mappings are dynamic - but they are mostly
static with e.g. dpdk, right?


> > >   * Implement "Additional Device Resources over PCI" for shared memory,
> > >     doorbells, and notifications instead of hardcoding a BAR with magic
> > >     offsets into virtio-vhost-user:
> > >     https://stefanha.github.io/virtio/vhost-user-slave.html#x1-2920007
> > 
> > Does this mean we need to standardize vhost-user protocol first?
> 
> Currently the draft spec says:
> 
>   This section relies on definitions from the Vhost-user Protocol [1].
> 
>   [1] https://git.qemu.org/?p=qemu.git;a=blob_plain;f=docs/interop/vhost-user.txt;hb=HEAD
> 
> Michael: Is it okay to simply include this link?


It is OK to include normative and non-normative references,
they go in the introduction and then you refer to them
anywhere in the document.


I'm still reviewing the draft.  At some level, this is a general tunnel
feature, it can tunnel any protocol. That would be one way to
isolate it.

> > >   * Implement the VRING_KICK eventfd - currently vhost-user slaves must be poll
> > >     mode drivers.
> > >   * Optimize VRING_CALL doorbell with ioeventfd to avoid QEMU exit.
> > 
> > The performance implication needs to be measured. It looks to me both kick
> > and call will introduce more latency form the point of guest.
> 
> I described the irqfd + ioeventfd approach above.  It should be faster
> than virtio-net + bridge today.
> 
> > >   * vhost-user log feature
> > >   * UUID config field for stable device identification regardless of PCI
> > >     bus addresses.
> > >   * vhost-user IOMMU and SLAVE_REQ_FD feature
> > 
> > So an assumption is the VM that implements vhost backends should be at least
> > as secure as vhost-user backend process on host. Could we have this
> > conclusion?
> 
> Yes.
> 
> Sadly the vhost-user IOMMU protocol feature does not provide isolation.
> At the moment IOMMU is basically a layer of indirection (mapping) but
> the vhost-user backend process still has full access to guest RAM :(.

An important feature would be to do the isolation in the qemu.
So trust the qemu running VM2 but not VM2 itself.


> > Btw, it's better to have some early numbers, e.g what testpmd reports during
> > forwarding.
> 
> I need to rely on others to do this (and many other things!) because
> virtio-vhost-user isn't the focus of my work.
> 
> These patches were written to demonstrate my suggestions for vhost-pci.
> They were written at work but also on weekends, early mornings, and late
> nights to avoid delaying Wei and Zhiyong's vhost-pci work too much.
> 
> If this approach has merit then I hope others will take over and I'll
> play a smaller role addressing some of the todo items and cleanups.
> 
> Stefan



Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device
Posted by Jason Wang 6 years, 3 months ago

On 2018年01月23日 04:04, Michael S. Tsirkin wrote:
> On Mon, Jan 22, 2018 at 12:17:51PM +0000, Stefan Hajnoczi wrote:
>> On Mon, Jan 22, 2018 at 11:33:46AM +0800, Jason Wang wrote:
>>> On 2018年01月19日 21:06, Stefan Hajnoczi wrote:
>>>> These patches implement the virtio-vhost-user device design that I have
>>>> described here:
>>>> https://stefanha.github.io/virtio/vhost-user-slave.html#x1-2830007
>>> Thanks for the patches, looks rather interesting and similar to split device
>>> model used by Xen.
>>>
>>>> The goal is to let the guest act as the vhost device backend for other guests.
>>>> This allows virtual networking and storage appliances to run inside guests.
>>> So the question still, what kind of protocol do you want to run on top? If
>>> it was ethernet based, virtio-net work pretty well and it can even do
>>> migration.
>>>
>>>> This device is particularly interesting for poll mode drivers where exitless
>>>> VM-to-VM communication is possible, completely bypassing the hypervisor in the
>>>> data path.
>>> It's better to clarify the reason of hypervisor bypassing. (performance,
>>> security or scalability).
>> Performance - yes, definitely.  Exitless VM-to-VM is the fastest
>> possible way to communicate between VMs.  Today it can only be done
>> using ivshmem.  This patch series allows virtio devices to take
>> advantage of it and will encourage people to use virtio instead of
>> non-standard ivshmem devices.
>>
>> Security - I don't think this feature is a security improvement.  It
>> reduces isolation because VM1 has full shared memory access to VM2.  In
>> fact, this is a reason for users to consider carefully whether they
>> even want to use this feature.
> True without an IOMMU, however using a vIOMMU within VM2
> can protect the VM2, can't it?

It's not clear to me how to do this. E.g need a way to report failure to 
VM2 or #PF?

>
>> Scalability - much for the same reasons as the Performance section
>> above.  Bypassing the hypervisor eliminates scalability bottlenecks
>> (e.g. host network stack and bridge).
>>
>>> Probably not for the following cases:
>>>
>>> 1) kick/call
>> I disagree here because kick/call is actually very efficient!
>>
>> VM1's irqfd is the ioeventfd for VM2.  When VM2 writes to the ioeventfd
>> there is a single lightweight vmexit which injects an interrupt into
>> VM1.  QEMU is not involved and the host kernel scheduler is not involved
>> so this is a low-latency operation.

Right, looks like I was wrong. But consider irqfd may do wakup which 
means scheduler is still needed.

>> I haven't tested this yet but the ioeventfd code looks like this will
>> work.
>>
>>> 2) device IOTLB / IOMMU transaction (or any other case that backends needs
>>> metadata from qemu).
>> Yes, this is the big weakness of vhost-user in general.  The IOMMU
>> feature doesn't offer good isolation
> I think that's an implementation issue, not a protocol issue.
>
>
>> and even when it does, performance
>> will be an issue.
> If the IOMMU mappings are dynamic - but they are mostly
> static with e.g. dpdk, right?
>
>
>>>>    * Implement "Additional Device Resources over PCI" for shared memory,
>>>>      doorbells, and notifications instead of hardcoding a BAR with magic
>>>>      offsets into virtio-vhost-user:
>>>>      https://stefanha.github.io/virtio/vhost-user-slave.html#x1-2920007
>>> Does this mean we need to standardize vhost-user protocol first?
>> Currently the draft spec says:
>>
>>    This section relies on definitions from the Vhost-user Protocol [1].
>>
>>    [1] https://git.qemu.org/?p=qemu.git;a=blob_plain;f=docs/interop/vhost-user.txt;hb=HEAD
>>
>> Michael: Is it okay to simply include this link?
>
> It is OK to include normative and non-normative references,
> they go in the introduction and then you refer to them
> anywhere in the document.
>
>
> I'm still reviewing the draft.  At some level, this is a general tunnel
> feature, it can tunnel any protocol. That would be one way to
> isolate it.

Right, but it should not be the main motivation, consider we can tunnel 
any protocol on top of ethernet too.

>
>>>>    * Implement the VRING_KICK eventfd - currently vhost-user slaves must be poll
>>>>      mode drivers.
>>>>    * Optimize VRING_CALL doorbell with ioeventfd to avoid QEMU exit.
>>> The performance implication needs to be measured. It looks to me both kick
>>> and call will introduce more latency form the point of guest.
>> I described the irqfd + ioeventfd approach above.  It should be faster
>> than virtio-net + bridge today.
>>
>>>>    * vhost-user log feature
>>>>    * UUID config field for stable device identification regardless of PCI
>>>>      bus addresses.
>>>>    * vhost-user IOMMU and SLAVE_REQ_FD feature
>>> So an assumption is the VM that implements vhost backends should be at least
>>> as secure as vhost-user backend process on host. Could we have this
>>> conclusion?
>> Yes.
>>
>> Sadly the vhost-user IOMMU protocol feature does not provide isolation.
>> At the moment IOMMU is basically a layer of indirection (mapping) but
>> the vhost-user backend process still has full access to guest RAM :(.
> An important feature would be to do the isolation in the qemu.
> So trust the qemu running VM2 but not VM2 itself.

Agree, we'd better not consider VM is as secure as qemu.

>
>
>>> Btw, it's better to have some early numbers, e.g what testpmd reports during
>>> forwarding.
>> I need to rely on others to do this (and many other things!) because
>> virtio-vhost-user isn't the focus of my work.
>>
>> These patches were written to demonstrate my suggestions for vhost-pci.
>> They were written at work but also on weekends, early mornings, and late
>> nights to avoid delaying Wei and Zhiyong's vhost-pci work too much.

Thanks a lot for the effort! If anyone want to benchmark, I would expect 
compare the following three solutions:

1) vhost-pci
2) virtio-vhost-user
3) testpmd with two vhost-user ports


Performance number is really important to show the advantages of new ideas.

>>
>> If this approach has merit then I hope others will take over and I'll
>> play a smaller role addressing some of the todo items and cleanups.

It looks to me the advantages are 1) generic virtio layer (vhost-pci can 
achieve this too if necessary) 2) some kind of code reusing (vhost pmd). 
And I'd expect they have similar performance result consider no major 
differences between them.

Thanks

>> Stefan
>


Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device
Posted by Michael S. Tsirkin 6 years, 3 months ago
On Tue, Jan 23, 2018 at 06:01:15PM +0800, Jason Wang wrote:
> 
> 
> On 2018年01月23日 04:04, Michael S. Tsirkin wrote:
> > On Mon, Jan 22, 2018 at 12:17:51PM +0000, Stefan Hajnoczi wrote:
> > > On Mon, Jan 22, 2018 at 11:33:46AM +0800, Jason Wang wrote:
> > > > On 2018年01月19日 21:06, Stefan Hajnoczi wrote:
> > > > > These patches implement the virtio-vhost-user device design that I have
> > > > > described here:
> > > > > https://stefanha.github.io/virtio/vhost-user-slave.html#x1-2830007
> > > > Thanks for the patches, looks rather interesting and similar to split device
> > > > model used by Xen.
> > > > 
> > > > > The goal is to let the guest act as the vhost device backend for other guests.
> > > > > This allows virtual networking and storage appliances to run inside guests.
> > > > So the question still, what kind of protocol do you want to run on top? If
> > > > it was ethernet based, virtio-net work pretty well and it can even do
> > > > migration.
> > > > 
> > > > > This device is particularly interesting for poll mode drivers where exitless
> > > > > VM-to-VM communication is possible, completely bypassing the hypervisor in the
> > > > > data path.
> > > > It's better to clarify the reason of hypervisor bypassing. (performance,
> > > > security or scalability).
> > > Performance - yes, definitely.  Exitless VM-to-VM is the fastest
> > > possible way to communicate between VMs.  Today it can only be done
> > > using ivshmem.  This patch series allows virtio devices to take
> > > advantage of it and will encourage people to use virtio instead of
> > > non-standard ivshmem devices.
> > > 
> > > Security - I don't think this feature is a security improvement.  It
> > > reduces isolation because VM1 has full shared memory access to VM2.  In
> > > fact, this is a reason for users to consider carefully whether they
> > > even want to use this feature.
> > True without an IOMMU, however using a vIOMMU within VM2
> > can protect the VM2, can't it?
> 
> It's not clear to me how to do this. E.g need a way to report failure to VM2
> or #PF?

Why would there be a failure? qemu running vm1 would be responsible for
preventing access to vm2's memory not mapped through an IOMMU.
Basically munmap these.

> > 
> > > Scalability - much for the same reasons as the Performance section
> > > above.  Bypassing the hypervisor eliminates scalability bottlenecks
> > > (e.g. host network stack and bridge).
> > > 
> > > > Probably not for the following cases:
> > > > 
> > > > 1) kick/call
> > > I disagree here because kick/call is actually very efficient!
> > > 
> > > VM1's irqfd is the ioeventfd for VM2.  When VM2 writes to the ioeventfd
> > > there is a single lightweight vmexit which injects an interrupt into
> > > VM1.  QEMU is not involved and the host kernel scheduler is not involved
> > > so this is a low-latency operation.
> 
> Right, looks like I was wrong. But consider irqfd may do wakup which means
> scheduler is still needed.
> 
> > > I haven't tested this yet but the ioeventfd code looks like this will
> > > work.
> > > 
> > > > 2) device IOTLB / IOMMU transaction (or any other case that backends needs
> > > > metadata from qemu).
> > > Yes, this is the big weakness of vhost-user in general.  The IOMMU
> > > feature doesn't offer good isolation
> > I think that's an implementation issue, not a protocol issue.
> > 
> > 
> > > and even when it does, performance
> > > will be an issue.
> > If the IOMMU mappings are dynamic - but they are mostly
> > static with e.g. dpdk, right?
> > 
> > 
> > > > >    * Implement "Additional Device Resources over PCI" for shared memory,
> > > > >      doorbells, and notifications instead of hardcoding a BAR with magic
> > > > >      offsets into virtio-vhost-user:
> > > > >      https://stefanha.github.io/virtio/vhost-user-slave.html#x1-2920007
> > > > Does this mean we need to standardize vhost-user protocol first?
> > > Currently the draft spec says:
> > > 
> > >    This section relies on definitions from the Vhost-user Protocol [1].
> > > 
> > >    [1] https://git.qemu.org/?p=qemu.git;a=blob_plain;f=docs/interop/vhost-user.txt;hb=HEAD
> > > 
> > > Michael: Is it okay to simply include this link?
> > 
> > It is OK to include normative and non-normative references,
> > they go in the introduction and then you refer to them
> > anywhere in the document.
> > 
> > 
> > I'm still reviewing the draft.  At some level, this is a general tunnel
> > feature, it can tunnel any protocol. That would be one way to
> > isolate it.
> 
> Right, but it should not be the main motivation, consider we can tunnel any
> protocol on top of ethernet too.
> 
> > 
> > > > >    * Implement the VRING_KICK eventfd - currently vhost-user slaves must be poll
> > > > >      mode drivers.
> > > > >    * Optimize VRING_CALL doorbell with ioeventfd to avoid QEMU exit.
> > > > The performance implication needs to be measured. It looks to me both kick
> > > > and call will introduce more latency form the point of guest.
> > > I described the irqfd + ioeventfd approach above.  It should be faster
> > > than virtio-net + bridge today.
> > > 
> > > > >    * vhost-user log feature
> > > > >    * UUID config field for stable device identification regardless of PCI
> > > > >      bus addresses.
> > > > >    * vhost-user IOMMU and SLAVE_REQ_FD feature
> > > > So an assumption is the VM that implements vhost backends should be at least
> > > > as secure as vhost-user backend process on host. Could we have this
> > > > conclusion?
> > > Yes.
> > > 
> > > Sadly the vhost-user IOMMU protocol feature does not provide isolation.
> > > At the moment IOMMU is basically a layer of indirection (mapping) but
> > > the vhost-user backend process still has full access to guest RAM :(.
> > An important feature would be to do the isolation in the qemu.
> > So trust the qemu running VM2 but not VM2 itself.
> 
> Agree, we'd better not consider VM is as secure as qemu.
> 
> > 
> > 
> > > > Btw, it's better to have some early numbers, e.g what testpmd reports during
> > > > forwarding.
> > > I need to rely on others to do this (and many other things!) because
> > > virtio-vhost-user isn't the focus of my work.
> > > 
> > > These patches were written to demonstrate my suggestions for vhost-pci.
> > > They were written at work but also on weekends, early mornings, and late
> > > nights to avoid delaying Wei and Zhiyong's vhost-pci work too much.
> 
> Thanks a lot for the effort! If anyone want to benchmark, I would expect
> compare the following three solutions:
> 
> 1) vhost-pci
> 2) virtio-vhost-user
> 3) testpmd with two vhost-user ports
> 
> 
> Performance number is really important to show the advantages of new ideas.
> 
> > > 
> > > If this approach has merit then I hope others will take over and I'll
> > > play a smaller role addressing some of the todo items and cleanups.
> 
> It looks to me the advantages are 1) generic virtio layer (vhost-pci can
> achieve this too if necessary) 2) some kind of code reusing (vhost pmd). And
> I'd expect they have similar performance result consider no major
> differences between them.
> 
> Thanks
> 
> > > Stefan
> > 

Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device
Posted by Paolo Bonzini 6 years, 3 months ago
On 23/01/2018 17:07, Michael S. Tsirkin wrote:
>> It's not clear to me how to do this. E.g need a way to report failure to VM2
>> or #PF?
> 
> Why would there be a failure? qemu running vm1 would be responsible for
> preventing access to vm2's memory not mapped through an IOMMU.
> Basically munmap these.

Access to VM2's memory would use VM2's configured IOVAs for VM1's
requester id.  VM2's QEMU send device IOTLB messages to VM1's QEMU,
which would remap VM2's memory on the fly into VM1's BAR2.

It's not trivial to do it efficiently, but it's possible.  The important
thing is that, if VM2 has an IOMMU, QEMU must *not* connect to a
virtio-vhost-user device that lacks IOTLB support.  But that would be a
vhost-user bug, not a virtio-vhost-user bug---and that's the beauty of
Stefan's approach. :)

Paolo

Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device
Posted by Michael S. Tsirkin 6 years, 3 months ago
On Thu, Jan 25, 2018 at 03:07:23PM +0100, Paolo Bonzini wrote:
> On 23/01/2018 17:07, Michael S. Tsirkin wrote:
> >> It's not clear to me how to do this. E.g need a way to report failure to VM2
> >> or #PF?
> > 
> > Why would there be a failure? qemu running vm1 would be responsible for
> > preventing access to vm2's memory not mapped through an IOMMU.
> > Basically munmap these.
> 
> Access to VM2's memory would use VM2's configured IOVAs for VM1's
> requester id.  VM2's QEMU send device IOTLB messages to VM1's QEMU,
> which would remap VM2's memory on the fly into VM1's BAR2.

Right. Almost.

One problem is that IOVA range is bigger than RAM range,
so you will have trouble making arbitrary virtual addresses
fit in a BAR.

This is why I suggested a hybrid approach where
translation happens within guest, qemu only does protection.

Another problem with it is that IOMMU has page granularity
while with hugetlbfs we might not be able to remap at that
granularity.

Not sure what to do about it - teach host to break
up pages? Pass limitation to guest through virtio-iommu?

Ideas?

> It's not trivial to do it efficiently, but it's possible.  The important
> thing is that, if VM2 has an IOMMU, QEMU must *not* connect to a
> virtio-vhost-user device that lacks IOTLB support.  But that would be a
> vhost-user bug, not a virtio-vhost-user bug---and that's the beauty of
> Stefan's approach. :)
> 
> Paolo

Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device
Posted by Jason Wang 6 years, 3 months ago

On 2018年01月25日 22:48, Michael S. Tsirkin wrote:
> On Thu, Jan 25, 2018 at 03:07:23PM +0100, Paolo Bonzini wrote:
>> On 23/01/2018 17:07, Michael S. Tsirkin wrote:
>>>> It's not clear to me how to do this. E.g need a way to report failure to VM2
>>>> or #PF?
>>> Why would there be a failure? qemu running vm1 would be responsible for
>>> preventing access to vm2's memory not mapped through an IOMMU.
>>> Basically munmap these.
>> Access to VM2's memory would use VM2's configured IOVAs for VM1's
>> requester id.  VM2's QEMU send device IOTLB messages to VM1's QEMU,
>> which would remap VM2's memory on the fly into VM1's BAR2.
> Right. Almost.

That would be extremely slow for dynamic mappings.

>
> One problem is that IOVA range is bigger than RAM range,
> so you will have trouble making arbitrary virtual addresses
> fit in a BAR.
>
> This is why I suggested a hybrid approach where
> translation happens within guest, qemu only does protection.
>
> Another problem with it is that IOMMU has page granularity
> while with hugetlbfs we might not be able to remap at that
> granularity.
>
> Not sure what to do about it - teach host to break
> up pages? Pass limitation to guest through virtio-iommu?

If we decide to go virtio IOMMU, maybe device can limit its IOVA range too.

Thanks

>
> Ideas?
>
>> It's not trivial to do it efficiently, but it's possible.  The important
>> thing is that, if VM2 has an IOMMU, QEMU must *not* connect to a
>> virtio-vhost-user device that lacks IOTLB support.  But that would be a
>> vhost-user bug, not a virtio-vhost-user bug---and that's the beauty of
>> Stefan's approach. :)
>>
>> Paolo


Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device
Posted by Stefan Hajnoczi 6 years, 3 months ago
On Mon, Jan 22, 2018 at 10:04:18PM +0200, Michael S. Tsirkin wrote:
> On Mon, Jan 22, 2018 at 12:17:51PM +0000, Stefan Hajnoczi wrote:
> > On Mon, Jan 22, 2018 at 11:33:46AM +0800, Jason Wang wrote:
> > > On 2018年01月19日 21:06, Stefan Hajnoczi wrote:
> > > > These patches implement the virtio-vhost-user device design that I have
> > > > described here:
> > > > https://stefanha.github.io/virtio/vhost-user-slave.html#x1-2830007
> > > 
> > > Thanks for the patches, looks rather interesting and similar to split device
> > > model used by Xen.
> > > 
> > > > 
> > > > The goal is to let the guest act as the vhost device backend for other guests.
> > > > This allows virtual networking and storage appliances to run inside guests.
> > > 
> > > So the question still, what kind of protocol do you want to run on top? If
> > > it was ethernet based, virtio-net work pretty well and it can even do
> > > migration.
> > > 
> > > > This device is particularly interesting for poll mode drivers where exitless
> > > > VM-to-VM communication is possible, completely bypassing the hypervisor in the
> > > > data path.
> > > 
> > > It's better to clarify the reason of hypervisor bypassing. (performance,
> > > security or scalability).
> > 
> > Performance - yes, definitely.  Exitless VM-to-VM is the fastest
> > possible way to communicate between VMs.  Today it can only be done
> > using ivshmem.  This patch series allows virtio devices to take
> > advantage of it and will encourage people to use virtio instead of
> > non-standard ivshmem devices.
> > 
> > Security - I don't think this feature is a security improvement.  It
> > reduces isolation because VM1 has full shared memory access to VM2.  In
> > fact, this is a reason for users to consider carefully whether they
> > even want to use this feature.
> 
> True without an IOMMU, however using a vIOMMU within VM2
> can protect the VM2, can't it?
[...]
> > Yes, this is the big weakness of vhost-user in general.  The IOMMU
> > feature doesn't offer good isolation
> 
> I think that's an implementation issue, not a protocol issue.

The IOMMU feature in the vhost-user protocol adds address translation
and permissions but the underlying memory region file descriptors still
expose all of guest RAM.

An implementation that offers true isolation has to enforce the IOMMU
state on the shared memory (e.g. read & write access).

> > and even when it does, performance
> > will be an issue.
> 
> If the IOMMU mappings are dynamic - but they are mostly
> static with e.g. dpdk, right?

Excellent idea!  My understanding is that DPDK's memory area for mbufs
is static.

The question is whether this area contains only packet payload buffers?
Then it would be feasible to map only this memory (plus vrings) and
achieve isolation.

If other data is intermingled with the packet payload buffers then it's
difficult to guarantee security since an evil VM can modify the
non-payload data to trigger race conditions, out-of-bounds memory
accesses, and other bugs.

Anyway, what you've suggested sounds like the most realistic option for
achieving isolation without losing performance.

> > > >   * Implement the VRING_KICK eventfd - currently vhost-user slaves must be poll
> > > >     mode drivers.
> > > >   * Optimize VRING_CALL doorbell with ioeventfd to avoid QEMU exit.
> > > 
> > > The performance implication needs to be measured. It looks to me both kick
> > > and call will introduce more latency form the point of guest.
> > 
> > I described the irqfd + ioeventfd approach above.  It should be faster
> > than virtio-net + bridge today.
> > 
> > > >   * vhost-user log feature
> > > >   * UUID config field for stable device identification regardless of PCI
> > > >     bus addresses.
> > > >   * vhost-user IOMMU and SLAVE_REQ_FD feature
> > > 
> > > So an assumption is the VM that implements vhost backends should be at least
> > > as secure as vhost-user backend process on host. Could we have this
> > > conclusion?
> > 
> > Yes.
> > 
> > Sadly the vhost-user IOMMU protocol feature does not provide isolation.
> > At the moment IOMMU is basically a layer of indirection (mapping) but
> > the vhost-user backend process still has full access to guest RAM :(.
> 
> An important feature would be to do the isolation in the qemu.
> So trust the qemu running VM2 but not VM2 itself.

I think what you suggested above about DPDK's static mappings is even
better.  It does not require trusting VM2's QEMU.  It's also simpler
than implementing isolation in QEMU.
Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device
Posted by Wei Wang 6 years, 3 months ago
On 01/22/2018 08:17 PM, Stefan Hajnoczi wrote:
> On Mon, Jan 22, 2018 at 11:33:46AM +0800, Jason Wang wrote:
>> On 2018年01月19日 21:06, Stefan Hajnoczi wrote:
>>
>> Probably not for the following cases:
>>
>> 1) kick/call
> I disagree here because kick/call is actually very efficient!
>
> VM1's irqfd is the ioeventfd for VM2.  When VM2 writes to the ioeventfd
> there is a single lightweight vmexit which injects an interrupt into
> VM1.  QEMU is not involved and the host kernel scheduler is not involved
> so this is a low-latency operation.
>
> I haven't tested this yet but the ioeventfd code looks like this will
> work.


This have been tested in vhost-pci v2 patches which worked with with a 
kernel driver. It worked pretty well.

>> Btw, it's better to have some early numbers, e.g what testpmd reports during
>> forwarding.
> I need to rely on others to do this (and many other things!) because
> virtio-vhost-user isn't the focus of my work.
>
> These patches were written to demonstrate my suggestions for vhost-pci.
> They were written at work but also on weekends, early mornings, and late
> nights to avoid delaying Wei and Zhiyong's vhost-pci work too much.
>
> If this approach has merit then I hope others will take over and I'll
> play a smaller role addressing some of the todo items and cleanups.

Thanks again for the great effort, your implementation looks nice.

If we finally decide to go with the virtio-vhost-user approach, I think 
zhiyong and I can help take over the work to continue, too.

I'm still thinking about solutions to the two issues that I shared 
yesterday - it should be like a normal PCI device, and if we unbind its 
driver, and bind back, it should also work.


Best,
Wei




Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device
Posted by Wei Wang 6 years, 3 months ago
On 01/19/2018 09:06 PM, Stefan Hajnoczi wrote:
> These patches implement the virtio-vhost-user device design that I have
> described here:
> https://stefanha.github.io/virtio/vhost-user-slave.html#x1-2830007
>
>
>   configure                                   |   18 +
>   hw/virtio/Makefile.objs                     |    1 +
>   hw/virtio/virtio-pci.h                      |   21 +
>   include/hw/pci/pci.h                        |    1 +
>   include/hw/virtio/vhost-user.h              |  106 +++
>   include/hw/virtio/virtio-vhost-user.h       |   88 +++
>   include/standard-headers/linux/virtio_ids.h |    1 +
>   hw/virtio/vhost-user.c                      |  100 +--
>   hw/virtio/virtio-pci.c                      |   61 ++
>   hw/virtio/virtio-vhost-user.c               | 1047 +++++++++++++++++++++++++++
>   hw/virtio/trace-events                      |   22 +
>   11 files changed, 1367 insertions(+), 99 deletions(-)
>   create mode 100644 include/hw/virtio/vhost-user.h
>   create mode 100644 include/hw/virtio/virtio-vhost-user.h
>   create mode 100644 hw/virtio/virtio-vhost-user.c
>

Thanks for the quick implementation. Not sure if the following issues 
could be solved with this approach:
  - After we boot the slave VM, if we don't run the virtio-vhost-user 
driver (i.e. testpmd), then the master VM couldn't boot, because the 
booting of the virtio-net device relies on a negotiation with the 
virtio-vhost-user driver.
  - Suppose in the future there is also a kernel virtio-vhost-user 
driver as other PCI devices, can we unbind the kernel driver first, and 
then bind the device to the dpdk driver? A normal PCI device should be 
able to smoothly switch between the kernel driver and dpdk driver.

Best,
Wei


Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device
Posted by Stefan Hajnoczi 6 years, 3 months ago
On Mon, Jan 22, 2018 at 07:09:06PM +0800, Wei Wang wrote:
> On 01/19/2018 09:06 PM, Stefan Hajnoczi wrote:
> Thanks for the quick implementation. Not sure if the following issues could
> be solved with this approach:
>  - After we boot the slave VM, if we don't run the virtio-vhost-user driver
> (i.e. testpmd), then the master VM couldn't boot, because the booting of the
> virtio-net device relies on a negotiation with the virtio-vhost-user driver.

This is a limitation of QEMU's vhost-user master implementation which
also affects AF_UNIX.  It cannot be solved by this patch series since
this is the slave side.

Here is what I suggest.  Introduce a new VIRTIO feature bit:

  #define VIRTIO_F_DEVICE_READY 34

When the device supports this bit we extend the device initialization
procedure.  If the device is not yet ready, the FEATURES_OK status bit
is not accepted by the device.  Device initialization fails temporarily
but the device may raise the configuration change interrupt to indicate
that device initialization should be retried.

Using a feature bit guarantees that existing device and driver behavior
remains unchanged.

On the QEMU side the changes are:

1. Virtio hardware registers and configuration space are available even
   when vhost-user is disconnected.  This shouldn't be difficult to
   implement because QEMU always has a virtio device model for each
   vhost-user device.  We just need to put dummy values in the
   registers/configuration space until vhost-user has connected.

2. When vhost-user connects, raise the configuration change interrupt
   and allow vhost-user to process.

On the guest side the changes are:

1. virtio_pci.ko must set VIRTIO_F_DEVICE_READY and handle !FEATURES_OK
   by waiting for the configuration change interrupt and retrying.

This doesn't fully solve the problem because it assumes that a connected
slave always responds to VIRTIO hardware register accesses (e.g. get/set
features).  If the slave crashes but leaves the virtio-vhost-user PCI
adapter connected then vhost-user requests from the master go
unanswered and cause hangs...

>  - Suppose in the future there is also a kernel virtio-vhost-user driver as
> other PCI devices, can we unbind the kernel driver first, and then bind the
> device to the dpdk driver? A normal PCI device should be able to smoothly
> switch between the kernel driver and dpdk driver.

It depends what you mean by "smoothly switch".

If you mean whether it's possible to go from a kernel driver to
vfio-pci, then the answer is yes.

But if the kernel driver has an established vhost-user connection then
it will be closed.  This is the same as reconnecting with AF_UNIX
vhost-user.

Stefan
Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device
Posted by Wei Wang 6 years, 3 months ago
On 01/23/2018 07:12 PM, Stefan Hajnoczi wrote:
> On Mon, Jan 22, 2018 at 07:09:06PM +0800, Wei Wang wrote:
>> On 01/19/2018 09:06 PM, Stefan Hajnoczi wrote:
>>
>>
>>   - Suppose in the future there is also a kernel virtio-vhost-user driver as
>> other PCI devices, can we unbind the kernel driver first, and then bind the
>> device to the dpdk driver? A normal PCI device should be able to smoothly
>> switch between the kernel driver and dpdk driver.
> It depends what you mean by "smoothly switch".
>
> If you mean whether it's possible to go from a kernel driver to
> vfio-pci, then the answer is yes.
>
> But if the kernel driver has an established vhost-user connection then
> it will be closed.  This is the same as reconnecting with AF_UNIX
> vhost-user.
>

Actually not only the case of switching to testpmd after kernel 
establishes the connection, but also for several runs of testpmd. That 
is, if we run testpmd, then exit testpmd. I think the second run of 
testpmd won't work. I'm thinking about caching the received master msgs 
in QEMU when virtio_vhost_user_parse_m2s().

Btw, I'm trying to run the code, but couldn't bind the virito-vhost-user 
device to vfio-pci (reports Unknown device), not sure if it is because 
the device type is "Unclassified device".

Best,
Wei



Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device
Posted by Stefan Hajnoczi 6 years, 3 months ago
On Tue, Jan 23, 2018 at 09:06:49PM +0800, Wei Wang wrote:
> On 01/23/2018 07:12 PM, Stefan Hajnoczi wrote:
> > On Mon, Jan 22, 2018 at 07:09:06PM +0800, Wei Wang wrote:
> > > On 01/19/2018 09:06 PM, Stefan Hajnoczi wrote:
> > > 
> > > 
> > >   - Suppose in the future there is also a kernel virtio-vhost-user driver as
> > > other PCI devices, can we unbind the kernel driver first, and then bind the
> > > device to the dpdk driver? A normal PCI device should be able to smoothly
> > > switch between the kernel driver and dpdk driver.
> > It depends what you mean by "smoothly switch".
> > 
> > If you mean whether it's possible to go from a kernel driver to
> > vfio-pci, then the answer is yes.
> > 
> > But if the kernel driver has an established vhost-user connection then
> > it will be closed.  This is the same as reconnecting with AF_UNIX
> > vhost-user.
> > 
> 
> Actually not only the case of switching to testpmd after kernel establishes
> the connection, but also for several runs of testpmd. That is, if we run
> testpmd, then exit testpmd. I think the second run of testpmd won't work.

The vhost-user master must reconnect and initialize again (SET_FEATURES,
SET_MEM_TABLE, etc).  Is your master reconnecting after the AF_UNIX
connection is closed?

> I'm thinking about caching the received master msgs in QEMU when
> virtio_vhost_user_parse_m2s().

Why is that necessary and how does QEMU know they are still up-to-date
when a new connection is made?

> Btw, I'm trying to run the code, but couldn't bind the virito-vhost-user
> device to vfio-pci (reports Unknown device), not sure if it is because the
> device type is "Unclassified device".

You need to use the modified usertools/dpdk-devbind.py from my patch
series inside the guest.  Please see:
https://dpdk.org/ml/archives/dev/2018-January/088177.html

Stefan
Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device
Posted by Wei Wang 6 years, 3 months ago
On 01/24/2018 07:40 PM, Stefan Hajnoczi wrote:
> On Tue, Jan 23, 2018 at 09:06:49PM +0800, Wei Wang wrote:
>> On 01/23/2018 07:12 PM, Stefan Hajnoczi wrote:
>>> On Mon, Jan 22, 2018 at 07:09:06PM +0800, Wei Wang wrote:
>>>> On 01/19/2018 09:06 PM, Stefan Hajnoczi wrote:
>>>>
>>>>
>>>>    - Suppose in the future there is also a kernel virtio-vhost-user driver as
>>>> other PCI devices, can we unbind the kernel driver first, and then bind the
>>>> device to the dpdk driver? A normal PCI device should be able to smoothly
>>>> switch between the kernel driver and dpdk driver.
>>> It depends what you mean by "smoothly switch".
>>>
>>> If you mean whether it's possible to go from a kernel driver to
>>> vfio-pci, then the answer is yes.
>>>
>>> But if the kernel driver has an established vhost-user connection then
>>> it will be closed.  This is the same as reconnecting with AF_UNIX
>>> vhost-user.
>>>
>> Actually not only the case of switching to testpmd after kernel establishes
>> the connection, but also for several runs of testpmd. That is, if we run
>> testpmd, then exit testpmd. I think the second run of testpmd won't work.
> The vhost-user master must reconnect and initialize again (SET_FEATURES,
> SET_MEM_TABLE, etc).  Is your master reconnecting after the AF_UNIX
> connection is closed?

Is this an explicit qmp operation to make the master re-connect?

>
>> I'm thinking about caching the received master msgs in QEMU when
>> virtio_vhost_user_parse_m2s().
> Why is that necessary and how does QEMU know they are still up-to-date
> when a new connection is made?

OK. I think that's not needed if the master can reconnect.

>
>> Btw, I'm trying to run the code, but couldn't bind the virito-vhost-user
>> device to vfio-pci (reports Unknown device), not sure if it is because the
>> device type is "Unclassified device".
> You need to use the modified usertools/dpdk-devbind.py from my patch
> series inside the guest.  Please see:
> https://dpdk.org/ml/archives/dev/2018-January/088177.html

Thanks. I haven't got time on this yet due to some more urgent things. 
I'll try.

Best,
Wei


Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device
Posted by Stefan Hajnoczi 6 years, 3 months ago
On Thu, Jan 25, 2018 at 06:19:13PM +0800, Wei Wang wrote:
> On 01/24/2018 07:40 PM, Stefan Hajnoczi wrote:
> > On Tue, Jan 23, 2018 at 09:06:49PM +0800, Wei Wang wrote:
> > > On 01/23/2018 07:12 PM, Stefan Hajnoczi wrote:
> > > > On Mon, Jan 22, 2018 at 07:09:06PM +0800, Wei Wang wrote:
> > > > > On 01/19/2018 09:06 PM, Stefan Hajnoczi wrote:
> > > > > 
> > > > > 
> > > > >    - Suppose in the future there is also a kernel virtio-vhost-user driver as
> > > > > other PCI devices, can we unbind the kernel driver first, and then bind the
> > > > > device to the dpdk driver? A normal PCI device should be able to smoothly
> > > > > switch between the kernel driver and dpdk driver.
> > > > It depends what you mean by "smoothly switch".
> > > > 
> > > > If you mean whether it's possible to go from a kernel driver to
> > > > vfio-pci, then the answer is yes.
> > > > 
> > > > But if the kernel driver has an established vhost-user connection then
> > > > it will be closed.  This is the same as reconnecting with AF_UNIX
> > > > vhost-user.
> > > > 
> > > Actually not only the case of switching to testpmd after kernel establishes
> > > the connection, but also for several runs of testpmd. That is, if we run
> > > testpmd, then exit testpmd. I think the second run of testpmd won't work.
> > The vhost-user master must reconnect and initialize again (SET_FEATURES,
> > SET_MEM_TABLE, etc).  Is your master reconnecting after the AF_UNIX
> > connection is closed?
> 
> Is this an explicit qmp operation to make the master re-connect?

I haven't tested it myself but I'm aware of two modes of operation:

1. -chardev socket,id=chardev0,...,server
   -netdev vhost-user,chardev=chardev0

   When the vhost-user socket is disconnected the peer needs to
   reconnect.  In this case no special commands are necessary.

   Here we're relying on DPDK librte_vhost's reconnection behavior.

Or

2. -chardev socket,id=chardev0,...,reconnect=3
   -netdev vhost-user,chardev=chardev0

   When the vhost-user socket is disconnected a new connection attempt
   will be made after 3 seconds.

In both cases vhost-user negotiation will resume when the new connection
is established.

Stefan
Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device
Posted by Wei Wang 6 years, 2 months ago
On 01/26/2018 10:44 PM, Stefan Hajnoczi wrote:
> On Thu, Jan 25, 2018 at 06:19:13PM +0800, Wei Wang wrote:
>> On 01/24/2018 07:40 PM, Stefan Hajnoczi wrote:
>>> On Tue, Jan 23, 2018 at 09:06:49PM +0800, Wei Wang wrote:
>>>> On 01/23/2018 07:12 PM, Stefan Hajnoczi wrote:
>>>>> On Mon, Jan 22, 2018 at 07:09:06PM +0800, Wei Wang wrote:
>>>>>> On 01/19/2018 09:06 PM, Stefan Hajnoczi wrote:
>>>>>>
>>>>>>
>>>>>>     - Suppose in the future there is also a kernel virtio-vhost-user driver as
>>>>>> other PCI devices, can we unbind the kernel driver first, and then bind the
>>>>>> device to the dpdk driver? A normal PCI device should be able to smoothly
>>>>>> switch between the kernel driver and dpdk driver.
>>>>> It depends what you mean by "smoothly switch".
>>>>>
>>>>> If you mean whether it's possible to go from a kernel driver to
>>>>> vfio-pci, then the answer is yes.
>>>>>
>>>>> But if the kernel driver has an established vhost-user connection then
>>>>> it will be closed.  This is the same as reconnecting with AF_UNIX
>>>>> vhost-user.
>>>>>
>>>> Actually not only the case of switching to testpmd after kernel establishes
>>>> the connection, but also for several runs of testpmd. That is, if we run
>>>> testpmd, then exit testpmd. I think the second run of testpmd won't work.
>>> The vhost-user master must reconnect and initialize again (SET_FEATURES,
>>> SET_MEM_TABLE, etc).  Is your master reconnecting after the AF_UNIX
>>> connection is closed?
>> Is this an explicit qmp operation to make the master re-connect?
> I haven't tested it myself but I'm aware of two modes of operation:
>
> 1. -chardev socket,id=chardev0,...,server
>     -netdev vhost-user,chardev=chardev0
>
>     When the vhost-user socket is disconnected the peer needs to
>     reconnect.  In this case no special commands are necessary.
>
>     Here we're relying on DPDK librte_vhost's reconnection behavior.
>
> Or
>
> 2. -chardev socket,id=chardev0,...,reconnect=3
>     -netdev vhost-user,chardev=chardev0
>
>     When the vhost-user socket is disconnected a new connection attempt
>     will be made after 3 seconds.
>
> In both cases vhost-user negotiation will resume when the new connection
> is established.
>
> Stefan

I've been thinking about the issues, and it looks vhost-pci outperforms 
in this aspect.
Vhost-pci is like using a mail box. messages are just dropped into the 
box, and whenever vhost-pci pmd gets booted, it can always get the 
messages from the box, the negotiation between vhost-pci pmd and 
virtio-net is asynchronous.
Virtio-vhost-user is like a phone call, which is a synchronous 
communication. If one side is absent, then the other side will hang on 
without knowing when it could get connected or hang up with messages not 
passed (lost).

I also think the above solutions won't help. Please see below:

Background:
The vhost-user negotiation is split into 2 phases currently. The 1st 
phase happens when the connection is established, and we can find what's 
done in the 1st phase in vhost_user_init(). The 2nd phase happens when 
the master driver is loaded (e.g. run of virtio-net pmd) and set status 
to the device, and we can find what's done in the 2nd phase in 
vhost_dev_start(), which includes sending the memory info and virtqueue 
info. The socket is connected, till one of the QEMU devices exits, so 
pmd exiting won't end the QEMU side socket connection.

Issues:
Suppose we have both the vhost and virtio-net set up, and vhost pmd <-> 
virtio-net pmd communication works well. Now, vhost pmd exits 
(virtio-net pmd is still there). Some time later, we re-run vhost pmd, 
the vhost pmd doesn't know the virtqueue addresses of the virtio-net 
pmd, unless the virtio-net pmd reloads to start the 2nd phase of the 
vhost-user protocol. So the second run of the vhost pmd won't work.

Any thoughts?

Best,
Wei

Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device
Posted by Michael S. Tsirkin 6 years, 2 months ago
On Tue, Jan 30, 2018 at 08:09:19PM +0800, Wei Wang wrote:
> Issues:
> Suppose we have both the vhost and virtio-net set up, and vhost pmd <->
> virtio-net pmd communication works well. Now, vhost pmd exits (virtio-net
> pmd is still there). Some time later, we re-run vhost pmd, the vhost pmd
> doesn't know the virtqueue addresses of the virtio-net pmd, unless the
> virtio-net pmd reloads to start the 2nd phase of the vhost-user protocol. So
> the second run of the vhost pmd won't work.
> 
> Any thoughts?
> 
> Best,
> Wei

So vhost in qemu must resend all configuration on reconnect.
Does this address the issues?

-- 
MST

Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device
Posted by Wei Wang 6 years, 2 months ago
On 02/02/2018 01:08 AM, Michael S. Tsirkin wrote:
> On Tue, Jan 30, 2018 at 08:09:19PM +0800, Wei Wang wrote:
>> Issues:
>> Suppose we have both the vhost and virtio-net set up, and vhost pmd <->
>> virtio-net pmd communication works well. Now, vhost pmd exits (virtio-net
>> pmd is still there). Some time later, we re-run vhost pmd, the vhost pmd
>> doesn't know the virtqueue addresses of the virtio-net pmd, unless the
>> virtio-net pmd reloads to start the 2nd phase of the vhost-user protocol. So
>> the second run of the vhost pmd won't work.
>>
>> Any thoughts?
>>
>> Best,
>> Wei
> So vhost in qemu must resend all configuration on reconnect.
> Does this address the issues?
>

Yes, but the issues are
1) there is no reconnecting when a pmd exits (the socket connection 
seems still on at the device layer);
2) If we find a way to break the QEMU layer socket connection when pmd 
exits and get it reconnect, virtio-net device still won't send all the 
configure when reconnecting, because socket connecting only triggers 
phase 1 of vhost-user negotiation (i.e. vhost_user_init). Phase 2 is 
triggered after the driver loads (i.e. vhost_net_start). If the 
virtio-net pmd doesn't reload, there are no phase 2 messages (like 
virtqueue addresses which are allocated by the pmd). I think we need to 
think more about this before moving forward.

Best,
Wei

Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device
Posted by Stefan Hajnoczi 6 years, 2 months ago
On Fri, Feb 02, 2018 at 09:08:44PM +0800, Wei Wang wrote:
> On 02/02/2018 01:08 AM, Michael S. Tsirkin wrote:
> > On Tue, Jan 30, 2018 at 08:09:19PM +0800, Wei Wang wrote:
> > > Issues:
> > > Suppose we have both the vhost and virtio-net set up, and vhost pmd <->
> > > virtio-net pmd communication works well. Now, vhost pmd exits (virtio-net
> > > pmd is still there). Some time later, we re-run vhost pmd, the vhost pmd
> > > doesn't know the virtqueue addresses of the virtio-net pmd, unless the
> > > virtio-net pmd reloads to start the 2nd phase of the vhost-user protocol. So
> > > the second run of the vhost pmd won't work.
> > > 
> > > Any thoughts?
> > > 
> > > Best,
> > > Wei
> > So vhost in qemu must resend all configuration on reconnect.
> > Does this address the issues?
> > 
> 
> Yes, but the issues are
> 1) there is no reconnecting when a pmd exits (the socket connection seems
> still on at the device layer);

This is how real hardware works too.  If the driver suddenly stops
running then the device remains operational.  When the driver is started
again it resets the device and initializes it.

> 2) If we find a way to break the QEMU layer socket connection when pmd exits
> and get it reconnect, virtio-net device still won't send all the configure
> when reconnecting, because socket connecting only triggers phase 1 of
> vhost-user negotiation (i.e. vhost_user_init). Phase 2 is triggered after
> the driver loads (i.e. vhost_net_start). If the virtio-net pmd doesn't
> reload, there are no phase 2 messages (like virtqueue addresses which are
> allocated by the pmd). I think we need to think more about this before
> moving forward.

Marc-André: How does vhost-user reconnect work when the master goes away
and a new master comes online?  Wei found that the QEMU slave
implementation only does partial vhost-user initialization upon
reconnect, so the new master doesn't get the virtqueue address and
related information.  Is this a QEMU bug?

Stefan
Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device
Posted by Wang, Wei W 6 years, 2 months ago
On Tuesday, February 6, 2018 12:26 AM, Stefan Hajnoczi wrote:
> On Fri, Feb 02, 2018 at 09:08:44PM +0800, Wei Wang wrote:
> > On 02/02/2018 01:08 AM, Michael S. Tsirkin wrote:
> > > On Tue, Jan 30, 2018 at 08:09:19PM +0800, Wei Wang wrote:
> > > > Issues:
> > > > Suppose we have both the vhost and virtio-net set up, and vhost
> > > > pmd <-> virtio-net pmd communication works well. Now, vhost pmd
> > > > exits (virtio-net pmd is still there). Some time later, we re-run
> > > > vhost pmd, the vhost pmd doesn't know the virtqueue addresses of
> > > > the virtio-net pmd, unless the virtio-net pmd reloads to start the
> > > > 2nd phase of the vhost-user protocol. So the second run of the vhost
> pmd won't work.
> > > >
> > > > Any thoughts?
> > > >
> > > > Best,
> > > > Wei
> > > So vhost in qemu must resend all configuration on reconnect.
> > > Does this address the issues?
> > >
> >
> > Yes, but the issues are
> > 1) there is no reconnecting when a pmd exits (the socket connection
> > seems still on at the device layer);
> 
> This is how real hardware works too.  If the driver suddenly stops running
> then the device remains operational.  When the driver is started again it
> resets the device and initializes it.
> 
> > 2) If we find a way to break the QEMU layer socket connection when pmd
> > exits and get it reconnect, virtio-net device still won't send all the
> > configure when reconnecting, because socket connecting only triggers
> > phase 1 of vhost-user negotiation (i.e. vhost_user_init). Phase 2 is
> > triggered after the driver loads (i.e. vhost_net_start). If the
> > virtio-net pmd doesn't reload, there are no phase 2 messages (like
> > virtqueue addresses which are allocated by the pmd). I think we need
> > to think more about this before moving forward.
> 
> Marc-André: How does vhost-user reconnect work when the master goes
> away and a new master comes online?  Wei found that the QEMU slave
> implementation only does partial vhost-user initialization upon reconnect, so
> the new master doesn't get the virtqueue address and related information.
> Is this a QEMU bug?

Actually we are discussing the slave (vhost is the slave, right?) going away. When a slave exits and some moment later a new slave runs, the master (virtio-net) won't send the virtqueue addresses to the new vhost slave.

Best,
Wei

Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device
Posted by Stefan Hajnoczi 6 years, 2 months ago
On Tue, Feb 06, 2018 at 01:28:25AM +0000, Wang, Wei W wrote:
> On Tuesday, February 6, 2018 12:26 AM, Stefan Hajnoczi wrote:
> > On Fri, Feb 02, 2018 at 09:08:44PM +0800, Wei Wang wrote:
> > > On 02/02/2018 01:08 AM, Michael S. Tsirkin wrote:
> > > > On Tue, Jan 30, 2018 at 08:09:19PM +0800, Wei Wang wrote:
> > > > > Issues:
> > > > > Suppose we have both the vhost and virtio-net set up, and vhost
> > > > > pmd <-> virtio-net pmd communication works well. Now, vhost pmd
> > > > > exits (virtio-net pmd is still there). Some time later, we re-run
> > > > > vhost pmd, the vhost pmd doesn't know the virtqueue addresses of
> > > > > the virtio-net pmd, unless the virtio-net pmd reloads to start the
> > > > > 2nd phase of the vhost-user protocol. So the second run of the vhost
> > pmd won't work.
> > > > >
> > > > > Any thoughts?
> > > > >
> > > > > Best,
> > > > > Wei
> > > > So vhost in qemu must resend all configuration on reconnect.
> > > > Does this address the issues?
> > > >
> > >
> > > Yes, but the issues are
> > > 1) there is no reconnecting when a pmd exits (the socket connection
> > > seems still on at the device layer);
> > 
> > This is how real hardware works too.  If the driver suddenly stops running
> > then the device remains operational.  When the driver is started again it
> > resets the device and initializes it.
> > 
> > > 2) If we find a way to break the QEMU layer socket connection when pmd
> > > exits and get it reconnect, virtio-net device still won't send all the
> > > configure when reconnecting, because socket connecting only triggers
> > > phase 1 of vhost-user negotiation (i.e. vhost_user_init). Phase 2 is
> > > triggered after the driver loads (i.e. vhost_net_start). If the
> > > virtio-net pmd doesn't reload, there are no phase 2 messages (like
> > > virtqueue addresses which are allocated by the pmd). I think we need
> > > to think more about this before moving forward.
> > 
> > Marc-André: How does vhost-user reconnect work when the master goes
> > away and a new master comes online?  Wei found that the QEMU slave
> > implementation only does partial vhost-user initialization upon reconnect, so
> > the new master doesn't get the virtqueue address and related information.
> > Is this a QEMU bug?
> 
> Actually we are discussing the slave (vhost is the slave, right?) going away. When a slave exits and some moment later a new slave runs, the master (virtio-net) won't send the virtqueue addresses to the new vhost slave.

Yes, apologies for the typo.  s/QEMU slave/QEMU master/

Yesterday I asked Marc-André for help on IRC and we found the code path
where the QEMU master performs phase 2 negotiation upon reconnect.  It's
not obvious but the qmp_set_link() calls in net_vhost_user_event() will
do it.

I'm going to try to reproduce the issue you're seeing now.  Will let you
know what I find.

Stefan
Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device
Posted by Wang, Wei W 6 years, 2 months ago
On Tuesday, February 6, 2018 5:32 PM, Stefan Hajnoczi wrote:
> On Tue, Feb 06, 2018 at 01:28:25AM +0000, Wang, Wei W wrote:
> > On Tuesday, February 6, 2018 12:26 AM, Stefan Hajnoczi wrote:
> > > On Fri, Feb 02, 2018 at 09:08:44PM +0800, Wei Wang wrote:
> > > > On 02/02/2018 01:08 AM, Michael S. Tsirkin wrote:
> > > > > On Tue, Jan 30, 2018 at 08:09:19PM +0800, Wei Wang wrote:
> > > > > > Issues:
> > > > > > Suppose we have both the vhost and virtio-net set up, and 
> > > > > > vhost pmd <-> virtio-net pmd communication works well. Now, 
> > > > > > vhost pmd exits (virtio-net pmd is still there). Some time 
> > > > > > later, we re-run vhost pmd, the vhost pmd doesn't know the 
> > > > > > virtqueue addresses of the virtio-net pmd, unless the 
> > > > > > virtio-net pmd reloads to start the 2nd phase of the 
> > > > > > vhost-user protocol. So the second run of the vhost
> > > pmd won't work.
> > > > > >
> > > > > > Any thoughts?
> > > > > >
> > > > > > Best,
> > > > > > Wei
> > > > > So vhost in qemu must resend all configuration on reconnect.
> > > > > Does this address the issues?
> > > > >
> > > >
> > > > Yes, but the issues are
> > > > 1) there is no reconnecting when a pmd exits (the socket 
> > > > connection seems still on at the device layer);
> > >
> > > This is how real hardware works too.  If the driver suddenly stops 
> > > running then the device remains operational.  When the driver is 
> > > started again it resets the device and initializes it.
> > >
> > > > 2) If we find a way to break the QEMU layer socket connection 
> > > > when pmd exits and get it reconnect, virtio-net device still 
> > > > won't send all the configure when reconnecting, because socket 
> > > > connecting only triggers phase 1 of vhost-user negotiation (i.e.
> > > > vhost_user_init). Phase 2 is triggered after the driver loads 
> > > > (i.e. vhost_net_start). If the virtio-net pmd doesn't reload, 
> > > > there are no phase 2 messages (like virtqueue addresses which 
> > > > are allocated by the pmd). I think we need to think more about 
> > > > this before
> moving forward.
> > >
> > > Marc-André: How does vhost-user reconnect work when the master 
> > > goes away and a new master comes online?  Wei found that the QEMU 
> > > slave implementation only does partial vhost-user initialization 
> > > upon reconnect, so the new master doesn't get the virtqueue 
> > > address and
> related information.
> > > Is this a QEMU bug?
> >
> > Actually we are discussing the slave (vhost is the slave, right?) going away.
> When a slave exits and some moment later a new slave runs, the master
> (virtio-net) won't send the virtqueue addresses to the new vhost slave.
> 
> Yes, apologies for the typo.  s/QEMU slave/QEMU master/
> 
> Yesterday I asked Marc-André for help on IRC and we found the code 
> path where the QEMU master performs phase 2 negotiation upon 
> reconnect.  It's not obvious but the qmp_set_link() calls in net_vhost_user_event() will do it.
> 
> I'm going to try to reproduce the issue you're seeing now.  Will let 
> you know what I find.
> 

OK. Thanks. I observed no messages after re-run virtio-vhost-user pmd, and found there is no re-connection event happening in the device side. 

I also tried to switch the role of client/server - virtio-net to run a server socket, and virtio-vhost-user to run the client, and it seems the current code fails to run that way. The reason is the virtio-net side vhost_user_get_features() doesn't return. On the vhost side, I don't see virtio_vhost_user_deliver_m2s being invoked to deliver the GET_FEATURES message. I'll come back to continue later.

Best,
Wei

Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device
Posted by Stefan Hajnoczi 6 years, 2 months ago
On Tue, Feb 06, 2018 at 12:42:36PM +0000, Wang, Wei W wrote:
> On Tuesday, February 6, 2018 5:32 PM, Stefan Hajnoczi wrote:
> > On Tue, Feb 06, 2018 at 01:28:25AM +0000, Wang, Wei W wrote:
> > > On Tuesday, February 6, 2018 12:26 AM, Stefan Hajnoczi wrote:
> > > > On Fri, Feb 02, 2018 at 09:08:44PM +0800, Wei Wang wrote:
> > > > > On 02/02/2018 01:08 AM, Michael S. Tsirkin wrote:
> > > > > > On Tue, Jan 30, 2018 at 08:09:19PM +0800, Wei Wang wrote:
> > > > > > > Issues:
> > > > > > > Suppose we have both the vhost and virtio-net set up, and 
> > > > > > > vhost pmd <-> virtio-net pmd communication works well. Now, 
> > > > > > > vhost pmd exits (virtio-net pmd is still there). Some time 
> > > > > > > later, we re-run vhost pmd, the vhost pmd doesn't know the 
> > > > > > > virtqueue addresses of the virtio-net pmd, unless the 
> > > > > > > virtio-net pmd reloads to start the 2nd phase of the 
> > > > > > > vhost-user protocol. So the second run of the vhost
> > > > pmd won't work.
> > > > > > >
> > > > > > > Any thoughts?
> > > > > > >
> > > > > > > Best,
> > > > > > > Wei
> > > > > > So vhost in qemu must resend all configuration on reconnect.
> > > > > > Does this address the issues?
> > > > > >
> > > > >
> > > > > Yes, but the issues are
> > > > > 1) there is no reconnecting when a pmd exits (the socket 
> > > > > connection seems still on at the device layer);
> > > >
> > > > This is how real hardware works too.  If the driver suddenly stops 
> > > > running then the device remains operational.  When the driver is 
> > > > started again it resets the device and initializes it.
> > > >
> > > > > 2) If we find a way to break the QEMU layer socket connection 
> > > > > when pmd exits and get it reconnect, virtio-net device still 
> > > > > won't send all the configure when reconnecting, because socket 
> > > > > connecting only triggers phase 1 of vhost-user negotiation (i.e.
> > > > > vhost_user_init). Phase 2 is triggered after the driver loads 
> > > > > (i.e. vhost_net_start). If the virtio-net pmd doesn't reload, 
> > > > > there are no phase 2 messages (like virtqueue addresses which 
> > > > > are allocated by the pmd). I think we need to think more about 
> > > > > this before
> > moving forward.
> > > >
> > > > Marc-André: How does vhost-user reconnect work when the master 
> > > > goes away and a new master comes online?  Wei found that the QEMU 
> > > > slave implementation only does partial vhost-user initialization 
> > > > upon reconnect, so the new master doesn't get the virtqueue 
> > > > address and
> > related information.
> > > > Is this a QEMU bug?
> > >
> > > Actually we are discussing the slave (vhost is the slave, right?) going away.
> > When a slave exits and some moment later a new slave runs, the master
> > (virtio-net) won't send the virtqueue addresses to the new vhost slave.
> > 
> > Yes, apologies for the typo.  s/QEMU slave/QEMU master/
> > 
> > Yesterday I asked Marc-André for help on IRC and we found the code 
> > path where the QEMU master performs phase 2 negotiation upon 
> > reconnect.  It's not obvious but the qmp_set_link() calls in net_vhost_user_event() will do it.
> > 
> > I'm going to try to reproduce the issue you're seeing now.  Will let 
> > you know what I find.
> > 
> 
> OK. Thanks. I observed no messages after re-run virtio-vhost-user pmd, and found there is no re-connection event happening in the device side. 
> 
> I also tried to switch the role of client/server - virtio-net to run a server socket, and virtio-vhost-user to run the client, and it seems the current code fails to run that way. The reason is the virtio-net side vhost_user_get_features() doesn't return. On the vhost side, I don't see virtio_vhost_user_deliver_m2s being invoked to deliver the GET_FEATURES message. I'll come back to continue later.

This morning I reached the conclusion that reconnection is currently
broken in the QEMU vhost-user master.  It's a bug in the QEMU vhost-user
master implementation, not a design or protocol problem.

On my machine the following QEMU command-line does not launch because
vhost-user.c gets stuck while trying to connect/negotiate:

  qemu -M accel=kvm -cpu host -m 1G \
       -object memory-backend-file,id=mem0,mem-path=/var/tmp/foo,size=1G,share=on \
       -numa node,memdev=mem0 \
       -drive if=virtio,file=test.img,format=raw \
       -chardev socket,id=chardev0,path=vhost-user.sock,reconnect=1 \
       -netdev vhost-user,chardev=chardev0,id=netdev0 \
       -device virtio-net-pci,netdev=netdev0

Commit c89804d674e4e3804bd3ac1fe79650896044b4e8 ("vhost-user: wait until
backend init is completed") broke reconnect by introducing a call to
qemu_chr_fe_wait_connected().

qemu_chr_fe_wait_connected() doesn't work together with -chardev
...,reconnect=1.  This is because reconnect=1 connects asynchronously
and then qemu_chr_fe_wait_connect() connects synchronously (if the async
connect hasn't completed yet).  This means there will be 2 sockets
connecting to the vhost-user slave!

The virtio-vhost-user slave accepts the first connection but never
receives any data because the QEMU master is trying to use the 2nd
socket instead.

Reconnection probably worked when Marc-André implemented it since QEMU
wasn't using qemu_chr_fe_wait_connected().

Marc-André: How do you think this should be fixed?

Stefan
Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device
Posted by Stefan Hajnoczi 6 years, 2 months ago
On Tue, Jan 30, 2018 at 08:09:19PM +0800, Wei Wang wrote:
> On 01/26/2018 10:44 PM, Stefan Hajnoczi wrote:
> > On Thu, Jan 25, 2018 at 06:19:13PM +0800, Wei Wang wrote:
> > > On 01/24/2018 07:40 PM, Stefan Hajnoczi wrote:
> > > > On Tue, Jan 23, 2018 at 09:06:49PM +0800, Wei Wang wrote:
> > > > > On 01/23/2018 07:12 PM, Stefan Hajnoczi wrote:
> > > > > > On Mon, Jan 22, 2018 at 07:09:06PM +0800, Wei Wang wrote:
> > > > > > > On 01/19/2018 09:06 PM, Stefan Hajnoczi wrote:
> > > > > > > 
> > > > > > > 
> > > > > > >     - Suppose in the future there is also a kernel virtio-vhost-user driver as
> > > > > > > other PCI devices, can we unbind the kernel driver first, and then bind the
> > > > > > > device to the dpdk driver? A normal PCI device should be able to smoothly
> > > > > > > switch between the kernel driver and dpdk driver.
> > > > > > It depends what you mean by "smoothly switch".
> > > > > > 
> > > > > > If you mean whether it's possible to go from a kernel driver to
> > > > > > vfio-pci, then the answer is yes.
> > > > > > 
> > > > > > But if the kernel driver has an established vhost-user connection then
> > > > > > it will be closed.  This is the same as reconnecting with AF_UNIX
> > > > > > vhost-user.
> > > > > > 
> > > > > Actually not only the case of switching to testpmd after kernel establishes
> > > > > the connection, but also for several runs of testpmd. That is, if we run
> > > > > testpmd, then exit testpmd. I think the second run of testpmd won't work.
> > > > The vhost-user master must reconnect and initialize again (SET_FEATURES,
> > > > SET_MEM_TABLE, etc).  Is your master reconnecting after the AF_UNIX
> > > > connection is closed?
> > > Is this an explicit qmp operation to make the master re-connect?
> > I haven't tested it myself but I'm aware of two modes of operation:
> > 
> > 1. -chardev socket,id=chardev0,...,server
> >     -netdev vhost-user,chardev=chardev0
> > 
> >     When the vhost-user socket is disconnected the peer needs to
> >     reconnect.  In this case no special commands are necessary.
> > 
> >     Here we're relying on DPDK librte_vhost's reconnection behavior.
> > 
> > Or
> > 
> > 2. -chardev socket,id=chardev0,...,reconnect=3
> >     -netdev vhost-user,chardev=chardev0
> > 
> >     When the vhost-user socket is disconnected a new connection attempt
> >     will be made after 3 seconds.
> > 
> > In both cases vhost-user negotiation will resume when the new connection
> > is established.
> > 
> > Stefan
> 
> I've been thinking about the issues, and it looks vhost-pci outperforms in
> this aspect.
> Vhost-pci is like using a mail box. messages are just dropped into the box,
> and whenever vhost-pci pmd gets booted, it can always get the messages from
> the box, the negotiation between vhost-pci pmd and virtio-net is
> asynchronous.
> Virtio-vhost-user is like a phone call, which is a synchronous
> communication. If one side is absent, then the other side will hang on
> without knowing when it could get connected or hang up with messages not
> passed (lost).
> 
> I also think the above solutions won't help. Please see below:
> 
> Background:
> The vhost-user negotiation is split into 2 phases currently. The 1st phase
> happens when the connection is established, and we can find what's done in
> the 1st phase in vhost_user_init(). The 2nd phase happens when the master
> driver is loaded (e.g. run of virtio-net pmd) and set status to the device,
> and we can find what's done in the 2nd phase in vhost_dev_start(), which
> includes sending the memory info and virtqueue info. The socket is
> connected, till one of the QEMU devices exits, so pmd exiting won't end the
> QEMU side socket connection.
>
> Issues:
> Suppose we have both the vhost and virtio-net set up, and vhost pmd <->
> virtio-net pmd communication works well. Now, vhost pmd exits (virtio-net
> pmd is still there). Some time later, we re-run vhost pmd, the vhost pmd
> doesn't know the virtqueue addresses of the virtio-net pmd, unless the
> virtio-net pmd reloads to start the 2nd phase of the vhost-user protocol. So
> the second run of the vhost pmd won't work.

This isn't a problem for virtio-vhost-user since the vhost-pmd resets
the virtio-vhost-user device when it restarts.  The vhost-user AF_UNIX
socket reconnects and negotiation restarts.

Stefan
Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device
Posted by Wang, Wei W 6 years, 2 months ago
On Friday, February 2, 2018 11:26 PM, Stefan Hajnoczi wrote:
> On Tue, Jan 30, 2018 at 08:09:19PM +0800, Wei Wang wrote:
> > Background:
> > The vhost-user negotiation is split into 2 phases currently. The 1st
> > phase happens when the connection is established, and we can find
> > what's done in the 1st phase in vhost_user_init(). The 2nd phase
> > happens when the master driver is loaded (e.g. run of virtio-net pmd)
> > and set status to the device, and we can find what's done in the 2nd
> > phase in vhost_dev_start(), which includes sending the memory info and
> > virtqueue info. The socket is connected, till one of the QEMU devices
> > exits, so pmd exiting won't end the QEMU side socket connection.
> >
> > Issues:
> > Suppose we have both the vhost and virtio-net set up, and vhost pmd
> > <-> virtio-net pmd communication works well. Now, vhost pmd exits
> > (virtio-net pmd is still there). Some time later, we re-run vhost pmd,
> > the vhost pmd doesn't know the virtqueue addresses of the virtio-net
> > pmd, unless the virtio-net pmd reloads to start the 2nd phase of the
> > vhost-user protocol. So the second run of the vhost pmd won't work.
> 
> This isn't a problem for virtio-vhost-user since the vhost-pmd resets the
> virtio-vhost-user device when it restarts.  The vhost-user AF_UNIX socket
> reconnects and negotiation restarts.

I'm not sure if you've agreed that vhost-user negotiation is split into two phases as described above. If not, it's also not difficult to check, thanks to the RTE_LOG put at the vhost_user_msg_handler:
RTE_LOG(INFO, VHOST_CONFIG, "read message %s\n",
                        vhost_message_str[msg.request.master]);
It tells us what messages are received and when they are received. 

Before trying the virtio-vhost-user setup, please make sure the virtio-net side VM doesn't have the virtio-net kernel driver loaded (blacklist the module or disable it in .config).
VM1: the VM with virtio-vhost-user
VM2: the VM with virtio-net

1) After we boot VM1 and the virtio-vhost-user pmd, we boot VM2, and we will see the following log in VM1:
VHOST_CONFIG: new device, handle is 0
VHOST_CONFIG: read message VHOST_USER_GET_FEATURES
VHOST_CONFIG: read message VHOST_USER_GET_PROTOCOL_FEATURES
VHOST_CONFIG: read message VHOST_USER_SET_PROTOCOL_FEATURES
VHOST_CONFIG: read message VHOST_USER_GET_QUEUE_NUM
VHOST_CONFIG: read message VHOST_USER_SET_OWNER
VHOST_CONFIG: read message VHOST_USER_GET_FEATURES
VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
VHOST_CONFIG: vring call idx:0 file:-1
VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
VHOST_CONFIG: vring call idx:1 file:-1

Those messages are what I called phase1 negotiation. They are negotiated when the AF_UNIX socket connects.

2) Then in VM2 we load the virtio-net pmd, we will see phase2 messages showing up in VM1, like:
VHOST_CONFIG: read message VHOST_USER_SET_VRING_NUM
VHOST_CONFIG: read message VHOST_USER_SET_VRING_BASE
VHOST_CONFIG: read message VHOST_USER_SET_VRING_ADDR
...
Those messages are sent to virtio-vhost-user only when the VM2's virtio-net pmd loads.

AF_UNIX socket reconnection only triggers phase1 negotiation. If virtio-net pmd in VM2 doesn't reload, virtio-vhost-user pmd won't get the above phase2 messages. Do you agree with the issue?

Best,
Wei 

Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device
Posted by Stefan Hajnoczi 6 years, 2 months ago
On Mon, Feb 05, 2018 at 09:57:23AM +0000, Wang, Wei W wrote:
> On Friday, February 2, 2018 11:26 PM, Stefan Hajnoczi wrote:
> > On Tue, Jan 30, 2018 at 08:09:19PM +0800, Wei Wang wrote:
> > > Background:
> > > The vhost-user negotiation is split into 2 phases currently. The 1st
> > > phase happens when the connection is established, and we can find
> > > what's done in the 1st phase in vhost_user_init(). The 2nd phase
> > > happens when the master driver is loaded (e.g. run of virtio-net pmd)
> > > and set status to the device, and we can find what's done in the 2nd
> > > phase in vhost_dev_start(), which includes sending the memory info and
> > > virtqueue info. The socket is connected, till one of the QEMU devices
> > > exits, so pmd exiting won't end the QEMU side socket connection.
> > >
> > > Issues:
> > > Suppose we have both the vhost and virtio-net set up, and vhost pmd
> > > <-> virtio-net pmd communication works well. Now, vhost pmd exits
> > > (virtio-net pmd is still there). Some time later, we re-run vhost pmd,
> > > the vhost pmd doesn't know the virtqueue addresses of the virtio-net
> > > pmd, unless the virtio-net pmd reloads to start the 2nd phase of the
> > > vhost-user protocol. So the second run of the vhost pmd won't work.
> > 
> > This isn't a problem for virtio-vhost-user since the vhost-pmd resets the
> > virtio-vhost-user device when it restarts.  The vhost-user AF_UNIX socket
> > reconnects and negotiation restarts.
> 
> I'm not sure if you've agreed that vhost-user negotiation is split into two phases as described above. If not, it's also not difficult to check, thanks to the RTE_LOG put at the vhost_user_msg_handler:
> RTE_LOG(INFO, VHOST_CONFIG, "read message %s\n",
>                         vhost_message_str[msg.request.master]);
> It tells us what messages are received and when they are received. 
> 
> Before trying the virtio-vhost-user setup, please make sure the virtio-net side VM doesn't have the virtio-net kernel driver loaded (blacklist the module or disable it in .config).
> VM1: the VM with virtio-vhost-user
> VM2: the VM with virtio-net
> 
> 1) After we boot VM1 and the virtio-vhost-user pmd, we boot VM2, and we will see the following log in VM1:
> VHOST_CONFIG: new device, handle is 0
> VHOST_CONFIG: read message VHOST_USER_GET_FEATURES
> VHOST_CONFIG: read message VHOST_USER_GET_PROTOCOL_FEATURES
> VHOST_CONFIG: read message VHOST_USER_SET_PROTOCOL_FEATURES
> VHOST_CONFIG: read message VHOST_USER_GET_QUEUE_NUM
> VHOST_CONFIG: read message VHOST_USER_SET_OWNER
> VHOST_CONFIG: read message VHOST_USER_GET_FEATURES
> VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
> VHOST_CONFIG: vring call idx:0 file:-1
> VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
> VHOST_CONFIG: vring call idx:1 file:-1
> 
> Those messages are what I called phase1 negotiation. They are negotiated when the AF_UNIX socket connects.
> 
> 2) Then in VM2 we load the virtio-net pmd, we will see phase2 messages showing up in VM1, like:
> VHOST_CONFIG: read message VHOST_USER_SET_VRING_NUM
> VHOST_CONFIG: read message VHOST_USER_SET_VRING_BASE
> VHOST_CONFIG: read message VHOST_USER_SET_VRING_ADDR
> ...
> Those messages are sent to virtio-vhost-user only when the VM2's virtio-net pmd loads.
> 
> AF_UNIX socket reconnection only triggers phase1 negotiation. If virtio-net pmd in VM2 doesn't reload, virtio-vhost-user pmd won't get the above phase2 messages. Do you agree with the issue?

I'll reply on the subthread with Michael where you posted a similar
response.

Stefan