[Qemu-devel] [PATCH V2 0/5] hw/pvrdma: PVRDMA device implementation

Marcel Apfelbaum posted 5 patches 7 years, 10 months ago
Failed in applying to current master (apply log)
There is a newer version of this series
MAINTAINERS                         |   7 +
Makefile.objs                       |   1 +
backends/hostmem-file.c             |  25 +-
backends/hostmem-ram.c              |   4 +-
backends/hostmem.c                  |  21 +
configure                           |   9 +-
default-configs/arm-softmmu.mak     |   2 +
default-configs/i386-softmmu.mak    |   1 +
default-configs/x86_64-softmmu.mak  |   1 +
docs/pvrdma.txt                     | 145 ++++++
exec.c                              |  26 +-
hw/net/Makefile.objs                |   7 +
hw/net/pvrdma/pvrdma.h              | 179 +++++++
hw/net/pvrdma/pvrdma_backend.c      | 986 ++++++++++++++++++++++++++++++++++++
hw/net/pvrdma/pvrdma_backend.h      |  74 +++
hw/net/pvrdma/pvrdma_backend_defs.h |  68 +++
hw/net/pvrdma/pvrdma_cmd.c          | 338 ++++++++++++
hw/net/pvrdma/pvrdma_defs.h         | 121 +++++
hw/net/pvrdma/pvrdma_dev_api.h      | 580 +++++++++++++++++++++
hw/net/pvrdma/pvrdma_dev_ring.c     | 138 +++++
hw/net/pvrdma/pvrdma_dev_ring.h     |  42 ++
hw/net/pvrdma/pvrdma_ib_verbs.h     | 399 +++++++++++++++
hw/net/pvrdma/pvrdma_main.c         | 664 ++++++++++++++++++++++++
hw/net/pvrdma/pvrdma_qp_ops.c       | 187 +++++++
hw/net/pvrdma/pvrdma_qp_ops.h       |  26 +
hw/net/pvrdma/pvrdma_ring.h         | 134 +++++
hw/net/pvrdma/pvrdma_rm.c           | 791 +++++++++++++++++++++++++++++
hw/net/pvrdma/pvrdma_rm.h           |  54 ++
hw/net/pvrdma/pvrdma_rm_defs.h      | 111 ++++
hw/net/pvrdma/pvrdma_types.h        |  37 ++
hw/net/pvrdma/pvrdma_utils.c        | 133 +++++
hw/net/pvrdma/pvrdma_utils.h        |  41 ++
hw/net/pvrdma/trace-events          |   9 +
hw/pci/shpc.c                       |  11 +-
include/exec/memory.h               |  23 +
include/exec/ram_addr.h             |   3 +-
include/hw/pci/pci_ids.h            |   3 +
include/qemu/cutils.h               |  10 +
include/qemu/osdep.h                |   2 +-
include/sysemu/hostmem.h            |   2 +-
include/sysemu/kvm.h                |   2 +-
memory.c                            |  16 +-
util/oslib-posix.c                  |   4 +-
util/oslib-win32.c                  |   2 +-
44 files changed, 5378 insertions(+), 61 deletions(-)
create mode 100644 docs/pvrdma.txt
create mode 100644 hw/net/pvrdma/pvrdma.h
create mode 100644 hw/net/pvrdma/pvrdma_backend.c
create mode 100644 hw/net/pvrdma/pvrdma_backend.h
create mode 100644 hw/net/pvrdma/pvrdma_backend_defs.h
create mode 100644 hw/net/pvrdma/pvrdma_cmd.c
create mode 100644 hw/net/pvrdma/pvrdma_defs.h
create mode 100644 hw/net/pvrdma/pvrdma_dev_api.h
create mode 100644 hw/net/pvrdma/pvrdma_dev_ring.c
create mode 100644 hw/net/pvrdma/pvrdma_dev_ring.h
create mode 100644 hw/net/pvrdma/pvrdma_ib_verbs.h
create mode 100644 hw/net/pvrdma/pvrdma_main.c
create mode 100644 hw/net/pvrdma/pvrdma_qp_ops.c
create mode 100644 hw/net/pvrdma/pvrdma_qp_ops.h
create mode 100644 hw/net/pvrdma/pvrdma_ring.h
create mode 100644 hw/net/pvrdma/pvrdma_rm.c
create mode 100644 hw/net/pvrdma/pvrdma_rm.h
create mode 100644 hw/net/pvrdma/pvrdma_rm_defs.h
create mode 100644 hw/net/pvrdma/pvrdma_types.h
create mode 100644 hw/net/pvrdma/pvrdma_utils.c
create mode 100644 hw/net/pvrdma/pvrdma_utils.h
create mode 100644 hw/net/pvrdma/trace-events
[Qemu-devel] [PATCH V2 0/5] hw/pvrdma: PVRDMA device implementation
Posted by Marcel Apfelbaum 7 years, 10 months ago
RFC -> V2:
 - Full implementation of the pvrdma device
 - Backend is an ibdevice interface, no need for the KDBR module


General description
===================
PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device.
It works with its Linux Kernel driver AS IS, no need for any special guest
modifications.

While it complies with the VMware device, it can also communicate with bare
metal RDMA-enabled machines and does not require an RDMA HCA in the host, it
can work with Soft-RoCE (rxe).

It does not require the whole guest RAM to be pinned allowing memory
over-commit and, even if not implemented yet, migration support will be
possible with some HW assistance.


 Design
 ======
 - Follows the behavior of VMware's pvrdma device, however is not tightly
   coupled with it and most of the code can be reused if we decide to
   continue to a Virtio based RDMA device.

 - It exposes 3 BARs:
    BAR 0 - MSIX, utilize 3 vectors for command ring, async events and
            completions
    BAR 1 - Configuration of registers
    BAR 2 - UAR, used to pass HW commands from driver.

 - The device performs internal management of the RDMA
   resources (PDs, CQs, QPs, ...), meaning the objects
   are not directly coupled to a physical RDMA device resources.

The pvrdma backend is an ibdevice interface that can be exposed
either by a Soft-RoCE(rxe) device on machines with no RDMA device,
or an HCA SRIOV function(VF/PF).
Note that ibdevice interfaces can't be shared between pvrdma devices,
each one requiring a separate instance (rxe or SRIOV VF).


Tests and performance
=====================
Tested with SoftRoCE backend (rxe)/Mellanox ConnectX3,
and Mellanox ConnectX4 HCAs with:
  - VMs in the same host
  - VMs in different hosts 
  - VMs to bare metal.

The best performance achieved with ConnectX HCAs and buffer size
bigger than 1MB which was the line rate ~ 50Gb/s.
The conclusion is that using the PVRDMA device there are no
actual performance penalties compared to bare metal for big enough
buffers (which is quite common when using RDMA), while allowing
memory overcommit.

Marcel Apfelbaum (3):
  mem: add share parameter to memory-backend-ram
  docs: add pvrdma device documentation.
  MAINTAINERS: add entry for hw/net/pvrdma

Yuval Shaia (2):
  pci/shpc: Move function to generic header file
  pvrdma: initial implementation

 MAINTAINERS                         |   7 +
 Makefile.objs                       |   1 +
 backends/hostmem-file.c             |  25 +-
 backends/hostmem-ram.c              |   4 +-
 backends/hostmem.c                  |  21 +
 configure                           |   9 +-
 default-configs/arm-softmmu.mak     |   2 +
 default-configs/i386-softmmu.mak    |   1 +
 default-configs/x86_64-softmmu.mak  |   1 +
 docs/pvrdma.txt                     | 145 ++++++
 exec.c                              |  26 +-
 hw/net/Makefile.objs                |   7 +
 hw/net/pvrdma/pvrdma.h              | 179 +++++++
 hw/net/pvrdma/pvrdma_backend.c      | 986 ++++++++++++++++++++++++++++++++++++
 hw/net/pvrdma/pvrdma_backend.h      |  74 +++
 hw/net/pvrdma/pvrdma_backend_defs.h |  68 +++
 hw/net/pvrdma/pvrdma_cmd.c          | 338 ++++++++++++
 hw/net/pvrdma/pvrdma_defs.h         | 121 +++++
 hw/net/pvrdma/pvrdma_dev_api.h      | 580 +++++++++++++++++++++
 hw/net/pvrdma/pvrdma_dev_ring.c     | 138 +++++
 hw/net/pvrdma/pvrdma_dev_ring.h     |  42 ++
 hw/net/pvrdma/pvrdma_ib_verbs.h     | 399 +++++++++++++++
 hw/net/pvrdma/pvrdma_main.c         | 664 ++++++++++++++++++++++++
 hw/net/pvrdma/pvrdma_qp_ops.c       | 187 +++++++
 hw/net/pvrdma/pvrdma_qp_ops.h       |  26 +
 hw/net/pvrdma/pvrdma_ring.h         | 134 +++++
 hw/net/pvrdma/pvrdma_rm.c           | 791 +++++++++++++++++++++++++++++
 hw/net/pvrdma/pvrdma_rm.h           |  54 ++
 hw/net/pvrdma/pvrdma_rm_defs.h      | 111 ++++
 hw/net/pvrdma/pvrdma_types.h        |  37 ++
 hw/net/pvrdma/pvrdma_utils.c        | 133 +++++
 hw/net/pvrdma/pvrdma_utils.h        |  41 ++
 hw/net/pvrdma/trace-events          |   9 +
 hw/pci/shpc.c                       |  11 +-
 include/exec/memory.h               |  23 +
 include/exec/ram_addr.h             |   3 +-
 include/hw/pci/pci_ids.h            |   3 +
 include/qemu/cutils.h               |  10 +
 include/qemu/osdep.h                |   2 +-
 include/sysemu/hostmem.h            |   2 +-
 include/sysemu/kvm.h                |   2 +-
 memory.c                            |  16 +-
 util/oslib-posix.c                  |   4 +-
 util/oslib-win32.c                  |   2 +-
 44 files changed, 5378 insertions(+), 61 deletions(-)
 create mode 100644 docs/pvrdma.txt
 create mode 100644 hw/net/pvrdma/pvrdma.h
 create mode 100644 hw/net/pvrdma/pvrdma_backend.c
 create mode 100644 hw/net/pvrdma/pvrdma_backend.h
 create mode 100644 hw/net/pvrdma/pvrdma_backend_defs.h
 create mode 100644 hw/net/pvrdma/pvrdma_cmd.c
 create mode 100644 hw/net/pvrdma/pvrdma_defs.h
 create mode 100644 hw/net/pvrdma/pvrdma_dev_api.h
 create mode 100644 hw/net/pvrdma/pvrdma_dev_ring.c
 create mode 100644 hw/net/pvrdma/pvrdma_dev_ring.h
 create mode 100644 hw/net/pvrdma/pvrdma_ib_verbs.h
 create mode 100644 hw/net/pvrdma/pvrdma_main.c
 create mode 100644 hw/net/pvrdma/pvrdma_qp_ops.c
 create mode 100644 hw/net/pvrdma/pvrdma_qp_ops.h
 create mode 100644 hw/net/pvrdma/pvrdma_ring.h
 create mode 100644 hw/net/pvrdma/pvrdma_rm.c
 create mode 100644 hw/net/pvrdma/pvrdma_rm.h
 create mode 100644 hw/net/pvrdma/pvrdma_rm_defs.h
 create mode 100644 hw/net/pvrdma/pvrdma_types.h
 create mode 100644 hw/net/pvrdma/pvrdma_utils.c
 create mode 100644 hw/net/pvrdma/pvrdma_utils.h
 create mode 100644 hw/net/pvrdma/trace-events

-- 
2.13.5


Re: [Qemu-devel] [PATCH V2 0/5] hw/pvrdma: PVRDMA device implementation
Posted by Michael S. Tsirkin 7 years, 10 months ago
On Sun, Dec 17, 2017 at 02:54:52PM +0200, Marcel Apfelbaum wrote:
> RFC -> V2:
>  - Full implementation of the pvrdma device
>  - Backend is an ibdevice interface, no need for the KDBR module
> 
> General description
> ===================
> PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device.
> It works with its Linux Kernel driver AS IS, no need for any special guest
> modifications.
> 
> While it complies with the VMware device, it can also communicate with bare
> metal RDMA-enabled machines and does not require an RDMA HCA in the host, it
> can work with Soft-RoCE (rxe).
> 
> It does not require the whole guest RAM to be pinned

What happens if guest attempts to register all its memory?

> allowing memory
> over-commit
> and, even if not implemented yet, migration support will be
> possible with some HW assistance.

What does "HW assistance" mean here?
Can it work with any existing hardware?

> 
>  Design
>  ======
>  - Follows the behavior of VMware's pvrdma device, however is not tightly
>    coupled with it

Everything seems to be in pvrdma. Since it's not coupled, could you
split code to pvrdma specific and generic parts?

> and most of the code can be reused if we decide to
>    continue to a Virtio based RDMA device.

I suspect that without virtio we won't be able to do any future
extensions.

>  - It exposes 3 BARs:
>     BAR 0 - MSIX, utilize 3 vectors for command ring, async events and
>             completions
>     BAR 1 - Configuration of registers

What does this mean?

>     BAR 2 - UAR, used to pass HW commands from driver.

A detailed description of above belongs in documentation.

>  - The device performs internal management of the RDMA
>    resources (PDs, CQs, QPs, ...), meaning the objects
>    are not directly coupled to a physical RDMA device resources.

I am wondering how do you make connections? QP#s are exposed on
the wire during connection management.

> The pvrdma backend is an ibdevice interface that can be exposed
> either by a Soft-RoCE(rxe) device on machines with no RDMA device,
> or an HCA SRIOV function(VF/PF).
> Note that ibdevice interfaces can't be shared between pvrdma devices,
> each one requiring a separate instance (rxe or SRIOV VF).

So what's the advantage of this over pass-through then?


> 
> Tests and performance
> =====================
> Tested with SoftRoCE backend (rxe)/Mellanox ConnectX3,
> and Mellanox ConnectX4 HCAs with:
>   - VMs in the same host
>   - VMs in different hosts 
>   - VMs to bare metal.
> 
> The best performance achieved with ConnectX HCAs and buffer size
> bigger than 1MB which was the line rate ~ 50Gb/s.
> The conclusion is that using the PVRDMA device there are no
> actual performance penalties compared to bare metal for big enough
> buffers (which is quite common when using RDMA), while allowing
> memory overcommit.
> 
> Marcel Apfelbaum (3):
>   mem: add share parameter to memory-backend-ram
>   docs: add pvrdma device documentation.
>   MAINTAINERS: add entry for hw/net/pvrdma
> 
> Yuval Shaia (2):
>   pci/shpc: Move function to generic header file
>   pvrdma: initial implementation
> 
>  MAINTAINERS                         |   7 +
>  Makefile.objs                       |   1 +
>  backends/hostmem-file.c             |  25 +-
>  backends/hostmem-ram.c              |   4 +-
>  backends/hostmem.c                  |  21 +
>  configure                           |   9 +-
>  default-configs/arm-softmmu.mak     |   2 +
>  default-configs/i386-softmmu.mak    |   1 +
>  default-configs/x86_64-softmmu.mak  |   1 +
>  docs/pvrdma.txt                     | 145 ++++++
>  exec.c                              |  26 +-
>  hw/net/Makefile.objs                |   7 +
>  hw/net/pvrdma/pvrdma.h              | 179 +++++++
>  hw/net/pvrdma/pvrdma_backend.c      | 986 ++++++++++++++++++++++++++++++++++++
>  hw/net/pvrdma/pvrdma_backend.h      |  74 +++
>  hw/net/pvrdma/pvrdma_backend_defs.h |  68 +++
>  hw/net/pvrdma/pvrdma_cmd.c          | 338 ++++++++++++
>  hw/net/pvrdma/pvrdma_defs.h         | 121 +++++
>  hw/net/pvrdma/pvrdma_dev_api.h      | 580 +++++++++++++++++++++
>  hw/net/pvrdma/pvrdma_dev_ring.c     | 138 +++++
>  hw/net/pvrdma/pvrdma_dev_ring.h     |  42 ++
>  hw/net/pvrdma/pvrdma_ib_verbs.h     | 399 +++++++++++++++
>  hw/net/pvrdma/pvrdma_main.c         | 664 ++++++++++++++++++++++++
>  hw/net/pvrdma/pvrdma_qp_ops.c       | 187 +++++++
>  hw/net/pvrdma/pvrdma_qp_ops.h       |  26 +
>  hw/net/pvrdma/pvrdma_ring.h         | 134 +++++
>  hw/net/pvrdma/pvrdma_rm.c           | 791 +++++++++++++++++++++++++++++
>  hw/net/pvrdma/pvrdma_rm.h           |  54 ++
>  hw/net/pvrdma/pvrdma_rm_defs.h      | 111 ++++
>  hw/net/pvrdma/pvrdma_types.h        |  37 ++
>  hw/net/pvrdma/pvrdma_utils.c        | 133 +++++
>  hw/net/pvrdma/pvrdma_utils.h        |  41 ++
>  hw/net/pvrdma/trace-events          |   9 +
>  hw/pci/shpc.c                       |  11 +-
>  include/exec/memory.h               |  23 +
>  include/exec/ram_addr.h             |   3 +-
>  include/hw/pci/pci_ids.h            |   3 +
>  include/qemu/cutils.h               |  10 +
>  include/qemu/osdep.h                |   2 +-
>  include/sysemu/hostmem.h            |   2 +-
>  include/sysemu/kvm.h                |   2 +-
>  memory.c                            |  16 +-
>  util/oslib-posix.c                  |   4 +-
>  util/oslib-win32.c                  |   2 +-
>  44 files changed, 5378 insertions(+), 61 deletions(-)
>  create mode 100644 docs/pvrdma.txt
>  create mode 100644 hw/net/pvrdma/pvrdma.h
>  create mode 100644 hw/net/pvrdma/pvrdma_backend.c
>  create mode 100644 hw/net/pvrdma/pvrdma_backend.h
>  create mode 100644 hw/net/pvrdma/pvrdma_backend_defs.h
>  create mode 100644 hw/net/pvrdma/pvrdma_cmd.c
>  create mode 100644 hw/net/pvrdma/pvrdma_defs.h
>  create mode 100644 hw/net/pvrdma/pvrdma_dev_api.h
>  create mode 100644 hw/net/pvrdma/pvrdma_dev_ring.c
>  create mode 100644 hw/net/pvrdma/pvrdma_dev_ring.h
>  create mode 100644 hw/net/pvrdma/pvrdma_ib_verbs.h
>  create mode 100644 hw/net/pvrdma/pvrdma_main.c
>  create mode 100644 hw/net/pvrdma/pvrdma_qp_ops.c
>  create mode 100644 hw/net/pvrdma/pvrdma_qp_ops.h
>  create mode 100644 hw/net/pvrdma/pvrdma_ring.h
>  create mode 100644 hw/net/pvrdma/pvrdma_rm.c
>  create mode 100644 hw/net/pvrdma/pvrdma_rm.h
>  create mode 100644 hw/net/pvrdma/pvrdma_rm_defs.h
>  create mode 100644 hw/net/pvrdma/pvrdma_types.h
>  create mode 100644 hw/net/pvrdma/pvrdma_utils.c
>  create mode 100644 hw/net/pvrdma/pvrdma_utils.h
>  create mode 100644 hw/net/pvrdma/trace-events
> 
> -- 
> 2.13.5

Re: [Qemu-devel] [PATCH V2 0/5] hw/pvrdma: PVRDMA device implementation
Posted by Marcel Apfelbaum 7 years, 10 months ago
On 19/12/2017 20:05, Michael S. Tsirkin wrote:
> On Sun, Dec 17, 2017 at 02:54:52PM +0200, Marcel Apfelbaum wrote:
>> RFC -> V2:
>>   - Full implementation of the pvrdma device
>>   - Backend is an ibdevice interface, no need for the KDBR module
>>
>> General description
>> ===================
>> PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device.
>> It works with its Linux Kernel driver AS IS, no need for any special guest
>> modifications.
>>
>> While it complies with the VMware device, it can also communicate with bare
>> metal RDMA-enabled machines and does not require an RDMA HCA in the host, it
>> can work with Soft-RoCE (rxe).
>>
>> It does not require the whole guest RAM to be pinned
> 

Hi Michael,

> What happens if guest attempts to register all its memory?
> 

Then we loose, is not different from bare metal, reg_mr will pin all the RAM.
However this is only one scenario, and hopefully not much used
for RoCE. (I know IPoIB does that, but it doesn't make sense to use it with RoCE).

>> allowing memory
>> over-commit
>> and, even if not implemented yet, migration support will be
>> possible with some HW assistance.
> 
> What does "HW assistance" mean here?

Several things:
1. We need to be able to pass resource numbers when we create
them on the destination machine.
2. We also need a way to stall prev connections while starting the new ones.
3. Last, we need the HW to pass resources states.

> Can it work with any existing hardware?
> 

Sadly no, however we talked with Mellanox at the last year
Plumbers Conference and all the above are on their plans.
We hope this submission will help, since now we will have
a fast way to test and use it.

For Soft-RoCE backend is doable, but is best to wait first to
see how HCAs are going to expose the changes.

>>
>>   Design
>>   ======
>>   - Follows the behavior of VMware's pvrdma device, however is not tightly
>>     coupled with it
> 
> Everything seems to be in pvrdma. Since it's not coupled, could you
> split code to pvrdma specific and generic parts?
> 
>> and most of the code can be reused if we decide to
>>     continue to a Virtio based RDMA device.
> 
> I suspect that without virtio we won't be able to do any future
> extensions.
> 

While I do agree is harder to work with a 3rd party spec, their
Linux driver is open source and we may be able to do sane
modifications.

>>   - It exposes 3 BARs:
>>      BAR 0 - MSIX, utilize 3 vectors for command ring, async events and
>>              completions
>>      BAR 1 - Configuration of registers

[...]

>> The pvrdma backend is an ibdevice interface that can be exposed
>> either by a Soft-RoCE(rxe) device on machines with no RDMA device,
>> or an HCA SRIOV function(VF/PF).
>> Note that ibdevice interfaces can't be shared between pvrdma devices,
>> each one requiring a separate instance (rxe or SRIOV VF).
> 
> So what's the advantage of this over pass-through then?
> 

1. We can work also with the same ibdevice for multiple pvrdma
devices using multiple GIDs; it works (tested).
The problem begins when we think about migration, the way
HCAs work today is one resource namespace per ibdevice,
not per GID. I emphasize that this can be changed,  however
we don't have a timeline for it.

2. We do have advantages:
- Guest agnostic device (we can change host HCA)
- Memory over commit (unless the guest registers all the memory)
- Future migration support
- A friendly migration of RDMA VMWare guests to QEMU.

3. In case when live migration is not a must we can
    use multiple GIDs of the same port, so we do not
    depend on SRIOV.

4. We support Soft RoCE backend, people can test their
    software on guest without RDMA hw.


Thanks,
Marcel

> 
>>
>> Tests and performance
>> =====================
>> Tested with SoftRoCE backend (rxe)/Mellanox ConnectX3,
>> and Mellanox ConnectX4 HCAs with:
>>    - VMs in the same host
>>    - VMs in different hosts
>>    - VMs to bare metal.
>>
>> The best performance achieved with ConnectX HCAs and buffer size
>> bigger than 1MB which was the line rate ~ 50Gb/s.
>> The conclusion is that using the PVRDMA device there are no
>> actual performance penalties compared to bare metal for big enough
>> buffers (which is quite common when using RDMA), while allowing
>> memory overcommit.
>>
>> Marcel Apfelbaum (3):
>>    mem: add share parameter to memory-backend-ram
>>    docs: add pvrdma device documentation.
>>    MAINTAINERS: add entry for hw/net/pvrdma
>>
>> Yuval Shaia (2):
>>    pci/shpc: Move function to generic header file
>>    pvrdma: initial implementation
>>

[...]

Re: [Qemu-devel] [PATCH V2 0/5] hw/pvrdma: PVRDMA device implementation
Posted by Michael S. Tsirkin 7 years, 10 months ago
On Wed, Dec 20, 2017 at 05:07:38PM +0200, Marcel Apfelbaum wrote:
> On 19/12/2017 20:05, Michael S. Tsirkin wrote:
> > On Sun, Dec 17, 2017 at 02:54:52PM +0200, Marcel Apfelbaum wrote:
> > > RFC -> V2:
> > >   - Full implementation of the pvrdma device
> > >   - Backend is an ibdevice interface, no need for the KDBR module
> > > 
> > > General description
> > > ===================
> > > PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device.
> > > It works with its Linux Kernel driver AS IS, no need for any special guest
> > > modifications.
> > > 
> > > While it complies with the VMware device, it can also communicate with bare
> > > metal RDMA-enabled machines and does not require an RDMA HCA in the host, it
> > > can work with Soft-RoCE (rxe).
> > > 
> > > It does not require the whole guest RAM to be pinned
> > 
> 
> Hi Michael,
> 
> > What happens if guest attempts to register all its memory?
> > 
> 
> Then we loose, is not different from bare metal, reg_mr will pin all the RAM.

We need to find a way to communicate to guests about amount
of memory they can pin.

> However this is only one scenario, and hopefully not much used
> for RoCE. (I know IPoIB does that, but it doesn't make sense to use it with RoCE).

SRP does it too AFAIK.

> > > allowing memory
> > > over-commit
> > > and, even if not implemented yet, migration support will be
> > > possible with some HW assistance.
> > 
> > What does "HW assistance" mean here?
> 
> Several things:
> 1. We need to be able to pass resource numbers when we create
> them on the destination machine.

These resources are mostly managed by software.

> 2. We also need a way to stall prev connections while starting the new ones.

Look at what hardware can do.

> 3. Last, we need the HW to pass resources states.

Look at the spec, some of this can be done.

> > Can it work with any existing hardware?
> > 
> 
> Sadly no,

Above can be done. What's needed is host kernel work to support it.

> however we talked with Mellanox at the last year
> Plumbers Conference and all the above are on their plans.
> We hope this submission will help, since now we will have
> a fast way to test and use it.

I'm doubtful it'll help.

> For Soft-RoCE backend is doable, but is best to wait first to
> see how HCAs are going to expose the changes.
> 
> > > 
> > >   Design
> > >   ======
> > >   - Follows the behavior of VMware's pvrdma device, however is not tightly
> > >     coupled with it
> > 
> > Everything seems to be in pvrdma. Since it's not coupled, could you
> > split code to pvrdma specific and generic parts?
> > 
> > > and most of the code can be reused if we decide to
> > >     continue to a Virtio based RDMA device.
> > 
> > I suspect that without virtio we won't be able to do any future
> > extensions.
> > 
> 
> While I do agree is harder to work with a 3rd party spec, their
> Linux driver is open source and we may be able to do sane
> modifications.

I am sceptical. ARM guys did not want to add a single bit in their IOMMU
spec. You want an open spec that everyone can contribute to.

> > >   - It exposes 3 BARs:
> > >      BAR 0 - MSIX, utilize 3 vectors for command ring, async events and
> > >              completions
> > >      BAR 1 - Configuration of registers
> 
> [...]
> 
> > > The pvrdma backend is an ibdevice interface that can be exposed
> > > either by a Soft-RoCE(rxe) device on machines with no RDMA device,
> > > or an HCA SRIOV function(VF/PF).
> > > Note that ibdevice interfaces can't be shared between pvrdma devices,
> > > each one requiring a separate instance (rxe or SRIOV VF).
> > 
> > So what's the advantage of this over pass-through then?
> > 
> 
> 1. We can work also with the same ibdevice for multiple pvrdma
> devices using multiple GIDs; it works (tested).
> The problem begins when we think about migration, the way
> HCAs work today is one resource namespace per ibdevice,
> not per GID. I emphasize that this can be changed,  however
> we don't have a timeline for it.
> 
> 2. We do have advantages:
> - Guest agnostic device (we can change host HCA)
> - Memory over commit (unless the guest registers all the memory)

Not just all. You trust guest and this is a problem.  If you do try to
overcommit, at any point guest can try to register too much and host
will stall.

> - Future migration support

So there are lots of difficult problems to solve for this.  E.g. any MR
that is hardware writeable can be changed and hypervisor won't know. All
this can be solvable but it might also be solveable with passthrough
too.

> - A friendly migration of RDMA VMWare guests to QEMU.

Why do we need to emulate their device for this?  Reboot is required
anyway, you can switch to a passthrough easily.

> 3. In case when live migration is not a must we can
>    use multiple GIDs of the same port, so we do not
>    depend on SRIOV.
> 
> 4. We support Soft RoCE backend, people can test their
>    software on guest without RDMA hw.
> 
> 
> Thanks,
> Marcel

These two are nice, if very niche, features.

> > 
> > > 
> > > Tests and performance
> > > =====================
> > > Tested with SoftRoCE backend (rxe)/Mellanox ConnectX3,
> > > and Mellanox ConnectX4 HCAs with:
> > >    - VMs in the same host
> > >    - VMs in different hosts
> > >    - VMs to bare metal.
> > > 
> > > The best performance achieved with ConnectX HCAs and buffer size
> > > bigger than 1MB which was the line rate ~ 50Gb/s.
> > > The conclusion is that using the PVRDMA device there are no
> > > actual performance penalties compared to bare metal for big enough
> > > buffers (which is quite common when using RDMA), while allowing
> > > memory overcommit.
> > > 
> > > Marcel Apfelbaum (3):
> > >    mem: add share parameter to memory-backend-ram
> > >    docs: add pvrdma device documentation.
> > >    MAINTAINERS: add entry for hw/net/pvrdma
> > > 
> > > Yuval Shaia (2):
> > >    pci/shpc: Move function to generic header file
> > >    pvrdma: initial implementation
> > > 
> 
> [...]

Re: [Qemu-devel] [PATCH V2 0/5] hw/pvrdma: PVRDMA device implementation
Posted by Yuval Shaia 7 years, 10 months ago
> > 
> > > What happens if guest attempts to register all its memory?
> > > 
> > 
> > Then we loose, is not different from bare metal, reg_mr will pin all the RAM.
> 
> We need to find a way to communicate to guests about amount
> of memory they can pin.

dev_caps.max_mr_size is the way device limits guest driver.
This value is controlled by the command line argument dev-caps-max-mr-size
so we should be fine (btw, default value is 1<<32).

> 
> > However this is only one scenario, and hopefully not much used
> > for RoCE. (I know IPoIB does that, but it doesn't make sense to use it with RoCE).
> 
> SRP does it too AFAIK.
> 

Re: [Qemu-devel] [PATCH V2 0/5] hw/pvrdma: PVRDMA device implementation
Posted by Michael S. Tsirkin 7 years, 10 months ago
On Thu, Dec 21, 2017 at 09:27:51AM +0200, Yuval Shaia wrote:
> > > 
> > > > What happens if guest attempts to register all its memory?
> > > > 
> > > 
> > > Then we loose, is not different from bare metal, reg_mr will pin all the RAM.
> > 
> > We need to find a way to communicate to guests about amount
> > of memory they can pin.
> 
> dev_caps.max_mr_size is the way device limits guest driver.
> This value is controlled by the command line argument dev-caps-max-mr-size
> so we should be fine (btw, default value is 1<<32).

Isn't that still leaving the option for guest to register all memory,
just in chunks?

> > 
> > > However this is only one scenario, and hopefully not much used
> > > for RoCE. (I know IPoIB does that, but it doesn't make sense to use it with RoCE).
> > 
> > SRP does it too AFAIK.
> > 

Re: [Qemu-devel] [PATCH V2 0/5] hw/pvrdma: PVRDMA device implementation
Posted by Marcel Apfelbaum 7 years, 10 months ago
On 21/12/2017 16:22, Michael S. Tsirkin wrote:
> On Thu, Dec 21, 2017 at 09:27:51AM +0200, Yuval Shaia wrote:
>>>>
>>>>> What happens if guest attempts to register all its memory?
>>>>>
>>>>
>>>> Then we loose, is not different from bare metal, reg_mr will pin all the RAM.
>>>
>>> We need to find a way to communicate to guests about amount
>>> of memory they can pin.
>>
>> dev_caps.max_mr_size is the way device limits guest driver.
>> This value is controlled by the command line argument dev-caps-max-mr-size
>> so we should be fine (btw, default value is 1<<32).
> 
> Isn't that still leaving the option for guest to register all memory,
> just in chunks?
> 

We also have a parameter limiting the number of mrs (dev-caps-max-mr),
together with dev-caps-max-mr-size we can limit the memory the guests can pin.

Thanks,
Marcel

>>>
>>>> However this is only one scenario, and hopefully not much used
>>>> for RoCE. (I know IPoIB does that, but it doesn't make sense to use it with RoCE).
>>>
>>> SRP does it too AFAIK.
>>>


Re: [Qemu-devel] [PATCH V2 0/5] hw/pvrdma: PVRDMA device implementation
Posted by Michael S. Tsirkin 7 years, 10 months ago
On Thu, Dec 21, 2017 at 05:59:38PM +0200, Marcel Apfelbaum wrote:
> On 21/12/2017 16:22, Michael S. Tsirkin wrote:
> > On Thu, Dec 21, 2017 at 09:27:51AM +0200, Yuval Shaia wrote:
> > > > > 
> > > > > > What happens if guest attempts to register all its memory?
> > > > > > 
> > > > > 
> > > > > Then we loose, is not different from bare metal, reg_mr will pin all the RAM.
> > > > 
> > > > We need to find a way to communicate to guests about amount
> > > > of memory they can pin.
> > > 
> > > dev_caps.max_mr_size is the way device limits guest driver.
> > > This value is controlled by the command line argument dev-caps-max-mr-size
> > > so we should be fine (btw, default value is 1<<32).
> > 
> > Isn't that still leaving the option for guest to register all memory,
> > just in chunks?
> > 
> 
> We also have a parameter limiting the number of mrs (dev-caps-max-mr),
> together with dev-caps-max-mr-size we can limit the memory the guests can pin.
> 
> Thanks,
> Marcel

You might want to limit the default values then.

Right now:

+#define MAX_MR_SIZE           (1UL << 32)
+#define MAX_MR                2048

Which is IIUC 8TB.

That's pretty close to unlimited, and so far overcommit seems to be the
main feature for users.


> > > > 
> > > > > However this is only one scenario, and hopefully not much used
> > > > > for RoCE. (I know IPoIB does that, but it doesn't make sense to use it with RoCE).
> > > > 
> > > > SRP does it too AFAIK.
> > > > 

Re: [Qemu-devel] [PATCH V2 0/5] hw/pvrdma: PVRDMA device implementation
Posted by Yuval Shaia 7 years, 10 months ago
On Thu, Dec 21, 2017 at 10:46:35PM +0200, Michael S. Tsirkin wrote:
> On Thu, Dec 21, 2017 at 05:59:38PM +0200, Marcel Apfelbaum wrote:
> > On 21/12/2017 16:22, Michael S. Tsirkin wrote:
> > > On Thu, Dec 21, 2017 at 09:27:51AM +0200, Yuval Shaia wrote:
> > > > > > 
> > > > > > > What happens if guest attempts to register all its memory?
> > > > > > > 
> > > > > > 
> > > > > > Then we loose, is not different from bare metal, reg_mr will pin all the RAM.
> > > > > 
> > > > > We need to find a way to communicate to guests about amount
> > > > > of memory they can pin.
> > > > 
> > > > dev_caps.max_mr_size is the way device limits guest driver.
> > > > This value is controlled by the command line argument dev-caps-max-mr-size
> > > > so we should be fine (btw, default value is 1<<32).
> > > 
> > > Isn't that still leaving the option for guest to register all memory,
> > > just in chunks?
> > > 
> > 
> > We also have a parameter limiting the number of mrs (dev-caps-max-mr),
> > together with dev-caps-max-mr-size we can limit the memory the guests can pin.
> > 
> > Thanks,
> > Marcel
> 
> You might want to limit the default values then.
> 
> Right now:
> 
> +#define MAX_MR_SIZE           (1UL << 32)
> +#define MAX_MR                2048

Maybe limiting by constant number is not a good approach, it looks odd if
one guest with 16G ram and second with 32G ram will have the same settings,
right?
So how about limiting by a specific percentage of total memory?
In that case, what would be this percentage? 100%? 80%?

> 
> Which is IIUC 8TB.
> 
> That's pretty close to unlimited, and so far overcommit seems to be the
> main feature for users.
> 
> 
> > > > > 
> > > > > > However this is only one scenario, and hopefully not much used
> > > > > > for RoCE. (I know IPoIB does that, but it doesn't make sense to use it with RoCE).
> > > > > 
> > > > > SRP does it too AFAIK.
> > > > > 

Re: [Qemu-devel] [PATCH V2 0/5] hw/pvrdma: PVRDMA device implementation
Posted by Marcel Apfelbaum 7 years, 10 months ago
On 22/12/2017 0:30, Yuval Shaia wrote:
> On Thu, Dec 21, 2017 at 10:46:35PM +0200, Michael S. Tsirkin wrote:
>> On Thu, Dec 21, 2017 at 05:59:38PM +0200, Marcel Apfelbaum wrote:
>>> On 21/12/2017 16:22, Michael S. Tsirkin wrote:
>>>> On Thu, Dec 21, 2017 at 09:27:51AM +0200, Yuval Shaia wrote:
>>>>>>>
>>>>>>>> What happens if guest attempts to register all its memory?
>>>>>>>>
>>>>>>>
>>>>>>> Then we loose, is not different from bare metal, reg_mr will pin all the RAM.
>>>>>>
>>>>>> We need to find a way to communicate to guests about amount
>>>>>> of memory they can pin.
>>>>>
>>>>> dev_caps.max_mr_size is the way device limits guest driver.
>>>>> This value is controlled by the command line argument dev-caps-max-mr-size
>>>>> so we should be fine (btw, default value is 1<<32).
>>>>
>>>> Isn't that still leaving the option for guest to register all memory,
>>>> just in chunks?
>>>>
>>>
>>> We also have a parameter limiting the number of mrs (dev-caps-max-mr),
>>> together with dev-caps-max-mr-size we can limit the memory the guests can pin.
>>>
>>> Thanks,
>>> Marcel
>>
>> You might want to limit the default values then.
>>

Hi Yuval,

>> Right now:
>>
>> +#define MAX_MR_SIZE           (1UL << 32)
>> +#define MAX_MR                2048
> 
> Maybe limiting by constant number is not a good approach, it looks odd if
> one guest with 16G ram and second with 32G ram will have the same settings,
> right?
> So how about limiting by a specific percentage of total memory?
> In that case, what would be this percentage? 100%? 80%?
> 

I think is too complicated. Maybe we can limit the max pined memory
to 2G assuming the RDMA guests have a lot of RAM and let the
users fine-tune the parameters.

Thanks,
Marcel

>>
>> Which is IIUC 8TB.
>>
>> That's pretty close to unlimited, and so far overcommit seems to be the
>> main feature for users.
>>
>>
>>>>>>
>>>>>>> However this is only one scenario, and hopefully not much used
>>>>>>> for RoCE. (I know IPoIB does that, but it doesn't make sense to use it with RoCE).
>>>>>>
>>>>>> SRP does it too AFAIK.
>>>>>>


Re: [Qemu-devel] [PATCH V2 0/5] hw/pvrdma: PVRDMA device implementation
Posted by Yuval Shaia 7 years, 10 months ago
On Tue, Dec 19, 2017 at 08:05:18PM +0200, Michael S. Tsirkin wrote:
> On Sun, Dec 17, 2017 at 02:54:52PM +0200, Marcel Apfelbaum wrote:
> > RFC -> V2:
> >  - Full implementation of the pvrdma device
> >  - Backend is an ibdevice interface, no need for the KDBR module
> > 
> > General description
> > ===================
> > PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device.
> > It works with its Linux Kernel driver AS IS, no need for any special guest
> > modifications.
> > 
> > While it complies with the VMware device, it can also communicate with bare
> > metal RDMA-enabled machines and does not require an RDMA HCA in the host, it
> > can work with Soft-RoCE (rxe).
> > 
> > It does not require the whole guest RAM to be pinned
> 
> What happens if guest attempts to register all its memory?
> 
> > allowing memory
> > over-commit
> > and, even if not implemented yet, migration support will be
> > possible with some HW assistance.
> 
> What does "HW assistance" mean here?
> Can it work with any existing hardware?
> 
> > 
> >  Design
> >  ======
> >  - Follows the behavior of VMware's pvrdma device, however is not tightly
> >    coupled with it
> 
> Everything seems to be in pvrdma. Since it's not coupled, could you
> split code to pvrdma specific and generic parts?
> 
> > and most of the code can be reused if we decide to
> >    continue to a Virtio based RDMA device.

The current design takes into account a future code reuse with virtio-rdma
device although not sure it is 100%.

We divided it to four software layers:
- Front-end interface with PCI:
	- pvrdma_main.c
- Front-end interface with pvrdma driver:
	- pvrdma_cmd.c
	- pvrdma_qp_ops.c
	- pvrdma_dev_ring.c
	- pvrdma_utils.c
- Device emulation:
	- pvrdma_rm.c
- Back-end interface:
	- pvrdma_backend.c

So in the future, when starting to work on virtio-rdma device we will move
the generic code to generic directory.

Any reason why we want to split it now, when we have only one device?
> 
> I suspect that without virtio we won't be able to do any future
> extensions.

As i see it these are two different issues, virtio RDMA device is on our
plate but the contribution of VMWare pvrdma device to QEMU is no doubt a
real advantage that will allow customers that runs ESX to easy move to QEMU.

> 
> >  - It exposes 3 BARs:
> >     BAR 0 - MSIX, utilize 3 vectors for command ring, async events and
> >             completions
> >     BAR 1 - Configuration of registers
> 
> What does this mean?

Device control operations:
	- Setting of interrupt mask.
	- Setup of Device/Driver shared configuration area.
	- Reset device, activate device etc.
	- Device commands such as create QP, create MR etc.

> 
> >     BAR 2 - UAR, used to pass HW commands from driver.
> 
> A detailed description of above belongs in documentation.

Will do.

> 
> >  - The device performs internal management of the RDMA
> >    resources (PDs, CQs, QPs, ...), meaning the objects
> >    are not directly coupled to a physical RDMA device resources.
> 
> I am wondering how do you make connections? QP#s are exposed on
> the wire during connection management.

QP#s that guest sees are the QP#s that are used on the wire.
The meaning of "internal management of the RDMA resources" is that we keep
context of internal QP in device (ex rings).

> 
> > The pvrdma backend is an ibdevice interface that can be exposed
> > either by a Soft-RoCE(rxe) device on machines with no RDMA device,
> > or an HCA SRIOV function(VF/PF).
> > Note that ibdevice interfaces can't be shared between pvrdma devices,
> > each one requiring a separate instance (rxe or SRIOV VF).
> 
> So what's the advantage of this over pass-through then?
> 
> 
> > 
> > Tests and performance
> > =====================
> > Tested with SoftRoCE backend (rxe)/Mellanox ConnectX3,
> > and Mellanox ConnectX4 HCAs with:
> >   - VMs in the same host
> >   - VMs in different hosts 
> >   - VMs to bare metal.
> > 
> > The best performance achieved with ConnectX HCAs and buffer size
> > bigger than 1MB which was the line rate ~ 50Gb/s.
> > The conclusion is that using the PVRDMA device there are no
> > actual performance penalties compared to bare metal for big enough
> > buffers (which is quite common when using RDMA), while allowing
> > memory overcommit.
> > 
> > Marcel Apfelbaum (3):
> >   mem: add share parameter to memory-backend-ram
> >   docs: add pvrdma device documentation.
> >   MAINTAINERS: add entry for hw/net/pvrdma
> > 
> > Yuval Shaia (2):
> >   pci/shpc: Move function to generic header file
> >   pvrdma: initial implementation
> > 
> >  MAINTAINERS                         |   7 +
> >  Makefile.objs                       |   1 +
> >  backends/hostmem-file.c             |  25 +-
> >  backends/hostmem-ram.c              |   4 +-
> >  backends/hostmem.c                  |  21 +
> >  configure                           |   9 +-
> >  default-configs/arm-softmmu.mak     |   2 +
> >  default-configs/i386-softmmu.mak    |   1 +
> >  default-configs/x86_64-softmmu.mak  |   1 +
> >  docs/pvrdma.txt                     | 145 ++++++
> >  exec.c                              |  26 +-
> >  hw/net/Makefile.objs                |   7 +
> >  hw/net/pvrdma/pvrdma.h              | 179 +++++++
> >  hw/net/pvrdma/pvrdma_backend.c      | 986 ++++++++++++++++++++++++++++++++++++
> >  hw/net/pvrdma/pvrdma_backend.h      |  74 +++
> >  hw/net/pvrdma/pvrdma_backend_defs.h |  68 +++
> >  hw/net/pvrdma/pvrdma_cmd.c          | 338 ++++++++++++
> >  hw/net/pvrdma/pvrdma_defs.h         | 121 +++++
> >  hw/net/pvrdma/pvrdma_dev_api.h      | 580 +++++++++++++++++++++
> >  hw/net/pvrdma/pvrdma_dev_ring.c     | 138 +++++
> >  hw/net/pvrdma/pvrdma_dev_ring.h     |  42 ++
> >  hw/net/pvrdma/pvrdma_ib_verbs.h     | 399 +++++++++++++++
> >  hw/net/pvrdma/pvrdma_main.c         | 664 ++++++++++++++++++++++++
> >  hw/net/pvrdma/pvrdma_qp_ops.c       | 187 +++++++
> >  hw/net/pvrdma/pvrdma_qp_ops.h       |  26 +
> >  hw/net/pvrdma/pvrdma_ring.h         | 134 +++++
> >  hw/net/pvrdma/pvrdma_rm.c           | 791 +++++++++++++++++++++++++++++
> >  hw/net/pvrdma/pvrdma_rm.h           |  54 ++
> >  hw/net/pvrdma/pvrdma_rm_defs.h      | 111 ++++
> >  hw/net/pvrdma/pvrdma_types.h        |  37 ++
> >  hw/net/pvrdma/pvrdma_utils.c        | 133 +++++
> >  hw/net/pvrdma/pvrdma_utils.h        |  41 ++
> >  hw/net/pvrdma/trace-events          |   9 +
> >  hw/pci/shpc.c                       |  11 +-
> >  include/exec/memory.h               |  23 +
> >  include/exec/ram_addr.h             |   3 +-
> >  include/hw/pci/pci_ids.h            |   3 +
> >  include/qemu/cutils.h               |  10 +
> >  include/qemu/osdep.h                |   2 +-
> >  include/sysemu/hostmem.h            |   2 +-
> >  include/sysemu/kvm.h                |   2 +-
> >  memory.c                            |  16 +-
> >  util/oslib-posix.c                  |   4 +-
> >  util/oslib-win32.c                  |   2 +-
> >  44 files changed, 5378 insertions(+), 61 deletions(-)
> >  create mode 100644 docs/pvrdma.txt
> >  create mode 100644 hw/net/pvrdma/pvrdma.h
> >  create mode 100644 hw/net/pvrdma/pvrdma_backend.c
> >  create mode 100644 hw/net/pvrdma/pvrdma_backend.h
> >  create mode 100644 hw/net/pvrdma/pvrdma_backend_defs.h
> >  create mode 100644 hw/net/pvrdma/pvrdma_cmd.c
> >  create mode 100644 hw/net/pvrdma/pvrdma_defs.h
> >  create mode 100644 hw/net/pvrdma/pvrdma_dev_api.h
> >  create mode 100644 hw/net/pvrdma/pvrdma_dev_ring.c
> >  create mode 100644 hw/net/pvrdma/pvrdma_dev_ring.h
> >  create mode 100644 hw/net/pvrdma/pvrdma_ib_verbs.h
> >  create mode 100644 hw/net/pvrdma/pvrdma_main.c
> >  create mode 100644 hw/net/pvrdma/pvrdma_qp_ops.c
> >  create mode 100644 hw/net/pvrdma/pvrdma_qp_ops.h
> >  create mode 100644 hw/net/pvrdma/pvrdma_ring.h
> >  create mode 100644 hw/net/pvrdma/pvrdma_rm.c
> >  create mode 100644 hw/net/pvrdma/pvrdma_rm.h
> >  create mode 100644 hw/net/pvrdma/pvrdma_rm_defs.h
> >  create mode 100644 hw/net/pvrdma/pvrdma_types.h
> >  create mode 100644 hw/net/pvrdma/pvrdma_utils.c
> >  create mode 100644 hw/net/pvrdma/pvrdma_utils.h
> >  create mode 100644 hw/net/pvrdma/trace-events
> > 
> > -- 
> > 2.13.5

Re: [Qemu-devel] [PATCH V2 0/5] hw/pvrdma: PVRDMA device implementation
Posted by Michael S. Tsirkin 7 years, 10 months ago
On Wed, Dec 20, 2017 at 07:56:47PM +0200, Yuval Shaia wrote:
> On Tue, Dec 19, 2017 at 08:05:18PM +0200, Michael S. Tsirkin wrote:
> > On Sun, Dec 17, 2017 at 02:54:52PM +0200, Marcel Apfelbaum wrote:
> > > RFC -> V2:
> > >  - Full implementation of the pvrdma device
> > >  - Backend is an ibdevice interface, no need for the KDBR module
> > > 
> > > General description
> > > ===================
> > > PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device.
> > > It works with its Linux Kernel driver AS IS, no need for any special guest
> > > modifications.
> > > 
> > > While it complies with the VMware device, it can also communicate with bare
> > > metal RDMA-enabled machines and does not require an RDMA HCA in the host, it
> > > can work with Soft-RoCE (rxe).
> > > 
> > > It does not require the whole guest RAM to be pinned
> > 
> > What happens if guest attempts to register all its memory?
> > 
> > > allowing memory
> > > over-commit
> > > and, even if not implemented yet, migration support will be
> > > possible with some HW assistance.
> > 
> > What does "HW assistance" mean here?
> > Can it work with any existing hardware?
> > 
> > > 
> > >  Design
> > >  ======
> > >  - Follows the behavior of VMware's pvrdma device, however is not tightly
> > >    coupled with it
> > 
> > Everything seems to be in pvrdma. Since it's not coupled, could you
> > split code to pvrdma specific and generic parts?
> > 
> > > and most of the code can be reused if we decide to
> > >    continue to a Virtio based RDMA device.
> 
> The current design takes into account a future code reuse with virtio-rdma
> device although not sure it is 100%.
> 
> We divided it to four software layers:
> - Front-end interface with PCI:
> 	- pvrdma_main.c
> - Front-end interface with pvrdma driver:
> 	- pvrdma_cmd.c
> 	- pvrdma_qp_ops.c
> 	- pvrdma_dev_ring.c
> 	- pvrdma_utils.c
> - Device emulation:
> 	- pvrdma_rm.c
> - Back-end interface:
> 	- pvrdma_backend.c
> 
> So in the future, when starting to work on virtio-rdma device we will move
> the generic code to generic directory.
> 
> Any reason why we want to split it now, when we have only one device?

To make it easier for me to ignore pvrdma stuff and review the generic stuff.

> > 
> > I suspect that without virtio we won't be able to do any future
> > extensions.
> 
> As i see it these are two different issues, virtio RDMA device is on our
> plate but the contribution of VMWare pvrdma device to QEMU is no doubt a
> real advantage that will allow customers that runs ESX to easy move to QEMU.

I don't have anything against it but I'm not really interested in
reviewing it either.

> > 
> > >  - It exposes 3 BARs:
> > >     BAR 0 - MSIX, utilize 3 vectors for command ring, async events and
> > >             completions
> > >     BAR 1 - Configuration of registers
> > 
> > What does this mean?
> 
> Device control operations:
> 	- Setting of interrupt mask.
> 	- Setup of Device/Driver shared configuration area.
> 	- Reset device, activate device etc.
> 	- Device commands such as create QP, create MR etc.
> 
> > 
> > >     BAR 2 - UAR, used to pass HW commands from driver.
> > 
> > A detailed description of above belongs in documentation.
> 
> Will do.
> 
> > 
> > >  - The device performs internal management of the RDMA
> > >    resources (PDs, CQs, QPs, ...), meaning the objects
> > >    are not directly coupled to a physical RDMA device resources.
> > 
> > I am wondering how do you make connections? QP#s are exposed on
> > the wire during connection management.
> 
> QP#s that guest sees are the QP#s that are used on the wire.
> The meaning of "internal management of the RDMA resources" is that we keep
> context of internal QP in device (ex rings).

The question was that you need to parse CM/MAD etc messages if you
need to change QP#s on the fly, and that code does not seem
to be there. I guess the answer is that
a bunch of this stuff is just broken or non-spec compliant.


> > 
> > > The pvrdma backend is an ibdevice interface that can be exposed
> > > either by a Soft-RoCE(rxe) device on machines with no RDMA device,
> > > or an HCA SRIOV function(VF/PF).
> > > Note that ibdevice interfaces can't be shared between pvrdma devices,
> > > each one requiring a separate instance (rxe or SRIOV VF).
> > 
> > So what's the advantage of this over pass-through then?
> > 
> > 
> > > 
> > > Tests and performance
> > > =====================
> > > Tested with SoftRoCE backend (rxe)/Mellanox ConnectX3,
> > > and Mellanox ConnectX4 HCAs with:
> > >   - VMs in the same host
> > >   - VMs in different hosts 
> > >   - VMs to bare metal.
> > > 
> > > The best performance achieved with ConnectX HCAs and buffer size
> > > bigger than 1MB which was the line rate ~ 50Gb/s.
> > > The conclusion is that using the PVRDMA device there are no
> > > actual performance penalties compared to bare metal for big enough
> > > buffers (which is quite common when using RDMA), while allowing
> > > memory overcommit.
> > > 
> > > Marcel Apfelbaum (3):
> > >   mem: add share parameter to memory-backend-ram
> > >   docs: add pvrdma device documentation.
> > >   MAINTAINERS: add entry for hw/net/pvrdma
> > > 
> > > Yuval Shaia (2):
> > >   pci/shpc: Move function to generic header file
> > >   pvrdma: initial implementation
> > > 
> > >  MAINTAINERS                         |   7 +
> > >  Makefile.objs                       |   1 +
> > >  backends/hostmem-file.c             |  25 +-
> > >  backends/hostmem-ram.c              |   4 +-
> > >  backends/hostmem.c                  |  21 +
> > >  configure                           |   9 +-
> > >  default-configs/arm-softmmu.mak     |   2 +
> > >  default-configs/i386-softmmu.mak    |   1 +
> > >  default-configs/x86_64-softmmu.mak  |   1 +
> > >  docs/pvrdma.txt                     | 145 ++++++
> > >  exec.c                              |  26 +-
> > >  hw/net/Makefile.objs                |   7 +
> > >  hw/net/pvrdma/pvrdma.h              | 179 +++++++
> > >  hw/net/pvrdma/pvrdma_backend.c      | 986 ++++++++++++++++++++++++++++++++++++
> > >  hw/net/pvrdma/pvrdma_backend.h      |  74 +++
> > >  hw/net/pvrdma/pvrdma_backend_defs.h |  68 +++
> > >  hw/net/pvrdma/pvrdma_cmd.c          | 338 ++++++++++++
> > >  hw/net/pvrdma/pvrdma_defs.h         | 121 +++++
> > >  hw/net/pvrdma/pvrdma_dev_api.h      | 580 +++++++++++++++++++++
> > >  hw/net/pvrdma/pvrdma_dev_ring.c     | 138 +++++
> > >  hw/net/pvrdma/pvrdma_dev_ring.h     |  42 ++
> > >  hw/net/pvrdma/pvrdma_ib_verbs.h     | 399 +++++++++++++++
> > >  hw/net/pvrdma/pvrdma_main.c         | 664 ++++++++++++++++++++++++
> > >  hw/net/pvrdma/pvrdma_qp_ops.c       | 187 +++++++
> > >  hw/net/pvrdma/pvrdma_qp_ops.h       |  26 +
> > >  hw/net/pvrdma/pvrdma_ring.h         | 134 +++++
> > >  hw/net/pvrdma/pvrdma_rm.c           | 791 +++++++++++++++++++++++++++++
> > >  hw/net/pvrdma/pvrdma_rm.h           |  54 ++
> > >  hw/net/pvrdma/pvrdma_rm_defs.h      | 111 ++++
> > >  hw/net/pvrdma/pvrdma_types.h        |  37 ++
> > >  hw/net/pvrdma/pvrdma_utils.c        | 133 +++++
> > >  hw/net/pvrdma/pvrdma_utils.h        |  41 ++
> > >  hw/net/pvrdma/trace-events          |   9 +
> > >  hw/pci/shpc.c                       |  11 +-
> > >  include/exec/memory.h               |  23 +
> > >  include/exec/ram_addr.h             |   3 +-
> > >  include/hw/pci/pci_ids.h            |   3 +
> > >  include/qemu/cutils.h               |  10 +
> > >  include/qemu/osdep.h                |   2 +-
> > >  include/sysemu/hostmem.h            |   2 +-
> > >  include/sysemu/kvm.h                |   2 +-
> > >  memory.c                            |  16 +-
> > >  util/oslib-posix.c                  |   4 +-
> > >  util/oslib-win32.c                  |   2 +-
> > >  44 files changed, 5378 insertions(+), 61 deletions(-)
> > >  create mode 100644 docs/pvrdma.txt
> > >  create mode 100644 hw/net/pvrdma/pvrdma.h
> > >  create mode 100644 hw/net/pvrdma/pvrdma_backend.c
> > >  create mode 100644 hw/net/pvrdma/pvrdma_backend.h
> > >  create mode 100644 hw/net/pvrdma/pvrdma_backend_defs.h
> > >  create mode 100644 hw/net/pvrdma/pvrdma_cmd.c
> > >  create mode 100644 hw/net/pvrdma/pvrdma_defs.h
> > >  create mode 100644 hw/net/pvrdma/pvrdma_dev_api.h
> > >  create mode 100644 hw/net/pvrdma/pvrdma_dev_ring.c
> > >  create mode 100644 hw/net/pvrdma/pvrdma_dev_ring.h
> > >  create mode 100644 hw/net/pvrdma/pvrdma_ib_verbs.h
> > >  create mode 100644 hw/net/pvrdma/pvrdma_main.c
> > >  create mode 100644 hw/net/pvrdma/pvrdma_qp_ops.c
> > >  create mode 100644 hw/net/pvrdma/pvrdma_qp_ops.h
> > >  create mode 100644 hw/net/pvrdma/pvrdma_ring.h
> > >  create mode 100644 hw/net/pvrdma/pvrdma_rm.c
> > >  create mode 100644 hw/net/pvrdma/pvrdma_rm.h
> > >  create mode 100644 hw/net/pvrdma/pvrdma_rm_defs.h
> > >  create mode 100644 hw/net/pvrdma/pvrdma_types.h
> > >  create mode 100644 hw/net/pvrdma/pvrdma_utils.c
> > >  create mode 100644 hw/net/pvrdma/pvrdma_utils.h
> > >  create mode 100644 hw/net/pvrdma/trace-events
> > > 
> > > -- 
> > > 2.13.5