hw/pvrdma: Proposal of a new pvrdma device

[Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device

Posted by Marcel Apfelbaum 8 years, 10 months ago

From: Yuval Shaia <yuval.shaia@oracle.com>

 Hi,

 General description
 ===================
 This is a very early RFC of a new RoCE emulated device
 that enables guests to use the RDMA stack without having
 a real hardware in the host.
 
 The current implementation supports only VM to VM communication
 on the same host.
 Down the road we plan to make possible to be able to support
 inter-machine communication by utilizing physical RoCE devices
 or Soft RoCE.

 The goals are:
 - Reach fast and secure loos-less Inter-VM data exchange.
 - Support remote VMs or bare metal machines.
 - Allow VMs migration.
 - Do not require to pin all VM memory. 


 Objective
 =========
 Have a QEMU implementation of the PVRDMA device. We aim to do so without
 any change in the PVRDMA guest driver which is already merged into the
 upstream kernel.


 RFC status
 ===========
 The project is in early development stages and supports
 only basic send/receive operations.

 We present it so we can get feedbacks on design,
 feature demands and to receive comments from the
 community pointing us to the "right" direction.

 What does work:
  - Tested with a basic unit-test:
    - https://github.com/yuvalshaia/kibpingpong .
  It works fine with two devices on a single VM, has
  some issue between two VMs in the same host.


 Design
 ======
 - Follows the behavior of VMware's pvrdma device, however is not tightly
   coupled with it and most of the code can be reused if we decide to
   continue to a Virtio based RDMA device.

 - It exposes 3 BARs:
    BAR 0 - MSIX, utilize 3 vectors for command ring, async events and
            completions
    BAR 1 - Configuration of registers
    BAR 2 - UAR, used to pass HW commands from driver.
 
 - The device performs internal management of the RDMA
   resources (PDs, CQs, QPs, ...), meaning the objects
   are not directly coupled to a physical RDMA device resources.
 
 - As backend, the pvrdma device uses KDBR, a new kernel module which
   is also in RFC phase, read more on the linux-rdma list:
     - https://www.spinics.net/lists/linux-rdma/msg47951.html

 - All RDMA operations are converted to KDBR module calls which performs
   the actual transfer between VMs, or, in the future,
   will utilize a RoCE device (either physical or soft) to be able
   to communicate with another host.


Roadmap (out of order)
======================
 - Utilize the RoCE host driver in order to support peers on external hosts.
 - Re-use the code for a virtio based device.

Any ideas, comments or suggestions would be highly appreciated.

Thanks,
Yuval Shaia & Marcel Apfelbaum

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
(Mainly design, coding was done by Yuval)
Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>

---
 hw/net/Makefile.objs            |   5 +
 hw/net/pvrdma/kdbr.h            | 104 +++++++
 hw/net/pvrdma/pvrdma-uapi.h     | 261 ++++++++++++++++
 hw/net/pvrdma/pvrdma.h          | 155 ++++++++++
 hw/net/pvrdma/pvrdma_cmd.c      | 322 +++++++++++++++++++
 hw/net/pvrdma/pvrdma_defs.h     | 301 ++++++++++++++++++
 hw/net/pvrdma/pvrdma_dev_api.h  | 342 ++++++++++++++++++++
 hw/net/pvrdma/pvrdma_ib_verbs.h | 469 ++++++++++++++++++++++++++++
 hw/net/pvrdma/pvrdma_kdbr.c     | 395 ++++++++++++++++++++++++
 hw/net/pvrdma/pvrdma_kdbr.h     |  53 ++++
 hw/net/pvrdma/pvrdma_main.c     | 667 ++++++++++++++++++++++++++++++++++++++++
 hw/net/pvrdma/pvrdma_qp_ops.c   | 174 +++++++++++
 hw/net/pvrdma/pvrdma_qp_ops.h   |  25 ++
 hw/net/pvrdma/pvrdma_ring.c     | 127 ++++++++
 hw/net/pvrdma/pvrdma_ring.h     |  43 +++
 hw/net/pvrdma/pvrdma_rm.c       | 529 +++++++++++++++++++++++++++++++
 hw/net/pvrdma/pvrdma_rm.h       | 214 +++++++++++++
 hw/net/pvrdma/pvrdma_types.h    |  37 +++
 hw/net/pvrdma/pvrdma_utils.c    |  36 +++
 hw/net/pvrdma/pvrdma_utils.h    |  49 +++
 include/hw/pci/pci_ids.h        |   3 +
 21 files changed, 4311 insertions(+)
 create mode 100644 hw/net/pvrdma/kdbr.h
 create mode 100644 hw/net/pvrdma/pvrdma-uapi.h
 create mode 100644 hw/net/pvrdma/pvrdma.h
 create mode 100644 hw/net/pvrdma/pvrdma_cmd.c
 create mode 100644 hw/net/pvrdma/pvrdma_defs.h
 create mode 100644 hw/net/pvrdma/pvrdma_dev_api.h
 create mode 100644 hw/net/pvrdma/pvrdma_ib_verbs.h
 create mode 100644 hw/net/pvrdma/pvrdma_kdbr.c
 create mode 100644 hw/net/pvrdma/pvrdma_kdbr.h
 create mode 100644 hw/net/pvrdma/pvrdma_main.c
 create mode 100644 hw/net/pvrdma/pvrdma_qp_ops.c
 create mode 100644 hw/net/pvrdma/pvrdma_qp_ops.h
 create mode 100644 hw/net/pvrdma/pvrdma_ring.c
 create mode 100644 hw/net/pvrdma/pvrdma_ring.h
 create mode 100644 hw/net/pvrdma/pvrdma_rm.c
 create mode 100644 hw/net/pvrdma/pvrdma_rm.h
 create mode 100644 hw/net/pvrdma/pvrdma_types.h
 create mode 100644 hw/net/pvrdma/pvrdma_utils.c
 create mode 100644 hw/net/pvrdma/pvrdma_utils.h

diff --git a/hw/net/Makefile.objs b/hw/net/Makefile.objs
index 610ed3e..a962347 100644
--- a/hw/net/Makefile.objs
+++ b/hw/net/Makefile.objs
@@ -43,3 +43,8 @@ common-obj-$(CONFIG_ROCKER) += rocker/rocker.o rocker/rocker_fp.o \
                                rocker/rocker_desc.o rocker/rocker_world.o \
                                rocker/rocker_of_dpa.o
 obj-$(call lnot,$(CONFIG_ROCKER)) += rocker/qmp-norocker.o
+
+obj-$(CONFIG_PCI) += pvrdma/pvrdma_ring.o pvrdma/pvrdma_rm.o \
+		     pvrdma/pvrdma_utils.o pvrdma/pvrdma_qp_ops.o \
+		     pvrdma/pvrdma_kdbr.o pvrdma/pvrdma_cmd.o \
+		     pvrdma/pvrdma_main.o
diff --git a/hw/net/pvrdma/kdbr.h b/hw/net/pvrdma/kdbr.h
new file mode 100644
index 0000000..97cb93c
--- /dev/null
+++ b/hw/net/pvrdma/kdbr.h
@@ -0,0 +1,104 @@
+/*
+ * Kernel Data Bridge driver - API
+ *
+ * Copyright 2016 Red Hat, Inc.
+ * Copyright 2016 Oracle
+ *
+ * Authors:
+ *   Marcel Apfelbaum <marcel@redhat.com>
+ *   Yuval Shaia <yuval.shaia@oracle.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef _KDBR_H
+#define _KDBR_H
+
+#ifdef __KERNEL__
+#include <linux/uio.h>
+#define KDBR_MAX_IOVEC_LEN    UIO_FASTIOV
+#else
+#include <sys/uio.h>
+#define KDBR_MAX_IOVEC_LEN    8
+#endif
+
+#define KDBR_FILE_NAME "/dev/kdbr"
+#define KDBR_MAX_PORTS 255
+
+#define KDBR_IOC_MAGIC 0xBA
+
+#define KDBR_REGISTER_PORT    _IOWR(KDBR_IOC_MAGIC, 0, struct kdbr_reg)
+#define KDBR_UNREGISTER_PORT    _IOW(KDBR_IOC_MAGIC, 1, int)
+#define KDBR_IOC_MAX        2
+
+
+enum kdbr_ack_type {
+    KDBR_ACK_IMMEDIATE,
+    KDBR_ACK_DELAYED,
+};
+
+struct kdbr_gid {
+    unsigned long net_id;
+    unsigned long id;
+};
+
+struct kdbr_peer {
+    struct kdbr_gid rgid;
+    unsigned long rqueue;
+};
+
+struct list_head;
+struct mutex;
+struct kdbr_connection {
+    unsigned long queue_id;
+    struct kdbr_peer peer;
+    enum kdbr_ack_type ack_type;
+    /* TODO: hide the below fields in the .c file */
+    struct list_head *sg_vecs_list;
+    struct mutex *sg_vecs_mutex;
+};
+
+struct kdbr_reg {
+    struct kdbr_gid gid; /* in */
+    int port; /* out */
+};
+
+#define KDBR_REQ_SIGNATURE    0x000000AB
+#define KDBR_REQ_POST_RECV    0x00000100
+#define KDBR_REQ_POST_SEND    0x00000200
+#define KDBR_REQ_POST_MREG    0x00000300
+#define KDBR_REQ_POST_RDMA    0x00000400
+
+struct kdbr_req {
+    unsigned int flags; /* 8 bits signature, 8 bits msg_type */
+    struct iovec vec[KDBR_MAX_IOVEC_LEN];
+    int vlen; /* <= KDBR_MAX_IOVEC_LEN */
+    int connection_id;
+    struct kdbr_peer peer;
+    unsigned long req_id;
+};
+
+#define KDBR_ERR_CODE_EMPTY_VEC           0x101
+#define KDBR_ERR_CODE_NO_MORE_RECV_BUF    0x102
+#define KDBR_ERR_CODE_RECV_BUF_PROT       0x103
+#define KDBR_ERR_CODE_INV_ADDR            0x104
+#define KDBR_ERR_CODE_INV_CONN_ID         0x105
+#define KDBR_ERR_CODE_NO_PEER             0x106
+
+struct kdbr_completion {
+    int connection_id;
+    unsigned long req_id;
+    int status; /* 0 = Success */
+};
+
+#define KDBR_PORT_IOC_MAGIC    0xBB
+
+#define KDBR_PORT_OPEN_CONN    _IOR(KDBR_PORT_IOC_MAGIC, 0, \
+                     struct kdbr_connection)
+#define KDBR_PORT_CLOSE_CONN    _IOR(KDBR_PORT_IOC_MAGIC, 1, int)
+#define KDBR_PORT_IOC_MAX    4
+
+#endif
+
diff --git a/hw/net/pvrdma/pvrdma-uapi.h b/hw/net/pvrdma/pvrdma-uapi.h
new file mode 100644
index 0000000..0045776
--- /dev/null
+++ b/hw/net/pvrdma/pvrdma-uapi.h
@@ -0,0 +1,261 @@
+/*
+ * Copyright (c) 2012-2016 VMware, Inc.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of EITHER the GNU General Public License
+ * version 2 as published by the Free Software Foundation or the BSD
+ * 2-Clause License. This program is distributed in the hope that it
+ * will be useful, but WITHOUT ANY WARRANTY; WITHOUT EVEN THE IMPLIED
+ * WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See the GNU General Public License version 2 for more details at
+ * http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program available in the file COPYING in the main
+ * directory of this source tree.
+ *
+ * The BSD 2-Clause License
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
+ * COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
+ * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+ * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
+ * OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef PVRDMA_UAPI_H
+#define PVRDMA_UAPI_H
+
+#include "qemu/osdep.h"
+#include "qemu/cutils.h"
+#include <hw/net/pvrdma/pvrdma_types.h>
+#include <qemu/compiler.h>
+#include <qemu/atomic.h>
+
+#define PVRDMA_VERSION 17
+
+#define PVRDMA_UAR_HANDLE_MASK    0x00FFFFFF    /* Bottom 24 bits. */
+#define PVRDMA_UAR_QP_OFFSET    0        /* Offset of QP doorbell. */
+#define PVRDMA_UAR_QP_SEND    BIT(30)        /* Send bit. */
+#define PVRDMA_UAR_QP_RECV    BIT(31)        /* Recv bit. */
+#define PVRDMA_UAR_CQ_OFFSET    4        /* Offset of CQ doorbell. */
+#define PVRDMA_UAR_CQ_ARM_SOL    BIT(29)        /* Arm solicited bit. */
+#define PVRDMA_UAR_CQ_ARM    BIT(30)        /* Arm bit. */
+#define PVRDMA_UAR_CQ_POLL    BIT(31)        /* Poll bit. */
+#define PVRDMA_INVALID_IDX    -1        /* Invalid index. */
+
+/* PVRDMA atomic compare and swap */
+struct pvrdma_exp_cmp_swap {
+    __u64 swap_val;
+    __u64 compare_val;
+    __u64 swap_mask;
+    __u64 compare_mask;
+};
+
+/* PVRDMA atomic fetch and add */
+struct pvrdma_exp_fetch_add {
+    __u64 add_val;
+    __u64 field_boundary;
+};
+
+/* PVRDMA address vector. */
+struct pvrdma_av {
+    __u32 port_pd;
+    __u32 sl_tclass_flowlabel;
+    __u8 dgid[16];
+    __u8 src_path_bits;
+    __u8 gid_index;
+    __u8 stat_rate;
+    __u8 hop_limit;
+    __u8 dmac[6];
+    __u8 reserved[6];
+};
+
+/* PVRDMA scatter/gather entry */
+struct pvrdma_sge {
+    __u64   addr;
+    __u32   length;
+    __u32   lkey;
+};
+
+/* PVRDMA receive queue work request */
+struct pvrdma_rq_wqe_hdr {
+    __u64 wr_id;        /* wr id */
+    __u32 num_sge;        /* size of s/g array */
+    __u32 total_len;    /* reserved */
+};
+/* Use pvrdma_sge (ib_sge) for receive queue s/g array elements. */
+
+/* PVRDMA send queue work request */
+struct pvrdma_sq_wqe_hdr {
+    __u64 wr_id;        /* wr id */
+    __u32 num_sge;        /* size of s/g array */
+    __u32 total_len;    /* reserved */
+    __u32 opcode;        /* operation type */
+    __u32 send_flags;    /* wr flags */
+    union {
+        __u32 imm_data;
+        __u32 invalidate_rkey;
+    } ex;
+    __u32 reserved;
+    union {
+        struct {
+            __u64 remote_addr;
+            __u32 rkey;
+            __u8 reserved[4];
+        } rdma;
+        struct {
+            __u64 remote_addr;
+            __u64 compare_add;
+            __u64 swap;
+            __u32 rkey;
+            __u32 reserved;
+        } atomic;
+        struct {
+            __u64 remote_addr;
+            __u32 log_arg_sz;
+            __u32 rkey;
+            union {
+                struct pvrdma_exp_cmp_swap  cmp_swap;
+                struct pvrdma_exp_fetch_add fetch_add;
+            } wr_data;
+        } masked_atomics;
+        struct {
+            __u64 iova_start;
+            __u64 pl_pdir_dma;
+            __u32 page_shift;
+            __u32 page_list_len;
+            __u32 length;
+            __u32 access_flags;
+            __u32 rkey;
+        } fast_reg;
+        struct {
+            __u32 remote_qpn;
+            __u32 remote_qkey;
+            struct pvrdma_av av;
+        } ud;
+    } wr;
+};
+/* Use pvrdma_sge (ib_sge) for send queue s/g array elements. */
+
+/* Completion queue element. */
+struct pvrdma_cqe {
+    __u64 wr_id;
+    __u64 qp;
+    __u32 opcode;
+    __u32 status;
+    __u32 byte_len;
+    __u32 imm_data;
+    __u32 src_qp;
+    __u32 wc_flags;
+    __u32 vendor_err;
+    __u16 pkey_index;
+    __u16 slid;
+    __u8 sl;
+    __u8 dlid_path_bits;
+    __u8 port_num;
+    __u8 smac[6];
+    __u8 reserved2[7]; /* Pad to next power of 2 (64). */
+};
+
+struct pvrdma_ring {
+    int prod_tail;    /* Producer tail. */
+    int cons_head;    /* Consumer head. */
+};
+
+struct pvrdma_ring_state {
+    struct pvrdma_ring tx;    /* Tx ring. */
+    struct pvrdma_ring rx;    /* Rx ring. */
+};
+
+static inline int pvrdma_idx_valid(__u32 idx, __u32 max_elems)
+{
+    /* Generates fewer instructions than a less-than. */
+    return (idx & ~((max_elems << 1) - 1)) == 0;
+}
+
+static inline __s32 pvrdma_idx(int *var, __u32 max_elems)
+{
+    unsigned int idx = atomic_read(var);
+
+    if (pvrdma_idx_valid(idx, max_elems)) {
+        return idx & (max_elems - 1);
+    }
+    return PVRDMA_INVALID_IDX;
+}
+
+static inline void pvrdma_idx_ring_inc(int *var, __u32 max_elems)
+{
+    __u32 idx = atomic_read(var) + 1;    /* Increment. */
+
+    idx &= (max_elems << 1) - 1;        /* Modulo size, flip gen. */
+    atomic_set(var, idx);
+}
+
+static inline __s32 pvrdma_idx_ring_has_space(const struct pvrdma_ring *r,
+                          __u32 max_elems, __u32 *out_tail)
+{
+    const __u32 tail = atomic_read(&r->prod_tail);
+    const __u32 head = atomic_read(&r->cons_head);
+
+    if (pvrdma_idx_valid(tail, max_elems) &&
+        pvrdma_idx_valid(head, max_elems)) {
+        *out_tail = tail & (max_elems - 1);
+        return tail != (head ^ max_elems);
+    }
+    return PVRDMA_INVALID_IDX;
+}
+
+static inline __s32 pvrdma_idx_ring_has_data(const struct pvrdma_ring *r,
+                         __u32 max_elems, __u32 *out_head)
+{
+    const __u32 tail = atomic_read(&r->prod_tail);
+    const __u32 head = atomic_read(&r->cons_head);
+
+    if (pvrdma_idx_valid(tail, max_elems) &&
+        pvrdma_idx_valid(head, max_elems)) {
+        *out_head = head & (max_elems - 1);
+        return tail != head;
+    }
+    return PVRDMA_INVALID_IDX;
+}
+
+static inline bool pvrdma_idx_ring_is_valid_idx(const struct pvrdma_ring *r,
+                        __u32 max_elems, __u32 *idx)
+{
+    const __u32 tail = atomic_read(&r->prod_tail);
+    const __u32 head = atomic_read(&r->cons_head);
+
+    if (pvrdma_idx_valid(tail, max_elems) &&
+        pvrdma_idx_valid(head, max_elems) &&
+        pvrdma_idx_valid(*idx, max_elems)) {
+        if (tail > head && (*idx < tail && *idx >= head)) {
+            return true;
+        } else if (head > tail && (*idx >= head || *idx < tail)) {
+            return true;
+        }
+    }
+    return false;
+}
+
+#endif /* PVRDMA_UAPI_H */
diff --git a/hw/net/pvrdma/pvrdma.h b/hw/net/pvrdma/pvrdma.h
new file mode 100644
index 0000000..d6349d4
--- /dev/null
+++ b/hw/net/pvrdma/pvrdma.h
@@ -0,0 +1,155 @@
+/*
+ * QEMU VMWARE paravirtual RDMA interface definitions
+ *
+ * Developed by Oracle & Redhat
+ *
+ * Authors:
+ *     Yuval Shaia <yuval.shaia@oracle.com>
+ *     Marcel Apfelbaum <marcel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef PVRDMA_PVRDMA_H
+#define PVRDMA_PVRDMA_H
+
+#include <qemu/osdep.h>
+#include <hw/pci/pci.h>
+#include <hw/pci/msix.h>
+#include <hw/net/pvrdma/pvrdma_kdbr.h>
+#include <hw/net/pvrdma/pvrdma_rm.h>
+#include <hw/net/pvrdma/pvrdma_defs.h>
+#include <hw/net/pvrdma/pvrdma_dev_api.h>
+#include <hw/net/pvrdma/pvrdma_ring.h>
+
+/* BARs */
+#define RDMA_MSIX_BAR_IDX    0
+#define RDMA_REG_BAR_IDX     1
+#define RDMA_UAR_BAR_IDX     2
+#define RDMA_BAR0_MSIX_SIZE  (16 * 1024)
+#define RDMA_BAR1_REGS_SIZE  256
+#define RDMA_BAR2_UAR_SIZE   (16 * 1024)
+
+/* MSIX */
+#define RDMA_MAX_INTRS       3
+#define RDMA_MSIX_TABLE      0x0000
+#define RDMA_MSIX_PBA        0x2000
+
+/* Interrupts Vectors */
+#define INTR_VEC_CMD_RING            0
+#define INTR_VEC_CMD_ASYNC_EVENTS    1
+#define INTR_VEC_CMD_COMPLETION_Q    2
+
+/* HW attributes */
+#define PVRDMA_HW_NAME       "pvrdma"
+#define PVRDMA_HW_VERSION    17
+#define PVRDMA_FW_VERSION    14
+
+/* Vendor Errors, codes 100 to FFF kept for kdbr */
+#define VENDOR_ERR_TOO_MANY_SGES    0x201
+#define VENDOR_ERR_NOMEM            0x202
+#define VENDOR_ERR_FAIL_KDBR        0x203
+
+typedef struct HWResourceIDs {
+    unsigned long *local_bitmap;
+    __u32 *hw_map;
+} HWResourceIDs;
+
+typedef struct DSRInfo {
+    dma_addr_t dma;
+    struct pvrdma_device_shared_region *dsr;
+
+    union pvrdma_cmd_req *req;
+    union pvrdma_cmd_resp *rsp;
+
+    struct pvrdma_ring *async_ring_state;
+    Ring async;
+
+    struct pvrdma_ring *cq_ring_state;
+    Ring cq;
+} DSRInfo;
+
+typedef struct PVRDMADev {
+    PCIDevice parent_obj;
+    MemoryRegion msix;
+    MemoryRegion regs;
+    __u32 regs_data[RDMA_BAR1_REGS_SIZE];
+    MemoryRegion uar;
+    __u32 uar_data[RDMA_BAR2_UAR_SIZE];
+    DSRInfo dsr_info;
+    int interrupt_mask;
+    RmPort ports[MAX_PORTS];
+    u64 sys_image_guid;
+    u64 node_guid;
+    u64 network_prefix;
+    RmResTbl pd_tbl;
+    RmResTbl mr_tbl;
+    RmResTbl qp_tbl;
+    RmResTbl cq_tbl;
+    RmResTbl wqe_ctx_tbl;
+} PVRDMADev;
+#define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
+
+static inline int get_reg_val(PVRDMADev *dev, hwaddr addr, __u32 *val)
+{
+    int idx = addr >> 2;
+
+    if (idx > RDMA_BAR1_REGS_SIZE) {
+        return -EINVAL;
+    }
+
+    *val = dev->regs_data[idx];
+
+    return 0;
+}
+static inline int set_reg_val(PVRDMADev *dev, hwaddr addr, __u32 val)
+{
+    int idx = addr >> 2;
+
+    if (idx > RDMA_BAR1_REGS_SIZE) {
+        return -EINVAL;
+    }
+
+    dev->regs_data[idx] = val;
+
+    return 0;
+}
+static inline int get_uar_val(PVRDMADev *dev, hwaddr addr, __u32 *val)
+{
+    int idx = addr >> 2;
+
+    if (idx > RDMA_BAR2_UAR_SIZE) {
+        return -EINVAL;
+    }
+
+    *val = dev->uar_data[idx];
+
+    return 0;
+}
+static inline int set_uar_val(PVRDMADev *dev, hwaddr addr, __u32 val)
+{
+    int idx = addr >> 2;
+
+    if (idx > RDMA_BAR2_UAR_SIZE) {
+        return -EINVAL;
+    }
+
+    dev->uar_data[idx] = val;
+
+    return 0;
+}
+
+static inline void post_interrupt(PVRDMADev *dev, unsigned vector)
+{
+    PCIDevice *pci_dev = PCI_DEVICE(dev);
+
+    if (likely(dev->interrupt_mask == 0)) {
+        msix_notify(pci_dev, vector);
+    }
+}
+
+int execute_command(PVRDMADev *dev);
+
+#endif
diff --git a/hw/net/pvrdma/pvrdma_cmd.c b/hw/net/pvrdma/pvrdma_cmd.c
new file mode 100644
index 0000000..ae1ef99
--- /dev/null
+++ b/hw/net/pvrdma/pvrdma_cmd.c
@@ -0,0 +1,322 @@
+#include "qemu/osdep.h"
+#include "hw/hw.h"
+#include "hw/pci/pci.h"
+#include "hw/pci/pci_ids.h"
+#include "hw/net/pvrdma/pvrdma_utils.h"
+#include "hw/net/pvrdma/pvrdma.h"
+#include "hw/net/pvrdma/pvrdma_rm.h"
+#include "hw/net/pvrdma/pvrdma_kdbr.h"
+
+static int query_port(PVRDMADev *dev, union pvrdma_cmd_req *req,
+                      union pvrdma_cmd_resp *rsp)
+{
+    struct pvrdma_cmd_query_port *cmd = &req->query_port;
+    struct pvrdma_cmd_query_port_resp *resp = &rsp->query_port_resp;
+    __u32 max_port_gids, max_port_pkeys;
+
+    pr_dbg("port=%d\n", cmd->port_num);
+
+    if (rm_get_max_port_gids(&max_port_gids) != 0) {
+        return -ENOMEM;
+    }
+
+    if (rm_get_max_port_pkeys(&max_port_pkeys) != 0) {
+        return -ENOMEM;
+    }
+
+    memset(resp, 0, sizeof(*resp));
+    resp->hdr.response = cmd->hdr.response;
+    resp->hdr.ack = PVRDMA_CMD_QUERY_PORT_RESP;
+    resp->hdr.err = 0;
+
+    resp->attrs.state = PVRDMA_PORT_ACTIVE;
+    resp->attrs.max_mtu = PVRDMA_MTU_4096;
+    resp->attrs.active_mtu = PVRDMA_MTU_4096;
+    resp->attrs.gid_tbl_len = max_port_gids;
+    resp->attrs.port_cap_flags = 0;
+    resp->attrs.max_msg_sz = 1024;
+    resp->attrs.bad_pkey_cntr = 0;
+    resp->attrs.qkey_viol_cntr = 0;
+    resp->attrs.pkey_tbl_len = max_port_pkeys;
+    resp->attrs.lid = 0;
+    resp->attrs.sm_lid = 0;
+    resp->attrs.lmc = 0;
+    resp->attrs.max_vl_num = 0;
+    resp->attrs.sm_sl = 0;
+    resp->attrs.subnet_timeout = 0;
+    resp->attrs.init_type_reply = 0;
+    resp->attrs.active_width = 1;
+    resp->attrs.active_speed = 1;
+    resp->attrs.phys_state = 1;
+
+    return 0;
+}
+
+static int query_pkey(PVRDMADev *dev, union pvrdma_cmd_req *req,
+                      union pvrdma_cmd_resp *rsp)
+{
+    struct pvrdma_cmd_query_pkey *cmd = &req->query_pkey;
+    struct pvrdma_cmd_query_pkey_resp *resp = &rsp->query_pkey_resp;
+
+    pr_dbg("port=%d\n", cmd->port_num);
+    pr_dbg("index=%d\n", cmd->index);
+
+    memset(resp, 0, sizeof(*resp));
+    resp->hdr.response = cmd->hdr.response;
+    resp->hdr.ack = PVRDMA_CMD_QUERY_PKEY_RESP;
+    resp->hdr.err = 0;
+
+    resp->pkey = 0x7FFF;
+    pr_dbg("pkey=0x%x\n", resp->pkey);
+
+    return 0;
+}
+
+static int create_pd(PVRDMADev *dev, union pvrdma_cmd_req *req,
+                     union pvrdma_cmd_resp *rsp)
+{
+    struct pvrdma_cmd_create_pd *cmd = &req->create_pd;
+    struct pvrdma_cmd_create_pd_resp *resp = &rsp->create_pd_resp;
+
+    pr_dbg("context=0x%x\n", cmd->ctx_handle ? cmd->ctx_handle : 0);
+
+    memset(resp, 0, sizeof(*resp));
+    resp->hdr.response = cmd->hdr.response;
+    resp->hdr.ack = PVRDMA_CMD_CREATE_PD_RESP;
+    resp->hdr.err = rm_alloc_pd(dev, &resp->pd_handle, cmd->ctx_handle);
+
+    pr_dbg("ret=%d\n", resp->hdr.err);
+    return resp->hdr.err;
+}
+
+static int destroy_pd(PVRDMADev *dev, union pvrdma_cmd_req *req,
+                      union pvrdma_cmd_resp *rsp)
+{
+    struct pvrdma_cmd_destroy_pd *cmd = &req->destroy_pd;
+
+    pr_dbg("pd_handle=%d\n", cmd->pd_handle);
+
+    rm_dealloc_pd(dev, cmd->pd_handle);
+
+    return 0;
+}
+
+static int create_mr(PVRDMADev *dev, union pvrdma_cmd_req *req,
+                     union pvrdma_cmd_resp *rsp)
+{
+    struct pvrdma_cmd_create_mr *cmd = &req->create_mr;
+    struct pvrdma_cmd_create_mr_resp *resp = &rsp->create_mr_resp;
+
+    pr_dbg("pd_handle=%d\n", cmd->pd_handle);
+    pr_dbg("access_flags=0x%x\n", cmd->access_flags);
+    pr_dbg("flags=0x%x\n", cmd->flags);
+
+    memset(resp, 0, sizeof(*resp));
+    resp->hdr.response = cmd->hdr.response;
+    resp->hdr.ack = PVRDMA_CMD_CREATE_MR_RESP;
+    resp->hdr.err = rm_alloc_mr(dev, cmd, resp);
+
+    pr_dbg("ret=%d\n", resp->hdr.err);
+    return resp->hdr.err;
+}
+
+static int destroy_mr(PVRDMADev *dev, union pvrdma_cmd_req *req,
+                      union pvrdma_cmd_resp *rsp)
+{
+    struct pvrdma_cmd_destroy_mr *cmd = &req->destroy_mr;
+
+    pr_dbg("mr_handle=%d\n", cmd->mr_handle);
+
+    rm_dealloc_mr(dev, cmd->mr_handle);
+
+    return 0;
+}
+
+static int create_cq(PVRDMADev *dev, union pvrdma_cmd_req *req,
+                     union pvrdma_cmd_resp *rsp)
+{
+    struct pvrdma_cmd_create_cq *cmd = &req->create_cq;
+    struct pvrdma_cmd_create_cq_resp *resp = &rsp->create_cq_resp;
+
+    pr_dbg("pdir_dma=0x%llx\n", (long long unsigned int)cmd->pdir_dma);
+    pr_dbg("context=0x%x\n", cmd->ctx_handle ? cmd->ctx_handle : 0);
+    pr_dbg("cqe=%d\n", cmd->cqe);
+    pr_dbg("nchunks=%d\n", cmd->nchunks);
+
+    memset(resp, 0, sizeof(*resp));
+    resp->hdr.response = cmd->hdr.response;
+    resp->hdr.ack = PVRDMA_CMD_CREATE_CQ_RESP;
+    resp->hdr.err = rm_alloc_cq(dev, cmd, resp);
+
+    pr_dbg("ret=%d\n", resp->hdr.err);
+    return resp->hdr.err;
+}
+
+static int destroy_cq(PVRDMADev *dev, union pvrdma_cmd_req *req,
+                      union pvrdma_cmd_resp *rsp)
+{
+    struct pvrdma_cmd_destroy_cq *cmd = &req->destroy_cq;
+
+    pr_dbg("cq_handle=%d\n", cmd->cq_handle);
+
+    rm_dealloc_cq(dev, cmd->cq_handle);
+
+    return 0;
+}
+
+static int create_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
+                     union pvrdma_cmd_resp *rsp)
+{
+    struct pvrdma_cmd_create_qp *cmd = &req->create_qp;
+    struct pvrdma_cmd_create_qp_resp *resp = &rsp->create_qp_resp;
+
+    if (!dev->ports[0].kdbr_port) {
+        pr_dbg("First QP, registering port 0\n");
+        dev->ports[0].kdbr_port = kdbr_alloc_port(dev);
+        if (!dev->ports[0].kdbr_port) {
+            pr_dbg("Fail to register port\n");
+            return -EIO;
+        }
+    }
+
+    pr_dbg("pd_handle=%d\n", cmd->pd_handle);
+    pr_dbg("pdir_dma=0x%llx\n", (long long unsigned int)cmd->pdir_dma);
+    pr_dbg("total_chunks=%d\n", cmd->total_chunks);
+    pr_dbg("send_chunks=%d\n", cmd->send_chunks);
+
+    memset(resp, 0, sizeof(*resp));
+    resp->hdr.response = cmd->hdr.response;
+    resp->hdr.ack = PVRDMA_CMD_CREATE_QP_RESP;
+    resp->hdr.err = rm_alloc_qp(dev, cmd, resp);
+
+    pr_dbg("ret=%d\n", resp->hdr.err);
+    return resp->hdr.err;
+}
+
+static int modify_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
+                     union pvrdma_cmd_resp *rsp)
+{
+    struct pvrdma_cmd_modify_qp *cmd = &req->modify_qp;
+
+    pr_dbg("qp_handle=%d\n", cmd->qp_handle);
+
+    memset(rsp, 0, sizeof(*rsp));
+    rsp->hdr.response = cmd->hdr.response;
+    rsp->hdr.ack = PVRDMA_CMD_MODIFY_QP_RESP;
+    rsp->hdr.err = rm_modify_qp(dev, cmd->qp_handle, cmd);
+
+    pr_dbg("ret=%d\n", rsp->hdr.err);
+    return rsp->hdr.err;
+}
+
+static int destroy_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
+                      union pvrdma_cmd_resp *rsp)
+{
+    struct pvrdma_cmd_destroy_qp *cmd = &req->destroy_qp;
+
+    pr_dbg("qp_handle=%d\n", cmd->qp_handle);
+
+    rm_dealloc_qp(dev, cmd->qp_handle);
+
+    return 0;
+}
+
+static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
+                       union pvrdma_cmd_resp *rsp)
+{
+    int rc;
+    struct pvrdma_cmd_create_bind *cmd = &req->create_bind;
+    u32 max_port_gids;
+#ifdef DEBUG
+    __be64 *subnet = (__be64 *)&cmd->new_gid[0];
+    __be64 *if_id = (__be64 *)&cmd->new_gid[8];
+#endif
+
+    pr_dbg("index=%d\n", cmd->index);
+
+    rc = rm_get_max_port_gids(&max_port_gids);
+    if (rc) {
+        return -EIO;
+    }
+
+    if (cmd->index > max_port_gids) {
+        return -EINVAL;
+    }
+
+    pr_dbg("gid[%d]=0x%llx,0x%llx\n", cmd->index, *subnet, *if_id);
+
+    /* Driver forces to one port only */
+    memcpy(dev->ports[0].gid_tbl[cmd->index].raw, &cmd->new_gid,
+           sizeof(cmd->new_gid));
+
+    return 0;
+}
+
+static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
+                        union pvrdma_cmd_resp *rsp)
+{
+    /*  TODO: Check the usage of this table */
+
+    struct pvrdma_cmd_destroy_bind *cmd = &req->destroy_bind;
+
+    pr_dbg("clear index %d\n", cmd->index);
+
+    memset(dev->ports[0].gid_tbl[cmd->index].raw, 0,
+           sizeof(dev->ports[0].gid_tbl[cmd->index].raw));
+
+    return 0;
+}
+
+struct cmd_handler {
+    __u32 cmd;
+    int (*exec)(PVRDMADev *dev, union pvrdma_cmd_req *req,
+            union pvrdma_cmd_resp *rsp);
+};
+
+static struct cmd_handler cmd_handlers[] = {
+    {PVRDMA_CMD_QUERY_PORT, query_port},
+    {PVRDMA_CMD_QUERY_PKEY, query_pkey},
+    {PVRDMA_CMD_CREATE_PD, create_pd},
+    {PVRDMA_CMD_DESTROY_PD, destroy_pd},
+    {PVRDMA_CMD_CREATE_MR, create_mr},
+    {PVRDMA_CMD_DESTROY_MR, destroy_mr},
+    {PVRDMA_CMD_CREATE_CQ, create_cq},
+    {PVRDMA_CMD_RESIZE_CQ, NULL},
+    {PVRDMA_CMD_DESTROY_CQ, destroy_cq},
+    {PVRDMA_CMD_CREATE_QP, create_qp},
+    {PVRDMA_CMD_MODIFY_QP, modify_qp},
+    {PVRDMA_CMD_QUERY_QP, NULL},
+    {PVRDMA_CMD_DESTROY_QP, destroy_qp},
+    {PVRDMA_CMD_CREATE_UC, NULL},
+    {PVRDMA_CMD_DESTROY_UC, NULL},
+    {PVRDMA_CMD_CREATE_BIND, create_bind},
+    {PVRDMA_CMD_DESTROY_BIND, destroy_bind},
+};
+
+int execute_command(PVRDMADev *dev)
+{
+    int err = 0xFFFF;
+    DSRInfo *dsr_info;
+
+    dsr_info = &dev->dsr_info;
+
+    pr_dbg("cmd=%d\n", dsr_info->req->hdr.cmd);
+    if (dsr_info->req->hdr.cmd >= sizeof(cmd_handlers) /
+                      sizeof(struct cmd_handler)) {
+        pr_err("Unsupported command\n");
+        goto out;
+    }
+
+    if (!cmd_handlers[dsr_info->req->hdr.cmd].exec) {
+        pr_err("Unsupported command (not implemented yet)\n");
+        goto out;
+    }
+
+    err = cmd_handlers[dsr_info->req->hdr.cmd].exec(dev, dsr_info->req,
+                            dsr_info->rsp);
+out:
+    set_reg_val(dev, PVRDMA_REG_ERR, err);
+    post_interrupt(dev, INTR_VEC_CMD_RING);
+
+    return (err == 0) ? 0 : -EINVAL;
+}
diff --git a/hw/net/pvrdma/pvrdma_defs.h b/hw/net/pvrdma/pvrdma_defs.h
new file mode 100644
index 0000000..1d0cc11
--- /dev/null
+++ b/hw/net/pvrdma/pvrdma_defs.h
@@ -0,0 +1,301 @@
+/*
+ * Copyright (c) 2012-2016 VMware, Inc.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of EITHER the GNU General Public License
+ * version 2 as published by the Free Software Foundation or the BSD
+ * 2-Clause License. This program is distributed in the hope that it
+ * will be useful, but WITHOUT ANY WARRANTY; WITHOUT EVEN THE IMPLIED
+ * WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See the GNU General Public License version 2 for more details at
+ * http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program available in the file COPYING in the main
+ * directory of this source tree.
+ *
+ * The BSD 2-Clause License
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
+ * COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
+ * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+ * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
+ * OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef PVRDMA_DEFS_H
+#define PVRDMA_DEFS_H
+
+#include <hw/net/pvrdma/pvrdma_types.h>
+#include <hw/net/pvrdma/pvrdma_ib_verbs.h>
+#include <hw/net/pvrdma/pvrdma-uapi.h>
+
+/*
+ * Masks and accessors for page directory, which is a two-level lookup:
+ * page directory -> page table -> page. Only one directory for now, but we
+ * could expand that easily. 9 bits for tables, 9 bits for pages, gives one
+ * gigabyte for memory regions and so forth.
+ */
+
+#define PVRDMA_PDIR_SHIFT        18
+#define PVRDMA_PTABLE_SHIFT        9
+#define PVRDMA_PAGE_DIR_DIR(x)        (((x) >> PVRDMA_PDIR_SHIFT) & 0x1)
+#define PVRDMA_PAGE_DIR_TABLE(x)    (((x) >> PVRDMA_PTABLE_SHIFT) & 0x1ff)
+#define PVRDMA_PAGE_DIR_PAGE(x)        ((x) & 0x1ff)
+#define PVRDMA_PAGE_DIR_MAX_PAGES    (1 * 512 * 512)
+#define PVRDMA_MAX_FAST_REG_PAGES    128
+
+/*
+ * Max MSI-X vectors.
+ */
+
+#define PVRDMA_MAX_INTERRUPTS    3
+
+/* Register offsets within PCI resource on BAR1. */
+#define PVRDMA_REG_VERSION    0x00    /* R: Version of device. */
+#define PVRDMA_REG_DSRLOW    0x04    /* W: Device shared region low PA. */
+#define PVRDMA_REG_DSRHIGH    0x08    /* W: Device shared region high PA. */
+#define PVRDMA_REG_CTL        0x0c    /* W: PVRDMA_DEVICE_CTL */
+#define PVRDMA_REG_REQUEST    0x10    /* W: Indicate device request. */
+#define PVRDMA_REG_ERR        0x14    /* R: Device error. */
+#define PVRDMA_REG_ICR        0x18    /* R: Interrupt cause. */
+#define PVRDMA_REG_IMR        0x1c    /* R/W: Interrupt mask. */
+#define PVRDMA_REG_MACL        0x20    /* R/W: MAC address low. */
+#define PVRDMA_REG_MACH        0x24    /* R/W: MAC address high. */
+
+/* Object flags. */
+#define PVRDMA_CQ_FLAG_ARMED_SOL    BIT(0)    /* Armed for solicited-only. */
+#define PVRDMA_CQ_FLAG_ARMED        BIT(1)    /* Armed. */
+#define PVRDMA_MR_FLAG_DMA        BIT(0)    /* DMA region. */
+#define PVRDMA_MR_FLAG_FRMR        BIT(1)    /* Fast reg memory region. */
+
+/*
+ * Atomic operation capability (masked versions are extended atomic
+ * operations.
+ */
+
+#define PVRDMA_ATOMIC_OP_COMP_SWAP    BIT(0) /* Compare and swap. */
+#define PVRDMA_ATOMIC_OP_FETCH_ADD    BIT(1) /* Fetch and add. */
+#define PVRDMA_ATOMIC_OP_MASK_COMP_SWAP    BIT(2) /* Masked compare and swap. */
+#define PVRDMA_ATOMIC_OP_MASK_FETCH_ADD    BIT(3) /* Masked fetch and add. */
+
+/*
+ * Base Memory Management Extension flags to support Fast Reg Memory Regions
+ * and Fast Reg Work Requests. Each flag represents a verb operation and we
+ * must support all of them to qualify for the BMME device cap.
+ */
+
+#define PVRDMA_BMME_FLAG_LOCAL_INV    BIT(0) /* Local Invalidate. */
+#define PVRDMA_BMME_FLAG_REMOTE_INV    BIT(1) /* Remote Invalidate. */
+#define PVRDMA_BMME_FLAG_FAST_REG_WR    BIT(2) /* Fast Reg Work Request. */
+
+/*
+ * GID types. The interpretation of the gid_types bit field in the device
+ * capabilities will depend on the device mode. For now, the device only
+ * supports RoCE as mode, so only the different GID types for RoCE are
+ * defined.
+ */
+
+#define PVRDMA_GID_TYPE_FLAG_ROCE_V1 BIT(0)
+#define PVRDMA_GID_TYPE_FLAG_ROCE_V2 BIT(1)
+
+enum pvrdma_pci_resource {
+    PVRDMA_PCI_RESOURCE_MSIX,    /* BAR0: MSI-X, MMIO. */
+    PVRDMA_PCI_RESOURCE_REG,    /* BAR1: Registers, MMIO. */
+    PVRDMA_PCI_RESOURCE_UAR,    /* BAR2: UAR pages, MMIO, 64-bit. */
+    PVRDMA_PCI_RESOURCE_LAST,    /* Last. */
+};
+
+enum pvrdma_device_ctl {
+    PVRDMA_DEVICE_CTL_ACTIVATE,    /* Activate device. */
+    PVRDMA_DEVICE_CTL_QUIESCE,    /* Quiesce device. */
+    PVRDMA_DEVICE_CTL_RESET,    /* Reset device. */
+};
+
+enum pvrdma_intr_vector {
+    PVRDMA_INTR_VECTOR_RESPONSE,    /* Command response. */
+    PVRDMA_INTR_VECTOR_ASYNC,    /* Async events. */
+    PVRDMA_INTR_VECTOR_CQ,        /* CQ notification. */
+    /* Additional CQ notification vectors. */
+};
+
+enum pvrdma_intr_cause {
+    PVRDMA_INTR_CAUSE_RESPONSE    = (1 << PVRDMA_INTR_VECTOR_RESPONSE),
+    PVRDMA_INTR_CAUSE_ASYNC        = (1 << PVRDMA_INTR_VECTOR_ASYNC),
+    PVRDMA_INTR_CAUSE_CQ        = (1 << PVRDMA_INTR_VECTOR_CQ),
+};
+
+enum pvrdma_intr_type {
+    PVRDMA_INTR_TYPE_INTX,        /* Legacy. */
+    PVRDMA_INTR_TYPE_MSI,        /* MSI. */
+    PVRDMA_INTR_TYPE_MSIX,        /* MSI-X. */
+};
+
+enum pvrdma_gos_bits {
+    PVRDMA_GOS_BITS_UNK,        /* Unknown. */
+    PVRDMA_GOS_BITS_32,        /* 32-bit. */
+    PVRDMA_GOS_BITS_64,        /* 64-bit. */
+};
+
+enum pvrdma_gos_type {
+    PVRDMA_GOS_TYPE_UNK,        /* Unknown. */
+    PVRDMA_GOS_TYPE_LINUX,        /* Linux. */
+};
+
+enum pvrdma_device_mode {
+    PVRDMA_DEVICE_MODE_ROCE,    /* RoCE. */
+    PVRDMA_DEVICE_MODE_IWARP,    /* iWarp. */
+    PVRDMA_DEVICE_MODE_IB,        /* InfiniBand. */
+};
+
+struct pvrdma_gos_info {
+    u32 gos_bits:2;            /* W: PVRDMA_GOS_BITS_ */
+    u32 gos_type:4;            /* W: PVRDMA_GOS_TYPE_ */
+    u32 gos_ver:16;            /* W: Guest OS version. */
+    u32 gos_misc:10;        /* W: Other. */
+    u32 pad;            /* Pad to 8-byte alignment. */
+};
+
+struct pvrdma_device_caps {
+    u64 fw_ver;                /* R: Query device. */
+    __be64 node_guid;
+    __be64 sys_image_guid;
+    u64 max_mr_size;
+    u64 page_size_cap;
+    u64 atomic_arg_sizes;            /* EXP verbs. */
+    u32 exp_comp_mask;            /* EXP verbs. */
+    u32 device_cap_flags2;            /* EXP verbs. */
+    u32 max_fa_bit_boundary;        /* EXP verbs. */
+    u32 log_max_atomic_inline_arg;        /* EXP verbs. */
+    u32 vendor_id;
+    u32 vendor_part_id;
+    u32 hw_ver;
+    u32 max_qp;
+    u32 max_qp_wr;
+    u32 device_cap_flags;
+    u32 max_sge;
+    u32 max_sge_rd;
+    u32 max_cq;
+    u32 max_cqe;
+    u32 max_mr;
+    u32 max_pd;
+    u32 max_qp_rd_atom;
+    u32 max_ee_rd_atom;
+    u32 max_res_rd_atom;
+    u32 max_qp_init_rd_atom;
+    u32 max_ee_init_rd_atom;
+    u32 max_ee;
+    u32 max_rdd;
+    u32 max_mw;
+    u32 max_raw_ipv6_qp;
+    u32 max_raw_ethy_qp;
+    u32 max_mcast_grp;
+    u32 max_mcast_qp_attach;
+    u32 max_total_mcast_qp_attach;
+    u32 max_ah;
+    u32 max_fmr;
+    u32 max_map_per_fmr;
+    u32 max_srq;
+    u32 max_srq_wr;
+    u32 max_srq_sge;
+    u32 max_uar;
+    u32 gid_tbl_len;
+    u16 max_pkeys;
+    u8  local_ca_ack_delay;
+    u8  phys_port_cnt;
+    u8  mode;                /* PVRDMA_DEVICE_MODE_ */
+    u8  atomic_ops;                /* PVRDMA_ATOMIC_OP_* bits */
+    u8  bmme_flags;                /* FRWR Mem Mgmt Extensions */
+    u8  gid_types;                /* PVRDMA_GID_TYPE_FLAG_ */
+    u8  reserved[4];
+};
+
+struct pvrdma_ring_page_info {
+    u32 num_pages;                /* Num pages incl. header. */
+    u32 reserved;                /* Reserved. */
+    u64 pdir_dma;                /* Page directory PA. */
+};
+
+#pragma pack(push, 1)
+
+struct pvrdma_device_shared_region {
+    u32 driver_version;            /* W: Driver version. */
+    u32 pad;                /* Pad to 8-byte align. */
+    struct pvrdma_gos_info gos_info;    /* W: Guest OS information. */
+    u64 cmd_slot_dma;            /* W: Command slot address. */
+    u64 resp_slot_dma;            /* W: Response slot address. */
+    struct pvrdma_ring_page_info async_ring_pages;
+                        /* W: Async ring page info. */
+    struct pvrdma_ring_page_info cq_ring_pages;
+                        /* W: CQ ring page info. */
+    u32 uar_pfn;                /* W: UAR pageframe. */
+    u32 pad2;                /* Pad to 8-byte align. */
+    struct pvrdma_device_caps caps;        /* R: Device capabilities. */
+};
+
+#pragma pack(pop)
+
+
+/* Event types. Currently a 1:1 mapping with enum ib_event. */
+enum pvrdma_eqe_type {
+    PVRDMA_EVENT_CQ_ERR,
+    PVRDMA_EVENT_QP_FATAL,
+    PVRDMA_EVENT_QP_REQ_ERR,
+    PVRDMA_EVENT_QP_ACCESS_ERR,
+    PVRDMA_EVENT_COMM_EST,
+    PVRDMA_EVENT_SQ_DRAINED,
+    PVRDMA_EVENT_PATH_MIG,
+    PVRDMA_EVENT_PATH_MIG_ERR,
+    PVRDMA_EVENT_DEVICE_FATAL,
+    PVRDMA_EVENT_PORT_ACTIVE,
+    PVRDMA_EVENT_PORT_ERR,
+    PVRDMA_EVENT_LID_CHANGE,
+    PVRDMA_EVENT_PKEY_CHANGE,
+    PVRDMA_EVENT_SM_CHANGE,
+    PVRDMA_EVENT_SRQ_ERR,
+    PVRDMA_EVENT_SRQ_LIMIT_REACHED,
+    PVRDMA_EVENT_QP_LAST_WQE_REACHED,
+    PVRDMA_EVENT_CLIENT_REREGISTER,
+    PVRDMA_EVENT_GID_CHANGE,
+};
+
+/* Event queue element. */
+struct pvrdma_eqe {
+    u32 type;    /* Event type. */
+    u32 info;    /* Handle, other. */
+};
+
+/* CQ notification queue element. */
+struct pvrdma_cqne {
+    u32 info;    /* Handle */
+};
+
+static inline void pvrdma_init_cqe(struct pvrdma_cqe *cqe, u64 wr_id, u64 qp)
+{
+    memset(cqe, 0, sizeof(*cqe));
+    cqe->status = PVRDMA_WC_GENERAL_ERR;
+    cqe->wr_id = wr_id;
+    cqe->qp = qp;
+}
+
+#endif /* PVRDMA_DEFS_H */
diff --git a/hw/net/pvrdma/pvrdma_dev_api.h b/hw/net/pvrdma/pvrdma_dev_api.h
new file mode 100644
index 0000000..4887b96
--- /dev/null
+++ b/hw/net/pvrdma/pvrdma_dev_api.h
@@ -0,0 +1,342 @@
+/*
+ * Copyright (c) 2012-2016 VMware, Inc.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of EITHER the GNU General Public License
+ * version 2 as published by the Free Software Foundation or the BSD
+ * 2-Clause License. This program is distributed in the hope that it
+ * will be useful, but WITHOUT ANY WARRANTY; WITHOUT EVEN THE IMPLIED
+ * WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See the GNU General Public License version 2 for more details at
+ * http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program available in the file COPYING in the main
+ * directory of this source tree.
+ *
+ * The BSD 2-Clause License
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
+ * COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
+ * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+ * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
+ * OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef PVRDMA_DEV_API_H
+#define PVRDMA_DEV_API_H
+
+#include <hw/net/pvrdma/pvrdma_types.h>
+#include <hw/net/pvrdma/pvrdma_ib_verbs.h>
+
+enum {
+    PVRDMA_CMD_FIRST,
+    PVRDMA_CMD_QUERY_PORT = PVRDMA_CMD_FIRST,
+    PVRDMA_CMD_QUERY_PKEY,
+    PVRDMA_CMD_CREATE_PD,
+    PVRDMA_CMD_DESTROY_PD,
+    PVRDMA_CMD_CREATE_MR,
+    PVRDMA_CMD_DESTROY_MR,
+    PVRDMA_CMD_CREATE_CQ,
+    PVRDMA_CMD_RESIZE_CQ,
+    PVRDMA_CMD_DESTROY_CQ,
+    PVRDMA_CMD_CREATE_QP,
+    PVRDMA_CMD_MODIFY_QP,
+    PVRDMA_CMD_QUERY_QP,
+    PVRDMA_CMD_DESTROY_QP,
+    PVRDMA_CMD_CREATE_UC,
+    PVRDMA_CMD_DESTROY_UC,
+    PVRDMA_CMD_CREATE_BIND,
+    PVRDMA_CMD_DESTROY_BIND,
+    PVRDMA_CMD_MAX,
+};
+
+enum {
+    PVRDMA_CMD_FIRST_RESP = (1 << 31),
+    PVRDMA_CMD_QUERY_PORT_RESP = PVRDMA_CMD_FIRST_RESP,
+    PVRDMA_CMD_QUERY_PKEY_RESP,
+    PVRDMA_CMD_CREATE_PD_RESP,
+    PVRDMA_CMD_DESTROY_PD_RESP_NOOP,
+    PVRDMA_CMD_CREATE_MR_RESP,
+    PVRDMA_CMD_DESTROY_MR_RESP_NOOP,
+    PVRDMA_CMD_CREATE_CQ_RESP,
+    PVRDMA_CMD_RESIZE_CQ_RESP,
+    PVRDMA_CMD_DESTROY_CQ_RESP_NOOP,
+    PVRDMA_CMD_CREATE_QP_RESP,
+    PVRDMA_CMD_MODIFY_QP_RESP,
+    PVRDMA_CMD_QUERY_QP_RESP,
+    PVRDMA_CMD_DESTROY_QP_RESP,
+    PVRDMA_CMD_CREATE_UC_RESP,
+    PVRDMA_CMD_DESTROY_UC_RESP_NOOP,
+    PVRDMA_CMD_CREATE_BIND_RESP_NOOP,
+    PVRDMA_CMD_DESTROY_BIND_RESP_NOOP,
+    PVRDMA_CMD_MAX_RESP,
+};
+
+struct pvrdma_cmd_hdr {
+    u64 response;        /* Key for response lookup. */
+    u32 cmd;        /* PVRDMA_CMD_ */
+    u32 reserved;        /* Reserved. */
+};
+
+struct pvrdma_cmd_resp_hdr {
+    u64 response;        /* From cmd hdr. */
+    u32 ack;        /* PVRDMA_CMD_XXX_RESP */
+    u8 err;            /* Error. */
+    u8 reserved[3];        /* Reserved. */
+};
+
+struct pvrdma_cmd_query_port {
+    struct pvrdma_cmd_hdr hdr;
+    u8 port_num;
+    u8 reserved[7];
+};
+
+struct pvrdma_cmd_query_port_resp {
+    struct pvrdma_cmd_resp_hdr hdr;
+    struct pvrdma_port_attr attrs;
+};
+
+struct pvrdma_cmd_query_pkey {
+    struct pvrdma_cmd_hdr hdr;
+    u8 port_num;
+    u8 index;
+    u8 reserved[6];
+};
+
+struct pvrdma_cmd_query_pkey_resp {
+    struct pvrdma_cmd_resp_hdr hdr;
+    u16 pkey;
+    u8 reserved[6];
+};
+
+struct pvrdma_cmd_create_uc {
+    struct pvrdma_cmd_hdr hdr;
+    u32 pfn; /* UAR page frame number */
+    u8 reserved[4];
+};
+
+struct pvrdma_cmd_create_uc_resp {
+    struct pvrdma_cmd_resp_hdr hdr;
+    u32 ctx_handle;
+    u8 reserved[4];
+};
+
+struct pvrdma_cmd_destroy_uc {
+    struct pvrdma_cmd_hdr hdr;
+    u32 ctx_handle;
+    u8 reserved[4];
+};
+
+struct pvrdma_cmd_create_pd {
+    struct pvrdma_cmd_hdr hdr;
+    u32 ctx_handle;
+    u8 reserved[4];
+};
+
+struct pvrdma_cmd_create_pd_resp {
+    struct pvrdma_cmd_resp_hdr hdr;
+    u32 pd_handle;
+    u8 reserved[4];
+};
+
+struct pvrdma_cmd_destroy_pd {
+    struct pvrdma_cmd_hdr hdr;
+    u32 pd_handle;
+    u8 reserved[4];
+};
+
+struct pvrdma_cmd_create_mr {
+    struct pvrdma_cmd_hdr hdr;
+    u64 start;
+    u64 length;
+    u64 pdir_dma;
+    u32 pd_handle;
+    u32 access_flags;
+    u32 flags;
+    u32 nchunks;
+};
+
+struct pvrdma_cmd_create_mr_resp {
+    struct pvrdma_cmd_resp_hdr hdr;
+    u32 mr_handle;
+    u32 lkey;
+    u32 rkey;
+    u8 reserved[4];
+};
+
+struct pvrdma_cmd_destroy_mr {
+    struct pvrdma_cmd_hdr hdr;
+    u32 mr_handle;
+    u8 reserved[4];
+};
+
+struct pvrdma_cmd_create_cq {
+    struct pvrdma_cmd_hdr hdr;
+    u64 pdir_dma;
+    u32 ctx_handle;
+    u32 cqe;
+    u32 nchunks;
+    u8 reserved[4];
+};
+
+struct pvrdma_cmd_create_cq_resp {
+    struct pvrdma_cmd_resp_hdr hdr;
+    u32 cq_handle;
+    u32 cqe;
+};
+
+struct pvrdma_cmd_resize_cq {
+    struct pvrdma_cmd_hdr hdr;
+    u32 cq_handle;
+    u32 cqe;
+};
+
+struct pvrdma_cmd_resize_cq_resp {
+    struct pvrdma_cmd_resp_hdr hdr;
+    u32 cqe;
+    u8 reserved[4];
+};
+
+struct pvrdma_cmd_destroy_cq {
+    struct pvrdma_cmd_hdr hdr;
+    u32 cq_handle;
+    u8 reserved[4];
+};
+
+struct pvrdma_cmd_create_qp {
+    struct pvrdma_cmd_hdr hdr;
+    u64 pdir_dma;
+    u32 pd_handle;
+    u32 send_cq_handle;
+    u32 recv_cq_handle;
+    u32 srq_handle;
+    u32 max_send_wr;
+    u32 max_recv_wr;
+    u32 max_send_sge;
+    u32 max_recv_sge;
+    u32 max_inline_data;
+    u32 lkey;
+    u32 access_flags;
+    u16 total_chunks;
+    u16 send_chunks;
+    u16 max_atomic_arg;
+    u8 sq_sig_all;
+    u8 qp_type;
+    u8 is_srq;
+    u8 reserved[3];
+};
+
+struct pvrdma_cmd_create_qp_resp {
+    struct pvrdma_cmd_resp_hdr hdr;
+    u32 qpn;
+    u32 max_send_wr;
+    u32 max_recv_wr;
+    u32 max_send_sge;
+    u32 max_recv_sge;
+    u32 max_inline_data;
+};
+
+struct pvrdma_cmd_modify_qp {
+    struct pvrdma_cmd_hdr hdr;
+    u32 qp_handle;
+    u32 attr_mask;
+    struct pvrdma_qp_attr attrs;
+};
+
+struct pvrdma_cmd_query_qp {
+    struct pvrdma_cmd_hdr hdr;
+    u32 qp_handle;
+    u32 attr_mask;
+};
+
+struct pvrdma_cmd_query_qp_resp {
+    struct pvrdma_cmd_resp_hdr hdr;
+    struct pvrdma_qp_attr attrs;
+};
+
+struct pvrdma_cmd_destroy_qp {
+    struct pvrdma_cmd_hdr hdr;
+    u32 qp_handle;
+    u8 reserved[4];
+};
+
+struct pvrdma_cmd_destroy_qp_resp {
+    struct pvrdma_cmd_resp_hdr hdr;
+    u32 events_reported;
+    u8 reserved[4];
+};
+
+struct pvrdma_cmd_create_bind {
+    struct pvrdma_cmd_hdr hdr;
+    u32 mtu;
+    u32 vlan;
+    u32 index;
+    u8 new_gid[16];
+    u8 gid_type;
+    u8 reserved[3];
+};
+
+struct pvrdma_cmd_destroy_bind {
+    struct pvrdma_cmd_hdr hdr;
+    u32 index;
+    u8 dest_gid[16];
+    u8 reserved[4];
+};
+
+union pvrdma_cmd_req {
+    struct pvrdma_cmd_hdr hdr;
+    struct pvrdma_cmd_query_port query_port;
+    struct pvrdma_cmd_query_pkey query_pkey;
+    struct pvrdma_cmd_create_uc create_uc;
+    struct pvrdma_cmd_destroy_uc destroy_uc;
+    struct pvrdma_cmd_create_pd create_pd;
+    struct pvrdma_cmd_destroy_pd destroy_pd;
+    struct pvrdma_cmd_create_mr create_mr;
+    struct pvrdma_cmd_destroy_mr destroy_mr;
+    struct pvrdma_cmd_create_cq create_cq;
+    struct pvrdma_cmd_resize_cq resize_cq;
+    struct pvrdma_cmd_destroy_cq destroy_cq;
+    struct pvrdma_cmd_create_qp create_qp;
+    struct pvrdma_cmd_modify_qp modify_qp;
+    struct pvrdma_cmd_query_qp query_qp;
+    struct pvrdma_cmd_destroy_qp destroy_qp;
+    struct pvrdma_cmd_create_bind create_bind;
+    struct pvrdma_cmd_destroy_bind destroy_bind;
+};
+
+union pvrdma_cmd_resp {
+    struct pvrdma_cmd_resp_hdr hdr;
+    struct pvrdma_cmd_query_port_resp query_port_resp;
+    struct pvrdma_cmd_query_pkey_resp query_pkey_resp;
+    struct pvrdma_cmd_create_uc_resp create_uc_resp;
+    struct pvrdma_cmd_create_pd_resp create_pd_resp;
+    struct pvrdma_cmd_create_mr_resp create_mr_resp;
+    struct pvrdma_cmd_create_cq_resp create_cq_resp;
+    struct pvrdma_cmd_resize_cq_resp resize_cq_resp;
+    struct pvrdma_cmd_create_qp_resp create_qp_resp;
+    struct pvrdma_cmd_query_qp_resp query_qp_resp;
+    struct pvrdma_cmd_destroy_qp_resp destroy_qp_resp;
+};
+
+#endif /* PVRDMA_DEV_API_H */
diff --git a/hw/net/pvrdma/pvrdma_ib_verbs.h b/hw/net/pvrdma/pvrdma_ib_verbs.h
new file mode 100644
index 0000000..e2a23f3
--- /dev/null
+++ b/hw/net/pvrdma/pvrdma_ib_verbs.h
@@ -0,0 +1,469 @@
+/*
+ * [PLEASE NOTE:  VMWARE, INC. ELECTS TO USE AND DISTRIBUTE THIS COMPONENT
+ * UNDER THE TERMS OF THE OpenIB.org BSD license.  THE ORIGINAL LICENSE TERMS
+ * ARE REPRODUCED BELOW ONLY AS A REFERENCE.]
+ *
+ * Copyright (c) 2004 Mellanox Technologies Ltd.  All rights reserved.
+ * Copyright (c) 2004 Infinicon Corporation.  All rights reserved.
+ * Copyright (c) 2004 Intel Corporation.  All rights reserved.
+ * Copyright (c) 2004 Topspin Corporation.  All rights reserved.
+ * Copyright (c) 2004 Voltaire Corporation.  All rights reserved.
+ * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
+ * Copyright (c) 2005, 2006, 2007 Cisco Systems.  All rights reserved.
+ * Copyright (c) 2015-2016 VMware, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef PVRDMA_IB_VERBS_H
+#define PVRDMA_IB_VERBS_H
+
+#include <linux/types.h>
+
+union pvrdma_gid {
+    u8    raw[16];
+    struct {
+        __be64    subnet_prefix;
+        __be64    interface_id;
+    } global;
+};
+
+enum pvrdma_link_layer {
+    PVRDMA_LINK_LAYER_UNSPECIFIED,
+    PVRDMA_LINK_LAYER_INFINIBAND,
+    PVRDMA_LINK_LAYER_ETHERNET,
+};
+
+enum pvrdma_mtu {
+    PVRDMA_MTU_256  = 1,
+    PVRDMA_MTU_512  = 2,
+    PVRDMA_MTU_1024 = 3,
+    PVRDMA_MTU_2048 = 4,
+    PVRDMA_MTU_4096 = 5,
+};
+
+static inline int pvrdma_mtu_enum_to_int(enum pvrdma_mtu mtu)
+{
+    switch (mtu) {
+    case PVRDMA_MTU_256:    return  256;
+    case PVRDMA_MTU_512:    return  512;
+    case PVRDMA_MTU_1024:    return 1024;
+    case PVRDMA_MTU_2048:    return 2048;
+    case PVRDMA_MTU_4096:    return 4096;
+    default:        return   -1;
+    }
+}
+
+static inline enum pvrdma_mtu pvrdma_mtu_int_to_enum(int mtu)
+{
+    switch (mtu) {
+    case 256:    return PVRDMA_MTU_256;
+    case 512:    return PVRDMA_MTU_512;
+    case 1024:    return PVRDMA_MTU_1024;
+    case 2048:    return PVRDMA_MTU_2048;
+    case 4096:
+    default:    return PVRDMA_MTU_4096;
+    }
+}
+
+enum pvrdma_port_state {
+    PVRDMA_PORT_NOP            = 0,
+    PVRDMA_PORT_DOWN        = 1,
+    PVRDMA_PORT_INIT        = 2,
+    PVRDMA_PORT_ARMED        = 3,
+    PVRDMA_PORT_ACTIVE        = 4,
+    PVRDMA_PORT_ACTIVE_DEFER    = 5,
+};
+
+enum pvrdma_port_cap_flags {
+    PVRDMA_PORT_SM                = 1 <<  1,
+    PVRDMA_PORT_NOTICE_SUP            = 1 <<  2,
+    PVRDMA_PORT_TRAP_SUP            = 1 <<  3,
+    PVRDMA_PORT_OPT_IPD_SUP            = 1 <<  4,
+    PVRDMA_PORT_AUTO_MIGR_SUP        = 1 <<  5,
+    PVRDMA_PORT_SL_MAP_SUP            = 1 <<  6,
+    PVRDMA_PORT_MKEY_NVRAM            = 1 <<  7,
+    PVRDMA_PORT_PKEY_NVRAM            = 1 <<  8,
+    PVRDMA_PORT_LED_INFO_SUP        = 1 <<  9,
+    PVRDMA_PORT_SM_DISABLED            = 1 << 10,
+    PVRDMA_PORT_SYS_IMAGE_GUID_SUP        = 1 << 11,
+    PVRDMA_PORT_PKEY_SW_EXT_PORT_TRAP_SUP    = 1 << 12,
+    PVRDMA_PORT_EXTENDED_SPEEDS_SUP        = 1 << 14,
+    PVRDMA_PORT_CM_SUP            = 1 << 16,
+    PVRDMA_PORT_SNMP_TUNNEL_SUP        = 1 << 17,
+    PVRDMA_PORT_REINIT_SUP            = 1 << 18,
+    PVRDMA_PORT_DEVICE_MGMT_SUP        = 1 << 19,
+    PVRDMA_PORT_VENDOR_CLASS_SUP        = 1 << 20,
+    PVRDMA_PORT_DR_NOTICE_SUP        = 1 << 21,
+    PVRDMA_PORT_CAP_MASK_NOTICE_SUP        = 1 << 22,
+    PVRDMA_PORT_BOOT_MGMT_SUP        = 1 << 23,
+    PVRDMA_PORT_LINK_LATENCY_SUP        = 1 << 24,
+    PVRDMA_PORT_CLIENT_REG_SUP        = 1 << 25,
+    PVRDMA_PORT_IP_BASED_GIDS        = 1 << 26,
+    PVRDMA_PORT_CAP_FLAGS_MAX        = PVRDMA_PORT_IP_BASED_GIDS,
+};
+
+enum pvrdma_port_width {
+    PVRDMA_WIDTH_1X        = 1,
+    PVRDMA_WIDTH_4X        = 2,
+    PVRDMA_WIDTH_8X        = 4,
+    PVRDMA_WIDTH_12X    = 8,
+};
+
+static inline int pvrdma_width_enum_to_int(enum pvrdma_port_width width)
+{
+    switch (width) {
+    case PVRDMA_WIDTH_1X:    return  1;
+    case PVRDMA_WIDTH_4X:    return  4;
+    case PVRDMA_WIDTH_8X:    return  8;
+    case PVRDMA_WIDTH_12X:    return 12;
+    default:        return -1;
+    }
+}
+
+enum pvrdma_port_speed {
+    PVRDMA_SPEED_SDR    = 1,
+    PVRDMA_SPEED_DDR    = 2,
+    PVRDMA_SPEED_QDR    = 4,
+    PVRDMA_SPEED_FDR10    = 8,
+    PVRDMA_SPEED_FDR    = 16,
+    PVRDMA_SPEED_EDR    = 32,
+};
+
+struct pvrdma_port_attr {
+    enum pvrdma_port_state    state;
+    enum pvrdma_mtu        max_mtu;
+    enum pvrdma_mtu        active_mtu;
+    u32            gid_tbl_len;
+    u32            port_cap_flags;
+    u32            max_msg_sz;
+    u32            bad_pkey_cntr;
+    u32            qkey_viol_cntr;
+    u16            pkey_tbl_len;
+    u16            lid;
+    u16            sm_lid;
+    u8            lmc;
+    u8            max_vl_num;
+    u8            sm_sl;
+    u8            subnet_timeout;
+    u8            init_type_reply;
+    u8            active_width;
+    u8            active_speed;
+    u8            phys_state;
+    u8            reserved[2];
+};
+
+struct pvrdma_global_route {
+    union pvrdma_gid    dgid;
+    u32            flow_label;
+    u8            sgid_index;
+    u8            hop_limit;
+    u8            traffic_class;
+    u8            reserved;
+};
+
+struct pvrdma_grh {
+    __be32            version_tclass_flow;
+    __be16            paylen;
+    u8            next_hdr;
+    u8            hop_limit;
+    union pvrdma_gid    sgid;
+    union pvrdma_gid    dgid;
+};
+
+enum pvrdma_ah_flags {
+    PVRDMA_AH_GRH = 1,
+};
+
+enum pvrdma_rate {
+    PVRDMA_RATE_PORT_CURRENT    = 0,
+    PVRDMA_RATE_2_5_GBPS        = 2,
+    PVRDMA_RATE_5_GBPS        = 5,
+    PVRDMA_RATE_10_GBPS        = 3,
+    PVRDMA_RATE_20_GBPS        = 6,
+    PVRDMA_RATE_30_GBPS        = 4,
+    PVRDMA_RATE_40_GBPS        = 7,
+    PVRDMA_RATE_60_GBPS        = 8,
+    PVRDMA_RATE_80_GBPS        = 9,
+    PVRDMA_RATE_120_GBPS        = 10,
+    PVRDMA_RATE_14_GBPS        = 11,
+    PVRDMA_RATE_56_GBPS        = 12,
+    PVRDMA_RATE_112_GBPS        = 13,
+    PVRDMA_RATE_168_GBPS        = 14,
+    PVRDMA_RATE_25_GBPS        = 15,
+    PVRDMA_RATE_100_GBPS        = 16,
+    PVRDMA_RATE_200_GBPS        = 17,
+    PVRDMA_RATE_300_GBPS        = 18,
+};
+
+struct pvrdma_ah_attr {
+    struct pvrdma_global_route    grh;
+    u16                dlid;
+    u16                vlan_id;
+    u8                sl;
+    u8                src_path_bits;
+    u8                static_rate;
+    u8                ah_flags;
+    u8                port_num;
+    u8                dmac[6];
+    u8                reserved;
+};
+
+enum pvrdma_wc_status {
+    PVRDMA_WC_SUCCESS,
+    PVRDMA_WC_LOC_LEN_ERR,
+    PVRDMA_WC_LOC_QP_OP_ERR,
+    PVRDMA_WC_LOC_EEC_OP_ERR,
+    PVRDMA_WC_LOC_PROT_ERR,
+    PVRDMA_WC_WR_FLUSH_ERR,
+    PVRDMA_WC_MW_BIND_ERR,
+    PVRDMA_WC_BAD_RESP_ERR,
+    PVRDMA_WC_LOC_ACCESS_ERR,
+    PVRDMA_WC_REM_INV_REQ_ERR,
+    PVRDMA_WC_REM_ACCESS_ERR,
+    PVRDMA_WC_REM_OP_ERR,
+    PVRDMA_WC_RETRY_EXC_ERR,
+    PVRDMA_WC_RNR_RETRY_EXC_ERR,
+    PVRDMA_WC_LOC_RDD_VIOL_ERR,
+    PVRDMA_WC_REM_INV_RD_REQ_ERR,
+    PVRDMA_WC_REM_ABORT_ERR,
+    PVRDMA_WC_INV_EECN_ERR,
+    PVRDMA_WC_INV_EEC_STATE_ERR,
+    PVRDMA_WC_FATAL_ERR,
+    PVRDMA_WC_RESP_TIMEOUT_ERR,
+    PVRDMA_WC_GENERAL_ERR,
+};
+
+enum pvrdma_wc_opcode {
+    PVRDMA_WC_SEND,
+    PVRDMA_WC_RDMA_WRITE,
+    PVRDMA_WC_RDMA_READ,
+    PVRDMA_WC_COMP_SWAP,
+    PVRDMA_WC_FETCH_ADD,
+    PVRDMA_WC_BIND_MW,
+    PVRDMA_WC_LSO,
+    PVRDMA_WC_LOCAL_INV,
+    PVRDMA_WC_FAST_REG_MR,
+    PVRDMA_WC_MASKED_COMP_SWAP,
+    PVRDMA_WC_MASKED_FETCH_ADD,
+    PVRDMA_WC_RECV = 1 << 7,
+    PVRDMA_WC_RECV_RDMA_WITH_IMM,
+};
+
+enum pvrdma_wc_flags {
+    PVRDMA_WC_GRH            = 1 << 0,
+    PVRDMA_WC_WITH_IMM        = 1 << 1,
+    PVRDMA_WC_WITH_INVALIDATE    = 1 << 2,
+    PVRDMA_WC_IP_CSUM_OK        = 1 << 3,
+    PVRDMA_WC_WITH_SMAC        = 1 << 4,
+    PVRDMA_WC_WITH_VLAN        = 1 << 5,
+    PVRDMA_WC_FLAGS_MAX        = PVRDMA_WC_WITH_VLAN,
+};
+
+enum pvrdma_cq_notify_flags {
+    PVRDMA_CQ_SOLICITED        = 1 << 0,
+    PVRDMA_CQ_NEXT_COMP        = 1 << 1,
+    PVRDMA_CQ_SOLICITED_MASK    = PVRDMA_CQ_SOLICITED |
+                      PVRDMA_CQ_NEXT_COMP,
+    PVRDMA_CQ_REPORT_MISSED_EVENTS    = 1 << 2,
+};
+
+struct pvrdma_qp_cap {
+    u32    max_send_wr;
+    u32    max_recv_wr;
+    u32    max_send_sge;
+    u32    max_recv_sge;
+    u32    max_inline_data;
+    u32    reserved;
+};
+
+enum pvrdma_sig_type {
+    PVRDMA_SIGNAL_ALL_WR,
+    PVRDMA_SIGNAL_REQ_WR,
+};
+
+enum pvrdma_qp_type {
+    PVRDMA_QPT_SMI,
+    PVRDMA_QPT_GSI,
+    PVRDMA_QPT_RC,
+    PVRDMA_QPT_UC,
+    PVRDMA_QPT_UD,
+    PVRDMA_QPT_RAW_IPV6,
+    PVRDMA_QPT_RAW_ETHERTYPE,
+    PVRDMA_QPT_RAW_PACKET = 8,
+    PVRDMA_QPT_XRC_INI = 9,
+    PVRDMA_QPT_XRC_TGT,
+    PVRDMA_QPT_MAX,
+};
+
+enum pvrdma_qp_create_flags {
+    PVRDMA_QP_CREATE_IPOPVRDMA_UD_LSO        = 1 << 0,
+    PVRDMA_QP_CREATE_BLOCK_MULTICAST_LOOPBACK    = 1 << 1,
+};
+
+enum pvrdma_qp_attr_mask {
+    PVRDMA_QP_STATE            = 1 << 0,
+    PVRDMA_QP_CUR_STATE        = 1 << 1,
+    PVRDMA_QP_EN_SQD_ASYNC_NOTIFY    = 1 << 2,
+    PVRDMA_QP_ACCESS_FLAGS        = 1 << 3,
+    PVRDMA_QP_PKEY_INDEX        = 1 << 4,
+    PVRDMA_QP_PORT            = 1 << 5,
+    PVRDMA_QP_QKEY            = 1 << 6,
+    PVRDMA_QP_AV            = 1 << 7,
+    PVRDMA_QP_PATH_MTU        = 1 << 8,
+    PVRDMA_QP_TIMEOUT        = 1 << 9,
+    PVRDMA_QP_RETRY_CNT        = 1 << 10,
+    PVRDMA_QP_RNR_RETRY        = 1 << 11,
+    PVRDMA_QP_RQ_PSN        = 1 << 12,
+    PVRDMA_QP_MAX_QP_RD_ATOMIC    = 1 << 13,
+    PVRDMA_QP_ALT_PATH        = 1 << 14,
+    PVRDMA_QP_MIN_RNR_TIMER        = 1 << 15,
+    PVRDMA_QP_SQ_PSN        = 1 << 16,
+    PVRDMA_QP_MAX_DEST_RD_ATOMIC    = 1 << 17,
+    PVRDMA_QP_PATH_MIG_STATE    = 1 << 18,
+    PVRDMA_QP_CAP            = 1 << 19,
+    PVRDMA_QP_DEST_QPN        = 1 << 20,
+    PVRDMA_QP_ATTR_MASK_MAX        = PVRDMA_QP_DEST_QPN,
+};
+
+enum pvrdma_qp_state {
+    PVRDMA_QPS_RESET,
+    PVRDMA_QPS_INIT,
+    PVRDMA_QPS_RTR,
+    PVRDMA_QPS_RTS,
+    PVRDMA_QPS_SQD,
+    PVRDMA_QPS_SQE,
+    PVRDMA_QPS_ERR,
+};
+
+enum pvrdma_mig_state {
+    PVRDMA_MIG_MIGRATED,
+    PVRDMA_MIG_REARM,
+    PVRDMA_MIG_ARMED,
+};
+
+enum pvrdma_mw_type {
+    PVRDMA_MW_TYPE_1 = 1,
+    PVRDMA_MW_TYPE_2 = 2,
+};
+
+struct pvrdma_qp_attr {
+    enum pvrdma_qp_state    qp_state;
+    enum pvrdma_qp_state    cur_qp_state;
+    enum pvrdma_mtu        path_mtu;
+    enum pvrdma_mig_state    path_mig_state;
+    u32            qkey;
+    u32            rq_psn;
+    u32            sq_psn;
+    u32            dest_qp_num;
+    u32            qp_access_flags;
+    u16            pkey_index;
+    u16            alt_pkey_index;
+    u8            en_sqd_async_notify;
+    u8            sq_draining;
+    u8            max_rd_atomic;
+    u8            max_dest_rd_atomic;
+    u8            min_rnr_timer;
+    u8            port_num;
+    u8            timeout;
+    u8            retry_cnt;
+    u8            rnr_retry;
+    u8            alt_port_num;
+    u8            alt_timeout;
+    u8            reserved[5];
+    struct pvrdma_qp_cap    cap;
+    struct pvrdma_ah_attr    ah_attr;
+    struct pvrdma_ah_attr    alt_ah_attr;
+};
+
+enum pvrdma_wr_opcode {
+    PVRDMA_WR_RDMA_WRITE,
+    PVRDMA_WR_RDMA_WRITE_WITH_IMM,
+    PVRDMA_WR_SEND,
+    PVRDMA_WR_SEND_WITH_IMM,
+    PVRDMA_WR_RDMA_READ,
+    PVRDMA_WR_ATOMIC_CMP_AND_SWP,
+    PVRDMA_WR_ATOMIC_FETCH_AND_ADD,
+    PVRDMA_WR_LSO,
+    PVRDMA_WR_SEND_WITH_INV,
+    PVRDMA_WR_RDMA_READ_WITH_INV,
+    PVRDMA_WR_LOCAL_INV,
+    PVRDMA_WR_FAST_REG_MR,
+    PVRDMA_WR_MASKED_ATOMIC_CMP_AND_SWP,
+    PVRDMA_WR_MASKED_ATOMIC_FETCH_AND_ADD,
+    PVRDMA_WR_BIND_MW,
+    PVRDMA_WR_REG_SIG_MR,
+};
+
+enum pvrdma_send_flags {
+    PVRDMA_SEND_FENCE    = 1 << 0,
+    PVRDMA_SEND_SIGNALED    = 1 << 1,
+    PVRDMA_SEND_SOLICITED    = 1 << 2,
+    PVRDMA_SEND_INLINE    = 1 << 3,
+    PVRDMA_SEND_IP_CSUM    = 1 << 4,
+    PVRDMA_SEND_FLAGS_MAX    = PVRDMA_SEND_IP_CSUM,
+};
+
+enum pvrdma_access_flags {
+    PVRDMA_ACCESS_LOCAL_WRITE    = 1 << 0,
+    PVRDMA_ACCESS_REMOTE_WRITE    = 1 << 1,
+    PVRDMA_ACCESS_REMOTE_READ    = 1 << 2,
+    PVRDMA_ACCESS_REMOTE_ATOMIC    = 1 << 3,
+    PVRDMA_ACCESS_MW_BIND        = 1 << 4,
+    PVRDMA_ZERO_BASED        = 1 << 5,
+    PVRDMA_ACCESS_ON_DEMAND        = 1 << 6,
+    PVRDMA_ACCESS_FLAGS_MAX        = PVRDMA_ACCESS_ON_DEMAND,
+};
+
+enum ib_wc_status {
+    IB_WC_SUCCESS,
+    IB_WC_LOC_LEN_ERR,
+    IB_WC_LOC_QP_OP_ERR,
+    IB_WC_LOC_EEC_OP_ERR,
+    IB_WC_LOC_PROT_ERR,
+    IB_WC_WR_FLUSH_ERR,
+    IB_WC_MW_BIND_ERR,
+    IB_WC_BAD_RESP_ERR,
+    IB_WC_LOC_ACCESS_ERR,
+    IB_WC_REM_INV_REQ_ERR,
+    IB_WC_REM_ACCESS_ERR,
+    IB_WC_REM_OP_ERR,
+    IB_WC_RETRY_EXC_ERR,
+    IB_WC_RNR_RETRY_EXC_ERR,
+    IB_WC_LOC_RDD_VIOL_ERR,
+    IB_WC_REM_INV_RD_REQ_ERR,
+    IB_WC_REM_ABORT_ERR,
+    IB_WC_INV_EECN_ERR,
+    IB_WC_INV_EEC_STATE_ERR,
+    IB_WC_FATAL_ERR,
+    IB_WC_RESP_TIMEOUT_ERR,
+    IB_WC_GENERAL_ERR
+};
+
+#endif /* PVRDMA_IB_VERBS_H */
diff --git a/hw/net/pvrdma/pvrdma_kdbr.c b/hw/net/pvrdma/pvrdma_kdbr.c
new file mode 100644
index 0000000..ec04afd
--- /dev/null
+++ b/hw/net/pvrdma/pvrdma_kdbr.c
@@ -0,0 +1,395 @@
+#include <qemu/osdep.h>
+#include <hw/pci/pci.h>
+
+#include <sys/ioctl.h>
+
+#include <hw/net/pvrdma/pvrdma.h>
+#include <hw/net/pvrdma/pvrdma_ib_verbs.h>
+#include <hw/net/pvrdma/pvrdma_rm.h>
+#include <hw/net/pvrdma/pvrdma_kdbr.h>
+#include <hw/net/pvrdma/pvrdma_utils.h>
+#include <hw/net/pvrdma/kdbr.h>
+
+int kdbr_fd = -1;
+
+#define MAX_CONSEQ_CQES_READ 10
+
+typedef struct KdbrCtx {
+    struct kdbr_req req;
+    void *up_ctx;
+    bool is_tx_req;
+} KdbrCtx;
+
+static void (*tx_comp_handler)(int status, unsigned int vendor_err,
+                               void *ctx) = 0;
+static void (*rx_comp_handler)(int status, unsigned int vendor_err,
+                               void *ctx) = 0;
+
+static void kdbr_err_to_pvrdma_err(int kdbr_status, unsigned int *status,
+                                   unsigned int *vendor_err)
+{
+    if (kdbr_status == 0) {
+        *status = IB_WC_SUCCESS;
+        *vendor_err = 0;
+        return;
+    }
+
+    *vendor_err = kdbr_status;
+    switch (kdbr_status) {
+    case KDBR_ERR_CODE_EMPTY_VEC:
+        *status = IB_WC_LOC_LEN_ERR;
+        break;
+    case KDBR_ERR_CODE_NO_MORE_RECV_BUF:
+        *status = IB_WC_REM_OP_ERR;
+        break;
+    case KDBR_ERR_CODE_RECV_BUF_PROT:
+        *status = IB_WC_REM_ACCESS_ERR;
+        break;
+    case KDBR_ERR_CODE_INV_ADDR:
+        *status = IB_WC_LOC_ACCESS_ERR;
+        break;
+    case KDBR_ERR_CODE_INV_CONN_ID:
+        *status = IB_WC_LOC_PROT_ERR;
+        break;
+    case KDBR_ERR_CODE_NO_PEER:
+        *status = IB_WC_LOC_QP_OP_ERR;
+        break;
+    default:
+        *status = IB_WC_GENERAL_ERR;
+        break;
+    }
+}
+
+static void *comp_handler_thread(void *arg)
+{
+    KdbrPort *port = (KdbrPort *)arg;
+    struct kdbr_completion comp[MAX_CONSEQ_CQES_READ];
+    int i, j, rc;
+    KdbrCtx *sctx;
+    unsigned int status, vendor_err;
+
+    while (port->comp_thread.run) {
+        rc = read(port->fd, &comp, sizeof(comp));
+        if (unlikely(rc % sizeof(struct kdbr_completion))) {
+            pr_err("Got unsupported message size (%d) from kdbr\n", rc);
+            continue;
+        }
+        pr_dbg("Processing %ld CQEs from kdbr\n",
+               rc / sizeof(struct kdbr_completion));
+
+        for (i = 0; i < rc / sizeof(struct kdbr_completion); i++) {
+            pr_dbg("comp.req_id=%ld\n", comp[i].req_id);
+            pr_dbg("comp.status=%d\n", comp[i].status);
+
+            sctx = rm_get_wqe_ctx(PVRDMA_DEV(port->dev), comp[i].req_id);
+            if (!sctx) {
+                pr_err("Fail to find ctx for req %ld\n", comp[i].req_id);
+                continue;
+            }
+            pr_dbg("Processing %s CQE\n", sctx->is_tx_req ? "send" : "recv");
+
+            for (j = 0; j < sctx->req.vlen; j++) {
+                pr_dbg("payload=%s\n", (char *)sctx->req.vec[j].iov_base);
+                pvrdma_pci_dma_unmap(port->dev, sctx->req.vec[j].iov_base,
+                                     sctx->req.vec[j].iov_len);
+            }
+
+            kdbr_err_to_pvrdma_err(comp[i].status, &status, &vendor_err);
+            pr_dbg("status=%d\n", status);
+            pr_dbg("vendor_err=0x%x\n", vendor_err);
+
+            if (sctx->is_tx_req) {
+                tx_comp_handler(status, vendor_err, sctx->up_ctx);
+            } else {
+                rx_comp_handler(status, vendor_err, sctx->up_ctx);
+            }
+
+            rm_dealloc_wqe_ctx(PVRDMA_DEV(port->dev), comp[i].req_id);
+            free(sctx);
+        }
+    }
+
+    pr_dbg("Going down\n");
+
+    return NULL;
+}
+
+KdbrPort *kdbr_alloc_port(PVRDMADev *dev)
+{
+    int rc;
+    KdbrPort *port;
+    char name[80] = {0};
+    struct kdbr_reg reg;
+
+    port = malloc(sizeof(KdbrPort));
+    if (!port) {
+        pr_dbg("Fail to allocate memory for port object\n");
+        return NULL;
+    }
+
+    port->dev = PCI_DEVICE(dev);
+
+    pr_dbg("net=0x%llx\n", dev->ports[0].gid_tbl[0].global.subnet_prefix);
+    pr_dbg("guid=0x%llx\n", dev->ports[0].gid_tbl[0].global.interface_id);
+    reg.gid.net_id = dev->ports[0].gid_tbl[0].global.subnet_prefix;
+    reg.gid.id = dev->ports[0].gid_tbl[0].global.interface_id;
+    rc = ioctl(kdbr_fd, KDBR_REGISTER_PORT, &reg);
+    if (rc < 0) {
+        pr_err("Fail to allocate port\n");
+        goto err_free_port;
+    }
+
+    port->num = reg.port;
+
+    sprintf(name, KDBR_FILE_NAME "%d", port->num);
+    port->fd = open(name, O_RDWR);
+    if (port->fd < 0) {
+        pr_err("Fail to open file %s\n", name);
+        goto err_unregister_device;
+    }
+
+    sprintf(name, "pvrdma_comp_%d", port->num);
+    port->comp_thread.run = true;
+    qemu_thread_create(&port->comp_thread.thread, name, comp_handler_thread,
+                       port, QEMU_THREAD_DETACHED);
+
+    pr_info("Port %d (fd %d) allocated\n", port->num, port->fd);
+
+    return port;
+
+err_unregister_device:
+    ioctl(kdbr_fd, KDBR_UNREGISTER_PORT, &port->num);
+
+err_free_port:
+    free(port);
+
+    return NULL;
+}
+
+void kdbr_free_port(KdbrPort *port)
+{
+    int rc;
+
+    if (!port) {
+        return;
+    }
+
+    rc = write(port->fd, (char *)0, 1);
+    port->comp_thread.run = false;
+    close(port->fd);
+
+    rc = ioctl(kdbr_fd, KDBR_UNREGISTER_PORT, &port->num);
+    if (rc < 0) {
+        pr_err("Fail to allocate port\n");
+    }
+
+    free(port);
+}
+
+unsigned long kdbr_open_connection(KdbrPort *port, u32 qpn,
+                                   union pvrdma_gid dgid, u32 dqpn, bool rc_qp)
+{
+    int rc;
+    struct kdbr_connection connection = {0};
+
+    connection.queue_id = qpn;
+    connection.peer.rgid.net_id = dgid.global.subnet_prefix;
+    connection.peer.rgid.id = dgid.global.interface_id;
+    connection.peer.rqueue = dqpn;
+    connection.ack_type = rc_qp ? KDBR_ACK_DELAYED : KDBR_ACK_IMMEDIATE;
+
+    rc = ioctl(port->fd, KDBR_PORT_OPEN_CONN, &connection);
+    if (rc <= 0) {
+        pr_err("Fail to open kdbr connection on port %d fd %d err %d\n",
+               port->num, port->fd, rc);
+        return 0;
+    }
+
+    return (unsigned long)rc;
+}
+
+void kdbr_close_connection(KdbrPort *port, unsigned long connection_id)
+{
+    int rc;
+
+    rc = ioctl(port->fd, KDBR_PORT_CLOSE_CONN, &connection_id);
+    if (rc < 0) {
+        pr_err("Fail to close kdbr connection on port %d\n",
+               port->num);
+    }
+}
+
+void kdbr_register_tx_comp_handler(void (*comp_handler)(int status,
+                                   unsigned int vendor_err, void *ctx))
+{
+    tx_comp_handler = comp_handler;
+}
+
+void kdbr_register_rx_comp_handler(void (*comp_handler)(int status,
+                                   unsigned int vendor_err, void *ctx))
+{
+    rx_comp_handler = comp_handler;
+}
+
+void kdbr_send_wqe(KdbrPort *port, unsigned long connection_id, bool rc_qp,
+                   struct RmSqWqe *wqe, void *ctx)
+{
+    KdbrCtx *sctx;
+    int rc;
+    int i;
+
+    pr_dbg("kdbr_port=%d\n", port->num);
+    pr_dbg("kdbr_connection_id=%ld\n", connection_id);
+    pr_dbg("wqe->hdr.num_sge=%d\n", wqe->hdr.num_sge);
+
+    /* Last minute validation - verify that kdbr supports num_sge */
+    /* TODO: Make sure this will not happen! */
+    if (wqe->hdr.num_sge > KDBR_MAX_IOVEC_LEN) {
+        pr_err("Error: requested %d SGEs where kdbr supports %d\n",
+               wqe->hdr.num_sge, KDBR_MAX_IOVEC_LEN);
+        tx_comp_handler(IB_WC_GENERAL_ERR, VENDOR_ERR_TOO_MANY_SGES, ctx);
+        return;
+    }
+
+    sctx = malloc(sizeof(*sctx));
+    if (!sctx) {
+        pr_err("Fail to allocate kdbr request ctx\n");
+        tx_comp_handler(IB_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
+    }
+
+    memset(&sctx->req, 0, sizeof(sctx->req));
+    sctx->req.flags = KDBR_REQ_SIGNATURE | KDBR_REQ_POST_SEND;
+    sctx->req.connection_id = connection_id;
+
+    sctx->up_ctx = ctx;
+    sctx->is_tx_req = 1;
+
+    rc = rm_alloc_wqe_ctx(PVRDMA_DEV(port->dev), &sctx->req.req_id, sctx);
+    if (rc != 0) {
+        pr_err("Fail to allocate request ID\n");
+        free(sctx);
+        tx_comp_handler(IB_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
+        return;
+    }
+    sctx->req.vlen = wqe->hdr.num_sge;
+
+    for (i = 0; i < wqe->hdr.num_sge; i++) {
+        struct pvrdma_sge *sge;
+
+        sge = &wqe->sge[i];
+
+        pr_dbg("addr=0x%llx\n", sge->addr);
+        pr_dbg("length=%d\n", sge->length);
+        pr_dbg("lkey=0x%x\n", sge->lkey);
+
+        sctx->req.vec[i].iov_base = pvrdma_pci_dma_map(port->dev, sge->addr,
+                                                       sge->length);
+        sctx->req.vec[i].iov_len = sge->length;
+    }
+
+    if (!rc_qp) {
+        sctx->req.peer.rqueue = wqe->hdr.wr.ud.remote_qpn;
+        sctx->req.peer.rgid.net_id = *((unsigned long *)
+                        &wqe->hdr.wr.ud.av.dgid[0]);
+        sctx->req.peer.rgid.id = *((unsigned long *)
+                        &wqe->hdr.wr.ud.av.dgid[8]);
+    }
+
+    rc = write(port->fd, &sctx->req, sizeof(sctx->req));
+    if (rc < 0) {
+        pr_err("Fail (%d, %d) to post send WQE to port %d, conn_id %ld\n", rc,
+               errno, port->num, connection_id);
+        tx_comp_handler(IB_WC_GENERAL_ERR, VENDOR_ERR_FAIL_KDBR, ctx);
+        return;
+    }
+}
+
+void kdbr_recv_wqe(KdbrPort *port, unsigned long connection_id,
+                   struct RmRqWqe *wqe, void *ctx)
+{
+    KdbrCtx *sctx;
+    int rc;
+    int i;
+
+    pr_dbg("kdbr_port=%d\n", port->num);
+    pr_dbg("kdbr_connection_id=%ld\n", connection_id);
+    pr_dbg("wqe->hdr.num_sge=%d\n", wqe->hdr.num_sge);
+
+    /* Last minute validation - verify that kdbr supports num_sge */
+    if (wqe->hdr.num_sge > KDBR_MAX_IOVEC_LEN) {
+        pr_err("Error: requested %d SGEs where kdbr supports %d\n",
+               wqe->hdr.num_sge, KDBR_MAX_IOVEC_LEN);
+        tx_comp_handler(IB_WC_GENERAL_ERR, VENDOR_ERR_TOO_MANY_SGES, ctx);
+        return;
+    }
+
+    sctx = malloc(sizeof(*sctx));
+    if (!sctx) {
+        pr_err("Fail to allocate kdbr request ctx\n");
+        tx_comp_handler(IB_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
+    }
+
+    memset(&sctx->req, 0, sizeof(sctx->req));
+    sctx->req.flags = KDBR_REQ_SIGNATURE | KDBR_REQ_POST_RECV;
+    sctx->req.connection_id = connection_id;
+
+    sctx->up_ctx = ctx;
+    sctx->is_tx_req = 0;
+
+    pr_dbg("sctx=%p\n", sctx);
+    rc = rm_alloc_wqe_ctx(PVRDMA_DEV(port->dev), &sctx->req.req_id, sctx);
+    if (rc != 0) {
+        pr_err("Fail to allocate request ID\n");
+        free(sctx);
+        tx_comp_handler(IB_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
+        return;
+    }
+
+    sctx->req.vlen = wqe->hdr.num_sge;
+
+    for (i = 0; i < wqe->hdr.num_sge; i++) {
+        struct pvrdma_sge *sge;
+
+        sge = &wqe->sge[i];
+
+        pr_dbg("addr=0x%llx\n", sge->addr);
+        pr_dbg("length=%d\n", sge->length);
+        pr_dbg("lkey=0x%x\n", sge->lkey);
+
+        sctx->req.vec[i].iov_base = pvrdma_pci_dma_map(port->dev, sge->addr,
+                                                       sge->length);
+        sctx->req.vec[i].iov_len = sge->length;
+    }
+
+    rc = write(port->fd, &sctx->req, sizeof(sctx->req));
+    if (rc < 0) {
+        pr_err("Fail (%d, %d) to post recv WQE to port %d, conn_id %ld\n", rc,
+               errno, port->num, connection_id);
+        tx_comp_handler(IB_WC_GENERAL_ERR, VENDOR_ERR_FAIL_KDBR, ctx);
+        return;
+    }
+}
+
+static void dummy_comp_handler(int status, unsigned int vendor_err, void *ctx)
+{
+    pr_err("No completion handler is registered\n");
+}
+
+int kdbr_init(void)
+{
+    kdbr_register_tx_comp_handler(dummy_comp_handler);
+    kdbr_register_rx_comp_handler(dummy_comp_handler);
+
+    kdbr_fd = open(KDBR_FILE_NAME, 0);
+    if (kdbr_fd < 0) {
+        pr_dbg("Can't connect to kdbr, rc=%d\n", kdbr_fd);
+        return -EIO;
+    }
+
+    return 0;
+}
+
+void kdbr_fini(void)
+{
+    close(kdbr_fd);
+}
diff --git a/hw/net/pvrdma/pvrdma_kdbr.h b/hw/net/pvrdma/pvrdma_kdbr.h
new file mode 100644
index 0000000..293a180
--- /dev/null
+++ b/hw/net/pvrdma/pvrdma_kdbr.h
@@ -0,0 +1,53 @@
+/*
+ * QEMU VMWARE paravirtual RDMA QP Operations
+ *
+ * Developed by Oracle & Redhat
+ *
+ * Authors:
+ *     Yuval Shaia <yuval.shaia@oracle.com>
+ *     Marcel Apfelbaum <marcel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef PVRDMA_KDBR_H
+#define PVRDMA_KDBR_H
+
+#include <hw/net/pvrdma/pvrdma_types.h>
+#include <hw/net/pvrdma/pvrdma_ib_verbs.h>
+#include <hw/net/pvrdma/pvrdma_rm.h>
+#include <hw/net/pvrdma/kdbr.h>
+
+typedef struct KdbrCompThread {
+    QemuThread thread;
+    QemuMutex mutex;
+    bool run;
+} KdbrCompThread;
+
+typedef struct KdbrPort {
+    int num;
+    int fd;
+    KdbrCompThread comp_thread;
+    PCIDevice *dev;
+} KdbrPort;
+
+int kdbr_init(void);
+void kdbr_fini(void);
+KdbrPort *kdbr_alloc_port(PVRDMADev *dev);
+void kdbr_free_port(KdbrPort *port);
+void kdbr_register_tx_comp_handler(void (*comp_handler)(int status,
+                                   unsigned int vendor_err, void *ctx));
+void kdbr_register_rx_comp_handler(void (*comp_handler)(int status,
+                                   unsigned int vendor_err, void *ctx));
+unsigned long kdbr_open_connection(KdbrPort *port, u32 qpn,
+                                   union pvrdma_gid dgid, u32 dqpn,
+                                   bool rc_qp);
+void kdbr_close_connection(KdbrPort *port, unsigned long connection_id);
+void kdbr_send_wqe(KdbrPort *port, unsigned long connection_id, bool rc_qp,
+                   struct RmSqWqe *wqe, void *ctx);
+void kdbr_recv_wqe(KdbrPort *port, unsigned long connection_id,
+                   struct RmRqWqe *wqe, void *ctx);
+
+#endif
diff --git a/hw/net/pvrdma/pvrdma_main.c b/hw/net/pvrdma/pvrdma_main.c
new file mode 100644
index 0000000..5db802e
--- /dev/null
+++ b/hw/net/pvrdma/pvrdma_main.c
@@ -0,0 +1,667 @@
+#include <qemu/osdep.h>
+#include <hw/hw.h>
+#include <hw/pci/pci.h>
+#include <hw/pci/pci_ids.h>
+#include <hw/pci/msi.h>
+#include <hw/pci/msix.h>
+#include <hw/qdev-core.h>
+#include <hw/qdev-properties.h>
+#include <cpu.h>
+
+#include "hw/net/pvrdma/pvrdma.h"
+#include "hw/net/pvrdma/pvrdma_defs.h"
+#include "hw/net/pvrdma/pvrdma_utils.h"
+#include "hw/net/pvrdma/pvrdma_dev_api.h"
+#include "hw/net/pvrdma/pvrdma_rm.h"
+#include "hw/net/pvrdma/pvrdma_kdbr.h"
+#include "hw/net/pvrdma/pvrdma_qp_ops.h"
+
+static Property pvrdma_dev_properties[] = {
+    DEFINE_PROP_UINT64("sys-image-guid", PVRDMADev, sys_image_guid, 0),
+    DEFINE_PROP_UINT64("node-guid", PVRDMADev, node_guid, 0),
+    DEFINE_PROP_UINT64("network-prefix", PVRDMADev, network_prefix, 0),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void free_dev_ring(PCIDevice *pci_dev, Ring *ring, void *ring_state)
+{
+    ring_free(ring);
+    pvrdma_pci_dma_unmap(pci_dev, ring_state, TARGET_PAGE_SIZE);
+}
+
+static int init_dev_ring(Ring *ring, struct pvrdma_ring **ring_state,
+                         const char *name, PCIDevice *pci_dev,
+                         dma_addr_t dir_addr, u32 num_pages)
+{
+    __u64 *dir, *tbl;
+    int rc = 0;
+
+    pr_dbg("Initializing device ring %s\n", name);
+    pr_dbg("pdir_dma=0x%llx\n", (long long unsigned int)dir_addr);
+    pr_dbg("num_pages=%d\n", num_pages);
+    dir = pvrdma_pci_dma_map(pci_dev, dir_addr, TARGET_PAGE_SIZE);
+    if (!dir) {
+        pr_err("Fail to map to page directory\n");
+        rc = -ENOMEM;
+        goto out;
+    }
+    tbl = pvrdma_pci_dma_map(pci_dev, dir[0], TARGET_PAGE_SIZE);
+    if (!tbl) {
+        pr_err("Fail to map to page table\n");
+        rc = -ENOMEM;
+        goto out_free_dir;
+    }
+
+    *ring_state = pvrdma_pci_dma_map(pci_dev, tbl[0], TARGET_PAGE_SIZE);
+    if (!*ring_state) {
+        pr_err("Fail to map to ring state\n");
+        rc = -ENOMEM;
+        goto out_free_tbl;
+    }
+    /* RX ring is the second */
+    (struct pvrdma_ring *)(*ring_state)++;
+    rc = ring_init(ring, name, pci_dev, (struct pvrdma_ring *)*ring_state,
+                   (num_pages - 1) * TARGET_PAGE_SIZE /
+                   sizeof(struct pvrdma_cqne), sizeof(struct pvrdma_cqne),
+                   (dma_addr_t *)&tbl[1], (dma_addr_t)num_pages - 1);
+    if (rc != 0) {
+        pr_err("Fail to initialize ring\n");
+        rc = -ENOMEM;
+        goto out_free_ring_state;
+    }
+
+    goto out_free_tbl;
+
+out_free_ring_state:
+    pvrdma_pci_dma_unmap(pci_dev, *ring_state, TARGET_PAGE_SIZE);
+
+out_free_tbl:
+    pvrdma_pci_dma_unmap(pci_dev, tbl, TARGET_PAGE_SIZE);
+
+out_free_dir:
+    pvrdma_pci_dma_unmap(pci_dev, dir, TARGET_PAGE_SIZE);
+
+out:
+    return rc;
+}
+
+static void free_dsr(PVRDMADev *dev)
+{
+    PCIDevice *pci_dev = PCI_DEVICE(dev);
+
+    if (!dev->dsr_info.dsr) {
+        return;
+    }
+
+    free_dev_ring(pci_dev, &dev->dsr_info.async,
+                  dev->dsr_info.async_ring_state);
+
+    free_dev_ring(pci_dev, &dev->dsr_info.cq, dev->dsr_info.cq_ring_state);
+
+    pvrdma_pci_dma_unmap(pci_dev, dev->dsr_info.req,
+                         sizeof(union pvrdma_cmd_req));
+
+    pvrdma_pci_dma_unmap(pci_dev, dev->dsr_info.rsp,
+                         sizeof(union pvrdma_cmd_resp));
+
+    pvrdma_pci_dma_unmap(pci_dev, dev->dsr_info.dsr,
+                         sizeof(struct pvrdma_device_shared_region));
+
+    dev->dsr_info.dsr = NULL;
+}
+
+static int load_dsr(PVRDMADev *dev)
+{
+    int rc = 0;
+    PCIDevice *pci_dev = PCI_DEVICE(dev);
+    DSRInfo *dsr_info;
+    struct pvrdma_device_shared_region *dsr;
+
+    free_dsr(dev);
+
+    /* Map to DSR */
+    pr_dbg("dsr_dma=0x%llx\n", (long long unsigned int)dev->dsr_info.dma);
+    dev->dsr_info.dsr = pvrdma_pci_dma_map(pci_dev, dev->dsr_info.dma,
+                                sizeof(struct pvrdma_device_shared_region));
+    if (!dev->dsr_info.dsr) {
+        pr_err("Fail to map to DSR\n");
+        rc = -ENOMEM;
+        goto out;
+    }
+
+    /* Shortcuts */
+    dsr_info = &dev->dsr_info;
+    dsr = dsr_info->dsr;
+
+    /* Map to command slot */
+    pr_dbg("cmd_dma=0x%llx\n", (long long unsigned int)dsr->cmd_slot_dma);
+    dsr_info->req = pvrdma_pci_dma_map(pci_dev, dsr->cmd_slot_dma,
+                                       sizeof(union pvrdma_cmd_req));
+    if (!dsr_info->req) {
+        pr_err("Fail to map to command slot address\n");
+        rc = -ENOMEM;
+        goto out_free_dsr;
+    }
+
+    /* Map to response slot */
+    pr_dbg("rsp_dma=0x%llx\n", (long long unsigned int)dsr->resp_slot_dma);
+    dsr_info->rsp = pvrdma_pci_dma_map(pci_dev, dsr->resp_slot_dma,
+                                       sizeof(union pvrdma_cmd_resp));
+    if (!dsr_info->rsp) {
+        pr_err("Fail to map to response slot address\n");
+        rc = -ENOMEM;
+        goto out_free_req;
+    }
+
+    /* Map to CQ notification ring */
+    rc = init_dev_ring(&dsr_info->cq, &dsr_info->cq_ring_state, "dev_cq",
+                       pci_dev, dsr->cq_ring_pages.pdir_dma,
+                       dsr->cq_ring_pages.num_pages);
+    if (rc != 0) {
+        pr_err("Fail to map to initialize CQ ring\n");
+        rc = -ENOMEM;
+        goto out_free_rsp;
+    }
+
+    /* Map to event notification ring */
+    rc = init_dev_ring(&dsr_info->async, &dsr_info->async_ring_state,
+                       "dev_async", pci_dev, dsr->async_ring_pages.pdir_dma,
+                       dsr->async_ring_pages.num_pages);
+    if (rc != 0) {
+        pr_err("Fail to map to initialize event ring\n");
+        rc = -ENOMEM;
+        goto out_free_rsp;
+    }
+
+    goto out;
+
+out_free_rsp:
+    pvrdma_pci_dma_unmap(pci_dev, dsr_info->rsp, sizeof(union pvrdma_cmd_resp));
+
+out_free_req:
+    pvrdma_pci_dma_unmap(pci_dev, dsr_info->req, sizeof(union pvrdma_cmd_req));
+
+out_free_dsr:
+    pvrdma_pci_dma_unmap(pci_dev, dsr_info->dsr,
+                         sizeof(struct pvrdma_device_shared_region));
+    dsr_info->dsr = NULL;
+
+out:
+    return rc;
+}
+
+static void init_dev_caps(PVRDMADev *dev)
+{
+    struct pvrdma_device_shared_region *dsr;
+
+    if (dev->dsr_info.dsr == NULL) {
+        pr_err("Can't initialized DSR\n");
+        return;
+    }
+
+    dsr = dev->dsr_info.dsr;
+
+    dsr->caps.fw_ver = PVRDMA_FW_VERSION;
+    pr_dbg("fw_ver=0x%lx\n", dsr->caps.fw_ver);
+
+    dsr->caps.mode = PVRDMA_DEVICE_MODE_ROCE;
+    pr_dbg("mode=%d\n", dsr->caps.mode);
+
+    dsr->caps.gid_types |= PVRDMA_GID_TYPE_FLAG_ROCE_V1;
+    pr_dbg("gid_types=0x%x\n", dsr->caps.gid_types);
+
+    dsr->caps.max_uar = RDMA_BAR2_UAR_SIZE;
+    pr_dbg("max_uar=%d\n", dsr->caps.max_uar);
+
+    if (rm_get_max_pds(&dsr->caps.max_pd)) {
+        return;
+    }
+    pr_dbg("max_pd=%d\n", dsr->caps.max_pd);
+
+    if (rm_get_max_gids(&dsr->caps.gid_tbl_len)) {
+        return;
+    }
+    pr_dbg("gid_tbl_len=%d\n", dsr->caps.gid_tbl_len);
+
+    if (rm_get_max_cqs(&dsr->caps.max_cq)) {
+        return;
+    }
+    pr_dbg("max_cq=%d\n", dsr->caps.max_cq);
+
+    if (rm_get_max_cqes(&dsr->caps.max_cqe)) {
+        return;
+    }
+    pr_dbg("max_cqe=%d\n", dsr->caps.max_cqe);
+
+    if (rm_get_max_qps(&dsr->caps.max_qp)) {
+        return;
+    }
+    pr_dbg("max_qp=%d\n", dsr->caps.max_qp);
+
+    dsr->caps.sys_image_guid = cpu_to_be64(dev->sys_image_guid);
+    pr_dbg("sys_image_guid=%llx\n",
+           (long long unsigned int)be64_to_cpu(dsr->caps.sys_image_guid));
+
+    dsr->caps.node_guid = cpu_to_be64(dev->node_guid);
+    pr_dbg("node_guid=%llx\n",
+           (long long unsigned int)be64_to_cpu(dsr->caps.node_guid));
+
+    if (rm_get_phys_port_cnt(&dsr->caps.phys_port_cnt)) {
+        return;
+    }
+    pr_dbg("phys_port_cnt=%d\n", dsr->caps.phys_port_cnt);
+
+    if (rm_get_max_qp_wrs(&dsr->caps.max_qp_wr)) {
+        return;
+    }
+    pr_dbg("max_qp_wr=%d\n", dsr->caps.max_qp_wr);
+
+    if (rm_get_max_sges(&dsr->caps.max_sge)) {
+        return;
+    }
+    pr_dbg("max_sge=%d\n", dsr->caps.max_sge);
+
+    if (rm_get_max_mrs(&dsr->caps.max_mr)) {
+        return;
+    }
+    pr_dbg("max_mr=%d\n", dsr->caps.max_mr);
+
+    if (rm_get_max_pkeys(&dsr->caps.max_pkeys)) {
+        return;
+    }
+    pr_dbg("max_pkeys=%d\n", dsr->caps.max_pkeys);
+
+    if (rm_get_max_ah(&dsr->caps.max_ah)) {
+        return;
+    }
+    pr_dbg("max_ah=%d\n", dsr->caps.max_ah);
+
+    pr_dbg("Initialized\n");
+}
+
+static void free_ports(PVRDMADev *dev)
+{
+    int i;
+
+    for (i = 0; i < MAX_PORTS; i++) {
+        free(dev->ports[i].gid_tbl);
+        kdbr_free_port(dev->ports[i].kdbr_port);
+    }
+}
+
+static int init_ports(PVRDMADev *dev)
+{
+    int i, ret = 0;
+    __u32 max_port_gids;
+    __u32 max_port_pkeys;
+
+    memset(dev->ports, 0, sizeof(dev->ports));
+
+    ret = rm_get_max_port_gids(&max_port_gids);
+    if (ret != 0) {
+        goto err;
+    }
+
+    ret = rm_get_max_port_pkeys(&max_port_pkeys);
+    if (ret != 0) {
+        goto err;
+    }
+
+    for (i = 0; i < MAX_PORTS; i++) {
+        dev->ports[i].state = PVRDMA_PORT_DOWN;
+
+        dev->ports[i].pkey_tbl = malloc(sizeof(*dev->ports[i].pkey_tbl) *
+                                        max_port_pkeys);
+        if (dev->ports[i].gid_tbl == NULL) {
+            goto err_free_ports;
+        }
+
+        memset(dev->ports[i].gid_tbl, 0, sizeof(dev->ports[i].gid_tbl));
+    }
+
+    return 0;
+
+err_free_ports:
+    free_ports(dev);
+
+err:
+    pr_err("Fail to initialize device's ports\n");
+
+    return ret;
+}
+
+static void activate_device(PVRDMADev *dev)
+{
+    set_reg_val(dev, PVRDMA_REG_ERR, 0);
+    pr_dbg("Device activated\n");
+}
+
+static int quiesce_device(PVRDMADev *dev)
+{
+    pr_dbg("Device quiesced\n");
+    return 0;
+}
+
+static int reset_device(PVRDMADev *dev)
+{
+    pr_dbg("Device reset complete\n");
+    return 0;
+}
+
+static uint64_t regs_read(void *opaque, hwaddr addr, unsigned size)
+{
+    PVRDMADev *dev = opaque;
+    __u32 val;
+
+    /* pr_dbg("addr=0x%lx, size=%d\n", addr, size); */
+
+    if (get_reg_val(dev, addr, &val)) {
+        pr_dbg("Error trying to read REG value from address 0x%x\n",
+               (__u32)addr);
+        return -EINVAL;
+    }
+
+    /* pr_dbg("regs[0x%x]=0x%x\n", (__u32)addr, val); */
+
+    return val;
+}
+
+static void regs_write(void *opaque, hwaddr addr, uint64_t val, unsigned size)
+{
+    PVRDMADev *dev = opaque;
+
+    /* pr_dbg("addr=0x%lx, val=0x%x, size=%d\n", addr, (uint32_t)val, size); */
+
+    if (set_reg_val(dev, addr, val)) {
+        pr_err("Error trying to set REG value, addr=0x%x, val=0x%lx\n",
+               (__u32)addr, val);
+        return;
+    }
+
+    /* pr_dbg("regs[0x%x]=0x%lx\n", (__u32)addr, val); */
+
+    switch (addr) {
+    case PVRDMA_REG_DSRLOW:
+        dev->dsr_info.dma = val;
+        break;
+    case PVRDMA_REG_DSRHIGH:
+        dev->dsr_info.dma |= val << 32;
+        load_dsr(dev);
+        init_dev_caps(dev);
+        break;
+    case PVRDMA_REG_CTL:
+        switch (val) {
+        case PVRDMA_DEVICE_CTL_ACTIVATE:
+            activate_device(dev);
+            break;
+        case PVRDMA_DEVICE_CTL_QUIESCE:
+            quiesce_device(dev);
+            break;
+        case PVRDMA_DEVICE_CTL_RESET:
+            reset_device(dev);
+            break;
+        }
+    case PVRDMA_REG_IMR:
+        pr_dbg("Interrupt mask=0x%lx\n", val);
+        dev->interrupt_mask = val;
+        break;
+    case PVRDMA_REG_REQUEST:
+        if (val == 0) {
+            execute_command(dev);
+        }
+    default:
+        break;
+    }
+}
+
+static const MemoryRegionOps regs_ops = {
+    .read = regs_read,
+    .write = regs_write,
+    .endianness = DEVICE_LITTLE_ENDIAN,
+    .impl = {
+        .min_access_size = sizeof(uint32_t),
+        .max_access_size = sizeof(uint32_t),
+    },
+};
+
+static uint64_t uar_read(void *opaque, hwaddr addr, unsigned size)
+{
+    PVRDMADev *dev = opaque;
+    __u32 val;
+
+    pr_dbg("addr=0x%lx, size=%d\n", addr, size);
+
+    if (get_uar_val(dev, addr, &val)) {
+        pr_dbg("Error trying to read UAR value from address 0x%x\n",
+               (__u32)addr);
+        return -EINVAL;
+    }
+
+    pr_dbg("uar[0x%x]=0x%x\n", (__u32)addr, val);
+
+    return val;
+}
+
+static void uar_write(void *opaque, hwaddr addr, uint64_t val, unsigned size)
+{
+    PVRDMADev *dev = opaque;
+
+    /* pr_dbg("addr=0x%lx, val=0x%x, size=%d\n", addr, (uint32_t)val, size); */
+
+    if (set_uar_val(dev, addr, val)) {
+        pr_err("Error trying to set UAR value, addr=0x%x, val=0x%lx\n",
+               (__u32)addr, val);
+        return;
+    }
+
+    /* pr_dbg("uar[0x%x]=0x%lx\n", (__u32)addr, val); */
+
+    switch (addr) {
+    case PVRDMA_UAR_QP_OFFSET:
+        pr_dbg("UAR QP command, addr=0x%x, val=0x%lx\n", (__u32)addr, val);
+        if (val & PVRDMA_UAR_QP_SEND) {
+            qp_send(dev, val & PVRDMA_UAR_HANDLE_MASK);
+        }
+        if (val & PVRDMA_UAR_QP_RECV) {
+            qp_recv(dev, val & PVRDMA_UAR_HANDLE_MASK);
+        }
+        break;
+    case PVRDMA_UAR_CQ_OFFSET:
+        pr_dbg("UAR CQ command, addr=0x%x, val=0x%lx\n", (__u32)addr, val);
+        rm_req_notify_cq(dev, val & PVRDMA_UAR_HANDLE_MASK,
+                 val & ~PVRDMA_UAR_HANDLE_MASK);
+        break;
+    default:
+        pr_err("Unsupported command, addr=0x%x, val=0x%lx\n", (__u32)addr, val);
+        break;
+    }
+}
+
+static const MemoryRegionOps uar_ops = {
+    .read = uar_read,
+    .write = uar_write,
+    .endianness = DEVICE_LITTLE_ENDIAN,
+    .impl = {
+        .min_access_size = sizeof(uint32_t),
+        .max_access_size = sizeof(uint32_t),
+    },
+};
+
+static void init_pci_config(PCIDevice *pdev)
+{
+    pdev->config[PCI_INTERRUPT_PIN] = 1;
+}
+
+static void init_bars(PCIDevice *pdev)
+{
+    PVRDMADev *dev = PVRDMA_DEV(pdev);
+
+    /* BAR 0 - MSI-X */
+    memory_region_init(&dev->msix, OBJECT(dev), "pvrdma-msix",
+                       RDMA_BAR0_MSIX_SIZE);
+    pci_register_bar(pdev, RDMA_MSIX_BAR_IDX, PCI_BASE_ADDRESS_SPACE_MEMORY,
+                     &dev->msix);
+
+    /* BAR 1 - Registers */
+    memset(&dev->regs_data, 0, RDMA_BAR1_REGS_SIZE);
+    memory_region_init_io(&dev->regs, OBJECT(dev), &regs_ops, dev,
+                          "pvrdma-regs", RDMA_BAR1_REGS_SIZE);
+    pci_register_bar(pdev, RDMA_REG_BAR_IDX, PCI_BASE_ADDRESS_SPACE_MEMORY,
+                     &dev->regs);
+
+    /* BAR 2 - UAR */
+    memset(&dev->uar_data, 0, RDMA_BAR2_UAR_SIZE);
+    memory_region_init_io(&dev->uar, OBJECT(dev), &uar_ops, dev, "rdma-uar",
+                          RDMA_BAR2_UAR_SIZE);
+    pci_register_bar(pdev, RDMA_UAR_BAR_IDX, PCI_BASE_ADDRESS_SPACE_MEMORY,
+                     &dev->uar);
+}
+
+static void init_regs(PCIDevice *pdev)
+{
+    PVRDMADev *dev = PVRDMA_DEV(pdev);
+
+    set_reg_val(dev, PVRDMA_REG_VERSION, PVRDMA_HW_VERSION);
+    set_reg_val(dev, PVRDMA_REG_ERR, 0xFFFF);
+}
+
+static void uninit_msix(PCIDevice *pdev, int used_vectors)
+{
+    PVRDMADev *dev = PVRDMA_DEV(pdev);
+    int i;
+
+    for (i = 0; i < used_vectors; i++) {
+        msix_vector_unuse(pdev, i);
+    }
+
+    msix_uninit(pdev, &dev->msix, &dev->msix);
+}
+
+static int init_msix(PCIDevice *pdev)
+{
+    PVRDMADev *dev = PVRDMA_DEV(pdev);
+    int i;
+    int rc;
+
+    rc = msix_init(pdev, RDMA_MAX_INTRS, &dev->msix, RDMA_MSIX_BAR_IDX,
+                   RDMA_MSIX_TABLE, &dev->msix, RDMA_MSIX_BAR_IDX,
+                   RDMA_MSIX_PBA, 0, NULL);
+
+    if (rc < 0) {
+        pr_err("Fail to initialize MSI-X\n");
+        return rc;
+    }
+
+    for (i = 0; i < RDMA_MAX_INTRS; i++) {
+        rc = msix_vector_use(PCI_DEVICE(dev), i);
+        if (rc < 0) {
+            pr_err("Fail mark MSI-X vercor %d\n", i);
+            uninit_msix(pdev, i);
+            return rc;
+        }
+    }
+
+    return 0;
+}
+
+static int pvrdma_init(PCIDevice *pdev)
+{
+    int rc;
+    PVRDMADev *dev = PVRDMA_DEV(pdev);
+
+    pr_info("Initializing device %s %x.%x\n", pdev->name,
+            PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
+
+    dev->dsr_info.dsr = NULL;
+
+    init_pci_config(pdev);
+
+    init_bars(pdev);
+
+    init_regs(pdev);
+
+    rc = init_msix(pdev);
+    if (rc != 0) {
+        goto out;
+    }
+
+    rc = kdbr_init();
+    if (rc != 0) {
+        goto out;
+    }
+
+    rc = rm_init(dev);
+    if (rc != 0) {
+        goto out;
+    }
+
+    rc = init_ports(dev);
+    if (rc != 0) {
+        goto out;
+    }
+
+    rc = qp_ops_init();
+    if (rc != 0) {
+        goto out;
+    }
+
+out:
+    if (rc != 0) {
+        pr_err("Device fail to load\n");
+    }
+
+    return rc;
+}
+
+static void pvrdma_exit(PCIDevice *pdev)
+{
+    PVRDMADev *dev = PVRDMA_DEV(pdev);
+
+    pr_info("Closing device %s %x.%x\n", pdev->name,
+            PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
+
+    qp_ops_fini();
+
+    free_ports(dev);
+
+    rm_fini(dev);
+
+    kdbr_fini();
+
+    free_dsr(dev);
+
+    if (msix_enabled(pdev)) {
+        uninit_msix(pdev, RDMA_MAX_INTRS);
+    }
+}
+
+static void pvrdma_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
+
+    k->init = pvrdma_init;
+    k->exit = pvrdma_exit;
+    k->vendor_id = PCI_VENDOR_ID_VMWARE;
+    k->device_id = PCI_DEVICE_ID_VMWARE_PVRDMA;
+    k->revision = 0x00;
+    k->class_id = PCI_CLASS_NETWORK_OTHER;
+
+    dc->desc = "RDMA Device";
+    dc->props = pvrdma_dev_properties;
+    set_bit(DEVICE_CATEGORY_NETWORK, dc->categories);
+}
+
+static const TypeInfo pvrdma_info = {
+    .name = PVRDMA_HW_NAME,
+    .parent    = TYPE_PCI_DEVICE,
+    .instance_size = sizeof(PVRDMADev),
+    .class_init = pvrdma_class_init,
+};
+
+static void register_types(void)
+{
+    type_register_static(&pvrdma_info);
+}
+
+type_init(register_types)
diff --git a/hw/net/pvrdma/pvrdma_qp_ops.c b/hw/net/pvrdma/pvrdma_qp_ops.c
new file mode 100644
index 0000000..2db45d9
--- /dev/null
+++ b/hw/net/pvrdma/pvrdma_qp_ops.c
@@ -0,0 +1,174 @@
+#include "hw/net/pvrdma/pvrdma.h"
+#include "hw/net/pvrdma/pvrdma_utils.h"
+#include "hw/net/pvrdma/pvrdma_qp_ops.h"
+#include "hw/net/pvrdma/pvrdma_rm.h"
+#include "hw/net/pvrdma/pvrdma-uapi.h"
+#include "hw/net/pvrdma/pvrdma_kdbr.h"
+#include "sysemu/dma.h"
+#include "hw/pci/pci.h"
+
+typedef struct CompHandlerCtx {
+    PVRDMADev *dev;
+    u32 cq_handle;
+    struct pvrdma_cqe cqe;
+} CompHandlerCtx;
+
+/*
+ * 1. Put CQE on send CQ ring
+ * 2. Put CQ number on dsr completion ring
+ * 3. Interrupt host
+ */
+static int post_cqe(PVRDMADev *dev, u32 cq_handle, struct pvrdma_cqe *cqe)
+{
+    struct pvrdma_cqe *cqe1;
+    struct pvrdma_cqne *cqne;
+    RmCQ *cq = rm_get_cq(dev, cq_handle);
+
+    if (!cq) {
+        pr_dbg("Invalid cqn %d\n", cq_handle);
+        return -EINVAL;
+    }
+
+    pr_dbg("cq->comp_type=%d\n", cq->comp_type);
+    if (cq->comp_type == CCT_NONE) {
+        return 0;
+    }
+    cq->comp_type = CCT_NONE;
+
+    /* Step #1: Put CQE on CQ ring */
+    pr_dbg("Writing CQE\n");
+    cqe1 = ring_next_elem_write(&cq->cq);
+    if (!cqe1) {
+        return -EINVAL;
+    }
+
+    memcpy(cqe1, cqe, sizeof(*cqe));
+    ring_write_inc(&cq->cq);
+
+    /* Step #2: Put CQ number on dsr completion ring */
+    pr_dbg("Writing CQNE\n");
+    cqne = ring_next_elem_write(&dev->dsr_info.cq);
+    if (!cqne) {
+        return -EINVAL;
+    }
+
+    cqne->info = cq_handle;
+    ring_write_inc(&dev->dsr_info.cq);
+
+    post_interrupt(dev, INTR_VEC_CMD_COMPLETION_Q);
+
+    return 0;
+}
+
+static void qp_ops_comp_handler(int status, unsigned int vendor_err, void *ctx)
+{
+    CompHandlerCtx *comp_ctx = (CompHandlerCtx *)ctx;
+
+    pr_dbg("cq_handle=%d\n", comp_ctx->cq_handle);
+    pr_dbg("wr_id=%lld\n", comp_ctx->cqe.wr_id);
+    pr_dbg("status=%d\n", status);
+    pr_dbg("vendor_err=0x%x\n", vendor_err);
+    comp_ctx->cqe.status = status;
+    comp_ctx->cqe.vendor_err = vendor_err;
+    post_cqe(comp_ctx->dev, comp_ctx->cq_handle, &comp_ctx->cqe);
+    free(ctx);
+}
+
+void qp_ops_fini(void)
+{
+}
+
+int qp_ops_init(void)
+{
+    kdbr_register_tx_comp_handler(qp_ops_comp_handler);
+    kdbr_register_rx_comp_handler(qp_ops_comp_handler);
+
+    return 0;
+}
+
+int qp_send(PVRDMADev *dev, __u32 qp_handle)
+{
+    RmQP *qp;
+    RmSqWqe *wqe;
+
+    qp = rm_get_qp(dev, qp_handle);
+    if (!qp) {
+        return -EINVAL;
+    }
+
+    if (qp->qp_state < PVRDMA_QPS_RTS) {
+        pr_dbg("Invalid QP state for send\n");
+        return -EINVAL;
+    }
+
+    wqe = (struct RmSqWqe *)ring_next_elem_read(&qp->sq);
+    while (wqe) {
+        CompHandlerCtx *comp_ctx;
+
+        pr_dbg("wr_id=%lld\n", wqe->hdr.wr_id);
+        wqe->hdr.num_sge = MIN(wqe->hdr.num_sge,
+                       qp->init_args.max_send_sge);
+
+        /* Prepare CQE */
+        comp_ctx = malloc(sizeof(CompHandlerCtx));
+        comp_ctx->dev = dev;
+        comp_ctx->cqe.wr_id = wqe->hdr.wr_id;
+        comp_ctx->cqe.qp = qp_handle;
+        comp_ctx->cq_handle = qp->init_args.send_cq_handle;
+        comp_ctx->cqe.opcode = wqe->hdr.opcode;
+        /* TODO: Fill rest of the data */
+
+        kdbr_send_wqe(dev->ports[qp->port_num].kdbr_port,
+                      qp->kdbr_connection_id,
+                      qp->init_args.qp_type == PVRDMA_QPT_RC, wqe, comp_ctx);
+
+        ring_read_inc(&qp->sq);
+
+        wqe = ring_next_elem_read(&qp->sq);
+    }
+
+    return 0;
+}
+
+int qp_recv(PVRDMADev *dev, __u32 qp_handle)
+{
+    RmQP *qp;
+    RmRqWqe *wqe;
+
+    qp = rm_get_qp(dev, qp_handle);
+    if (!qp) {
+        return -EINVAL;
+    }
+
+    if (qp->qp_state < PVRDMA_QPS_RTR) {
+        pr_dbg("Invalid QP state for receive\n");
+        return -EINVAL;
+    }
+
+    wqe = (struct RmRqWqe *)ring_next_elem_read(&qp->rq);
+    while (wqe) {
+        CompHandlerCtx *comp_ctx;
+
+        pr_dbg("wr_id=%lld\n", wqe->hdr.wr_id);
+        wqe->hdr.num_sge = MIN(wqe->hdr.num_sge,
+                       qp->init_args.max_send_sge);
+
+        /* Prepare CQE */
+        comp_ctx = malloc(sizeof(CompHandlerCtx));
+        comp_ctx->dev = dev;
+        comp_ctx->cqe.qp = qp_handle;
+        comp_ctx->cq_handle = qp->init_args.recv_cq_handle;
+        comp_ctx->cqe.wr_id = wqe->hdr.wr_id;
+        comp_ctx->cqe.qp = qp_handle;
+        /* TODO: Fill rest of the data */
+
+        kdbr_recv_wqe(dev->ports[qp->port_num].kdbr_port,
+                      qp->kdbr_connection_id, wqe, comp_ctx);
+
+        ring_read_inc(&qp->rq);
+
+        wqe = ring_next_elem_read(&qp->rq);
+    }
+
+    return 0;
+}
diff --git a/hw/net/pvrdma/pvrdma_qp_ops.h b/hw/net/pvrdma/pvrdma_qp_ops.h
new file mode 100644
index 0000000..20125d6
--- /dev/null
+++ b/hw/net/pvrdma/pvrdma_qp_ops.h
@@ -0,0 +1,25 @@
+/*
+ * QEMU VMWARE paravirtual RDMA QP Operations
+ *
+ * Developed by Oracle & Redhat
+ *
+ * Authors:
+ *     Yuval Shaia <yuval.shaia@oracle.com>
+ *     Marcel Apfelbaum <marcel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef PVRDMA_QP_H
+#define PVRDMA_QP_H
+
+typedef struct PVRDMADev PVRDMADev;
+
+int qp_ops_init(void);
+void qp_ops_fini(void);
+int qp_send(PVRDMADev *dev, __u32 qp_handle);
+int qp_recv(PVRDMADev *dev, __u32 qp_handle);
+
+#endif
diff --git a/hw/net/pvrdma/pvrdma_ring.c b/hw/net/pvrdma/pvrdma_ring.c
new file mode 100644
index 0000000..34dc1f5
--- /dev/null
+++ b/hw/net/pvrdma/pvrdma_ring.c
@@ -0,0 +1,127 @@
+#include <qemu/osdep.h>
+#include <hw/pci/pci.h>
+#include <cpu.h>
+#include <hw/net/pvrdma/pvrdma_ring.h>
+#include <hw/net/pvrdma/pvrdma-uapi.h>
+#include <hw/net/pvrdma/pvrdma_utils.h>
+
+int ring_init(Ring *ring, const char *name, PCIDevice *dev,
+              struct pvrdma_ring *ring_state, size_t max_elems, size_t elem_sz,
+              dma_addr_t *tbl, dma_addr_t npages)
+{
+    int i;
+    int rc = 0;
+
+    strncpy(ring->name, name, MAX_RING_NAME_SZ);
+    ring->name[MAX_RING_NAME_SZ - 1] = 0;
+    pr_info("Initializing %s ring\n", ring->name);
+    ring->dev = dev;
+    ring->ring_state = ring_state;
+    ring->max_elems = max_elems;
+    ring->elem_sz = elem_sz;
+    pr_dbg("ring->elem_sz=%ld\n", ring->elem_sz);
+    pr_dbg("npages=%ld\n", npages);
+    /* TODO: Give a moment to think if we want to redo driver settings
+    atomic_set(&ring->ring_state->prod_tail, 0);
+    atomic_set(&ring->ring_state->cons_head, 0);
+    */
+    ring->npages = npages;
+    ring->pages = malloc(npages * sizeof(void *));
+    for (i = 0; i < npages; i++) {
+        if (!tbl[i]) {
+            pr_err("npages=%ld but tbl[%d] is NULL\n", npages, i);
+            continue;
+        }
+
+        ring->pages[i] = pvrdma_pci_dma_map(dev, tbl[i], TARGET_PAGE_SIZE);
+        if (!ring->pages[i]) {
+            rc = -ENOMEM;
+            pr_err("Fail to map to page %d\n", i);
+            goto out_free;
+        }
+    }
+
+    goto out;
+
+out_free:
+    while (i--) {
+        pvrdma_pci_dma_unmap(dev, ring->pages[i], TARGET_PAGE_SIZE);
+    }
+    free(ring->pages);
+
+out:
+    return rc;
+}
+
+void *ring_next_elem_read(Ring *ring)
+{
+    unsigned int idx = 0, offset;
+
+    /*
+    pr_dbg("%s: t=%d, h=%d\n", ring->name, ring->ring_state->prod_tail,
+           ring->ring_state->cons_head);
+    */
+
+    if (!pvrdma_idx_ring_has_data(ring->ring_state, ring->max_elems, &idx)) {
+        pr_dbg("No more data in ring\n");
+        return NULL;
+    }
+
+    offset = idx * ring->elem_sz;
+    /*
+    pr_dbg("idx=%d\n", idx);
+    pr_dbg("offset=%d\n", offset);
+    */
+    return ring->pages[offset / TARGET_PAGE_SIZE] + (offset % TARGET_PAGE_SIZE);
+}
+
+void ring_read_inc(Ring *ring)
+{
+    pvrdma_idx_ring_inc(&ring->ring_state->cons_head, ring->max_elems);
+    /*
+    pr_dbg("%s: t=%d, h=%d, m=%ld\n", ring->name,
+           ring->ring_state->prod_tail, ring->ring_state->cons_head,
+           ring->max_elems);
+    */
+}
+
+void *ring_next_elem_write(Ring *ring)
+{
+    unsigned int idx, offset, tail;
+
+    /*
+    pr_dbg("%s: t=%d, h=%d\n", ring->name, ring->ring_state->prod_tail,
+           ring->ring_state->cons_head);
+    */
+
+    if (!pvrdma_idx_ring_has_space(ring->ring_state, ring->max_elems, &tail)) {
+        pr_dbg("CQ is full\n");
+        return NULL;
+    }
+
+    idx = pvrdma_idx(&ring->ring_state->prod_tail, ring->max_elems);
+    /* TODO: tail == idx */
+
+    offset = idx * ring->elem_sz;
+    return ring->pages[offset / TARGET_PAGE_SIZE] + (offset % TARGET_PAGE_SIZE);
+}
+
+void ring_write_inc(Ring *ring)
+{
+    pvrdma_idx_ring_inc(&ring->ring_state->prod_tail, ring->max_elems);
+    /*
+    pr_dbg("%s: t=%d, h=%d, m=%ld\n", ring->name,
+           ring->ring_state->prod_tail, ring->ring_state->cons_head,
+           ring->max_elems);
+    */
+}
+
+void ring_free(Ring *ring)
+{
+    while (ring->npages--) {
+        pvrdma_pci_dma_unmap(ring->dev, ring->pages[ring->npages],
+                             TARGET_PAGE_SIZE);
+    }
+
+    free(ring->pages);
+}
diff --git a/hw/net/pvrdma/pvrdma_ring.h b/hw/net/pvrdma/pvrdma_ring.h
new file mode 100644
index 0000000..8a0c448
--- /dev/null
+++ b/hw/net/pvrdma/pvrdma_ring.h
@@ -0,0 +1,43 @@
+/*
+ * QEMU VMWARE paravirtual RDMA interface definitions
+ *
+ * Developed by Oracle & Redhat
+ *
+ * Authors:
+ *     Yuval Shaia <yuval.shaia@oracle.com>
+ *     Marcel Apfelbaum <marcel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef PVRDMA_RING_H
+#define PVRDMA_RING_H
+
+#include <qemu/typedefs.h>
+#include <hw/net/pvrdma/pvrdma-uapi.h>
+#include <hw/net/pvrdma/pvrdma_types.h>
+
+#define MAX_RING_NAME_SZ 16
+
+typedef struct Ring {
+    char name[MAX_RING_NAME_SZ];
+    PCIDevice *dev;
+    size_t max_elems;
+    size_t elem_sz;
+    struct pvrdma_ring *ring_state;
+    int npages;
+    void **pages;
+} Ring;
+
+int ring_init(Ring *ring, const char *name, PCIDevice *dev,
+              struct pvrdma_ring *ring_state, size_t max_elems, size_t elem_sz,
+              dma_addr_t *tbl, dma_addr_t npages);
+void *ring_next_elem_read(Ring *ring);
+void ring_read_inc(Ring *ring);
+void *ring_next_elem_write(Ring *ring);
+void ring_write_inc(Ring *ring);
+void ring_free(Ring *ring);
+
+#endif
diff --git a/hw/net/pvrdma/pvrdma_rm.c b/hw/net/pvrdma/pvrdma_rm.c
new file mode 100644
index 0000000..55ca1e5
--- /dev/null
+++ b/hw/net/pvrdma/pvrdma_rm.c
@@ -0,0 +1,529 @@
+#include <hw/net/pvrdma/pvrdma.h>
+#include <hw/net/pvrdma/pvrdma_utils.h>
+#include <hw/net/pvrdma/pvrdma_rm.h>
+#include <hw/net/pvrdma/pvrdma-uapi.h>
+#include <hw/net/pvrdma/pvrdma_kdbr.h>
+#include <qemu/bitmap.h>
+#include <qemu/atomic.h>
+#include <cpu.h>
+
+/* Page directory and page tables */
+#define PG_DIR_SZ { TARGET_PAGE_SIZE / sizeof(__u64) }
+#define PG_TBL_SZ { TARGET_PAGE_SIZE / sizeof(__u64) }
+
+/* Global local and remote keys */
+__u64 global_lkey = 1;
+__u64 global_rkey = 1;
+
+static inline int res_tbl_init(const char *name, RmResTbl *tbl, u32 tbl_sz,
+                               u32 res_sz)
+{
+    tbl->tbl = malloc(tbl_sz * res_sz);
+    if (!tbl->tbl) {
+        return -ENOMEM;
+    }
+
+    strncpy(tbl->name, name, MAX_RING_NAME_SZ);
+    tbl->name[MAX_RING_NAME_SZ - 1] = 0;
+
+    tbl->bitmap = bitmap_new(tbl_sz);
+    tbl->tbl_sz = tbl_sz;
+    tbl->res_sz = res_sz;
+    qemu_mutex_init(&tbl->lock);
+
+    return 0;
+}
+
+static inline void res_tbl_free(RmResTbl *tbl)
+{
+    qemu_mutex_destroy(&tbl->lock);
+    free(tbl->tbl);
+    bitmap_zero_extend(tbl->bitmap, tbl->tbl_sz, 0);
+}
+
+static inline void *res_tbl_get(RmResTbl *tbl, u32 handle)
+{
+    pr_dbg("%s, handle=%d\n", tbl->name, handle);
+
+    if ((handle < tbl->tbl_sz) && (test_bit(handle, tbl->bitmap))) {
+        return tbl->tbl + handle * tbl->res_sz;
+    } else {
+        pr_dbg("Invalid handle %d\n", handle);
+        return NULL;
+    }
+}
+
+static inline void *res_tbl_alloc(RmResTbl *tbl, u32 *handle)
+{
+    qemu_mutex_lock(&tbl->lock);
+
+    *handle = find_first_zero_bit(tbl->bitmap, tbl->tbl_sz);
+    if (*handle > tbl->tbl_sz) {
+        pr_dbg("Fail to alloc, bitmap is full\n");
+        qemu_mutex_unlock(&tbl->lock);
+        return NULL;
+    }
+
+    set_bit(*handle, tbl->bitmap);
+
+    qemu_mutex_unlock(&tbl->lock);
+
+    pr_dbg("%s, handle=%d\n", tbl->name, *handle);
+
+    return tbl->tbl + *handle * tbl->res_sz;
+}
+
+static inline void res_tbl_dealloc(RmResTbl *tbl, u32 handle)
+{
+    pr_dbg("%s, handle=%d\n", tbl->name, handle);
+
+    qemu_mutex_lock(&tbl->lock);
+
+    if (handle < tbl->tbl_sz) {
+        clear_bit(handle, tbl->bitmap);
+    }
+
+    qemu_mutex_unlock(&tbl->lock);
+}
+
+int rm_alloc_pd(PVRDMADev *dev, __u32 *pd_handle, __u32 ctx_handle)
+{
+    RmPD *pd;
+
+    pd = res_tbl_alloc(&dev->pd_tbl, pd_handle);
+    if (!pd) {
+        return -ENOMEM;
+    }
+
+    pd->ctx_handle = ctx_handle;
+
+    return 0;
+}
+
+void rm_dealloc_pd(PVRDMADev *dev, __u32 pd_handle)
+{
+    res_tbl_dealloc(&dev->pd_tbl, pd_handle);
+}
+
+RmCQ *rm_get_cq(PVRDMADev *dev, __u32 cq_handle)
+{
+    return res_tbl_get(&dev->cq_tbl, cq_handle);
+}
+
+int rm_alloc_cq(PVRDMADev *dev, struct pvrdma_cmd_create_cq *cmd,
+                struct pvrdma_cmd_create_cq_resp *resp)
+{
+    int rc = 0;
+    RmCQ *cq;
+    PCIDevice *pci_dev = PCI_DEVICE(dev);
+    __u64 *dir = 0, *tbl = 0;
+    char ring_name[MAX_RING_NAME_SZ];
+    u32 cqe;
+
+    cq = res_tbl_alloc(&dev->cq_tbl, &resp->cq_handle);
+    if (!cq) {
+        return -ENOMEM;
+    }
+
+    memset(cq, 0, sizeof(RmCQ));
+
+    memcpy(&cq->init_args, cmd, sizeof(*cmd));
+    cq->comp_type = CCT_NONE;
+
+    /* Get pointer to CQ */
+    dir = pvrdma_pci_dma_map(pci_dev, cq->init_args.pdir_dma, TARGET_PAGE_SIZE);
+    if (!dir) {
+        pr_err("Fail to map to CQ page directory\n");
+        rc = -ENOMEM;
+        goto out_free_cq;
+    }
+    tbl = pvrdma_pci_dma_map(pci_dev, dir[0], TARGET_PAGE_SIZE);
+    if (!tbl) {
+        pr_err("Fail to map to CQ page table\n");
+        rc = -ENOMEM;
+        goto out_free_cq;
+    }
+
+    cq->ring_state = (struct pvrdma_ring *)
+            pvrdma_pci_dma_map(pci_dev, tbl[0], TARGET_PAGE_SIZE);
+    if (!cq->ring_state) {
+        pr_err("Fail to map to CQ header page\n");
+        rc = -ENOMEM;
+        goto out_free_cq;
+    }
+
+    sprintf(ring_name, "cq%d", resp->cq_handle);
+    cqe = MIN(cmd->cqe, dev->dsr_info.dsr->caps.max_cqe);
+    rc = ring_init(&cq->cq, ring_name, pci_dev, &cq->ring_state[1],
+                   cqe, sizeof(struct pvrdma_cqe), (dma_addr_t *)&tbl[1],
+                   cmd->nchunks - 1 /* first page is ring state */);
+    if (rc != 0) {
+        pr_err("Fail to initialize CQ ring\n");
+        rc = -ENOMEM;
+        goto out_free_ring_state;
+    }
+
+
+    resp->cqe = cmd->cqe;
+
+    goto out;
+
+out_free_ring_state:
+    pvrdma_pci_dma_unmap(pci_dev, cq->ring_state, TARGET_PAGE_SIZE);
+
+out_free_cq:
+    rm_dealloc_cq(dev, resp->cq_handle);
+
+out:
+    if (tbl) {
+        pvrdma_pci_dma_unmap(pci_dev, tbl, TARGET_PAGE_SIZE);
+    }
+    if (dir) {
+        pvrdma_pci_dma_unmap(pci_dev, dir, TARGET_PAGE_SIZE);
+    }
+
+    return rc;
+}
+
+void rm_req_notify_cq(PVRDMADev *dev, __u32 cq_handle, u32 flags)
+{
+    RmCQ *cq;
+
+    pr_dbg("cq_handle=%d, flags=0x%x\n", cq_handle, flags);
+
+    cq = rm_get_cq(dev, cq_handle);
+    if (!cq) {
+        return;
+    }
+
+    cq->comp_type = (flags & PVRDMA_UAR_CQ_ARM_SOL) ? CCT_SOLICITED :
+                     CCT_NEXT_COMP;
+    pr_dbg("comp_type=%d\n", cq->comp_type);
+}
+
+void rm_dealloc_cq(PVRDMADev *dev, __u32 cq_handle)
+{
+    PCIDevice *pci_dev = PCI_DEVICE(dev);
+    RmCQ *cq;
+
+    cq = rm_get_cq(dev, cq_handle);
+    if (!cq) {
+        return;
+    }
+
+    ring_free(&cq->cq);
+    pvrdma_pci_dma_unmap(pci_dev, cq->ring_state, TARGET_PAGE_SIZE);
+    res_tbl_dealloc(&dev->cq_tbl, cq_handle);
+}
+
+int rm_alloc_mr(PVRDMADev *dev, struct pvrdma_cmd_create_mr *cmd,
+                struct pvrdma_cmd_create_mr_resp *resp)
+{
+    RmMR *mr;
+
+    mr = res_tbl_alloc(&dev->mr_tbl, &resp->mr_handle);
+    if (!mr) {
+        return -ENOMEM;
+    }
+
+    mr->pd_handle = cmd->pd_handle;
+    resp->lkey = mr->lkey = global_lkey++;
+    resp->rkey = mr->rkey = global_rkey++;
+
+    return 0;
+}
+
+void rm_dealloc_mr(PVRDMADev *dev, __u32 mr_handle)
+{
+    res_tbl_dealloc(&dev->mr_tbl, mr_handle);
+}
+
+int rm_alloc_qp(PVRDMADev *dev, struct pvrdma_cmd_create_qp *cmd,
+                struct pvrdma_cmd_create_qp_resp *resp)
+{
+    int rc = 0;
+    RmQP *qp;
+    PCIDevice *pci_dev = PCI_DEVICE(dev);
+    __u64 *dir = 0, *tbl = 0;
+    int wqe_size;
+    char ring_name[MAX_RING_NAME_SZ];
+
+    if (!rm_get_cq(dev, cmd->send_cq_handle) ||
+        !rm_get_cq(dev, cmd->recv_cq_handle)) {
+        pr_err("Invalid send_cqn or recv_cqn (%d, %d)\n",
+               cmd->send_cq_handle, cmd->recv_cq_handle);
+        return -EINVAL;
+    }
+
+    qp = res_tbl_alloc(&dev->qp_tbl, &resp->qpn);
+    if (!qp) {
+        return -EINVAL;
+    }
+
+    memset(qp, 0, sizeof(RmQP));
+
+    memcpy(&qp->init_args, cmd, sizeof(*cmd));
+
+    pr_dbg("qp_type=%d\n", qp->init_args.qp_type);
+    pr_dbg("send_cq_handle=%d\n", qp->init_args.send_cq_handle);
+    pr_dbg("max_send_sge=%d\n", qp->init_args.max_send_sge);
+    pr_dbg("recv_cq_handle=%d\n", qp->init_args.recv_cq_handle);
+    pr_dbg("max_recv_sge=%d\n", qp->init_args.max_recv_sge);
+    pr_dbg("total_chunks=%d\n", cmd->total_chunks);
+    pr_dbg("send_chunks=%d\n", cmd->send_chunks);
+    pr_dbg("recv_chunks=%d\n", cmd->total_chunks - cmd->send_chunks);
+
+    qp->qp_state = PVRDMA_QPS_ERR;
+
+    /* Get pointer to send & recv rings */
+    dir = pvrdma_pci_dma_map(pci_dev, qp->init_args.pdir_dma, TARGET_PAGE_SIZE);
+    if (!dir) {
+        pr_err("Fail to map to QP page directory\n");
+        rc = -ENOMEM;
+        goto out_free_qp;
+    }
+    tbl = pvrdma_pci_dma_map(pci_dev, dir[0], TARGET_PAGE_SIZE);
+    if (!tbl) {
+        pr_err("Fail to map to QP page table\n");
+        rc = -ENOMEM;
+        goto out_free_qp;
+    }
+
+    /* Send ring */
+    qp->sq_ring_state = (struct pvrdma_ring *)
+            pvrdma_pci_dma_map(pci_dev, tbl[0], TARGET_PAGE_SIZE);
+    if (!qp->sq_ring_state) {
+        pr_err("Fail to map to QP header page\n");
+        rc = -ENOMEM;
+        goto out_free_qp;
+    }
+
+    wqe_size = roundup_pow_of_two(sizeof(struct pvrdma_sq_wqe_hdr) +
+                                  sizeof(struct pvrdma_sge) *
+                                  qp->init_args.max_send_sge);
+    sprintf(ring_name, "qp%d_sq", resp->qpn);
+    rc = ring_init(&qp->sq, ring_name, pci_dev, qp->sq_ring_state,
+                   qp->init_args.max_send_wr, wqe_size,
+                   (dma_addr_t *)&tbl[1], cmd->send_chunks);
+    if (rc != 0) {
+        pr_err("Fail to initialize SQ ring\n");
+        rc = -ENOMEM;
+        goto out_free_ring_state;
+    }
+
+    /* Recv ring */
+    qp->rq_ring_state = &qp->sq_ring_state[1];
+    wqe_size = roundup_pow_of_two(sizeof(struct pvrdma_rq_wqe_hdr) +
+                                  sizeof(struct pvrdma_sge) *
+                                  qp->init_args.max_recv_sge);
+    pr_dbg("wqe_size=%d\n", wqe_size);
+    pr_dbg("pvrdma_rq_wqe_hdr=%ld\n", sizeof(struct pvrdma_rq_wqe_hdr));
+    pr_dbg("pvrdma_sge=%ld\n", sizeof(struct pvrdma_sge));
+    pr_dbg("init_args.max_recv_sge=%d\n", qp->init_args.max_recv_sge);
+    sprintf(ring_name, "qp%d_rq", resp->qpn);
+    rc = ring_init(&qp->rq, ring_name, pci_dev, qp->rq_ring_state,
+                   qp->init_args.max_recv_wr, wqe_size,
+                   (dma_addr_t *)&tbl[2], cmd->total_chunks -
+                   cmd->send_chunks - 1 /* first page is ring state */);
+    if (rc != 0) {
+        pr_err("Fail to initialize RQ ring\n");
+        rc = -ENOMEM;
+        goto out_free_send_ring;
+    }
+
+    resp->max_send_wr = cmd->max_send_wr;
+    resp->max_recv_wr = cmd->max_recv_wr;
+    resp->max_send_sge = cmd->max_send_sge;
+    resp->max_recv_sge = cmd->max_recv_sge;
+    resp->max_inline_data = cmd->max_inline_data;
+
+    goto out;
+
+out_free_send_ring:
+    ring_free(&qp->sq);
+
+out_free_ring_state:
+    pvrdma_pci_dma_unmap(pci_dev, qp->sq_ring_state, TARGET_PAGE_SIZE);
+
+out_free_qp:
+    rm_dealloc_qp(dev, resp->qpn);
+
+out:
+    if (tbl) {
+        pvrdma_pci_dma_unmap(pci_dev, tbl, TARGET_PAGE_SIZE);
+    }
+    if (dir) {
+        pvrdma_pci_dma_unmap(pci_dev, dir, TARGET_PAGE_SIZE);
+    }
+
+    return rc;
+}
+
+int rm_modify_qp(PVRDMADev *dev, __u32 qp_handle,
+                 struct pvrdma_cmd_modify_qp *modify_qp_args)
+{
+    RmQP *qp;
+
+    pr_dbg("qp_handle=%d\n", qp_handle);
+    pr_dbg("new_state=%d\n", modify_qp_args->attrs.qp_state);
+
+    qp = res_tbl_get(&dev->qp_tbl, qp_handle);
+    if (!qp) {
+        return -EINVAL;
+    }
+
+    pr_dbg("qp_type=%d\n", qp->init_args.qp_type);
+
+    if (modify_qp_args->attr_mask & PVRDMA_QP_PORT) {
+        qp->port_num = modify_qp_args->attrs.port_num - 1;
+    }
+    if (modify_qp_args->attr_mask & PVRDMA_QP_DEST_QPN) {
+        qp->dest_qp_num = modify_qp_args->attrs.dest_qp_num;
+    }
+    if (modify_qp_args->attr_mask & PVRDMA_QP_AV) {
+        qp->dgid = modify_qp_args->attrs.ah_attr.grh.dgid;
+        qp->port_num = modify_qp_args->attrs.ah_attr.port_num - 1;
+    }
+    if (modify_qp_args->attr_mask & PVRDMA_QP_STATE) {
+        qp->qp_state = modify_qp_args->attrs.qp_state;
+    }
+
+    /* kdbr connection */
+    if (qp->qp_state == PVRDMA_QPS_RTR) {
+        qp->kdbr_connection_id =
+            kdbr_open_connection(dev->ports[qp->port_num].kdbr_port,
+                                 qp_handle, qp->dgid, qp->dest_qp_num,
+                                 qp->init_args.qp_type == PVRDMA_QPT_RC);
+        if (qp->kdbr_connection_id == 0) {
+            return -EIO;
+        }
+    }
+
+    return 0;
+}
+
+void rm_dealloc_qp(PVRDMADev *dev, __u32 qp_handle)
+{
+    PCIDevice *pci_dev = PCI_DEVICE(dev);
+    RmQP *qp;
+
+    qp = res_tbl_get(&dev->qp_tbl, qp_handle);
+    if (!qp) {
+        return;
+    }
+
+    if (qp->kdbr_connection_id) {
+        kdbr_close_connection(dev->ports[qp->port_num].kdbr_port,
+                              qp->kdbr_connection_id);
+    }
+
+    ring_free(&qp->rq);
+    ring_free(&qp->sq);
+
+    pvrdma_pci_dma_unmap(pci_dev, qp->sq_ring_state, TARGET_PAGE_SIZE);
+
+    res_tbl_dealloc(&dev->qp_tbl, qp_handle);
+}
+
+RmQP *rm_get_qp(PVRDMADev *dev, __u32 qp_handle)
+{
+    return res_tbl_get(&dev->qp_tbl, qp_handle);
+}
+
+void *rm_get_wqe_ctx(PVRDMADev *dev, unsigned long wqe_ctx_id)
+{
+    void **wqe_ctx;
+
+    wqe_ctx = res_tbl_get(&dev->wqe_ctx_tbl, wqe_ctx_id);
+    if (!wqe_ctx) {
+        return NULL;
+    }
+
+    pr_dbg("ctx=%p\n", *wqe_ctx);
+
+    return *wqe_ctx;
+}
+
+int rm_alloc_wqe_ctx(PVRDMADev *dev, unsigned long *wqe_ctx_id, void *ctx)
+{
+    void **wqe_ctx;
+
+    wqe_ctx = res_tbl_alloc(&dev->wqe_ctx_tbl, (u32 *)wqe_ctx_id);
+    if (!wqe_ctx) {
+        return -ENOMEM;
+    }
+
+    pr_dbg("ctx=%p\n", ctx);
+    *wqe_ctx = ctx;
+
+    return 0;
+}
+
+void rm_dealloc_wqe_ctx(PVRDMADev *dev, unsigned long wqe_ctx_id)
+{
+    res_tbl_dealloc(&dev->wqe_ctx_tbl, (u32) wqe_ctx_id);
+}
+
+int rm_init(PVRDMADev *dev)
+{
+    int ret = 0;
+
+    ret = res_tbl_init("PD", &dev->pd_tbl, MAX_PDS, sizeof(RmPD));
+    if (ret != 0) {
+        goto cln_pds;
+    }
+
+    ret = res_tbl_init("CQ", &dev->cq_tbl, MAX_CQS, sizeof(RmCQ));
+    if (ret != 0) {
+        goto cln_cqs;
+    }
+
+    ret = res_tbl_init("MR", &dev->mr_tbl, MAX_MRS, sizeof(RmMR));
+    if (ret != 0) {
+        goto cln_mrs;
+    }
+
+    ret = res_tbl_init("QP", &dev->qp_tbl, MAX_QPS, sizeof(RmQP));
+    if (ret != 0) {
+        goto cln_qps;
+    }
+
+    ret = res_tbl_init("WQE_CTX", &dev->wqe_ctx_tbl, MAX_QPS * MAX_QP_WRS,
+               sizeof(void *));
+    if (ret != 0) {
+        goto cln_wqe_ctxs;
+    }
+
+    goto out;
+
+cln_wqe_ctxs:
+    res_tbl_free(&dev->wqe_ctx_tbl);
+
+cln_qps:
+    res_tbl_free(&dev->qp_tbl);
+
+cln_mrs:
+    res_tbl_free(&dev->mr_tbl);
+
+cln_cqs:
+    res_tbl_free(&dev->cq_tbl);
+
+cln_pds:
+    res_tbl_free(&dev->pd_tbl);
+
+out:
+    if (ret != 0) {
+        pr_err("Fail to initialize RM\n");
+    }
+
+    return ret;
+}
+
+void rm_fini(PVRDMADev *dev)
+{
+    res_tbl_free(&dev->pd_tbl);
+    res_tbl_free(&dev->cq_tbl);
+    res_tbl_free(&dev->mr_tbl);
+    res_tbl_free(&dev->qp_tbl);
+    res_tbl_free(&dev->wqe_ctx_tbl);
+}
diff --git a/hw/net/pvrdma/pvrdma_rm.h b/hw/net/pvrdma/pvrdma_rm.h
new file mode 100644
index 0000000..1d42bc7
--- /dev/null
+++ b/hw/net/pvrdma/pvrdma_rm.h
@@ -0,0 +1,214 @@
+/*
+ * QEMU VMWARE paravirtual RDMA - Resource Manager
+ *
+ * Developed by Oracle & Redhat
+ *
+ * Authors:
+ *     Yuval Shaia <yuval.shaia@oracle.com>
+ *     Marcel Apfelbaum <marcel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef PVRDMA_RM_H
+#define PVRDMA_RM_H
+
+#include <hw/net/pvrdma/pvrdma_dev_api.h>
+#include <hw/net/pvrdma/pvrdma-uapi.h>
+#include <hw/net/pvrdma/pvrdma_ring.h>
+#include <hw/net/pvrdma/kdbr.h>
+
+/* TODO: More then 1 port it fails in ib_modify_qp, maybe something with
+ * the MAC of the second port */
+#define MAX_PORTS        1 /* Driver force to 1 see pvrdma_add_gid */
+#define MAX_PORT_GIDS    1
+#define MAX_PORT_PKEYS   1
+#define MAX_PKEYS        1
+#define MAX_PDS          2048
+#define MAX_CQS          2048
+#define MAX_CQES         1024 /* cqe size is 64 */
+#define MAX_QPS          1024
+#define MAX_GIDS         2048
+#define MAX_QP_WRS       1024 /* wqe size is 128 */
+#define MAX_SGES         4
+#define MAX_MRS          2048
+#define MAX_AH           1024
+
+typedef struct PVRDMADev PVRDMADev;
+typedef struct KdbrPort KdbrPort;
+
+#define MAX_RMRESTBL_NAME_SZ 16
+typedef struct RmResTbl {
+    char name[MAX_RMRESTBL_NAME_SZ];
+    unsigned long *bitmap;
+    size_t tbl_sz;
+    size_t res_sz;
+    void *tbl;
+    QemuMutex lock;
+} RmResTbl;
+
+enum cq_comp_type {
+    CCT_NONE,
+    CCT_SOLICITED,
+    CCT_NEXT_COMP,
+};
+
+typedef struct RmPD {
+    __u32 ctx_handle;
+} RmPD;
+
+typedef struct RmCQ {
+    struct pvrdma_cmd_create_cq init_args;
+    struct pvrdma_ring *ring_state;
+    Ring cq;
+    enum cq_comp_type comp_type;
+} RmCQ;
+
+/* MR (DMA region) */
+typedef struct RmMR {
+    __u32 pd_handle;
+    __u32 lkey;
+    __u32 rkey;
+} RmMR;
+
+typedef struct RmSqWqe {
+    struct pvrdma_sq_wqe_hdr hdr;
+    struct pvrdma_sge sge[0];
+} RmSqWqe;
+
+typedef struct RmRqWqe {
+    struct pvrdma_rq_wqe_hdr hdr;
+    struct pvrdma_sge sge[0];
+} RmRqWqe;
+
+typedef struct RmQP {
+    struct pvrdma_cmd_create_qp init_args;
+    enum pvrdma_qp_state qp_state;
+    u8 port_num;
+    u32 dest_qp_num;
+    union pvrdma_gid dgid;
+
+    struct pvrdma_ring *sq_ring_state;
+    Ring sq;
+    struct pvrdma_ring *rq_ring_state;
+    Ring rq;
+
+    unsigned long kdbr_connection_id;
+} RmQP;
+
+typedef struct RmPort {
+    enum pvrdma_port_state state;
+    union pvrdma_gid gid_tbl[MAX_PORT_GIDS];
+    /* TODO: Change type */
+    int *pkey_tbl;
+    KdbrPort *kdbr_port;
+} RmPort;
+
+static inline int rm_get_max_port_gids(__u32 *max_port_gids)
+{
+    *max_port_gids = MAX_PORT_GIDS;
+    return 0;
+}
+
+static inline int rm_get_max_port_pkeys(__u32 *max_port_pkeys)
+{
+    *max_port_pkeys = MAX_PORT_PKEYS;
+    return 0;
+}
+
+static inline int rm_get_max_pkeys(__u16 *max_pkeys)
+{
+    *max_pkeys = MAX_PKEYS;
+    return 0;
+}
+
+static inline int rm_get_max_cqs(__u32 *max_cqs)
+{
+    *max_cqs = MAX_CQS;
+    return 0;
+}
+
+static inline int rm_get_max_cqes(__u32 *max_cqes)
+{
+    *max_cqes = MAX_CQES;
+    return 0;
+}
+
+static inline int rm_get_max_pds(__u32 *max_pds)
+{
+    *max_pds = MAX_PDS;
+    return 0;
+}
+
+static inline int rm_get_max_qps(__u32 *max_qps)
+{
+    *max_qps = MAX_QPS;
+    return 0;
+}
+
+static inline int rm_get_max_gids(__u32 *max_gids)
+{
+    *max_gids = MAX_GIDS;
+    return 0;
+}
+
+static inline int rm_get_max_qp_wrs(__u32 *max_qp_wrs)
+{
+    *max_qp_wrs = MAX_QP_WRS;
+    return 0;
+}
+
+static inline int rm_get_max_sges(__u32 *max_sges)
+{
+    *max_sges = MAX_SGES;
+    return 0;
+}
+
+static inline int rm_get_max_mrs(__u32 *max_mrs)
+{
+    *max_mrs = MAX_MRS;
+    return 0;
+}
+
+static inline int rm_get_phys_port_cnt(__u8 *phys_port_cnt)
+{
+    *phys_port_cnt = MAX_PORTS;
+    return 0;
+}
+
+static inline int rm_get_max_ah(__u32 *max_ah)
+{
+    *max_ah = MAX_AH;
+    return 0;
+}
+
+int rm_init(PVRDMADev *dev);
+void rm_fini(PVRDMADev *dev);
+
+int rm_alloc_pd(PVRDMADev *dev, __u32 *pd_handle, __u32 ctx_handle);
+void rm_dealloc_pd(PVRDMADev *dev, __u32 pd_handle);
+
+RmCQ *rm_get_cq(PVRDMADev *dev, __u32 cq_handle);
+int rm_alloc_cq(PVRDMADev *dev, struct pvrdma_cmd_create_cq *cmd,
+        struct pvrdma_cmd_create_cq_resp *resp);
+void rm_req_notify_cq(PVRDMADev *dev, __u32 cq_handle, u32 flags);
+void rm_dealloc_cq(PVRDMADev *dev, __u32 cq_handle);
+
+int rm_alloc_mr(PVRDMADev *dev, struct pvrdma_cmd_create_mr *cmd,
+        struct pvrdma_cmd_create_mr_resp *resp);
+void rm_dealloc_mr(PVRDMADev *dev, __u32 mr_handle);
+
+RmQP *rm_get_qp(PVRDMADev *dev, __u32 qp_handle);
+int rm_alloc_qp(PVRDMADev *dev, struct pvrdma_cmd_create_qp *cmd,
+        struct pvrdma_cmd_create_qp_resp *resp);
+int rm_modify_qp(PVRDMADev *dev, __u32 qp_handle,
+         struct pvrdma_cmd_modify_qp *modify_qp_args);
+void rm_dealloc_qp(PVRDMADev *dev, __u32 qp_handle);
+
+void *rm_get_wqe_ctx(PVRDMADev *dev, unsigned long wqe_ctx_id);
+int rm_alloc_wqe_ctx(PVRDMADev *dev, unsigned long *wqe_ctx_id, void *ctx);
+void rm_dealloc_wqe_ctx(PVRDMADev *dev, unsigned long wqe_ctx_id);
+
+#endif
diff --git a/hw/net/pvrdma/pvrdma_types.h b/hw/net/pvrdma/pvrdma_types.h
new file mode 100644
index 0000000..22a7cde
--- /dev/null
+++ b/hw/net/pvrdma/pvrdma_types.h
@@ -0,0 +1,37 @@
+/*
+ * QEMU VMWARE paravirtual RDMA interface definitions
+ *
+ * Developed by Oracle & Redhat
+ *
+ * Authors:
+ *     Yuval Shaia <yuval.shaia@oracle.com>
+ *     Marcel Apfelbaum <marcel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef PVRDMA_TYPES_H
+#define PVRDMA_TYPES_H
+
+/* TDOD: All defs here should be removed !!! */
+
+#include <stdint.h>
+#include <asm-generic/int-ll64.h>
+
+typedef unsigned char uint8_t;
+typedef uint64_t dma_addr_t;
+
+typedef uint8_t        __u8;
+typedef uint8_t        u8;
+typedef unsigned short __u16;
+typedef unsigned short u16;
+typedef uint64_t       u64;
+typedef uint32_t       u32;
+typedef uint32_t       __u32;
+typedef int32_t       __s32;
+#define __bitwise
+typedef __u64 __bitwise __be64;
+
+#endif
diff --git a/hw/net/pvrdma/pvrdma_utils.c b/hw/net/pvrdma/pvrdma_utils.c
new file mode 100644
index 0000000..0f420e2
--- /dev/null
+++ b/hw/net/pvrdma/pvrdma_utils.c
@@ -0,0 +1,36 @@
+#include <qemu/osdep.h>
+#include <cpu.h>
+#include <hw/pci/pci.h>
+#include <hw/net/pvrdma/pvrdma_utils.h>
+#include <hw/net/pvrdma/pvrdma.h>
+
+void pvrdma_pci_dma_unmap(PCIDevice *dev, void *buffer, dma_addr_t len)
+{
+    pr_dbg("%p\n", buffer);
+    pci_dma_unmap(dev, buffer, len, DMA_DIRECTION_TO_DEVICE, 0);
+}
+
+void *pvrdma_pci_dma_map(PCIDevice *dev, dma_addr_t addr, dma_addr_t plen)
+{
+    void *p;
+    hwaddr len = plen;
+
+    if (!addr) {
+        pr_dbg("addr is NULL\n");
+        return NULL;
+    }
+
+    p = pci_dma_map(dev, addr, &len, DMA_DIRECTION_TO_DEVICE);
+    if (!p) {
+        return NULL;
+    }
+
+    if (len != plen) {
+        pvrdma_pci_dma_unmap(dev, p, len);
+        return NULL;
+    }
+
+    pr_dbg("0x%llx -> %p (len=%ld)\n", (long long unsigned int)addr, p, len);
+
+    return p;
+}
diff --git a/hw/net/pvrdma/pvrdma_utils.h b/hw/net/pvrdma/pvrdma_utils.h
new file mode 100644
index 0000000..da01967
--- /dev/null
+++ b/hw/net/pvrdma/pvrdma_utils.h
@@ -0,0 +1,49 @@
+/*
+ * QEMU VMWARE paravirtual RDMA interface definitions
+ *
+ * Developed by Oracle & Redhat
+ *
+ * Authors:
+ *     Yuval Shaia <yuval.shaia@oracle.com>
+ *     Marcel Apfelbaum <marcel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef PVRDMA_UTILS_H
+#define PVRDMA_UTILS_H
+
+#define pr_info(fmt, ...) \
+    fprintf(stdout, "%s: %-20s (%3d): " fmt, "pvrdma",  __func__, __LINE__,\
+           ## __VA_ARGS__)
+
+#define pr_err(fmt, ...) \
+    fprintf(stderr, "%s: Error at %-20s (%3d): " fmt, "pvrdma", __func__, \
+        __LINE__, ## __VA_ARGS__)
+
+#define DEBUG
+#ifdef DEBUG
+#define pr_dbg(fmt, ...) \
+    fprintf(stdout, "%s: %-20s (%3d): " fmt, "pvrdma", __func__, __LINE__,\
+           ## __VA_ARGS__)
+#else
+#define pr_dbg(fmt, ...)
+#endif
+
+static inline int roundup_pow_of_two(int x)
+{
+    x--;
+    x |= (x >> 1);
+    x |= (x >> 2);
+    x |= (x >> 4);
+    x |= (x >> 8);
+    x |= (x >> 16);
+    return x + 1;
+}
+
+void pvrdma_pci_dma_unmap(PCIDevice *dev, void *buffer, dma_addr_t len);
+void *pvrdma_pci_dma_map(PCIDevice *dev, dma_addr_t addr, dma_addr_t plen);
+
+#endif
diff --git a/include/hw/pci/pci_ids.h b/include/hw/pci/pci_ids.h
index d77ca60..a016ad6 100644
--- a/include/hw/pci/pci_ids.h
+++ b/include/hw/pci/pci_ids.h
@@ -167,4 +167,7 @@
 #define PCI_VENDOR_ID_TEWS               0x1498
 #define PCI_DEVICE_ID_TEWS_TPCI200       0x30C8
 
+#define PCI_VENDOR_ID_VMWARE             0x15ad
+#define PCI_DEVICE_ID_VMWARE_PVRDMA      0x0820
+
 #endif
-- 
2.5.5

Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device

Posted by Leon Romanovsky 8 years, 10 months ago

On Thu, Mar 30, 2017 at 02:12:21PM +0300, Marcel Apfelbaum wrote:
> From: Yuval Shaia <yuval.shaia@oracle.com>
>
>  Hi,
>
>  General description
>  ===================
>  This is a very early RFC of a new RoCE emulated device
>  that enables guests to use the RDMA stack without having
>  a real hardware in the host.
>
>  The current implementation supports only VM to VM communication
>  on the same host.
>  Down the road we plan to make possible to be able to support
>  inter-machine communication by utilizing physical RoCE devices
>  or Soft RoCE.
>
>  The goals are:
>  - Reach fast and secure loos-less Inter-VM data exchange.
>  - Support remote VMs or bare metal machines.
>  - Allow VMs migration.
>  - Do not require to pin all VM memory.
>
>
>  Objective
>  =========
>  Have a QEMU implementation of the PVRDMA device. We aim to do so without
>  any change in the PVRDMA guest driver which is already merged into the
>  upstream kernel.
>
>
>  RFC status
>  ===========
>  The project is in early development stages and supports
>  only basic send/receive operations.
>
>  We present it so we can get feedbacks on design,
>  feature demands and to receive comments from the
>  community pointing us to the "right" direction.

If to judge by the feedback which you got from RDMA community
for kernel proposal [1], this community failed to understand:
1. Why do you need new module?
2. Why existing solutions are not enough and can't be extended?
3. Why RXE (SoftRoCE) can't be extended to perform this inter-VM
   communication via virtual NIC?

Can you please help us to fill this knowledge gap?

[1] http://marc.info/?l=linux-rdma&m=149063626907175&w=2

Thanks

>
>  What does work:
>   - Tested with a basic unit-test:
>     - https://github.com/yuvalshaia/kibpingpong .
>   It works fine with two devices on a single VM, has
>   some issue between two VMs in the same host.
>
>
>  Design
>  ======
>  - Follows the behavior of VMware's pvrdma device, however is not tightly
>    coupled with it and most of the code can be reused if we decide to
>    continue to a Virtio based RDMA device.
>
>  - It exposes 3 BARs:
>     BAR 0 - MSIX, utilize 3 vectors for command ring, async events and
>             completions
>     BAR 1 - Configuration of registers
>     BAR 2 - UAR, used to pass HW commands from driver.
>
>  - The device performs internal management of the RDMA
>    resources (PDs, CQs, QPs, ...), meaning the objects
>    are not directly coupled to a physical RDMA device resources.
>
>  - As backend, the pvrdma device uses KDBR, a new kernel module which
>    is also in RFC phase, read more on the linux-rdma list:
>      - https://www.spinics.net/lists/linux-rdma/msg47951.html
>
>  - All RDMA operations are converted to KDBR module calls which performs
>    the actual transfer between VMs, or, in the future,
>    will utilize a RoCE device (either physical or soft) to be able
>    to communicate with another host.
>
>
> Roadmap (out of order)
> ======================
>  - Utilize the RoCE host driver in order to support peers on external hosts.
>  - Re-use the code for a virtio based device.
>
> Any ideas, comments or suggestions would be highly appreciated.
>
> Thanks,
> Yuval Shaia & Marcel Apfelbaum
>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> (Mainly design, coding was done by Yuval)
> Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
>
> ---
>  hw/net/Makefile.objs            |   5 +
>  hw/net/pvrdma/kdbr.h            | 104 +++++++
>  hw/net/pvrdma/pvrdma-uapi.h     | 261 ++++++++++++++++
>  hw/net/pvrdma/pvrdma.h          | 155 ++++++++++
>  hw/net/pvrdma/pvrdma_cmd.c      | 322 +++++++++++++++++++
>  hw/net/pvrdma/pvrdma_defs.h     | 301 ++++++++++++++++++
>  hw/net/pvrdma/pvrdma_dev_api.h  | 342 ++++++++++++++++++++
>  hw/net/pvrdma/pvrdma_ib_verbs.h | 469 ++++++++++++++++++++++++++++
>  hw/net/pvrdma/pvrdma_kdbr.c     | 395 ++++++++++++++++++++++++
>  hw/net/pvrdma/pvrdma_kdbr.h     |  53 ++++
>  hw/net/pvrdma/pvrdma_main.c     | 667 ++++++++++++++++++++++++++++++++++++++++
>  hw/net/pvrdma/pvrdma_qp_ops.c   | 174 +++++++++++
>  hw/net/pvrdma/pvrdma_qp_ops.h   |  25 ++
>  hw/net/pvrdma/pvrdma_ring.c     | 127 ++++++++
>  hw/net/pvrdma/pvrdma_ring.h     |  43 +++
>  hw/net/pvrdma/pvrdma_rm.c       | 529 +++++++++++++++++++++++++++++++
>  hw/net/pvrdma/pvrdma_rm.h       | 214 +++++++++++++
>  hw/net/pvrdma/pvrdma_types.h    |  37 +++
>  hw/net/pvrdma/pvrdma_utils.c    |  36 +++
>  hw/net/pvrdma/pvrdma_utils.h    |  49 +++
>  include/hw/pci/pci_ids.h        |   3 +
>  21 files changed, 4311 insertions(+)
>  create mode 100644 hw/net/pvrdma/kdbr.h
>  create mode 100644 hw/net/pvrdma/pvrdma-uapi.h
>  create mode 100644 hw/net/pvrdma/pvrdma.h
>  create mode 100644 hw/net/pvrdma/pvrdma_cmd.c
>  create mode 100644 hw/net/pvrdma/pvrdma_defs.h
>  create mode 100644 hw/net/pvrdma/pvrdma_dev_api.h
>  create mode 100644 hw/net/pvrdma/pvrdma_ib_verbs.h
>  create mode 100644 hw/net/pvrdma/pvrdma_kdbr.c
>  create mode 100644 hw/net/pvrdma/pvrdma_kdbr.h
>  create mode 100644 hw/net/pvrdma/pvrdma_main.c
>  create mode 100644 hw/net/pvrdma/pvrdma_qp_ops.c
>  create mode 100644 hw/net/pvrdma/pvrdma_qp_ops.h
>  create mode 100644 hw/net/pvrdma/pvrdma_ring.c
>  create mode 100644 hw/net/pvrdma/pvrdma_ring.h
>  create mode 100644 hw/net/pvrdma/pvrdma_rm.c
>  create mode 100644 hw/net/pvrdma/pvrdma_rm.h
>  create mode 100644 hw/net/pvrdma/pvrdma_types.h
>  create mode 100644 hw/net/pvrdma/pvrdma_utils.c
>  create mode 100644 hw/net/pvrdma/pvrdma_utils.h
>
> diff --git a/hw/net/Makefile.objs b/hw/net/Makefile.objs
> index 610ed3e..a962347 100644
> --- a/hw/net/Makefile.objs
> +++ b/hw/net/Makefile.objs
> @@ -43,3 +43,8 @@ common-obj-$(CONFIG_ROCKER) += rocker/rocker.o rocker/rocker_fp.o \
>                                 rocker/rocker_desc.o rocker/rocker_world.o \
>                                 rocker/rocker_of_dpa.o
>  obj-$(call lnot,$(CONFIG_ROCKER)) += rocker/qmp-norocker.o
> +
> +obj-$(CONFIG_PCI) += pvrdma/pvrdma_ring.o pvrdma/pvrdma_rm.o \
> +		     pvrdma/pvrdma_utils.o pvrdma/pvrdma_qp_ops.o \
> +		     pvrdma/pvrdma_kdbr.o pvrdma/pvrdma_cmd.o \
> +		     pvrdma/pvrdma_main.o
> diff --git a/hw/net/pvrdma/kdbr.h b/hw/net/pvrdma/kdbr.h
> new file mode 100644
> index 0000000..97cb93c
> --- /dev/null
> +++ b/hw/net/pvrdma/kdbr.h
> @@ -0,0 +1,104 @@
> +/*
> + * Kernel Data Bridge driver - API
> + *
> + * Copyright 2016 Red Hat, Inc.
> + * Copyright 2016 Oracle
> + *
> + * Authors:
> + *   Marcel Apfelbaum <marcel@redhat.com>
> + *   Yuval Shaia <yuval.shaia@oracle.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + *
> + */
> +
> +#ifndef _KDBR_H
> +#define _KDBR_H
> +
> +#ifdef __KERNEL__
> +#include <linux/uio.h>
> +#define KDBR_MAX_IOVEC_LEN    UIO_FASTIOV
> +#else
> +#include <sys/uio.h>
> +#define KDBR_MAX_IOVEC_LEN    8
> +#endif
> +
> +#define KDBR_FILE_NAME "/dev/kdbr"
> +#define KDBR_MAX_PORTS 255
> +
> +#define KDBR_IOC_MAGIC 0xBA
> +
> +#define KDBR_REGISTER_PORT    _IOWR(KDBR_IOC_MAGIC, 0, struct kdbr_reg)
> +#define KDBR_UNREGISTER_PORT    _IOW(KDBR_IOC_MAGIC, 1, int)
> +#define KDBR_IOC_MAX        2
> +
> +
> +enum kdbr_ack_type {
> +    KDBR_ACK_IMMEDIATE,
> +    KDBR_ACK_DELAYED,
> +};
> +
> +struct kdbr_gid {
> +    unsigned long net_id;
> +    unsigned long id;
> +};
> +
> +struct kdbr_peer {
> +    struct kdbr_gid rgid;
> +    unsigned long rqueue;
> +};
> +
> +struct list_head;
> +struct mutex;
> +struct kdbr_connection {
> +    unsigned long queue_id;
> +    struct kdbr_peer peer;
> +    enum kdbr_ack_type ack_type;
> +    /* TODO: hide the below fields in the .c file */
> +    struct list_head *sg_vecs_list;
> +    struct mutex *sg_vecs_mutex;
> +};
> +
> +struct kdbr_reg {
> +    struct kdbr_gid gid; /* in */
> +    int port; /* out */
> +};
> +
> +#define KDBR_REQ_SIGNATURE    0x000000AB
> +#define KDBR_REQ_POST_RECV    0x00000100
> +#define KDBR_REQ_POST_SEND    0x00000200
> +#define KDBR_REQ_POST_MREG    0x00000300
> +#define KDBR_REQ_POST_RDMA    0x00000400
> +
> +struct kdbr_req {
> +    unsigned int flags; /* 8 bits signature, 8 bits msg_type */
> +    struct iovec vec[KDBR_MAX_IOVEC_LEN];
> +    int vlen; /* <= KDBR_MAX_IOVEC_LEN */
> +    int connection_id;
> +    struct kdbr_peer peer;
> +    unsigned long req_id;
> +};
> +
> +#define KDBR_ERR_CODE_EMPTY_VEC           0x101
> +#define KDBR_ERR_CODE_NO_MORE_RECV_BUF    0x102
> +#define KDBR_ERR_CODE_RECV_BUF_PROT       0x103
> +#define KDBR_ERR_CODE_INV_ADDR            0x104
> +#define KDBR_ERR_CODE_INV_CONN_ID         0x105
> +#define KDBR_ERR_CODE_NO_PEER             0x106
> +
> +struct kdbr_completion {
> +    int connection_id;
> +    unsigned long req_id;
> +    int status; /* 0 = Success */
> +};
> +
> +#define KDBR_PORT_IOC_MAGIC    0xBB
> +
> +#define KDBR_PORT_OPEN_CONN    _IOR(KDBR_PORT_IOC_MAGIC, 0, \
> +                     struct kdbr_connection)
> +#define KDBR_PORT_CLOSE_CONN    _IOR(KDBR_PORT_IOC_MAGIC, 1, int)
> +#define KDBR_PORT_IOC_MAX    4
> +
> +#endif
> +
> diff --git a/hw/net/pvrdma/pvrdma-uapi.h b/hw/net/pvrdma/pvrdma-uapi.h
> new file mode 100644
> index 0000000..0045776
> --- /dev/null
> +++ b/hw/net/pvrdma/pvrdma-uapi.h
> @@ -0,0 +1,261 @@
> +/*
> + * Copyright (c) 2012-2016 VMware, Inc.  All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of EITHER the GNU General Public License
> + * version 2 as published by the Free Software Foundation or the BSD
> + * 2-Clause License. This program is distributed in the hope that it
> + * will be useful, but WITHOUT ANY WARRANTY; WITHOUT EVEN THE IMPLIED
> + * WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
> + * See the GNU General Public License version 2 for more details at
> + * http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program available in the file COPYING in the main
> + * directory of this source tree.
> + *
> + * The BSD 2-Clause License
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
> + * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> + * COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
> + * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
> + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
> + * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
> + * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
> + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
> + * OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#ifndef PVRDMA_UAPI_H
> +#define PVRDMA_UAPI_H
> +
> +#include "qemu/osdep.h"
> +#include "qemu/cutils.h"
> +#include <hw/net/pvrdma/pvrdma_types.h>
> +#include <qemu/compiler.h>
> +#include <qemu/atomic.h>
> +
> +#define PVRDMA_VERSION 17
> +
> +#define PVRDMA_UAR_HANDLE_MASK    0x00FFFFFF    /* Bottom 24 bits. */
> +#define PVRDMA_UAR_QP_OFFSET    0        /* Offset of QP doorbell. */
> +#define PVRDMA_UAR_QP_SEND    BIT(30)        /* Send bit. */
> +#define PVRDMA_UAR_QP_RECV    BIT(31)        /* Recv bit. */
> +#define PVRDMA_UAR_CQ_OFFSET    4        /* Offset of CQ doorbell. */
> +#define PVRDMA_UAR_CQ_ARM_SOL    BIT(29)        /* Arm solicited bit. */
> +#define PVRDMA_UAR_CQ_ARM    BIT(30)        /* Arm bit. */
> +#define PVRDMA_UAR_CQ_POLL    BIT(31)        /* Poll bit. */
> +#define PVRDMA_INVALID_IDX    -1        /* Invalid index. */
> +
> +/* PVRDMA atomic compare and swap */
> +struct pvrdma_exp_cmp_swap {
> +    __u64 swap_val;
> +    __u64 compare_val;
> +    __u64 swap_mask;
> +    __u64 compare_mask;
> +};
> +
> +/* PVRDMA atomic fetch and add */
> +struct pvrdma_exp_fetch_add {
> +    __u64 add_val;
> +    __u64 field_boundary;
> +};
> +
> +/* PVRDMA address vector. */
> +struct pvrdma_av {
> +    __u32 port_pd;
> +    __u32 sl_tclass_flowlabel;
> +    __u8 dgid[16];
> +    __u8 src_path_bits;
> +    __u8 gid_index;
> +    __u8 stat_rate;
> +    __u8 hop_limit;
> +    __u8 dmac[6];
> +    __u8 reserved[6];
> +};
> +
> +/* PVRDMA scatter/gather entry */
> +struct pvrdma_sge {
> +    __u64   addr;
> +    __u32   length;
> +    __u32   lkey;
> +};
> +
> +/* PVRDMA receive queue work request */
> +struct pvrdma_rq_wqe_hdr {
> +    __u64 wr_id;        /* wr id */
> +    __u32 num_sge;        /* size of s/g array */
> +    __u32 total_len;    /* reserved */
> +};
> +/* Use pvrdma_sge (ib_sge) for receive queue s/g array elements. */
> +
> +/* PVRDMA send queue work request */
> +struct pvrdma_sq_wqe_hdr {
> +    __u64 wr_id;        /* wr id */
> +    __u32 num_sge;        /* size of s/g array */
> +    __u32 total_len;    /* reserved */
> +    __u32 opcode;        /* operation type */
> +    __u32 send_flags;    /* wr flags */
> +    union {
> +        __u32 imm_data;
> +        __u32 invalidate_rkey;
> +    } ex;
> +    __u32 reserved;
> +    union {
> +        struct {
> +            __u64 remote_addr;
> +            __u32 rkey;
> +            __u8 reserved[4];
> +        } rdma;
> +        struct {
> +            __u64 remote_addr;
> +            __u64 compare_add;
> +            __u64 swap;
> +            __u32 rkey;
> +            __u32 reserved;
> +        } atomic;
> +        struct {
> +            __u64 remote_addr;
> +            __u32 log_arg_sz;
> +            __u32 rkey;
> +            union {
> +                struct pvrdma_exp_cmp_swap  cmp_swap;
> +                struct pvrdma_exp_fetch_add fetch_add;
> +            } wr_data;
> +        } masked_atomics;
> +        struct {
> +            __u64 iova_start;
> +            __u64 pl_pdir_dma;
> +            __u32 page_shift;
> +            __u32 page_list_len;
> +            __u32 length;
> +            __u32 access_flags;
> +            __u32 rkey;
> +        } fast_reg;
> +        struct {
> +            __u32 remote_qpn;
> +            __u32 remote_qkey;
> +            struct pvrdma_av av;
> +        } ud;
> +    } wr;
> +};
> +/* Use pvrdma_sge (ib_sge) for send queue s/g array elements. */
> +
> +/* Completion queue element. */
> +struct pvrdma_cqe {
> +    __u64 wr_id;
> +    __u64 qp;
> +    __u32 opcode;
> +    __u32 status;
> +    __u32 byte_len;
> +    __u32 imm_data;
> +    __u32 src_qp;
> +    __u32 wc_flags;
> +    __u32 vendor_err;
> +    __u16 pkey_index;
> +    __u16 slid;
> +    __u8 sl;
> +    __u8 dlid_path_bits;
> +    __u8 port_num;
> +    __u8 smac[6];
> +    __u8 reserved2[7]; /* Pad to next power of 2 (64). */
> +};
> +
> +struct pvrdma_ring {
> +    int prod_tail;    /* Producer tail. */
> +    int cons_head;    /* Consumer head. */
> +};
> +
> +struct pvrdma_ring_state {
> +    struct pvrdma_ring tx;    /* Tx ring. */
> +    struct pvrdma_ring rx;    /* Rx ring. */
> +};
> +
> +static inline int pvrdma_idx_valid(__u32 idx, __u32 max_elems)
> +{
> +    /* Generates fewer instructions than a less-than. */
> +    return (idx & ~((max_elems << 1) - 1)) == 0;
> +}
> +
> +static inline __s32 pvrdma_idx(int *var, __u32 max_elems)
> +{
> +    unsigned int idx = atomic_read(var);
> +
> +    if (pvrdma_idx_valid(idx, max_elems)) {
> +        return idx & (max_elems - 1);
> +    }
> +    return PVRDMA_INVALID_IDX;
> +}
> +
> +static inline void pvrdma_idx_ring_inc(int *var, __u32 max_elems)
> +{
> +    __u32 idx = atomic_read(var) + 1;    /* Increment. */
> +
> +    idx &= (max_elems << 1) - 1;        /* Modulo size, flip gen. */
> +    atomic_set(var, idx);
> +}
> +
> +static inline __s32 pvrdma_idx_ring_has_space(const struct pvrdma_ring *r,
> +                          __u32 max_elems, __u32 *out_tail)
> +{
> +    const __u32 tail = atomic_read(&r->prod_tail);
> +    const __u32 head = atomic_read(&r->cons_head);
> +
> +    if (pvrdma_idx_valid(tail, max_elems) &&
> +        pvrdma_idx_valid(head, max_elems)) {
> +        *out_tail = tail & (max_elems - 1);
> +        return tail != (head ^ max_elems);
> +    }
> +    return PVRDMA_INVALID_IDX;
> +}
> +
> +static inline __s32 pvrdma_idx_ring_has_data(const struct pvrdma_ring *r,
> +                         __u32 max_elems, __u32 *out_head)
> +{
> +    const __u32 tail = atomic_read(&r->prod_tail);
> +    const __u32 head = atomic_read(&r->cons_head);
> +
> +    if (pvrdma_idx_valid(tail, max_elems) &&
> +        pvrdma_idx_valid(head, max_elems)) {
> +        *out_head = head & (max_elems - 1);
> +        return tail != head;
> +    }
> +    return PVRDMA_INVALID_IDX;
> +}
> +
> +static inline bool pvrdma_idx_ring_is_valid_idx(const struct pvrdma_ring *r,
> +                        __u32 max_elems, __u32 *idx)
> +{
> +    const __u32 tail = atomic_read(&r->prod_tail);
> +    const __u32 head = atomic_read(&r->cons_head);
> +
> +    if (pvrdma_idx_valid(tail, max_elems) &&
> +        pvrdma_idx_valid(head, max_elems) &&
> +        pvrdma_idx_valid(*idx, max_elems)) {
> +        if (tail > head && (*idx < tail && *idx >= head)) {
> +            return true;
> +        } else if (head > tail && (*idx >= head || *idx < tail)) {
> +            return true;
> +        }
> +    }
> +    return false;
> +}
> +
> +#endif /* PVRDMA_UAPI_H */
> diff --git a/hw/net/pvrdma/pvrdma.h b/hw/net/pvrdma/pvrdma.h
> new file mode 100644
> index 0000000..d6349d4
> --- /dev/null
> +++ b/hw/net/pvrdma/pvrdma.h
> @@ -0,0 +1,155 @@
> +/*
> + * QEMU VMWARE paravirtual RDMA interface definitions
> + *
> + * Developed by Oracle & Redhat
> + *
> + * Authors:
> + *     Yuval Shaia <yuval.shaia@oracle.com>
> + *     Marcel Apfelbaum <marcel@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#ifndef PVRDMA_PVRDMA_H
> +#define PVRDMA_PVRDMA_H
> +
> +#include <qemu/osdep.h>
> +#include <hw/pci/pci.h>
> +#include <hw/pci/msix.h>
> +#include <hw/net/pvrdma/pvrdma_kdbr.h>
> +#include <hw/net/pvrdma/pvrdma_rm.h>
> +#include <hw/net/pvrdma/pvrdma_defs.h>
> +#include <hw/net/pvrdma/pvrdma_dev_api.h>
> +#include <hw/net/pvrdma/pvrdma_ring.h>
> +
> +/* BARs */
> +#define RDMA_MSIX_BAR_IDX    0
> +#define RDMA_REG_BAR_IDX     1
> +#define RDMA_UAR_BAR_IDX     2
> +#define RDMA_BAR0_MSIX_SIZE  (16 * 1024)
> +#define RDMA_BAR1_REGS_SIZE  256
> +#define RDMA_BAR2_UAR_SIZE   (16 * 1024)
> +
> +/* MSIX */
> +#define RDMA_MAX_INTRS       3
> +#define RDMA_MSIX_TABLE      0x0000
> +#define RDMA_MSIX_PBA        0x2000
> +
> +/* Interrupts Vectors */
> +#define INTR_VEC_CMD_RING            0
> +#define INTR_VEC_CMD_ASYNC_EVENTS    1
> +#define INTR_VEC_CMD_COMPLETION_Q    2
> +
> +/* HW attributes */
> +#define PVRDMA_HW_NAME       "pvrdma"
> +#define PVRDMA_HW_VERSION    17
> +#define PVRDMA_FW_VERSION    14
> +
> +/* Vendor Errors, codes 100 to FFF kept for kdbr */
> +#define VENDOR_ERR_TOO_MANY_SGES    0x201
> +#define VENDOR_ERR_NOMEM            0x202
> +#define VENDOR_ERR_FAIL_KDBR        0x203
> +
> +typedef struct HWResourceIDs {
> +    unsigned long *local_bitmap;
> +    __u32 *hw_map;
> +} HWResourceIDs;
> +
> +typedef struct DSRInfo {
> +    dma_addr_t dma;
> +    struct pvrdma_device_shared_region *dsr;
> +
> +    union pvrdma_cmd_req *req;
> +    union pvrdma_cmd_resp *rsp;
> +
> +    struct pvrdma_ring *async_ring_state;
> +    Ring async;
> +
> +    struct pvrdma_ring *cq_ring_state;
> +    Ring cq;
> +} DSRInfo;
> +
> +typedef struct PVRDMADev {
> +    PCIDevice parent_obj;
> +    MemoryRegion msix;
> +    MemoryRegion regs;
> +    __u32 regs_data[RDMA_BAR1_REGS_SIZE];
> +    MemoryRegion uar;
> +    __u32 uar_data[RDMA_BAR2_UAR_SIZE];
> +    DSRInfo dsr_info;
> +    int interrupt_mask;
> +    RmPort ports[MAX_PORTS];
> +    u64 sys_image_guid;
> +    u64 node_guid;
> +    u64 network_prefix;
> +    RmResTbl pd_tbl;
> +    RmResTbl mr_tbl;
> +    RmResTbl qp_tbl;
> +    RmResTbl cq_tbl;
> +    RmResTbl wqe_ctx_tbl;
> +} PVRDMADev;
> +#define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
> +
> +static inline int get_reg_val(PVRDMADev *dev, hwaddr addr, __u32 *val)
> +{
> +    int idx = addr >> 2;
> +
> +    if (idx > RDMA_BAR1_REGS_SIZE) {
> +        return -EINVAL;
> +    }
> +
> +    *val = dev->regs_data[idx];
> +
> +    return 0;
> +}
> +static inline int set_reg_val(PVRDMADev *dev, hwaddr addr, __u32 val)
> +{
> +    int idx = addr >> 2;
> +
> +    if (idx > RDMA_BAR1_REGS_SIZE) {
> +        return -EINVAL;
> +    }
> +
> +    dev->regs_data[idx] = val;
> +
> +    return 0;
> +}
> +static inline int get_uar_val(PVRDMADev *dev, hwaddr addr, __u32 *val)
> +{
> +    int idx = addr >> 2;
> +
> +    if (idx > RDMA_BAR2_UAR_SIZE) {
> +        return -EINVAL;
> +    }
> +
> +    *val = dev->uar_data[idx];
> +
> +    return 0;
> +}
> +static inline int set_uar_val(PVRDMADev *dev, hwaddr addr, __u32 val)
> +{
> +    int idx = addr >> 2;
> +
> +    if (idx > RDMA_BAR2_UAR_SIZE) {
> +        return -EINVAL;
> +    }
> +
> +    dev->uar_data[idx] = val;
> +
> +    return 0;
> +}
> +
> +static inline void post_interrupt(PVRDMADev *dev, unsigned vector)
> +{
> +    PCIDevice *pci_dev = PCI_DEVICE(dev);
> +
> +    if (likely(dev->interrupt_mask == 0)) {
> +        msix_notify(pci_dev, vector);
> +    }
> +}
> +
> +int execute_command(PVRDMADev *dev);
> +
> +#endif
> diff --git a/hw/net/pvrdma/pvrdma_cmd.c b/hw/net/pvrdma/pvrdma_cmd.c
> new file mode 100644
> index 0000000..ae1ef99
> --- /dev/null
> +++ b/hw/net/pvrdma/pvrdma_cmd.c
> @@ -0,0 +1,322 @@
> +#include "qemu/osdep.h"
> +#include "hw/hw.h"
> +#include "hw/pci/pci.h"
> +#include "hw/pci/pci_ids.h"
> +#include "hw/net/pvrdma/pvrdma_utils.h"
> +#include "hw/net/pvrdma/pvrdma.h"
> +#include "hw/net/pvrdma/pvrdma_rm.h"
> +#include "hw/net/pvrdma/pvrdma_kdbr.h"
> +
> +static int query_port(PVRDMADev *dev, union pvrdma_cmd_req *req,
> +                      union pvrdma_cmd_resp *rsp)
> +{
> +    struct pvrdma_cmd_query_port *cmd = &req->query_port;
> +    struct pvrdma_cmd_query_port_resp *resp = &rsp->query_port_resp;
> +    __u32 max_port_gids, max_port_pkeys;
> +
> +    pr_dbg("port=%d\n", cmd->port_num);
> +
> +    if (rm_get_max_port_gids(&max_port_gids) != 0) {
> +        return -ENOMEM;
> +    }
> +
> +    if (rm_get_max_port_pkeys(&max_port_pkeys) != 0) {
> +        return -ENOMEM;
> +    }
> +
> +    memset(resp, 0, sizeof(*resp));
> +    resp->hdr.response = cmd->hdr.response;
> +    resp->hdr.ack = PVRDMA_CMD_QUERY_PORT_RESP;
> +    resp->hdr.err = 0;
> +
> +    resp->attrs.state = PVRDMA_PORT_ACTIVE;
> +    resp->attrs.max_mtu = PVRDMA_MTU_4096;
> +    resp->attrs.active_mtu = PVRDMA_MTU_4096;
> +    resp->attrs.gid_tbl_len = max_port_gids;
> +    resp->attrs.port_cap_flags = 0;
> +    resp->attrs.max_msg_sz = 1024;
> +    resp->attrs.bad_pkey_cntr = 0;
> +    resp->attrs.qkey_viol_cntr = 0;
> +    resp->attrs.pkey_tbl_len = max_port_pkeys;
> +    resp->attrs.lid = 0;
> +    resp->attrs.sm_lid = 0;
> +    resp->attrs.lmc = 0;
> +    resp->attrs.max_vl_num = 0;
> +    resp->attrs.sm_sl = 0;
> +    resp->attrs.subnet_timeout = 0;
> +    resp->attrs.init_type_reply = 0;
> +    resp->attrs.active_width = 1;
> +    resp->attrs.active_speed = 1;
> +    resp->attrs.phys_state = 1;
> +
> +    return 0;
> +}
> +
> +static int query_pkey(PVRDMADev *dev, union pvrdma_cmd_req *req,
> +                      union pvrdma_cmd_resp *rsp)
> +{
> +    struct pvrdma_cmd_query_pkey *cmd = &req->query_pkey;
> +    struct pvrdma_cmd_query_pkey_resp *resp = &rsp->query_pkey_resp;
> +
> +    pr_dbg("port=%d\n", cmd->port_num);
> +    pr_dbg("index=%d\n", cmd->index);
> +
> +    memset(resp, 0, sizeof(*resp));
> +    resp->hdr.response = cmd->hdr.response;
> +    resp->hdr.ack = PVRDMA_CMD_QUERY_PKEY_RESP;
> +    resp->hdr.err = 0;
> +
> +    resp->pkey = 0x7FFF;
> +    pr_dbg("pkey=0x%x\n", resp->pkey);
> +
> +    return 0;
> +}
> +
> +static int create_pd(PVRDMADev *dev, union pvrdma_cmd_req *req,
> +                     union pvrdma_cmd_resp *rsp)
> +{
> +    struct pvrdma_cmd_create_pd *cmd = &req->create_pd;
> +    struct pvrdma_cmd_create_pd_resp *resp = &rsp->create_pd_resp;
> +
> +    pr_dbg("context=0x%x\n", cmd->ctx_handle ? cmd->ctx_handle : 0);
> +
> +    memset(resp, 0, sizeof(*resp));
> +    resp->hdr.response = cmd->hdr.response;
> +    resp->hdr.ack = PVRDMA_CMD_CREATE_PD_RESP;
> +    resp->hdr.err = rm_alloc_pd(dev, &resp->pd_handle, cmd->ctx_handle);
> +
> +    pr_dbg("ret=%d\n", resp->hdr.err);
> +    return resp->hdr.err;
> +}
> +
> +static int destroy_pd(PVRDMADev *dev, union pvrdma_cmd_req *req,
> +                      union pvrdma_cmd_resp *rsp)
> +{
> +    struct pvrdma_cmd_destroy_pd *cmd = &req->destroy_pd;
> +
> +    pr_dbg("pd_handle=%d\n", cmd->pd_handle);
> +
> +    rm_dealloc_pd(dev, cmd->pd_handle);
> +
> +    return 0;
> +}
> +
> +static int create_mr(PVRDMADev *dev, union pvrdma_cmd_req *req,
> +                     union pvrdma_cmd_resp *rsp)
> +{
> +    struct pvrdma_cmd_create_mr *cmd = &req->create_mr;
> +    struct pvrdma_cmd_create_mr_resp *resp = &rsp->create_mr_resp;
> +
> +    pr_dbg("pd_handle=%d\n", cmd->pd_handle);
> +    pr_dbg("access_flags=0x%x\n", cmd->access_flags);
> +    pr_dbg("flags=0x%x\n", cmd->flags);
> +
> +    memset(resp, 0, sizeof(*resp));
> +    resp->hdr.response = cmd->hdr.response;
> +    resp->hdr.ack = PVRDMA_CMD_CREATE_MR_RESP;
> +    resp->hdr.err = rm_alloc_mr(dev, cmd, resp);
> +
> +    pr_dbg("ret=%d\n", resp->hdr.err);
> +    return resp->hdr.err;
> +}
> +
> +static int destroy_mr(PVRDMADev *dev, union pvrdma_cmd_req *req,
> +                      union pvrdma_cmd_resp *rsp)
> +{
> +    struct pvrdma_cmd_destroy_mr *cmd = &req->destroy_mr;
> +
> +    pr_dbg("mr_handle=%d\n", cmd->mr_handle);
> +
> +    rm_dealloc_mr(dev, cmd->mr_handle);
> +
> +    return 0;
> +}
> +
> +static int create_cq(PVRDMADev *dev, union pvrdma_cmd_req *req,
> +                     union pvrdma_cmd_resp *rsp)
> +{
> +    struct pvrdma_cmd_create_cq *cmd = &req->create_cq;
> +    struct pvrdma_cmd_create_cq_resp *resp = &rsp->create_cq_resp;
> +
> +    pr_dbg("pdir_dma=0x%llx\n", (long long unsigned int)cmd->pdir_dma);
> +    pr_dbg("context=0x%x\n", cmd->ctx_handle ? cmd->ctx_handle : 0);
> +    pr_dbg("cqe=%d\n", cmd->cqe);
> +    pr_dbg("nchunks=%d\n", cmd->nchunks);
> +
> +    memset(resp, 0, sizeof(*resp));
> +    resp->hdr.response = cmd->hdr.response;
> +    resp->hdr.ack = PVRDMA_CMD_CREATE_CQ_RESP;
> +    resp->hdr.err = rm_alloc_cq(dev, cmd, resp);
> +
> +    pr_dbg("ret=%d\n", resp->hdr.err);
> +    return resp->hdr.err;
> +}
> +
> +static int destroy_cq(PVRDMADev *dev, union pvrdma_cmd_req *req,
> +                      union pvrdma_cmd_resp *rsp)
> +{
> +    struct pvrdma_cmd_destroy_cq *cmd = &req->destroy_cq;
> +
> +    pr_dbg("cq_handle=%d\n", cmd->cq_handle);
> +
> +    rm_dealloc_cq(dev, cmd->cq_handle);
> +
> +    return 0;
> +}
> +
> +static int create_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
> +                     union pvrdma_cmd_resp *rsp)
> +{
> +    struct pvrdma_cmd_create_qp *cmd = &req->create_qp;
> +    struct pvrdma_cmd_create_qp_resp *resp = &rsp->create_qp_resp;
> +
> +    if (!dev->ports[0].kdbr_port) {
> +        pr_dbg("First QP, registering port 0\n");
> +        dev->ports[0].kdbr_port = kdbr_alloc_port(dev);
> +        if (!dev->ports[0].kdbr_port) {
> +            pr_dbg("Fail to register port\n");
> +            return -EIO;
> +        }
> +    }
> +
> +    pr_dbg("pd_handle=%d\n", cmd->pd_handle);
> +    pr_dbg("pdir_dma=0x%llx\n", (long long unsigned int)cmd->pdir_dma);
> +    pr_dbg("total_chunks=%d\n", cmd->total_chunks);
> +    pr_dbg("send_chunks=%d\n", cmd->send_chunks);
> +
> +    memset(resp, 0, sizeof(*resp));
> +    resp->hdr.response = cmd->hdr.response;
> +    resp->hdr.ack = PVRDMA_CMD_CREATE_QP_RESP;
> +    resp->hdr.err = rm_alloc_qp(dev, cmd, resp);
> +
> +    pr_dbg("ret=%d\n", resp->hdr.err);
> +    return resp->hdr.err;
> +}
> +
> +static int modify_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
> +                     union pvrdma_cmd_resp *rsp)
> +{
> +    struct pvrdma_cmd_modify_qp *cmd = &req->modify_qp;
> +
> +    pr_dbg("qp_handle=%d\n", cmd->qp_handle);
> +
> +    memset(rsp, 0, sizeof(*rsp));
> +    rsp->hdr.response = cmd->hdr.response;
> +    rsp->hdr.ack = PVRDMA_CMD_MODIFY_QP_RESP;
> +    rsp->hdr.err = rm_modify_qp(dev, cmd->qp_handle, cmd);
> +
> +    pr_dbg("ret=%d\n", rsp->hdr.err);
> +    return rsp->hdr.err;
> +}
> +
> +static int destroy_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
> +                      union pvrdma_cmd_resp *rsp)
> +{
> +    struct pvrdma_cmd_destroy_qp *cmd = &req->destroy_qp;
> +
> +    pr_dbg("qp_handle=%d\n", cmd->qp_handle);
> +
> +    rm_dealloc_qp(dev, cmd->qp_handle);
> +
> +    return 0;
> +}
> +
> +static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
> +                       union pvrdma_cmd_resp *rsp)
> +{
> +    int rc;
> +    struct pvrdma_cmd_create_bind *cmd = &req->create_bind;
> +    u32 max_port_gids;
> +#ifdef DEBUG
> +    __be64 *subnet = (__be64 *)&cmd->new_gid[0];
> +    __be64 *if_id = (__be64 *)&cmd->new_gid[8];
> +#endif
> +
> +    pr_dbg("index=%d\n", cmd->index);
> +
> +    rc = rm_get_max_port_gids(&max_port_gids);
> +    if (rc) {
> +        return -EIO;
> +    }
> +
> +    if (cmd->index > max_port_gids) {
> +        return -EINVAL;
> +    }
> +
> +    pr_dbg("gid[%d]=0x%llx,0x%llx\n", cmd->index, *subnet, *if_id);
> +
> +    /* Driver forces to one port only */
> +    memcpy(dev->ports[0].gid_tbl[cmd->index].raw, &cmd->new_gid,
> +           sizeof(cmd->new_gid));
> +
> +    return 0;
> +}
> +
> +static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
> +                        union pvrdma_cmd_resp *rsp)
> +{
> +    /*  TODO: Check the usage of this table */
> +
> +    struct pvrdma_cmd_destroy_bind *cmd = &req->destroy_bind;
> +
> +    pr_dbg("clear index %d\n", cmd->index);
> +
> +    memset(dev->ports[0].gid_tbl[cmd->index].raw, 0,
> +           sizeof(dev->ports[0].gid_tbl[cmd->index].raw));
> +
> +    return 0;
> +}
> +
> +struct cmd_handler {
> +    __u32 cmd;
> +    int (*exec)(PVRDMADev *dev, union pvrdma_cmd_req *req,
> +            union pvrdma_cmd_resp *rsp);
> +};
> +
> +static struct cmd_handler cmd_handlers[] = {
> +    {PVRDMA_CMD_QUERY_PORT, query_port},
> +    {PVRDMA_CMD_QUERY_PKEY, query_pkey},
> +    {PVRDMA_CMD_CREATE_PD, create_pd},
> +    {PVRDMA_CMD_DESTROY_PD, destroy_pd},
> +    {PVRDMA_CMD_CREATE_MR, create_mr},
> +    {PVRDMA_CMD_DESTROY_MR, destroy_mr},
> +    {PVRDMA_CMD_CREATE_CQ, create_cq},
> +    {PVRDMA_CMD_RESIZE_CQ, NULL},
> +    {PVRDMA_CMD_DESTROY_CQ, destroy_cq},
> +    {PVRDMA_CMD_CREATE_QP, create_qp},
> +    {PVRDMA_CMD_MODIFY_QP, modify_qp},
> +    {PVRDMA_CMD_QUERY_QP, NULL},
> +    {PVRDMA_CMD_DESTROY_QP, destroy_qp},
> +    {PVRDMA_CMD_CREATE_UC, NULL},
> +    {PVRDMA_CMD_DESTROY_UC, NULL},
> +    {PVRDMA_CMD_CREATE_BIND, create_bind},
> +    {PVRDMA_CMD_DESTROY_BIND, destroy_bind},
> +};
> +
> +int execute_command(PVRDMADev *dev)
> +{
> +    int err = 0xFFFF;
> +    DSRInfo *dsr_info;
> +
> +    dsr_info = &dev->dsr_info;
> +
> +    pr_dbg("cmd=%d\n", dsr_info->req->hdr.cmd);
> +    if (dsr_info->req->hdr.cmd >= sizeof(cmd_handlers) /
> +                      sizeof(struct cmd_handler)) {
> +        pr_err("Unsupported command\n");
> +        goto out;
> +    }
> +
> +    if (!cmd_handlers[dsr_info->req->hdr.cmd].exec) {
> +        pr_err("Unsupported command (not implemented yet)\n");
> +        goto out;
> +    }
> +
> +    err = cmd_handlers[dsr_info->req->hdr.cmd].exec(dev, dsr_info->req,
> +                            dsr_info->rsp);
> +out:
> +    set_reg_val(dev, PVRDMA_REG_ERR, err);
> +    post_interrupt(dev, INTR_VEC_CMD_RING);
> +
> +    return (err == 0) ? 0 : -EINVAL;
> +}
> diff --git a/hw/net/pvrdma/pvrdma_defs.h b/hw/net/pvrdma/pvrdma_defs.h
> new file mode 100644
> index 0000000..1d0cc11
> --- /dev/null
> +++ b/hw/net/pvrdma/pvrdma_defs.h
> @@ -0,0 +1,301 @@
> +/*
> + * Copyright (c) 2012-2016 VMware, Inc.  All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of EITHER the GNU General Public License
> + * version 2 as published by the Free Software Foundation or the BSD
> + * 2-Clause License. This program is distributed in the hope that it
> + * will be useful, but WITHOUT ANY WARRANTY; WITHOUT EVEN THE IMPLIED
> + * WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
> + * See the GNU General Public License version 2 for more details at
> + * http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program available in the file COPYING in the main
> + * directory of this source tree.
> + *
> + * The BSD 2-Clause License
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
> + * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> + * COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
> + * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
> + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
> + * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
> + * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
> + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
> + * OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#ifndef PVRDMA_DEFS_H
> +#define PVRDMA_DEFS_H
> +
> +#include <hw/net/pvrdma/pvrdma_types.h>
> +#include <hw/net/pvrdma/pvrdma_ib_verbs.h>
> +#include <hw/net/pvrdma/pvrdma-uapi.h>
> +
> +/*
> + * Masks and accessors for page directory, which is a two-level lookup:
> + * page directory -> page table -> page. Only one directory for now, but we
> + * could expand that easily. 9 bits for tables, 9 bits for pages, gives one
> + * gigabyte for memory regions and so forth.
> + */
> +
> +#define PVRDMA_PDIR_SHIFT        18
> +#define PVRDMA_PTABLE_SHIFT        9
> +#define PVRDMA_PAGE_DIR_DIR(x)        (((x) >> PVRDMA_PDIR_SHIFT) & 0x1)
> +#define PVRDMA_PAGE_DIR_TABLE(x)    (((x) >> PVRDMA_PTABLE_SHIFT) & 0x1ff)
> +#define PVRDMA_PAGE_DIR_PAGE(x)        ((x) & 0x1ff)
> +#define PVRDMA_PAGE_DIR_MAX_PAGES    (1 * 512 * 512)
> +#define PVRDMA_MAX_FAST_REG_PAGES    128
> +
> +/*
> + * Max MSI-X vectors.
> + */
> +
> +#define PVRDMA_MAX_INTERRUPTS    3
> +
> +/* Register offsets within PCI resource on BAR1. */
> +#define PVRDMA_REG_VERSION    0x00    /* R: Version of device. */
> +#define PVRDMA_REG_DSRLOW    0x04    /* W: Device shared region low PA. */
> +#define PVRDMA_REG_DSRHIGH    0x08    /* W: Device shared region high PA. */
> +#define PVRDMA_REG_CTL        0x0c    /* W: PVRDMA_DEVICE_CTL */
> +#define PVRDMA_REG_REQUEST    0x10    /* W: Indicate device request. */
> +#define PVRDMA_REG_ERR        0x14    /* R: Device error. */
> +#define PVRDMA_REG_ICR        0x18    /* R: Interrupt cause. */
> +#define PVRDMA_REG_IMR        0x1c    /* R/W: Interrupt mask. */
> +#define PVRDMA_REG_MACL        0x20    /* R/W: MAC address low. */
> +#define PVRDMA_REG_MACH        0x24    /* R/W: MAC address high. */
> +
> +/* Object flags. */
> +#define PVRDMA_CQ_FLAG_ARMED_SOL    BIT(0)    /* Armed for solicited-only. */
> +#define PVRDMA_CQ_FLAG_ARMED        BIT(1)    /* Armed. */
> +#define PVRDMA_MR_FLAG_DMA        BIT(0)    /* DMA region. */
> +#define PVRDMA_MR_FLAG_FRMR        BIT(1)    /* Fast reg memory region. */
> +
> +/*
> + * Atomic operation capability (masked versions are extended atomic
> + * operations.
> + */
> +
> +#define PVRDMA_ATOMIC_OP_COMP_SWAP    BIT(0) /* Compare and swap. */
> +#define PVRDMA_ATOMIC_OP_FETCH_ADD    BIT(1) /* Fetch and add. */
> +#define PVRDMA_ATOMIC_OP_MASK_COMP_SWAP    BIT(2) /* Masked compare and swap. */
> +#define PVRDMA_ATOMIC_OP_MASK_FETCH_ADD    BIT(3) /* Masked fetch and add. */
> +
> +/*
> + * Base Memory Management Extension flags to support Fast Reg Memory Regions
> + * and Fast Reg Work Requests. Each flag represents a verb operation and we
> + * must support all of them to qualify for the BMME device cap.
> + */
> +
> +#define PVRDMA_BMME_FLAG_LOCAL_INV    BIT(0) /* Local Invalidate. */
> +#define PVRDMA_BMME_FLAG_REMOTE_INV    BIT(1) /* Remote Invalidate. */
> +#define PVRDMA_BMME_FLAG_FAST_REG_WR    BIT(2) /* Fast Reg Work Request. */
> +
> +/*
> + * GID types. The interpretation of the gid_types bit field in the device
> + * capabilities will depend on the device mode. For now, the device only
> + * supports RoCE as mode, so only the different GID types for RoCE are
> + * defined.
> + */
> +
> +#define PVRDMA_GID_TYPE_FLAG_ROCE_V1 BIT(0)
> +#define PVRDMA_GID_TYPE_FLAG_ROCE_V2 BIT(1)
> +
> +enum pvrdma_pci_resource {
> +    PVRDMA_PCI_RESOURCE_MSIX,    /* BAR0: MSI-X, MMIO. */
> +    PVRDMA_PCI_RESOURCE_REG,    /* BAR1: Registers, MMIO. */
> +    PVRDMA_PCI_RESOURCE_UAR,    /* BAR2: UAR pages, MMIO, 64-bit. */
> +    PVRDMA_PCI_RESOURCE_LAST,    /* Last. */
> +};
> +
> +enum pvrdma_device_ctl {
> +    PVRDMA_DEVICE_CTL_ACTIVATE,    /* Activate device. */
> +    PVRDMA_DEVICE_CTL_QUIESCE,    /* Quiesce device. */
> +    PVRDMA_DEVICE_CTL_RESET,    /* Reset device. */
> +};
> +
> +enum pvrdma_intr_vector {
> +    PVRDMA_INTR_VECTOR_RESPONSE,    /* Command response. */
> +    PVRDMA_INTR_VECTOR_ASYNC,    /* Async events. */
> +    PVRDMA_INTR_VECTOR_CQ,        /* CQ notification. */
> +    /* Additional CQ notification vectors. */
> +};
> +
> +enum pvrdma_intr_cause {
> +    PVRDMA_INTR_CAUSE_RESPONSE    = (1 << PVRDMA_INTR_VECTOR_RESPONSE),
> +    PVRDMA_INTR_CAUSE_ASYNC        = (1 << PVRDMA_INTR_VECTOR_ASYNC),
> +    PVRDMA_INTR_CAUSE_CQ        = (1 << PVRDMA_INTR_VECTOR_CQ),
> +};
> +
> +enum pvrdma_intr_type {
> +    PVRDMA_INTR_TYPE_INTX,        /* Legacy. */
> +    PVRDMA_INTR_TYPE_MSI,        /* MSI. */
> +    PVRDMA_INTR_TYPE_MSIX,        /* MSI-X. */
> +};
> +
> +enum pvrdma_gos_bits {
> +    PVRDMA_GOS_BITS_UNK,        /* Unknown. */
> +    PVRDMA_GOS_BITS_32,        /* 32-bit. */
> +    PVRDMA_GOS_BITS_64,        /* 64-bit. */
> +};
> +
> +enum pvrdma_gos_type {
> +    PVRDMA_GOS_TYPE_UNK,        /* Unknown. */
> +    PVRDMA_GOS_TYPE_LINUX,        /* Linux. */
> +};
> +
> +enum pvrdma_device_mode {
> +    PVRDMA_DEVICE_MODE_ROCE,    /* RoCE. */
> +    PVRDMA_DEVICE_MODE_IWARP,    /* iWarp. */
> +    PVRDMA_DEVICE_MODE_IB,        /* InfiniBand. */
> +};
> +
> +struct pvrdma_gos_info {
> +    u32 gos_bits:2;            /* W: PVRDMA_GOS_BITS_ */
> +    u32 gos_type:4;            /* W: PVRDMA_GOS_TYPE_ */
> +    u32 gos_ver:16;            /* W: Guest OS version. */
> +    u32 gos_misc:10;        /* W: Other. */
> +    u32 pad;            /* Pad to 8-byte alignment. */
> +};
> +
> +struct pvrdma_device_caps {
> +    u64 fw_ver;                /* R: Query device. */
> +    __be64 node_guid;
> +    __be64 sys_image_guid;
> +    u64 max_mr_size;
> +    u64 page_size_cap;
> +    u64 atomic_arg_sizes;            /* EXP verbs. */
> +    u32 exp_comp_mask;            /* EXP verbs. */
> +    u32 device_cap_flags2;            /* EXP verbs. */
> +    u32 max_fa_bit_boundary;        /* EXP verbs. */
> +    u32 log_max_atomic_inline_arg;        /* EXP verbs. */
> +    u32 vendor_id;
> +    u32 vendor_part_id;
> +    u32 hw_ver;
> +    u32 max_qp;
> +    u32 max_qp_wr;
> +    u32 device_cap_flags;
> +    u32 max_sge;
> +    u32 max_sge_rd;
> +    u32 max_cq;
> +    u32 max_cqe;
> +    u32 max_mr;
> +    u32 max_pd;
> +    u32 max_qp_rd_atom;
> +    u32 max_ee_rd_atom;
> +    u32 max_res_rd_atom;
> +    u32 max_qp_init_rd_atom;
> +    u32 max_ee_init_rd_atom;
> +    u32 max_ee;
> +    u32 max_rdd;
> +    u32 max_mw;
> +    u32 max_raw_ipv6_qp;
> +    u32 max_raw_ethy_qp;
> +    u32 max_mcast_grp;
> +    u32 max_mcast_qp_attach;
> +    u32 max_total_mcast_qp_attach;
> +    u32 max_ah;
> +    u32 max_fmr;
> +    u32 max_map_per_fmr;
> +    u32 max_srq;
> +    u32 max_srq_wr;
> +    u32 max_srq_sge;
> +    u32 max_uar;
> +    u32 gid_tbl_len;
> +    u16 max_pkeys;
> +    u8  local_ca_ack_delay;
> +    u8  phys_port_cnt;
> +    u8  mode;                /* PVRDMA_DEVICE_MODE_ */
> +    u8  atomic_ops;                /* PVRDMA_ATOMIC_OP_* bits */
> +    u8  bmme_flags;                /* FRWR Mem Mgmt Extensions */
> +    u8  gid_types;                /* PVRDMA_GID_TYPE_FLAG_ */
> +    u8  reserved[4];
> +};
> +
> +struct pvrdma_ring_page_info {
> +    u32 num_pages;                /* Num pages incl. header. */
> +    u32 reserved;                /* Reserved. */
> +    u64 pdir_dma;                /* Page directory PA. */
> +};
> +
> +#pragma pack(push, 1)
> +
> +struct pvrdma_device_shared_region {
> +    u32 driver_version;            /* W: Driver version. */
> +    u32 pad;                /* Pad to 8-byte align. */
> +    struct pvrdma_gos_info gos_info;    /* W: Guest OS information. */
> +    u64 cmd_slot_dma;            /* W: Command slot address. */
> +    u64 resp_slot_dma;            /* W: Response slot address. */
> +    struct pvrdma_ring_page_info async_ring_pages;
> +                        /* W: Async ring page info. */
> +    struct pvrdma_ring_page_info cq_ring_pages;
> +                        /* W: CQ ring page info. */
> +    u32 uar_pfn;                /* W: UAR pageframe. */
> +    u32 pad2;                /* Pad to 8-byte align. */
> +    struct pvrdma_device_caps caps;        /* R: Device capabilities. */
> +};
> +
> +#pragma pack(pop)
> +
> +
> +/* Event types. Currently a 1:1 mapping with enum ib_event. */
> +enum pvrdma_eqe_type {
> +    PVRDMA_EVENT_CQ_ERR,
> +    PVRDMA_EVENT_QP_FATAL,
> +    PVRDMA_EVENT_QP_REQ_ERR,
> +    PVRDMA_EVENT_QP_ACCESS_ERR,
> +    PVRDMA_EVENT_COMM_EST,
> +    PVRDMA_EVENT_SQ_DRAINED,
> +    PVRDMA_EVENT_PATH_MIG,
> +    PVRDMA_EVENT_PATH_MIG_ERR,
> +    PVRDMA_EVENT_DEVICE_FATAL,
> +    PVRDMA_EVENT_PORT_ACTIVE,
> +    PVRDMA_EVENT_PORT_ERR,
> +    PVRDMA_EVENT_LID_CHANGE,
> +    PVRDMA_EVENT_PKEY_CHANGE,
> +    PVRDMA_EVENT_SM_CHANGE,
> +    PVRDMA_EVENT_SRQ_ERR,
> +    PVRDMA_EVENT_SRQ_LIMIT_REACHED,
> +    PVRDMA_EVENT_QP_LAST_WQE_REACHED,
> +    PVRDMA_EVENT_CLIENT_REREGISTER,
> +    PVRDMA_EVENT_GID_CHANGE,
> +};
> +
> +/* Event queue element. */
> +struct pvrdma_eqe {
> +    u32 type;    /* Event type. */
> +    u32 info;    /* Handle, other. */
> +};
> +
> +/* CQ notification queue element. */
> +struct pvrdma_cqne {
> +    u32 info;    /* Handle */
> +};
> +
> +static inline void pvrdma_init_cqe(struct pvrdma_cqe *cqe, u64 wr_id, u64 qp)
> +{
> +    memset(cqe, 0, sizeof(*cqe));
> +    cqe->status = PVRDMA_WC_GENERAL_ERR;
> +    cqe->wr_id = wr_id;
> +    cqe->qp = qp;
> +}
> +
> +#endif /* PVRDMA_DEFS_H */
> diff --git a/hw/net/pvrdma/pvrdma_dev_api.h b/hw/net/pvrdma/pvrdma_dev_api.h
> new file mode 100644
> index 0000000..4887b96
> --- /dev/null
> +++ b/hw/net/pvrdma/pvrdma_dev_api.h
> @@ -0,0 +1,342 @@
> +/*
> + * Copyright (c) 2012-2016 VMware, Inc.  All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of EITHER the GNU General Public License
> + * version 2 as published by the Free Software Foundation or the BSD
> + * 2-Clause License. This program is distributed in the hope that it
> + * will be useful, but WITHOUT ANY WARRANTY; WITHOUT EVEN THE IMPLIED
> + * WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
> + * See the GNU General Public License version 2 for more details at
> + * http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program available in the file COPYING in the main
> + * directory of this source tree.
> + *
> + * The BSD 2-Clause License
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
> + * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> + * COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
> + * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
> + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
> + * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
> + * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
> + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
> + * OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#ifndef PVRDMA_DEV_API_H
> +#define PVRDMA_DEV_API_H
> +
> +#include <hw/net/pvrdma/pvrdma_types.h>
> +#include <hw/net/pvrdma/pvrdma_ib_verbs.h>
> +
> +enum {
> +    PVRDMA_CMD_FIRST,
> +    PVRDMA_CMD_QUERY_PORT = PVRDMA_CMD_FIRST,
> +    PVRDMA_CMD_QUERY_PKEY,
> +    PVRDMA_CMD_CREATE_PD,
> +    PVRDMA_CMD_DESTROY_PD,
> +    PVRDMA_CMD_CREATE_MR,
> +    PVRDMA_CMD_DESTROY_MR,
> +    PVRDMA_CMD_CREATE_CQ,
> +    PVRDMA_CMD_RESIZE_CQ,
> +    PVRDMA_CMD_DESTROY_CQ,
> +    PVRDMA_CMD_CREATE_QP,
> +    PVRDMA_CMD_MODIFY_QP,
> +    PVRDMA_CMD_QUERY_QP,
> +    PVRDMA_CMD_DESTROY_QP,
> +    PVRDMA_CMD_CREATE_UC,
> +    PVRDMA_CMD_DESTROY_UC,
> +    PVRDMA_CMD_CREATE_BIND,
> +    PVRDMA_CMD_DESTROY_BIND,
> +    PVRDMA_CMD_MAX,
> +};
> +
> +enum {
> +    PVRDMA_CMD_FIRST_RESP = (1 << 31),
> +    PVRDMA_CMD_QUERY_PORT_RESP = PVRDMA_CMD_FIRST_RESP,
> +    PVRDMA_CMD_QUERY_PKEY_RESP,
> +    PVRDMA_CMD_CREATE_PD_RESP,
> +    PVRDMA_CMD_DESTROY_PD_RESP_NOOP,
> +    PVRDMA_CMD_CREATE_MR_RESP,
> +    PVRDMA_CMD_DESTROY_MR_RESP_NOOP,
> +    PVRDMA_CMD_CREATE_CQ_RESP,
> +    PVRDMA_CMD_RESIZE_CQ_RESP,
> +    PVRDMA_CMD_DESTROY_CQ_RESP_NOOP,
> +    PVRDMA_CMD_CREATE_QP_RESP,
> +    PVRDMA_CMD_MODIFY_QP_RESP,
> +    PVRDMA_CMD_QUERY_QP_RESP,
> +    PVRDMA_CMD_DESTROY_QP_RESP,
> +    PVRDMA_CMD_CREATE_UC_RESP,
> +    PVRDMA_CMD_DESTROY_UC_RESP_NOOP,
> +    PVRDMA_CMD_CREATE_BIND_RESP_NOOP,
> +    PVRDMA_CMD_DESTROY_BIND_RESP_NOOP,
> +    PVRDMA_CMD_MAX_RESP,
> +};
> +
> +struct pvrdma_cmd_hdr {
> +    u64 response;        /* Key for response lookup. */
> +    u32 cmd;        /* PVRDMA_CMD_ */
> +    u32 reserved;        /* Reserved. */
> +};
> +
> +struct pvrdma_cmd_resp_hdr {
> +    u64 response;        /* From cmd hdr. */
> +    u32 ack;        /* PVRDMA_CMD_XXX_RESP */
> +    u8 err;            /* Error. */
> +    u8 reserved[3];        /* Reserved. */
> +};
> +
> +struct pvrdma_cmd_query_port {
> +    struct pvrdma_cmd_hdr hdr;
> +    u8 port_num;
> +    u8 reserved[7];
> +};
> +
> +struct pvrdma_cmd_query_port_resp {
> +    struct pvrdma_cmd_resp_hdr hdr;
> +    struct pvrdma_port_attr attrs;
> +};
> +
> +struct pvrdma_cmd_query_pkey {
> +    struct pvrdma_cmd_hdr hdr;
> +    u8 port_num;
> +    u8 index;
> +    u8 reserved[6];
> +};
> +
> +struct pvrdma_cmd_query_pkey_resp {
> +    struct pvrdma_cmd_resp_hdr hdr;
> +    u16 pkey;
> +    u8 reserved[6];
> +};
> +
> +struct pvrdma_cmd_create_uc {
> +    struct pvrdma_cmd_hdr hdr;
> +    u32 pfn; /* UAR page frame number */
> +    u8 reserved[4];
> +};
> +
> +struct pvrdma_cmd_create_uc_resp {
> +    struct pvrdma_cmd_resp_hdr hdr;
> +    u32 ctx_handle;
> +    u8 reserved[4];
> +};
> +
> +struct pvrdma_cmd_destroy_uc {
> +    struct pvrdma_cmd_hdr hdr;
> +    u32 ctx_handle;
> +    u8 reserved[4];
> +};
> +
> +struct pvrdma_cmd_create_pd {
> +    struct pvrdma_cmd_hdr hdr;
> +    u32 ctx_handle;
> +    u8 reserved[4];
> +};
> +
> +struct pvrdma_cmd_create_pd_resp {
> +    struct pvrdma_cmd_resp_hdr hdr;
> +    u32 pd_handle;
> +    u8 reserved[4];
> +};
> +
> +struct pvrdma_cmd_destroy_pd {
> +    struct pvrdma_cmd_hdr hdr;
> +    u32 pd_handle;
> +    u8 reserved[4];
> +};
> +
> +struct pvrdma_cmd_create_mr {
> +    struct pvrdma_cmd_hdr hdr;
> +    u64 start;
> +    u64 length;
> +    u64 pdir_dma;
> +    u32 pd_handle;
> +    u32 access_flags;
> +    u32 flags;
> +    u32 nchunks;
> +};
> +
> +struct pvrdma_cmd_create_mr_resp {
> +    struct pvrdma_cmd_resp_hdr hdr;
> +    u32 mr_handle;
> +    u32 lkey;
> +    u32 rkey;
> +    u8 reserved[4];
> +};
> +
> +struct pvrdma_cmd_destroy_mr {
> +    struct pvrdma_cmd_hdr hdr;
> +    u32 mr_handle;
> +    u8 reserved[4];
> +};
> +
> +struct pvrdma_cmd_create_cq {
> +    struct pvrdma_cmd_hdr hdr;
> +    u64 pdir_dma;
> +    u32 ctx_handle;
> +    u32 cqe;
> +    u32 nchunks;
> +    u8 reserved[4];
> +};
> +
> +struct pvrdma_cmd_create_cq_resp {
> +    struct pvrdma_cmd_resp_hdr hdr;
> +    u32 cq_handle;
> +    u32 cqe;
> +};
> +
> +struct pvrdma_cmd_resize_cq {
> +    struct pvrdma_cmd_hdr hdr;
> +    u32 cq_handle;
> +    u32 cqe;
> +};
> +
> +struct pvrdma_cmd_resize_cq_resp {
> +    struct pvrdma_cmd_resp_hdr hdr;
> +    u32 cqe;
> +    u8 reserved[4];
> +};
> +
> +struct pvrdma_cmd_destroy_cq {
> +    struct pvrdma_cmd_hdr hdr;
> +    u32 cq_handle;
> +    u8 reserved[4];
> +};
> +
> +struct pvrdma_cmd_create_qp {
> +    struct pvrdma_cmd_hdr hdr;
> +    u64 pdir_dma;
> +    u32 pd_handle;
> +    u32 send_cq_handle;
> +    u32 recv_cq_handle;
> +    u32 srq_handle;
> +    u32 max_send_wr;
> +    u32 max_recv_wr;
> +    u32 max_send_sge;
> +    u32 max_recv_sge;
> +    u32 max_inline_data;
> +    u32 lkey;
> +    u32 access_flags;
> +    u16 total_chunks;
> +    u16 send_chunks;
> +    u16 max_atomic_arg;
> +    u8 sq_sig_all;
> +    u8 qp_type;
> +    u8 is_srq;
> +    u8 reserved[3];
> +};
> +
> +struct pvrdma_cmd_create_qp_resp {
> +    struct pvrdma_cmd_resp_hdr hdr;
> +    u32 qpn;
> +    u32 max_send_wr;
> +    u32 max_recv_wr;
> +    u32 max_send_sge;
> +    u32 max_recv_sge;
> +    u32 max_inline_data;
> +};
> +
> +struct pvrdma_cmd_modify_qp {
> +    struct pvrdma_cmd_hdr hdr;
> +    u32 qp_handle;
> +    u32 attr_mask;
> +    struct pvrdma_qp_attr attrs;
> +};
> +
> +struct pvrdma_cmd_query_qp {
> +    struct pvrdma_cmd_hdr hdr;
> +    u32 qp_handle;
> +    u32 attr_mask;
> +};
> +
> +struct pvrdma_cmd_query_qp_resp {
> +    struct pvrdma_cmd_resp_hdr hdr;
> +    struct pvrdma_qp_attr attrs;
> +};
> +
> +struct pvrdma_cmd_destroy_qp {
> +    struct pvrdma_cmd_hdr hdr;
> +    u32 qp_handle;
> +    u8 reserved[4];
> +};
> +
> +struct pvrdma_cmd_destroy_qp_resp {
> +    struct pvrdma_cmd_resp_hdr hdr;
> +    u32 events_reported;
> +    u8 reserved[4];
> +};
> +
> +struct pvrdma_cmd_create_bind {
> +    struct pvrdma_cmd_hdr hdr;
> +    u32 mtu;
> +    u32 vlan;
> +    u32 index;
> +    u8 new_gid[16];
> +    u8 gid_type;
> +    u8 reserved[3];
> +};
> +
> +struct pvrdma_cmd_destroy_bind {
> +    struct pvrdma_cmd_hdr hdr;
> +    u32 index;
> +    u8 dest_gid[16];
> +    u8 reserved[4];
> +};
> +
> +union pvrdma_cmd_req {
> +    struct pvrdma_cmd_hdr hdr;
> +    struct pvrdma_cmd_query_port query_port;
> +    struct pvrdma_cmd_query_pkey query_pkey;
> +    struct pvrdma_cmd_create_uc create_uc;
> +    struct pvrdma_cmd_destroy_uc destroy_uc;
> +    struct pvrdma_cmd_create_pd create_pd;
> +    struct pvrdma_cmd_destroy_pd destroy_pd;
> +    struct pvrdma_cmd_create_mr create_mr;
> +    struct pvrdma_cmd_destroy_mr destroy_mr;
> +    struct pvrdma_cmd_create_cq create_cq;
> +    struct pvrdma_cmd_resize_cq resize_cq;
> +    struct pvrdma_cmd_destroy_cq destroy_cq;
> +    struct pvrdma_cmd_create_qp create_qp;
> +    struct pvrdma_cmd_modify_qp modify_qp;
> +    struct pvrdma_cmd_query_qp query_qp;
> +    struct pvrdma_cmd_destroy_qp destroy_qp;
> +    struct pvrdma_cmd_create_bind create_bind;
> +    struct pvrdma_cmd_destroy_bind destroy_bind;
> +};
> +
> +union pvrdma_cmd_resp {
> +    struct pvrdma_cmd_resp_hdr hdr;
> +    struct pvrdma_cmd_query_port_resp query_port_resp;
> +    struct pvrdma_cmd_query_pkey_resp query_pkey_resp;
> +    struct pvrdma_cmd_create_uc_resp create_uc_resp;
> +    struct pvrdma_cmd_create_pd_resp create_pd_resp;
> +    struct pvrdma_cmd_create_mr_resp create_mr_resp;
> +    struct pvrdma_cmd_create_cq_resp create_cq_resp;
> +    struct pvrdma_cmd_resize_cq_resp resize_cq_resp;
> +    struct pvrdma_cmd_create_qp_resp create_qp_resp;
> +    struct pvrdma_cmd_query_qp_resp query_qp_resp;
> +    struct pvrdma_cmd_destroy_qp_resp destroy_qp_resp;
> +};
> +
> +#endif /* PVRDMA_DEV_API_H */
> diff --git a/hw/net/pvrdma/pvrdma_ib_verbs.h b/hw/net/pvrdma/pvrdma_ib_verbs.h
> new file mode 100644
> index 0000000..e2a23f3
> --- /dev/null
> +++ b/hw/net/pvrdma/pvrdma_ib_verbs.h
> @@ -0,0 +1,469 @@
> +/*
> + * [PLEASE NOTE:  VMWARE, INC. ELECTS TO USE AND DISTRIBUTE THIS COMPONENT
> + * UNDER THE TERMS OF THE OpenIB.org BSD license.  THE ORIGINAL LICENSE TERMS
> + * ARE REPRODUCED BELOW ONLY AS A REFERENCE.]
> + *
> + * Copyright (c) 2004 Mellanox Technologies Ltd.  All rights reserved.
> + * Copyright (c) 2004 Infinicon Corporation.  All rights reserved.
> + * Copyright (c) 2004 Intel Corporation.  All rights reserved.
> + * Copyright (c) 2004 Topspin Corporation.  All rights reserved.
> + * Copyright (c) 2004 Voltaire Corporation.  All rights reserved.
> + * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
> + * Copyright (c) 2005, 2006, 2007 Cisco Systems.  All rights reserved.
> + * Copyright (c) 2015-2016 VMware, Inc.  All rights reserved.
> + *
> + * This software is available to you under a choice of one of two
> + * licenses.  You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * OpenIB.org BSD license below:
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +
> +#ifndef PVRDMA_IB_VERBS_H
> +#define PVRDMA_IB_VERBS_H
> +
> +#include <linux/types.h>
> +
> +union pvrdma_gid {
> +    u8    raw[16];
> +    struct {
> +        __be64    subnet_prefix;
> +        __be64    interface_id;
> +    } global;
> +};
> +
> +enum pvrdma_link_layer {
> +    PVRDMA_LINK_LAYER_UNSPECIFIED,
> +    PVRDMA_LINK_LAYER_INFINIBAND,
> +    PVRDMA_LINK_LAYER_ETHERNET,
> +};
> +
> +enum pvrdma_mtu {
> +    PVRDMA_MTU_256  = 1,
> +    PVRDMA_MTU_512  = 2,
> +    PVRDMA_MTU_1024 = 3,
> +    PVRDMA_MTU_2048 = 4,
> +    PVRDMA_MTU_4096 = 5,
> +};
> +
> +static inline int pvrdma_mtu_enum_to_int(enum pvrdma_mtu mtu)
> +{
> +    switch (mtu) {
> +    case PVRDMA_MTU_256:    return  256;
> +    case PVRDMA_MTU_512:    return  512;
> +    case PVRDMA_MTU_1024:    return 1024;
> +    case PVRDMA_MTU_2048:    return 2048;
> +    case PVRDMA_MTU_4096:    return 4096;
> +    default:        return   -1;
> +    }
> +}
> +
> +static inline enum pvrdma_mtu pvrdma_mtu_int_to_enum(int mtu)
> +{
> +    switch (mtu) {
> +    case 256:    return PVRDMA_MTU_256;
> +    case 512:    return PVRDMA_MTU_512;
> +    case 1024:    return PVRDMA_MTU_1024;
> +    case 2048:    return PVRDMA_MTU_2048;
> +    case 4096:
> +    default:    return PVRDMA_MTU_4096;
> +    }
> +}
> +
> +enum pvrdma_port_state {
> +    PVRDMA_PORT_NOP            = 0,
> +    PVRDMA_PORT_DOWN        = 1,
> +    PVRDMA_PORT_INIT        = 2,
> +    PVRDMA_PORT_ARMED        = 3,
> +    PVRDMA_PORT_ACTIVE        = 4,
> +    PVRDMA_PORT_ACTIVE_DEFER    = 5,
> +};
> +
> +enum pvrdma_port_cap_flags {
> +    PVRDMA_PORT_SM                = 1 <<  1,
> +    PVRDMA_PORT_NOTICE_SUP            = 1 <<  2,
> +    PVRDMA_PORT_TRAP_SUP            = 1 <<  3,
> +    PVRDMA_PORT_OPT_IPD_SUP            = 1 <<  4,
> +    PVRDMA_PORT_AUTO_MIGR_SUP        = 1 <<  5,
> +    PVRDMA_PORT_SL_MAP_SUP            = 1 <<  6,
> +    PVRDMA_PORT_MKEY_NVRAM            = 1 <<  7,
> +    PVRDMA_PORT_PKEY_NVRAM            = 1 <<  8,
> +    PVRDMA_PORT_LED_INFO_SUP        = 1 <<  9,
> +    PVRDMA_PORT_SM_DISABLED            = 1 << 10,
> +    PVRDMA_PORT_SYS_IMAGE_GUID_SUP        = 1 << 11,
> +    PVRDMA_PORT_PKEY_SW_EXT_PORT_TRAP_SUP    = 1 << 12,
> +    PVRDMA_PORT_EXTENDED_SPEEDS_SUP        = 1 << 14,
> +    PVRDMA_PORT_CM_SUP            = 1 << 16,
> +    PVRDMA_PORT_SNMP_TUNNEL_SUP        = 1 << 17,
> +    PVRDMA_PORT_REINIT_SUP            = 1 << 18,
> +    PVRDMA_PORT_DEVICE_MGMT_SUP        = 1 << 19,
> +    PVRDMA_PORT_VENDOR_CLASS_SUP        = 1 << 20,
> +    PVRDMA_PORT_DR_NOTICE_SUP        = 1 << 21,
> +    PVRDMA_PORT_CAP_MASK_NOTICE_SUP        = 1 << 22,
> +    PVRDMA_PORT_BOOT_MGMT_SUP        = 1 << 23,
> +    PVRDMA_PORT_LINK_LATENCY_SUP        = 1 << 24,
> +    PVRDMA_PORT_CLIENT_REG_SUP        = 1 << 25,
> +    PVRDMA_PORT_IP_BASED_GIDS        = 1 << 26,
> +    PVRDMA_PORT_CAP_FLAGS_MAX        = PVRDMA_PORT_IP_BASED_GIDS,
> +};
> +
> +enum pvrdma_port_width {
> +    PVRDMA_WIDTH_1X        = 1,
> +    PVRDMA_WIDTH_4X        = 2,
> +    PVRDMA_WIDTH_8X        = 4,
> +    PVRDMA_WIDTH_12X    = 8,
> +};
> +
> +static inline int pvrdma_width_enum_to_int(enum pvrdma_port_width width)
> +{
> +    switch (width) {
> +    case PVRDMA_WIDTH_1X:    return  1;
> +    case PVRDMA_WIDTH_4X:    return  4;
> +    case PVRDMA_WIDTH_8X:    return  8;
> +    case PVRDMA_WIDTH_12X:    return 12;
> +    default:        return -1;
> +    }
> +}
> +
> +enum pvrdma_port_speed {
> +    PVRDMA_SPEED_SDR    = 1,
> +    PVRDMA_SPEED_DDR    = 2,
> +    PVRDMA_SPEED_QDR    = 4,
> +    PVRDMA_SPEED_FDR10    = 8,
> +    PVRDMA_SPEED_FDR    = 16,
> +    PVRDMA_SPEED_EDR    = 32,
> +};
> +
> +struct pvrdma_port_attr {
> +    enum pvrdma_port_state    state;
> +    enum pvrdma_mtu        max_mtu;
> +    enum pvrdma_mtu        active_mtu;
> +    u32            gid_tbl_len;
> +    u32            port_cap_flags;
> +    u32            max_msg_sz;
> +    u32            bad_pkey_cntr;
> +    u32            qkey_viol_cntr;
> +    u16            pkey_tbl_len;
> +    u16            lid;
> +    u16            sm_lid;
> +    u8            lmc;
> +    u8            max_vl_num;
> +    u8            sm_sl;
> +    u8            subnet_timeout;
> +    u8            init_type_reply;
> +    u8            active_width;
> +    u8            active_speed;
> +    u8            phys_state;
> +    u8            reserved[2];
> +};
> +
> +struct pvrdma_global_route {
> +    union pvrdma_gid    dgid;
> +    u32            flow_label;
> +    u8            sgid_index;
> +    u8            hop_limit;
> +    u8            traffic_class;
> +    u8            reserved;
> +};
> +
> +struct pvrdma_grh {
> +    __be32            version_tclass_flow;
> +    __be16            paylen;
> +    u8            next_hdr;
> +    u8            hop_limit;
> +    union pvrdma_gid    sgid;
> +    union pvrdma_gid    dgid;
> +};
> +
> +enum pvrdma_ah_flags {
> +    PVRDMA_AH_GRH = 1,
> +};
> +
> +enum pvrdma_rate {
> +    PVRDMA_RATE_PORT_CURRENT    = 0,
> +    PVRDMA_RATE_2_5_GBPS        = 2,
> +    PVRDMA_RATE_5_GBPS        = 5,
> +    PVRDMA_RATE_10_GBPS        = 3,
> +    PVRDMA_RATE_20_GBPS        = 6,
> +    PVRDMA_RATE_30_GBPS        = 4,
> +    PVRDMA_RATE_40_GBPS        = 7,
> +    PVRDMA_RATE_60_GBPS        = 8,
> +    PVRDMA_RATE_80_GBPS        = 9,
> +    PVRDMA_RATE_120_GBPS        = 10,
> +    PVRDMA_RATE_14_GBPS        = 11,
> +    PVRDMA_RATE_56_GBPS        = 12,
> +    PVRDMA_RATE_112_GBPS        = 13,
> +    PVRDMA_RATE_168_GBPS        = 14,
> +    PVRDMA_RATE_25_GBPS        = 15,
> +    PVRDMA_RATE_100_GBPS        = 16,
> +    PVRDMA_RATE_200_GBPS        = 17,
> +    PVRDMA_RATE_300_GBPS        = 18,
> +};
> +
> +struct pvrdma_ah_attr {
> +    struct pvrdma_global_route    grh;
> +    u16                dlid;
> +    u16                vlan_id;
> +    u8                sl;
> +    u8                src_path_bits;
> +    u8                static_rate;
> +    u8                ah_flags;
> +    u8                port_num;
> +    u8                dmac[6];
> +    u8                reserved;
> +};
> +
> +enum pvrdma_wc_status {
> +    PVRDMA_WC_SUCCESS,
> +    PVRDMA_WC_LOC_LEN_ERR,
> +    PVRDMA_WC_LOC_QP_OP_ERR,
> +    PVRDMA_WC_LOC_EEC_OP_ERR,
> +    PVRDMA_WC_LOC_PROT_ERR,
> +    PVRDMA_WC_WR_FLUSH_ERR,
> +    PVRDMA_WC_MW_BIND_ERR,
> +    PVRDMA_WC_BAD_RESP_ERR,
> +    PVRDMA_WC_LOC_ACCESS_ERR,
> +    PVRDMA_WC_REM_INV_REQ_ERR,
> +    PVRDMA_WC_REM_ACCESS_ERR,
> +    PVRDMA_WC_REM_OP_ERR,
> +    PVRDMA_WC_RETRY_EXC_ERR,
> +    PVRDMA_WC_RNR_RETRY_EXC_ERR,
> +    PVRDMA_WC_LOC_RDD_VIOL_ERR,
> +    PVRDMA_WC_REM_INV_RD_REQ_ERR,
> +    PVRDMA_WC_REM_ABORT_ERR,
> +    PVRDMA_WC_INV_EECN_ERR,
> +    PVRDMA_WC_INV_EEC_STATE_ERR,
> +    PVRDMA_WC_FATAL_ERR,
> +    PVRDMA_WC_RESP_TIMEOUT_ERR,
> +    PVRDMA_WC_GENERAL_ERR,
> +};
> +
> +enum pvrdma_wc_opcode {
> +    PVRDMA_WC_SEND,
> +    PVRDMA_WC_RDMA_WRITE,
> +    PVRDMA_WC_RDMA_READ,
> +    PVRDMA_WC_COMP_SWAP,
> +    PVRDMA_WC_FETCH_ADD,
> +    PVRDMA_WC_BIND_MW,
> +    PVRDMA_WC_LSO,
> +    PVRDMA_WC_LOCAL_INV,
> +    PVRDMA_WC_FAST_REG_MR,
> +    PVRDMA_WC_MASKED_COMP_SWAP,
> +    PVRDMA_WC_MASKED_FETCH_ADD,
> +    PVRDMA_WC_RECV = 1 << 7,
> +    PVRDMA_WC_RECV_RDMA_WITH_IMM,
> +};
> +
> +enum pvrdma_wc_flags {
> +    PVRDMA_WC_GRH            = 1 << 0,
> +    PVRDMA_WC_WITH_IMM        = 1 << 1,
> +    PVRDMA_WC_WITH_INVALIDATE    = 1 << 2,
> +    PVRDMA_WC_IP_CSUM_OK        = 1 << 3,
> +    PVRDMA_WC_WITH_SMAC        = 1 << 4,
> +    PVRDMA_WC_WITH_VLAN        = 1 << 5,
> +    PVRDMA_WC_FLAGS_MAX        = PVRDMA_WC_WITH_VLAN,
> +};
> +
> +enum pvrdma_cq_notify_flags {
> +    PVRDMA_CQ_SOLICITED        = 1 << 0,
> +    PVRDMA_CQ_NEXT_COMP        = 1 << 1,
> +    PVRDMA_CQ_SOLICITED_MASK    = PVRDMA_CQ_SOLICITED |
> +                      PVRDMA_CQ_NEXT_COMP,
> +    PVRDMA_CQ_REPORT_MISSED_EVENTS    = 1 << 2,
> +};
> +
> +struct pvrdma_qp_cap {
> +    u32    max_send_wr;
> +    u32    max_recv_wr;
> +    u32    max_send_sge;
> +    u32    max_recv_sge;
> +    u32    max_inline_data;
> +    u32    reserved;
> +};
> +
> +enum pvrdma_sig_type {
> +    PVRDMA_SIGNAL_ALL_WR,
> +    PVRDMA_SIGNAL_REQ_WR,
> +};
> +
> +enum pvrdma_qp_type {
> +    PVRDMA_QPT_SMI,
> +    PVRDMA_QPT_GSI,
> +    PVRDMA_QPT_RC,
> +    PVRDMA_QPT_UC,
> +    PVRDMA_QPT_UD,
> +    PVRDMA_QPT_RAW_IPV6,
> +    PVRDMA_QPT_RAW_ETHERTYPE,
> +    PVRDMA_QPT_RAW_PACKET = 8,
> +    PVRDMA_QPT_XRC_INI = 9,
> +    PVRDMA_QPT_XRC_TGT,
> +    PVRDMA_QPT_MAX,
> +};
> +
> +enum pvrdma_qp_create_flags {
> +    PVRDMA_QP_CREATE_IPOPVRDMA_UD_LSO        = 1 << 0,
> +    PVRDMA_QP_CREATE_BLOCK_MULTICAST_LOOPBACK    = 1 << 1,
> +};
> +
> +enum pvrdma_qp_attr_mask {
> +    PVRDMA_QP_STATE            = 1 << 0,
> +    PVRDMA_QP_CUR_STATE        = 1 << 1,
> +    PVRDMA_QP_EN_SQD_ASYNC_NOTIFY    = 1 << 2,
> +    PVRDMA_QP_ACCESS_FLAGS        = 1 << 3,
> +    PVRDMA_QP_PKEY_INDEX        = 1 << 4,
> +    PVRDMA_QP_PORT            = 1 << 5,
> +    PVRDMA_QP_QKEY            = 1 << 6,
> +    PVRDMA_QP_AV            = 1 << 7,
> +    PVRDMA_QP_PATH_MTU        = 1 << 8,
> +    PVRDMA_QP_TIMEOUT        = 1 << 9,
> +    PVRDMA_QP_RETRY_CNT        = 1 << 10,
> +    PVRDMA_QP_RNR_RETRY        = 1 << 11,
> +    PVRDMA_QP_RQ_PSN        = 1 << 12,
> +    PVRDMA_QP_MAX_QP_RD_ATOMIC    = 1 << 13,
> +    PVRDMA_QP_ALT_PATH        = 1 << 14,
> +    PVRDMA_QP_MIN_RNR_TIMER        = 1 << 15,
> +    PVRDMA_QP_SQ_PSN        = 1 << 16,
> +    PVRDMA_QP_MAX_DEST_RD_ATOMIC    = 1 << 17,
> +    PVRDMA_QP_PATH_MIG_STATE    = 1 << 18,
> +    PVRDMA_QP_CAP            = 1 << 19,
> +    PVRDMA_QP_DEST_QPN        = 1 << 20,
> +    PVRDMA_QP_ATTR_MASK_MAX        = PVRDMA_QP_DEST_QPN,
> +};
> +
> +enum pvrdma_qp_state {
> +    PVRDMA_QPS_RESET,
> +    PVRDMA_QPS_INIT,
> +    PVRDMA_QPS_RTR,
> +    PVRDMA_QPS_RTS,
> +    PVRDMA_QPS_SQD,
> +    PVRDMA_QPS_SQE,
> +    PVRDMA_QPS_ERR,
> +};
> +
> +enum pvrdma_mig_state {
> +    PVRDMA_MIG_MIGRATED,
> +    PVRDMA_MIG_REARM,
> +    PVRDMA_MIG_ARMED,
> +};
> +
> +enum pvrdma_mw_type {
> +    PVRDMA_MW_TYPE_1 = 1,
> +    PVRDMA_MW_TYPE_2 = 2,
> +};
> +
> +struct pvrdma_qp_attr {
> +    enum pvrdma_qp_state    qp_state;
> +    enum pvrdma_qp_state    cur_qp_state;
> +    enum pvrdma_mtu        path_mtu;
> +    enum pvrdma_mig_state    path_mig_state;
> +    u32            qkey;
> +    u32            rq_psn;
> +    u32            sq_psn;
> +    u32            dest_qp_num;
> +    u32            qp_access_flags;
> +    u16            pkey_index;
> +    u16            alt_pkey_index;
> +    u8            en_sqd_async_notify;
> +    u8            sq_draining;
> +    u8            max_rd_atomic;
> +    u8            max_dest_rd_atomic;
> +    u8            min_rnr_timer;
> +    u8            port_num;
> +    u8            timeout;
> +    u8            retry_cnt;
> +    u8            rnr_retry;
> +    u8            alt_port_num;
> +    u8            alt_timeout;
> +    u8            reserved[5];
> +    struct pvrdma_qp_cap    cap;
> +    struct pvrdma_ah_attr    ah_attr;
> +    struct pvrdma_ah_attr    alt_ah_attr;
> +};
> +
> +enum pvrdma_wr_opcode {
> +    PVRDMA_WR_RDMA_WRITE,
> +    PVRDMA_WR_RDMA_WRITE_WITH_IMM,
> +    PVRDMA_WR_SEND,
> +    PVRDMA_WR_SEND_WITH_IMM,
> +    PVRDMA_WR_RDMA_READ,
> +    PVRDMA_WR_ATOMIC_CMP_AND_SWP,
> +    PVRDMA_WR_ATOMIC_FETCH_AND_ADD,
> +    PVRDMA_WR_LSO,
> +    PVRDMA_WR_SEND_WITH_INV,
> +    PVRDMA_WR_RDMA_READ_WITH_INV,
> +    PVRDMA_WR_LOCAL_INV,
> +    PVRDMA_WR_FAST_REG_MR,
> +    PVRDMA_WR_MASKED_ATOMIC_CMP_AND_SWP,
> +    PVRDMA_WR_MASKED_ATOMIC_FETCH_AND_ADD,
> +    PVRDMA_WR_BIND_MW,
> +    PVRDMA_WR_REG_SIG_MR,
> +};
> +
> +enum pvrdma_send_flags {
> +    PVRDMA_SEND_FENCE    = 1 << 0,
> +    PVRDMA_SEND_SIGNALED    = 1 << 1,
> +    PVRDMA_SEND_SOLICITED    = 1 << 2,
> +    PVRDMA_SEND_INLINE    = 1 << 3,
> +    PVRDMA_SEND_IP_CSUM    = 1 << 4,
> +    PVRDMA_SEND_FLAGS_MAX    = PVRDMA_SEND_IP_CSUM,
> +};
> +
> +enum pvrdma_access_flags {
> +    PVRDMA_ACCESS_LOCAL_WRITE    = 1 << 0,
> +    PVRDMA_ACCESS_REMOTE_WRITE    = 1 << 1,
> +    PVRDMA_ACCESS_REMOTE_READ    = 1 << 2,
> +    PVRDMA_ACCESS_REMOTE_ATOMIC    = 1 << 3,
> +    PVRDMA_ACCESS_MW_BIND        = 1 << 4,
> +    PVRDMA_ZERO_BASED        = 1 << 5,
> +    PVRDMA_ACCESS_ON_DEMAND        = 1 << 6,
> +    PVRDMA_ACCESS_FLAGS_MAX        = PVRDMA_ACCESS_ON_DEMAND,
> +};
> +
> +enum ib_wc_status {
> +    IB_WC_SUCCESS,
> +    IB_WC_LOC_LEN_ERR,
> +    IB_WC_LOC_QP_OP_ERR,
> +    IB_WC_LOC_EEC_OP_ERR,
> +    IB_WC_LOC_PROT_ERR,
> +    IB_WC_WR_FLUSH_ERR,
> +    IB_WC_MW_BIND_ERR,
> +    IB_WC_BAD_RESP_ERR,
> +    IB_WC_LOC_ACCESS_ERR,
> +    IB_WC_REM_INV_REQ_ERR,
> +    IB_WC_REM_ACCESS_ERR,
> +    IB_WC_REM_OP_ERR,
> +    IB_WC_RETRY_EXC_ERR,
> +    IB_WC_RNR_RETRY_EXC_ERR,
> +    IB_WC_LOC_RDD_VIOL_ERR,
> +    IB_WC_REM_INV_RD_REQ_ERR,
> +    IB_WC_REM_ABORT_ERR,
> +    IB_WC_INV_EECN_ERR,
> +    IB_WC_INV_EEC_STATE_ERR,
> +    IB_WC_FATAL_ERR,
> +    IB_WC_RESP_TIMEOUT_ERR,
> +    IB_WC_GENERAL_ERR
> +};
> +
> +#endif /* PVRDMA_IB_VERBS_H */
> diff --git a/hw/net/pvrdma/pvrdma_kdbr.c b/hw/net/pvrdma/pvrdma_kdbr.c
> new file mode 100644
> index 0000000..ec04afd
> --- /dev/null
> +++ b/hw/net/pvrdma/pvrdma_kdbr.c
> @@ -0,0 +1,395 @@
> +#include <qemu/osdep.h>
> +#include <hw/pci/pci.h>
> +
> +#include <sys/ioctl.h>
> +
> +#include <hw/net/pvrdma/pvrdma.h>
> +#include <hw/net/pvrdma/pvrdma_ib_verbs.h>
> +#include <hw/net/pvrdma/pvrdma_rm.h>
> +#include <hw/net/pvrdma/pvrdma_kdbr.h>
> +#include <hw/net/pvrdma/pvrdma_utils.h>
> +#include <hw/net/pvrdma/kdbr.h>
> +
> +int kdbr_fd = -1;
> +
> +#define MAX_CONSEQ_CQES_READ 10
> +
> +typedef struct KdbrCtx {
> +    struct kdbr_req req;
> +    void *up_ctx;
> +    bool is_tx_req;
> +} KdbrCtx;
> +
> +static void (*tx_comp_handler)(int status, unsigned int vendor_err,
> +                               void *ctx) = 0;
> +static void (*rx_comp_handler)(int status, unsigned int vendor_err,
> +                               void *ctx) = 0;
> +
> +static void kdbr_err_to_pvrdma_err(int kdbr_status, unsigned int *status,
> +                                   unsigned int *vendor_err)
> +{
> +    if (kdbr_status == 0) {
> +        *status = IB_WC_SUCCESS;
> +        *vendor_err = 0;
> +        return;
> +    }
> +
> +    *vendor_err = kdbr_status;
> +    switch (kdbr_status) {
> +    case KDBR_ERR_CODE_EMPTY_VEC:
> +        *status = IB_WC_LOC_LEN_ERR;
> +        break;
> +    case KDBR_ERR_CODE_NO_MORE_RECV_BUF:
> +        *status = IB_WC_REM_OP_ERR;
> +        break;
> +    case KDBR_ERR_CODE_RECV_BUF_PROT:
> +        *status = IB_WC_REM_ACCESS_ERR;
> +        break;
> +    case KDBR_ERR_CODE_INV_ADDR:
> +        *status = IB_WC_LOC_ACCESS_ERR;
> +        break;
> +    case KDBR_ERR_CODE_INV_CONN_ID:
> +        *status = IB_WC_LOC_PROT_ERR;
> +        break;
> +    case KDBR_ERR_CODE_NO_PEER:
> +        *status = IB_WC_LOC_QP_OP_ERR;
> +        break;
> +    default:
> +        *status = IB_WC_GENERAL_ERR;
> +        break;
> +    }
> +}
> +
> +static void *comp_handler_thread(void *arg)
> +{
> +    KdbrPort *port = (KdbrPort *)arg;
> +    struct kdbr_completion comp[MAX_CONSEQ_CQES_READ];
> +    int i, j, rc;
> +    KdbrCtx *sctx;
> +    unsigned int status, vendor_err;
> +
> +    while (port->comp_thread.run) {
> +        rc = read(port->fd, &comp, sizeof(comp));
> +        if (unlikely(rc % sizeof(struct kdbr_completion))) {
> +            pr_err("Got unsupported message size (%d) from kdbr\n", rc);
> +            continue;
> +        }
> +        pr_dbg("Processing %ld CQEs from kdbr\n",
> +               rc / sizeof(struct kdbr_completion));
> +
> +        for (i = 0; i < rc / sizeof(struct kdbr_completion); i++) {
> +            pr_dbg("comp.req_id=%ld\n", comp[i].req_id);
> +            pr_dbg("comp.status=%d\n", comp[i].status);
> +
> +            sctx = rm_get_wqe_ctx(PVRDMA_DEV(port->dev), comp[i].req_id);
> +            if (!sctx) {
> +                pr_err("Fail to find ctx for req %ld\n", comp[i].req_id);
> +                continue;
> +            }
> +            pr_dbg("Processing %s CQE\n", sctx->is_tx_req ? "send" : "recv");
> +
> +            for (j = 0; j < sctx->req.vlen; j++) {
> +                pr_dbg("payload=%s\n", (char *)sctx->req.vec[j].iov_base);
> +                pvrdma_pci_dma_unmap(port->dev, sctx->req.vec[j].iov_base,
> +                                     sctx->req.vec[j].iov_len);
> +            }
> +
> +            kdbr_err_to_pvrdma_err(comp[i].status, &status, &vendor_err);
> +            pr_dbg("status=%d\n", status);
> +            pr_dbg("vendor_err=0x%x\n", vendor_err);
> +
> +            if (sctx->is_tx_req) {
> +                tx_comp_handler(status, vendor_err, sctx->up_ctx);
> +            } else {
> +                rx_comp_handler(status, vendor_err, sctx->up_ctx);
> +            }
> +
> +            rm_dealloc_wqe_ctx(PVRDMA_DEV(port->dev), comp[i].req_id);
> +            free(sctx);
> +        }
> +    }
> +
> +    pr_dbg("Going down\n");
> +
> +    return NULL;
> +}
> +
> +KdbrPort *kdbr_alloc_port(PVRDMADev *dev)
> +{
> +    int rc;
> +    KdbrPort *port;
> +    char name[80] = {0};
> +    struct kdbr_reg reg;
> +
> +    port = malloc(sizeof(KdbrPort));
> +    if (!port) {
> +        pr_dbg("Fail to allocate memory for port object\n");
> +        return NULL;
> +    }
> +
> +    port->dev = PCI_DEVICE(dev);
> +
> +    pr_dbg("net=0x%llx\n", dev->ports[0].gid_tbl[0].global.subnet_prefix);
> +    pr_dbg("guid=0x%llx\n", dev->ports[0].gid_tbl[0].global.interface_id);
> +    reg.gid.net_id = dev->ports[0].gid_tbl[0].global.subnet_prefix;
> +    reg.gid.id = dev->ports[0].gid_tbl[0].global.interface_id;
> +    rc = ioctl(kdbr_fd, KDBR_REGISTER_PORT, &reg);
> +    if (rc < 0) {
> +        pr_err("Fail to allocate port\n");
> +        goto err_free_port;
> +    }
> +
> +    port->num = reg.port;
> +
> +    sprintf(name, KDBR_FILE_NAME "%d", port->num);
> +    port->fd = open(name, O_RDWR);
> +    if (port->fd < 0) {
> +        pr_err("Fail to open file %s\n", name);
> +        goto err_unregister_device;
> +    }
> +
> +    sprintf(name, "pvrdma_comp_%d", port->num);
> +    port->comp_thread.run = true;
> +    qemu_thread_create(&port->comp_thread.thread, name, comp_handler_thread,
> +                       port, QEMU_THREAD_DETACHED);
> +
> +    pr_info("Port %d (fd %d) allocated\n", port->num, port->fd);
> +
> +    return port;
> +
> +err_unregister_device:
> +    ioctl(kdbr_fd, KDBR_UNREGISTER_PORT, &port->num);
> +
> +err_free_port:
> +    free(port);
> +
> +    return NULL;
> +}
> +
> +void kdbr_free_port(KdbrPort *port)
> +{
> +    int rc;
> +
> +    if (!port) {
> +        return;
> +    }
> +
> +    rc = write(port->fd, (char *)0, 1);
> +    port->comp_thread.run = false;
> +    close(port->fd);
> +
> +    rc = ioctl(kdbr_fd, KDBR_UNREGISTER_PORT, &port->num);
> +    if (rc < 0) {
> +        pr_err("Fail to allocate port\n");
> +    }
> +
> +    free(port);
> +}
> +
> +unsigned long kdbr_open_connection(KdbrPort *port, u32 qpn,
> +                                   union pvrdma_gid dgid, u32 dqpn, bool rc_qp)
> +{
> +    int rc;
> +    struct kdbr_connection connection = {0};
> +
> +    connection.queue_id = qpn;
> +    connection.peer.rgid.net_id = dgid.global.subnet_prefix;
> +    connection.peer.rgid.id = dgid.global.interface_id;
> +    connection.peer.rqueue = dqpn;
> +    connection.ack_type = rc_qp ? KDBR_ACK_DELAYED : KDBR_ACK_IMMEDIATE;
> +
> +    rc = ioctl(port->fd, KDBR_PORT_OPEN_CONN, &connection);
> +    if (rc <= 0) {
> +        pr_err("Fail to open kdbr connection on port %d fd %d err %d\n",
> +               port->num, port->fd, rc);
> +        return 0;
> +    }
> +
> +    return (unsigned long)rc;
> +}
> +
> +void kdbr_close_connection(KdbrPort *port, unsigned long connection_id)
> +{
> +    int rc;
> +
> +    rc = ioctl(port->fd, KDBR_PORT_CLOSE_CONN, &connection_id);
> +    if (rc < 0) {
> +        pr_err("Fail to close kdbr connection on port %d\n",
> +               port->num);
> +    }
> +}
> +
> +void kdbr_register_tx_comp_handler(void (*comp_handler)(int status,
> +                                   unsigned int vendor_err, void *ctx))
> +{
> +    tx_comp_handler = comp_handler;
> +}
> +
> +void kdbr_register_rx_comp_handler(void (*comp_handler)(int status,
> +                                   unsigned int vendor_err, void *ctx))
> +{
> +    rx_comp_handler = comp_handler;
> +}
> +
> +void kdbr_send_wqe(KdbrPort *port, unsigned long connection_id, bool rc_qp,
> +                   struct RmSqWqe *wqe, void *ctx)
> +{
> +    KdbrCtx *sctx;
> +    int rc;
> +    int i;
> +
> +    pr_dbg("kdbr_port=%d\n", port->num);
> +    pr_dbg("kdbr_connection_id=%ld\n", connection_id);
> +    pr_dbg("wqe->hdr.num_sge=%d\n", wqe->hdr.num_sge);
> +
> +    /* Last minute validation - verify that kdbr supports num_sge */
> +    /* TODO: Make sure this will not happen! */
> +    if (wqe->hdr.num_sge > KDBR_MAX_IOVEC_LEN) {
> +        pr_err("Error: requested %d SGEs where kdbr supports %d\n",
> +               wqe->hdr.num_sge, KDBR_MAX_IOVEC_LEN);
> +        tx_comp_handler(IB_WC_GENERAL_ERR, VENDOR_ERR_TOO_MANY_SGES, ctx);
> +        return;
> +    }
> +
> +    sctx = malloc(sizeof(*sctx));
> +    if (!sctx) {
> +        pr_err("Fail to allocate kdbr request ctx\n");
> +        tx_comp_handler(IB_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
> +    }
> +
> +    memset(&sctx->req, 0, sizeof(sctx->req));
> +    sctx->req.flags = KDBR_REQ_SIGNATURE | KDBR_REQ_POST_SEND;
> +    sctx->req.connection_id = connection_id;
> +
> +    sctx->up_ctx = ctx;
> +    sctx->is_tx_req = 1;
> +
> +    rc = rm_alloc_wqe_ctx(PVRDMA_DEV(port->dev), &sctx->req.req_id, sctx);
> +    if (rc != 0) {
> +        pr_err("Fail to allocate request ID\n");
> +        free(sctx);
> +        tx_comp_handler(IB_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
> +        return;
> +    }
> +    sctx->req.vlen = wqe->hdr.num_sge;
> +
> +    for (i = 0; i < wqe->hdr.num_sge; i++) {
> +        struct pvrdma_sge *sge;
> +
> +        sge = &wqe->sge[i];
> +
> +        pr_dbg("addr=0x%llx\n", sge->addr);
> +        pr_dbg("length=%d\n", sge->length);
> +        pr_dbg("lkey=0x%x\n", sge->lkey);
> +
> +        sctx->req.vec[i].iov_base = pvrdma_pci_dma_map(port->dev, sge->addr,
> +                                                       sge->length);
> +        sctx->req.vec[i].iov_len = sge->length;
> +    }
> +
> +    if (!rc_qp) {
> +        sctx->req.peer.rqueue = wqe->hdr.wr.ud.remote_qpn;
> +        sctx->req.peer.rgid.net_id = *((unsigned long *)
> +                        &wqe->hdr.wr.ud.av.dgid[0]);
> +        sctx->req.peer.rgid.id = *((unsigned long *)
> +                        &wqe->hdr.wr.ud.av.dgid[8]);
> +    }
> +
> +    rc = write(port->fd, &sctx->req, sizeof(sctx->req));
> +    if (rc < 0) {
> +        pr_err("Fail (%d, %d) to post send WQE to port %d, conn_id %ld\n", rc,
> +               errno, port->num, connection_id);
> +        tx_comp_handler(IB_WC_GENERAL_ERR, VENDOR_ERR_FAIL_KDBR, ctx);
> +        return;
> +    }
> +}
> +
> +void kdbr_recv_wqe(KdbrPort *port, unsigned long connection_id,
> +                   struct RmRqWqe *wqe, void *ctx)
> +{
> +    KdbrCtx *sctx;
> +    int rc;
> +    int i;
> +
> +    pr_dbg("kdbr_port=%d\n", port->num);
> +    pr_dbg("kdbr_connection_id=%ld\n", connection_id);
> +    pr_dbg("wqe->hdr.num_sge=%d\n", wqe->hdr.num_sge);
> +
> +    /* Last minute validation - verify that kdbr supports num_sge */
> +    if (wqe->hdr.num_sge > KDBR_MAX_IOVEC_LEN) {
> +        pr_err("Error: requested %d SGEs where kdbr supports %d\n",
> +               wqe->hdr.num_sge, KDBR_MAX_IOVEC_LEN);
> +        tx_comp_handler(IB_WC_GENERAL_ERR, VENDOR_ERR_TOO_MANY_SGES, ctx);
> +        return;
> +    }
> +
> +    sctx = malloc(sizeof(*sctx));
> +    if (!sctx) {
> +        pr_err("Fail to allocate kdbr request ctx\n");
> +        tx_comp_handler(IB_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
> +    }
> +
> +    memset(&sctx->req, 0, sizeof(sctx->req));
> +    sctx->req.flags = KDBR_REQ_SIGNATURE | KDBR_REQ_POST_RECV;
> +    sctx->req.connection_id = connection_id;
> +
> +    sctx->up_ctx = ctx;
> +    sctx->is_tx_req = 0;
> +
> +    pr_dbg("sctx=%p\n", sctx);
> +    rc = rm_alloc_wqe_ctx(PVRDMA_DEV(port->dev), &sctx->req.req_id, sctx);
> +    if (rc != 0) {
> +        pr_err("Fail to allocate request ID\n");
> +        free(sctx);
> +        tx_comp_handler(IB_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
> +        return;
> +    }
> +
> +    sctx->req.vlen = wqe->hdr.num_sge;
> +
> +    for (i = 0; i < wqe->hdr.num_sge; i++) {
> +        struct pvrdma_sge *sge;
> +
> +        sge = &wqe->sge[i];
> +
> +        pr_dbg("addr=0x%llx\n", sge->addr);
> +        pr_dbg("length=%d\n", sge->length);
> +        pr_dbg("lkey=0x%x\n", sge->lkey);
> +
> +        sctx->req.vec[i].iov_base = pvrdma_pci_dma_map(port->dev, sge->addr,
> +                                                       sge->length);
> +        sctx->req.vec[i].iov_len = sge->length;
> +    }
> +
> +    rc = write(port->fd, &sctx->req, sizeof(sctx->req));
> +    if (rc < 0) {
> +        pr_err("Fail (%d, %d) to post recv WQE to port %d, conn_id %ld\n", rc,
> +               errno, port->num, connection_id);
> +        tx_comp_handler(IB_WC_GENERAL_ERR, VENDOR_ERR_FAIL_KDBR, ctx);
> +        return;
> +    }
> +}
> +
> +static void dummy_comp_handler(int status, unsigned int vendor_err, void *ctx)
> +{
> +    pr_err("No completion handler is registered\n");
> +}
> +
> +int kdbr_init(void)
> +{
> +    kdbr_register_tx_comp_handler(dummy_comp_handler);
> +    kdbr_register_rx_comp_handler(dummy_comp_handler);
> +
> +    kdbr_fd = open(KDBR_FILE_NAME, 0);
> +    if (kdbr_fd < 0) {
> +        pr_dbg("Can't connect to kdbr, rc=%d\n", kdbr_fd);
> +        return -EIO;
> +    }
> +
> +    return 0;
> +}
> +
> +void kdbr_fini(void)
> +{
> +    close(kdbr_fd);
> +}
> diff --git a/hw/net/pvrdma/pvrdma_kdbr.h b/hw/net/pvrdma/pvrdma_kdbr.h
> new file mode 100644
> index 0000000..293a180
> --- /dev/null
> +++ b/hw/net/pvrdma/pvrdma_kdbr.h
> @@ -0,0 +1,53 @@
> +/*
> + * QEMU VMWARE paravirtual RDMA QP Operations
> + *
> + * Developed by Oracle & Redhat
> + *
> + * Authors:
> + *     Yuval Shaia <yuval.shaia@oracle.com>
> + *     Marcel Apfelbaum <marcel@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#ifndef PVRDMA_KDBR_H
> +#define PVRDMA_KDBR_H
> +
> +#include <hw/net/pvrdma/pvrdma_types.h>
> +#include <hw/net/pvrdma/pvrdma_ib_verbs.h>
> +#include <hw/net/pvrdma/pvrdma_rm.h>
> +#include <hw/net/pvrdma/kdbr.h>
> +
> +typedef struct KdbrCompThread {
> +    QemuThread thread;
> +    QemuMutex mutex;
> +    bool run;
> +} KdbrCompThread;
> +
> +typedef struct KdbrPort {
> +    int num;
> +    int fd;
> +    KdbrCompThread comp_thread;
> +    PCIDevice *dev;
> +} KdbrPort;
> +
> +int kdbr_init(void);
> +void kdbr_fini(void);
> +KdbrPort *kdbr_alloc_port(PVRDMADev *dev);
> +void kdbr_free_port(KdbrPort *port);
> +void kdbr_register_tx_comp_handler(void (*comp_handler)(int status,
> +                                   unsigned int vendor_err, void *ctx));
> +void kdbr_register_rx_comp_handler(void (*comp_handler)(int status,
> +                                   unsigned int vendor_err, void *ctx));
> +unsigned long kdbr_open_connection(KdbrPort *port, u32 qpn,
> +                                   union pvrdma_gid dgid, u32 dqpn,
> +                                   bool rc_qp);
> +void kdbr_close_connection(KdbrPort *port, unsigned long connection_id);
> +void kdbr_send_wqe(KdbrPort *port, unsigned long connection_id, bool rc_qp,
> +                   struct RmSqWqe *wqe, void *ctx);
> +void kdbr_recv_wqe(KdbrPort *port, unsigned long connection_id,
> +                   struct RmRqWqe *wqe, void *ctx);
> +
> +#endif
> diff --git a/hw/net/pvrdma/pvrdma_main.c b/hw/net/pvrdma/pvrdma_main.c
> new file mode 100644
> index 0000000..5db802e
> --- /dev/null
> +++ b/hw/net/pvrdma/pvrdma_main.c
> @@ -0,0 +1,667 @@
> +#include <qemu/osdep.h>
> +#include <hw/hw.h>
> +#include <hw/pci/pci.h>
> +#include <hw/pci/pci_ids.h>
> +#include <hw/pci/msi.h>
> +#include <hw/pci/msix.h>
> +#include <hw/qdev-core.h>
> +#include <hw/qdev-properties.h>
> +#include <cpu.h>
> +
> +#include "hw/net/pvrdma/pvrdma.h"
> +#include "hw/net/pvrdma/pvrdma_defs.h"
> +#include "hw/net/pvrdma/pvrdma_utils.h"
> +#include "hw/net/pvrdma/pvrdma_dev_api.h"
> +#include "hw/net/pvrdma/pvrdma_rm.h"
> +#include "hw/net/pvrdma/pvrdma_kdbr.h"
> +#include "hw/net/pvrdma/pvrdma_qp_ops.h"
> +
> +static Property pvrdma_dev_properties[] = {
> +    DEFINE_PROP_UINT64("sys-image-guid", PVRDMADev, sys_image_guid, 0),
> +    DEFINE_PROP_UINT64("node-guid", PVRDMADev, node_guid, 0),
> +    DEFINE_PROP_UINT64("network-prefix", PVRDMADev, network_prefix, 0),
> +    DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +static void free_dev_ring(PCIDevice *pci_dev, Ring *ring, void *ring_state)
> +{
> +    ring_free(ring);
> +    pvrdma_pci_dma_unmap(pci_dev, ring_state, TARGET_PAGE_SIZE);
> +}
> +
> +static int init_dev_ring(Ring *ring, struct pvrdma_ring **ring_state,
> +                         const char *name, PCIDevice *pci_dev,
> +                         dma_addr_t dir_addr, u32 num_pages)
> +{
> +    __u64 *dir, *tbl;
> +    int rc = 0;
> +
> +    pr_dbg("Initializing device ring %s\n", name);
> +    pr_dbg("pdir_dma=0x%llx\n", (long long unsigned int)dir_addr);
> +    pr_dbg("num_pages=%d\n", num_pages);
> +    dir = pvrdma_pci_dma_map(pci_dev, dir_addr, TARGET_PAGE_SIZE);
> +    if (!dir) {
> +        pr_err("Fail to map to page directory\n");
> +        rc = -ENOMEM;
> +        goto out;
> +    }
> +    tbl = pvrdma_pci_dma_map(pci_dev, dir[0], TARGET_PAGE_SIZE);
> +    if (!tbl) {
> +        pr_err("Fail to map to page table\n");
> +        rc = -ENOMEM;
> +        goto out_free_dir;
> +    }
> +
> +    *ring_state = pvrdma_pci_dma_map(pci_dev, tbl[0], TARGET_PAGE_SIZE);
> +    if (!*ring_state) {
> +        pr_err("Fail to map to ring state\n");
> +        rc = -ENOMEM;
> +        goto out_free_tbl;
> +    }
> +    /* RX ring is the second */
> +    (struct pvrdma_ring *)(*ring_state)++;
> +    rc = ring_init(ring, name, pci_dev, (struct pvrdma_ring *)*ring_state,
> +                   (num_pages - 1) * TARGET_PAGE_SIZE /
> +                   sizeof(struct pvrdma_cqne), sizeof(struct pvrdma_cqne),
> +                   (dma_addr_t *)&tbl[1], (dma_addr_t)num_pages - 1);
> +    if (rc != 0) {
> +        pr_err("Fail to initialize ring\n");
> +        rc = -ENOMEM;
> +        goto out_free_ring_state;
> +    }
> +
> +    goto out_free_tbl;
> +
> +out_free_ring_state:
> +    pvrdma_pci_dma_unmap(pci_dev, *ring_state, TARGET_PAGE_SIZE);
> +
> +out_free_tbl:
> +    pvrdma_pci_dma_unmap(pci_dev, tbl, TARGET_PAGE_SIZE);
> +
> +out_free_dir:
> +    pvrdma_pci_dma_unmap(pci_dev, dir, TARGET_PAGE_SIZE);
> +
> +out:
> +    return rc;
> +}
> +
> +static void free_dsr(PVRDMADev *dev)
> +{
> +    PCIDevice *pci_dev = PCI_DEVICE(dev);
> +
> +    if (!dev->dsr_info.dsr) {
> +        return;
> +    }
> +
> +    free_dev_ring(pci_dev, &dev->dsr_info.async,
> +                  dev->dsr_info.async_ring_state);
> +
> +    free_dev_ring(pci_dev, &dev->dsr_info.cq, dev->dsr_info.cq_ring_state);
> +
> +    pvrdma_pci_dma_unmap(pci_dev, dev->dsr_info.req,
> +                         sizeof(union pvrdma_cmd_req));
> +
> +    pvrdma_pci_dma_unmap(pci_dev, dev->dsr_info.rsp,
> +                         sizeof(union pvrdma_cmd_resp));
> +
> +    pvrdma_pci_dma_unmap(pci_dev, dev->dsr_info.dsr,
> +                         sizeof(struct pvrdma_device_shared_region));
> +
> +    dev->dsr_info.dsr = NULL;
> +}
> +
> +static int load_dsr(PVRDMADev *dev)
> +{
> +    int rc = 0;
> +    PCIDevice *pci_dev = PCI_DEVICE(dev);
> +    DSRInfo *dsr_info;
> +    struct pvrdma_device_shared_region *dsr;
> +
> +    free_dsr(dev);
> +
> +    /* Map to DSR */
> +    pr_dbg("dsr_dma=0x%llx\n", (long long unsigned int)dev->dsr_info.dma);
> +    dev->dsr_info.dsr = pvrdma_pci_dma_map(pci_dev, dev->dsr_info.dma,
> +                                sizeof(struct pvrdma_device_shared_region));
> +    if (!dev->dsr_info.dsr) {
> +        pr_err("Fail to map to DSR\n");
> +        rc = -ENOMEM;
> +        goto out;
> +    }
> +
> +    /* Shortcuts */
> +    dsr_info = &dev->dsr_info;
> +    dsr = dsr_info->dsr;
> +
> +    /* Map to command slot */
> +    pr_dbg("cmd_dma=0x%llx\n", (long long unsigned int)dsr->cmd_slot_dma);
> +    dsr_info->req = pvrdma_pci_dma_map(pci_dev, dsr->cmd_slot_dma,
> +                                       sizeof(union pvrdma_cmd_req));
> +    if (!dsr_info->req) {
> +        pr_err("Fail to map to command slot address\n");
> +        rc = -ENOMEM;
> +        goto out_free_dsr;
> +    }
> +
> +    /* Map to response slot */
> +    pr_dbg("rsp_dma=0x%llx\n", (long long unsigned int)dsr->resp_slot_dma);
> +    dsr_info->rsp = pvrdma_pci_dma_map(pci_dev, dsr->resp_slot_dma,
> +                                       sizeof(union pvrdma_cmd_resp));
> +    if (!dsr_info->rsp) {
> +        pr_err("Fail to map to response slot address\n");
> +        rc = -ENOMEM;
> +        goto out_free_req;
> +    }
> +
> +    /* Map to CQ notification ring */
> +    rc = init_dev_ring(&dsr_info->cq, &dsr_info->cq_ring_state, "dev_cq",
> +                       pci_dev, dsr->cq_ring_pages.pdir_dma,
> +                       dsr->cq_ring_pages.num_pages);
> +    if (rc != 0) {
> +        pr_err("Fail to map to initialize CQ ring\n");
> +        rc = -ENOMEM;
> +        goto out_free_rsp;
> +    }
> +
> +    /* Map to event notification ring */
> +    rc = init_dev_ring(&dsr_info->async, &dsr_info->async_ring_state,
> +                       "dev_async", pci_dev, dsr->async_ring_pages.pdir_dma,
> +                       dsr->async_ring_pages.num_pages);
> +    if (rc != 0) {
> +        pr_err("Fail to map to initialize event ring\n");
> +        rc = -ENOMEM;
> +        goto out_free_rsp;
> +    }
> +
> +    goto out;
> +
> +out_free_rsp:
> +    pvrdma_pci_dma_unmap(pci_dev, dsr_info->rsp, sizeof(union pvrdma_cmd_resp));
> +
> +out_free_req:
> +    pvrdma_pci_dma_unmap(pci_dev, dsr_info->req, sizeof(union pvrdma_cmd_req));
> +
> +out_free_dsr:
> +    pvrdma_pci_dma_unmap(pci_dev, dsr_info->dsr,
> +                         sizeof(struct pvrdma_device_shared_region));
> +    dsr_info->dsr = NULL;
> +
> +out:
> +    return rc;
> +}
> +
> +static void init_dev_caps(PVRDMADev *dev)
> +{
> +    struct pvrdma_device_shared_region *dsr;
> +
> +    if (dev->dsr_info.dsr == NULL) {
> +        pr_err("Can't initialized DSR\n");
> +        return;
> +    }
> +
> +    dsr = dev->dsr_info.dsr;
> +
> +    dsr->caps.fw_ver = PVRDMA_FW_VERSION;
> +    pr_dbg("fw_ver=0x%lx\n", dsr->caps.fw_ver);
> +
> +    dsr->caps.mode = PVRDMA_DEVICE_MODE_ROCE;
> +    pr_dbg("mode=%d\n", dsr->caps.mode);
> +
> +    dsr->caps.gid_types |= PVRDMA_GID_TYPE_FLAG_ROCE_V1;
> +    pr_dbg("gid_types=0x%x\n", dsr->caps.gid_types);
> +
> +    dsr->caps.max_uar = RDMA_BAR2_UAR_SIZE;
> +    pr_dbg("max_uar=%d\n", dsr->caps.max_uar);
> +
> +    if (rm_get_max_pds(&dsr->caps.max_pd)) {
> +        return;
> +    }
> +    pr_dbg("max_pd=%d\n", dsr->caps.max_pd);
> +
> +    if (rm_get_max_gids(&dsr->caps.gid_tbl_len)) {
> +        return;
> +    }
> +    pr_dbg("gid_tbl_len=%d\n", dsr->caps.gid_tbl_len);
> +
> +    if (rm_get_max_cqs(&dsr->caps.max_cq)) {
> +        return;
> +    }
> +    pr_dbg("max_cq=%d\n", dsr->caps.max_cq);
> +
> +    if (rm_get_max_cqes(&dsr->caps.max_cqe)) {
> +        return;
> +    }
> +    pr_dbg("max_cqe=%d\n", dsr->caps.max_cqe);
> +
> +    if (rm_get_max_qps(&dsr->caps.max_qp)) {
> +        return;
> +    }
> +    pr_dbg("max_qp=%d\n", dsr->caps.max_qp);
> +
> +    dsr->caps.sys_image_guid = cpu_to_be64(dev->sys_image_guid);
> +    pr_dbg("sys_image_guid=%llx\n",
> +           (long long unsigned int)be64_to_cpu(dsr->caps.sys_image_guid));
> +
> +    dsr->caps.node_guid = cpu_to_be64(dev->node_guid);
> +    pr_dbg("node_guid=%llx\n",
> +           (long long unsigned int)be64_to_cpu(dsr->caps.node_guid));
> +
> +    if (rm_get_phys_port_cnt(&dsr->caps.phys_port_cnt)) {
> +        return;
> +    }
> +    pr_dbg("phys_port_cnt=%d\n", dsr->caps.phys_port_cnt);
> +
> +    if (rm_get_max_qp_wrs(&dsr->caps.max_qp_wr)) {
> +        return;
> +    }
> +    pr_dbg("max_qp_wr=%d\n", dsr->caps.max_qp_wr);
> +
> +    if (rm_get_max_sges(&dsr->caps.max_sge)) {
> +        return;
> +    }
> +    pr_dbg("max_sge=%d\n", dsr->caps.max_sge);
> +
> +    if (rm_get_max_mrs(&dsr->caps.max_mr)) {
> +        return;
> +    }
> +    pr_dbg("max_mr=%d\n", dsr->caps.max_mr);
> +
> +    if (rm_get_max_pkeys(&dsr->caps.max_pkeys)) {
> +        return;
> +    }
> +    pr_dbg("max_pkeys=%d\n", dsr->caps.max_pkeys);
> +
> +    if (rm_get_max_ah(&dsr->caps.max_ah)) {
> +        return;
> +    }
> +    pr_dbg("max_ah=%d\n", dsr->caps.max_ah);
> +
> +    pr_dbg("Initialized\n");
> +}
> +
> +static void free_ports(PVRDMADev *dev)
> +{
> +    int i;
> +
> +    for (i = 0; i < MAX_PORTS; i++) {
> +        free(dev->ports[i].gid_tbl);
> +        kdbr_free_port(dev->ports[i].kdbr_port);
> +    }
> +}
> +
> +static int init_ports(PVRDMADev *dev)
> +{
> +    int i, ret = 0;
> +    __u32 max_port_gids;
> +    __u32 max_port_pkeys;
> +
> +    memset(dev->ports, 0, sizeof(dev->ports));
> +
> +    ret = rm_get_max_port_gids(&max_port_gids);
> +    if (ret != 0) {
> +        goto err;
> +    }
> +
> +    ret = rm_get_max_port_pkeys(&max_port_pkeys);
> +    if (ret != 0) {
> +        goto err;
> +    }
> +
> +    for (i = 0; i < MAX_PORTS; i++) {
> +        dev->ports[i].state = PVRDMA_PORT_DOWN;
> +
> +        dev->ports[i].pkey_tbl = malloc(sizeof(*dev->ports[i].pkey_tbl) *
> +                                        max_port_pkeys);
> +        if (dev->ports[i].gid_tbl == NULL) {
> +            goto err_free_ports;
> +        }
> +
> +        memset(dev->ports[i].gid_tbl, 0, sizeof(dev->ports[i].gid_tbl));
> +    }
> +
> +    return 0;
> +
> +err_free_ports:
> +    free_ports(dev);
> +
> +err:
> +    pr_err("Fail to initialize device's ports\n");
> +
> +    return ret;
> +}
> +
> +static void activate_device(PVRDMADev *dev)
> +{
> +    set_reg_val(dev, PVRDMA_REG_ERR, 0);
> +    pr_dbg("Device activated\n");
> +}
> +
> +static int quiesce_device(PVRDMADev *dev)
> +{
> +    pr_dbg("Device quiesced\n");
> +    return 0;
> +}
> +
> +static int reset_device(PVRDMADev *dev)
> +{
> +    pr_dbg("Device reset complete\n");
> +    return 0;
> +}
> +
> +static uint64_t regs_read(void *opaque, hwaddr addr, unsigned size)
> +{
> +    PVRDMADev *dev = opaque;
> +    __u32 val;
> +
> +    /* pr_dbg("addr=0x%lx, size=%d\n", addr, size); */
> +
> +    if (get_reg_val(dev, addr, &val)) {
> +        pr_dbg("Error trying to read REG value from address 0x%x\n",
> +               (__u32)addr);
> +        return -EINVAL;
> +    }
> +
> +    /* pr_dbg("regs[0x%x]=0x%x\n", (__u32)addr, val); */
> +
> +    return val;
> +}
> +
> +static void regs_write(void *opaque, hwaddr addr, uint64_t val, unsigned size)
> +{
> +    PVRDMADev *dev = opaque;
> +
> +    /* pr_dbg("addr=0x%lx, val=0x%x, size=%d\n", addr, (uint32_t)val, size); */
> +
> +    if (set_reg_val(dev, addr, val)) {
> +        pr_err("Error trying to set REG value, addr=0x%x, val=0x%lx\n",
> +               (__u32)addr, val);
> +        return;
> +    }
> +
> +    /* pr_dbg("regs[0x%x]=0x%lx\n", (__u32)addr, val); */
> +
> +    switch (addr) {
> +    case PVRDMA_REG_DSRLOW:
> +        dev->dsr_info.dma = val;
> +        break;
> +    case PVRDMA_REG_DSRHIGH:
> +        dev->dsr_info.dma |= val << 32;
> +        load_dsr(dev);
> +        init_dev_caps(dev);
> +        break;
> +    case PVRDMA_REG_CTL:
> +        switch (val) {
> +        case PVRDMA_DEVICE_CTL_ACTIVATE:
> +            activate_device(dev);
> +            break;
> +        case PVRDMA_DEVICE_CTL_QUIESCE:
> +            quiesce_device(dev);
> +            break;
> +        case PVRDMA_DEVICE_CTL_RESET:
> +            reset_device(dev);
> +            break;
> +        }
> +    case PVRDMA_REG_IMR:
> +        pr_dbg("Interrupt mask=0x%lx\n", val);
> +        dev->interrupt_mask = val;
> +        break;
> +    case PVRDMA_REG_REQUEST:
> +        if (val == 0) {
> +            execute_command(dev);
> +        }
> +    default:
> +        break;
> +    }
> +}
> +
> +static const MemoryRegionOps regs_ops = {
> +    .read = regs_read,
> +    .write = regs_write,
> +    .endianness = DEVICE_LITTLE_ENDIAN,
> +    .impl = {
> +        .min_access_size = sizeof(uint32_t),
> +        .max_access_size = sizeof(uint32_t),
> +    },
> +};
> +
> +static uint64_t uar_read(void *opaque, hwaddr addr, unsigned size)
> +{
> +    PVRDMADev *dev = opaque;
> +    __u32 val;
> +
> +    pr_dbg("addr=0x%lx, size=%d\n", addr, size);
> +
> +    if (get_uar_val(dev, addr, &val)) {
> +        pr_dbg("Error trying to read UAR value from address 0x%x\n",
> +               (__u32)addr);
> +        return -EINVAL;
> +    }
> +
> +    pr_dbg("uar[0x%x]=0x%x\n", (__u32)addr, val);
> +
> +    return val;
> +}
> +
> +static void uar_write(void *opaque, hwaddr addr, uint64_t val, unsigned size)
> +{
> +    PVRDMADev *dev = opaque;
> +
> +    /* pr_dbg("addr=0x%lx, val=0x%x, size=%d\n", addr, (uint32_t)val, size); */
> +
> +    if (set_uar_val(dev, addr, val)) {
> +        pr_err("Error trying to set UAR value, addr=0x%x, val=0x%lx\n",
> +               (__u32)addr, val);
> +        return;
> +    }
> +
> +    /* pr_dbg("uar[0x%x]=0x%lx\n", (__u32)addr, val); */
> +
> +    switch (addr) {
> +    case PVRDMA_UAR_QP_OFFSET:
> +        pr_dbg("UAR QP command, addr=0x%x, val=0x%lx\n", (__u32)addr, val);
> +        if (val & PVRDMA_UAR_QP_SEND) {
> +            qp_send(dev, val & PVRDMA_UAR_HANDLE_MASK);
> +        }
> +        if (val & PVRDMA_UAR_QP_RECV) {
> +            qp_recv(dev, val & PVRDMA_UAR_HANDLE_MASK);
> +        }
> +        break;
> +    case PVRDMA_UAR_CQ_OFFSET:
> +        pr_dbg("UAR CQ command, addr=0x%x, val=0x%lx\n", (__u32)addr, val);
> +        rm_req_notify_cq(dev, val & PVRDMA_UAR_HANDLE_MASK,
> +                 val & ~PVRDMA_UAR_HANDLE_MASK);
> +        break;
> +    default:
> +        pr_err("Unsupported command, addr=0x%x, val=0x%lx\n", (__u32)addr, val);
> +        break;
> +    }
> +}
> +
> +static const MemoryRegionOps uar_ops = {
> +    .read = uar_read,
> +    .write = uar_write,
> +    .endianness = DEVICE_LITTLE_ENDIAN,
> +    .impl = {
> +        .min_access_size = sizeof(uint32_t),
> +        .max_access_size = sizeof(uint32_t),
> +    },
> +};
> +
> +static void init_pci_config(PCIDevice *pdev)
> +{
> +    pdev->config[PCI_INTERRUPT_PIN] = 1;
> +}
> +
> +static void init_bars(PCIDevice *pdev)
> +{
> +    PVRDMADev *dev = PVRDMA_DEV(pdev);
> +
> +    /* BAR 0 - MSI-X */
> +    memory_region_init(&dev->msix, OBJECT(dev), "pvrdma-msix",
> +                       RDMA_BAR0_MSIX_SIZE);
> +    pci_register_bar(pdev, RDMA_MSIX_BAR_IDX, PCI_BASE_ADDRESS_SPACE_MEMORY,
> +                     &dev->msix);
> +
> +    /* BAR 1 - Registers */
> +    memset(&dev->regs_data, 0, RDMA_BAR1_REGS_SIZE);
> +    memory_region_init_io(&dev->regs, OBJECT(dev), &regs_ops, dev,
> +                          "pvrdma-regs", RDMA_BAR1_REGS_SIZE);
> +    pci_register_bar(pdev, RDMA_REG_BAR_IDX, PCI_BASE_ADDRESS_SPACE_MEMORY,
> +                     &dev->regs);
> +
> +    /* BAR 2 - UAR */
> +    memset(&dev->uar_data, 0, RDMA_BAR2_UAR_SIZE);
> +    memory_region_init_io(&dev->uar, OBJECT(dev), &uar_ops, dev, "rdma-uar",
> +                          RDMA_BAR2_UAR_SIZE);
> +    pci_register_bar(pdev, RDMA_UAR_BAR_IDX, PCI_BASE_ADDRESS_SPACE_MEMORY,
> +                     &dev->uar);
> +}
> +
> +static void init_regs(PCIDevice *pdev)
> +{
> +    PVRDMADev *dev = PVRDMA_DEV(pdev);
> +
> +    set_reg_val(dev, PVRDMA_REG_VERSION, PVRDMA_HW_VERSION);
> +    set_reg_val(dev, PVRDMA_REG_ERR, 0xFFFF);
> +}
> +
> +static void uninit_msix(PCIDevice *pdev, int used_vectors)
> +{
> +    PVRDMADev *dev = PVRDMA_DEV(pdev);
> +    int i;
> +
> +    for (i = 0; i < used_vectors; i++) {
> +        msix_vector_unuse(pdev, i);
> +    }
> +
> +    msix_uninit(pdev, &dev->msix, &dev->msix);
> +}
> +
> +static int init_msix(PCIDevice *pdev)
> +{
> +    PVRDMADev *dev = PVRDMA_DEV(pdev);
> +    int i;
> +    int rc;
> +
> +    rc = msix_init(pdev, RDMA_MAX_INTRS, &dev->msix, RDMA_MSIX_BAR_IDX,
> +                   RDMA_MSIX_TABLE, &dev->msix, RDMA_MSIX_BAR_IDX,
> +                   RDMA_MSIX_PBA, 0, NULL);
> +
> +    if (rc < 0) {
> +        pr_err("Fail to initialize MSI-X\n");
> +        return rc;
> +    }
> +
> +    for (i = 0; i < RDMA_MAX_INTRS; i++) {
> +        rc = msix_vector_use(PCI_DEVICE(dev), i);
> +        if (rc < 0) {
> +            pr_err("Fail mark MSI-X vercor %d\n", i);
> +            uninit_msix(pdev, i);
> +            return rc;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +static int pvrdma_init(PCIDevice *pdev)
> +{
> +    int rc;
> +    PVRDMADev *dev = PVRDMA_DEV(pdev);
> +
> +    pr_info("Initializing device %s %x.%x\n", pdev->name,
> +            PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
> +
> +    dev->dsr_info.dsr = NULL;
> +
> +    init_pci_config(pdev);
> +
> +    init_bars(pdev);
> +
> +    init_regs(pdev);
> +
> +    rc = init_msix(pdev);
> +    if (rc != 0) {
> +        goto out;
> +    }
> +
> +    rc = kdbr_init();
> +    if (rc != 0) {
> +        goto out;
> +    }
> +
> +    rc = rm_init(dev);
> +    if (rc != 0) {
> +        goto out;
> +    }
> +
> +    rc = init_ports(dev);
> +    if (rc != 0) {
> +        goto out;
> +    }
> +
> +    rc = qp_ops_init();
> +    if (rc != 0) {
> +        goto out;
> +    }
> +
> +out:
> +    if (rc != 0) {
> +        pr_err("Device fail to load\n");
> +    }
> +
> +    return rc;
> +}
> +
> +static void pvrdma_exit(PCIDevice *pdev)
> +{
> +    PVRDMADev *dev = PVRDMA_DEV(pdev);
> +
> +    pr_info("Closing device %s %x.%x\n", pdev->name,
> +            PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
> +
> +    qp_ops_fini();
> +
> +    free_ports(dev);
> +
> +    rm_fini(dev);
> +
> +    kdbr_fini();
> +
> +    free_dsr(dev);
> +
> +    if (msix_enabled(pdev)) {
> +        uninit_msix(pdev, RDMA_MAX_INTRS);
> +    }
> +}
> +
> +static void pvrdma_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +    PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
> +
> +    k->init = pvrdma_init;
> +    k->exit = pvrdma_exit;
> +    k->vendor_id = PCI_VENDOR_ID_VMWARE;
> +    k->device_id = PCI_DEVICE_ID_VMWARE_PVRDMA;
> +    k->revision = 0x00;
> +    k->class_id = PCI_CLASS_NETWORK_OTHER;
> +
> +    dc->desc = "RDMA Device";
> +    dc->props = pvrdma_dev_properties;
> +    set_bit(DEVICE_CATEGORY_NETWORK, dc->categories);
> +}
> +
> +static const TypeInfo pvrdma_info = {
> +    .name = PVRDMA_HW_NAME,
> +    .parent    = TYPE_PCI_DEVICE,
> +    .instance_size = sizeof(PVRDMADev),
> +    .class_init = pvrdma_class_init,
> +};
> +
> +static void register_types(void)
> +{
> +    type_register_static(&pvrdma_info);
> +}
> +
> +type_init(register_types)
> diff --git a/hw/net/pvrdma/pvrdma_qp_ops.c b/hw/net/pvrdma/pvrdma_qp_ops.c
> new file mode 100644
> index 0000000..2db45d9
> --- /dev/null
> +++ b/hw/net/pvrdma/pvrdma_qp_ops.c
> @@ -0,0 +1,174 @@
> +#include "hw/net/pvrdma/pvrdma.h"
> +#include "hw/net/pvrdma/pvrdma_utils.h"
> +#include "hw/net/pvrdma/pvrdma_qp_ops.h"
> +#include "hw/net/pvrdma/pvrdma_rm.h"
> +#include "hw/net/pvrdma/pvrdma-uapi.h"
> +#include "hw/net/pvrdma/pvrdma_kdbr.h"
> +#include "sysemu/dma.h"
> +#include "hw/pci/pci.h"
> +
> +typedef struct CompHandlerCtx {
> +    PVRDMADev *dev;
> +    u32 cq_handle;
> +    struct pvrdma_cqe cqe;
> +} CompHandlerCtx;
> +
> +/*
> + * 1. Put CQE on send CQ ring
> + * 2. Put CQ number on dsr completion ring
> + * 3. Interrupt host
> + */
> +static int post_cqe(PVRDMADev *dev, u32 cq_handle, struct pvrdma_cqe *cqe)
> +{
> +    struct pvrdma_cqe *cqe1;
> +    struct pvrdma_cqne *cqne;
> +    RmCQ *cq = rm_get_cq(dev, cq_handle);
> +
> +    if (!cq) {
> +        pr_dbg("Invalid cqn %d\n", cq_handle);
> +        return -EINVAL;
> +    }
> +
> +    pr_dbg("cq->comp_type=%d\n", cq->comp_type);
> +    if (cq->comp_type == CCT_NONE) {
> +        return 0;
> +    }
> +    cq->comp_type = CCT_NONE;
> +
> +    /* Step #1: Put CQE on CQ ring */
> +    pr_dbg("Writing CQE\n");
> +    cqe1 = ring_next_elem_write(&cq->cq);
> +    if (!cqe1) {
> +        return -EINVAL;
> +    }
> +
> +    memcpy(cqe1, cqe, sizeof(*cqe));
> +    ring_write_inc(&cq->cq);
> +
> +    /* Step #2: Put CQ number on dsr completion ring */
> +    pr_dbg("Writing CQNE\n");
> +    cqne = ring_next_elem_write(&dev->dsr_info.cq);
> +    if (!cqne) {
> +        return -EINVAL;
> +    }
> +
> +    cqne->info = cq_handle;
> +    ring_write_inc(&dev->dsr_info.cq);
> +
> +    post_interrupt(dev, INTR_VEC_CMD_COMPLETION_Q);
> +
> +    return 0;
> +}
> +
> +static void qp_ops_comp_handler(int status, unsigned int vendor_err, void *ctx)
> +{
> +    CompHandlerCtx *comp_ctx = (CompHandlerCtx *)ctx;
> +
> +    pr_dbg("cq_handle=%d\n", comp_ctx->cq_handle);
> +    pr_dbg("wr_id=%lld\n", comp_ctx->cqe.wr_id);
> +    pr_dbg("status=%d\n", status);
> +    pr_dbg("vendor_err=0x%x\n", vendor_err);
> +    comp_ctx->cqe.status = status;
> +    comp_ctx->cqe.vendor_err = vendor_err;
> +    post_cqe(comp_ctx->dev, comp_ctx->cq_handle, &comp_ctx->cqe);
> +    free(ctx);
> +}
> +
> +void qp_ops_fini(void)
> +{
> +}
> +
> +int qp_ops_init(void)
> +{
> +    kdbr_register_tx_comp_handler(qp_ops_comp_handler);
> +    kdbr_register_rx_comp_handler(qp_ops_comp_handler);
> +
> +    return 0;
> +}
> +
> +int qp_send(PVRDMADev *dev, __u32 qp_handle)
> +{
> +    RmQP *qp;
> +    RmSqWqe *wqe;
> +
> +    qp = rm_get_qp(dev, qp_handle);
> +    if (!qp) {
> +        return -EINVAL;
> +    }
> +
> +    if (qp->qp_state < PVRDMA_QPS_RTS) {
> +        pr_dbg("Invalid QP state for send\n");
> +        return -EINVAL;
> +    }
> +
> +    wqe = (struct RmSqWqe *)ring_next_elem_read(&qp->sq);
> +    while (wqe) {
> +        CompHandlerCtx *comp_ctx;
> +
> +        pr_dbg("wr_id=%lld\n", wqe->hdr.wr_id);
> +        wqe->hdr.num_sge = MIN(wqe->hdr.num_sge,
> +                       qp->init_args.max_send_sge);
> +
> +        /* Prepare CQE */
> +        comp_ctx = malloc(sizeof(CompHandlerCtx));
> +        comp_ctx->dev = dev;
> +        comp_ctx->cqe.wr_id = wqe->hdr.wr_id;
> +        comp_ctx->cqe.qp = qp_handle;
> +        comp_ctx->cq_handle = qp->init_args.send_cq_handle;
> +        comp_ctx->cqe.opcode = wqe->hdr.opcode;
> +        /* TODO: Fill rest of the data */
> +
> +        kdbr_send_wqe(dev->ports[qp->port_num].kdbr_port,
> +                      qp->kdbr_connection_id,
> +                      qp->init_args.qp_type == PVRDMA_QPT_RC, wqe, comp_ctx);
> +
> +        ring_read_inc(&qp->sq);
> +
> +        wqe = ring_next_elem_read(&qp->sq);
> +    }
> +
> +    return 0;
> +}
> +
> +int qp_recv(PVRDMADev *dev, __u32 qp_handle)
> +{
> +    RmQP *qp;
> +    RmRqWqe *wqe;
> +
> +    qp = rm_get_qp(dev, qp_handle);
> +    if (!qp) {
> +        return -EINVAL;
> +    }
> +
> +    if (qp->qp_state < PVRDMA_QPS_RTR) {
> +        pr_dbg("Invalid QP state for receive\n");
> +        return -EINVAL;
> +    }
> +
> +    wqe = (struct RmRqWqe *)ring_next_elem_read(&qp->rq);
> +    while (wqe) {
> +        CompHandlerCtx *comp_ctx;
> +
> +        pr_dbg("wr_id=%lld\n", wqe->hdr.wr_id);
> +        wqe->hdr.num_sge = MIN(wqe->hdr.num_sge,
> +                       qp->init_args.max_send_sge);
> +
> +        /* Prepare CQE */
> +        comp_ctx = malloc(sizeof(CompHandlerCtx));
> +        comp_ctx->dev = dev;
> +        comp_ctx->cqe.qp = qp_handle;
> +        comp_ctx->cq_handle = qp->init_args.recv_cq_handle;
> +        comp_ctx->cqe.wr_id = wqe->hdr.wr_id;
> +        comp_ctx->cqe.qp = qp_handle;
> +        /* TODO: Fill rest of the data */
> +
> +        kdbr_recv_wqe(dev->ports[qp->port_num].kdbr_port,
> +                      qp->kdbr_connection_id, wqe, comp_ctx);
> +
> +        ring_read_inc(&qp->rq);
> +
> +        wqe = ring_next_elem_read(&qp->rq);
> +    }
> +
> +    return 0;
> +}
> diff --git a/hw/net/pvrdma/pvrdma_qp_ops.h b/hw/net/pvrdma/pvrdma_qp_ops.h
> new file mode 100644
> index 0000000..20125d6
> --- /dev/null
> +++ b/hw/net/pvrdma/pvrdma_qp_ops.h
> @@ -0,0 +1,25 @@
> +/*
> + * QEMU VMWARE paravirtual RDMA QP Operations
> + *
> + * Developed by Oracle & Redhat
> + *
> + * Authors:
> + *     Yuval Shaia <yuval.shaia@oracle.com>
> + *     Marcel Apfelbaum <marcel@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#ifndef PVRDMA_QP_H
> +#define PVRDMA_QP_H
> +
> +typedef struct PVRDMADev PVRDMADev;
> +
> +int qp_ops_init(void);
> +void qp_ops_fini(void);
> +int qp_send(PVRDMADev *dev, __u32 qp_handle);
> +int qp_recv(PVRDMADev *dev, __u32 qp_handle);
> +
> +#endif
> diff --git a/hw/net/pvrdma/pvrdma_ring.c b/hw/net/pvrdma/pvrdma_ring.c
> new file mode 100644
> index 0000000..34dc1f5
> --- /dev/null
> +++ b/hw/net/pvrdma/pvrdma_ring.c
> @@ -0,0 +1,127 @@
> +#include <qemu/osdep.h>
> +#include <hw/pci/pci.h>
> +#include <cpu.h>
> +#include <hw/net/pvrdma/pvrdma_ring.h>
> +#include <hw/net/pvrdma/pvrdma-uapi.h>
> +#include <hw/net/pvrdma/pvrdma_utils.h>
> +
> +int ring_init(Ring *ring, const char *name, PCIDevice *dev,
> +              struct pvrdma_ring *ring_state, size_t max_elems, size_t elem_sz,
> +              dma_addr_t *tbl, dma_addr_t npages)
> +{
> +    int i;
> +    int rc = 0;
> +
> +    strncpy(ring->name, name, MAX_RING_NAME_SZ);
> +    ring->name[MAX_RING_NAME_SZ - 1] = 0;
> +    pr_info("Initializing %s ring\n", ring->name);
> +    ring->dev = dev;
> +    ring->ring_state = ring_state;
> +    ring->max_elems = max_elems;
> +    ring->elem_sz = elem_sz;
> +    pr_dbg("ring->elem_sz=%ld\n", ring->elem_sz);
> +    pr_dbg("npages=%ld\n", npages);
> +    /* TODO: Give a moment to think if we want to redo driver settings
> +    atomic_set(&ring->ring_state->prod_tail, 0);
> +    atomic_set(&ring->ring_state->cons_head, 0);
> +    */
> +    ring->npages = npages;
> +    ring->pages = malloc(npages * sizeof(void *));
> +    for (i = 0; i < npages; i++) {
> +        if (!tbl[i]) {
> +            pr_err("npages=%ld but tbl[%d] is NULL\n", npages, i);
> +            continue;
> +        }
> +
> +        ring->pages[i] = pvrdma_pci_dma_map(dev, tbl[i], TARGET_PAGE_SIZE);
> +        if (!ring->pages[i]) {
> +            rc = -ENOMEM;
> +            pr_err("Fail to map to page %d\n", i);
> +            goto out_free;
> +        }
> +    }
> +
> +    goto out;
> +
> +out_free:
> +    while (i--) {
> +        pvrdma_pci_dma_unmap(dev, ring->pages[i], TARGET_PAGE_SIZE);
> +    }
> +    free(ring->pages);
> +
> +out:
> +    return rc;
> +}
> +
> +void *ring_next_elem_read(Ring *ring)
> +{
> +    unsigned int idx = 0, offset;
> +
> +    /*
> +    pr_dbg("%s: t=%d, h=%d\n", ring->name, ring->ring_state->prod_tail,
> +           ring->ring_state->cons_head);
> +    */
> +
> +    if (!pvrdma_idx_ring_has_data(ring->ring_state, ring->max_elems, &idx)) {
> +        pr_dbg("No more data in ring\n");
> +        return NULL;
> +    }
> +
> +    offset = idx * ring->elem_sz;
> +    /*
> +    pr_dbg("idx=%d\n", idx);
> +    pr_dbg("offset=%d\n", offset);
> +    */
> +    return ring->pages[offset / TARGET_PAGE_SIZE] + (offset % TARGET_PAGE_SIZE);
> +}
> +
> +void ring_read_inc(Ring *ring)
> +{
> +    pvrdma_idx_ring_inc(&ring->ring_state->cons_head, ring->max_elems);
> +    /*
> +    pr_dbg("%s: t=%d, h=%d, m=%ld\n", ring->name,
> +           ring->ring_state->prod_tail, ring->ring_state->cons_head,
> +           ring->max_elems);
> +    */
> +}
> +
> +void *ring_next_elem_write(Ring *ring)
> +{
> +    unsigned int idx, offset, tail;
> +
> +    /*
> +    pr_dbg("%s: t=%d, h=%d\n", ring->name, ring->ring_state->prod_tail,
> +           ring->ring_state->cons_head);
> +    */
> +
> +    if (!pvrdma_idx_ring_has_space(ring->ring_state, ring->max_elems, &tail)) {
> +        pr_dbg("CQ is full\n");
> +        return NULL;
> +    }
> +
> +    idx = pvrdma_idx(&ring->ring_state->prod_tail, ring->max_elems);
> +    /* TODO: tail == idx */
> +
> +    offset = idx * ring->elem_sz;
> +    return ring->pages[offset / TARGET_PAGE_SIZE] + (offset % TARGET_PAGE_SIZE);
> +}
> +
> +void ring_write_inc(Ring *ring)
> +{
> +    pvrdma_idx_ring_inc(&ring->ring_state->prod_tail, ring->max_elems);
> +    /*
> +    pr_dbg("%s: t=%d, h=%d, m=%ld\n", ring->name,
> +           ring->ring_state->prod_tail, ring->ring_state->cons_head,
> +           ring->max_elems);
> +    */
> +}
> +
> +void ring_free(Ring *ring)
> +{
> +    while (ring->npages--) {
> +        pvrdma_pci_dma_unmap(ring->dev, ring->pages[ring->npages],
> +                             TARGET_PAGE_SIZE);
> +    }
> +
> +    free(ring->pages);
> +}
> diff --git a/hw/net/pvrdma/pvrdma_ring.h b/hw/net/pvrdma/pvrdma_ring.h
> new file mode 100644
> index 0000000..8a0c448
> --- /dev/null
> +++ b/hw/net/pvrdma/pvrdma_ring.h
> @@ -0,0 +1,43 @@
> +/*
> + * QEMU VMWARE paravirtual RDMA interface definitions
> + *
> + * Developed by Oracle & Redhat
> + *
> + * Authors:
> + *     Yuval Shaia <yuval.shaia@oracle.com>
> + *     Marcel Apfelbaum <marcel@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#ifndef PVRDMA_RING_H
> +#define PVRDMA_RING_H
> +
> +#include <qemu/typedefs.h>
> +#include <hw/net/pvrdma/pvrdma-uapi.h>
> +#include <hw/net/pvrdma/pvrdma_types.h>
> +
> +#define MAX_RING_NAME_SZ 16
> +
> +typedef struct Ring {
> +    char name[MAX_RING_NAME_SZ];
> +    PCIDevice *dev;
> +    size_t max_elems;
> +    size_t elem_sz;
> +    struct pvrdma_ring *ring_state;
> +    int npages;
> +    void **pages;
> +} Ring;
> +
> +int ring_init(Ring *ring, const char *name, PCIDevice *dev,
> +              struct pvrdma_ring *ring_state, size_t max_elems, size_t elem_sz,
> +              dma_addr_t *tbl, dma_addr_t npages);
> +void *ring_next_elem_read(Ring *ring);
> +void ring_read_inc(Ring *ring);
> +void *ring_next_elem_write(Ring *ring);
> +void ring_write_inc(Ring *ring);
> +void ring_free(Ring *ring);
> +
> +#endif
> diff --git a/hw/net/pvrdma/pvrdma_rm.c b/hw/net/pvrdma/pvrdma_rm.c
> new file mode 100644
> index 0000000..55ca1e5
> --- /dev/null
> +++ b/hw/net/pvrdma/pvrdma_rm.c
> @@ -0,0 +1,529 @@
> +#include <hw/net/pvrdma/pvrdma.h>
> +#include <hw/net/pvrdma/pvrdma_utils.h>
> +#include <hw/net/pvrdma/pvrdma_rm.h>
> +#include <hw/net/pvrdma/pvrdma-uapi.h>
> +#include <hw/net/pvrdma/pvrdma_kdbr.h>
> +#include <qemu/bitmap.h>
> +#include <qemu/atomic.h>
> +#include <cpu.h>
> +
> +/* Page directory and page tables */
> +#define PG_DIR_SZ { TARGET_PAGE_SIZE / sizeof(__u64) }
> +#define PG_TBL_SZ { TARGET_PAGE_SIZE / sizeof(__u64) }
> +
> +/* Global local and remote keys */
> +__u64 global_lkey = 1;
> +__u64 global_rkey = 1;
> +
> +static inline int res_tbl_init(const char *name, RmResTbl *tbl, u32 tbl_sz,
> +                               u32 res_sz)
> +{
> +    tbl->tbl = malloc(tbl_sz * res_sz);
> +    if (!tbl->tbl) {
> +        return -ENOMEM;
> +    }
> +
> +    strncpy(tbl->name, name, MAX_RING_NAME_SZ);
> +    tbl->name[MAX_RING_NAME_SZ - 1] = 0;
> +
> +    tbl->bitmap = bitmap_new(tbl_sz);
> +    tbl->tbl_sz = tbl_sz;
> +    tbl->res_sz = res_sz;
> +    qemu_mutex_init(&tbl->lock);
> +
> +    return 0;
> +}
> +
> +static inline void res_tbl_free(RmResTbl *tbl)
> +{
> +    qemu_mutex_destroy(&tbl->lock);
> +    free(tbl->tbl);
> +    bitmap_zero_extend(tbl->bitmap, tbl->tbl_sz, 0);
> +}
> +
> +static inline void *res_tbl_get(RmResTbl *tbl, u32 handle)
> +{
> +    pr_dbg("%s, handle=%d\n", tbl->name, handle);
> +
> +    if ((handle < tbl->tbl_sz) && (test_bit(handle, tbl->bitmap))) {
> +        return tbl->tbl + handle * tbl->res_sz;
> +    } else {
> +        pr_dbg("Invalid handle %d\n", handle);
> +        return NULL;
> +    }
> +}
> +
> +static inline void *res_tbl_alloc(RmResTbl *tbl, u32 *handle)
> +{
> +    qemu_mutex_lock(&tbl->lock);
> +
> +    *handle = find_first_zero_bit(tbl->bitmap, tbl->tbl_sz);
> +    if (*handle > tbl->tbl_sz) {
> +        pr_dbg("Fail to alloc, bitmap is full\n");
> +        qemu_mutex_unlock(&tbl->lock);
> +        return NULL;
> +    }
> +
> +    set_bit(*handle, tbl->bitmap);
> +
> +    qemu_mutex_unlock(&tbl->lock);
> +
> +    pr_dbg("%s, handle=%d\n", tbl->name, *handle);
> +
> +    return tbl->tbl + *handle * tbl->res_sz;
> +}
> +
> +static inline void res_tbl_dealloc(RmResTbl *tbl, u32 handle)
> +{
> +    pr_dbg("%s, handle=%d\n", tbl->name, handle);
> +
> +    qemu_mutex_lock(&tbl->lock);
> +
> +    if (handle < tbl->tbl_sz) {
> +        clear_bit(handle, tbl->bitmap);
> +    }
> +
> +    qemu_mutex_unlock(&tbl->lock);
> +}
> +
> +int rm_alloc_pd(PVRDMADev *dev, __u32 *pd_handle, __u32 ctx_handle)
> +{
> +    RmPD *pd;
> +
> +    pd = res_tbl_alloc(&dev->pd_tbl, pd_handle);
> +    if (!pd) {
> +        return -ENOMEM;
> +    }
> +
> +    pd->ctx_handle = ctx_handle;
> +
> +    return 0;
> +}
> +
> +void rm_dealloc_pd(PVRDMADev *dev, __u32 pd_handle)
> +{
> +    res_tbl_dealloc(&dev->pd_tbl, pd_handle);
> +}
> +
> +RmCQ *rm_get_cq(PVRDMADev *dev, __u32 cq_handle)
> +{
> +    return res_tbl_get(&dev->cq_tbl, cq_handle);
> +}
> +
> +int rm_alloc_cq(PVRDMADev *dev, struct pvrdma_cmd_create_cq *cmd,
> +                struct pvrdma_cmd_create_cq_resp *resp)
> +{
> +    int rc = 0;
> +    RmCQ *cq;
> +    PCIDevice *pci_dev = PCI_DEVICE(dev);
> +    __u64 *dir = 0, *tbl = 0;
> +    char ring_name[MAX_RING_NAME_SZ];
> +    u32 cqe;
> +
> +    cq = res_tbl_alloc(&dev->cq_tbl, &resp->cq_handle);
> +    if (!cq) {
> +        return -ENOMEM;
> +    }
> +
> +    memset(cq, 0, sizeof(RmCQ));
> +
> +    memcpy(&cq->init_args, cmd, sizeof(*cmd));
> +    cq->comp_type = CCT_NONE;
> +
> +    /* Get pointer to CQ */
> +    dir = pvrdma_pci_dma_map(pci_dev, cq->init_args.pdir_dma, TARGET_PAGE_SIZE);
> +    if (!dir) {
> +        pr_err("Fail to map to CQ page directory\n");
> +        rc = -ENOMEM;
> +        goto out_free_cq;
> +    }
> +    tbl = pvrdma_pci_dma_map(pci_dev, dir[0], TARGET_PAGE_SIZE);
> +    if (!tbl) {
> +        pr_err("Fail to map to CQ page table\n");
> +        rc = -ENOMEM;
> +        goto out_free_cq;
> +    }
> +
> +    cq->ring_state = (struct pvrdma_ring *)
> +            pvrdma_pci_dma_map(pci_dev, tbl[0], TARGET_PAGE_SIZE);
> +    if (!cq->ring_state) {
> +        pr_err("Fail to map to CQ header page\n");
> +        rc = -ENOMEM;
> +        goto out_free_cq;
> +    }
> +
> +    sprintf(ring_name, "cq%d", resp->cq_handle);
> +    cqe = MIN(cmd->cqe, dev->dsr_info.dsr->caps.max_cqe);
> +    rc = ring_init(&cq->cq, ring_name, pci_dev, &cq->ring_state[1],
> +                   cqe, sizeof(struct pvrdma_cqe), (dma_addr_t *)&tbl[1],
> +                   cmd->nchunks - 1 /* first page is ring state */);
> +    if (rc != 0) {
> +        pr_err("Fail to initialize CQ ring\n");
> +        rc = -ENOMEM;
> +        goto out_free_ring_state;
> +    }
> +
> +
> +    resp->cqe = cmd->cqe;
> +
> +    goto out;
> +
> +out_free_ring_state:
> +    pvrdma_pci_dma_unmap(pci_dev, cq->ring_state, TARGET_PAGE_SIZE);
> +
> +out_free_cq:
> +    rm_dealloc_cq(dev, resp->cq_handle);
> +
> +out:
> +    if (tbl) {
> +        pvrdma_pci_dma_unmap(pci_dev, tbl, TARGET_PAGE_SIZE);
> +    }
> +    if (dir) {
> +        pvrdma_pci_dma_unmap(pci_dev, dir, TARGET_PAGE_SIZE);
> +    }
> +
> +    return rc;
> +}
> +
> +void rm_req_notify_cq(PVRDMADev *dev, __u32 cq_handle, u32 flags)
> +{
> +    RmCQ *cq;
> +
> +    pr_dbg("cq_handle=%d, flags=0x%x\n", cq_handle, flags);
> +
> +    cq = rm_get_cq(dev, cq_handle);
> +    if (!cq) {
> +        return;
> +    }
> +
> +    cq->comp_type = (flags & PVRDMA_UAR_CQ_ARM_SOL) ? CCT_SOLICITED :
> +                     CCT_NEXT_COMP;
> +    pr_dbg("comp_type=%d\n", cq->comp_type);
> +}
> +
> +void rm_dealloc_cq(PVRDMADev *dev, __u32 cq_handle)
> +{
> +    PCIDevice *pci_dev = PCI_DEVICE(dev);
> +    RmCQ *cq;
> +
> +    cq = rm_get_cq(dev, cq_handle);
> +    if (!cq) {
> +        return;
> +    }
> +
> +    ring_free(&cq->cq);
> +    pvrdma_pci_dma_unmap(pci_dev, cq->ring_state, TARGET_PAGE_SIZE);
> +    res_tbl_dealloc(&dev->cq_tbl, cq_handle);
> +}
> +
> +int rm_alloc_mr(PVRDMADev *dev, struct pvrdma_cmd_create_mr *cmd,
> +                struct pvrdma_cmd_create_mr_resp *resp)
> +{
> +    RmMR *mr;
> +
> +    mr = res_tbl_alloc(&dev->mr_tbl, &resp->mr_handle);
> +    if (!mr) {
> +        return -ENOMEM;
> +    }
> +
> +    mr->pd_handle = cmd->pd_handle;
> +    resp->lkey = mr->lkey = global_lkey++;
> +    resp->rkey = mr->rkey = global_rkey++;
> +
> +    return 0;
> +}
> +
> +void rm_dealloc_mr(PVRDMADev *dev, __u32 mr_handle)
> +{
> +    res_tbl_dealloc(&dev->mr_tbl, mr_handle);
> +}
> +
> +int rm_alloc_qp(PVRDMADev *dev, struct pvrdma_cmd_create_qp *cmd,
> +                struct pvrdma_cmd_create_qp_resp *resp)
> +{
> +    int rc = 0;
> +    RmQP *qp;
> +    PCIDevice *pci_dev = PCI_DEVICE(dev);
> +    __u64 *dir = 0, *tbl = 0;
> +    int wqe_size;
> +    char ring_name[MAX_RING_NAME_SZ];
> +
> +    if (!rm_get_cq(dev, cmd->send_cq_handle) ||
> +        !rm_get_cq(dev, cmd->recv_cq_handle)) {
> +        pr_err("Invalid send_cqn or recv_cqn (%d, %d)\n",
> +               cmd->send_cq_handle, cmd->recv_cq_handle);
> +        return -EINVAL;
> +    }
> +
> +    qp = res_tbl_alloc(&dev->qp_tbl, &resp->qpn);
> +    if (!qp) {
> +        return -EINVAL;
> +    }
> +
> +    memset(qp, 0, sizeof(RmQP));
> +
> +    memcpy(&qp->init_args, cmd, sizeof(*cmd));
> +
> +    pr_dbg("qp_type=%d\n", qp->init_args.qp_type);
> +    pr_dbg("send_cq_handle=%d\n", qp->init_args.send_cq_handle);
> +    pr_dbg("max_send_sge=%d\n", qp->init_args.max_send_sge);
> +    pr_dbg("recv_cq_handle=%d\n", qp->init_args.recv_cq_handle);
> +    pr_dbg("max_recv_sge=%d\n", qp->init_args.max_recv_sge);
> +    pr_dbg("total_chunks=%d\n", cmd->total_chunks);
> +    pr_dbg("send_chunks=%d\n", cmd->send_chunks);
> +    pr_dbg("recv_chunks=%d\n", cmd->total_chunks - cmd->send_chunks);
> +
> +    qp->qp_state = PVRDMA_QPS_ERR;
> +
> +    /* Get pointer to send & recv rings */
> +    dir = pvrdma_pci_dma_map(pci_dev, qp->init_args.pdir_dma, TARGET_PAGE_SIZE);
> +    if (!dir) {
> +        pr_err("Fail to map to QP page directory\n");
> +        rc = -ENOMEM;
> +        goto out_free_qp;
> +    }
> +    tbl = pvrdma_pci_dma_map(pci_dev, dir[0], TARGET_PAGE_SIZE);
> +    if (!tbl) {
> +        pr_err("Fail to map to QP page table\n");
> +        rc = -ENOMEM;
> +        goto out_free_qp;
> +    }
> +
> +    /* Send ring */
> +    qp->sq_ring_state = (struct pvrdma_ring *)
> +            pvrdma_pci_dma_map(pci_dev, tbl[0], TARGET_PAGE_SIZE);
> +    if (!qp->sq_ring_state) {
> +        pr_err("Fail to map to QP header page\n");
> +        rc = -ENOMEM;
> +        goto out_free_qp;
> +    }
> +
> +    wqe_size = roundup_pow_of_two(sizeof(struct pvrdma_sq_wqe_hdr) +
> +                                  sizeof(struct pvrdma_sge) *
> +                                  qp->init_args.max_send_sge);
> +    sprintf(ring_name, "qp%d_sq", resp->qpn);
> +    rc = ring_init(&qp->sq, ring_name, pci_dev, qp->sq_ring_state,
> +                   qp->init_args.max_send_wr, wqe_size,
> +                   (dma_addr_t *)&tbl[1], cmd->send_chunks);
> +    if (rc != 0) {
> +        pr_err("Fail to initialize SQ ring\n");
> +        rc = -ENOMEM;
> +        goto out_free_ring_state;
> +    }
> +
> +    /* Recv ring */
> +    qp->rq_ring_state = &qp->sq_ring_state[1];
> +    wqe_size = roundup_pow_of_two(sizeof(struct pvrdma_rq_wqe_hdr) +
> +                                  sizeof(struct pvrdma_sge) *
> +                                  qp->init_args.max_recv_sge);
> +    pr_dbg("wqe_size=%d\n", wqe_size);
> +    pr_dbg("pvrdma_rq_wqe_hdr=%ld\n", sizeof(struct pvrdma_rq_wqe_hdr));
> +    pr_dbg("pvrdma_sge=%ld\n", sizeof(struct pvrdma_sge));
> +    pr_dbg("init_args.max_recv_sge=%d\n", qp->init_args.max_recv_sge);
> +    sprintf(ring_name, "qp%d_rq", resp->qpn);
> +    rc = ring_init(&qp->rq, ring_name, pci_dev, qp->rq_ring_state,
> +                   qp->init_args.max_recv_wr, wqe_size,
> +                   (dma_addr_t *)&tbl[2], cmd->total_chunks -
> +                   cmd->send_chunks - 1 /* first page is ring state */);
> +    if (rc != 0) {
> +        pr_err("Fail to initialize RQ ring\n");
> +        rc = -ENOMEM;
> +        goto out_free_send_ring;
> +    }
> +
> +    resp->max_send_wr = cmd->max_send_wr;
> +    resp->max_recv_wr = cmd->max_recv_wr;
> +    resp->max_send_sge = cmd->max_send_sge;
> +    resp->max_recv_sge = cmd->max_recv_sge;
> +    resp->max_inline_data = cmd->max_inline_data;
> +
> +    goto out;
> +
> +out_free_send_ring:
> +    ring_free(&qp->sq);
> +
> +out_free_ring_state:
> +    pvrdma_pci_dma_unmap(pci_dev, qp->sq_ring_state, TARGET_PAGE_SIZE);
> +
> +out_free_qp:
> +    rm_dealloc_qp(dev, resp->qpn);
> +
> +out:
> +    if (tbl) {
> +        pvrdma_pci_dma_unmap(pci_dev, tbl, TARGET_PAGE_SIZE);
> +    }
> +    if (dir) {
> +        pvrdma_pci_dma_unmap(pci_dev, dir, TARGET_PAGE_SIZE);
> +    }
> +
> +    return rc;
> +}
> +
> +int rm_modify_qp(PVRDMADev *dev, __u32 qp_handle,
> +                 struct pvrdma_cmd_modify_qp *modify_qp_args)
> +{
> +    RmQP *qp;
> +
> +    pr_dbg("qp_handle=%d\n", qp_handle);
> +    pr_dbg("new_state=%d\n", modify_qp_args->attrs.qp_state);
> +
> +    qp = res_tbl_get(&dev->qp_tbl, qp_handle);
> +    if (!qp) {
> +        return -EINVAL;
> +    }
> +
> +    pr_dbg("qp_type=%d\n", qp->init_args.qp_type);
> +
> +    if (modify_qp_args->attr_mask & PVRDMA_QP_PORT) {
> +        qp->port_num = modify_qp_args->attrs.port_num - 1;
> +    }
> +    if (modify_qp_args->attr_mask & PVRDMA_QP_DEST_QPN) {
> +        qp->dest_qp_num = modify_qp_args->attrs.dest_qp_num;
> +    }
> +    if (modify_qp_args->attr_mask & PVRDMA_QP_AV) {
> +        qp->dgid = modify_qp_args->attrs.ah_attr.grh.dgid;
> +        qp->port_num = modify_qp_args->attrs.ah_attr.port_num - 1;
> +    }
> +    if (modify_qp_args->attr_mask & PVRDMA_QP_STATE) {
> +        qp->qp_state = modify_qp_args->attrs.qp_state;
> +    }
> +
> +    /* kdbr connection */
> +    if (qp->qp_state == PVRDMA_QPS_RTR) {
> +        qp->kdbr_connection_id =
> +            kdbr_open_connection(dev->ports[qp->port_num].kdbr_port,
> +                                 qp_handle, qp->dgid, qp->dest_qp_num,
> +                                 qp->init_args.qp_type == PVRDMA_QPT_RC);
> +        if (qp->kdbr_connection_id == 0) {
> +            return -EIO;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +void rm_dealloc_qp(PVRDMADev *dev, __u32 qp_handle)
> +{
> +    PCIDevice *pci_dev = PCI_DEVICE(dev);
> +    RmQP *qp;
> +
> +    qp = res_tbl_get(&dev->qp_tbl, qp_handle);
> +    if (!qp) {
> +        return;
> +    }
> +
> +    if (qp->kdbr_connection_id) {
> +        kdbr_close_connection(dev->ports[qp->port_num].kdbr_port,
> +                              qp->kdbr_connection_id);
> +    }
> +
> +    ring_free(&qp->rq);
> +    ring_free(&qp->sq);
> +
> +    pvrdma_pci_dma_unmap(pci_dev, qp->sq_ring_state, TARGET_PAGE_SIZE);
> +
> +    res_tbl_dealloc(&dev->qp_tbl, qp_handle);
> +}
> +
> +RmQP *rm_get_qp(PVRDMADev *dev, __u32 qp_handle)
> +{
> +    return res_tbl_get(&dev->qp_tbl, qp_handle);
> +}
> +
> +void *rm_get_wqe_ctx(PVRDMADev *dev, unsigned long wqe_ctx_id)
> +{
> +    void **wqe_ctx;
> +
> +    wqe_ctx = res_tbl_get(&dev->wqe_ctx_tbl, wqe_ctx_id);
> +    if (!wqe_ctx) {
> +        return NULL;
> +    }
> +
> +    pr_dbg("ctx=%p\n", *wqe_ctx);
> +
> +    return *wqe_ctx;
> +}
> +
> +int rm_alloc_wqe_ctx(PVRDMADev *dev, unsigned long *wqe_ctx_id, void *ctx)
> +{
> +    void **wqe_ctx;
> +
> +    wqe_ctx = res_tbl_alloc(&dev->wqe_ctx_tbl, (u32 *)wqe_ctx_id);
> +    if (!wqe_ctx) {
> +        return -ENOMEM;
> +    }
> +
> +    pr_dbg("ctx=%p\n", ctx);
> +    *wqe_ctx = ctx;
> +
> +    return 0;
> +}
> +
> +void rm_dealloc_wqe_ctx(PVRDMADev *dev, unsigned long wqe_ctx_id)
> +{
> +    res_tbl_dealloc(&dev->wqe_ctx_tbl, (u32) wqe_ctx_id);
> +}
> +
> +int rm_init(PVRDMADev *dev)
> +{
> +    int ret = 0;
> +
> +    ret = res_tbl_init("PD", &dev->pd_tbl, MAX_PDS, sizeof(RmPD));
> +    if (ret != 0) {
> +        goto cln_pds;
> +    }
> +
> +    ret = res_tbl_init("CQ", &dev->cq_tbl, MAX_CQS, sizeof(RmCQ));
> +    if (ret != 0) {
> +        goto cln_cqs;
> +    }
> +
> +    ret = res_tbl_init("MR", &dev->mr_tbl, MAX_MRS, sizeof(RmMR));
> +    if (ret != 0) {
> +        goto cln_mrs;
> +    }
> +
> +    ret = res_tbl_init("QP", &dev->qp_tbl, MAX_QPS, sizeof(RmQP));
> +    if (ret != 0) {
> +        goto cln_qps;
> +    }
> +
> +    ret = res_tbl_init("WQE_CTX", &dev->wqe_ctx_tbl, MAX_QPS * MAX_QP_WRS,
> +               sizeof(void *));
> +    if (ret != 0) {
> +        goto cln_wqe_ctxs;
> +    }
> +
> +    goto out;
> +
> +cln_wqe_ctxs:
> +    res_tbl_free(&dev->wqe_ctx_tbl);
> +
> +cln_qps:
> +    res_tbl_free(&dev->qp_tbl);
> +
> +cln_mrs:
> +    res_tbl_free(&dev->mr_tbl);
> +
> +cln_cqs:
> +    res_tbl_free(&dev->cq_tbl);
> +
> +cln_pds:
> +    res_tbl_free(&dev->pd_tbl);
> +
> +out:
> +    if (ret != 0) {
> +        pr_err("Fail to initialize RM\n");
> +    }
> +
> +    return ret;
> +}
> +
> +void rm_fini(PVRDMADev *dev)
> +{
> +    res_tbl_free(&dev->pd_tbl);
> +    res_tbl_free(&dev->cq_tbl);
> +    res_tbl_free(&dev->mr_tbl);
> +    res_tbl_free(&dev->qp_tbl);
> +    res_tbl_free(&dev->wqe_ctx_tbl);
> +}
> diff --git a/hw/net/pvrdma/pvrdma_rm.h b/hw/net/pvrdma/pvrdma_rm.h
> new file mode 100644
> index 0000000..1d42bc7
> --- /dev/null
> +++ b/hw/net/pvrdma/pvrdma_rm.h
> @@ -0,0 +1,214 @@
> +/*
> + * QEMU VMWARE paravirtual RDMA - Resource Manager
> + *
> + * Developed by Oracle & Redhat
> + *
> + * Authors:
> + *     Yuval Shaia <yuval.shaia@oracle.com>
> + *     Marcel Apfelbaum <marcel@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#ifndef PVRDMA_RM_H
> +#define PVRDMA_RM_H
> +
> +#include <hw/net/pvrdma/pvrdma_dev_api.h>
> +#include <hw/net/pvrdma/pvrdma-uapi.h>
> +#include <hw/net/pvrdma/pvrdma_ring.h>
> +#include <hw/net/pvrdma/kdbr.h>
> +
> +/* TODO: More then 1 port it fails in ib_modify_qp, maybe something with
> + * the MAC of the second port */
> +#define MAX_PORTS        1 /* Driver force to 1 see pvrdma_add_gid */
> +#define MAX_PORT_GIDS    1
> +#define MAX_PORT_PKEYS   1
> +#define MAX_PKEYS        1
> +#define MAX_PDS          2048
> +#define MAX_CQS          2048
> +#define MAX_CQES         1024 /* cqe size is 64 */
> +#define MAX_QPS          1024
> +#define MAX_GIDS         2048
> +#define MAX_QP_WRS       1024 /* wqe size is 128 */
> +#define MAX_SGES         4
> +#define MAX_MRS          2048
> +#define MAX_AH           1024
> +
> +typedef struct PVRDMADev PVRDMADev;
> +typedef struct KdbrPort KdbrPort;
> +
> +#define MAX_RMRESTBL_NAME_SZ 16
> +typedef struct RmResTbl {
> +    char name[MAX_RMRESTBL_NAME_SZ];
> +    unsigned long *bitmap;
> +    size_t tbl_sz;
> +    size_t res_sz;
> +    void *tbl;
> +    QemuMutex lock;
> +} RmResTbl;
> +
> +enum cq_comp_type {
> +    CCT_NONE,
> +    CCT_SOLICITED,
> +    CCT_NEXT_COMP,
> +};
> +
> +typedef struct RmPD {
> +    __u32 ctx_handle;
> +} RmPD;
> +
> +typedef struct RmCQ {
> +    struct pvrdma_cmd_create_cq init_args;
> +    struct pvrdma_ring *ring_state;
> +    Ring cq;
> +    enum cq_comp_type comp_type;
> +} RmCQ;
> +
> +/* MR (DMA region) */
> +typedef struct RmMR {
> +    __u32 pd_handle;
> +    __u32 lkey;
> +    __u32 rkey;
> +} RmMR;
> +
> +typedef struct RmSqWqe {
> +    struct pvrdma_sq_wqe_hdr hdr;
> +    struct pvrdma_sge sge[0];
> +} RmSqWqe;
> +
> +typedef struct RmRqWqe {
> +    struct pvrdma_rq_wqe_hdr hdr;
> +    struct pvrdma_sge sge[0];
> +} RmRqWqe;
> +
> +typedef struct RmQP {
> +    struct pvrdma_cmd_create_qp init_args;
> +    enum pvrdma_qp_state qp_state;
> +    u8 port_num;
> +    u32 dest_qp_num;
> +    union pvrdma_gid dgid;
> +
> +    struct pvrdma_ring *sq_ring_state;
> +    Ring sq;
> +    struct pvrdma_ring *rq_ring_state;
> +    Ring rq;
> +
> +    unsigned long kdbr_connection_id;
> +} RmQP;
> +
> +typedef struct RmPort {
> +    enum pvrdma_port_state state;
> +    union pvrdma_gid gid_tbl[MAX_PORT_GIDS];
> +    /* TODO: Change type */
> +    int *pkey_tbl;
> +    KdbrPort *kdbr_port;
> +} RmPort;
> +
> +static inline int rm_get_max_port_gids(__u32 *max_port_gids)
> +{
> +    *max_port_gids = MAX_PORT_GIDS;
> +    return 0;
> +}
> +
> +static inline int rm_get_max_port_pkeys(__u32 *max_port_pkeys)
> +{
> +    *max_port_pkeys = MAX_PORT_PKEYS;
> +    return 0;
> +}
> +
> +static inline int rm_get_max_pkeys(__u16 *max_pkeys)
> +{
> +    *max_pkeys = MAX_PKEYS;
> +    return 0;
> +}
> +
> +static inline int rm_get_max_cqs(__u32 *max_cqs)
> +{
> +    *max_cqs = MAX_CQS;
> +    return 0;
> +}
> +
> +static inline int rm_get_max_cqes(__u32 *max_cqes)
> +{
> +    *max_cqes = MAX_CQES;
> +    return 0;
> +}
> +
> +static inline int rm_get_max_pds(__u32 *max_pds)
> +{
> +    *max_pds = MAX_PDS;
> +    return 0;
> +}
> +
> +static inline int rm_get_max_qps(__u32 *max_qps)
> +{
> +    *max_qps = MAX_QPS;
> +    return 0;
> +}
> +
> +static inline int rm_get_max_gids(__u32 *max_gids)
> +{
> +    *max_gids = MAX_GIDS;
> +    return 0;
> +}
> +
> +static inline int rm_get_max_qp_wrs(__u32 *max_qp_wrs)
> +{
> +    *max_qp_wrs = MAX_QP_WRS;
> +    return 0;
> +}
> +
> +static inline int rm_get_max_sges(__u32 *max_sges)
> +{
> +    *max_sges = MAX_SGES;
> +    return 0;
> +}
> +
> +static inline int rm_get_max_mrs(__u32 *max_mrs)
> +{
> +    *max_mrs = MAX_MRS;
> +    return 0;
> +}
> +
> +static inline int rm_get_phys_port_cnt(__u8 *phys_port_cnt)
> +{
> +    *phys_port_cnt = MAX_PORTS;
> +    return 0;
> +}
> +
> +static inline int rm_get_max_ah(__u32 *max_ah)
> +{
> +    *max_ah = MAX_AH;
> +    return 0;
> +}
> +
> +int rm_init(PVRDMADev *dev);
> +void rm_fini(PVRDMADev *dev);
> +
> +int rm_alloc_pd(PVRDMADev *dev, __u32 *pd_handle, __u32 ctx_handle);
> +void rm_dealloc_pd(PVRDMADev *dev, __u32 pd_handle);
> +
> +RmCQ *rm_get_cq(PVRDMADev *dev, __u32 cq_handle);
> +int rm_alloc_cq(PVRDMADev *dev, struct pvrdma_cmd_create_cq *cmd,
> +        struct pvrdma_cmd_create_cq_resp *resp);
> +void rm_req_notify_cq(PVRDMADev *dev, __u32 cq_handle, u32 flags);
> +void rm_dealloc_cq(PVRDMADev *dev, __u32 cq_handle);
> +
> +int rm_alloc_mr(PVRDMADev *dev, struct pvrdma_cmd_create_mr *cmd,
> +        struct pvrdma_cmd_create_mr_resp *resp);
> +void rm_dealloc_mr(PVRDMADev *dev, __u32 mr_handle);
> +
> +RmQP *rm_get_qp(PVRDMADev *dev, __u32 qp_handle);
> +int rm_alloc_qp(PVRDMADev *dev, struct pvrdma_cmd_create_qp *cmd,
> +        struct pvrdma_cmd_create_qp_resp *resp);
> +int rm_modify_qp(PVRDMADev *dev, __u32 qp_handle,
> +         struct pvrdma_cmd_modify_qp *modify_qp_args);
> +void rm_dealloc_qp(PVRDMADev *dev, __u32 qp_handle);
> +
> +void *rm_get_wqe_ctx(PVRDMADev *dev, unsigned long wqe_ctx_id);
> +int rm_alloc_wqe_ctx(PVRDMADev *dev, unsigned long *wqe_ctx_id, void *ctx);
> +void rm_dealloc_wqe_ctx(PVRDMADev *dev, unsigned long wqe_ctx_id);
> +
> +#endif
> diff --git a/hw/net/pvrdma/pvrdma_types.h b/hw/net/pvrdma/pvrdma_types.h
> new file mode 100644
> index 0000000..22a7cde
> --- /dev/null
> +++ b/hw/net/pvrdma/pvrdma_types.h
> @@ -0,0 +1,37 @@
> +/*
> + * QEMU VMWARE paravirtual RDMA interface definitions
> + *
> + * Developed by Oracle & Redhat
> + *
> + * Authors:
> + *     Yuval Shaia <yuval.shaia@oracle.com>
> + *     Marcel Apfelbaum <marcel@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#ifndef PVRDMA_TYPES_H
> +#define PVRDMA_TYPES_H
> +
> +/* TDOD: All defs here should be removed !!! */
> +
> +#include <stdint.h>
> +#include <asm-generic/int-ll64.h>
> +
> +typedef unsigned char uint8_t;
> +typedef uint64_t dma_addr_t;
> +
> +typedef uint8_t        __u8;
> +typedef uint8_t        u8;
> +typedef unsigned short __u16;
> +typedef unsigned short u16;
> +typedef uint64_t       u64;
> +typedef uint32_t       u32;
> +typedef uint32_t       __u32;
> +typedef int32_t       __s32;
> +#define __bitwise
> +typedef __u64 __bitwise __be64;
> +
> +#endif
> diff --git a/hw/net/pvrdma/pvrdma_utils.c b/hw/net/pvrdma/pvrdma_utils.c
> new file mode 100644
> index 0000000..0f420e2
> --- /dev/null
> +++ b/hw/net/pvrdma/pvrdma_utils.c
> @@ -0,0 +1,36 @@
> +#include <qemu/osdep.h>
> +#include <cpu.h>
> +#include <hw/pci/pci.h>
> +#include <hw/net/pvrdma/pvrdma_utils.h>
> +#include <hw/net/pvrdma/pvrdma.h>
> +
> +void pvrdma_pci_dma_unmap(PCIDevice *dev, void *buffer, dma_addr_t len)
> +{
> +    pr_dbg("%p\n", buffer);
> +    pci_dma_unmap(dev, buffer, len, DMA_DIRECTION_TO_DEVICE, 0);
> +}
> +
> +void *pvrdma_pci_dma_map(PCIDevice *dev, dma_addr_t addr, dma_addr_t plen)
> +{
> +    void *p;
> +    hwaddr len = plen;
> +
> +    if (!addr) {
> +        pr_dbg("addr is NULL\n");
> +        return NULL;
> +    }
> +
> +    p = pci_dma_map(dev, addr, &len, DMA_DIRECTION_TO_DEVICE);
> +    if (!p) {
> +        return NULL;
> +    }
> +
> +    if (len != plen) {
> +        pvrdma_pci_dma_unmap(dev, p, len);
> +        return NULL;
> +    }
> +
> +    pr_dbg("0x%llx -> %p (len=%ld)\n", (long long unsigned int)addr, p, len);
> +
> +    return p;
> +}
> diff --git a/hw/net/pvrdma/pvrdma_utils.h b/hw/net/pvrdma/pvrdma_utils.h
> new file mode 100644
> index 0000000..da01967
> --- /dev/null
> +++ b/hw/net/pvrdma/pvrdma_utils.h
> @@ -0,0 +1,49 @@
> +/*
> + * QEMU VMWARE paravirtual RDMA interface definitions
> + *
> + * Developed by Oracle & Redhat
> + *
> + * Authors:
> + *     Yuval Shaia <yuval.shaia@oracle.com>
> + *     Marcel Apfelbaum <marcel@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#ifndef PVRDMA_UTILS_H
> +#define PVRDMA_UTILS_H
> +
> +#define pr_info(fmt, ...) \
> +    fprintf(stdout, "%s: %-20s (%3d): " fmt, "pvrdma",  __func__, __LINE__,\
> +           ## __VA_ARGS__)
> +
> +#define pr_err(fmt, ...) \
> +    fprintf(stderr, "%s: Error at %-20s (%3d): " fmt, "pvrdma", __func__, \
> +        __LINE__, ## __VA_ARGS__)
> +
> +#define DEBUG
> +#ifdef DEBUG
> +#define pr_dbg(fmt, ...) \
> +    fprintf(stdout, "%s: %-20s (%3d): " fmt, "pvrdma", __func__, __LINE__,\
> +           ## __VA_ARGS__)
> +#else
> +#define pr_dbg(fmt, ...)
> +#endif
> +
> +static inline int roundup_pow_of_two(int x)
> +{
> +    x--;
> +    x |= (x >> 1);
> +    x |= (x >> 2);
> +    x |= (x >> 4);
> +    x |= (x >> 8);
> +    x |= (x >> 16);
> +    return x + 1;
> +}
> +
> +void pvrdma_pci_dma_unmap(PCIDevice *dev, void *buffer, dma_addr_t len);
> +void *pvrdma_pci_dma_map(PCIDevice *dev, dma_addr_t addr, dma_addr_t plen);
> +
> +#endif
> diff --git a/include/hw/pci/pci_ids.h b/include/hw/pci/pci_ids.h
> index d77ca60..a016ad6 100644
> --- a/include/hw/pci/pci_ids.h
> +++ b/include/hw/pci/pci_ids.h
> @@ -167,4 +167,7 @@
>  #define PCI_VENDOR_ID_TEWS               0x1498
>  #define PCI_DEVICE_ID_TEWS_TPCI200       0x30C8
>
> +#define PCI_VENDOR_ID_VMWARE             0x15ad
> +#define PCI_DEVICE_ID_VMWARE_PVRDMA      0x0820
> +
>  #endif
> --
> 2.5.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device

Posted by Doug Ledford 8 years, 10 months ago

On 3/30/17 9:13 AM, Leon Romanovsky wrote:
> On Thu, Mar 30, 2017 at 02:12:21PM +0300, Marcel Apfelbaum wrote:
>> From: Yuval Shaia <yuval.shaia@oracle.com>
>>
>>  Hi,
>>
>>  General description
>>  ===================
>>  This is a very early RFC of a new RoCE emulated device
>>  that enables guests to use the RDMA stack without having
>>  a real hardware in the host.
>>
>>  The current implementation supports only VM to VM communication
>>  on the same host.
>>  Down the road we plan to make possible to be able to support
>>  inter-machine communication by utilizing physical RoCE devices
>>  or Soft RoCE.
>>
>>  The goals are:
>>  - Reach fast and secure loos-less Inter-VM data exchange.
>>  - Support remote VMs or bare metal machines.
>>  - Allow VMs migration.
>>  - Do not require to pin all VM memory.
>>
>>
>>  Objective
>>  =========
>>  Have a QEMU implementation of the PVRDMA device. We aim to do so without
>>  any change in the PVRDMA guest driver which is already merged into the
>>  upstream kernel.
>>
>>
>>  RFC status
>>  ===========
>>  The project is in early development stages and supports
>>  only basic send/receive operations.
>>
>>  We present it so we can get feedbacks on design,
>>  feature demands and to receive comments from the
>>  community pointing us to the "right" direction.
>
> If to judge by the feedback which you got from RDMA community
> for kernel proposal [1], this community failed to understand:
> 1. Why do you need new module?

In this case, this is a qemu module to allow qemu to provide a virt rdma 
device to guests that is compatible with the device provided by VMWare's 
ESX product.  Right now, the vmware_pvrdma driver works only when the 
guest is running on a VMWare ESX server product, this would change that. 
  Marcel mentioned that they are currently making it compatible because 
that's the easiest/quickest thing to do, but in the future they might 
extend beyond what VMWare's virt rdma driver provides/uses and might 
then need to either modify it to work with their extensions or fork and 
create their own virt client driver.

> 2. Why existing solutions are not enough and can't be extended?

This patch is against the qemu source code, not the kernel.  There is no 
other solution in the qemu source code, so there is no existing solution 
to extend.

> 3. Why RXE (SoftRoCE) can't be extended to perform this inter-VM
>    communication via virtual NIC?

Eventually they want this to work on real hardware, and to be more or 
less transparent to the guest.  They will need to make it independent of 
the kernel hardware/driver in use.  That means their own virt driver, 
then the virt driver will eventually hook into whatever hardware is 
present on the system, or failing that, fall back to soft RoCE or soft 
iWARP if that ever makes it in the kernel.

>
> Can you please help us to fill this knowledge gap?
>
> [1] http://marc.info/?l=linux-rdma&m=149063626907175&w=2

Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device

Posted by Adit Ranadive 8 years, 10 months ago

On Thu Mar 30 2017 13:28:21 GMT-0700 (PDT), Doug Ledford wrote:
> On 3/30/17 9:13 AM, Leon Romanovsky wrote:
> > On Thu, Mar 30, 2017 at 02:12:21PM +0300, Marcel Apfelbaum wrote:
> > > From: Yuval Shaia <yuval.shaia@oracle.com>
> > >
> > >  Hi,
> > >
> > >  General description
> > >  ===================
> > >  This is a very early RFC of a new RoCE emulated device
> > >  that enables guests to use the RDMA stack without having
> > >  a real hardware in the host.
> > >
> > >  The current implementation supports only VM to VM communication
> > >  on the same host.
> > >  Down the road we plan to make possible to be able to support
> > >  inter-machine communication by utilizing physical RoCE devices
> > >  or Soft RoCE.
> > >
> > >  The goals are:
> > >  - Reach fast and secure loos-less Inter-VM data exchange.
> > >  - Support remote VMs or bare metal machines.
> > >  - Allow VMs migration.
> > >  - Do not require to pin all VM memory.
> > >
> > >
> > >  Objective
> > >  =========
> > >  Have a QEMU implementation of the PVRDMA device. We aim to do so without
> > >  any change in the PVRDMA guest driver which is already merged into the
> > >  upstream kernel.
> > >
> > >
> > >  RFC status
> > >  ===========
> > >  The project is in early development stages and supports
> > >  only basic send/receive operations.
> > >
> > >  We present it so we can get feedbacks on design,
> > >  feature demands and to receive comments from the
> > >  community pointing us to the "right" direction.
> >
> > If to judge by the feedback which you got from RDMA community
> > for kernel proposal [1], this community failed to understand:
> > 1. Why do you need new module?
> 
> In this case, this is a qemu module to allow qemu to provide a virt rdma device to guests that is compatible with the device provided by VMWare's ESX product.  Right now, the vmware_pvrdma driver works only when the guest is running on a VMWare ESX server product, this would change that.  Marcel mentioned that they are currently making it compatible because that's the easiest/quickest thing to do, but in the future they might extend beyond what VMWare's virt rdma driver provides/uses and might then need to either modify it to work with their extensions or fork and create their own virt client driver.
> 
> > 2. Why existing solutions are not enough and can't be extended?
> 
> This patch is against the qemu source code, not the kernel.  There is no other solution in the qemu source code, so there is no existing solution to extend.
> 
> > 3. Why RXE (SoftRoCE) can't be extended to perform this inter-VM
> >    communication via virtual NIC?
> 
> Eventually they want this to work on real hardware, and to be more or less transparent to the guest.  They will need to make it independent of the kernel hardware/driver in use.  That means their own virt driver, then the virt driver will eventually hook into whatever hardware is present on the system, or failing that, fall back to soft RoCE or soft iWARP if that ever makes it in the kernel.
>

Hmm, this looks quite interesting. Though I'm not surprised, the PVRDMA 
device spec is relatively straightforward.
I would have definitely mentioned this (if I knew about it) during my 
OFA workshop talk a couple of days ago :).

Doug's right. I mean basically, this looks like a QEMU version of our PVRDMA
backend.

Thanks,
Adit

Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device

Posted by Marcel Apfelbaum 8 years, 10 months ago

On 03/31/2017 02:38 AM, Adit Ranadive wrote:
> On Thu Mar 30 2017 13:28:21 GMT-0700 (PDT), Doug Ledford wrote:
>> On 3/30/17 9:13 AM, Leon Romanovsky wrote:
>>> On Thu, Mar 30, 2017 at 02:12:21PM +0300, Marcel Apfelbaum wrote:
>>>> From: Yuval Shaia <yuval.shaia@oracle.com>
>>>>
>>>>  Hi,
>>>>
>>>>  General description
>>>>  ===================
>>>>  This is a very early RFC of a new RoCE emulated device
>>>>  that enables guests to use the RDMA stack without having
>>>>  a real hardware in the host.
>>>>
>>>>  The current implementation supports only VM to VM communication
>>>>  on the same host.
>>>>  Down the road we plan to make possible to be able to support
>>>>  inter-machine communication by utilizing physical RoCE devices
>>>>  or Soft RoCE.
>>>>
>>>>  The goals are:
>>>>  - Reach fast and secure loos-less Inter-VM data exchange.
>>>>  - Support remote VMs or bare metal machines.
>>>>  - Allow VMs migration.
>>>>  - Do not require to pin all VM memory.
>>>>
>>>>
>>>>  Objective
>>>>  =========
>>>>  Have a QEMU implementation of the PVRDMA device. We aim to do so without
>>>>  any change in the PVRDMA guest driver which is already merged into the
>>>>  upstream kernel.
>>>>
>>>>
>>>>  RFC status
>>>>  ===========
>>>>  The project is in early development stages and supports
>>>>  only basic send/receive operations.
>>>>
>>>>  We present it so we can get feedbacks on design,
>>>>  feature demands and to receive comments from the
>>>>  community pointing us to the "right" direction.
>>>
>>> If to judge by the feedback which you got from RDMA community
>>> for kernel proposal [1], this community failed to understand:
>>> 1. Why do you need new module?
>>
>> In this case, this is a qemu module to allow qemu to provide a virt rdma device to guests that is compatible with the device provided by VMWare's ESX product.  Right now, the vmware_pvrdma driver works only when the guest is running on a VMWare ESX server product, this would change that.  Marcel mentioned that they are currently making it compatible because that's the easiest/quickest thing to do, but in the future they might extend beyond what VMWare's virt rdma driver provides/uses and might then need to either modify it to work with their extensions or fork and create their own virt client driver.
>>
>>> 2. Why existing solutions are not enough and can't be extended?
>>
>> This patch is against the qemu source code, not the kernel.  There is no other solution in the qemu source code, so there is no existing solution to extend.
>>
>>> 3. Why RXE (SoftRoCE) can't be extended to perform this inter-VM
>>>    communication via virtual NIC?
>>
>> Eventually they want this to work on real hardware, and to be more or less transparent to the guest.  They will need to make it independent of the kernel hardware/driver in use.  That means their own virt driver, then the virt driver will eventually hook into whatever hardware is present on the system, or failing that, fall back to soft RoCE or soft iWARP if that ever makes it in the kernel.
>>
>

Hi Adit,

> Hmm, this looks quite interesting.

Thanks!!

> Though I'm not surprised, the PVRDMA
> device spec is relatively straightforward.

Indeed, the pvrdma driver is clear and well documented,
which made our development much easier.

> I would have definitely mentioned this (if I knew about it) during my
> OFA workshop talk a couple of days ago :).
>

There is always a next OFA workshop :)

Thanks,
Marcel & Yval

> Doug's right. I mean basically, this looks like a QEMU version of our PVRDMA
> backend.
>
> Thanks,
> Adit
>

Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device

Posted by Marcel Apfelbaum 8 years, 10 months ago

On 03/30/2017 11:28 PM, Doug Ledford wrote:
> On 3/30/17 9:13 AM, Leon Romanovsky wrote:
>> On Thu, Mar 30, 2017 at 02:12:21PM +0300, Marcel Apfelbaum wrote:
>>> From: Yuval Shaia <yuval.shaia@oracle.com>
>>>
>>>  Hi,
>>>
>>>  General description
>>>  ===================
>>>  This is a very early RFC of a new RoCE emulated device
>>>  that enables guests to use the RDMA stack without having
>>>  a real hardware in the host.
>>>
>>>  The current implementation supports only VM to VM communication
>>>  on the same host.
>>>  Down the road we plan to make possible to be able to support
>>>  inter-machine communication by utilizing physical RoCE devices
>>>  or Soft RoCE.
>>>
>>>  The goals are:
>>>  - Reach fast and secure loos-less Inter-VM data exchange.
>>>  - Support remote VMs or bare metal machines.
>>>  - Allow VMs migration.
>>>  - Do not require to pin all VM memory.
>>>
>>>
>>>  Objective
>>>  =========
>>>  Have a QEMU implementation of the PVRDMA device. We aim to do so without
>>>  any change in the PVRDMA guest driver which is already merged into the
>>>  upstream kernel.
>>>
>>>
>>>  RFC status
>>>  ===========
>>>  The project is in early development stages and supports
>>>  only basic send/receive operations.
>>>
>>>  We present it so we can get feedbacks on design,
>>>  feature demands and to receive comments from the
>>>  community pointing us to the "right" direction.
>>
>> If to judge by the feedback which you got from RDMA community
>> for kernel proposal [1], this community failed to understand:
>> 1. Why do you need new module?
>
> In this case, this is a qemu module to allow qemu to provide a virt rdma device to guests that is compatible with the device provided by VMWare's ESX product.  Right now, the vmware_pvrdma driver
> works only when the guest is running on a VMWare ESX server product, this would change that.  Marcel mentioned that they are currently making it compatible because that's the easiest/quickest thing to
> do, but in the future they might extend beyond what VMWare's virt rdma driver provides/uses and might then need to either modify it to work with their extensions or fork and create their own virt
> client driver.
>
>> 2. Why existing solutions are not enough and can't be extended?
>
> This patch is against the qemu source code, not the kernel.  There is no other solution in the qemu source code, so there is no existing solution to extend.
>
>> 3. Why RXE (SoftRoCE) can't be extended to perform this inter-VM
>>    communication via virtual NIC?
>
> Eventually they want this to work on real hardware, and to be more or less transparent to the guest.  They will need to make it independent of the kernel hardware/driver in use.  That means their own
> virt driver, then the virt driver will eventually hook into whatever hardware is present on the system, or failing that, fall back to soft RoCE or soft iWARP if that ever makes it in the kernel.
>
>

Hi Leon and Doug,
Your feedback is much appreciated!

As Doug mentioned, the RFC is a QEMU implementation of a pvrdma device,
so SoftRoCE can't help here (we are emulating a PCI device).

Regarding the new KDBR module (Kernel Data Bridge), as the name suggests is
a bridge between different VMs or between a VM and a hardware/software device
and does not replace it.

Leon, utilizing the Soft RoCE is definitely part of our roadmap from the start,
we find the project a must since most of our systems don't even have real
RDMA hardware, and the question is how do best integrate with it.

Thanks,
Marcel & Yuval


>>
>> Can you please help us to fill this knowledge gap?
>>
>> [1] http://marc.info/?l=linux-rdma&m=149063626907175&w=2
>

Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device

Posted by Leon Romanovsky 8 years, 10 months ago

On Fri, Mar 31, 2017 at 06:45:43PM +0300, Marcel Apfelbaum wrote:
> On 03/30/2017 11:28 PM, Doug Ledford wrote:
> > On 3/30/17 9:13 AM, Leon Romanovsky wrote:
> > > On Thu, Mar 30, 2017 at 02:12:21PM +0300, Marcel Apfelbaum wrote:
> > > > From: Yuval Shaia <yuval.shaia@oracle.com>
> > > >
> > > >  Hi,
> > > >
> > > >  General description
> > > >  ===================
> > > >  This is a very early RFC of a new RoCE emulated device
> > > >  that enables guests to use the RDMA stack without having
> > > >  a real hardware in the host.
> > > >
> > > >  The current implementation supports only VM to VM communication
> > > >  on the same host.
> > > >  Down the road we plan to make possible to be able to support
> > > >  inter-machine communication by utilizing physical RoCE devices
> > > >  or Soft RoCE.
> > > >
> > > >  The goals are:
> > > >  - Reach fast and secure loos-less Inter-VM data exchange.
> > > >  - Support remote VMs or bare metal machines.
> > > >  - Allow VMs migration.
> > > >  - Do not require to pin all VM memory.
> > > >
> > > >
> > > >  Objective
> > > >  =========
> > > >  Have a QEMU implementation of the PVRDMA device. We aim to do so without
> > > >  any change in the PVRDMA guest driver which is already merged into the
> > > >  upstream kernel.
> > > >
> > > >
> > > >  RFC status
> > > >  ===========
> > > >  The project is in early development stages and supports
> > > >  only basic send/receive operations.
> > > >
> > > >  We present it so we can get feedbacks on design,
> > > >  feature demands and to receive comments from the
> > > >  community pointing us to the "right" direction.
> > >
> > > If to judge by the feedback which you got from RDMA community
> > > for kernel proposal [1], this community failed to understand:
> > > 1. Why do you need new module?
> >
> > In this case, this is a qemu module to allow qemu to provide a virt rdma device to guests that is compatible with the device provided by VMWare's ESX product.  Right now, the vmware_pvrdma driver
> > works only when the guest is running on a VMWare ESX server product, this would change that.  Marcel mentioned that they are currently making it compatible because that's the easiest/quickest thing to
> > do, but in the future they might extend beyond what VMWare's virt rdma driver provides/uses and might then need to either modify it to work with their extensions or fork and create their own virt
> > client driver.
> >
> > > 2. Why existing solutions are not enough and can't be extended?
> >
> > This patch is against the qemu source code, not the kernel.  There is no other solution in the qemu source code, so there is no existing solution to extend.
> >
> > > 3. Why RXE (SoftRoCE) can't be extended to perform this inter-VM
> > >    communication via virtual NIC?
> >
> > Eventually they want this to work on real hardware, and to be more or less transparent to the guest.  They will need to make it independent of the kernel hardware/driver in use.  That means their own
> > virt driver, then the virt driver will eventually hook into whatever hardware is present on the system, or failing that, fall back to soft RoCE or soft iWARP if that ever makes it in the kernel.
> >
> >
>
> Hi Leon and Doug,
> Your feedback is much appreciated!
>
> As Doug mentioned, the RFC is a QEMU implementation of a pvrdma device,
> so SoftRoCE can't help here (we are emulating a PCI device).

I just responded to the latest email, but as you understood from my question,
it was related to your KDBR module.

>
> Regarding the new KDBR module (Kernel Data Bridge), as the name suggests is
> a bridge between different VMs or between a VM and a hardware/software device
> and does not replace it.
>
> Leon, utilizing the Soft RoCE is definitely part of our roadmap from the start,
> we find the project a must since most of our systems don't even have real
> RDMA hardware, and the question is how do best integrate with it.

This is exactly the question, you chose as an implementation path to do
it with new module over char device. I'm not against your approach,
but would like to see the list with pros and cons for over possible
solutions if any. Does it make sense to do special ULP to share the data
between different drivers over shared memory?

Thanks

>
> Thanks,
> Marcel & Yuval
>
>
> > >
> > > Can you please help us to fill this knowledge gap?
> > >
> > > [1] http://marc.info/?l=linux-rdma&m=149063626907175&w=2
> >
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device

Posted by Marcel Apfelbaum 8 years, 10 months ago

On 04/03/2017 09:23 AM, Leon Romanovsky wrote:
> On Fri, Mar 31, 2017 at 06:45:43PM +0300, Marcel Apfelbaum wrote:
>> On 03/30/2017 11:28 PM, Doug Ledford wrote:
>>> On 3/30/17 9:13 AM, Leon Romanovsky wrote:
>>>> On Thu, Mar 30, 2017 at 02:12:21PM +0300, Marcel Apfelbaum wrote:
>>>>> From: Yuval Shaia <yuval.shaia@oracle.com>
>>>>>
>>>>>  Hi,
>>>>>
>>>>>  General description
>>>>>  ===================
>>>>>  This is a very early RFC of a new RoCE emulated device
>>>>>  that enables guests to use the RDMA stack without having
>>>>>  a real hardware in the host.
>>>>>
>>>>>  The current implementation supports only VM to VM communication
>>>>>  on the same host.
>>>>>  Down the road we plan to make possible to be able to support
>>>>>  inter-machine communication by utilizing physical RoCE devices
>>>>>  or Soft RoCE.
>>>>>
>>>>>  The goals are:
>>>>>  - Reach fast and secure loos-less Inter-VM data exchange.
>>>>>  - Support remote VMs or bare metal machines.
>>>>>  - Allow VMs migration.
>>>>>  - Do not require to pin all VM memory.
>>>>>
>>>>>
>>>>>  Objective
>>>>>  =========
>>>>>  Have a QEMU implementation of the PVRDMA device. We aim to do so without
>>>>>  any change in the PVRDMA guest driver which is already merged into the
>>>>>  upstream kernel.
>>>>>
>>>>>
>>>>>  RFC status
>>>>>  ===========
>>>>>  The project is in early development stages and supports
>>>>>  only basic send/receive operations.
>>>>>
>>>>>  We present it so we can get feedbacks on design,
>>>>>  feature demands and to receive comments from the
>>>>>  community pointing us to the "right" direction.
>>>>
>>>> If to judge by the feedback which you got from RDMA community
>>>> for kernel proposal [1], this community failed to understand:
>>>> 1. Why do you need new module?
>>>
>>> In this case, this is a qemu module to allow qemu to provide a virt rdma device to guests that is compatible with the device provided by VMWare's ESX product.  Right now, the vmware_pvrdma driver
>>> works only when the guest is running on a VMWare ESX server product, this would change that.  Marcel mentioned that they are currently making it compatible because that's the easiest/quickest thing to
>>> do, but in the future they might extend beyond what VMWare's virt rdma driver provides/uses and might then need to either modify it to work with their extensions or fork and create their own virt
>>> client driver.
>>>
>>>> 2. Why existing solutions are not enough and can't be extended?
>>>
>>> This patch is against the qemu source code, not the kernel.  There is no other solution in the qemu source code, so there is no existing solution to extend.
>>>
>>>> 3. Why RXE (SoftRoCE) can't be extended to perform this inter-VM
>>>>    communication via virtual NIC?
>>>
>>> Eventually they want this to work on real hardware, and to be more or less transparent to the guest.  They will need to make it independent of the kernel hardware/driver in use.  That means their own
>>> virt driver, then the virt driver will eventually hook into whatever hardware is present on the system, or failing that, fall back to soft RoCE or soft iWARP if that ever makes it in the kernel.
>>>
>>>
>>
>> Hi Leon and Doug,
>> Your feedback is much appreciated!
>>
>> As Doug mentioned, the RFC is a QEMU implementation of a pvrdma device,
>> so SoftRoCE can't help here (we are emulating a PCI device).
>
> I just responded to the latest email, but as you understood from my question,
> it was related to your KDBR module.
>
>>
>> Regarding the new KDBR module (Kernel Data Bridge), as the name suggests is
>> a bridge between different VMs or between a VM and a hardware/software device
>> and does not replace it.
>>
>> Leon, utilizing the Soft RoCE is definitely part of our roadmap from the start,
>> we find the project a must since most of our systems don't even have real
>> RDMA hardware, and the question is how do best integrate with it.
>
> This is exactly the question, you chose as an implementation path to do
> it with new module over char device. I'm not against your approach,
> but would like to see the list with pros and cons for over possible
> solutions if any. Does it make sense to do special ULP to share the data
> between different drivers over shared memory?

Hi Leon,

Here are some thoughts regarding the Soft RoCE usage in our project.
We thought about using it as backend for QEMU pvrdma device
we didn't how it will support our requirements.

1. Does Soft RoCE support inter process (VM) fast path ? The KDBR
    removes the need for hw resources, emulated or not, concentrating
    on one copy from a VM to another.

2. We needed to support migration, meaning the PVRDMA device must preserve
    the RDMA resources between different hosts. Our solution includes a clear
    separation between the guest resources namespace and the actual hw/sw device.
    This is why the KDBR is intended to run outside the scope of the SoftRoCE
    so it can open/close hw connections independent from the VM.

3. Our intention is for KDBR to be used in other contexts as well when we need
    inter VM data exchange, e.g. backend for virtio devices. We didn't see how this
    kind of requirement can be implemented inside SoftRoce as we don't see any
    connection between them.

4. We don't want all the VM memory to be pinned since it disable memory-over-commit
    which in turn will make the pvrdma device useless.
    We weren't sure how nice would play Soft RoCE with memory pinning and we wanted
    more control on memory management. It may be a solvable issue, but combined
    with the others lead us to our decision to come up with our kernel bridge (char
    device or not, we went for it since it was the easiest to implement for a POC)


Thanks,
Marcel & Yuval

>
> Thanks
>
>>
>> Thanks,
>> Marcel & Yuval
>>
>>
>>>>
>>>> Can you please help us to fill this knowledge gap?
>>>>
>>>> [1] http://marc.info/?l=linux-rdma&m=149063626907175&w=2
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device

Posted by Jason Gunthorpe 8 years, 10 months ago

On Tue, Apr 04, 2017 at 04:38:40PM +0300, Marcel Apfelbaum wrote:

> Here are some thoughts regarding the Soft RoCE usage in our project.
> We thought about using it as backend for QEMU pvrdma device
> we didn't how it will support our requirements.
> 
> 1. Does Soft RoCE support inter process (VM) fast path ? The KDBR
>    removes the need for hw resources, emulated or not, concentrating
>    on one copy from a VM to another.

I'd rather see someone optimize the loopback path of soft roce than
see KDBR :)

> 3. Our intention is for KDBR to be used in other contexts as well when we need
>    inter VM data exchange, e.g. backend for virtio devices. We didn't see how this
>    kind of requirement can be implemented inside SoftRoce as we don't see any
>    connection between them.

KDBR looks like weak RDMA to me, so it is reasonable question why not
use full RDMA with loopback optimization instead of creating something
unique.

IMHO, it also makes more sense for something like KDBR to live as a
RDMA transport, not as a unique char device, it is obviously very
RDMA-like.

.. and the char dev really can't be used when implementing user space
RDMA, that would just make a big mess..

> 4. We don't want all the VM memory to be pinned since it disable memory-over-commit
>    which in turn will make the pvrdma device useless.
>    We weren't sure how nice would play Soft RoCE with memory pinning and we wanted
>    more control on memory management. It may be a solvable issue, but combined
>    with the others lead us to our decision to come up with our kernel bridge (char

soft roce certainly can be optimized to remove the page pin and always
run in an ODP-like mode.

But obviously if you connect pvrdma to real hardware then the page pin
comes back.

>    device or not, we went for it since it was the easiest to
>    implement for a POC)

I can see why it would be easy to implement, but not sure how this
really improves the kernel..

Jason

Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device

Posted by Yuval Shaia 8 years, 10 months ago

On Tue, Apr 04, 2017 at 10:01:55AM -0600, Jason Gunthorpe wrote:
> On Tue, Apr 04, 2017 at 04:38:40PM +0300, Marcel Apfelbaum wrote:
> 
> > Here are some thoughts regarding the Soft RoCE usage in our project.
> > We thought about using it as backend for QEMU pvrdma device
> > we didn't how it will support our requirements.
> > 
> > 1. Does Soft RoCE support inter process (VM) fast path ? The KDBR
> >    removes the need for hw resources, emulated or not, concentrating
> >    on one copy from a VM to another.
> 
> I'd rather see someone optimize the loopback path of soft roce than
> see KDBR :)

Can we assume that the optimized loopback path will be as fast as direct
copy between one VM address space to another VM address space?

> 
> > 3. Our intention is for KDBR to be used in other contexts as well when we need
> >    inter VM data exchange, e.g. backend for virtio devices. We didn't see how this
> >    kind of requirement can be implemented inside SoftRoce as we don't see any
> >    connection between them.
> 
> KDBR looks like weak RDMA to me, so it is reasonable question why not
> use full RDMA with loopback optimization instead of creating something
> unique.

True, KDBR exposes RDMA-like API because it's sole user is currently pvrdma
device.
But, by design it can be expand to support other clients for example virtio
device which might have other attributes, can we expect the same from
SoftRoCE?

> 
> IMHO, it also makes more sense for something like KDBR to live as a
> RDMA transport, not as a unique char device, it is obviously very
> RDMA-like.

Can you elaborate more on this?
What exactly it will solve?
How it will be better than kdbr?

As we see it - kdbr, when will be expand to support peers on external
hosts, will be like a ULP.

> 
> .. and the char dev really can't be used when implementing user space
> RDMA, that would just make a big mess..

The position of kdbr is not to be a layer *between* user space and device -
it is *the device* from point of view of the process.

> 
> > 4. We don't want all the VM memory to be pinned since it disable memory-over-commit
> >    which in turn will make the pvrdma device useless.
> >    We weren't sure how nice would play Soft RoCE with memory pinning and we wanted
> >    more control on memory management. It may be a solvable issue, but combined
> >    with the others lead us to our decision to come up with our kernel bridge (char
> 
> soft roce certainly can be optimized to remove the page pin and always
> run in an ODP-like mode.
> 
> But obviously if you connect pvrdma to real hardware then the page pin
> comes back.

The fact that page pin is not needed with Soft RoCE device but is needed
with real RoCE device is exactly where kdbr can help as it isolates this
fact from user space process.

> 
> >    device or not, we went for it since it was the easiest to
> >    implement for a POC)
> 
> I can see why it would be easy to implement, but not sure how this
> really improves the kernel..

Sorry, we didn't mean "easy" but "simple", and simplest solutions are
always preferred.
IMHO, currently there is no good solution to do data copy between two VMs.

> 
> Jason

Can you comment on the second point - migration? Please note that we need
it to work both with Soft RoCE and with real device.

Marcel & Yuval

Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device

Posted by Jason Gunthorpe 8 years, 10 months ago

On Thu, Apr 06, 2017 at 10:42:20PM +0300, Yuval Shaia wrote:

> > I'd rather see someone optimize the loopback path of soft roce than
> > see KDBR :)
> 
> Can we assume that the optimized loopback path will be as fast as direct
> copy between one VM address space to another VM address space?

Well, you'd optimize it until it was a direct memory copy, so I think
that is a reasonable starting assumption.

> > > 3. Our intention is for KDBR to be used in other contexts as well when we need
> > >    inter VM data exchange, e.g. backend for virtio devices. We didn't see how this
> > >    kind of requirement can be implemented inside SoftRoce as we don't see any
> > >    connection between them.
> > 
> > KDBR looks like weak RDMA to me, so it is reasonable question why not
> > use full RDMA with loopback optimization instead of creating something
> > unique.
> 
> True, KDBR exposes RDMA-like API because it's sole user is currently
> pvrdma device.  But, by design it can be expand to support other
> clients for example virtio device which might have other attributes,
> can we expect the same from SoftRoCE?

RDMA handles all sorts of complex virtio-like protocols just
fine. Unclear what 'other attributes' would be. Sounds like over
designing??

> > IMHO, it also makes more sense for something like KDBR to live as a
> > RDMA transport, not as a unique char device, it is obviously very
> > RDMA-like.
> 
> Can you elaborate more on this?
> What exactly it will solve?
> How it will be better than kdbr?

If you are going to do RDMA, then the uAPI for it from the kernel
should be the RDMA subsystem, don't invent unique cdevs that overlap
established kernel functionality without a very, very good reason.

> > .. and the char dev really can't be used when implementing user space
> > RDMA, that would just make a big mess..
> 
> The position of kdbr is not to be a layer *between* user space and device -
> it is *the device* from point of view of the process.

Any RDMA device built on top of kdbr certainly needs to support
/dev/uverbs0 and all the usual RDMA stuff, so again, I fail to see the
point of the special cdev.. Trying to mix /dev/uverbs0 and /dev/kdbr
in your provider would be too goofy and weird.

> > But obviously if you connect pvrdma to real hardware then the page pin
> > comes back.
> 
> The fact that page pin is not needed with Soft RoCE device but is needed
> with real RoCE device is exactly where kdbr can help as it isolates this
> fact from user space process.

I don't see how KDBR helps at all.

To do virtual RDMA you must transfer RDMA objects and commands
unmodified from VM to HV and implement a fairly complicated SW stack
inside the HV.

Once you do that, micro-optimizing for same-machine VM-to-VM copy is
not really such a big deal, IMHO.

The big challenge is keeping the real HW (or softrocee) RDMA objects
in sync with the VM ones and implementing some kind of RDMA-in-RDMA
tunnel to enable migration when using today's HW offload.

I see nothing in kdbr that helps with any of this. All it seems to do
is obfuscate the transfer of RDMA objects and commands to the
hypervisor, and make the transition of a RDMA channel from loopback to
network far, far, more complicated.

> Sorry, we didn't mean "easy" but "simple", and simplest solutions
> are always preferred.  IMHO, currently there is no good solution to
> do data copy between two VMs.

Don't confuse 'simple' with under featured. :)

> Can you comment on the second point - migration? Please note that we need
> it to work both with Soft RoCE and with real device.

I don't see how kdbr helps with migration, you still have to setup the
HW NIC and that needs sharing all the RDMA centric objects from VM to
HV.

Jason

Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device

Posted by Leon Romanovsky 8 years, 10 months ago

On Tue, Apr 04, 2017 at 04:38:40PM +0300, Marcel Apfelbaum wrote:
> On 04/03/2017 09:23 AM, Leon Romanovsky wrote:
> > On Fri, Mar 31, 2017 at 06:45:43PM +0300, Marcel Apfelbaum wrote:
> > > On 03/30/2017 11:28 PM, Doug Ledford wrote:
> > > > On 3/30/17 9:13 AM, Leon Romanovsky wrote:
> > > > > On Thu, Mar 30, 2017 at 02:12:21PM +0300, Marcel Apfelbaum wrote:
> > > > > > From: Yuval Shaia <yuval.shaia@oracle.com>
> > > > > >
> > > > > >  Hi,
> > > > > >
> > > > > >  General description
> > > > > >  ===================
> > > > > >  This is a very early RFC of a new RoCE emulated device
> > > > > >  that enables guests to use the RDMA stack without having
> > > > > >  a real hardware in the host.
> > > > > >
> > > > > >  The current implementation supports only VM to VM communication
> > > > > >  on the same host.
> > > > > >  Down the road we plan to make possible to be able to support
> > > > > >  inter-machine communication by utilizing physical RoCE devices
> > > > > >  or Soft RoCE.
> > > > > >
> > > > > >  The goals are:
> > > > > >  - Reach fast and secure loos-less Inter-VM data exchange.
> > > > > >  - Support remote VMs or bare metal machines.
> > > > > >  - Allow VMs migration.
> > > > > >  - Do not require to pin all VM memory.
> > > > > >
> > > > > >
> > > > > >  Objective
> > > > > >  =========
> > > > > >  Have a QEMU implementation of the PVRDMA device. We aim to do so without
> > > > > >  any change in the PVRDMA guest driver which is already merged into the
> > > > > >  upstream kernel.
> > > > > >
> > > > > >
> > > > > >  RFC status
> > > > > >  ===========
> > > > > >  The project is in early development stages and supports
> > > > > >  only basic send/receive operations.
> > > > > >
> > > > > >  We present it so we can get feedbacks on design,
> > > > > >  feature demands and to receive comments from the
> > > > > >  community pointing us to the "right" direction.
> > > > >
> > > > > If to judge by the feedback which you got from RDMA community
> > > > > for kernel proposal [1], this community failed to understand:
> > > > > 1. Why do you need new module?
> > > >
> > > > In this case, this is a qemu module to allow qemu to provide a virt rdma device to guests that is compatible with the device provided by VMWare's ESX product.  Right now, the vmware_pvrdma driver
> > > > works only when the guest is running on a VMWare ESX server product, this would change that.  Marcel mentioned that they are currently making it compatible because that's the easiest/quickest thing to
> > > > do, but in the future they might extend beyond what VMWare's virt rdma driver provides/uses and might then need to either modify it to work with their extensions or fork and create their own virt
> > > > client driver.
> > > >
> > > > > 2. Why existing solutions are not enough and can't be extended?
> > > >
> > > > This patch is against the qemu source code, not the kernel.  There is no other solution in the qemu source code, so there is no existing solution to extend.
> > > >
> > > > > 3. Why RXE (SoftRoCE) can't be extended to perform this inter-VM
> > > > >    communication via virtual NIC?
> > > >
> > > > Eventually they want this to work on real hardware, and to be more or less transparent to the guest.  They will need to make it independent of the kernel hardware/driver in use.  That means their own
> > > > virt driver, then the virt driver will eventually hook into whatever hardware is present on the system, or failing that, fall back to soft RoCE or soft iWARP if that ever makes it in the kernel.
> > > >
> > > >
> > >
> > > Hi Leon and Doug,
> > > Your feedback is much appreciated!
> > >
> > > As Doug mentioned, the RFC is a QEMU implementation of a pvrdma device,
> > > so SoftRoCE can't help here (we are emulating a PCI device).
> >
> > I just responded to the latest email, but as you understood from my question,
> > it was related to your KDBR module.
> >
> > >
> > > Regarding the new KDBR module (Kernel Data Bridge), as the name suggests is
> > > a bridge between different VMs or between a VM and a hardware/software device
> > > and does not replace it.
> > >
> > > Leon, utilizing the Soft RoCE is definitely part of our roadmap from the start,
> > > we find the project a must since most of our systems don't even have real
> > > RDMA hardware, and the question is how do best integrate with it.
> >
> > This is exactly the question, you chose as an implementation path to do
> > it with new module over char device. I'm not against your approach,
> > but would like to see the list with pros and cons for over possible
> > solutions if any. Does it make sense to do special ULP to share the data
> > between different drivers over shared memory?
>
> Hi Leon,
>
> Here are some thoughts regarding the Soft RoCE usage in our project.
> We thought about using it as backend for QEMU pvrdma device
> we didn't how it will support our requirements.
>
> 1. Does Soft RoCE support inter process (VM) fast path ? The KDBR
>    removes the need for hw resources, emulated or not, concentrating
>    on one copy from a VM to another.
>
> 2. We needed to support migration, meaning the PVRDMA device must preserve
>    the RDMA resources between different hosts. Our solution includes a clear
>    separation between the guest resources namespace and the actual hw/sw device.
>    This is why the KDBR is intended to run outside the scope of the SoftRoCE
>    so it can open/close hw connections independent from the VM.
>
> 3. Our intention is for KDBR to be used in other contexts as well when we need
>    inter VM data exchange, e.g. backend for virtio devices. We didn't see how this
>    kind of requirement can be implemented inside SoftRoce as we don't see any
>    connection between them.
>
> 4. We don't want all the VM memory to be pinned since it disable memory-over-commit
>    which in turn will make the pvrdma device useless.
>    We weren't sure how nice would play Soft RoCE with memory pinning and we wanted
>    more control on memory management. It may be a solvable issue, but combined
>    with the others lead us to our decision to come up with our kernel bridge (char
>    device or not, we went for it since it was the easiest to implement for a POC)

I'm not going to repeat Jason's answer, I'm completely agree with him.

Just add my 2 cents. You didn't answer on my question about other possible
implementations. It can be SoftRoCE loopback optimizations, special ULP,
RDMA transport, virtual driver with multiple VFs and single PF.

>
>
> Thanks,
> Marcel & Yuval
>
> >
> > Thanks
> >
> > >
> > > Thanks,
> > > Marcel & Yuval
> > >
> > >
> > > > >
> > > > > Can you please help us to fill this knowledge gap?
> > > > >
> > > > > [1] http://marc.info/?l=linux-rdma&m=149063626907175&w=2
> > > >
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device

Posted by Yuval Shaia 8 years, 10 months ago

On Tue, Apr 04, 2017 at 08:33:49PM +0300, Leon Romanovsky wrote:
> 
> I'm not going to repeat Jason's answer, I'm completely agree with him.
> 
> Just add my 2 cents. You didn't answer on my question about other possible
> implementations. It can be SoftRoCE loopback optimizations, special ULP,
> RDMA transport, virtual driver with multiple VFs and single PF.

Please see my response to Jason's comments - eventually, when a support for
VM to external host communication will be added - kdbr will become ULP as
well.

Marcel & Yuval

> 
> >

Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device

Posted by Jason Gunthorpe 8 years, 10 months ago

On Thu, Apr 06, 2017 at 10:45:54PM +0300, Yuval Shaia wrote:

> > Just add my 2 cents. You didn't answer on my question about other possible
> > implementations. It can be SoftRoCE loopback optimizations, special ULP,
> > RDMA transport, virtual driver with multiple VFs and single PF.
> 
> Please see my response to Jason's comments - eventually, when a support for
> VM to external host communication will be added - kdbr will become ULP as
> well.

So, is KDBR only to be used on the HV side? Ie it never shows up in the VM?

That is even weirder, we certainly do not want to see a kernel RDMA
ULP for any of this - the entire point of RDMA is to let user space
implement their protocols without needing a unique kernel component!!

Jason

Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device

Posted by Leon Romanovsky 8 years, 10 months ago

On Thu, Mar 30, 2017 at 03:28:21PM -0500, Doug Ledford wrote:
> On 3/30/17 9:13 AM, Leon Romanovsky wrote:
> > On Thu, Mar 30, 2017 at 02:12:21PM +0300, Marcel Apfelbaum wrote:
> > > From: Yuval Shaia <yuval.shaia@oracle.com>
> > >
> > >  Hi,
> > >
> > >  General description
> > >  ===================
> > >  This is a very early RFC of a new RoCE emulated device
> > >  that enables guests to use the RDMA stack without having
> > >  a real hardware in the host.
> > >
> > >  The current implementation supports only VM to VM communication
> > >  on the same host.
> > >  Down the road we plan to make possible to be able to support
> > >  inter-machine communication by utilizing physical RoCE devices
> > >  or Soft RoCE.
> > >
> > >  The goals are:
> > >  - Reach fast and secure loos-less Inter-VM data exchange.
> > >  - Support remote VMs or bare metal machines.
> > >  - Allow VMs migration.
> > >  - Do not require to pin all VM memory.
> > >
> > >
> > >  Objective
> > >  =========
> > >  Have a QEMU implementation of the PVRDMA device. We aim to do so without
> > >  any change in the PVRDMA guest driver which is already merged into the
> > >  upstream kernel.
> > >
> > >
> > >  RFC status
> > >  ===========
> > >  The project is in early development stages and supports
> > >  only basic send/receive operations.
> > >
> > >  We present it so we can get feedbacks on design,
> > >  feature demands and to receive comments from the
> > >  community pointing us to the "right" direction.
> >
> > If to judge by the feedback which you got from RDMA community
> > for kernel proposal [1], this community failed to understand:
> > 1. Why do you need new module?
>
> In this case, this is a qemu module to allow qemu to provide a virt rdma
> device to guests that is compatible with the device provided by VMWare's ESX
> product.  Right now, the vmware_pvrdma driver works only when the guest is
> running on a VMWare ESX server product, this would change that.  Marcel
> mentioned that they are currently making it compatible because that's the
> easiest/quickest thing to do, but in the future they might extend beyond
> what VMWare's virt rdma driver provides/uses and might then need to either
> modify it to work with their extensions or fork and create their own virt
> client driver.

Doug,

As I mentioned during OFA, I just responded to the latest email, but
targeted my questions for their module. Sorry for not being clear about
it.

>
> > 2. Why existing solutions are not enough and can't be extended?
>
> This patch is against the qemu source code, not the kernel.  There is no
> other solution in the qemu source code, so there is no existing solution to
> extend.
>
> > 3. Why RXE (SoftRoCE) can't be extended to perform this inter-VM
> >    communication via virtual NIC?
>
> Eventually they want this to work on real hardware, and to be more or less
> transparent to the guest.  They will need to make it independent of the
> kernel hardware/driver in use.  That means their own virt driver, then the
> virt driver will eventually hook into whatever hardware is present on the
> system, or failing that, fall back to soft RoCE or soft iWARP if that ever
> makes it in the kernel.
>
>
> >
> > Can you please help us to fill this knowledge gap?
> >
> > [1] http://marc.info/?l=linux-rdma&m=149063626907175&w=2
>