[PATCH v3] migration/rdma: Fix out of order wrid

Li Zhijian posted 1 patch 2 years, 7 months ago
Test checkpatch passed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20210927070703.889012-1-lizhijian@cn.fujitsu.com
Maintainers: "Dr. David Alan Gilbert" <dgilbert@redhat.com>, Juan Quintela <quintela@redhat.com>
There is a newer version of this series
migration/rdma.c | 132 +++++++++++++++++++++++++++++++++++------------
1 file changed, 98 insertions(+), 34 deletions(-)
[PATCH v3] migration/rdma: Fix out of order wrid
Posted by Li Zhijian 2 years, 7 months ago
destination:
../qemu/build/qemu-system-x86_64 -enable-kvm -netdev tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive if=none,file=./Fedora-rdma-server-migration.qcow2,id=drive-virtio-disk0 -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga qxl -spice streaming-video=filter,port=5902,disable-ticketing -incoming rdma:192.168.22.23:8888
qemu-system-x86_64: -spice streaming-video=filter,port=5902,disable-ticketing: warning: short-form boolean option 'disable-ticketing' deprecated
Please use disable-ticketing=on instead
QEMU 6.0.50 monitor - type 'help' for more information
(qemu) trace-event qemu_rdma_block_for_wrid_miss on
(qemu) dest_init RDMA Device opened: kernel name rxe_eth0 uverbs device name uverbs2, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs2, infiniband class device path /sys/class/infiniband/rxe_eth0, transport: (2) Ethernet
qemu_rdma_block_for_wrid_miss A Wanted wrid CONTROL SEND (2000) but got CONTROL RECV (4000)

source:
../qemu/build/qemu-system-x86_64 -enable-kvm -netdev tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive if=none,file=./Fedora-rdma-server.qcow2,id=drive-virtio-disk0 -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga qxl -spice streaming-video=filter,port=5901,disable-ticketing -S
qemu-system-x86_64: -spice streaming-video=filter,port=5901,disable-ticketing: warning: short-form boolean option 'disable-ticketing' deprecated
Please use disable-ticketing=on instead
QEMU 6.0.50 monitor - type 'help' for more information
(qemu)
(qemu) trace-event qemu_rdma_block_for_wrid_miss on
(qemu) migrate -d rdma:192.168.22.23:8888
source_resolve_host RDMA Device opened: kernel name rxe_eth0 uverbs device name uverbs2, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs2, infiniband class device path /sys/class/infiniband/rxe_eth0, transport: (2) Ethernet
(qemu) qemu_rdma_block_for_wrid_miss A Wanted wrid WRITE RDMA (1) but got CONTROL RECV (4000)

NOTE: we use soft RoCE as the rdma device.
[root@iaas-rpma images]# rdma link show rxe_eth0/1
link rxe_eth0/1 state ACTIVE physical_state LINK_UP netdev eth0

This migration could not be completed when out of order(OOO) CQ event occurs.
The send queue and receive queue shared a same completion queue, and
qemu_rdma_block_for_wrid() will drop the CQs it's not interested in. But
the dropped CQs by qemu_rdma_block_for_wrid() could be later CQs it wants.
So in this case, qemu_rdma_block_for_wrid() will block forever.

OOO cases will occur in both source side and destination side. And a
forever blocking happens on only SEND and RECV are out of order. OOO between
'WRITE RDMA' and 'RECV' doesn't matter.

below the OOO sequence:
       source                             destination
      rdma_write_one()                   qemu_rdma_registration_handle()
1.    S1: post_recv X                    D1: post_recv Y
2.    wait for recv CQ event X
3.                                       D2: post_send X     ---------------+
4.                                       wait for send CQ send event X (D2) |
5.    recv CQ event X reaches (D2)                                          |
6.  +-S2: post_send Y                                                       |
7.  | wait for send CQ event Y                                              |
8.  |                                    recv CQ event Y (S2) (drop it)     |
9.  +-send CQ event Y reaches (S2)                                          |
10.                                      send CQ event X reaches (D2)  -----+
11.                                      wait recv CQ event Y (dropped by (8))

Although a hardware IB works fine in my a hundred of runs, the IB specification
doesn't guaratee the CQ order in such case.

Here we introduce a independent send completion queue to distinguish
ibv_post_send completion queue from the original mixed completion queue.
It helps us to poll the specific CQE we are really interested in.

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
---
V3: rebase code, and combine 2/2 to 1/2
V2: Introduce send completion queue
---
 migration/rdma.c | 132 +++++++++++++++++++++++++++++++++++------------
 1 file changed, 98 insertions(+), 34 deletions(-)

diff --git a/migration/rdma.c b/migration/rdma.c
index 5c2d113aa94..bb19a5afe73 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -358,9 +358,11 @@ typedef struct RDMAContext {
     struct ibv_context          *verbs;
     struct rdma_event_channel   *channel;
     struct ibv_qp *qp;                      /* queue pair */
-    struct ibv_comp_channel *comp_channel;  /* completion channel */
+    struct ibv_comp_channel *recv_comp_channel;  /* recv completion channel */
+    struct ibv_comp_channel *send_comp_channel;  /* send completion channel */
     struct ibv_pd *pd;                      /* protection domain */
-    struct ibv_cq *cq;                      /* completion queue */
+    struct ibv_cq *recv_cq;                 /* recvieve completion queue */
+    struct ibv_cq *send_cq;                 /* send completion queue */
 
     /*
      * If a previous write failed (perhaps because of a failed
@@ -1060,19 +1062,32 @@ static int qemu_rdma_alloc_pd_cq(RDMAContext *rdma)
     }
 
     /* create completion channel */
-    rdma->comp_channel = ibv_create_comp_channel(rdma->verbs);
-    if (!rdma->comp_channel) {
+    rdma->recv_comp_channel = ibv_create_comp_channel(rdma->verbs);
+    if (!rdma->recv_comp_channel) {
         error_report("failed to allocate completion channel");
         goto err_alloc_pd_cq;
     }
 
     /*
-     * Completion queue can be filled by both read and write work requests,
-     * so must reflect the sum of both possible queue sizes.
+     * Completion queue can be filled by read work requests.
      */
-    rdma->cq = ibv_create_cq(rdma->verbs, (RDMA_SIGNALED_SEND_MAX * 3),
-            NULL, rdma->comp_channel, 0);
-    if (!rdma->cq) {
+    rdma->recv_cq = ibv_create_cq(rdma->verbs, (RDMA_SIGNALED_SEND_MAX * 3),
+                                  NULL, rdma->recv_comp_channel, 0);
+    if (!rdma->recv_cq) {
+        error_report("failed to allocate completion queue");
+        goto err_alloc_pd_cq;
+    }
+
+    /* create send completion channel */
+    rdma->send_comp_channel = ibv_create_comp_channel(rdma->verbs);
+    if (!rdma->send_comp_channel) {
+        error_report("failed to allocate completion channel");
+        goto err_alloc_pd_cq;
+    }
+
+    rdma->send_cq = ibv_create_cq(rdma->verbs, (RDMA_SIGNALED_SEND_MAX * 3),
+                                  NULL, rdma->send_comp_channel, 0);
+    if (!rdma->send_cq) {
         error_report("failed to allocate completion queue");
         goto err_alloc_pd_cq;
     }
@@ -1083,11 +1098,19 @@ err_alloc_pd_cq:
     if (rdma->pd) {
         ibv_dealloc_pd(rdma->pd);
     }
-    if (rdma->comp_channel) {
-        ibv_destroy_comp_channel(rdma->comp_channel);
+    if (rdma->recv_comp_channel) {
+        ibv_destroy_comp_channel(rdma->recv_comp_channel);
+    }
+    if (rdma->send_comp_channel) {
+        ibv_destroy_comp_channel(rdma->send_comp_channel);
+    }
+    if (rdma->recv_cq) {
+        ibv_destroy_cq(rdma->recv_cq);
+        rdma->recv_cq = NULL;
     }
     rdma->pd = NULL;
-    rdma->comp_channel = NULL;
+    rdma->recv_comp_channel = NULL;
+    rdma->send_comp_channel = NULL;
     return -1;
 
 }
@@ -1104,8 +1127,8 @@ static int qemu_rdma_alloc_qp(RDMAContext *rdma)
     attr.cap.max_recv_wr = 3;
     attr.cap.max_send_sge = 1;
     attr.cap.max_recv_sge = 1;
-    attr.send_cq = rdma->cq;
-    attr.recv_cq = rdma->cq;
+    attr.send_cq = rdma->send_cq;
+    attr.recv_cq = rdma->recv_cq;
     attr.qp_type = IBV_QPT_RC;
 
     ret = rdma_create_qp(rdma->cm_id, rdma->pd, &attr);
@@ -1421,14 +1444,14 @@ static void qemu_rdma_signal_unregister(RDMAContext *rdma, uint64_t index,
  * (of any kind) has completed.
  * Return the work request ID that completed.
  */
-static uint64_t qemu_rdma_poll(RDMAContext *rdma, uint64_t *wr_id_out,
-                               uint32_t *byte_len)
+static uint64_t qemu_rdma_poll(RDMAContext *rdma, struct ibv_cq *cq,
+                               uint64_t *wr_id_out, uint32_t *byte_len)
 {
     int ret;
     struct ibv_wc wc;
     uint64_t wr_id;
 
-    ret = ibv_poll_cq(rdma->cq, 1, &wc);
+    ret = ibv_poll_cq(cq, 1, &wc);
 
     if (!ret) {
         *wr_id_out = RDMA_WRID_NONE;
@@ -1500,7 +1523,8 @@ static uint64_t qemu_rdma_poll(RDMAContext *rdma, uint64_t *wr_id_out,
 /* Wait for activity on the completion channel.
  * Returns 0 on success, none-0 on error.
  */
-static int qemu_rdma_wait_comp_channel(RDMAContext *rdma)
+static int qemu_rdma_wait_comp_channel(RDMAContext *rdma,
+                                       struct ibv_comp_channel *comp_channel)
 {
     struct rdma_cm_event *cm_event;
     int ret = -1;
@@ -1511,7 +1535,7 @@ static int qemu_rdma_wait_comp_channel(RDMAContext *rdma)
      */
     if (rdma->migration_started_on_destination &&
         migration_incoming_get_current()->state == MIGRATION_STATUS_ACTIVE) {
-        yield_until_fd_readable(rdma->comp_channel->fd);
+        yield_until_fd_readable(comp_channel->fd);
     } else {
         /* This is the source side, we're in a separate thread
          * or destination prior to migration_fd_process_incoming()
@@ -1522,7 +1546,7 @@ static int qemu_rdma_wait_comp_channel(RDMAContext *rdma)
          */
         while (!rdma->error_state  && !rdma->received_error) {
             GPollFD pfds[2];
-            pfds[0].fd = rdma->comp_channel->fd;
+            pfds[0].fd = comp_channel->fd;
             pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
             pfds[0].revents = 0;
 
@@ -1580,6 +1604,17 @@ static int qemu_rdma_wait_comp_channel(RDMAContext *rdma)
     return rdma->error_state;
 }
 
+static struct ibv_comp_channel *to_channel(RDMAContext *rdma, int wrid)
+{
+    return wrid < RDMA_WRID_RECV_CONTROL ? rdma->send_comp_channel :
+           rdma->recv_comp_channel;
+}
+
+static struct ibv_cq *to_cq(RDMAContext *rdma, int wrid)
+{
+    return wrid < RDMA_WRID_RECV_CONTROL ? rdma->send_cq : rdma->recv_cq;
+}
+
 /*
  * Block until the next work request has completed.
  *
@@ -1600,13 +1635,15 @@ static int qemu_rdma_block_for_wrid(RDMAContext *rdma, int wrid_requested,
     struct ibv_cq *cq;
     void *cq_ctx;
     uint64_t wr_id = RDMA_WRID_NONE, wr_id_in;
+    struct ibv_comp_channel *ch = to_channel(rdma, wrid_requested);
+    struct ibv_cq *poll_cq = to_cq(rdma, wrid_requested);
 
-    if (ibv_req_notify_cq(rdma->cq, 0)) {
+    if (ibv_req_notify_cq(poll_cq, 0)) {
         return -1;
     }
     /* poll cq first */
     while (wr_id != wrid_requested) {
-        ret = qemu_rdma_poll(rdma, &wr_id_in, byte_len);
+        ret = qemu_rdma_poll(rdma, poll_cq, &wr_id_in, byte_len);
         if (ret < 0) {
             return ret;
         }
@@ -1627,12 +1664,12 @@ static int qemu_rdma_block_for_wrid(RDMAContext *rdma, int wrid_requested,
     }
 
     while (1) {
-        ret = qemu_rdma_wait_comp_channel(rdma);
+        ret = qemu_rdma_wait_comp_channel(rdma, ch);
         if (ret) {
             goto err_block_for_wrid;
         }
 
-        ret = ibv_get_cq_event(rdma->comp_channel, &cq, &cq_ctx);
+        ret = ibv_get_cq_event(ch, &cq, &cq_ctx);
         if (ret) {
             perror("ibv_get_cq_event");
             goto err_block_for_wrid;
@@ -1646,7 +1683,7 @@ static int qemu_rdma_block_for_wrid(RDMAContext *rdma, int wrid_requested,
         }
 
         while (wr_id != wrid_requested) {
-            ret = qemu_rdma_poll(rdma, &wr_id_in, byte_len);
+            ret = qemu_rdma_poll(rdma, poll_cq, &wr_id_in, byte_len);
             if (ret < 0) {
                 goto err_block_for_wrid;
             }
@@ -2362,13 +2399,21 @@ static void qemu_rdma_cleanup(RDMAContext *rdma)
         rdma_destroy_qp(rdma->cm_id);
         rdma->qp = NULL;
     }
-    if (rdma->cq) {
-        ibv_destroy_cq(rdma->cq);
-        rdma->cq = NULL;
+    if (rdma->recv_cq) {
+        ibv_destroy_cq(rdma->recv_cq);
+        rdma->recv_cq = NULL;
+    }
+    if (rdma->send_cq) {
+        ibv_destroy_cq(rdma->send_cq);
+        rdma->send_cq = NULL;
     }
-    if (rdma->comp_channel) {
-        ibv_destroy_comp_channel(rdma->comp_channel);
-        rdma->comp_channel = NULL;
+    if (rdma->recv_comp_channel) {
+        ibv_destroy_comp_channel(rdma->recv_comp_channel);
+        rdma->recv_comp_channel = NULL;
+    }
+    if (rdma->send_comp_channel) {
+        ibv_destroy_comp_channel(rdma->send_comp_channel);
+        rdma->send_comp_channel = NULL;
     }
     if (rdma->pd) {
         ibv_dealloc_pd(rdma->pd);
@@ -3040,10 +3085,14 @@ static void qio_channel_rdma_set_aio_fd_handler(QIOChannel *ioc,
 {
     QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
     if (io_read) {
-        aio_set_fd_handler(ctx, rioc->rdmain->comp_channel->fd,
+        aio_set_fd_handler(ctx, rioc->rdmain->recv_comp_channel->fd,
+                           false, io_read, io_write, NULL, opaque);
+        aio_set_fd_handler(ctx, rioc->rdmain->send_comp_channel->fd,
                            false, io_read, io_write, NULL, opaque);
     } else {
-        aio_set_fd_handler(ctx, rioc->rdmaout->comp_channel->fd,
+        aio_set_fd_handler(ctx, rioc->rdmaout->recv_comp_channel->fd,
+                           false, io_read, io_write, NULL, opaque);
+        aio_set_fd_handler(ctx, rioc->rdmaout->send_comp_channel->fd,
                            false, io_read, io_write, NULL, opaque);
     }
 }
@@ -3257,7 +3306,22 @@ static size_t qemu_rdma_save_page(QEMUFile *f, void *opaque,
      */
     while (1) {
         uint64_t wr_id, wr_id_in;
-        int ret = qemu_rdma_poll(rdma, &wr_id_in, NULL);
+        int ret = qemu_rdma_poll(rdma, rdma->recv_cq, &wr_id_in, NULL);
+        if (ret < 0) {
+            error_report("rdma migration: polling error! %d", ret);
+            goto err;
+        }
+
+        wr_id = wr_id_in & RDMA_WRID_TYPE_MASK;
+
+        if (wr_id == RDMA_WRID_NONE) {
+            break;
+        }
+    }
+
+    while (1) {
+        uint64_t wr_id, wr_id_in;
+        int ret = qemu_rdma_poll(rdma, rdma->send_cq, &wr_id_in, NULL);
         if (ret < 0) {
             error_report("rdma migration: polling error! %d", ret);
             goto err;
-- 
2.31.1




Re: [PATCH v3] migration/rdma: Fix out of order wrid
Posted by lizhijian@fujitsu.com 2 years, 6 months ago
ping


On 27/09/2021 15:07, Li Zhijian wrote:
> destination:
> ../qemu/build/qemu-system-x86_64 -enable-kvm -netdev tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive if=none,file=./Fedora-rdma-server-migration.qcow2,id=drive-virtio-disk0 -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga qxl -spice streaming-video=filter,port=5902,disable-ticketing -incoming rdma:192.168.22.23:8888
> qemu-system-x86_64: -spice streaming-video=filter,port=5902,disable-ticketing: warning: short-form boolean option 'disable-ticketing' deprecated
> Please use disable-ticketing=on instead
> QEMU 6.0.50 monitor - type 'help' for more information
> (qemu) trace-event qemu_rdma_block_for_wrid_miss on
> (qemu) dest_init RDMA Device opened: kernel name rxe_eth0 uverbs device name uverbs2, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs2, infiniband class device path /sys/class/infiniband/rxe_eth0, transport: (2) Ethernet
> qemu_rdma_block_for_wrid_miss A Wanted wrid CONTROL SEND (2000) but got CONTROL RECV (4000)
>
> source:
> ../qemu/build/qemu-system-x86_64 -enable-kvm -netdev tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive if=none,file=./Fedora-rdma-server.qcow2,id=drive-virtio-disk0 -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga qxl -spice streaming-video=filter,port=5901,disable-ticketing -S
> qemu-system-x86_64: -spice streaming-video=filter,port=5901,disable-ticketing: warning: short-form boolean option 'disable-ticketing' deprecated
> Please use disable-ticketing=on instead
> QEMU 6.0.50 monitor - type 'help' for more information
> (qemu)
> (qemu) trace-event qemu_rdma_block_for_wrid_miss on
> (qemu) migrate -d rdma:192.168.22.23:8888
> source_resolve_host RDMA Device opened: kernel name rxe_eth0 uverbs device name uverbs2, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs2, infiniband class device path /sys/class/infiniband/rxe_eth0, transport: (2) Ethernet
> (qemu) qemu_rdma_block_for_wrid_miss A Wanted wrid WRITE RDMA (1) but got CONTROL RECV (4000)
>
> NOTE: we use soft RoCE as the rdma device.
> [root@iaas-rpma images]# rdma link show rxe_eth0/1
> link rxe_eth0/1 state ACTIVE physical_state LINK_UP netdev eth0
>
> This migration could not be completed when out of order(OOO) CQ event occurs.
> The send queue and receive queue shared a same completion queue, and
> qemu_rdma_block_for_wrid() will drop the CQs it's not interested in. But
> the dropped CQs by qemu_rdma_block_for_wrid() could be later CQs it wants.
> So in this case, qemu_rdma_block_for_wrid() will block forever.
>
> OOO cases will occur in both source side and destination side. And a
> forever blocking happens on only SEND and RECV are out of order. OOO between
> 'WRITE RDMA' and 'RECV' doesn't matter.
>
> below the OOO sequence:
>         source                             destination
>        rdma_write_one()                   qemu_rdma_registration_handle()
> 1.    S1: post_recv X                    D1: post_recv Y
> 2.    wait for recv CQ event X
> 3.                                       D2: post_send X     ---------------+
> 4.                                       wait for send CQ send event X (D2) |
> 5.    recv CQ event X reaches (D2)                                          |
> 6.  +-S2: post_send Y                                                       |
> 7.  | wait for send CQ event Y                                              |
> 8.  |                                    recv CQ event Y (S2) (drop it)     |
> 9.  +-send CQ event Y reaches (S2)                                          |
> 10.                                      send CQ event X reaches (D2)  -----+
> 11.                                      wait recv CQ event Y (dropped by (8))
>
> Although a hardware IB works fine in my a hundred of runs, the IB specification
> doesn't guaratee the CQ order in such case.
>
> Here we introduce a independent send completion queue to distinguish
> ibv_post_send completion queue from the original mixed completion queue.
> It helps us to poll the specific CQE we are really interested in.
>
> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
> ---
> V3: rebase code, and combine 2/2 to 1/2
> V2: Introduce send completion queue
> ---
>   migration/rdma.c | 132 +++++++++++++++++++++++++++++++++++------------
>   1 file changed, 98 insertions(+), 34 deletions(-)
>
> diff --git a/migration/rdma.c b/migration/rdma.c
> index 5c2d113aa94..bb19a5afe73 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -358,9 +358,11 @@ typedef struct RDMAContext {
>       struct ibv_context          *verbs;
>       struct rdma_event_channel   *channel;
>       struct ibv_qp *qp;                      /* queue pair */
> -    struct ibv_comp_channel *comp_channel;  /* completion channel */
> +    struct ibv_comp_channel *recv_comp_channel;  /* recv completion channel */
> +    struct ibv_comp_channel *send_comp_channel;  /* send completion channel */
>       struct ibv_pd *pd;                      /* protection domain */
> -    struct ibv_cq *cq;                      /* completion queue */
> +    struct ibv_cq *recv_cq;                 /* recvieve completion queue */
> +    struct ibv_cq *send_cq;                 /* send completion queue */
>   
>       /*
>        * If a previous write failed (perhaps because of a failed
> @@ -1060,19 +1062,32 @@ static int qemu_rdma_alloc_pd_cq(RDMAContext *rdma)
>       }
>   
>       /* create completion channel */
> -    rdma->comp_channel = ibv_create_comp_channel(rdma->verbs);
> -    if (!rdma->comp_channel) {
> +    rdma->recv_comp_channel = ibv_create_comp_channel(rdma->verbs);
> +    if (!rdma->recv_comp_channel) {
>           error_report("failed to allocate completion channel");
>           goto err_alloc_pd_cq;
>       }
>   
>       /*
> -     * Completion queue can be filled by both read and write work requests,
> -     * so must reflect the sum of both possible queue sizes.
> +     * Completion queue can be filled by read work requests.
>        */
> -    rdma->cq = ibv_create_cq(rdma->verbs, (RDMA_SIGNALED_SEND_MAX * 3),
> -            NULL, rdma->comp_channel, 0);
> -    if (!rdma->cq) {
> +    rdma->recv_cq = ibv_create_cq(rdma->verbs, (RDMA_SIGNALED_SEND_MAX * 3),
> +                                  NULL, rdma->recv_comp_channel, 0);
> +    if (!rdma->recv_cq) {
> +        error_report("failed to allocate completion queue");
> +        goto err_alloc_pd_cq;
> +    }
> +
> +    /* create send completion channel */
> +    rdma->send_comp_channel = ibv_create_comp_channel(rdma->verbs);
> +    if (!rdma->send_comp_channel) {
> +        error_report("failed to allocate completion channel");
> +        goto err_alloc_pd_cq;
> +    }
> +
> +    rdma->send_cq = ibv_create_cq(rdma->verbs, (RDMA_SIGNALED_SEND_MAX * 3),
> +                                  NULL, rdma->send_comp_channel, 0);
> +    if (!rdma->send_cq) {
>           error_report("failed to allocate completion queue");
>           goto err_alloc_pd_cq;
>       }
> @@ -1083,11 +1098,19 @@ err_alloc_pd_cq:
>       if (rdma->pd) {
>           ibv_dealloc_pd(rdma->pd);
>       }
> -    if (rdma->comp_channel) {
> -        ibv_destroy_comp_channel(rdma->comp_channel);
> +    if (rdma->recv_comp_channel) {
> +        ibv_destroy_comp_channel(rdma->recv_comp_channel);
> +    }
> +    if (rdma->send_comp_channel) {
> +        ibv_destroy_comp_channel(rdma->send_comp_channel);
> +    }
> +    if (rdma->recv_cq) {
> +        ibv_destroy_cq(rdma->recv_cq);
> +        rdma->recv_cq = NULL;
>       }
>       rdma->pd = NULL;
> -    rdma->comp_channel = NULL;
> +    rdma->recv_comp_channel = NULL;
> +    rdma->send_comp_channel = NULL;
>       return -1;
>   
>   }
> @@ -1104,8 +1127,8 @@ static int qemu_rdma_alloc_qp(RDMAContext *rdma)
>       attr.cap.max_recv_wr = 3;
>       attr.cap.max_send_sge = 1;
>       attr.cap.max_recv_sge = 1;
> -    attr.send_cq = rdma->cq;
> -    attr.recv_cq = rdma->cq;
> +    attr.send_cq = rdma->send_cq;
> +    attr.recv_cq = rdma->recv_cq;
>       attr.qp_type = IBV_QPT_RC;
>   
>       ret = rdma_create_qp(rdma->cm_id, rdma->pd, &attr);
> @@ -1421,14 +1444,14 @@ static void qemu_rdma_signal_unregister(RDMAContext *rdma, uint64_t index,
>    * (of any kind) has completed.
>    * Return the work request ID that completed.
>    */
> -static uint64_t qemu_rdma_poll(RDMAContext *rdma, uint64_t *wr_id_out,
> -                               uint32_t *byte_len)
> +static uint64_t qemu_rdma_poll(RDMAContext *rdma, struct ibv_cq *cq,
> +                               uint64_t *wr_id_out, uint32_t *byte_len)
>   {
>       int ret;
>       struct ibv_wc wc;
>       uint64_t wr_id;
>   
> -    ret = ibv_poll_cq(rdma->cq, 1, &wc);
> +    ret = ibv_poll_cq(cq, 1, &wc);
>   
>       if (!ret) {
>           *wr_id_out = RDMA_WRID_NONE;
> @@ -1500,7 +1523,8 @@ static uint64_t qemu_rdma_poll(RDMAContext *rdma, uint64_t *wr_id_out,
>   /* Wait for activity on the completion channel.
>    * Returns 0 on success, none-0 on error.
>    */
> -static int qemu_rdma_wait_comp_channel(RDMAContext *rdma)
> +static int qemu_rdma_wait_comp_channel(RDMAContext *rdma,
> +                                       struct ibv_comp_channel *comp_channel)
>   {
>       struct rdma_cm_event *cm_event;
>       int ret = -1;
> @@ -1511,7 +1535,7 @@ static int qemu_rdma_wait_comp_channel(RDMAContext *rdma)
>        */
>       if (rdma->migration_started_on_destination &&
>           migration_incoming_get_current()->state == MIGRATION_STATUS_ACTIVE) {
> -        yield_until_fd_readable(rdma->comp_channel->fd);
> +        yield_until_fd_readable(comp_channel->fd);
>       } else {
>           /* This is the source side, we're in a separate thread
>            * or destination prior to migration_fd_process_incoming()
> @@ -1522,7 +1546,7 @@ static int qemu_rdma_wait_comp_channel(RDMAContext *rdma)
>            */
>           while (!rdma->error_state  && !rdma->received_error) {
>               GPollFD pfds[2];
> -            pfds[0].fd = rdma->comp_channel->fd;
> +            pfds[0].fd = comp_channel->fd;
>               pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
>               pfds[0].revents = 0;
>   
> @@ -1580,6 +1604,17 @@ static int qemu_rdma_wait_comp_channel(RDMAContext *rdma)
>       return rdma->error_state;
>   }
>   
> +static struct ibv_comp_channel *to_channel(RDMAContext *rdma, int wrid)
> +{
> +    return wrid < RDMA_WRID_RECV_CONTROL ? rdma->send_comp_channel :
> +           rdma->recv_comp_channel;
> +}
> +
> +static struct ibv_cq *to_cq(RDMAContext *rdma, int wrid)
> +{
> +    return wrid < RDMA_WRID_RECV_CONTROL ? rdma->send_cq : rdma->recv_cq;
> +}
> +
>   /*
>    * Block until the next work request has completed.
>    *
> @@ -1600,13 +1635,15 @@ static int qemu_rdma_block_for_wrid(RDMAContext *rdma, int wrid_requested,
>       struct ibv_cq *cq;
>       void *cq_ctx;
>       uint64_t wr_id = RDMA_WRID_NONE, wr_id_in;
> +    struct ibv_comp_channel *ch = to_channel(rdma, wrid_requested);
> +    struct ibv_cq *poll_cq = to_cq(rdma, wrid_requested);
>   
> -    if (ibv_req_notify_cq(rdma->cq, 0)) {
> +    if (ibv_req_notify_cq(poll_cq, 0)) {
>           return -1;
>       }
>       /* poll cq first */
>       while (wr_id != wrid_requested) {
> -        ret = qemu_rdma_poll(rdma, &wr_id_in, byte_len);
> +        ret = qemu_rdma_poll(rdma, poll_cq, &wr_id_in, byte_len);
>           if (ret < 0) {
>               return ret;
>           }
> @@ -1627,12 +1664,12 @@ static int qemu_rdma_block_for_wrid(RDMAContext *rdma, int wrid_requested,
>       }
>   
>       while (1) {
> -        ret = qemu_rdma_wait_comp_channel(rdma);
> +        ret = qemu_rdma_wait_comp_channel(rdma, ch);
>           if (ret) {
>               goto err_block_for_wrid;
>           }
>   
> -        ret = ibv_get_cq_event(rdma->comp_channel, &cq, &cq_ctx);
> +        ret = ibv_get_cq_event(ch, &cq, &cq_ctx);
>           if (ret) {
>               perror("ibv_get_cq_event");
>               goto err_block_for_wrid;
> @@ -1646,7 +1683,7 @@ static int qemu_rdma_block_for_wrid(RDMAContext *rdma, int wrid_requested,
>           }
>   
>           while (wr_id != wrid_requested) {
> -            ret = qemu_rdma_poll(rdma, &wr_id_in, byte_len);
> +            ret = qemu_rdma_poll(rdma, poll_cq, &wr_id_in, byte_len);
>               if (ret < 0) {
>                   goto err_block_for_wrid;
>               }
> @@ -2362,13 +2399,21 @@ static void qemu_rdma_cleanup(RDMAContext *rdma)
>           rdma_destroy_qp(rdma->cm_id);
>           rdma->qp = NULL;
>       }
> -    if (rdma->cq) {
> -        ibv_destroy_cq(rdma->cq);
> -        rdma->cq = NULL;
> +    if (rdma->recv_cq) {
> +        ibv_destroy_cq(rdma->recv_cq);
> +        rdma->recv_cq = NULL;
> +    }
> +    if (rdma->send_cq) {
> +        ibv_destroy_cq(rdma->send_cq);
> +        rdma->send_cq = NULL;
>       }
> -    if (rdma->comp_channel) {
> -        ibv_destroy_comp_channel(rdma->comp_channel);
> -        rdma->comp_channel = NULL;
> +    if (rdma->recv_comp_channel) {
> +        ibv_destroy_comp_channel(rdma->recv_comp_channel);
> +        rdma->recv_comp_channel = NULL;
> +    }
> +    if (rdma->send_comp_channel) {
> +        ibv_destroy_comp_channel(rdma->send_comp_channel);
> +        rdma->send_comp_channel = NULL;
>       }
>       if (rdma->pd) {
>           ibv_dealloc_pd(rdma->pd);
> @@ -3040,10 +3085,14 @@ static void qio_channel_rdma_set_aio_fd_handler(QIOChannel *ioc,
>   {
>       QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
>       if (io_read) {
> -        aio_set_fd_handler(ctx, rioc->rdmain->comp_channel->fd,
> +        aio_set_fd_handler(ctx, rioc->rdmain->recv_comp_channel->fd,
> +                           false, io_read, io_write, NULL, opaque);
> +        aio_set_fd_handler(ctx, rioc->rdmain->send_comp_channel->fd,
>                              false, io_read, io_write, NULL, opaque);
>       } else {
> -        aio_set_fd_handler(ctx, rioc->rdmaout->comp_channel->fd,
> +        aio_set_fd_handler(ctx, rioc->rdmaout->recv_comp_channel->fd,
> +                           false, io_read, io_write, NULL, opaque);
> +        aio_set_fd_handler(ctx, rioc->rdmaout->send_comp_channel->fd,
>                              false, io_read, io_write, NULL, opaque);
>       }
>   }
> @@ -3257,7 +3306,22 @@ static size_t qemu_rdma_save_page(QEMUFile *f, void *opaque,
>        */
>       while (1) {
>           uint64_t wr_id, wr_id_in;
> -        int ret = qemu_rdma_poll(rdma, &wr_id_in, NULL);
> +        int ret = qemu_rdma_poll(rdma, rdma->recv_cq, &wr_id_in, NULL);
> +        if (ret < 0) {
> +            error_report("rdma migration: polling error! %d", ret);
> +            goto err;
> +        }
> +
> +        wr_id = wr_id_in & RDMA_WRID_TYPE_MASK;
> +
> +        if (wr_id == RDMA_WRID_NONE) {
> +            break;
> +        }
> +    }
> +
> +    while (1) {
> +        uint64_t wr_id, wr_id_in;
> +        int ret = qemu_rdma_poll(rdma, rdma->send_cq, &wr_id_in, NULL);
>           if (ret < 0) {
>               error_report("rdma migration: polling error! %d", ret);
>               goto err;
Re: [PATCH v3] migration/rdma: Fix out of order wrid
Posted by lizhijian@fujitsu.com 2 years, 6 months ago
ping again


On 18/10/2021 18:18, Li, Zhijian/李 智坚 wrote:
> ping
>
>
> On 27/09/2021 15:07, Li Zhijian wrote:
>> destination:
>> ../qemu/build/qemu-system-x86_64 -enable-kvm -netdev tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive if=none,file=./Fedora-rdma-server-migration.qcow2,id=drive-virtio-disk0 -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga qxl -spice streaming-video=filter,port=5902,disable-ticketing -incoming rdma:192.168.22.23:8888
>> qemu-system-x86_64: -spice streaming-video=filter,port=5902,disable-ticketing: warning: short-form boolean option 'disable-ticketing' deprecated
>> Please use disable-ticketing=on instead
>> QEMU 6.0.50 monitor - type 'help' for more information
>> (qemu) trace-event qemu_rdma_block_for_wrid_miss on
>> (qemu) dest_init RDMA Device opened: kernel name rxe_eth0 uverbs device name uverbs2, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs2, infiniband class device path /sys/class/infiniband/rxe_eth0, transport: (2) Ethernet
>> qemu_rdma_block_for_wrid_miss A Wanted wrid CONTROL SEND (2000) but got CONTROL RECV (4000)
>>
>> source:
>> ../qemu/build/qemu-system-x86_64 -enable-kvm -netdev tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive if=none,file=./Fedora-rdma-server.qcow2,id=drive-virtio-disk0 -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga qxl -spice streaming-video=filter,port=5901,disable-ticketing -S
>> qemu-system-x86_64: -spice streaming-video=filter,port=5901,disable-ticketing: warning: short-form boolean option 'disable-ticketing' deprecated
>> Please use disable-ticketing=on instead
>> QEMU 6.0.50 monitor - type 'help' for more information
>> (qemu)
>> (qemu) trace-event qemu_rdma_block_for_wrid_miss on
>> (qemu) migrate -d rdma:192.168.22.23:8888
>> source_resolve_host RDMA Device opened: kernel name rxe_eth0 uverbs device name uverbs2, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs2, infiniband class device path /sys/class/infiniband/rxe_eth0, transport: (2) Ethernet
>> (qemu) qemu_rdma_block_for_wrid_miss A Wanted wrid WRITE RDMA (1) but got CONTROL RECV (4000)
>>
>> NOTE: we use soft RoCE as the rdma device.
>> [root@iaas-rpma images]# rdma link show rxe_eth0/1
>> link rxe_eth0/1 state ACTIVE physical_state LINK_UP netdev eth0
>>
>> This migration could not be completed when out of order(OOO) CQ event occurs.
>> The send queue and receive queue shared a same completion queue, and
>> qemu_rdma_block_for_wrid() will drop the CQs it's not interested in. But
>> the dropped CQs by qemu_rdma_block_for_wrid() could be later CQs it wants.
>> So in this case, qemu_rdma_block_for_wrid() will block forever.
>>
>> OOO cases will occur in both source side and destination side. And a
>> forever blocking happens on only SEND and RECV are out of order. OOO between
>> 'WRITE RDMA' and 'RECV' doesn't matter.
>>
>> below the OOO sequence:
>>          source                             destination
>>         rdma_write_one()                   qemu_rdma_registration_handle()
>> 1.    S1: post_recv X                    D1: post_recv Y
>> 2.    wait for recv CQ event X
>> 3.                                       D2: post_send X     ---------------+
>> 4.                                       wait for send CQ send event X (D2) |
>> 5.    recv CQ event X reaches (D2)                                          |
>> 6.  +-S2: post_send Y                                                       |
>> 7.  | wait for send CQ event Y                                              |
>> 8.  |                                    recv CQ event Y (S2) (drop it)     |
>> 9.  +-send CQ event Y reaches (S2)                                          |
>> 10.                                      send CQ event X reaches (D2)  -----+
>> 11.                                      wait recv CQ event Y (dropped by (8))
>>
>> Although a hardware IB works fine in my a hundred of runs, the IB specification
>> doesn't guaratee the CQ order in such case.
>>
>> Here we introduce a independent send completion queue to distinguish
>> ibv_post_send completion queue from the original mixed completion queue.
>> It helps us to poll the specific CQE we are really interested in.
>>
>> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
>> ---
>> V3: rebase code, and combine 2/2 to 1/2
>> V2: Introduce send completion queue
>> ---
>>    migration/rdma.c | 132 +++++++++++++++++++++++++++++++++++------------
>>    1 file changed, 98 insertions(+), 34 deletions(-)
>>
>> diff --git a/migration/rdma.c b/migration/rdma.c
>> index 5c2d113aa94..bb19a5afe73 100644
>> --- a/migration/rdma.c
>> +++ b/migration/rdma.c
>> @@ -358,9 +358,11 @@ typedef struct RDMAContext {
>>        struct ibv_context          *verbs;
>>        struct rdma_event_channel   *channel;
>>        struct ibv_qp *qp;                      /* queue pair */
>> -    struct ibv_comp_channel *comp_channel;  /* completion channel */
>> +    struct ibv_comp_channel *recv_comp_channel;  /* recv completion channel */
>> +    struct ibv_comp_channel *send_comp_channel;  /* send completion channel */
>>        struct ibv_pd *pd;                      /* protection domain */
>> -    struct ibv_cq *cq;                      /* completion queue */
>> +    struct ibv_cq *recv_cq;                 /* recvieve completion queue */
>> +    struct ibv_cq *send_cq;                 /* send completion queue */
>>    
>>        /*
>>         * If a previous write failed (perhaps because of a failed
>> @@ -1060,19 +1062,32 @@ static int qemu_rdma_alloc_pd_cq(RDMAContext *rdma)
>>        }
>>    
>>        /* create completion channel */
>> -    rdma->comp_channel = ibv_create_comp_channel(rdma->verbs);
>> -    if (!rdma->comp_channel) {
>> +    rdma->recv_comp_channel = ibv_create_comp_channel(rdma->verbs);
>> +    if (!rdma->recv_comp_channel) {
>>            error_report("failed to allocate completion channel");
>>            goto err_alloc_pd_cq;
>>        }
>>    
>>        /*
>> -     * Completion queue can be filled by both read and write work requests,
>> -     * so must reflect the sum of both possible queue sizes.
>> +     * Completion queue can be filled by read work requests.
>>         */
>> -    rdma->cq = ibv_create_cq(rdma->verbs, (RDMA_SIGNALED_SEND_MAX * 3),
>> -            NULL, rdma->comp_channel, 0);
>> -    if (!rdma->cq) {
>> +    rdma->recv_cq = ibv_create_cq(rdma->verbs, (RDMA_SIGNALED_SEND_MAX * 3),
>> +                                  NULL, rdma->recv_comp_channel, 0);
>> +    if (!rdma->recv_cq) {
>> +        error_report("failed to allocate completion queue");
>> +        goto err_alloc_pd_cq;
>> +    }
>> +
>> +    /* create send completion channel */
>> +    rdma->send_comp_channel = ibv_create_comp_channel(rdma->verbs);
>> +    if (!rdma->send_comp_channel) {
>> +        error_report("failed to allocate completion channel");
>> +        goto err_alloc_pd_cq;
>> +    }
>> +
>> +    rdma->send_cq = ibv_create_cq(rdma->verbs, (RDMA_SIGNALED_SEND_MAX * 3),
>> +                                  NULL, rdma->send_comp_channel, 0);
>> +    if (!rdma->send_cq) {
>>            error_report("failed to allocate completion queue");
>>            goto err_alloc_pd_cq;
>>        }
>> @@ -1083,11 +1098,19 @@ err_alloc_pd_cq:
>>        if (rdma->pd) {
>>            ibv_dealloc_pd(rdma->pd);
>>        }
>> -    if (rdma->comp_channel) {
>> -        ibv_destroy_comp_channel(rdma->comp_channel);
>> +    if (rdma->recv_comp_channel) {
>> +        ibv_destroy_comp_channel(rdma->recv_comp_channel);
>> +    }
>> +    if (rdma->send_comp_channel) {
>> +        ibv_destroy_comp_channel(rdma->send_comp_channel);
>> +    }
>> +    if (rdma->recv_cq) {
>> +        ibv_destroy_cq(rdma->recv_cq);
>> +        rdma->recv_cq = NULL;
>>        }
>>        rdma->pd = NULL;
>> -    rdma->comp_channel = NULL;
>> +    rdma->recv_comp_channel = NULL;
>> +    rdma->send_comp_channel = NULL;
>>        return -1;
>>    
>>    }
>> @@ -1104,8 +1127,8 @@ static int qemu_rdma_alloc_qp(RDMAContext *rdma)
>>        attr.cap.max_recv_wr = 3;
>>        attr.cap.max_send_sge = 1;
>>        attr.cap.max_recv_sge = 1;
>> -    attr.send_cq = rdma->cq;
>> -    attr.recv_cq = rdma->cq;
>> +    attr.send_cq = rdma->send_cq;
>> +    attr.recv_cq = rdma->recv_cq;
>>        attr.qp_type = IBV_QPT_RC;
>>    
>>        ret = rdma_create_qp(rdma->cm_id, rdma->pd, &attr);
>> @@ -1421,14 +1444,14 @@ static void qemu_rdma_signal_unregister(RDMAContext *rdma, uint64_t index,
>>     * (of any kind) has completed.
>>     * Return the work request ID that completed.
>>     */
>> -static uint64_t qemu_rdma_poll(RDMAContext *rdma, uint64_t *wr_id_out,
>> -                               uint32_t *byte_len)
>> +static uint64_t qemu_rdma_poll(RDMAContext *rdma, struct ibv_cq *cq,
>> +                               uint64_t *wr_id_out, uint32_t *byte_len)
>>    {
>>        int ret;
>>        struct ibv_wc wc;
>>        uint64_t wr_id;
>>    
>> -    ret = ibv_poll_cq(rdma->cq, 1, &wc);
>> +    ret = ibv_poll_cq(cq, 1, &wc);
>>    
>>        if (!ret) {
>>            *wr_id_out = RDMA_WRID_NONE;
>> @@ -1500,7 +1523,8 @@ static uint64_t qemu_rdma_poll(RDMAContext *rdma, uint64_t *wr_id_out,
>>    /* Wait for activity on the completion channel.
>>     * Returns 0 on success, none-0 on error.
>>     */
>> -static int qemu_rdma_wait_comp_channel(RDMAContext *rdma)
>> +static int qemu_rdma_wait_comp_channel(RDMAContext *rdma,
>> +                                       struct ibv_comp_channel *comp_channel)
>>    {
>>        struct rdma_cm_event *cm_event;
>>        int ret = -1;
>> @@ -1511,7 +1535,7 @@ static int qemu_rdma_wait_comp_channel(RDMAContext *rdma)
>>         */
>>        if (rdma->migration_started_on_destination &&
>>            migration_incoming_get_current()->state == MIGRATION_STATUS_ACTIVE) {
>> -        yield_until_fd_readable(rdma->comp_channel->fd);
>> +        yield_until_fd_readable(comp_channel->fd);
>>        } else {
>>            /* This is the source side, we're in a separate thread
>>             * or destination prior to migration_fd_process_incoming()
>> @@ -1522,7 +1546,7 @@ static int qemu_rdma_wait_comp_channel(RDMAContext *rdma)
>>             */
>>            while (!rdma->error_state  && !rdma->received_error) {
>>                GPollFD pfds[2];
>> -            pfds[0].fd = rdma->comp_channel->fd;
>> +            pfds[0].fd = comp_channel->fd;
>>                pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
>>                pfds[0].revents = 0;
>>    
>> @@ -1580,6 +1604,17 @@ static int qemu_rdma_wait_comp_channel(RDMAContext *rdma)
>>        return rdma->error_state;
>>    }
>>    
>> +static struct ibv_comp_channel *to_channel(RDMAContext *rdma, int wrid)
>> +{
>> +    return wrid < RDMA_WRID_RECV_CONTROL ? rdma->send_comp_channel :
>> +           rdma->recv_comp_channel;
>> +}
>> +
>> +static struct ibv_cq *to_cq(RDMAContext *rdma, int wrid)
>> +{
>> +    return wrid < RDMA_WRID_RECV_CONTROL ? rdma->send_cq : rdma->recv_cq;
>> +}
>> +
>>    /*
>>     * Block until the next work request has completed.
>>     *
>> @@ -1600,13 +1635,15 @@ static int qemu_rdma_block_for_wrid(RDMAContext *rdma, int wrid_requested,
>>        struct ibv_cq *cq;
>>        void *cq_ctx;
>>        uint64_t wr_id = RDMA_WRID_NONE, wr_id_in;
>> +    struct ibv_comp_channel *ch = to_channel(rdma, wrid_requested);
>> +    struct ibv_cq *poll_cq = to_cq(rdma, wrid_requested);
>>    
>> -    if (ibv_req_notify_cq(rdma->cq, 0)) {
>> +    if (ibv_req_notify_cq(poll_cq, 0)) {
>>            return -1;
>>        }
>>        /* poll cq first */
>>        while (wr_id != wrid_requested) {
>> -        ret = qemu_rdma_poll(rdma, &wr_id_in, byte_len);
>> +        ret = qemu_rdma_poll(rdma, poll_cq, &wr_id_in, byte_len);
>>            if (ret < 0) {
>>                return ret;
>>            }
>> @@ -1627,12 +1664,12 @@ static int qemu_rdma_block_for_wrid(RDMAContext *rdma, int wrid_requested,
>>        }
>>    
>>        while (1) {
>> -        ret = qemu_rdma_wait_comp_channel(rdma);
>> +        ret = qemu_rdma_wait_comp_channel(rdma, ch);
>>            if (ret) {
>>                goto err_block_for_wrid;
>>            }
>>    
>> -        ret = ibv_get_cq_event(rdma->comp_channel, &cq, &cq_ctx);
>> +        ret = ibv_get_cq_event(ch, &cq, &cq_ctx);
>>            if (ret) {
>>                perror("ibv_get_cq_event");
>>                goto err_block_for_wrid;
>> @@ -1646,7 +1683,7 @@ static int qemu_rdma_block_for_wrid(RDMAContext *rdma, int wrid_requested,
>>            }
>>    
>>            while (wr_id != wrid_requested) {
>> -            ret = qemu_rdma_poll(rdma, &wr_id_in, byte_len);
>> +            ret = qemu_rdma_poll(rdma, poll_cq, &wr_id_in, byte_len);
>>                if (ret < 0) {
>>                    goto err_block_for_wrid;
>>                }
>> @@ -2362,13 +2399,21 @@ static void qemu_rdma_cleanup(RDMAContext *rdma)
>>            rdma_destroy_qp(rdma->cm_id);
>>            rdma->qp = NULL;
>>        }
>> -    if (rdma->cq) {
>> -        ibv_destroy_cq(rdma->cq);
>> -        rdma->cq = NULL;
>> +    if (rdma->recv_cq) {
>> +        ibv_destroy_cq(rdma->recv_cq);
>> +        rdma->recv_cq = NULL;
>> +    }
>> +    if (rdma->send_cq) {
>> +        ibv_destroy_cq(rdma->send_cq);
>> +        rdma->send_cq = NULL;
>>        }
>> -    if (rdma->comp_channel) {
>> -        ibv_destroy_comp_channel(rdma->comp_channel);
>> -        rdma->comp_channel = NULL;
>> +    if (rdma->recv_comp_channel) {
>> +        ibv_destroy_comp_channel(rdma->recv_comp_channel);
>> +        rdma->recv_comp_channel = NULL;
>> +    }
>> +    if (rdma->send_comp_channel) {
>> +        ibv_destroy_comp_channel(rdma->send_comp_channel);
>> +        rdma->send_comp_channel = NULL;
>>        }
>>        if (rdma->pd) {
>>            ibv_dealloc_pd(rdma->pd);
>> @@ -3040,10 +3085,14 @@ static void qio_channel_rdma_set_aio_fd_handler(QIOChannel *ioc,
>>    {
>>        QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
>>        if (io_read) {
>> -        aio_set_fd_handler(ctx, rioc->rdmain->comp_channel->fd,
>> +        aio_set_fd_handler(ctx, rioc->rdmain->recv_comp_channel->fd,
>> +                           false, io_read, io_write, NULL, opaque);
>> +        aio_set_fd_handler(ctx, rioc->rdmain->send_comp_channel->fd,
>>                               false, io_read, io_write, NULL, opaque);
>>        } else {
>> -        aio_set_fd_handler(ctx, rioc->rdmaout->comp_channel->fd,
>> +        aio_set_fd_handler(ctx, rioc->rdmaout->recv_comp_channel->fd,
>> +                           false, io_read, io_write, NULL, opaque);
>> +        aio_set_fd_handler(ctx, rioc->rdmaout->send_comp_channel->fd,
>>                               false, io_read, io_write, NULL, opaque);
>>        }
>>    }
>> @@ -3257,7 +3306,22 @@ static size_t qemu_rdma_save_page(QEMUFile *f, void *opaque,
>>         */
>>        while (1) {
>>            uint64_t wr_id, wr_id_in;
>> -        int ret = qemu_rdma_poll(rdma, &wr_id_in, NULL);
>> +        int ret = qemu_rdma_poll(rdma, rdma->recv_cq, &wr_id_in, NULL);
>> +        if (ret < 0) {
>> +            error_report("rdma migration: polling error! %d", ret);
>> +            goto err;
>> +        }
>> +
>> +        wr_id = wr_id_in & RDMA_WRID_TYPE_MASK;
>> +
>> +        if (wr_id == RDMA_WRID_NONE) {
>> +            break;
>> +        }
>> +    }
>> +
>> +    while (1) {
>> +        uint64_t wr_id, wr_id_in;
>> +        int ret = qemu_rdma_poll(rdma, rdma->send_cq, &wr_id_in, NULL);
>>            if (ret < 0) {
>>                error_report("rdma migration: polling error! %d", ret);
>>                goto err;