From nobody Tue Oct 28 01:57:50 2025 Delivered-To: importer@patchew.org Received-SPF: pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) client-ip=208.118.235.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Authentication-Results: mx.zohomail.com; spf=pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org Return-Path: Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) by mx.zohomail.com with SMTPS id 1516183072598266.23111487671815; Wed, 17 Jan 2018 01:57:52 -0800 (PST) Received: from localhost ([::1]:51838 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ebkTf-0008DV-Ms for importer@patchew.org; Wed, 17 Jan 2018 04:57:51 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:54936) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ebkRb-0006ub-24 for qemu-devel@nongnu.org; Wed, 17 Jan 2018 04:55:50 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ebkRX-0000gr-TV for qemu-devel@nongnu.org; Wed, 17 Jan 2018 04:55:43 -0500 Received: from mx1.redhat.com ([209.132.183.28]:53402) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1ebkRX-0000gQ-IA for qemu-devel@nongnu.org; Wed, 17 Jan 2018 04:55:39 -0500 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 9DAF1C051694; Wed, 17 Jan 2018 09:55:38 +0000 (UTC) Received: from localhost.localdomain (unknown [10.35.206.19]) by smtp.corp.redhat.com (Postfix) with ESMTP id 869424F9A2; Wed, 17 Jan 2018 09:55:12 +0000 (UTC) From: Marcel Apfelbaum To: qemu-devel@nongnu.org Date: Wed, 17 Jan 2018 11:54:19 +0200 Message-Id: <20180117095421.124787-3-marcel@redhat.com> In-Reply-To: <20180117095421.124787-1-marcel@redhat.com> References: <20180117095421.124787-1-marcel@redhat.com> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.31]); Wed, 17 Jan 2018 09:55:38 +0000 (UTC) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 209.132.183.28 Subject: [Qemu-devel] [PATCH V8 2/4] docs: add pvrdma device documentation. X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: ehabkost@redhat.com, mst@redhat.com, cohuck@redhat.com, f4bug@amsat.org, yuval.shaia@oracle.com, borntraeger@de.ibm.com, pbonzini@redhat.com, marcel@redhat.com, imammedo@redhat.com Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: "Qemu-devel" X-ZohoMail: RSF_0 Z_629925259 SPT_0 Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Signed-off-by: Marcel Apfelbaum Signed-off-by: Yuval Shaia Reviewed-by: Shamir Rabinovitch --- docs/pvrdma.txt | 255 ++++++++++++++++++++++++++++++++++++++++++++++++++++= ++++ 1 file changed, 255 insertions(+) create mode 100644 docs/pvrdma.txt diff --git a/docs/pvrdma.txt b/docs/pvrdma.txt new file mode 100644 index 0000000000..5599318159 --- /dev/null +++ b/docs/pvrdma.txt @@ -0,0 +1,255 @@ +Paravirtualized RDMA Device (PVRDMA) +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + + +1. Description +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device. +It works with its Linux Kernel driver AS IS, no need for any special guest +modifications. + +While it complies with the VMware device, it can also communicate with bare +metal RDMA-enabled machines and does not require an RDMA HCA in the host, = it +can work with Soft-RoCE (rxe). + +It does not require the whole guest RAM to be pinned allowing memory +over-commit and, even if not implemented yet, migration support will be +possible with some HW assistance. + +A project presentation accompany this document: +- http://events.linuxfoundation.org/sites/events/files/slides/lpc-2017-pvr= dma-marcel-apfelbaum-yuval-shaia.pdf + + + +2. Setup +=3D=3D=3D=3D=3D=3D=3D=3D + + +2.1 Guest setup +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +Fedora 27+ kernels work out of the box, older distributions +require updating the kernel to 4.14 to include the pvrdma driver. + +However the libpvrdma library needed by User Level Software is still +not available as part of the distributions, so the rdma-core library +needs to be compiled and optionally installed. + +Please follow the instructions at: + https://github.com/linux-rdma/rdma-core.git + + +2.2 Host Setup +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +The pvrdma backend is an ibdevice interface that can be exposed +either by a Soft-RoCE(rxe) device on machines with no RDMA device, +or an HCA SRIOV function(VF/PF). +Note that ibdevice interfaces can't be shared between pvrdma devices, +each one requiring a separate instance (rxe or SRIOV VF). + + +2.2.1 Soft-RoCE backend(rxe) +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D +A stable version of rxe is required, Fedora 27+ or a Linux +Kernel 4.14+ is preferred. + +The rdma_rxe module is part of the Linux Kernel but not loaded by default. +Install the User Level library (librxe) following the instructions from: +https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home + +Associate an ETH interface with rxe by running: + rxe_cfg add eth0 +An rxe0 ibdevice interface will be created and can be used as pvrdma backe= nd. + + +2.2.2 RDMA device Virtual Function backend +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +Nothing special is required, the pvrdma device can work not only with +Ethernet Links, but also Infinibands Links. +All is needed is an ibdevice with an active port, for Mellanox cards +will be something like mlx5_6 which can be the backend. + + +2.2.3 QEMU setup +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +Configure QEMU with --enable-rdma flag, installing +the required RDMA libraries. + + + +3. Usage +=3D=3D=3D=3D=3D=3D=3D=3D +Currently the device is working only with memory backed RAM +and it must be mark as "shared": + -m 1G \ + -object memory-backend-ram,id=3Dmb1,size=3D1G,share \ + -numa node,memdev=3Dmb1 \ + +The pvrdma device is composed of two functions: + - Function 0 is a vmxnet Ethernet Device which is redundant in Guest + but is required to pass the ibdevice GID using its MAC. + Examples: + For an rxe backend using eth0 interface it will use its mac: + -device vmxnet3,addr=3D.0,multifunction=3Don,mac=3D + For an SRIOV VF, we take the Ethernet Interface exposed by it: + -device vmxnet3,multifunction=3Don,mac=3D + - Function 1 is the actual device: + -device pvrdma,addr=3D.1,backend-dev=3D,backend-gid= -idx=3D,backend-port=3D + where the ibdevice can be rxe or RDMA VF (e.g. mlx5_4) + Note: Pay special attention that the GID at backend-gid-idx matches vmxne= t's MAC. + The rules of conversion are part of the RoCE spec, but since manual conve= rsion + is not required, spotting problems is not hard: + Example: GID: fe80:0000:0000:0000:7efe:90ff:fecb:743a + MAC: 7c:fe:90:cb:74:3a + Note the difference between the first byte of the MAC and the GID. + + + +4. Implementation details +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D + + +4.1 Overview +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +The device acts like a proxy between the Guest Driver and the host +ibdevice interface. +On configuration path: + - For every hardware resource request (PD/QP/CQ/...) the pvrdma will requ= est + a resource from the backend interface, maintaining a 1-1 mapping + between the guest and host. +On data path: + - Every post_send/receive received from the guest will be converted into + a post_send/receive for the backend. The buffers data will not be touch= ed + or copied resulting in near bare-metal performance for large enough buf= fers. + - Completions from the backend interface will result in completions for + the pvrdma device. + + +4.2 PCI BARs +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +PCI Bars: + BAR 0 - MSI-X + MSI-X vectors: + (0) Command - used when execution of a command is completed. + (1) Async - not in use. + (2) Completion - used when a completion event is placed in + device's CQ ring. + BAR 1 - Registers + -------------------------------------------------------- + | VERSION | DSR | CTL | REQ | ERR | ICR | IMR | MAC | + -------------------------------------------------------- + DSR - Address of driver/device shared memory used + for the command channel, used for passing: + - General info such as driver version + - Address of 'command' and 'response' + - Address of async ring + - Address of device's CQ ring + - Device capabilities + CTL - Device control operations (activate, reset etc) + IMG - Set interrupt mask + REQ - Command execution register + ERR - Operation status + + BAR 2 - UAR + --------------------------------------------------------- + | QP_NUM | SEND/RECV Flag || CQ_NUM | ARM/POLL Flag | + --------------------------------------------------------- + - Offset 0 used for QP operations (send and recv) + - Offset 4 used for CQ operations (arm and poll) + + +4.3 Major flows +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +4.3.1 Create CQ +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + - Guest driver + - Allocates pages for CQ ring + - Creates page directory (pdir) to hold CQ ring's pages + - Initializes CQ ring + - Initializes 'Create CQ' command object (cqe, pdir etc) + - Copies the command to 'command' address + - Writes 0 into REQ register + - Device + - Reads the request object from the 'command' address + - Allocates CQ object and initialize CQ ring based on pdir + - Creates the backend CQ + - Writes operation status to ERR register + - Posts command-interrupt to guest + - Guest driver + - Reads the HW response code from ERR register + +4.3.2 Create QP +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + - Guest driver + - Allocates pages for send and receive rings + - Creates page directory(pdir) to hold the ring's pages + - Initializes 'Create QP' command object (max_send_wr, + send_cq_handle, recv_cq_handle, pdir etc) + - Copies the object to 'command' address + - Write 0 into REQ register + - Device + - Reads the request object from 'command' address + - Allocates the QP object and initialize + - Send and recv rings based on pdir + - Send and recv ring state + - Creates the backend QP + - Writes the operation status to ERR register + - Posts command-interrupt to guest + - Guest driver + - Reads the HW response code from ERR register + +4.3.3 Post receive +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + - Guest driver + - Initializes a wqe and place it on recv ring + - Write to qpn|qp_recv_bit (31) to QP offset in UAR + - Device + - Extracts qpn from UAR + - Walks through the ring and does the following for each wqe + - Prepares the backend CQE context to be used when + receiving completion from backend (wr_id, op_code, emu_cq_nu= m) + - For each sge prepares backend sge + - Calls backend's post_recv + +4.3.4 Process backend events +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D + - Done by a dedicated thread used to process backend events; + at initialization is attached to the device and creates + the communication channel. + - Thread main loop: + - Polls for completions + - Extracts QEMU _cq_num, wr_id and op_code from context + - Writes CQE to CQ ring + - Writes CQ number to device CQ + - Sends completion-interrupt to guest + - Deallocates context + - Acks the event to backend + + + +5. Limitations +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +- The device obviously is limited by the Guest Linux Driver features imple= mentation + of the VMware device API. +- Memory registration mechanism requires mremap for every page in the buff= er in order + to map it to a contiguous virtual address range. Since this is not the d= ata path + it should not matter much. If the default max mr size is increased, be a= ware that + memory registration can take up to 0.5 seconds for 1GB of memory. +- The device requires target page size to be the same as the host page siz= e, + otherwise it will fail to init. +- QEMU cannot map guest RAM from a file descriptor if a pvrdma device is a= ttached, + so it can't work with huge pages. The limitation will be addressed in th= e future, + however QEMU allocates Guest RAM with MADV_HUGEPAGE so if there are enou= gh huge + pages available, QEMU will use them. QEMU will fail to init if the requi= rements + are not met. + + + +6. Performance +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +By design the pvrdma device exits on each post-send/receive, so for small = buffers +the performance is affected; however for medium buffers it will became clo= se to +bare metal and from 1MB buffers and up it reaches bare metal performance. +(tested with 2 VMs, the pvrdma devices connected to 2 VFs of the same devi= ce) + +All the above assumes no memory registration is done on data path. --=20 2.13.5