From nobody Wed Jan 15 11:20:10 2025 Delivered-To: importer@patchew.org Authentication-Results: mx.zohomail.com; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=fail(p=none dis=none) header.from=nutanix.com ARC-Seal: i=1; a=rsa-sha256; t=1594914464; cv=none; d=zohomail.com; s=zohoarc; b=kDSqRg6pvuem5I/1U28GgtMnMZADiszlJhUNBbrYDVB1f48ZZKQA0L8K6CBb4KLT4gAafp0ejLVKqIqaTIn1h4J3MnLSvHJA7i1q9qha6dXFRfuLkBan7w8TgN45J/0k+3TYc6UsWsJPQQYJE2fhZ1/xd1CrS0SGrI+CvHonGIQ= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1594914464; h=Cc:Date:From:List-Subscribe:List-Post:List-Id:List-Archive:List-Help:List-Unsubscribe:Message-ID:Sender:Subject:To; bh=SknehcYqDBU4X0gnTnApmsWynw0/H6O3RzxXPQ8miSE=; b=WN/8tbrQ0b5KJm/lvqTEzvxpBERfpbD47Pnk8X4u6r+zPXWxw+j8lQ+vhMTZmsoIE2u5U9wKQnjC9y6D55CIqlKO8slOjuw8V5QoKb+Se3d7lGDtXQs4PVbwVnMMjVfIUgLYzNvPvvgJh9A9cKCARqQzLnuy7TxmTdseu7esz/s= ARC-Authentication-Results: i=1; mx.zohomail.com; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=fail header.from= (p=none dis=none) header.from= Return-Path: Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) by mx.zohomail.com with SMTPS id 1594914464538574.8937445555052; Thu, 16 Jul 2020 08:47:44 -0700 (PDT) Received: from localhost ([::1]:55302 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jw66o-0003qV-AT for importer@patchew.org; Thu, 16 Jul 2020 11:47:42 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:39528) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1jw65d-0003QB-0m for qemu-devel@nongnu.org; Thu, 16 Jul 2020 11:46:29 -0400 Received: from [192.146.154.245] (port=47284 helo=thanos-makatos.dev.nutanix.com) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jw65Z-0000Wf-HF for qemu-devel@nongnu.org; Thu, 16 Jul 2020 11:46:28 -0400 Received: by thanos-makatos.dev.nutanix.com (Postfix, from userid 17755) id 441A24057C; Thu, 16 Jul 2020 08:32:01 -0700 (PDT) From: Thanos Makatos To: qemu-devel@nongnu.org Subject: [PATCH] introduce VFIO-over-socket protocol specificaion Date: Thu, 16 Jul 2020 08:31:43 -0700 Message-Id: <1594913503-52271-1-git-send-email-thanos.makatos@nutanix.com> X-Mailer: git-send-email 1.7.1 X-Host-Lookup-Failed: Reverse DNS lookup failed for 192.146.154.245 (failed) Received-SPF: pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Received-SPF: none client-ip=192.146.154.245; envelope-from=thanos.makatos@thanos-makatos.dev.nutanix.com; helo=thanos-makatos.dev.nutanix.com X-detected-operating-system: by eggs.gnu.org: First seen = 2020/07/16 11:32:03 X-ACL-Warn: Detected OS = Linux 3.11 and newer [fuzzy] X-Spam_score_int: 0 X-Spam_score: -0.1 X-Spam_bar: / X-Spam_report: (-0.1 / 5.0 requ) BAYES_00=-1.9, NO_DNS_FOR_FROM=0.001, PP_MIME_FAKE_ASCII_TEXT=1, RDNS_NONE=0.793, SPF_HELO_NONE=0.001, SPF_NONE=0.001, WEIRD_QUOTING=0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: benjamin.walker@intel.com, elena.ufimtseva@oracle.com, tomassetti.andrea@gmail.com, John G Johnson , jag.raman@oracle.com, swapnil.ingle@nutanix.com, james.r.harris@intel.com, konrad.wilk@oracle.com, yuvalkashtan@gmail.com, dgilbert@redhat.com, raphael.norwitz@nutanix.com, ismael@linux.com, alex.williamson@redhat.com, Thanos Makatos , Kanth.Ghatraju@oracle.com, stefanha@redhat.com, felipe@nutanix.com, marcandre.lureau@redhat.com, tina.zhang@intel.com, changpeng.liu@intel.com Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: "Qemu-devel" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" This patch introduces the VFIO-over-socket protocol specification, which is designed to allow devices to be emulated outside QEMU, in a separate process. VFIO-over-socket reuses the existing VFIO defines, structs and concepts. It has been earlier discussed as an RFC in: "RFC: use VFIO over a UNIX domain socket to implement device offloading" Signed-off-by: John G Johnson Signed-off-by: Thanos Makatos --- docs/devel/vfio-over-socket.rst | 1135 +++++++++++++++++++++++++++++++++++= ++++ 1 files changed, 1135 insertions(+), 0 deletions(-) create mode 100644 docs/devel/vfio-over-socket.rst diff --git a/docs/devel/vfio-over-socket.rst b/docs/devel/vfio-over-socket.= rst new file mode 100644 index 0000000..723b944 --- /dev/null +++ b/docs/devel/vfio-over-socket.rst @@ -0,0 +1,1135 @@ +*************************************** +VFIO-over-socket Protocol Specification +*************************************** + +Version 0.1 + +Introduction +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +VFIO-over-socket, also known as vfio-user, is a protocol that allows a dev= ice +to be virtualized in a separate process outside of QEMU. VFIO-over-socket +devices consist of a generic VFIO device type, living inside QEMU, which we +call the client, and the core device implementation, living outside QEMU, = which +we call the server. VFIO-over-socket can be the main transport mechanism f= or +multi-process QEMU, however it can be used by other applications offering +device virtualization. Explaining the advantages of a +disaggregated/multi-process QEMU, and device virtualization outside QEMU in +general, is beyond the scope of this document. + +This document focuses on specifying the VFIO-over-socket protocol. VFIO has +been chosen for the following reasons: + +1) It is a mature and stable API, backed by an extensively used framework. +2) The existing VFIO client implementation (qemu/hw/vfio/) can be largely + reused. + +In a proof of concept implementation it has been demonstrated that using V= FIO +over a UNIX domain socket is a viable option. VFIO-over-socket is designed= with +QEMU in mind, however it could be used by other client applications. The +VFIO-over-socket protocol does not require that QEMU's VFIO client +implementation is used in QEMU. None of the VFIO kernel modules are requir= ed +for supporting the protocol, neither in the client nor the server, only the +source header files are used. + +The main idea is to allow a virtual device to function in a separate proce= ss in +the same host over a UNIX domain socket. A UNIX domain socket (AF_UNIX) is +chosen because we can trivially send file descriptors over it, which in tu= rn +allows: + +* Sharing of guest memory for DMA with the virtual device process. +* Sharing of virtual device memory with the guest for fast MMIO. +* Efficient sharing of eventfd's for triggering interrupts. + +However, other socket types could be used which allows the virtual device +process to run in a separate guest in the same host (AF_VSOCK) or remotely +(AF_INET). Theoretically the underlying transport doesn't necessarily have= to +be a socket, however we don't examine such alternatives. In this document = we +focus on using a UNIX domain socket and introduce basic support for the ot= her +two types of sockets without considering performance implications. + +This document does not yet describe any internal details of the server-side +implementation, however QEMU's VFIO client implementation will have to be +adapted according to this protocol in order to support VFIO-over-socket vi= rtual +devices. + +VFIO +=3D=3D=3D=3D +VFIO is a framework that allows a physical device to be securely passed th= rough +to a user space process; the kernel does not drive the device at all. +Typically, the user space process is a VM and the device is passed through= to +it in order to achieve high performance. VFIO provides an API and the requ= ired +functionality in the kernel. QEMU has adopted VFIO to allow a guest virtual +machine to directly access physical devices, instead of emulating them in +software + +VFIO-over-socket reuses the core VFIO concepts defined in its API, but +implements them as messages to be sent over a UNIX-domain socket. It does = not +change the kernel-based VFIO in any way, in fact none of the VFIO kernel +modules need to be loaded to use VFIO-over-socket. It is also possible for= QEMU +to concurrently use the current kernel-based VFIO for one guest device, an= d use +VFIO-over-socket for another device in the same guest. + +VFIO Device Model +----------------- +A device under VFIO presents a standard VFIO model to the user process. Ma= ny +of the VFIO operations in the existing kernel model use the ioctl() system +call, and references to the existing model are called the ioctl() +implementation in this document. + +The following sections describe the set of messages that implement the VFIO +device model over a UNIX domain socket. In many cases, the messages are di= rect +translations of data structures used in the ioctl() implementation. Messag= es +derived from ioctl()s will have a name derived from the ioctl() command na= me. +E.g., the VFIO_GET_INFO ioctl() command becomes a VFIO_USER_GET_INFO messa= ge. +The purpose for this reuse is to share as much code as feasible with the +ioctl() implementation. + +Client and Server +^^^^^^^^^^^^^^^^^ +The socket connects two processes together: a client process and a server +process. In the context of this document, the client process is the process +emulating a guest virtual machine, such as QEMU. The server process is a +process that provides device emulation. + +Connection Initiation +^^^^^^^^^^^^^^^^^^^^^ +After the client connects to the server, the initial server message is +VFIO_USER_VERSION to propose a protocol version and set of capabilities to +apply to the session. The client replies with a compatible version and set= of +capabilities it will support, or closes the connection if it cannot suppor= t the +advertised version. + +Guest Memory Configuration +^^^^^^^^^^^^^^^^^^^^^^^^^^ +The client uses VFIO_USER_DMA_MAP and VFIO_USER_DMA_UNMAP messages to info= rm +the server of the valid guest DMA ranges that the server can access on beh= alf +of a device. Guest memory may be accessed by the server via VFIO_USER_DMA_= READ +and VFIO_USER_DMA_WRITE messages over the socket. + +An optimization for server access to guest memory is for the client to pro= vide +file descriptors the server can mmap() to directly access guest memory. No= te +that mmap() privileges cannot be revoked by the client, therefore file +descriptors should only be exported in environments where the client trust= s the +server not to corrupt guest memory. + +Device Information +^^^^^^^^^^^^^^^^^^ +The client uses a VFIO_USER_DEVICE_GET_INFO message to query the server for +information about the device. This information includes: + +* The device type and capabilities, +* the number of memory regions, and +* the device presents to the guest the number of interrupt types the device + supports. + +Region Information +^^^^^^^^^^^^^^^^^^ +The client uses VFIO_USER_DEVICE_GET_REGION_INFO messages to query the ser= ver +for information about the device's memory regions. This information descri= bes: + +* Read and write permissions, whether it can be memory mapped, and whether= it + supports additional capabilities. +* Region index, size, and offset. + +When a region can be mapped by the client, the server provides a file +descriptor which the client can mmap(). The server is responsible for poll= ing +for client updates to memory mapped regions. + +Region Capabilities +""""""""""""""""""" +Some regions have additional capabilities that cannot be described adequat= ely +by the region info data structure. These capabilities are returned in the +region info reply in a list similar to PCI capabilities in a PCI device's +configuration space.=20 + +Sparse Regions +"""""""""""""" +A region can be memory-mappable in whole or in part. When only a subset of= a +region can be mapped by the client, a VFIO_REGION_INFO_CAP_SPARSE_MMAP +capability is included in the region info reply. This capability describes +which portions can be mapped by the client. + +For example, in a virtual NVMe controller, sparse regions can be used so t= hat +accesses to the NVMe registers (found in the beginning of BAR0) are trappe= d (an +infrequent an event), while allowing direct access to the doorbells (an +extremely frequent event as every I/O submission requires a write to BAR0), +found right after the NVMe registers in BAR0. + +Interrupts +^^^^^^^^^^ +The client uses VFIO_USER_DEVICE_GET_IRQ_INFO messages to query the server= for +the device's interrupt types. The interrupt types are specific to the bus = the +device is attached to, and the client is expected to know the capabilities= of +each interrupt type. The server can signal an interrupt either with +VFIO_USER_VM_INTERRUPT messages over the socket, or can directly inject +interrupts into the guest via an event file descriptor. The client configu= res +how the server signals an interrupt with VFIO_USER_SET_IRQS messages. + +Device Read and Write +^^^^^^^^^^^^^^^^^^^^^ +When the guest executes load or store operations to device memory, the cli= ent +forwards these operations to the server with VFIO_USER_REGION_READ or +VFIO_USER_REGION_WRITE messages. The server will reply with data from the +device on read operations or an acknowledgement on write operations. + +DMA +^^^ +When a device performs DMA accesses to guest memory, the server will forwa= rd +them to the client with VFIO_USER_DMA_READ and VFIO_USER_DMA_WRITE message= s. +These messages can only be used to access guest memory the client has +configured into the server. + +Protocol Specification +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +To distinguish from the base VFIO symbols, all VFIO-over-socket symbols are +prefixed with vfio_user or VFIO_USER. In revision 0.1, all data is in the +little-endian format, although this may be relaxed in future revision in c= ases +where the client and server are both big-endian. The messages are formatted +for seamless reuse of the native VFIO structs. A server can serve: + +1) multiple clients, and/or +2) multiple virtual devices, belonging to one or more clients. + +Therefore each message requires a header that uniquely identifies the virt= ual +device. It is a server-side implementation detail whether a single server +handles multiple virtual devices from the same or multiple guests. + +Socket +------ +A single UNIX domain socket is assumed to be used for each device. The loc= ation +of the socket is implementation-specific. Multiplexing clients, devices, a= nd +servers over the same socket is not supported in this version of the proto= col, +but a device ID field exists in the message header so that a future suppor= t can +be added without a major version change. + +Authentication +-------------- +For AF_UNIX, we rely on OS mandatory access controls on the socket files, +therefore it is up to the management layer to set up the socket as require= d. +Socket types than span guests or hosts will require a proper authentication +mechanism. Defining that mechanism is deferred to a future version of the +protocol. + +Request Concurrency +------------------- +There can be multiple outstanding requests per virtual device, e.g. a +frame buffer where the guest does multiple stores to the virtual device. T= he +server can execute and reorder non-conflicting requests in parallel, depen= ding +on the device semantics. + +Socket Disconnection Behavior +----------------------------- +The server and the client can disconnect from each other, either intention= ally +or unexpectedly. Both the client and the server need to know how to handle= such +events. + +Server Disconnection +^^^^^^^^^^^^^^^^^^^^ +A server disconnecting from the client may indicate that: + +1) A virtual device has been restarted, either intentionally (e.g. because= of a +device update) or unintentionally (e.g. because of a crash). In any case, = the +virtual device will come back so the client should not do anything (e.g. s= imply +reconnect and retry failed operations). + +2) A virtual device has been shut down with no intention to be restarted. + +It is impossible for the client to know whether or not a failure is +intermittent or innocuous and should be retried, therefore the client shou= ld +attempt to reconnect to the socket. Since an intentional server restart (e= .g. +due to an upgrade) might take some time, a reasonable timeout should be us= ed. +In cases where the disconnection is expected (e.g. the guest shutting down= ), no +new requests will be sent anyway so this situation doesn't pose a problem.= The +control stack will clean up accordingly. + +Parametrizing this behaviour by having the virtual device advertise a +reasonable reconnect is deferred to a future version of the protocol. + +Client Disconnection +^^^^^^^^^^^^^^^^^^^^ +The client disconnecting from the server primarily means that the QEMU pro= cess +has exited. Currently this means that the guest is shut down so the device= is +no longer needed therefore the server can automatically exit. However, the= re +can be cases where a client disconnect should not result in a server exit: + +1) A single server serving multiple clients. +2) A multi-process QEMU upgrading itself step by step, which isn't yet + implemented. + +Therefore in order for the protocol to be forward compatible the server sh= ould +take no action when the client disconnects. If anything happens to the cli= ent +process the control stack will know about it and can clean up resources +accordingly. + +Request Retry and Response Timeout +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +QEMU's VFIO retries certain operations if they fail. While this makes sens= e for +real HW, we don't know for sure whether it makes sense for virtual devices= . A +failed request is a request that has been successfully sent and has been +responded to with an error code. Failure to send the request in the first = place +(e.g. because the socket is disconnected) is a different type of error exa= mined +earlier in the disconnect section. + +Defining a retry and timeout scheme if deferred to a future version of the +protocol. + +Commands +-------- +The following table lists the VFIO message command IDs, and whether the +message request is sent from the client or the server. + ++----------------------------------+---------+-------------------+ +| Name | Command | Request Direction | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| VFIO_USER_VERSION | 1 | server \ufffd\ufffd\ufffd c= lient | ++----------------------------------+---------+-------------------+ +| VFIO_USER_DMA_MAP | 2 | client \ufffd\ufffd\ufffd s= erver | ++----------------------------------+---------+-------------------+ +| VFIO_USER_DMA_UNMAP | 3 | client \ufffd\ufffd\ufffd s= erver | ++----------------------------------+---------+-------------------+ +| VFIO_USER_DEVICE_GET_INFO | 4 | client \ufffd\ufffd\ufffd s= erver | ++----------------------------------+---------+-------------------+ +| VFIO_USER_DEVICE_GET_REGION_INFO | 5 | client \ufffd\ufffd\ufffd s= erver | ++----------------------------------+---------+-------------------+ +| VFIO_USER_DEVICE_GET_IRQ_INFO | 6 | client \ufffd\ufffd\ufffd s= erver | ++----------------------------------+---------+-------------------+ +| VFIO_USER_DEVICE_SET_IRQS | 7 | client \ufffd\ufffd\ufffd s= erver | ++----------------------------------+---------+-------------------+ +| VFIO_USER_REGION_READ | 8 | client \ufffd\ufffd\ufffd s= erver | ++----------------------------------+---------+-------------------+ +| VFIO_USER_REGION_WRITE | 9 | client \ufffd\ufffd\ufffd s= erver | ++----------------------------------+---------+-------------------+ +| VFIO_USER_DMA_READ | 10 | server \ufffd\ufffd\ufffd c= lient | ++----------------------------------+---------+-------------------+ +| VFIO_USER_DMA_READ | 11 | server \ufffd\ufffd\ufffd c= lient | ++----------------------------------+---------+-------------------+ +| VFIO_USER_VM_INTERRUPT | 12 | server \ufffd\ufffd\ufffd c= lient | ++----------------------------------+---------+-------------------+ +| VFIO_DEVICE_RESET | 13 | client \ufffd\ufffd\ufffd s= erver | ++----------------------------------+---------+-------------------+ + +Header +------ +All messages are preceded by a 16 byte header that contains basic informat= ion +about the message. The header is followed by message-specific data describ= ed +in the sections below. + ++----------------+--------+-------------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D= +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| Device ID | 0 | 2 | ++----------------+--------+-------------+ +| Message ID | 2 | 2 | ++----------------+--------+-------------+ +| Command | 4 | 4 | ++----------------+--------+-------------+ +| Message size | 8 | 4 | ++----------------+--------+-------------+ +| Flags | 12 | 4 | ++----------------+--------+-------------+ +| | +-----+------------+ | +| | | Bit | Definition | | +| | +=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ | +| | | 0 | Reply | | +| | +-----+------------+ | +| | | 1 | No_reply | | +| | +-----+------------+ |=20 ++----------------+--------+-------------+ +| | 16 | variable | ++----------------+--------+-------------+ + +* Device ID identifies the destination device of the message. This field is + reserved when the server only supports one device per socket. +* Message ID identifies the message, and is used in the message acknowledg= ement. +* Command specifies the command to be executed, listed in the Command Tabl= e. +* Message size contains the size of the entire message, including the head= er. +* Flags contains attributes of the message: + + * The reply bit differentiates request messages from reply messages. A r= eply + message acknowledges a previous request with the same message ID. + * No_reply indicates that no reply is needed for this request. This is + commonly used when multiple requests are sent, and only the last needs + acknowledgement. + +VFIO_USER_VERSION +----------------- + +Message format +^^^^^^^^^^^^^^ + ++--------------+------------------------+ +| Name | Value | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| Device ID | 0 | ++--------------+------------------------+ +| Message ID | | ++--------------+------------------------+ +| Command | 1 | ++--------------+------------------------+ +| Message size | 16 + version length | ++--------------+------------------------+ +| Flags | Reply bit set in reply | ++--------------+------------------------+ +| Version | JSON byte array | ++--------------+------------------------+ + +This is the initial message sent by the server after the socket connection= is +established. The version is in JSON format, and the following objects must= be +included: + ++--------------+--------+-------------------------------------------------= --+ +| Name | Type | Description = | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| version | object | {\ufffd\ufffd\ufffdmajor\ufffd\ufffd\ufffd: , \ufffd\ufffd\ufffdminor\ufffd\ufffd\ufffd: } | +| | | Version supported by the sender, e.g. \ufffd\uff= fd\ufffd0.1\ufffd\ufffd\ufffd. | ++--------------+--------+-------------------------------------------------= --+ +| type | string | Fixed to \ufffd\ufffd\ufffdvfio-user\ufffd\ufffd= \ufffd. | ++--------------+--------+-------------------------------------------------= --+ +| capabilities | array | Reserved. Can be omitted for v0.1, otherwise mus= t | +| | | be empty. = | ++--------------+--------+-------------------------------------------------= --+ + +Versioning and Feature Support +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Upon accepting a connection, the server must send a VFIO_USER_VERSION mess= age +proposing a protocol version and a set of capabilities. The client compares +these with the versions and capabilities it supports and sends a +VFIO_USER_VERSION reply according to the following rules. + +* The major version in the reply must be the same as proposed. If the clie= nt + does not support the proposed major, it closes the connection. +* The minor version in the reply must be equal to or less than the minor + version proposed. +* The capability list must be a subset of those proposed. If the client + requires a capability the server did not include, it closes the connecti= on. +* If type is not \ufffd\ufffd\ufffdvfio-user\ufffd\ufffd\ufffd, the client= closes the connection. + +The protocol major version will only change when incompatible protocol cha= nges +are made, such as changing the message format. The minor version may change +when compatible changes are made, such as adding new messages or capabilit= ies, +Both the client and server must support all minor versions less than the +maximum minor version it supports. E.g., an implementation that supports +version 1.3 must also support 1.0 through 1.2. + +VFIO_USER_DMA_MAP +----------------- + +VFIO_USER_DMA_UNMAP +------------------- + +Message Format +^^^^^^^^^^^^^^ + ++--------------+------------------------+ +| Name | Value | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| Device ID | 0 | ++--------------+------------------------+ +| Message ID | | ++--------------+------------------------+ +| Command | MAP=3D2, UNMAP=3D3 | ++--------------+------------------------+ +| Message size | 16 + table size | ++--------------+------------------------+ +| Flags | Reply bit set in reply | ++--------------+------------------------+ +| Table | array of table entries | ++--------------+------------------------+ + +This message is sent by the client to the server to inform it of the guest +memory regions the device can access. It must be sent before the device can +perform any DMA to the guest. It is normally sent directly after the versi= on +handshake is completed, but may also occur when memory is added or subtrac= ted +in the guest. + +The table is an array of the following structure. This structure is 32 byt= es +in size, so the message size will be 16 + (# of table entries * 32). If a +region being added can be directly mapped by the server, an array of file +descriptors will be sent as part of the message meta-data. Each region ent= ry +will have a corresponding file descriptor. On AF_UNIX sockets, the file +descriptors will be passed as SCM_RIGHTS type ancillary data. + +Table entry format +^^^^^^^^^^^^^^^^^^ + ++-------------+--------+-------------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| Address | 0 | 8 | ++-------------+--------+-------------+ +| Size | 8 | 8 | ++-------------+--------+-------------+ +| Offset | 16 | 8 | ++-------------+--------+-------------+ +| Protections | 24 | 4 | ++-------------+--------+-------------+ +| Flags | 28 | 4 | ++-------------+--------+-------------+ +| | +-----+------------+ | +| | | Bit | Definition | | +| | +=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ | +| | | 0 | Mappable | | +| | +-----+------------+ | ++-------------+--------+-------------+ + +* Address is the base DMA address of the region. +* Size is the size of the region. +* Offset is the file offset of the region with respect to the associated f= ile + descriptor. +* Protections are the region's protection attributes as encoded in + ````. +* Flags contain the following region attributes: + + * Mappable indicate the region can be mapped via the mmap() system call = using + the file descriptor provided in the message meta-data. + +VFIO_USER_DEVICE_GET_INFO +------------------------- + +Message format +^^^^^^^^^^^^^^ + ++--------------+----------------------------+ +| Name | Value | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| Device ID | | ++--------------+----------------------------+ +| Message ID | | ++--------------+----------------------------+ +| Command | 4 | ++--------------+----------------------------+ +| Message size | 16 in request, 32 in reply | ++--------------+----------------------------+ +| Flags | Reply bit set in reply | ++--------------+----------------------------+ +| Device info | VFIO device info | ++--------------+----------------------------+ + +This message is sent by the client to the server to query for basic inform= ation +about the device. Only the message header is needed in the request message. +The VFIO device info structure is defined in ```` (``struct +vfio_device_info``). + +VFIO device info format +^^^^^^^^^^^^^^^^^^^^^^^ + ++-------------+--------+--------------------------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| argsz | 16 | 4 | ++-------------+--------+--------------------------+ +| flags | 20 | 4 | ++-------------+--------+--------------------------+ +| | +-----+-------------------------+ | +| | | Bit | Definition | | +| | +=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ | +| | | 0 | VFIO_DEVICE_FLAGS_RESET | | +| | +-----+-------------------------+ | +| | | 1 | VFIO_DEVICE_FLAGS_PCI | | +| | +-----+-------------------------+ | ++-------------+--------+--------------------------+ +| num_regions | 24 | 4 | ++-------------+--------+--------------------------+ +| num_irqs | 28 | 4 | ++-------------+--------+--------------------------+ + +* argz is reserved in vfio-user, it is only used in the ioctl() VFIO + implementation. +* flags contains the following device attributes. + + * VFIO_DEVICE_FLAGS_RESET indicates the device supports the + VFIO_USER_DEVICE_RESET message. + * VFIO_DEVICE_FLAGS_PCI indicates the device is a PCI device. + +* num_regions is the number of memory regions the device exposes. +* num_irqs is the number of distinct interrupt types the device supports. + +This version of the protocol only supports PCI devices. Additional devices= may +be supported in future versions.=20 + +VFIO_USER_DEVICE_GET_REGION_INFO +-------------------------------- + +Message format +^^^^^^^^^^^^^^ + ++--------------+------------------+ +| Name | Value | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D+ +| Device ID | | ++--------------+------------------+ +| Message ID | | ++--------------+------------------+ +| Command | 5 |=20 ++--------------+------------------+ +| Message size | 48 + any caps | ++--------------+------------------+ +| Flags Reply | bit set in reply | ++--------------+------------------+ +| Region info | VFIO region info | ++--------------+------------------+ + +This message is sent by the client to the server to query for information = about +device memory regions. The VFIO region info structure is defined in +```` (``struct vfio_region_info``). + +VFIO region info format +^^^^^^^^^^^^^^^^^^^^^^^ + ++------------+--------+------------------------------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D+ +| argsz | 16 | 4 | ++------------+--------+------------------------------+ +| flags | 20 | 4 | ++------------+--------+------------------------------+ +| | +-----+-----------------------------+ | +| | | Bit | Definition | | +| | +=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ | +| | | 0 | VFIO_REGION_INFO_FLAG_READ | | +| | +-----+-----------------------------+ | +| | | 1 | VFIO_REGION_INFO_FLAG_WRITE | | +| | +-----+-----------------------------+ | +| | | 2 | VFIO_REGION_INFO_FLAG_MMAP | | +| | +-----+-----------------------------+ | +| | | 3 | VFIO_REGION_INFO_FLAG_CAPS | | +| | +-----+-----------------------------+ | ++------------+--------+------------------------------+ +| index | 24 | 4 | ++------------+--------+------------------------------+ +| cap_offset | 28 | 4 | ++------------+--------+------------------------------+ +| size | 32 | 8 | ++------------+--------+------------------------------+ +| offset | 40 | 8 | ++------------+--------+------------------------------+ + +* argz is reserved in vfio-user, it is only used in the ioctl() VFIO + implementation. +* flags are attributes of the region: + + * VFIO_REGION_INFO_FLAG_READ allows client read access to the region. + * VFIO_REGION_INFO_FLAG_WRITE allows client write access region. + * VFIO_REGION_INFO_FLAG_MMAP specifies the client can mmap() the region.= When + this flag is set, the reply will include a file descriptor in its meta= -data. + On AF_UNIX sockets, the file descriptors will be passed as SCM_RIGHTS = type + ancillary data. + * VFIO_REGION_INFO_FLAG_CAPS indicates additional capabilities found in = the + reply. + +* index is the index of memory region being queried, it is the only field = that + is required to be set in the request message. +* cap_offset describes where additional region capabilities can be found. + cap_offset is relative to the beginning of the VFIO region info structur= e. + The data structure it points is a VFIO cap header defined in ````. +* size is the size of the region. +* offset is the offset given to the mmap() system call for regions with the + MMAP attribute. It is also used as the base offset when mapping a VFIO + sparse mmap area, described below. + +VFIO Region capabilities +^^^^^^^^^^^^^^^^^^^^^^^^ +The VFIO region information can also include a capabilities list. This lis= t is +similar to a PCI capability list - each entry has a common header that +identifies a capability and where the next capability in the list can be f= ound. +The VFIO capability header format is defined in ```` (``struct +vfio_info_cap_header``). + +VFIO cap header format +^^^^^^^^^^^^^^^^^^^^^^ + ++---------+--------+------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D+ +| id | 0 | 2 | ++---------+--------+------+ +| version | 2 | 2 | ++---------+--------+------+ +| next | 4 | 4 | ++---------+--------+------+ + +* id is the capability identity. +* version is a capability-specific version number. +* next specifies the offset of the next capability in the capability list.= It + is relative to the beginning of the VFIO region info structure. + +VFIO sparse mmap +^^^^^^^^^^^^^^^^ + ++------------------+----------------------------------+ +| Name | Value | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D+ +| id | VFIO_REGION_INFO_CAP_SPARSE_MMAP | ++------------------+----------------------------------+ +| version | 0x1 | ++------------------+----------------------------------+ +| next | | ++------------------+----------------------------------+ +| sparse mmap info | VFIO region info sparse mmap | ++------------------+----------------------------------+ + +The only capability supported in this version of the protocol is for sparse +mmap. This capability is defined when only a subrange of the region suppor= ts +direct access by the client via mmap(). The VFIO sparse mmap area is defin= ed in +```` (``struct vfio_region_sparse_mmap_area``). + +VFIO region info cap sparse mmap +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ++----------+--------+------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D= =3D+ +| nr_areas | 0 | 4 | ++----------+--------+------+ +| reserved | 4 | 4 | ++----------+--------+------+ +| offset | 8 | 8 | ++----------+--------+------+ +| size | 16 | 9 | ++----------+--------+------+ +| ... | | | ++----------+--------+------+ + +* nr_areas is the number of sparse mmap areas in the region. +* offset and size describe a single area that can be mapped by the client. + There will be nr_areas pairs of offset and size. The offset will be adde= d to + the base offset given in the VFIO_USER_DEVICE_GET_REGION_INFO to form the + offset argument of the subsequent mmap() call. + +The VFIO sparse mmap area is defined in ```` (``struct +vfio_region_info_cap_sparse_mmap``). + +VFIO_USER_DEVICE_GET_IRQ_INFO +----------------------------- + +Message format +^^^^^^^^^^^^^^ + ++--------------+------------------------+ +| Name | Value | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| Device ID | | ++--------------+------------------------+ +| Message ID | | ++--------------+------------------------+ +| Command | 6 | ++--------------+------------------------+ +| Message size | 32 | ++--------------+------------------------+ +| Flags | Reply bit set in reply | ++--------------+------------------------+ +| IRQ info | VFIO IRQ info | ++--------------+------------------------+ + +This message is sent by the client to the server to query for information = about +device interrupt types. The VFIO IRQ info structure is defined in +```` (``struct vfio_irq_info``). + +VFIO IRQ info format +^^^^^^^^^^^^^^^^^^^^ + ++-------+--------+---------------------------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| argsz | 16 | 4 | ++-------+--------+---------------------------+ +| flags | 20 | 4 | ++-------+--------+---------------------------+ +| | +-----+--------------------------+ | +| | | Bit | Definition | | +| | +=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ | +| | | 0 | VFIO_IRQ_INFO_EVENTFD | | +| | +-----+--------------------------+ | +| | | 1 | VFIO_IRQ_INFO_MASKABLE | | +| | +-----+--------------------------+ | +| | | 2 | VFIO_IRQ_INFO_AUTOMASKED | | +| | +-----+--------------------------+ | +| | | 3 | VFIO_IRQ_INFO_NORESIZE | | +| | +-----+--------------------------+ | ++-------+--------+---------------------------+ +| index | 24 | 4 | ++-------+--------+---------------------------+ +| count | 28 | 4 | ++-------+--------+---------------------------+ + +* argz is reserved in vfio-user, it is only used in the ioctl() VFIO + implementation. +* flags defines IRQ attributes: + + * VFIO_IRQ_INFO_EVENTFD indicates the IRQ type can support server eventfd + signalling. + * VFIO_IRQ_INFO_MASKABLE indicates that the IRQ type supports the MASK a= nd + UNMASK actions in a VFIO_USER_DEVICE_SET_IRQS message. + * VFIO_IRQ_INFO_AUTOMASKED indicates the IRQ type masks itself after bei= ng + triggered, and the client must send an UNMASK action to receive new + interrupts. + * VFIO_IRQ_INFO_NORESIZE indicates VFIO_USER_SET_IRQS operations setup + interrupts as a set, and new subindexes cannot be enabled without disa= bling + the entire type. + +* index is the index of IRQ type being queried, it is the only field that = is + required to be set in the request message. +* count describes the number of interrupts of the queried type. + +VFIO_USER_DEVICE_SET_IRQS +------------------------- + +Message format +^^^^^^^^^^^^^^ + ++--------------+------------------------+ +| Name | Value | +| Device ID | | ++--------------+------------------------+ +| Message ID | | ++--------------+------------------------+ +| Command | 7 | ++--------------+------------------------+ +| Message size | 36 + any data | ++--------------+------------------------+ +| Flags | Reply bit set in reply | ++--------------+------------------------+ +| IRQ set | VFIO IRQ set | ++--------------+------------------------+ + +This message is sent by the client to the server to set actions for device +interrupt types. The VFIO IRQ set structure is defined in ```` +(``struct vfio_irq_set``). + +VFIO IRQ info format +^^^^^^^^^^^^^^^^^^^^ + ++-------+--------+------------------------------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| argsz | 6 | 4 | ++-------+--------+------------------------------+ +| flags | 20 | 4 | ++-------+--------+------------------------------+ +| | +-----+-----------------------------+ | +| | | Bit | Definition | | +| | +=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ | +| | | 0 | VFIO_IRQ_SET_DATA_NONE | | +| | +-----+-----------------------------+ | +| | | 1 | VFIO_IRQ_SET_DATA_BOOL | | +| | +-----+-----------------------------+ | +| | | 2 | VFIO_IRQ_SET_DATA_EVENTFD | | +| | +-----+-----------------------------+ | +| | | 3 | VFIO_IRQ_SET_ACTION_MASK | | +| | +-----+-----------------------------+ | +| | | 4 | VFIO_IRQ_SET_ACTION_UNMASK | | +| | +-----+-----------------------------+ | +| | | 5 | VFIO_IRQ_SET_ACTION_TRIGGER | | +| | +-----+-----------------------------+ | ++-------+--------+------------------------------+ +| index | 24 | 4 | ++-------+--------+------------------------------+ +| start | 28 | 4 | ++-------+--------+------------------------------+ +| count | 32 | 4 | ++-------+--------+------------------------------+ +| data | 36 | variable | ++-------+--------+------------------------------+ + +* argz is reserved in vfio-user, it is only used in the ioctl() VFIO + implementation. +* flags defines the action performed on the interrupt range. The DATA flags + describe the data field sent in the message; the ACTION flags describe t= he + action to be performed. The flags are mutually exclusive for both sets. + + * VFIO_IRQ_SET_DATA_NONE indicates there is no data field in the request= . The + action is performed unconditionally. + * VFIO_IRQ_SET_DATA_BOOL indicates the data field is an array of boolean + bytes. The action is performed if the corresponding boolean is true. + * VFIO_IRQ_SET_DATA_EVENTFD indicates an array of event file descriptors= was + sent in the message meta-data. These descriptors will be signalled whe= n the + action defined by the action flags occurs. In AF_UNIX sockets, the + descriptors are sent as SCM_RIGHTS type ancillary data. + * VFIO_IRQ_SET_ACTION_MASK indicates a masking event. It can be used with + VFIO_IRQ_SET_DATA_BOOL or VFIO_IRQ_SET_DATA_NONE to mask an interrupt,= or + with VFIO_IRQ_SET_DATA_EVENTFD to generate an event when the guest mas= ks + the interrupt.=20 + * VFIO_IRQ_SET_ACTION_UNMASK indicates an unmasking event. It can be used + with VFIO_IRQ_SET_DATA_BOOL or VFIO_IRQ_SET_DATA_NONE to unmask an + interrupt, or with VFIO_IRQ_SET_DATA_EVENTFD to generate an event when= the + guest unmasks the interrupt.=20 + * VFIO_IRQ_SET_ACTION_TRIGGER indicates a triggering event. It can be us= ed + with VFIO_IRQ_SET_DATA_BOOL or VFIO_IRQ_SET_DATA_NONE to trigger an + interrupt, or with VFIO_IRQ_SET_DATA_EVENTFD to generate an event when= the + guest triggers the interrupt. + +* index is the index of IRQ type being setup. +* start is the start of the subindex being set. +* count describes the number of sub-indexes being set. As a special case, a + count of 0 with data flags of VFIO_IRQ_SET_DATA_NONE disables all interr= upts + of the index data is an optional field included when the + VFIO_IRQ_SET_DATA_BOOL flag is present. It contains an array of booleans + that specify whether the action is to be performed on the corresponding + index. It's used when the action is only performed on a subset of the ra= nge + specified. + +Not all interrupt types support every combination of data and action flags. +The client must know the capabilities of the device and IRQ index before it +sends a VFIO_USER_DEVICE_SET_IRQ message. + +Read and Write Operations +------------------------- + +Not all I/O operations between the client and server can be done via direct +access of memory mapped with an mmap() call. In these cases, the client and +server use messages sent over the socket. It is expected that these operat= ions +will have lower performance than direct access. + +The client can access device memory with VFIO_USER_REGION_READ and +VFIO_USER_REGION_WRITE requests. These share a common data structure that +appears after the 16 byte message header.=20 + +REGION Read/Write Data +^^^^^^^^^^^^^^^^^^^^^^ + ++--------+--------+----------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D+ +| Offset | 16 | 8 | ++--------+--------+----------+ +| Region | 24 | 4 | ++--------+--------+----------+ +| Count | 28 | 4 | ++--------+--------+----------+ +| Data | 32 | variable | ++--------+--------+----------+ + +* Offset into the region being accessed. +* Region is the index of the region being accessed. +* Count is the size of the data to be transferred. +* Data is the data to be read or written. + +The server can access guest memory with VFIO_USER_DMA_READ and +VFIO_USER_DMA_WRITE messages. These also share a common data structure that +appears after the 16 byte message header. + +DMA Read/Write Data +^^^^^^^^^^^^^^^^^^^ + ++---------+--------+----------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D+ +| Address | 16 | 8 | ++---------+--------+----------+ +| Count | 24 | 4 | ++---------+--------+----------+ +| Data | 28 | variable | ++---------+--------+----------+ + +* Address is the area of guest memory being accessed. This address must ha= ve + been exported to the server with a VFIO_USER_DMA_MAP message. +* Count is the size of the data to be transferred. +* Data is the data to be read or written. + +Address and count can also be accessed as ``struct iovec`` from ````. + +VFIO_USER_REGION_READ +--------------------- + +Message format +^^^^^^^^^^^^^^ + ++--------------+------------------------+ +| Name | Value | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| Device ID | | ++--------------+------------------------+ +| Message ID | | ++--------------+------------------------+ +| Command | 8 | ++--------------+------------------------+ +| Message size | 32 + data size | ++--------------+------------------------+ +| Flags Reply | bit set in reply | ++--------------+------------------------+ +| Read info | REGION read/write data | ++--------------+------------------------+ + +This request is sent from the client to the server to read from device mem= ory. +In the request messages, there will be no data, and the count field will b= e the +amount of data to be read. The reply will include the data read, and its c= ount +field will be the amount of data read. + +VFIO_USER_REGION_WRITE +---------------------- + +Message format +^^^^^^^^^^^^^^ + ++--------------+------------------------+ +| Name | Value | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| Device ID | | ++--------------+------------------------+ +| Message ID | | ++--------------+------------------------+ +| Command | 9 | ++--------------+------------------------+ +| Message size | 32 + data size | ++--------------+------------------------+ +| Flags | Reply bit set in reply | ++--------------+------------------------+ +| Write info | REGION read write data | ++--------------+------------------------+ + +This request is sent from the client to the server to write to device memo= ry. +The request message will contain the data to be written, and its count fie= ld +will contain the amount of write data. The count field in the reply will be +zero. + +VFIO_USER_DMA_READ +------------------ + +Message format +^^^^^^^^^^^^^^ + ++--------------+---------------------+ +| Name | Value | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| Device ID | | ++--------------+---------------------+ +| Message ID | | ++--------------+---------------------+ +| Command | 10 | ++--------------+---------------------+ +| Message size | 28 + data size | ++--------------+---------------------+ +| Flags Reply | bit set in reply | ++--------------+---------------------+ +| DMA info | DMA read/write data | ++--------------+---------------------+ + +This request is sent from the server to the client to read from guest memo= ry. +In the request messages, there will be no data, and the count field will b= e the +amount of data to be read. The reply will include the data read, and its c= ount +field will be the amount of data read. + +VFIO_USER_DMA_WRITE +------------------- + +Message format +^^^^^^^^^^^^^^ + ++--------------+------------------------+ +| Name | Value | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| Device ID | | ++--------------+------------------------+ +| Message ID | | ++--------------+------------------------+ +| Command | 11 | ++--------------+------------------------+ +| Message size | 28 + data size | ++--------------+------------------------+ +| Flags | Reply bit set in reply | ++--------------+------------------------+ +| DMA info | DMA read/write data | ++--------------+------------------------+ + +This request is sent from the server to the client to write to guest memor= y. +The request message will contain the data to be written, and its count fie= ld +will contain the amount of write data. The count field in the reply will be +zero. + +VFIO_USER_VM_INTERRUPT +---------------------- + +Message format +^^^^^^^^^^^^^^ + ++----------------+------------------------+ +| Name | Value | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| Device ID | | ++----------------+------------------------+ +| Message ID | | ++----------------+------------------------+ +| Command | 12 | ++----------------+------------------------+ +| Message size | 24 | ++----------------+------------------------+ +| Flags | Reply bit set in reply | ++----------------+------------------------+ +| Interrupt info | | ++----------------+------------------------+ + +This request is sent from the server to the client to signal the device has +raised an interrupt. + +Interrupt info format +^^^^^^^^^^^^^^^^^^^^^ + ++----------+--------+------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D= =3D+ +| Index | 16 | 4 | ++----------+--------+------+ +| Subindex | 20 | 4 | ++----------+--------+------+ + +* Index is the interrupt index; it is the same value used in VFIO_USER_SET= _IRQS. +* Subindex is relative to the index, e.g., the vector number used in PCI M= SI/X + type interrupts. + +VFIO_USER_DEVICE_RESET +---------------------- + +Message format +^^^^^^^^^^^^^^ + ++--------------+------------------------+ +| Name | Value | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| Device ID | | ++--------------+------------------------+ +| Message ID | | ++--------------+------------------------+ +| Command | 13 | ++--------------+------------------------+ +| Message size | 16 | ++--------------+------------------------+ +| Flags | Reply bit set in reply | ++--------------+------------------------+ + +This request is sent from the client to the server to reset the device. + +Appendices +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +Unused VFIO ioctl() commands +---------------------------- + +The following commands must be handled by the client and not sent to the s= erver: + +* VFIO_GET_API_VERSION +* VFIO_CHECK_EXTENSION +* VFIO_SET_IOMMU +* VFIO_GROUP_GET_STATUS +* VFIO_GROUP_SET_CONTAINER +* VFIO_GROUP_UNSET_CONTAINER +* VFIO_GROUP_GET_DEVICE_FD +* VFIO_IOMMU_GET_INFO + +However, once support for live migration for VFIO devices is finalized some +of the above commands might have to be handled by the client. This will be +addressed in a future protocol version. + +Live Migration +-------------- +Currently live migration is not supported for devices passed through via V= FIO, +therefore it is not supported for VFIO-over-socket, either. This is being +actively worked on in the "Add migration support for VFIO devices" (v25) p= atch +series. + +VFIO groups and containers +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The current VFIO implementation includes group and container idioms that +describe how a device relates to the host IOMMU. In the VFIO over socket +implementation, the IOMMU is implemented in SW by the client, and isn't vi= sible +to the server. The simplest idea is for the client is to put each device i= nto +its own group and container. --=20 1.7.1