From nobody Sat Apr 27 04:11:11 2024 Delivered-To: importer@patchew.org Authentication-Results: mx.zohomail.com; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=fail(p=none dis=none) header.from=nutanix.com ARC-Seal: i=1; a=rsa-sha256; t=1606752910; cv=none; d=zohomail.com; s=zohoarc; b=WoEQ3xjLG48YlS649VFX29r/PBPazL6jaKb08sIKHpEGl6+zH9iOCYYHc+HsFAJxGYHpmPQf5BmaadAD3h7LqCzfM2Wz5Mv3Q38zvk0kNNlU+JQB5GoYM5/B5BHkFif+q5FWD1aEddelCV1YM9coFs5eeJux1zgn8yQDNNyHm/4= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1606752910; h=Cc:Date:From:List-Subscribe:List-Post:List-Id:List-Archive:List-Help:List-Unsubscribe:Message-ID:Sender:Subject:To; bh=5Q3T9ClKjwN0aA0ygN21xJd+9xxIzG1QVwe0SzaWfzE=; b=J566Iq1Ga2/DBPNzPSaWqtpwutbwaTHJJmR8+07VbKnE4AohljBhlXZcUbEMCv7gg7pSwyazhPHLIbNQ0TPyrxwtWs2RHruyY5iuWqQ7TTZjyhqTW89jrnQ8NlVKADoPb/Xp/cHvR0VkZaIGn02vospHWa4xuzq9J/AtErTGHOQ= ARC-Authentication-Results: i=1; mx.zohomail.com; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=fail header.from= (p=none dis=none) header.from= Return-Path: Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) by mx.zohomail.com with SMTPS id 1606752910227463.9071506481258; Mon, 30 Nov 2020 08:15:10 -0800 (PST) Received: from localhost ([::1]:48688 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kjlpU-0001XD-G3 for importer@patchew.org; Mon, 30 Nov 2020 11:15:08 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:53038) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kjlnH-0008LX-0W for qemu-devel@nongnu.org; Mon, 30 Nov 2020 11:12:51 -0500 Received: from [192.146.154.245] (port=1151 helo=thanos-makatos.dev.nutanix.com) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kjlnC-0006jx-Bx for qemu-devel@nongnu.org; Mon, 30 Nov 2020 11:12:50 -0500 Received: by thanos-makatos.dev.nutanix.com (Postfix, from userid 17755) id E0219402F6; Mon, 30 Nov 2020 08:12:41 -0800 (PST) From: Thanos Makatos To: qemu-devel@nongnu.org Subject: [PATCH v7] introduce vfio-user protocol specification Date: Mon, 30 Nov 2020 08:12:29 -0800 Message-Id: <20201130161229.23164-1-thanos.makatos@nutanix.com> X-Mailer: git-send-email 2.12.2 X-Host-Lookup-Failed: Reverse DNS lookup failed for 192.146.154.245 (failed) Received-SPF: pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Received-SPF: none client-ip=192.146.154.245; envelope-from=thanos.makatos@thanos-makatos.dev.nutanix.com; helo=thanos-makatos.dev.nutanix.com X-Spam_score_int: -10 X-Spam_score: -1.1 X-Spam_bar: - X-Spam_report: (-1.1 / 5.0 requ) BAYES_00=-1.9, NO_DNS_FOR_FROM=0.001, RDNS_NONE=0.793, SPF_HELO_NONE=0.001, SPF_NONE=0.001, WEIRD_QUOTING=0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: benjamin.walker@intel.com, John G Johnson , Swapnil Ingle , "Michael S . Tsirkin" , Jason Wang , John Levon , Kirti Wankhede , Gerd Hoffmann , tina.zhang@intel.com, jag.raman@oracle.com, james.r.harris@intel.com, Elena Ufimtseva , Kanth.Ghatraju@oracle.com, Felipe Franciosi , marcandre.lureau@redhat.com, Christophe de Dinechin , Yan Zhao , konrad.wilk@oracle.com, yuvalkashtan@gmail.com, dgilbert@redhat.com, ismael@linux.com, Stefan Hajnoczi , Thanos Makatos , changpeng.liu@intel.com, tomassetti.andrea@gmail.com, Cornelia Huck , alex.williamson@redhat.com, Raphael Norwitz , xiuchun.lu@intel.com, Paolo Bonzini Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: "Qemu-devel" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" This patch introduces the vfio-user protocol specification (formerly known as VFIO-over-socket), which is designed to allow devices to be emulated outside QEMU, in a separate process. vfio-user reuses the existing VFIO defines, structs and concepts. It has been earlier discussed as an RFC in: "RFC: use VFIO over a UNIX domain socket to implement device offloading" Signed-off-by: John G Johnson Signed-off-by: Thanos Makatos --- Changed since v1: * fix coding style issues * update MAINTAINERS for VFIO-over-socket * add vfio-over-socket to ToC Changed since v2: * fix whitespace Changed since v3: * rename protocol to vfio-user * add table of contents * fix Unicode problems * fix typos and various reStructuredText issues * various stylistic improvements * add backend program conventions * rewrite part of intro, drop QEMU-specific stuff * drop QEMU-specific paragraph about implementation * explain that passing of FDs isn't necessary * minor improvements in the VFIO section * various text substitutions for the sake of consistency * drop paragraph about client and server, already explained in * intro * drop device ID * drop type from version * elaborate on request concurrency * convert some inessential paragraphs into notes * explain why some existing VFIO defines cannot be reused * explain how to make changes to the protocol * improve text of DMA map * reword comment about existing VFIO commands * add reference to Version section * reset device on disconnection * reword live migration section * replace sys/vfio.h with linux/vfio.h * drop reference to iovec * use argz the same way it is used in VFIO * add type field in header for clarity Changed since v4: * introduce support for live migration as defined in * include/uapi/linux/vfio.h * introduce 'max_fds' and 'migration' capabilities: * remove 'index' from VFIO_USER_DEVICE_GET_IRQ_INFO * fix minor typos and reworded some text for clarity Changed since v5: * fix minor typos * separate VFIO_USER_DMA_MAP and VFIO_USER_DMA_UNMAP * clarify meaning of VFIO bitmap size field * move version major/minor outside JSON * client proposes version first * make Errno optional in message header * clarification about message ID uniqueness * clarify that server->client request can appear in between client->server request/reply Changed since v6: * put JSON strings in double quotes * clarify reply behavior on error * introduce max message size capability * clarify semantics when failing to map multiple DMA regions in a single command You can focus on v6 to v7 changes by cloning my fork (https://github.com/tmakatos/qemu) and doing: git diff refs/tags/vfio-user/v6 refs/heads/vfio-user/v7 --- MAINTAINERS | 6 + docs/devel/index.rst | 1 + docs/devel/vfio-user.rst | 1662 ++++++++++++++++++++++++++++++++++++++++++= ++++ 3 files changed, 1669 insertions(+) create mode 100644 docs/devel/vfio-user.rst diff --git a/MAINTAINERS b/MAINTAINERS index 68bc160f41..6a4c662976 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1766,6 +1766,12 @@ F: hw/vfio/ap.c F: docs/system/s390x/vfio-ap.rst L: qemu-s390x@nongnu.org =20 +vfio-user +M: John G Johnson +M: Thanos Makatos +S: Supported +F: docs/devel/vfio-user.rst + vhost M: Michael S. Tsirkin S: Supported diff --git a/docs/devel/index.rst b/docs/devel/index.rst index f10ed77e4c..2e2cba28c6 100644 --- a/docs/devel/index.rst +++ b/docs/devel/index.rst @@ -35,3 +35,4 @@ Contents: clocks qom block-coroutine-wrapper + vfio-user diff --git a/docs/devel/vfio-user.rst b/docs/devel/vfio-user.rst new file mode 100644 index 0000000000..d15a228a1a --- /dev/null +++ b/docs/devel/vfio-user.rst @@ -0,0 +1,1662 @@ +.. include:: + +******************************** +vfio-user Protocol Specification +******************************** + +------------ +Version_ 0.1 +------------ + +.. contents:: Table of Contents + +Introduction +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +vfio-user is a protocol that allows a device to be emulated in a separate +process outside of a Virtual Machine Monitor (VMM). vfio-user devices cons= ist +of a generic VFIO device type, living inside the VMM, which we call the cl= ient, +and the core device implementation, living outside the VMM, which we call = the +server. + +The `Linux VFIO ioctl interface `_ +been chosen as the base for this protocol for the following reasons: + +1) It is a mature and stable API, backed by an extensively used framework. +2) The existing VFIO client implementation in QEMU (qemu/hw/vfio/) can be + largely reused. + +.. Note:: + In a proof of concept implementation it has been demonstrated that usin= g VFIO + over a UNIX domain socket is a viable option. vfio-user is designed with + QEMU in mind, however it could be used by other client applications. The + vfio-user protocol does not require that QEMU's VFIO client implementa= tion + is used in QEMU. + +None of the VFIO kernel modules are required for supporting the protocol, +neither in the client nor the server, only the source header files are use= d. + +The main idea is to allow a virtual device to function in a separate proce= ss in +the same host over a UNIX domain socket. A UNIX domain socket (AF_UNIX) is +chosen because file descriptors can be trivially sent over it, which in tu= rn +allows: + +* Sharing of client memory for DMA with the server. +* Sharing of server memory with the client for fast MMIO. +* Efficient sharing of eventfd's for triggering interrupts. + +Other socket types could be used which allow the server to run in a separa= te +guest in the same host (AF_VSOCK) or remotely (AF_INET). Theoretically the +underlying transport does not necessarily have to be a socket, however we = do +not examine such alternatives. In this protocol version we focus on using a +UNIX domain socket and introduce basic support for the other two types of +sockets without considering performance implications. + +While passing of file descriptors is desirable for performance reasons, it= is +not necessary neither for the client nor for the server to support it in o= rder +to implement the protocol. There is always an in-band, message-passing fall +back mechanism. + +VFIO +=3D=3D=3D=3D +VFIO is a framework that allows a physical device to be securely passed th= rough +to a user space process; the device-specific kernel driver does not drive = the +device at all. Typically, the user space process is a VMM and the device = is +passed through to it in order to achieve high performance. VFIO provides a= n API +and the required functionality in the kernel. QEMU has adopted VFIO to all= ow a +guest to directly access physical devices, instead of emulating them in +software. + +vfio-user reuses the core VFIO concepts defined in its API, but implements= them +as messages to be sent over a socket. It does not change the kernel-based = VFIO +in any way, in fact none of the VFIO kernel modules need to be loaded to u= se +vfio-user. It is also possible for the client to concurrently use the curr= ent +kernel-based VFIO for one device, and vfio-user for another device. + +VFIO Device Model +----------------- +A device under VFIO presents a standard interface to the user process. Man= y of +the VFIO operations in the existing interface use the ioctl() system call,= and +references to the existing interface are called the ioctl() implementation= in +this document. + +The following sections describe the set of messages that implement the VFIO +interface over a socket. In many cases, the messages are direct translatio= ns of +data structures used in the ioctl() implementation. Messages derived from +ioctl()s will have a name derived from the ioctl() command name. E.g., the +VFIO_GET_INFO ioctl() command becomes a VFIO_USER_GET_INFO message. The +purpose of this reuse is to share as much code as feasible with the ioctl() +implementation. + +Connection Initiation +^^^^^^^^^^^^^^^^^^^^^ +After the client connects to the server, the initial server message is +VFIO_USER_VERSION to propose a protocol version and set of capabilities to +apply to the session. The client replies with a compatible version and set= of +capabilities it supports, or closes the connection if it cannot support the +advertised version. + +DMA Memory Configuration +^^^^^^^^^^^^^^^^^^^^^^^^ +The client uses VFIO_USER_DMA_MAP and VFIO_USER_DMA_UNMAP messages to info= rm +the server of the valid DMA ranges that the server can access on behalf +of a device. DMA memory may be accessed by the server via VFIO_USER_DMA_RE= AD +and VFIO_USER_DMA_WRITE messages over the socket. + +An optimization for server access to client memory is for the client to pr= ovide +file descriptors the server can mmap() to directly access client memory. N= ote +that mmap() privileges cannot be revoked by the client, therefore file +descriptors should only be exported in environments where the client trust= s the +server not to corrupt guest memory. + +Device Information +^^^^^^^^^^^^^^^^^^ +The client uses a VFIO_USER_DEVICE_GET_INFO message to query the server for +information about the device. This information includes: + +* The device type and whether it supports reset (``VFIO_DEVICE_FLAGS_``), +* the number of device regions, and +* the device presents to the client the number of interrupt types the devi= ce + supports. + +Region Information +^^^^^^^^^^^^^^^^^^ +The client uses VFIO_USER_DEVICE_GET_REGION_INFO messages to query the ser= ver +for information about the device's memory regions. This information descri= bes: + +* Read and write permissions, whether it can be memory mapped, and whether= it + supports additional capabilities (``VFIO_REGION_INFO_CAP_``). +* Region index, size, and offset. + +When a region can be mapped by the client, the server provides a file +descriptor which the client can mmap(). The server is responsible for poll= ing +for client updates to memory mapped regions. + +Region Capabilities +""""""""""""""""""" +Some regions have additional capabilities that cannot be described adequat= ely +by the region info data structure. These capabilities are returned in the +region info reply in a list similar to PCI capabilities in a PCI device's +configuration space. + +Sparse Regions +"""""""""""""" +A region can be memory-mappable in whole or in part. When only a subset of= a +region can be mapped by the client, a VFIO_REGION_INFO_CAP_SPARSE_MMAP +capability is included in the region info reply. This capability describes +which portions can be mapped by the client. + +.. Note:: + For example, in a virtual NVMe controller, sparse regions can be used so + that accesses to the NVMe registers (found in the beginning of BAR0) are + trapped (an infrequent event), while allowing direct access to the door= bells + (an extremely frequent event as every I/O submission requires a write to + BAR0), found right after the NVMe registers in BAR0. + +Device-Specific Regions +""""""""""""""""""""""" + +A device can define regions additional to the standard ones (e.g. PCI inde= xes +0-8). This is achieved by including a VFIO_REGION_INFO_CAP_TYPE capability +in the region info reply of a device-specific region. Such regions are ref= lected +in ``struct vfio_device_info.num_regions``. Thus, for PCI devices this val= ue can +be equal to, or higher than, VFIO_PCI_NUM_REGIONS. + +Interrupts +^^^^^^^^^^ +The client uses VFIO_USER_DEVICE_GET_IRQ_INFO messages to query the server= for +the device's interrupt types. The interrupt types are specific to the bus = the +device is attached to, and the client is expected to know the capabilities= of +each interrupt type. The server can signal an interrupt either with +VFIO_USER_VM_INTERRUPT messages over the socket, or can directly inject +interrupts into the guest via an event file descriptor. The client configu= res +how the server signals an interrupt with VFIO_USER_SET_IRQS messages. + +Device Read and Write +^^^^^^^^^^^^^^^^^^^^^ +When the guest executes load or store operations to device memory, the cli= ent +forwards these operations to the server with VFIO_USER_REGION_READ or +VFIO_USER_REGION_WRITE messages. The server will reply with data from the +device on read operations or an acknowledgement on write operations. + +DMA +^^^ +When a device performs DMA accesses to guest memory, the server will forwa= rd +them to the client with VFIO_USER_DMA_READ and VFIO_USER_DMA_WRITE message= s. +These messages can only be used to access guest memory the client has +configured into the server. + +Protocol Specification +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +To distinguish from the base VFIO symbols, all vfio-user symbols are prefi= xed +with vfio_user or VFIO_USER. In revision 0.1, all data is in the little-en= dian +format, although this may be relaxed in future revision in cases where the +client and server are both big-endian. The messages are formatted for seam= less +reuse of the native VFIO structs. + +Socket +------ + +A server can serve: + +1) one or more clients, and/or +2) one or more virtual devices, belonging to one or more clients. + +The current protocol specification requires a dedicated socket per +client/server connection. It is a server-side implementation detail whethe= r a +single server handles multiple virtual devices from the same or multiple +clients. The location of the socket is implementation-specific. Multiplexi= ng +clients, devices, and servers over the same socket is not supported in this +version of the protocol. + +Authentication +-------------- +For AF_UNIX, we rely on OS mandatory access controls on the socket files, +therefore it is up to the management layer to set up the socket as require= d. +Socket types than span guests or hosts will require a proper authentication +mechanism. Defining that mechanism is deferred to a future version of the +protocol. + +Command Concurrency +------------------- +A client may pipeline multiple commands without waiting for previous comma= nd +replies. The server will process commands in the order they are received.= A +consequence of this is if a client issues a command with the *No_reply* bi= t, +then subseqently issues a command without *No_reply*, the older command wi= ll +have been processed before the reply to the younger command is sent by the +server. The client must be aware of the device's capability to process +concurrent commands if pipelining is used. For example, pipelining allows +multiple client threads to concurently access device memory; the client mu= st +ensure these acceses obey device semantics. + +An example is a frame buffer device, where the device may allow concurrent +access to different areas of video memory, but may have indeterminate beha= vior +if concurrent acceses are performed to command or status registers. + +Note that unrelated messages sent from the sevrer to the client can appear= in +between a client to server request/reply and vice versa. + +Socket Disconnection Behavior +----------------------------- +The server and the client can disconnect from each other, either intention= ally +or unexpectedly. Both the client and the server need to know how to handle= such +events. + +Server Disconnection +^^^^^^^^^^^^^^^^^^^^ +A server disconnecting from the client may indicate that: + +1) A virtual device has been restarted, either intentionally (e.g. because= of a + device update) or unintentionally (e.g. because of a crash). +2) A virtual device has been shut down with no intention to be restarted. + +It is impossible for the client to know whether or not a failure is +intermittent or innocuous and should be retried, therefore the client shou= ld +reset the VFIO device when it detects the socket has been disconnected. +Error recovery will be driven by the guest's device error handling +behavior. + +Client Disconnection +^^^^^^^^^^^^^^^^^^^^ +The client disconnecting from the server primarily means that the client +has exited. Currently, this means that the guest is shut down so the devic= e is +no longer needed therefore the server can automatically exit. However, the= re +can be cases where a client disconnection should not result in a server ex= it: + +1) A single server serving multiple clients. +2) A multi-process QEMU upgrading itself step by step, which is not yet + implemented. + +Therefore in order for the protocol to be forward compatible the server sh= ould +take no action when the client disconnects. If anything happens to the cli= ent +the control stack will know about it and can clean up resources +accordingly. + +Request Retry and Response Timeout +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +A failed command is a command that has been successfully sent and has been +responded to with an error code. Failure to send the command in the first = place +(e.g. because the socket is disconnected) is a different type of error exa= mined +earlier in the disconnect section. + +.. Note:: + QEMU's VFIO retries certain operations if they fail. While this makes s= ense + for real HW, we don't know for sure whether it makes sense for virtual + devices. + +Defining a retry and timeout scheme is deferred to a future version of the +protocol. + +.. _Commands: + +Commands +-------- +The following table lists the VFIO message command IDs, and whether the +message command is sent from the client or the server. + ++----------------------------------+---------+-------------------+ +| Name | Command | Request Direction | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| VFIO_USER_VERSION | 1 | server -> client | ++----------------------------------+---------+-------------------+ +| VFIO_USER_DMA_MAP | 2 | client -> server | ++----------------------------------+---------+-------------------+ +| VFIO_USER_DMA_UNMAP | 3 | client -> server | ++----------------------------------+---------+-------------------+ +| VFIO_USER_DEVICE_GET_INFO | 4 | client -> server | ++----------------------------------+---------+-------------------+ +| VFIO_USER_DEVICE_GET_REGION_INFO | 5 | client -> server | ++----------------------------------+---------+-------------------+ +| VFIO_USER_DEVICE_GET_IRQ_INFO | 6 | client -> server | ++----------------------------------+---------+-------------------+ +| VFIO_USER_DEVICE_SET_IRQS | 7 | client -> server | ++----------------------------------+---------+-------------------+ +| VFIO_USER_REGION_READ | 8 | client -> server | ++----------------------------------+---------+-------------------+ +| VFIO_USER_REGION_WRITE | 9 | client -> server | ++----------------------------------+---------+-------------------+ +| VFIO_USER_DMA_READ | 10 | server -> client | ++----------------------------------+---------+-------------------+ +| VFIO_USER_DMA_WRITE | 11 | server -> client | ++----------------------------------+---------+-------------------+ +| VFIO_USER_VM_INTERRUPT | 12 | server -> client | ++----------------------------------+---------+-------------------+ +| VFIO_USER_DEVICE_RESET | 13 | client -> server | ++----------------------------------+---------+-------------------+ +| VFIO_USER_DIRTY_PAGES | 14 | client -> server | ++----------------------------------+---------+-------------------+ + + +.. Note:: Some VFIO defines cannot be reused since their values are + architecture-specific (e.g. VFIO_IOMMU_MAP_DMA). + +Header +------ +All messages, both command messages and reply messages, are preceded by a +header that contains basic information about the message. The header is +followed by message-specific data described in the sections below. + ++----------------+--------+-------------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D= +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| Message ID | 0 | 2 | ++----------------+--------+-------------+ +| Command | 2 | 2 | ++----------------+--------+-------------+ +| Message size | 4 | 4 | ++----------------+--------+-------------+ +| Flags | 8 | 4 | ++----------------+--------+-------------+ +| | +-----+------------+ | +| | | Bit | Definition | | +| | +=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ | +| | | 0-3 | Type | | +| | +-----+------------+ | +| | | 4 | No_reply | | +| | +-----+------------+ | +| | | 5 | Error | | +| | +-----+------------+ | ++----------------+--------+-------------+ +| Error | 12 | 4 | ++----------------+--------+-------------+ +| | 16 | variable | ++----------------+--------+-------------+ + +* *Message ID* identifies the message, and is echoed in the command's reply + message. Message IDs belong entirely to the sender, can be re-used (even + concurrently) and the receiver must not make any assumptions about their + uniqueness. +* *Command* specifies the command to be executed, listed in Commands_. +* *Message size* contains the size of the entire message, including the he= ader. +* *Flags* contains attributes of the message: + + * The *Type* bits indicate the message type. + + * *Command* (value 0x0) indicates a command message. + * *Reply* (value 0x1) indicates a reply message acknowledging a previ= ous + command with the same message ID. + * *No_reply* in a command message indicates that no reply is needed for = this command. + This is commonly used when multiple commands are sent, and only the la= st needs + acknowledgement. + * *Error* in a reply message indicates the command being acknowledged had + an error. In this case, the *Error* field will be valid. + +* *Error* in a reply message is an optional UNIX errno value. It may be ze= ro + even if the Error bit is set in Flags. It is reserved in a command messa= ge. + +Each command message in Commands_ must be replied to with a reply message,= unless the +message sets the *No_Reply* bit. The reply consists of the header with th= e *Reply* +bit set, plus any additional data. + +If an error occurs, the reply message must only include the reply header. + +VFIO_USER_VERSION +----------------- + +This is the initial message sent by the client after the socket connection= is +established: + +Message format +^^^^^^^^^^^^^^ + ++--------------+-------------------------------------------+ +| Name | Value | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D+ +| Message ID | | ++--------------+-------------------------------------------+ +| Command | 1 | ++--------------+-------------------------------------------+ +| Message size | 16 + version header + version data length | ++--------------+-------------------------------------------+ +| Flags | Reply bit set in reply | ++--------------+-------------------------------------------+ +| Error | 0/errno | ++--------------+-------------------------------------------+ +| Version | version header | ++--------------+-------------------------------------------+ + +Version Header Format +^^^^^^^^^^^^^^^^^^^^^ + ++---------------+--------+------------------------------------------------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| version major | 16 | 2 | ++---------------+--------+------------------------------------------------+ +| version minor | 18 | 2 | ++---------------+--------+------------------------------------------------+ +| version data | 22 | variable (including terminating NUL | +| | | character). Optional. | ++---------------+--------+------------------------------------------------+ + +Version Data Format +^^^^^^^^^^^^^^^^^^^ + +The version data is an optional JSON byte array with the following format: + ++--------------------+------------------+---------------------------------= --+ +| Name | Type | Description = | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| ``"capabilities"`` | collection of | Contains common capabilities tha= t | +| | name/value pairs | the sender supports. Optional. = | ++--------------------+------------------+---------------------------------= --+ + +Capabilities: + ++--------------------+------------------+---------------------------------= ----+ +| Name | Type | Description = | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D+ +| ``"max_fds"`` | number | Maximum number of file descripto= rs | +| | | the can be received by the sende= r. | +| | | Optional. If not specified then = the | +| | | receiver must assume = | +| | | ``"max_fds"=3D1``. = | ++--------------------+------------------+---------------------------------= ----+ +| ``"max_msg_size"`` | number | Maximum message size in bytes th= at | +| | | the receiver can handle, includi= ng | +| | | the header. Optional. If not = | +| | | specified then the receiver must= | +| | | assume ``"max_msg_size"=3D4096``= . | ++--------------------+------------------+---------------------------------= ----+ +| ``"migration"`` | collection of | Migration capability parameters.= If | +| | name/value pairs | missing then migration is not = | +| | | supported by the sender. = | ++--------------------+------------------+---------------------------------= ----+ + +The migration capability contains the following name/value pairs: + ++--------------+--------+-----------------------------------------------+ +| Name | Type | Description | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| ``"pgsize"`` | number | Page size of dirty pages bitmap. The smallest | +| | | between the client and the server is used. | ++--------------+--------+-----------------------------------------------+ + + +.. _Version: + +Versioning and Feature Support +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Upon accepting a connection, the client must send a VFIO_USER_VERSION mess= age +proposing a protocol version and a set of capabilities. The server compares +these with the versions and capabilities it supports and sends a +VFIO_USER_VERSION reply according to the following rules. + +* The major version in the reply must be the same as proposed. If the clie= nt + does not support the proposed major, it closes the connection. +* The minor version in the reply must be equal to or less than the minor + version proposed. +* The capability list must be a subset of those proposed. If the server + requires a capability the client did not include, it closes the connecti= on. + +The protocol major version will only change when incompatible protocol cha= nges +are made, such as changing the message format. The minor version may change +when compatible changes are made, such as adding new messages or capabilit= ies, +Both the client and server must support all minor versions less than the +maximum minor version it supports. E.g., an implementation that supports +version 1.3 must also support 1.0 through 1.2. + +When making a change to this specification, the protocol version number mu= st +be included in the form "added in version X.Y" + + +VFIO_USER_DMA_MAP +----------------- + +Message Format +^^^^^^^^^^^^^^ + ++--------------+------------------------+ +| Name | Value | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| Message ID | | ++--------------+------------------------+ +| Command | 2 | ++--------------+------------------------+ +| Message size | 16 + table size | ++--------------+------------------------+ +| Flags | Reply bit set in reply | ++--------------+------------------------+ +| Error | 0/errno | ++--------------+------------------------+ +| Table | array of table entries | ++--------------+------------------------+ + +This command message is sent by the client to the server to inform it of t= he +memory regions the server can access. It must be sent before the server can +perform any DMA to the client. It is normally sent directly after the vers= ion +handshake is completed, but may also occur when memory is added to the cli= ent, +or if the client uses a vIOMMU. If the client does not expect the server to +perform DMA then it does not need to send to the server VFIO_USER_DMA_MAP +commands. If the server does not need to perform DMA then it can ignore su= ch +commands but it must still reply to them. The table is an array of the +following structure: + +Table entry format +^^^^^^^^^^^^^^^^^^ + ++-------------+--------+-------------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| Address | 0 | 8 | ++-------------+--------+-------------+ +| Size | 8 | 8 | ++-------------+--------+-------------+ +| Offset | 16 | 8 | ++-------------+--------+-------------+ +| Protections | 24 | 4 | ++-------------+--------+-------------+ +| Flags | 28 | 4 | ++-------------+--------+-------------+ +| | +-----+------------+ | +| | | Bit | Definition | | +| | +=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ | +| | | 0 | Mappable | | +| | +-----+------------+ | ++-------------+--------+-------------+ + +* *Address* is the base DMA address of the region. +* *Size* is the size of the region. +* *Offset* is the file offset of the region with respect to the associated= file + descriptor. +* *Protections* are the region's protection attributes as encoded in + ````. +* *Flags* contains the following region attributes: + + * *Mappable* indicates that the region can be mapped via the mmap() syst= em + call using the file descriptor provided in the message meta-data. + +This structure is 32 bytes in size, so the message size is: +16 + (# of table entries * 32). + +If a DMA region being added can be directly mapped by the server, an array= of +file descriptors must be sent as part of the message meta-data. Each mappa= ble +region entry must have a corresponding file descriptor. On AF_UNIX sockets= , the +file descriptors must be passed as SCM_RIGHTS type ancillary data. Otherwi= se, +if a DMA region cannot be directly mapped by the server, it can be accesse= d by +the server using VFIO_USER_DMA_READ and VFIO_USER_DMA_WRITE messages, expl= ained +in `Read and Write Operations`_. A command to map over an existing region = must +be failed by the server with ``EEXIST`` set in error field in the reply. + +Adding multiple DMA regions can partially fail. The response does not indi= cate +which regions were added and which were not, therefore it is a client +implementation detail how to recover from the failure. + +.. Note:: + The server can optionally remove succesfully added DMA regions making t= his + operation atomic. + The client can recover by attempting to unmap one by one all the DMA re= gions + in the VFIO_USER_DMA_MAP command, ignoring failures for regions that do= not + exist. + + + +VFIO_USER_DMA_UNMAP +------------------- + +Message Format +^^^^^^^^^^^^^^ + ++--------------+------------------------+ +| Name | Value | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| Message ID | | ++--------------+------------------------+ +| Command | 3 | ++--------------+------------------------+ +| Message size | 16 + table size | ++--------------+------------------------+ +| Flags | Reply bit set in reply | ++--------------+------------------------+ +| Error | 0/errno | ++--------------+------------------------+ +| Table | array of table entries | ++--------------+------------------------+ + +This command message is sent by the client to the server to inform it that= a +DMA region, previously made available via a VFIO_USER_DMA_MAP command mess= age, +is no longer available for DMA. It typically occurs when memory is subtrac= ted +from the client or if the client uses a vIOMMU. If the client does not exp= ect +the server to perform DMA then it does not need to send to the server +VFIO_USER_DMA_UNMAP commands. If the server does not need to perform DMA t= hen +it can ignore such commands but it must still reply to them. The table is = an +array of the following structure: + +Table entry format +^^^^^^^^^^^^^^^^^^ + ++--------------+--------+---------------------------------------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| Address | 0 | 8 | ++--------------+--------+---------------------------------------+ +| Size | 8 | 8 | ++--------------+--------+---------------------------------------+ +| Offset | 16 | 8 | ++--------------+--------+---------------------------------------+ +| Protections | 24 | 4 | ++--------------+--------+---------------------------------------+ +| Flags | 28 | 4 | ++--------------+--------+---------------------------------------+ +| | +-----+--------------------------------------+ | +| | | Bit | Definition | | +| | +=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= + | +| | | 0 | VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP | | +| | +-----+--------------------------------------+ | ++--------------+--------+---------------------------------------+ +| VFIO Bitmaps | 32 | variable | ++--------------+--------+---------------------------------------+ + +* *Address* is the base DMA address of the region. +* *Size* is the size of the region. +* *Offset* is the file offset of the region with respect to the associated= file + descriptor. +* *Protections* are the region's protection attributes as encoded in + ````. +* *Flags* contains the following region attributes: + + * *VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP* indicates that a dirty page bit= map + must be populated before unmapping the DMA region. The client must pro= vide + a ``struct vfio_bitmap`` in the VFIO bitmaps field for each region, wi= th + the ``vfio_bitmap.pgsize`` and ``vfio_bitmap.size`` fields initialized. + +* *VFIO Bitmaps* contains one ``struct vfio_bitmap`` per region if + ``VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP`` is set in Flags. + +VFIO bitmap format +^^^^^^^^^^^^^^^^^^ + ++--------+--------+------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D+ +| pgsize | 0 | 8 | ++--------+--------+------+ +| size | 8 | 8 | ++--------+--------+------+ +| data | 16 | 8 | ++--------+--------+------+ + +* *pgsize* is the page size for the bitmap, in bytes. +* *size* is the size for the bitmap, in bytes, excluding the VFIO bitmap h= eader. +* *data* This field is unused in vfio-user. + +The VFIO bitmap structure is defined in ```` +(``struct vfio_bitmap``). + +If ``VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP`` is not set in Flags then the s= ize +of the message is: 16 + (# of table entries * 32). +If ``VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP`` is set in Flags then the size = of +the message is: 16 + (# of table entries * 56). + +Upon receiving a VFIO_USER_DMA_UNMAP command, if the file descriptor is ma= pped +then the server must release all references to that DMA region before repl= ying, +which includes potentially in flight DMA transactions. Removing a portion = of a +DMA region is possible. If the VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP bit is= set +in the request, the server must append to the header the ``struct vfio_bit= map`` +received in the command followed by the bitmap, for each region. Thus, the +message size the client should expect is the size of the header plus the s= ize +of ``struct vfio_bitmap`` plus ``vfio_bitmap.size`` bytes for each region.= Each +bit in the bitmap represents one page of size ``vfio_bitmap.pgsize``. + + +VFIO_USER_DEVICE_GET_INFO +------------------------- + +Message format +^^^^^^^^^^^^^^ + ++--------------+----------------------------+ +| Name | Value | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| Message ID | | ++--------------+----------------------------+ +| Command | 4 | ++--------------+----------------------------+ +| Message size | 16 in command, 32 in reply | ++--------------+----------------------------+ +| Flags | Reply bit set in reply | ++--------------+----------------------------+ +| Error | 0/errno | ++--------------+----------------------------+ +| Device info | VFIO device info | ++--------------+----------------------------+ + +This command message is sent by the client to the server to query for basic +information about the device. Only the message header is needed in the com= mand +message. The VFIO device info structure is defined in ```` +(``struct vfio_device_info``). + +VFIO device info format +^^^^^^^^^^^^^^^^^^^^^^^ + ++-------------+--------+--------------------------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| argsz | 16 | 4 | ++-------------+--------+--------------------------+ +| flags | 20 | 4 | ++-------------+--------+--------------------------+ +| | +-----+-------------------------+ | +| | | Bit | Definition | | +| | +=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ | +| | | 0 | VFIO_DEVICE_FLAGS_RESET | | +| | +-----+-------------------------+ | +| | | 1 | VFIO_DEVICE_FLAGS_PCI | | +| | +-----+-------------------------+ | ++-------------+--------+--------------------------+ +| num_regions | 24 | 4 | ++-------------+--------+--------------------------+ +| num_irqs | 28 | 4 | ++-------------+--------+--------------------------+ + +* *argsz* is the size of the VFIO device info structure. +* *flags* contains the following device attributes. + + * VFIO_DEVICE_FLAGS_RESET indicates that the device supports the + VFIO_USER_DEVICE_RESET message. + * VFIO_DEVICE_FLAGS_PCI indicates that the device is a PCI device. + +* *num_regions* is the number of memory regions that the device exposes. +* *num_irqs* is the number of distinct interrupt types that the device sup= ports. + +This version of the protocol only supports PCI devices. Additional devices= may +be supported in future versions. + +VFIO_USER_DEVICE_GET_REGION_INFO +-------------------------------- + +Message format +^^^^^^^^^^^^^^ + ++--------------+------------------------+ +| Name | Value | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| Message ID | | ++--------------+------------------------+ +| Command | 5 | ++--------------+------------------------+ +| Message size | 48 + any caps | ++--------------+------------------------+ +| Flags | Reply bit set in reply | ++--------------+------------------------+ +| Error | 0/errno | ++--------------+------------------------+ +| Region info | VFIO region info | ++--------------+------------------------+ + +This command message is sent by the client to the server to query for +information about device memory regions. The VFIO region info structure is +defined in ```` (``struct vfio_region_info``). Since the cli= ent +does not know the size of the capabilities, the size of the reply it should +expect is 48 plus any capabilities whose size is indicated in the size fie= ld of +the reply header. + +VFIO region info format +^^^^^^^^^^^^^^^^^^^^^^^ + ++------------+--------+------------------------------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D+ +| argsz | 16 | 4 | ++------------+--------+------------------------------+ +| flags | 20 | 4 | ++------------+--------+------------------------------+ +| | +-----+-----------------------------+ | +| | | Bit | Definition | | +| | +=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ | +| | | 0 | VFIO_REGION_INFO_FLAG_READ | | +| | +-----+-----------------------------+ | +| | | 1 | VFIO_REGION_INFO_FLAG_WRITE | | +| | +-----+-----------------------------+ | +| | | 2 | VFIO_REGION_INFO_FLAG_MMAP | | +| | +-----+-----------------------------+ | +| | | 3 | VFIO_REGION_INFO_FLAG_CAPS | | +| | +-----+-----------------------------+ | ++------------+--------+------------------------------+ +| index | 24 | 4 | ++------------+--------+------------------------------+ +| cap_offset | 28 | 4 | ++------------+--------+------------------------------+ +| size | 32 | 8 | ++------------+--------+------------------------------+ +| offset | 40 | 8 | ++------------+--------+------------------------------+ + +* *argsz* is the size of the VFIO region info structure plus the + size of any region capabilities returned. +* *flags* are attributes of the region: + + * *VFIO_REGION_INFO_FLAG_READ* allows client read access to the region. + * *VFIO_REGION_INFO_FLAG_WRITE* allows client write access to the region. + * *VFIO_REGION_INFO_FLAG_MMAP* specifies the client can mmap() the regio= n. + When this flag is set, the reply will include a file descriptor in its + meta-data. On AF_UNIX sockets, the file descriptors will be passed as + SCM_RIGHTS type ancillary data. + * *VFIO_REGION_INFO_FLAG_CAPS* indicates additional capabilities found i= n the + reply. + +* *index* is the index of memory region being queried, it is the only field + that is required to be set in the command message. +* *cap_offset* describes where additional region capabilities can be found. + cap_offset is relative to the beginning of the VFIO region info structur= e. + The data structure it points is a VFIO cap header defined in + ````. +* *size* is the size of the region. +* *offset* is the offset given to the mmap() system call for regions with = the + MMAP attribute. It is also used as the base offset when mapping a VFIO + sparse mmap area, described below. + +VFIO Region capabilities +^^^^^^^^^^^^^^^^^^^^^^^^ +The VFIO region information can also include a capabilities list. This lis= t is +similar to a PCI capability list - each entry has a common header that +identifies a capability and where the next capability in the list can be f= ound. +The VFIO capability header format is defined in ```` (``stru= ct +vfio_info_cap_header``). + +VFIO cap header format +^^^^^^^^^^^^^^^^^^^^^^ + ++---------+--------+------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D+ +| id | 0 | 2 | ++---------+--------+------+ +| version | 2 | 2 | ++---------+--------+------+ +| next | 4 | 4 | ++---------+--------+------+ + +* *id* is the capability identity. +* *version* is a capability-specific version number. +* *next* specifies the offset of the next capability in the capability lis= t. It + is relative to the beginning of the VFIO region info structure. + +VFIO sparse mmap +^^^^^^^^^^^^^^^^ + ++------------------+----------------------------------+ +| Name | Value | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D+ +| id | VFIO_REGION_INFO_CAP_SPARSE_MMAP | ++------------------+----------------------------------+ +| version | 0x1 | ++------------------+----------------------------------+ +| next | | ++------------------+----------------------------------+ +| sparse mmap info | VFIO region info sparse mmap | ++------------------+----------------------------------+ + +This capability is defined when only a subrange of the region supports +direct access by the client via mmap(). The VFIO sparse mmap area is defin= ed in +```` (``struct vfio_region_sparse_mmap_area``). + +VFIO region info cap sparse mmap +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ++----------+--------+------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D= =3D+ +| nr_areas | 0 | 4 | ++----------+--------+------+ +| reserved | 4 | 4 | ++----------+--------+------+ +| offset | 8 | 8 | ++----------+--------+------+ +| size | 16 | 9 | ++----------+--------+------+ +| ... | | | ++----------+--------+------+ + +* *nr_areas* is the number of sparse mmap areas in the region. +* *offset* and size describe a single area that can be mapped by the clien= t. + There will be nr_areas pairs of offset and size. The offset will be adde= d to + the base offset given in the VFIO_USER_DEVICE_GET_REGION_INFO to form the + offset argument of the subsequent mmap() call. + +The VFIO sparse mmap area is defined in ```` (``struct +vfio_region_info_cap_sparse_mmap``). + +VFIO Region Type +^^^^^^^^^^^^^^^^ + ++------------------+---------------------------+ +| Name | Value | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| id | VFIO_REGION_INFO_CAP_TYPE | ++------------------+---------------------------+ +| version | 0x1 | ++------------------+---------------------------+ +| next | | ++------------------+---------------------------+ +| region info type | VFIO region info type | ++------------------+---------------------------+ + +This capability is defined when a region is specific to the device. + +VFIO region info type +^^^^^^^^^^^^^^^^^^^^^ + +The VFIO region info type is defined in ```` +(``struct vfio_region_info_cap_type``). + ++---------+--------+------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D+ +| type | 0 | 4 | ++---------+--------+------+ +| subtype | 4 | 4 | ++---------+--------+------+ + +The only device-specific region type and subtype supported by vfio-user is +VFIO_REGION_TYPE_MIGRATION (3) and VFIO_REGION_SUBTYPE_MIGRATION (1). + +VFIO Device Migration Info +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The beginning of the subregion must contain +``struct vfio_device_migration_info``, defined in ````. This +subregion is accessed like any other part of a standard vfio-user PCI regi= on +using VFIO_USER_REGION_READ/VFIO_USER_REGION_WRITE. + ++---------------+--------+-----------------------------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D+ +| device_state | 0 | 4 | ++---------------+--------+-----------------------------+ +| | +-----+----------------------------+ | +| | | Bit | Definition | | +| | +=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ | +| | | 0 | VFIO_DEVICE_STATE_RUNNING | | +| | +-----+----------------------------+ | +| | | 1 | VFIO_DEVICE_STATE_SAVING | | +| | +-----+----------------------------+ | +| | | 2 | VFIO_DEVICE_STATE_RESUMING | | +| | +-----+----------------------------+ | ++---------------+--------+-----------------------------+ +| reserved | 4 | 4 | ++---------------+--------+-----------------------------+ +| pending_bytes | 8 | 8 | ++---------------+--------+-----------------------------+ +| data_offset | 16 | 8 | ++---------------+--------+-----------------------------+ +| data_size | 24 | 8 | ++---------------+--------+-----------------------------+ + +* *device_state* defines the state of the device: + + The client initiates device state transition by writing the intended sta= te. + The server must respond only after it has succesfully transitioned to th= e new + state. If an error occurs then the server must respond to the + VFIO_USER_REGION_WRITE operation with the Error field set accordingly and + must remain at the previous state, or in case of internal error it must + transtition to the error state, defined as + VFIO_DEVICE_STATE_RESUMING | VFIO_DEVICE_STATE_SAVING. The client must + re-read the device state in order to determine it afresh. + + The following device states are defined: + + +-----------+---------+----------+-----------------------------------+ + | _RESUMING | _SAVING | _RUNNING | Description | + +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ + | 0 | 0 | 0 | Device is stopped. | + +-----------+---------+----------+-----------------------------------+ + | 0 | 0 | 1 | Device is running, default state. | + +-----------+---------+----------+-----------------------------------+ + | 0 | 1 | 0 | Stop-and-copy state | + +-----------+---------+----------+-----------------------------------+ + | 0 | 1 | 1 | Pre-copy state | + +-----------+---------+----------+-----------------------------------+ + | 1 | 0 | 0 | Resuming | + +-----------+---------+----------+-----------------------------------+ + | 1 | 0 | 1 | Invalid state | + +-----------+---------+----------+-----------------------------------+ + | 1 | 1 | 0 | Error state | + +-----------+---------+----------+-----------------------------------+ + | 1 | 1 | 1 | Invalid state | + +-----------+---------+----------+-----------------------------------+ + + Valid state transitions are shown in the following table: + + +-------------------------+---------+---------+---------------+---------= -+----------+ + | |darr| From / To |rarr| | Stopped | Running | Stop-and-copy | Pre-copy= | Resuming | + +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D+ + | Stopped | \- | 0 | 0 | 0 = | 0 | + +-------------------------+---------+---------+---------------+---------= -+----------+ + | Running | 1 | \- | 1 | 1 = | 1 | + +-------------------------+---------+---------+---------------+---------= -+----------+ + | Stop-and-copy | 1 | 0 | \- | 0 = | 0 | + +-------------------------+---------+---------+---------------+---------= -+----------+ + | Pre-copy | 0 | 0 | 1 | \- = | 0 | + +-------------------------+---------+---------+---------------+---------= -+----------+ + | Resuming | 0 | 1 | 0 | 0 = | \- | + +-------------------------+---------+---------+---------------+---------= -+----------+ + + A device is migrated to the destination as follows: + + * The source client transitions the device state from the running state = to + the pre-copy state. This transition is optional for the client but mus= t be + supported by the server. The souce server starts sending device state = data + to the source client through the migration region while the device is + running. + + * The source client transitions the device state from the running state = or the + pre-copy state to the stop-and-copy state. The source server stops the + device, saves device state and sends it to the source client through t= he + migration region. + + The source client is responsible for sending the migration data to the + destination client. + + A device is resumed on the destination as follows: + + * The destination client transitions the device state from the running s= tate + to the resuming state. The destination server uses the device state da= ta + received through the migration region to resume the device. + + * The destination client provides saved device state to the destination + server and then transitions the device to back to the running state. + +* *reserved* This field is reserved and any access to it must be ignored b= y the + server. + +* *pending_bytes* Remaining bytes to be migrated by the server. This field= is + read only. + +* *data_offset* Offset in the migration region where the client must: + + * read from, during the pre-copy or stop-and-copy state, or + + * write to, during the resuming state. + + This field is read only. + +* *data_size* Contains the size, in bytes, of the amount of data copied to: + + * the source migration region by the source server during the pre-copy or + stop-and copy state, or + + * the destination migration region by the destination client during the + resuming state. + +Device-specific data must be stored at any position after +`struct vfio_device_migration_info`. Note that the migration region can be +memory mappable, even partially. In practise, only the migration data port= ion +can be memory mapped. + +The client processes device state data during the pre-copy and the +stop-and-copy state in the following iterative manner: + + 1. The client reads `pending_bytes` to mark a new iteration. Repeated re= ads + of this field is an idempotent operation. If there are no migration d= ata + to be consumed then the next step depends on the current device state: + + * pre-copy: the client must try again. + + * stop-and-copy: this procedure can end and the device can now start + resuming on the destination. + + 2. The client reads `data_offset`; at thich point the server must make + available a portion of migration data at this offset to be read by the + client, which must happen *before* completing the read operation. The + amount of data to be read must be stored in the `data_size` field, wh= ich + the client reads next. + + 3. The client reads `data_size` to determine the amount of migration data + available. + + 4. The client reads and processes the migration data. + + 5. Go to step 1. + +Note that the client can transition the device from the pre-copy state to = the +stop-and-copy state at any time; `pending_bytes` does not need to become z= ero. + +The client initializes the device state on the destination by setting the +device state in the resuming state and writing the migration data to the +destination migration region at `data_offset` offset. The client can write= the +source migration data in an iterative manner and the server must consume t= his +data before completing each write operation, updating the `data_offset` fi= eld. +The server must apply the source migration data on the device resume state= . The +client must write data on the same order and transction size as read. + +If an error occurs then the server must fail the read or write operation. = It is +an implementation detail of the client how to handle errors. + +VFIO_USER_DEVICE_GET_IRQ_INFO +----------------------------- + +Message format +^^^^^^^^^^^^^^ + ++--------------+------------------------+ +| Name | Value | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| Message ID | | ++--------------+------------------------+ +| Command | 6 | ++--------------+------------------------+ +| Message size | 32 | ++--------------+------------------------+ +| Flags | Reply bit set in reply | ++--------------+------------------------+ +| Error | 0/errno | ++--------------+------------------------+ +| IRQ info | VFIO IRQ info | ++--------------+------------------------+ + +This command message is sent by the client to the server to query for +information about device interrupt types. The VFIO IRQ info structure is +defined in ```` (``struct vfio_irq_info``). + +VFIO IRQ info format +^^^^^^^^^^^^^^^^^^^^ + ++-------+--------+---------------------------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| argsz | 16 | 4 | ++-------+--------+---------------------------+ +| flags | 20 | 4 | ++-------+--------+---------------------------+ +| | +-----+--------------------------+ | +| | | Bit | Definition | | +| | +=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ | +| | | 0 | VFIO_IRQ_INFO_EVENTFD | | +| | +-----+--------------------------+ | +| | | 1 | VFIO_IRQ_INFO_MASKABLE | | +| | +-----+--------------------------+ | +| | | 2 | VFIO_IRQ_INFO_AUTOMASKED | | +| | +-----+--------------------------+ | +| | | 3 | VFIO_IRQ_INFO_NORESIZE | | +| | +-----+--------------------------+ | ++-------+--------+---------------------------+ +| index | 24 | 4 | ++-------+--------+---------------------------+ +| count | 28 | 4 | ++-------+--------+---------------------------+ + +* *argsz* is the size of the VFIO IRQ info structure. +* *flags* defines IRQ attributes: + + * *VFIO_IRQ_INFO_EVENTFD* indicates the IRQ type can support server even= tfd + signalling. + * *VFIO_IRQ_INFO_MASKABLE* indicates that the IRQ type supports the MASK= and + UNMASK actions in a VFIO_USER_DEVICE_SET_IRQS message. + * *VFIO_IRQ_INFO_AUTOMASKED* indicates the IRQ type masks itself after b= eing + triggered, and the client must send an UNMASK action to receive new + interrupts. + * *VFIO_IRQ_INFO_NORESIZE* indicates VFIO_USER_SET_IRQS operations setup + interrupts as a set, and new sub-indexes cannot be enabled without dis= abling + the entire type. + +* index is the index of IRQ type being queried, it is the only field that = is + required to be set in the command message. +* count describes the number of interrupts of the queried type. + +VFIO_USER_DEVICE_SET_IRQS +------------------------- + +Message format +^^^^^^^^^^^^^^ + ++--------------+------------------------+ +| Name | Value | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| Message ID | | ++--------------+------------------------+ +| Command | 7 | ++--------------+------------------------+ +| Message size | 36 + any data | ++--------------+------------------------+ +| Flags | Reply bit set in reply | ++--------------+------------------------+ +| Error | 0/errno | ++--------------+------------------------+ +| IRQ set | VFIO IRQ set | ++--------------+------------------------+ + +This command message is sent by the client to the server to set actions for +device interrupt types. The VFIO IRQ set structure is defined in +```` (``struct vfio_irq_set``). + +VFIO IRQ set format +^^^^^^^^^^^^^^^^^^^ + ++-------+--------+------------------------------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| argsz | 16 | 4 | ++-------+--------+------------------------------+ +| flags | 20 | 4 | ++-------+--------+------------------------------+ +| | +-----+-----------------------------+ | +| | | Bit | Definition | | +| | +=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ | +| | | 0 | VFIO_IRQ_SET_DATA_NONE | | +| | +-----+-----------------------------+ | +| | | 1 | VFIO_IRQ_SET_DATA_BOOL | | +| | +-----+-----------------------------+ | +| | | 2 | VFIO_IRQ_SET_DATA_EVENTFD | | +| | +-----+-----------------------------+ | +| | | 3 | VFIO_IRQ_SET_ACTION_MASK | | +| | +-----+-----------------------------+ | +| | | 4 | VFIO_IRQ_SET_ACTION_UNMASK | | +| | +-----+-----------------------------+ | +| | | 5 | VFIO_IRQ_SET_ACTION_TRIGGER | | +| | +-----+-----------------------------+ | ++-------+--------+------------------------------+ +| index | 24 | 4 | ++-------+--------+------------------------------+ +| start | 28 | 4 | ++-------+--------+------------------------------+ +| count | 32 | 4 | ++-------+--------+------------------------------+ +| data | 36 | variable | ++-------+--------+------------------------------+ + +* *argsz* is the size of the VFIO IRQ set structure, including any *data* = field. +* *flags* defines the action performed on the interrupt range. The DATA fl= ags + describe the data field sent in the message; the ACTION flags describe t= he + action to be performed. The flags are mutually exclusive for both sets. + + * *VFIO_IRQ_SET_DATA_NONE* indicates there is no data field in the comma= nd. + The action is performed unconditionally. + * *VFIO_IRQ_SET_DATA_BOOL* indicates the data field is an array of boole= an + bytes. The action is performed if the corresponding boolean is true. + * *VFIO_IRQ_SET_DATA_EVENTFD* indicates an array of event file descripto= rs + was sent in the message meta-data. These descriptors will be signalled= when + the action defined by the action flags occurs. In AF_UNIX sockets, the + descriptors are sent as SCM_RIGHTS type ancillary data. + * *VFIO_IRQ_SET_ACTION_MASK* indicates a masking event. It can be used w= ith + VFIO_IRQ_SET_DATA_BOOL or VFIO_IRQ_SET_DATA_NONE to mask an interrupt,= or + with VFIO_IRQ_SET_DATA_EVENTFD to generate an event when the guest mas= ks + the interrupt. + * *VFIO_IRQ_SET_ACTION_UNMASK* indicates an unmasking event. It can be u= sed + with VFIO_IRQ_SET_DATA_BOOL or VFIO_IRQ_SET_DATA_NONE to unmask an + interrupt, or with VFIO_IRQ_SET_DATA_EVENTFD to generate an event when= the + guest unmasks the interrupt. + * *VFIO_IRQ_SET_ACTION_TRIGGER* indicates a triggering event. It can be = used + with VFIO_IRQ_SET_DATA_BOOL or VFIO_IRQ_SET_DATA_NONE to trigger an + interrupt, or with VFIO_IRQ_SET_DATA_EVENTFD to generate an event when= the + server triggers the interrupt. + +* *index* is the index of IRQ type being setup. +* *start* is the start of the sub-index being set. +* *count* describes the number of sub-indexes being set. As a special case= , a + count of 0 with data flags of VFIO_IRQ_SET_DATA_NONE disables all interr= upts + of the index. +* *data* is an optional field included when the + VFIO_IRQ_SET_DATA_BOOL flag is present. It contains an array of booleans + that specify whether the action is to be performed on the corresponding + index. It's used when the action is only performed on a subset of the ra= nge + specified. + +Not all interrupt types support every combination of data and action flags. +The client must know the capabilities of the device and IRQ index before it +sends a VFIO_USER_DEVICE_SET_IRQ message. + +.. _Read and Write Operations: + +Read and Write Operations +------------------------- + +Not all I/O operations between the client and server can be done via direct +access of memory mapped with an mmap() call. In these cases, the client and +server use messages sent over the socket. It is expected that these operat= ions +will have lower performance than direct access. + +The client can access server memory with VFIO_USER_REGION_READ and +VFIO_USER_REGION_WRITE commands. These share a common data structure that +appears after the message header. + +REGION Read/Write Data +^^^^^^^^^^^^^^^^^^^^^^ + ++--------+--------+----------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D+ +| Offset | 16 | 8 | ++--------+--------+----------+ +| Region | 24 | 4 | ++--------+--------+----------+ +| Count | 28 | 4 | ++--------+--------+----------+ +| Data | 32 | variable | ++--------+--------+----------+ + +* *Offset* into the region being accessed. +* *Region* is the index of the region being accessed. +* *Count* is the size of the data to be transferred. +* *Data* is the data to be read or written. + +The server can access client memory with VFIO_USER_DMA_READ and +VFIO_USER_DMA_WRITE messages. These also share a common data structure that +appears after the message header. + +DMA Read/Write Data +^^^^^^^^^^^^^^^^^^^ + ++---------+--------+----------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D+ +| Address | 16 | 8 | ++---------+--------+----------+ +| Count | 24 | 4 | ++---------+--------+----------+ +| Data | 28 | variable | ++---------+--------+----------+ + +* *Address* is the area of client memory being accessed. This address must= have + been previously exported to the server with a VFIO_USER_DMA_MAP message. +* *Count* is the size of the data to be transferred. +* *Data* is the data to be read or written. + +VFIO_USER_REGION_READ +--------------------- + +Message format +^^^^^^^^^^^^^^ + ++--------------+------------------------+ +| Name | Value | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| Message ID | | ++--------------+------------------------+ +| Command | 8 | ++--------------+------------------------+ +| Message size | 32 + data size | ++--------------+------------------------+ +| Flags | Reply bit set in reply | ++--------------+------------------------+ +| Error | 0/errno | ++--------------+------------------------+ +| Read info | REGION read/write data | ++--------------+------------------------+ + +This command message is sent from the client to the server to read from se= rver +memory. In the command messages, there is no data, and the count is the a= mount +of data to be read. The reply message must include the data read, and its = count +field is the amount of data read. + +VFIO_USER_REGION_WRITE +---------------------- + +Message format +^^^^^^^^^^^^^^ + ++--------------+------------------------+ +| Name | Value | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| Message ID | | ++--------------+------------------------+ +| Command | 9 | ++--------------+------------------------+ +| Message size | 32 + data size | ++--------------+------------------------+ +| Flags | Reply bit set in reply | ++--------------+------------------------+ +| Error | 0/errno | ++--------------+------------------------+ +| Write info | REGION read/write data | ++--------------+------------------------+ + +This command message is sent from the client to the server to write to ser= ver +memory. The command message must contain the data to be written, and its = count +field must contain the amount of write data. The count field in the reply +message must be zero. + +VFIO_USER_DMA_READ +------------------ + +Message format +^^^^^^^^^^^^^^ + ++--------------+------------------------+ +| Name | Value | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| Message ID | | ++--------------+------------------------+ +| Command | 10 | ++--------------+------------------------+ +| Message size | 28 + data size | ++--------------+------------------------+ +| Flags | Reply bit set in reply | ++--------------+------------------------+ +| Error | 0/errno | ++--------------+------------------------+ +| DMA info | DMA read/write data | ++--------------+------------------------+ + +This command message is sent from the server to the client to read from cl= ient +memory. In the command message, there is no data, and the count must will= be +the amount of data to be read. The reply message must include the data rea= d, +and its count field must be the amount of data read. + +VFIO_USER_DMA_WRITE +------------------- + +Message format +^^^^^^^^^^^^^^ + ++--------------+------------------------+ +| Name | Value | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| Message ID | | ++--------------+------------------------+ +| Command | 11 | ++--------------+------------------------+ +| Message size | 28 + data size | ++--------------+------------------------+ +| Flags | Reply bit set in reply | ++--------------+------------------------+ +| Error | 0/errno | ++--------------+------------------------+ +| DMA info | DMA read/write data | ++--------------+------------------------+ + +This command message is sent from the server to the client to write to cli= ent +memory. The command message must contain the data to be written, and its = count +field must contain the amount of write data. The count field in the reply +message must be zero. + +VFIO_USER_VM_INTERRUPT +---------------------- + +Message format +^^^^^^^^^^^^^^ + ++----------------+------------------------+ +| Name | Value | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| Message ID | | ++----------------+------------------------+ +| Command | 12 | ++----------------+------------------------+ +| Message size | 20 | ++----------------+------------------------+ +| Flags | Reply bit set in reply | ++----------------+------------------------+ +| Error | 0/errno | ++----------------+------------------------+ +| Interrupt info | | ++----------------+------------------------+ + +This command message is sent from the server to the client to signal the d= evice +has raised an interrupt. + +Interrupt info format +^^^^^^^^^^^^^^^^^^^^^ + ++-----------+--------+------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D= =3D=3D+ +| Sub-index | 16 | 4 | ++-----------+--------+------+ + +* *Sub-index* is relative to the IRQ index, e.g., the vector number used i= n PCI + MSI/X type interrupts. + +VFIO_USER_DEVICE_RESET +---------------------- + +Message format +^^^^^^^^^^^^^^ + ++--------------+------------------------+ +| Name | Value | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| Message ID | | ++--------------+------------------------+ +| Command | 13 | ++--------------+------------------------+ +| Message size | 16 | ++--------------+------------------------+ +| Flags | Reply bit set in reply | ++--------------+------------------------+ +| Error | 0/errno | ++--------------+------------------------+ + +This command message is sent from the client to the server to reset the de= vice. + +VFIO_USER_DIRTY_PAGES +--------------------- + +Message format +^^^^^^^^^^^^^^ + ++--------------------+------------------------+ +| Name | Value | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| Message ID | | ++--------------------+------------------------+ +| Command | 14 | ++--------------------+------------------------+ +| Message size | 16 | ++--------------------+------------------------+ +| Flags | Reply bit set in reply | ++--------------------+------------------------+ +| Error | 0/errno | ++--------------------+------------------------+ +| VFIO Dirty bitmap | | ++--------------------+------------------------+ + +This command is analogous to VFIO_IOMMU_DIRTY_PAGES. It is sent by the cli= ent +to the server in order to control logging of dirty pages, usually during a= live +migration. The VFIO dirty bitmap structure is defined in ```` +(``struct vfio_iommu_type1_dirty_bitmap``). + +VFIO Dirty Bitmap Format +^^^^^^^^^^^^^^^^^^^^^^^^ + ++-------+--------+-----------------------------------------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D+ +| argsz | 0 | 4 | ++-------+--------+-----------------------------------------+ +| flags | 4 | 4 | ++-------+--------+-----------------------------------------+ +| | +-----+----------------------------------------+ | +| | | Bit | Definition | | +| | +=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= + | +| | | 0 | VFIO_IOMMU_DIRTY_PAGES_FLAG_START | | +| | +-----+----------------------------------------+ | +| | | 1 | VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP | | +| | +-----+----------------------------------------+ | +| | | 2 | VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP | | +| | +-----+----------------------------------------+ | ++-------+--------+-----------------------------------------+ +| data | 8 | 4 | ++-------+--------+-----------------------------------------+ + +* *argsz* is the size of the VFIO dirty bitmap info structure. + +* *flags* defines the action to be performed by the server: + + * *VFIO_IOMMU_DIRTY_PAGES_FLAG_START* instructs the server to start logg= ing + pages it dirties. Logging continues until explicitly disabled by + VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP. + + * *VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP* instructs the server to stop logging + dirty pages. + + * *VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP* requests from the server to r= eturn + the dirty bitmap for a specific IOVA range. The IOVA range is specifie= d by + "VFIO dirty bitmap get" structure, which must immediatelly follow the + "VFIO dirty bitmap" structure, explained next. This operation is only = valid + if logging of dirty pages has been previously started. The server must + respond the same way it does for VFIO_USER_DMA_UNMAP (the dirty pages + bitmap must follow the response header). + + These flags are mutually exclusive with each other. + +* *data* This field is unused in vfio-user. + +VFIO Dirty Bitmap Get Format +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + ++--------+--------+------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D+ +| iova | 0 | 8 | ++--------+--------+------+ +| size | 8 | 8 | ++--------+--------+------+ +| bitmap | 16 | 24 | ++--------+--------+------+ + +* *iova* is the IOVA offset + +* *size* is the size of the IOVA region + +* *bitmap* is the VFIO bitmap (``struct vfio_bitmap``), with the same sema= ntics + as VFIO_USER_DMA_UNMAP. + + +Appendices +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +Unused VFIO ioctl() commands +---------------------------- + +The following VFIO commands do not have an equivalent vfio-user command: + +* VFIO_GET_API_VERSION +* VFIO_CHECK_EXTENSION +* VFIO_SET_IOMMU +* VFIO_GROUP_GET_STATUS +* VFIO_GROUP_SET_CONTAINER +* VFIO_GROUP_UNSET_CONTAINER +* VFIO_GROUP_GET_DEVICE_FD +* VFIO_IOMMU_GET_INFO + +However, once support for live migration for VFIO devices is finalized some +of the above commands may have to be handled by the client in their +corresponding vfio-user form. This will be addressed in a future protocol +version. + +VFIO groups and containers +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The current VFIO implementation includes group and container idioms that +describe how a device relates to the host IOMMU. In the vfio-user +implementation, the IOMMU is implemented in SW by the client, and is not +visible to the server. The simplest idea would be that the client put each +device into its own group and container. + +Backend Program Conventions +--------------------------- + +vfio-user backend program conventions are based on the vhost-user ones. + +* The backend program must not daemonize itself. +* No assumptions must be made as to what access the backend program has on= the + system. +* File descriptors 0, 1 and 2 must exist, must have regular + stdin/stdout/stderr semantics, and can be redirected. +* The backend program must honor the SIGTERM signal. +* The backend program must accept the following commands line options: + + * ``--socket-path=3DPATH``: path to UNIX domain socket, + * ``--fd=3DFDNUM``: file descriptor for UNIX domain socket, incompatible= with + ``--socket-path`` +* The backend program must be accompanied with a JSON file stored under + ``/usr/share/vfio-user``. --=20 2.12.2