[PATCH] introduce VFIO-over-socket protocol specificaion

Thanos Makatos posted 1 patch 4 years, 2 months ago
Test checkpatch failed
Test docker-mingw@fedora failed
Test FreeBSD passed
Test docker-quick@centos7 passed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/1594913503-52271-1-git-send-email-thanos.makatos@nutanix.com
There is a newer version of this series
docs/devel/vfio-over-socket.rst | 1135 +++++++++++++++++++++++++++++++++++++++
1 files changed, 1135 insertions(+), 0 deletions(-)
create mode 100644 docs/devel/vfio-over-socket.rst
[PATCH] introduce VFIO-over-socket protocol specificaion
Posted by Thanos Makatos 4 years, 2 months ago
This patch introduces the VFIO-over-socket protocol specification, which
is designed to allow devices to be emulated outside QEMU, in a separate
process. VFIO-over-socket reuses the existing VFIO defines, structs and
concepts.

It has been earlier discussed as an RFC in:
"RFC: use VFIO over a UNIX domain socket to implement device offloading"

Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Thanos Makatos <thanos.makatos@nutanix.com>
---
 docs/devel/vfio-over-socket.rst | 1135 +++++++++++++++++++++++++++++++++++++++
 1 files changed, 1135 insertions(+), 0 deletions(-)
 create mode 100644 docs/devel/vfio-over-socket.rst

diff --git a/docs/devel/vfio-over-socket.rst b/docs/devel/vfio-over-socket.rst
new file mode 100644
index 0000000..723b944
--- /dev/null
+++ b/docs/devel/vfio-over-socket.rst
@@ -0,0 +1,1135 @@
+***************************************
+VFIO-over-socket Protocol Specification
+***************************************
+
+Version 0.1
+
+Introduction
+============
+VFIO-over-socket, also known as vfio-user, is a protocol that allows a device
+to be virtualized in a separate process outside of QEMU. VFIO-over-socket
+devices consist of a generic VFIO device type, living inside QEMU, which we
+call the client, and the core device implementation, living outside QEMU, which
+we call the server. VFIO-over-socket can be the main transport mechanism for
+multi-process QEMU, however it can be used by other applications offering
+device virtualization. Explaining the advantages of a
+disaggregated/multi-process QEMU, and device virtualization outside QEMU in
+general, is beyond the scope of this document.
+
+This document focuses on specifying the VFIO-over-socket protocol. VFIO has
+been chosen for the following reasons:
+
+1) It is a mature and stable API, backed by an extensively used framework.
+2) The existing VFIO client implementation (qemu/hw/vfio/) can be largely
+   reused.
+
+In a proof of concept implementation it has been demonstrated that using VFIO
+over a UNIX domain socket is a viable option. VFIO-over-socket is designed with
+QEMU in mind, however it could be used by other client applications. The
+VFIO-over-socket protocol does not require that QEMU's VFIO client
+implementation is used in QEMU. None of the VFIO kernel modules are required
+for supporting the protocol, neither in the client nor the server, only the
+source header files are used.
+
+The main idea is to allow a virtual device to function in a separate process in
+the same host over a UNIX domain socket. A UNIX domain socket (AF_UNIX) is
+chosen because we can trivially send file descriptors over it, which in turn
+allows:
+
+* Sharing of guest memory for DMA with the virtual device process.
+* Sharing of virtual device memory with the guest for fast MMIO.
+* Efficient sharing of eventfd's for triggering interrupts.
+
+However, other socket types could be used which allows the virtual device
+process to run in a separate guest in the same host (AF_VSOCK) or remotely
+(AF_INET). Theoretically the underlying transport doesn't necessarily have to
+be a socket, however we don't examine such alternatives. In this document we
+focus on using a UNIX domain socket and introduce basic support for the other
+two types of sockets without considering performance implications.
+
+This document does not yet describe any internal details of the server-side
+implementation, however QEMU's VFIO client implementation will have to be
+adapted according to this protocol in order to support VFIO-over-socket virtual
+devices.
+
+VFIO
+====
+VFIO is a framework that allows a physical device to be securely passed through
+to a user space process; the kernel does not drive the device at all.
+Typically, the user space process is a VM and the device is passed through to
+it in order to achieve high performance. VFIO provides an API and the required
+functionality in the kernel. QEMU has adopted VFIO to allow a guest virtual
+machine to directly access physical devices, instead of emulating them in
+software
+
+VFIO-over-socket reuses the core VFIO concepts defined in its API, but
+implements them as messages to be sent over a UNIX-domain socket. It does not
+change the kernel-based VFIO in any way, in fact none of the VFIO kernel
+modules need to be loaded to use VFIO-over-socket. It is also possible for QEMU
+to concurrently use the current kernel-based VFIO for one guest device, and use
+VFIO-over-socket for another device in the same guest.
+
+VFIO Device Model
+-----------------
+A device under VFIO presents a standard VFIO model to the user process. Many
+of the VFIO operations in the existing kernel model use the ioctl() system
+call, and references to the existing model are called the ioctl()
+implementation in this document.
+
+The following sections describe the set of messages that implement the VFIO
+device model over a UNIX domain socket. In many cases, the messages are direct
+translations of data structures used in the ioctl() implementation. Messages
+derived from ioctl()s will have a name derived from the ioctl() command name.
+E.g., the VFIO_GET_INFO ioctl() command becomes a VFIO_USER_GET_INFO message.
+The purpose for this reuse is to share as much code as feasible with the
+ioctl() implementation.
+
+Client and Server
+^^^^^^^^^^^^^^^^^
+The socket connects two processes together: a client process and a server
+process. In the context of this document, the client process is the process
+emulating a guest virtual machine, such as QEMU. The server process is a
+process that provides device emulation.
+
+Connection Initiation
+^^^^^^^^^^^^^^^^^^^^^
+After the client connects to the server, the initial server message is
+VFIO_USER_VERSION to propose a protocol version and set of capabilities to
+apply to the session. The client replies with a compatible version and set of
+capabilities it will support, or closes the connection if it cannot support the
+advertised version.
+
+Guest Memory Configuration
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+The client uses VFIO_USER_DMA_MAP and VFIO_USER_DMA_UNMAP messages to inform
+the server of the valid guest DMA ranges that the server can access on behalf
+of a device. Guest memory may be accessed by the server via VFIO_USER_DMA_READ
+and VFIO_USER_DMA_WRITE messages over the socket.
+
+An optimization for server access to guest memory is for the client to provide
+file descriptors the server can mmap() to directly access guest memory. Note
+that mmap() privileges cannot be revoked by the client, therefore file
+descriptors should only be exported in environments where the client trusts the
+server not to corrupt guest memory.
+
+Device Information
+^^^^^^^^^^^^^^^^^^
+The client uses a VFIO_USER_DEVICE_GET_INFO message to query the server for
+information about the device. This information includes:
+
+* The device type and capabilities,
+* the number of memory regions, and
+* the device presents to the guest the number of interrupt types the device
+  supports.
+
+Region Information
+^^^^^^^^^^^^^^^^^^
+The client uses VFIO_USER_DEVICE_GET_REGION_INFO messages to query the server
+for information about the device's memory regions. This information describes:
+
+* Read and write permissions, whether it can be memory mapped, and whether it
+  supports additional capabilities.
+* Region index, size, and offset.
+
+When a region can be mapped by the client, the server provides a file
+descriptor which the client can mmap(). The server is responsible for polling
+for client updates to memory mapped regions.
+
+Region Capabilities
+"""""""""""""""""""
+Some regions have additional capabilities that cannot be described adequately
+by the region info data structure. These capabilities are returned in the
+region info reply in a list similar to PCI capabilities in a PCI device's
+configuration space. 
+
+Sparse Regions
+""""""""""""""
+A region can be memory-mappable in whole or in part. When only a subset of a
+region can be mapped by the client, a VFIO_REGION_INFO_CAP_SPARSE_MMAP
+capability is included in the region info reply. This capability describes
+which portions can be mapped by the client.
+
+For example, in a virtual NVMe controller, sparse regions can be used so that
+accesses to the NVMe registers (found in the beginning of BAR0) are trapped (an
+infrequent an event), while allowing direct access to the doorbells (an
+extremely frequent event as every I/O submission requires a write to BAR0),
+found right after the NVMe registers in BAR0.
+
+Interrupts
+^^^^^^^^^^
+The client uses VFIO_USER_DEVICE_GET_IRQ_INFO messages to query the server for
+the device's interrupt types. The interrupt types are specific to the bus the
+device is attached to, and the client is expected to know the capabilities of
+each interrupt type. The server can signal an interrupt either with
+VFIO_USER_VM_INTERRUPT messages over the socket, or can directly inject
+interrupts into the guest via an event file descriptor. The client configures
+how the server signals an interrupt with VFIO_USER_SET_IRQS messages.
+
+Device Read and Write
+^^^^^^^^^^^^^^^^^^^^^
+When the guest executes load or store operations to device memory, the client
+forwards these operations to the server with VFIO_USER_REGION_READ or
+VFIO_USER_REGION_WRITE messages. The server will reply with data from the
+device on read operations or an acknowledgement on write operations.
+
+DMA
+^^^
+When a device performs DMA accesses to guest memory, the server will forward
+them to the client with VFIO_USER_DMA_READ and VFIO_USER_DMA_WRITE messages.
+These messages can only be used to access guest memory the client has
+configured into the server.
+
+Protocol Specification
+======================
+To distinguish from the base VFIO symbols, all VFIO-over-socket symbols are
+prefixed with vfio_user or VFIO_USER. In revision 0.1, all data is in the
+little-endian format, although this may be relaxed in future revision in cases
+where the client and server are both big-endian. The messages are formatted
+for seamless reuse of the native VFIO structs. A server can serve:
+
+1) multiple clients, and/or
+2) multiple virtual devices, belonging to one or more clients.
+
+Therefore each message requires a header that uniquely identifies the virtual
+device. It is a server-side implementation detail whether a single server
+handles multiple virtual devices from the same or multiple guests.
+
+Socket
+------
+A single UNIX domain socket is assumed to be used for each device. The location
+of the socket is implementation-specific. Multiplexing clients, devices, and
+servers over the same socket is not supported in this version of the protocol,
+but a device ID field exists in the message header so that a future support can
+be added without a major version change.
+
+Authentication
+--------------
+For AF_UNIX, we rely on OS mandatory access controls on the socket files,
+therefore it is up to the management layer to set up the socket as required.
+Socket types than span guests or hosts will require a proper authentication
+mechanism. Defining that mechanism is deferred to a future version of the
+protocol.
+
+Request Concurrency
+-------------------
+There can be multiple outstanding requests per virtual device, e.g. a
+frame buffer where the guest does multiple stores to the virtual device. The
+server can execute and reorder non-conflicting requests in parallel, depending
+on the device semantics.
+
+Socket Disconnection Behavior
+-----------------------------
+The server and the client can disconnect from each other, either intentionally
+or unexpectedly. Both the client and the server need to know how to handle such
+events.
+
+Server Disconnection
+^^^^^^^^^^^^^^^^^^^^
+A server disconnecting from the client may indicate that:
+
+1) A virtual device has been restarted, either intentionally (e.g. because of a
+device update) or unintentionally (e.g. because of a crash). In any case, the
+virtual device will come back so the client should not do anything (e.g. simply
+reconnect and retry failed operations).
+
+2) A virtual device has been shut down with no intention to be restarted.
+
+It is impossible for the client to know whether or not a failure is
+intermittent or innocuous and should be retried, therefore the client should
+attempt to reconnect to the socket. Since an intentional server restart (e.g.
+due to an upgrade) might take some time, a reasonable timeout should be used.
+In cases where the disconnection is expected (e.g. the guest shutting down), no
+new requests will be sent anyway so this situation doesn't pose a problem. The
+control stack will clean up accordingly.
+
+Parametrizing this behaviour by having the virtual device advertise a
+reasonable reconnect is deferred to a future version of the protocol.
+
+Client Disconnection
+^^^^^^^^^^^^^^^^^^^^
+The client disconnecting from the server primarily means that the QEMU process
+has exited. Currently this means that the guest is shut down so the device is
+no longer needed therefore the server can automatically exit. However, there
+can be cases where a client disconnect should not result in a server exit:
+
+1) A single server serving multiple clients.
+2) A multi-process QEMU upgrading itself step by step, which isn't yet
+   implemented.
+
+Therefore in order for the protocol to be forward compatible the server should
+take no action when the client disconnects. If anything happens to the client
+process the control stack will know about it and can clean up resources
+accordingly.
+
+Request Retry and Response Timeout
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+QEMU's VFIO retries certain operations if they fail. While this makes sense for
+real HW, we don't know for sure whether it makes sense for virtual devices. A
+failed request is a request that has been successfully sent and has been
+responded to with an error code. Failure to send the request in the first place
+(e.g. because the socket is disconnected) is a different type of error examined
+earlier in the disconnect section.
+
+Defining a retry and timeout scheme if deferred to a future version of the
+protocol.
+
+Commands
+--------
+The following table lists the VFIO message command IDs, and whether the
+message request is sent from the client or the server.
+
++----------------------------------+---------+-------------------+
+| Name                             | Command | Request Direction |
++==================================+=========+===================+
+| VFIO_USER_VERSION                | 1       | server \ufffd\ufffd\ufffd client   |
++----------------------------------+---------+-------------------+
+| VFIO_USER_DMA_MAP                | 2       | client \ufffd\ufffd\ufffd server   |
++----------------------------------+---------+-------------------+
+| VFIO_USER_DMA_UNMAP              | 3       | client \ufffd\ufffd\ufffd server   |
++----------------------------------+---------+-------------------+
+| VFIO_USER_DEVICE_GET_INFO        | 4       | client \ufffd\ufffd\ufffd server   |
++----------------------------------+---------+-------------------+
+| VFIO_USER_DEVICE_GET_REGION_INFO | 5       | client \ufffd\ufffd\ufffd server   |
++----------------------------------+---------+-------------------+
+| VFIO_USER_DEVICE_GET_IRQ_INFO    | 6       | client \ufffd\ufffd\ufffd server   |
++----------------------------------+---------+-------------------+
+| VFIO_USER_DEVICE_SET_IRQS        | 7       | client \ufffd\ufffd\ufffd server   |
++----------------------------------+---------+-------------------+
+| VFIO_USER_REGION_READ            | 8       | client \ufffd\ufffd\ufffd server   |
++----------------------------------+---------+-------------------+
+| VFIO_USER_REGION_WRITE           | 9       | client \ufffd\ufffd\ufffd server   |
++----------------------------------+---------+-------------------+
+| VFIO_USER_DMA_READ               | 10      | server \ufffd\ufffd\ufffd client   |
++----------------------------------+---------+-------------------+
+| VFIO_USER_DMA_READ               | 11      | server \ufffd\ufffd\ufffd client   |
++----------------------------------+---------+-------------------+
+| VFIO_USER_VM_INTERRUPT           | 12      | server \ufffd\ufffd\ufffd client   |
++----------------------------------+---------+-------------------+
+| VFIO_DEVICE_RESET                | 13      | client \ufffd\ufffd\ufffd server   |
++----------------------------------+---------+-------------------+
+
+Header
+------
+All messages are preceded by a 16 byte header that contains basic information
+about the message. The header is followed by message-specific data described
+in the sections below.
+
++----------------+--------+-------------+
+| Name           | Offset | Size        |
++================+========+=============+
+| Device ID      | 0      | 2           |
++----------------+--------+-------------+
+| Message ID     | 2      | 2           |
++----------------+--------+-------------+
+| Command        | 4      | 4           |
++----------------+--------+-------------+
+| Message size   | 8      | 4           |
++----------------+--------+-------------+
+| Flags          | 12     | 4           |
++----------------+--------+-------------+
+|                | +-----+------------+ |
+|                | | Bit | Definition | |
+|                | +=====+============+ |
+|                | | 0   | Reply      | |
+|                | +-----+------------+ |
+|                | | 1   | No_reply   | |
+|                | +-----+------------+ | 
++----------------+--------+-------------+
+| <message data> | 16     | variable    |
++----------------+--------+-------------+
+
+* Device ID identifies the destination device of the message. This field is
+  reserved when the server only supports one device per socket.
+* Message ID identifies the message, and is used in the message acknowledgement.
+* Command specifies the command to be executed, listed in the Command Table.
+* Message size contains the size of the entire message, including the header.
+* Flags contains attributes of the message:
+
+  * The reply bit differentiates request messages from reply messages. A reply
+    message acknowledges a previous request with the same message ID.
+  * No_reply indicates that no reply is needed for this request. This is
+    commonly used when multiple requests are sent, and only the last needs
+    acknowledgement.
+
+VFIO_USER_VERSION
+-----------------
+
+Message format
+^^^^^^^^^^^^^^
+
++--------------+------------------------+
+| Name         | Value                  |
++==============+========================+
+| Device ID    | 0                      |
++--------------+------------------------+
+| Message ID   | <ID>                   |
++--------------+------------------------+
+| Command      | 1                      |
++--------------+------------------------+
+| Message size | 16 + version length    |
++--------------+------------------------+
+| Flags        | Reply bit set in reply |
++--------------+------------------------+
+| Version      | JSON byte array        |
++--------------+------------------------+
+
+This is the initial message sent by the server after the socket connection is
+established. The version is in JSON format, and the following objects must be
+included:
+
++--------------+--------+---------------------------------------------------+
+| Name         | Type   | Description                                       |
++==============+========+===================================================+
+| version      | object | {\ufffd\ufffd\ufffdmajor\ufffd\ufffd\ufffd: <number>, \ufffd\ufffd\ufffdminor\ufffd\ufffd\ufffd: <number>}            |
+|              |        | Version supported by the sender, e.g. \ufffd\ufffd\ufffd0.1\ufffd\ufffd\ufffd.      |
++--------------+--------+---------------------------------------------------+
+| type         | string | Fixed to \ufffd\ufffd\ufffdvfio-user\ufffd\ufffd\ufffd.                             |
++--------------+--------+---------------------------------------------------+
+| capabilities | array  | Reserved. Can be omitted for v0.1, otherwise must |
+|              |        | be empty.                                         |
++--------------+--------+---------------------------------------------------+
+
+Versioning and Feature Support
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Upon accepting a connection, the server must send a VFIO_USER_VERSION message
+proposing a protocol version and a set of capabilities. The client compares
+these with the versions and capabilities it supports and sends a
+VFIO_USER_VERSION reply according to the following rules.
+
+* The major version in the reply must be the same as proposed. If the client
+  does not support the proposed major, it closes the connection.
+* The minor version in the reply must be equal to or less than the minor
+  version proposed.
+* The capability list must be a subset of those proposed. If the client
+  requires a capability the server did not include, it closes the connection.
+* If type is not \ufffd\ufffd\ufffdvfio-user\ufffd\ufffd\ufffd, the client closes the connection.
+
+The protocol major version will only change when incompatible protocol changes
+are made, such as changing the message format. The minor version may change
+when compatible changes are made, such as adding new messages or capabilities,
+Both the client and server must support all minor versions less than the
+maximum minor version it supports. E.g., an implementation that supports
+version 1.3 must also support 1.0 through 1.2.
+
+VFIO_USER_DMA_MAP
+-----------------
+
+VFIO_USER_DMA_UNMAP
+-------------------
+
+Message Format
+^^^^^^^^^^^^^^
+
++--------------+------------------------+
+| Name         | Value                  |
++==============+========================+
+| Device ID    | 0                      |
++--------------+------------------------+
+| Message ID   | <ID>                   |
++--------------+------------------------+
+| Command      | MAP=2, UNMAP=3         |
++--------------+------------------------+
+| Message size | 16 + table size        |
++--------------+------------------------+
+| Flags        | Reply bit set in reply |
++--------------+------------------------+
+| Table        | array of table entries |
++--------------+------------------------+
+
+This message is sent by the client to the server to inform it of the guest
+memory regions the device can access. It must be sent before the device can
+perform any DMA to the guest. It is normally sent directly after the version
+handshake is completed, but may also occur when memory is added or subtracted
+in the guest.
+
+The table is an array of the following structure. This structure is 32 bytes
+in size, so the message size will be 16 + (# of table entries * 32). If a
+region being added can be directly mapped by the server, an array of file
+descriptors will be sent as part of the message meta-data. Each region entry
+will have a corresponding file descriptor. On AF_UNIX sockets, the file
+descriptors will be passed as SCM_RIGHTS type ancillary data.
+
+Table entry format
+^^^^^^^^^^^^^^^^^^
+
++-------------+--------+-------------+
+| Name        | Offset | Size        |
++=============+========+=============+
+| Address     | 0      | 8           |
++-------------+--------+-------------+
+| Size        | 8      | 8           |
++-------------+--------+-------------+
+| Offset      | 16     | 8           |
++-------------+--------+-------------+
+| Protections | 24     | 4           |
++-------------+--------+-------------+
+| Flags       | 28     | 4           |
++-------------+--------+-------------+
+|             | +-----+------------+ |
+|             | | Bit | Definition | |
+|             | +=====+============+ |
+|             | | 0   | Mappable   | |
+|             | +-----+------------+ |
++-------------+--------+-------------+
+
+* Address is the base DMA address of the region.
+* Size is the size of the region.
+* Offset is the file offset of the region with respect to the associated file
+  descriptor.
+* Protections are the region's protection attributes as encoded in
+  ``<sys/mman.h>``.
+* Flags contain the following region attributes:
+
+  * Mappable indicate the region can be mapped via the mmap() system call using
+    the file descriptor provided in the message meta-data.
+
+VFIO_USER_DEVICE_GET_INFO
+-------------------------
+
+Message format
+^^^^^^^^^^^^^^
+
++--------------+----------------------------+
+| Name         | Value                      |
++==============+============================+
+| Device ID    | <ID>                       |
++--------------+----------------------------+
+| Message ID   | <ID>                       |
++--------------+----------------------------+
+| Command      | 4                          |
++--------------+----------------------------+
+| Message size | 16 in request, 32 in reply |
++--------------+----------------------------+
+| Flags        | Reply bit set in reply     |
++--------------+----------------------------+
+| Device info  | VFIO device info           |
++--------------+----------------------------+
+
+This message is sent by the client to the server to query for basic information
+about the device. Only the message header is needed in the request message.
+The VFIO device info structure is defined in ``<sys/vfio.h>`` (``struct
+vfio_device_info``).
+
+VFIO device info format
+^^^^^^^^^^^^^^^^^^^^^^^
+
++-------------+--------+--------------------------+
+| Name        | Offset | Size                     |
++=============+========+==========================+
+| argsz       | 16     | 4                        |
++-------------+--------+--------------------------+
+| flags       | 20     | 4                        |
++-------------+--------+--------------------------+
+|             | +-----+-------------------------+ |
+|             | | Bit | Definition              | |
+|             | +=====+=========================+ |
+|             | | 0   | VFIO_DEVICE_FLAGS_RESET | |
+|             | +-----+-------------------------+ |
+|             | | 1   | VFIO_DEVICE_FLAGS_PCI   | |
+|             | +-----+-------------------------+ |
++-------------+--------+--------------------------+
+| num_regions | 24     | 4                        |
++-------------+--------+--------------------------+
+| num_irqs    | 28     | 4                        |
++-------------+--------+--------------------------+
+
+* argz is reserved in vfio-user, it is only used in the ioctl() VFIO
+  implementation.
+* flags contains the following device attributes.
+
+  * VFIO_DEVICE_FLAGS_RESET indicates the device supports the
+    VFIO_USER_DEVICE_RESET message.
+  * VFIO_DEVICE_FLAGS_PCI indicates the device is a PCI device.
+
+* num_regions is the number of memory regions the device exposes.
+* num_irqs is the number of distinct interrupt types the device supports.
+
+This version of the protocol only supports PCI devices. Additional devices may
+be supported in future versions. 
+
+VFIO_USER_DEVICE_GET_REGION_INFO
+--------------------------------
+
+Message format
+^^^^^^^^^^^^^^
+
++--------------+------------------+
+| Name         | Value            |
++==============+==================+
+| Device ID    | <ID>             |
++--------------+------------------+
+| Message ID   | <ID>             |
++--------------+------------------+
+| Command      | 5                | 
++--------------+------------------+
+| Message size | 48 + any caps    |
++--------------+------------------+
+| Flags Reply  | bit set in reply |
++--------------+------------------+
+| Region info  | VFIO region info |
++--------------+------------------+
+
+This message is sent by the client to the server to query for information about
+device memory regions. The VFIO region info structure is defined in
+``<sys/vfio.h>`` (``struct vfio_region_info``).
+
+VFIO region info format
+^^^^^^^^^^^^^^^^^^^^^^^
+
++------------+--------+------------------------------+
+| Name       | Offset | Size                         |
++============+========+==============================+
+| argsz      | 16     | 4                            |
++------------+--------+------------------------------+
+| flags      | 20     | 4                            |
++------------+--------+------------------------------+
+|            | +-----+-----------------------------+ |
+|            | | Bit | Definition                  | |
+|            | +=====+=============================+ |
+|            | | 0   | VFIO_REGION_INFO_FLAG_READ  | |
+|            | +-----+-----------------------------+ |
+|            | | 1   | VFIO_REGION_INFO_FLAG_WRITE | |
+|            | +-----+-----------------------------+ |
+|            | | 2   | VFIO_REGION_INFO_FLAG_MMAP  | |
+|            | +-----+-----------------------------+ |
+|            | | 3   | VFIO_REGION_INFO_FLAG_CAPS  | |
+|            | +-----+-----------------------------+ |
++------------+--------+------------------------------+
+| index      | 24     | 4                            |
++------------+--------+------------------------------+
+| cap_offset | 28     | 4                            |
++------------+--------+------------------------------+
+| size       | 32     | 8                            |
++------------+--------+------------------------------+
+| offset     | 40     | 8                            |
++------------+--------+------------------------------+
+
+* argz is reserved in vfio-user, it is only used in the ioctl() VFIO
+  implementation.
+* flags are attributes of the region:
+
+  * VFIO_REGION_INFO_FLAG_READ allows client read access to the region.
+  * VFIO_REGION_INFO_FLAG_WRITE allows client write access region.
+  * VFIO_REGION_INFO_FLAG_MMAP specifies the client can mmap() the region. When
+    this flag is set, the reply will include a file descriptor in its meta-data.
+    On AF_UNIX sockets, the file descriptors will be passed as SCM_RIGHTS type
+    ancillary data.
+  * VFIO_REGION_INFO_FLAG_CAPS indicates additional capabilities found in the
+    reply.
+
+* index is the index of memory region being queried, it is the only field that
+  is required to be set in the request message.
+* cap_offset describes where additional region capabilities can be found.
+  cap_offset is relative to the beginning of the VFIO region info structure.
+  The data structure it points is a VFIO cap header defined in ``<sys/vfio.h>``.
+* size is the size of the region.
+* offset is the offset given to the mmap() system call for regions with the
+  MMAP attribute. It is also used as the base offset when mapping a VFIO
+  sparse mmap area, described below.
+
+VFIO Region capabilities
+^^^^^^^^^^^^^^^^^^^^^^^^
+The VFIO region information can also include a capabilities list. This list is
+similar to a PCI capability list - each entry has a common header that
+identifies a capability and where the next capability in the list can be found.
+The VFIO capability header format is defined in ``<sys/vfio.h>`` (``struct
+vfio_info_cap_header``).
+
+VFIO cap header format
+^^^^^^^^^^^^^^^^^^^^^^
+
++---------+--------+------+
+| Name    | Offset | Size |
++=========+========+======+
+| id      | 0      | 2    |
++---------+--------+------+
+| version | 2      | 2    |
++---------+--------+------+
+| next    | 4      | 4    |
++---------+--------+------+
+
+* id is the capability identity.
+* version is a capability-specific version number.
+* next specifies the offset of the next capability in the capability list. It
+  is relative to the beginning of the VFIO region info structure.
+
+VFIO sparse mmap
+^^^^^^^^^^^^^^^^
+
++------------------+----------------------------------+
+| Name             | Value                            |
++==================+==================================+
+| id               | VFIO_REGION_INFO_CAP_SPARSE_MMAP |
++------------------+----------------------------------+
+| version          | 0x1                              |
++------------------+----------------------------------+
+| next             | <next>                           |
++------------------+----------------------------------+
+| sparse mmap info | VFIO region info sparse mmap     |
++------------------+----------------------------------+
+
+The only capability supported in this version of the protocol is for sparse
+mmap. This capability is defined when only a subrange of the region supports
+direct access by the client via mmap(). The VFIO sparse mmap area is defined in
+``<sys/vfio.h>`` (``struct vfio_region_sparse_mmap_area``).
+
+VFIO region info cap sparse mmap
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
++----------+--------+------+
+| Name     | Offset | Size |
++==========+========+======+
+| nr_areas | 0      | 4    |
++----------+--------+------+
+| reserved | 4      | 4    |
++----------+--------+------+
+| offset   | 8      | 8    |
++----------+--------+------+
+| size     | 16     | 9    |
++----------+--------+------+
+| ...      |        |      |
++----------+--------+------+
+
+* nr_areas is the number of sparse mmap areas in the region.
+* offset and size describe a single area that can be mapped by the client.
+  There will be nr_areas pairs of offset and size. The offset will be added to
+  the base offset given in the VFIO_USER_DEVICE_GET_REGION_INFO to form the
+  offset argument of the subsequent mmap() call.
+
+The VFIO sparse mmap area is defined in ``<sys/vfio.h>`` (``struct
+vfio_region_info_cap_sparse_mmap``).
+
+VFIO_USER_DEVICE_GET_IRQ_INFO
+-----------------------------
+
+Message format
+^^^^^^^^^^^^^^
+
++--------------+------------------------+
+| Name         | Value                  |
++==============+========================+
+| Device ID    | <ID>                   |
++--------------+------------------------+
+| Message ID   | <ID>                   |
++--------------+------------------------+
+| Command      | 6                      |
++--------------+------------------------+
+| Message size | 32                     |
++--------------+------------------------+
+| Flags        | Reply bit set in reply |
++--------------+------------------------+
+| IRQ info     | VFIO IRQ info          |
++--------------+------------------------+
+
+This message is sent by the client to the server to query for information about
+device interrupt types. The VFIO IRQ info structure is defined in
+``<sys/vfio.h>`` (``struct vfio_irq_info``).
+
+VFIO IRQ info format
+^^^^^^^^^^^^^^^^^^^^
+
++-------+--------+---------------------------+
+| Name  | Offset | Size                      |
++=======+========+===========================+
+| argsz | 16     | 4                         |
++-------+--------+---------------------------+
+| flags | 20     | 4                         |
++-------+--------+---------------------------+
+|       | +-----+--------------------------+ |
+|       | | Bit | Definition               | |
+|       | +=====+==========================+ |
+|       | | 0   | VFIO_IRQ_INFO_EVENTFD    | |
+|       | +-----+--------------------------+ |
+|       | | 1   | VFIO_IRQ_INFO_MASKABLE   | |
+|       | +-----+--------------------------+ |
+|       | | 2   | VFIO_IRQ_INFO_AUTOMASKED | |
+|       | +-----+--------------------------+ |
+|       | | 3   | VFIO_IRQ_INFO_NORESIZE   | |
+|       | +-----+--------------------------+ |
++-------+--------+---------------------------+
+| index | 24     | 4                         |
++-------+--------+---------------------------+
+| count | 28     | 4                         |
++-------+--------+---------------------------+
+
+* argz is reserved in vfio-user, it is only used in the ioctl() VFIO
+  implementation.
+* flags defines IRQ attributes:
+
+  * VFIO_IRQ_INFO_EVENTFD indicates the IRQ type can support server eventfd
+    signalling.
+  * VFIO_IRQ_INFO_MASKABLE indicates that the IRQ type supports the MASK and
+    UNMASK actions in a VFIO_USER_DEVICE_SET_IRQS message.
+  * VFIO_IRQ_INFO_AUTOMASKED indicates the IRQ type masks itself after being
+    triggered, and the client must send an UNMASK action to receive new
+    interrupts.
+  * VFIO_IRQ_INFO_NORESIZE indicates VFIO_USER_SET_IRQS operations setup
+    interrupts as a set, and new subindexes cannot be enabled without disabling
+    the entire type.
+
+* index is the index of IRQ type being queried, it is the only field that is
+  required to be set in the request message.
+* count describes the number of interrupts of the queried type.
+
+VFIO_USER_DEVICE_SET_IRQS
+-------------------------
+
+Message format
+^^^^^^^^^^^^^^
+
++--------------+------------------------+
+| Name         | Value                  |
+| Device ID    | <ID>                   |
++--------------+------------------------+
+| Message ID   | <ID>                   |
++--------------+------------------------+
+| Command      | 7                      |
++--------------+------------------------+
+| Message size | 36 + any data          |
++--------------+------------------------+
+| Flags        | Reply bit set in reply |
++--------------+------------------------+
+| IRQ set      | VFIO IRQ set           |
++--------------+------------------------+
+
+This message is sent by the client to the server to set actions for device
+interrupt types. The VFIO IRQ set structure is defined in ``<sys/vfio.h>``
+(``struct vfio_irq_set``).
+
+VFIO IRQ info format
+^^^^^^^^^^^^^^^^^^^^
+
++-------+--------+------------------------------+
+| Name  | Offset | Size                         |
++=======+========+==============================+
+| argsz | 6      | 4                            |
++-------+--------+------------------------------+
+| flags | 20     | 4                            |
++-------+--------+------------------------------+
+|       | +-----+-----------------------------+ |
+|       | | Bit | Definition                  | |
+|       | +=====+=============================+ |
+|       | | 0   | VFIO_IRQ_SET_DATA_NONE      | |
+|       | +-----+-----------------------------+ |
+|       | | 1   | VFIO_IRQ_SET_DATA_BOOL      | |
+|       | +-----+-----------------------------+ |
+|       | | 2   | VFIO_IRQ_SET_DATA_EVENTFD   | |
+|       | +-----+-----------------------------+ |
+|       | | 3   | VFIO_IRQ_SET_ACTION_MASK    | |
+|       | +-----+-----------------------------+ |
+|       | | 4   | VFIO_IRQ_SET_ACTION_UNMASK  | |
+|       | +-----+-----------------------------+ |
+|       | | 5   | VFIO_IRQ_SET_ACTION_TRIGGER | |
+|       | +-----+-----------------------------+ |
++-------+--------+------------------------------+
+| index | 24     | 4                            |
++-------+--------+------------------------------+
+| start | 28     | 4                            |
++-------+--------+------------------------------+
+| count | 32     | 4                            |
++-------+--------+------------------------------+
+| data  | 36     | variable                     |
++-------+--------+------------------------------+
+
+* argz is reserved in vfio-user, it is only used in the ioctl() VFIO
+  implementation.
+* flags defines the action performed on the interrupt range. The DATA flags
+  describe the data field sent in the message; the ACTION flags describe the
+  action to be performed. The flags are mutually exclusive for both sets.
+
+  * VFIO_IRQ_SET_DATA_NONE indicates there is no data field in the request. The
+    action is performed unconditionally.
+  * VFIO_IRQ_SET_DATA_BOOL indicates the data field is an array of boolean
+    bytes. The action is performed if the corresponding boolean is true.
+  * VFIO_IRQ_SET_DATA_EVENTFD indicates an array of event file descriptors was
+    sent in the message meta-data. These descriptors will be signalled when the
+    action defined by the action flags occurs. In AF_UNIX sockets, the
+    descriptors are sent as SCM_RIGHTS type ancillary data.
+  * VFIO_IRQ_SET_ACTION_MASK indicates a masking event. It can be used with
+    VFIO_IRQ_SET_DATA_BOOL or VFIO_IRQ_SET_DATA_NONE to mask an interrupt, or
+    with VFIO_IRQ_SET_DATA_EVENTFD to generate an event when the guest masks
+    the interrupt. 
+  * VFIO_IRQ_SET_ACTION_UNMASK indicates an unmasking event. It can be used
+    with VFIO_IRQ_SET_DATA_BOOL or VFIO_IRQ_SET_DATA_NONE to unmask an
+    interrupt, or with VFIO_IRQ_SET_DATA_EVENTFD to generate an event when the
+    guest unmasks the interrupt. 
+  * VFIO_IRQ_SET_ACTION_TRIGGER indicates a triggering event. It can be used
+    with VFIO_IRQ_SET_DATA_BOOL or VFIO_IRQ_SET_DATA_NONE to trigger an
+    interrupt, or with VFIO_IRQ_SET_DATA_EVENTFD to generate an event when the
+    guest triggers the interrupt.
+
+* index is the index of IRQ type being setup.
+* start is the start of the subindex being set.
+* count describes the number of sub-indexes being set. As a special case, a
+  count of 0 with data flags of VFIO_IRQ_SET_DATA_NONE disables all interrupts
+  of the index data is an optional field included when the
+  VFIO_IRQ_SET_DATA_BOOL flag is present. It contains an array of booleans
+  that specify whether the action is to be performed on the corresponding
+  index. It's used when the action is only performed on a subset of the range
+  specified.
+
+Not all interrupt types support every combination of data and action flags.
+The client must know the capabilities of the device and IRQ index before it
+sends a VFIO_USER_DEVICE_SET_IRQ message.
+
+Read and Write Operations
+-------------------------
+
+Not all I/O operations between the client and server can be done via direct
+access of memory mapped with an mmap() call. In these cases, the client and
+server use messages sent over the socket. It is expected that these operations
+will have lower performance than direct access.
+
+The client can access device memory with VFIO_USER_REGION_READ and
+VFIO_USER_REGION_WRITE requests. These share a common data structure that
+appears after the 16 byte message header. 
+
+REGION Read/Write Data
+^^^^^^^^^^^^^^^^^^^^^^
+
++--------+--------+----------+
+| Name   | Offset | Size     |
++========+========+==========+
+| Offset | 16     | 8        |
++--------+--------+----------+
+| Region | 24     | 4        |
++--------+--------+----------+
+| Count  | 28     | 4        |
++--------+--------+----------+
+| Data   | 32     | variable |
++--------+--------+----------+
+
+* Offset into the region being accessed.
+* Region is the index of the region being accessed.
+* Count is the size of the data to be transferred.
+* Data is the data to be read or written.
+
+The server can access guest memory with VFIO_USER_DMA_READ and
+VFIO_USER_DMA_WRITE messages. These also share a common data structure that
+appears after the 16 byte message header.
+
+DMA Read/Write Data
+^^^^^^^^^^^^^^^^^^^
+
++---------+--------+----------+
+| Name    | Offset | Size     |
++=========+========+==========+
+| Address | 16     | 8        |
++---------+--------+----------+
+| Count   | 24     | 4        |
++---------+--------+----------+
+| Data    | 28     | variable |
++---------+--------+----------+
+
+* Address is the area of guest memory being accessed. This address must have
+  been exported to the server with a VFIO_USER_DMA_MAP message.
+* Count is the size of the data to be transferred.
+* Data is the data to be read or written.
+
+Address and count can also be accessed as ``struct iovec`` from ``<sys/uio.h>``.
+
+VFIO_USER_REGION_READ
+---------------------
+
+Message format
+^^^^^^^^^^^^^^
+
++--------------+------------------------+
+| Name         | Value                  |
++==============+========================+
+| Device ID    | <ID>                   |
++--------------+------------------------+
+| Message ID   | <ID>                   |
++--------------+------------------------+
+| Command      | 8                      |
++--------------+------------------------+
+| Message size | 32 + data size         |
++--------------+------------------------+
+| Flags Reply  | bit set in reply       |
++--------------+------------------------+
+| Read info    | REGION read/write data |
++--------------+------------------------+
+
+This request is sent from the client to the server to read from device memory.
+In the request messages, there will be no data, and the count field will be the
+amount of data to be read. The reply will include the data read, and its count
+field will be the amount of data read.
+
+VFIO_USER_REGION_WRITE
+----------------------
+
+Message format
+^^^^^^^^^^^^^^
+
++--------------+------------------------+
+| Name         | Value                  |
++==============+========================+
+| Device ID    | <ID>                   |
++--------------+------------------------+
+| Message ID   | <ID>                   |
++--------------+------------------------+
+| Command      | 9                      |
++--------------+------------------------+
+| Message size | 32 + data size         |
++--------------+------------------------+
+| Flags        | Reply bit set in reply |
++--------------+------------------------+
+| Write info   | REGION read write data |
++--------------+------------------------+
+
+This request is sent from the client to the server to write to device memory.
+The request message will contain the data to be written, and its count field
+will contain the amount of write data. The count field in the reply will be
+zero.
+
+VFIO_USER_DMA_READ
+------------------
+
+Message format
+^^^^^^^^^^^^^^
+
++--------------+---------------------+
+| Name         | Value               |
++==============+=====================+
+| Device ID    | <ID>                |
++--------------+---------------------+
+| Message ID   | <ID>                |
++--------------+---------------------+
+| Command      | 10                  |
++--------------+---------------------+
+| Message size | 28 + data size      |
++--------------+---------------------+
+| Flags Reply  | bit set in reply    |
++--------------+---------------------+
+| DMA info     | DMA read/write data |
++--------------+---------------------+
+
+This request is sent from the server to the client to read from guest memory.
+In the request messages, there will be no data, and the count field will be the
+amount of data to be read. The reply will include the data read, and its count
+field will be the amount of data read.
+
+VFIO_USER_DMA_WRITE
+-------------------
+
+Message format
+^^^^^^^^^^^^^^
+
++--------------+------------------------+
+| Name         | Value                  |
++==============+========================+
+| Device ID    | <ID>                   |
++--------------+------------------------+
+| Message ID   | <ID>                   |
++--------------+------------------------+
+| Command      | 11                     |
++--------------+------------------------+
+| Message size | 28 + data size         |
++--------------+------------------------+
+| Flags        | Reply bit set in reply |
++--------------+------------------------+
+| DMA info     | DMA read/write data    |
++--------------+------------------------+
+
+This request is sent from the server to the client to write to guest memory.
+The request message will contain the data to be written, and its count field
+will contain the amount of write data. The count field in the reply will be
+zero.
+
+VFIO_USER_VM_INTERRUPT
+----------------------
+
+Message format
+^^^^^^^^^^^^^^
+
++----------------+------------------------+
+| Name           | Value                  |
++================+========================+
+| Device ID      | <ID>                   |
++----------------+------------------------+
+| Message ID     | <ID>                   |
++----------------+------------------------+
+| Command        | 12                     |
++----------------+------------------------+
+| Message size   | 24                     |
++----------------+------------------------+
+| Flags          | Reply bit set in reply |
++----------------+------------------------+
+| Interrupt info | <interrupt>            |
++----------------+------------------------+
+
+This request is sent from the server to the client to signal the device has
+raised an interrupt.
+
+Interrupt info format
+^^^^^^^^^^^^^^^^^^^^^
+
++----------+--------+------+
+| Name     | Offset | Size |
++==========+========+======+
+| Index    | 16     | 4    |
++----------+--------+------+
+| Subindex | 20     | 4    |
++----------+--------+------+
+
+* Index is the interrupt index; it is the same value used in VFIO_USER_SET_IRQS.
+* Subindex is relative to the index, e.g., the vector number used in PCI MSI/X
+  type interrupts.
+
+VFIO_USER_DEVICE_RESET
+----------------------
+
+Message format
+^^^^^^^^^^^^^^
+
++--------------+------------------------+
+| Name         | Value                  |
++==============+========================+
+| Device ID    | <ID>                   |
++--------------+------------------------+
+| Message ID   | <ID>                   |
++--------------+------------------------+
+| Command      | 13                     |
++--------------+------------------------+
+| Message size | 16                     |
++--------------+------------------------+
+| Flags        | Reply bit set in reply |
++--------------+------------------------+
+
+This request is sent from the client to the server to reset the device.
+
+Appendices
+==========
+
+Unused VFIO ioctl() commands
+----------------------------
+
+The following commands must be handled by the client and not sent to the server:
+
+* VFIO_GET_API_VERSION
+* VFIO_CHECK_EXTENSION
+* VFIO_SET_IOMMU
+* VFIO_GROUP_GET_STATUS
+* VFIO_GROUP_SET_CONTAINER
+* VFIO_GROUP_UNSET_CONTAINER
+* VFIO_GROUP_GET_DEVICE_FD
+* VFIO_IOMMU_GET_INFO
+
+However, once support for live migration for VFIO devices is finalized some
+of the above commands might have to be handled by the client. This will be
+addressed in a future protocol version.
+
+Live Migration
+--------------
+Currently live migration is not supported for devices passed through via VFIO,
+therefore it is not supported for VFIO-over-socket, either. This is being
+actively worked on in the "Add migration support for VFIO devices" (v25) patch
+series.
+
+VFIO groups and containers
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The current VFIO implementation includes group and container idioms that
+describe how a device relates to the host IOMMU. In the VFIO over socket
+implementation, the IOMMU is implemented in SW by the client, and isn't visible
+to the server. The simplest idea is for the client is to put each device into
+its own group and container.
-- 
1.7.1


Re: [PATCH] introduce VFIO-over-socket protocol specificaion
Posted by no-reply@patchew.org 4 years, 2 months ago
Patchew URL: https://patchew.org/QEMU/1594913503-52271-1-git-send-email-thanos.makatos@nutanix.com/



Hi,

This series failed the docker-mingw@fedora build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#! /bin/bash
export ARCH=x86_64
make docker-image-fedora V=1 NETWORK=1
time make docker-test-mingw@fedora J=14 NETWORK=1
=== TEST SCRIPT END ===

  CC      block/qapi.o
  CC      block/file-win32.o

Warning, treated as error:
/tmp/qemu-test/src/docs/devel/vfio-over-socket.rst:281:Malformed table.

+----------------------------------+---------+-------------------+
---
  CC      crypto/hmac.o
  CC      crypto/hmac-nettle.o
  CC      crypto/desrfb.o
make: *** [Makefile:1090: docs/devel/index.html] Error 2
make: *** Waiting for unfinished jobs....
Traceback (most recent call last):
  File "./tests/docker/docker.py", line 708, in <module>
---
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', '--label', 'com.qemu.instance.uuid=739a80f85e9146d59e3826f0d63daa6c', '-u', '1001', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 'SHOW_ENV=', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', '/home/patchew/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', '/var/tmp/patchew-tester-tmp-gwtqm3ly/src/docker-src.2020-07-16-12.51.55.8263:/var/tmp/qemu:z,ro', 'qemu/fedora', '/var/tmp/qemu/run', 'test-mingw']' returned non-zero exit status 2.
filter=--filter=label=com.qemu.instance.uuid=739a80f85e9146d59e3826f0d63daa6c
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-gwtqm3ly/src'
make: *** [docker-run-test-mingw@fedora] Error 2

real    3m4.939s
user    0m8.716s


The full log is available at
http://patchew.org/logs/1594913503-52271-1-git-send-email-thanos.makatos@nutanix.com/testing.docker-mingw@fedora/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com
Re: [PATCH] introduce VFIO-over-socket protocol specificaion
Posted by no-reply@patchew.org 4 years, 2 months ago
Patchew URL: https://patchew.org/QEMU/1594913503-52271-1-git-send-email-thanos.makatos@nutanix.com/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 1594913503-52271-1-git-send-email-thanos.makatos@nutanix.com
Subject: [PATCH] introduce VFIO-over-socket protocol specificaion

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
aa6155e introduce VFIO-over-socket protocol specificaion

=== OUTPUT BEGIN ===
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#20: 
new file mode 100644

ERROR: trailing whitespace
#167: FILE: docs/devel/vfio-over-socket.rst:143:
+configuration space. $

ERROR: trailing whitespace
#360: FILE: docs/devel/vfio-over-socket.rst:336:
+|                | +-----+------------+ | $

ERROR: trailing whitespace
#572: FILE: docs/devel/vfio-over-socket.rst:548:
+be supported in future versions. $

ERROR: trailing whitespace
#587: FILE: docs/devel/vfio-over-socket.rst:563:
+| Command      | 5                | $

ERROR: trailing whitespace
#874: FILE: docs/devel/vfio-over-socket.rst:850:
+    the interrupt. $

ERROR: trailing whitespace
#878: FILE: docs/devel/vfio-over-socket.rst:854:
+    guest unmasks the interrupt. $

ERROR: trailing whitespace
#908: FILE: docs/devel/vfio-over-socket.rst:884:
+appears after the 16 byte message header. $

total: 7 errors, 1 warnings, 1135 lines checked

Commit aa6155e147a9 (introduce VFIO-over-socket protocol specificaion) has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/1594913503-52271-1-git-send-email-thanos.makatos@nutanix.com/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com
Re: [PATCH] introduce VFIO-over-socket protocol specificaion
Posted by Stefan Hajnoczi 4 years, 2 months ago
On Thu, Jul 16, 2020 at 08:31:43AM -0700, Thanos Makatos wrote:
> This patch introduces the VFIO-over-socket protocol specification, which
> is designed to allow devices to be emulated outside QEMU, in a separate
> process. VFIO-over-socket reuses the existing VFIO defines, structs and
> concepts.
> 
> It has been earlier discussed as an RFC in:
> "RFC: use VFIO over a UNIX domain socket to implement device offloading"
> 
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Thanos Makatos <thanos.makatos@nutanix.com>
> ---
>  docs/devel/vfio-over-socket.rst | 1135 +++++++++++++++++++++++++++++++++++++++
>  1 files changed, 1135 insertions(+), 0 deletions(-)
>  create mode 100644 docs/devel/vfio-over-socket.rst

This is exciting! The spec is clear enough that I feel I could start
writing a client/server. There is enough functionality here to implement
real-world devices. Can you share links to client/server
implementations?

It would be useful to introduce a standard way of enumerating,
launching, and controlling device emulation processes. That doesn't need
to be part of this specification document though. In vhost-user there
are conventions for command-line parameters, process lifecycle, etc that
make it easier for management tools to run device processes (the
"Backend program conventions" section in vhost-user.rst).

> diff --git a/docs/devel/vfio-over-socket.rst b/docs/devel/vfio-over-socket.rst
> new file mode 100644
> index 0000000..723b944
> --- /dev/null
> +++ b/docs/devel/vfio-over-socket.rst
> @@ -0,0 +1,1135 @@
> +***************************************
> +VFIO-over-socket Protocol Specification
> +***************************************
> +
> +Version 0.1

Please include a reference to the section below explaining how
versioning works.

Also, are there rules about noting versions when updating the spec? For
example:

  When making a change to this specification, the protocol version
  number must be included:

    The `foo` field contains ... Added in version 1.3.

> +
> +Introduction
> +============
> +VFIO-over-socket, also known as vfio-user, is a protocol that allows a device

vfio-user is shorted. Now is the best time to start consistently using
"vfio-user" as the name for this protocol. Want to drop the name
VFIO-over-socket?

> +to be virtualized in a separate process outside of QEMU. VFIO-over-socket

Is there anything QEMU-specific about this protocol?

I think the scope of this protocol is more general and it could be
described as:

  allows device emulation in a separate process outside of a Virtual
  Machine Monitor (VMM).

(Or "emulator" instead of VMM, if you prefer.)

> +devices consist of a generic VFIO device type, living inside QEMU, which we

s/QEMU/the VMM/

> +call the client, and the core device implementation, living outside QEMU, which
> +we call the server. VFIO-over-socket can be the main transport mechanism for
> +multi-process QEMU, however it can be used by other applications offering
> +device virtualization. Explaining the advantages of a
> +disaggregated/multi-process QEMU, and device virtualization outside QEMU in
> +general, is beyond the scope of this document.
> +
> +This document focuses on specifying the VFIO-over-socket protocol. VFIO has
> +been chosen for the following reasons:

It's a little subtle here that "VFIO" means the Linux VFIO ioctl
interface. Expanding it a bit makes the rest of the text clearer:

  The `Linux VFIO ioctl
  interface
  <https://www.kernel.org/doc/html/latest/driver-api/vfio.html>`_ has
  been chosen as the base for this protocol for the following reasons:

> +
> +1) It is a mature and stable API, backed by an extensively used framework.
> +2) The existing VFIO client implementation (qemu/hw/vfio/) can be largely
> +   reused.
> +
> +In a proof of concept implementation it has been demonstrated that using VFIO
> +over a UNIX domain socket is a viable option. VFIO-over-socket is designed with
> +QEMU in mind, however it could be used by other client applications. The
> +VFIO-over-socket protocol does not require that QEMU's VFIO client
> +implementation is used in QEMU. None of the VFIO kernel modules are required
> +for supporting the protocol, neither in the client nor the server, only the
> +source header files are used.

In the long run only the last sentence in this paragraph needs to be in
the specification.

The proof of concept and QEMU-specific info is good to know for the
discussion right now, but I don't think this needs to be in the
specification.

> +
> +The main idea is to allow a virtual device to function in a separate process in
> +the same host over a UNIX domain socket. A UNIX domain socket (AF_UNIX) is
> +chosen because we can trivially send file descriptors over it, which in turn
> +allows:
> +
> +* Sharing of guest memory for DMA with the virtual device process.

Should the spec start consistently using "server" instead of various
terms like "virtual device process"?

> +* Sharing of virtual device memory with the guest for fast MMIO.

The VIRTIO spec restricts itself to the terms "device" and "driver" so
that the scope isn't too specific to virtualization of software devices.
I think the same might be good here because who knows how
VFIO-over-socket will be used in the future. Maybe it will be used
outside the context of traditional guests. For example, it's a great
interface for writing device emulation tests and fuzzers!

I suggest consistently using "client" (for the guest) and "server" (for
the device emulation process). That way the spec isn't focussed on a
specific use case.

> +* Efficient sharing of eventfd's for triggering interrupts.
> +
> +However, other socket types could be used which allows the virtual device
> +process to run in a separate guest in the same host (AF_VSOCK) or remotely
> +(AF_INET). Theoretically the underlying transport doesn't necessarily have to
> +be a socket, however we don't examine such alternatives. In this document we
> +focus on using a UNIX domain socket and introduce basic support for the other
> +two types of sockets without considering performance implications.

This is a good place to be explicit about the protocol requirements:

Is file-descriptor passing necessary?

Is shared-memory necessary?

Is there always an in-band message-passing fallback for any operation
that can be optimized via fd passing?

By stating something about this in the spec it's easier to ensure that
the protocol continues to follow these design parameters in the future
when other people make modifications.

> +This document does not yet describe any internal details of the server-side
> +implementation, however QEMU's VFIO client implementation will have to be
> +adapted according to this protocol in order to support VFIO-over-socket virtual
> +devices.

This paragraph about the QEMU implementation status can be dropped from
the specification.

> +
> +VFIO
> +====
> +VFIO is a framework that allows a physical device to be securely passed through
> +to a user space process; the kernel does not drive the device at all.
> +Typically, the user space process is a VM and the device is passed through to
> +it in order to achieve high performance. VFIO provides an API and the required
> +functionality in the kernel. QEMU has adopted VFIO to allow a guest virtual
> +machine to directly access physical devices, instead of emulating them in
> +software

s/software/software./

> +
> +VFIO-over-socket reuses the core VFIO concepts defined in its API, but
> +implements them as messages to be sent over a UNIX-domain socket. It does not

Can "UNIX-domain" be dropped here? If the protocol design is intended to
support other transports then it helps to avoid mentioning a specific
transport even if only AF_UNIX will be implemented in the beginning.

> +change the kernel-based VFIO in any way, in fact none of the VFIO kernel
> +modules need to be loaded to use VFIO-over-socket. It is also possible for QEMU
> +to concurrently use the current kernel-based VFIO for one guest device, and use
> +VFIO-over-socket for another device in the same guest.
> +
> +VFIO Device Model
> +-----------------
> +A device under VFIO presents a standard VFIO model to the user process. Many

I'm not sure what the meaning of the first sentence is.

s/standard VFIO model/standard interface/ ?

> +of the VFIO operations in the existing kernel model use the ioctl() system
> +call, and references to the existing model are called the ioctl()
> +implementation in this document.
> +
> +The following sections describe the set of messages that implement the VFIO
> +device model over a UNIX domain socket. In many cases, the messages are direct
> +translations of data structures used in the ioctl() implementation. Messages
> +derived from ioctl()s will have a name derived from the ioctl() command name.
> +E.g., the VFIO_GET_INFO ioctl() command becomes a VFIO_USER_GET_INFO message.
> +The purpose for this reuse is to share as much code as feasible with the
> +ioctl() implementation.
> +
> +Client and Server
> +^^^^^^^^^^^^^^^^^
> +The socket connects two processes together: a client process and a server
> +process. In the context of this document, the client process is the process
> +emulating a guest virtual machine, such as QEMU. The server process is a
> +process that provides device emulation.
> +
> +Connection Initiation
> +^^^^^^^^^^^^^^^^^^^^^
> +After the client connects to the server, the initial server message is
> +VFIO_USER_VERSION to propose a protocol version and set of capabilities to
> +apply to the session. The client replies with a compatible version and set of
> +capabilities it will support, or closes the connection if it cannot support the
> +advertised version.
> +
> +Guest Memory Configuration
> +^^^^^^^^^^^^^^^^^^^^^^^^^^
> +The client uses VFIO_USER_DMA_MAP and VFIO_USER_DMA_UNMAP messages to inform
> +the server of the valid guest DMA ranges that the server can access on behalf
> +of a device. Guest memory may be accessed by the server via VFIO_USER_DMA_READ
> +and VFIO_USER_DMA_WRITE messages over the socket.

"Guest" is a virtualization term. "DMA Memory Configuration" and "DMA
memory may be accessed ..." would be more general.

> +
> +An optimization for server access to guest memory is for the client to provide
> +file descriptors the server can mmap() to directly access guest memory. Note
> +that mmap() privileges cannot be revoked by the client, therefore file
> +descriptors should only be exported in environments where the client trusts the
> +server not to corrupt guest memory.
> +
> +Device Information
> +^^^^^^^^^^^^^^^^^^
> +The client uses a VFIO_USER_DEVICE_GET_INFO message to query the server for
> +information about the device. This information includes:
> +
> +* The device type and capabilities,
> +* the number of memory regions, and

VFIO calls them "device regions", which is clearer:
1. It cannot be confused with "DMA Memory Regions"
2. It allows for I/O space in additional to Memory space

I suggest using "device regions" consistently.

> +* the device presents to the guest the number of interrupt types the device
> +  supports.
> +
> +Region Information
> +^^^^^^^^^^^^^^^^^^
> +The client uses VFIO_USER_DEVICE_GET_REGION_INFO messages to query the server
> +for information about the device's memory regions. This information describes:
> +
> +* Read and write permissions, whether it can be memory mapped, and whether it
> +  supports additional capabilities.
> +* Region index, size, and offset.
> +
> +When a region can be mapped by the client, the server provides a file
> +descriptor which the client can mmap(). The server is responsible for polling
> +for client updates to memory mapped regions.
> +
> +Region Capabilities
> +"""""""""""""""""""
> +Some regions have additional capabilities that cannot be described adequately
> +by the region info data structure. These capabilities are returned in the
> +region info reply in a list similar to PCI capabilities in a PCI device's
> +configuration space. 
> +
> +Sparse Regions
> +""""""""""""""
> +A region can be memory-mappable in whole or in part. When only a subset of a
> +region can be mapped by the client, a VFIO_REGION_INFO_CAP_SPARSE_MMAP
> +capability is included in the region info reply. This capability describes
> +which portions can be mapped by the client.
> +
> +For example, in a virtual NVMe controller, sparse regions can be used so that
> +accesses to the NVMe registers (found in the beginning of BAR0) are trapped (an
> +infrequent an event), while allowing direct access to the doorbells (an
> +extremely frequent event as every I/O submission requires a write to BAR0),
> +found right after the NVMe registers in BAR0.
> +
> +Interrupts
> +^^^^^^^^^^
> +The client uses VFIO_USER_DEVICE_GET_IRQ_INFO messages to query the server for
> +the device's interrupt types. The interrupt types are specific to the bus the
> +device is attached to, and the client is expected to know the capabilities of
> +each interrupt type. The server can signal an interrupt either with
> +VFIO_USER_VM_INTERRUPT messages over the socket, or can directly inject
> +interrupts into the guest via an event file descriptor. The client configures
> +how the server signals an interrupt with VFIO_USER_SET_IRQS messages.
> +
> +Device Read and Write
> +^^^^^^^^^^^^^^^^^^^^^
> +When the guest executes load or store operations to device memory, the client
> +forwards these operations to the server with VFIO_USER_REGION_READ or
> +VFIO_USER_REGION_WRITE messages. The server will reply with data from the
> +device on read operations or an acknowledgement on write operations.
> +
> +DMA
> +^^^
> +When a device performs DMA accesses to guest memory, the server will forward
> +them to the client with VFIO_USER_DMA_READ and VFIO_USER_DMA_WRITE messages.
> +These messages can only be used to access guest memory the client has
> +configured into the server.
> +
> +Protocol Specification
> +======================
> +To distinguish from the base VFIO symbols, all VFIO-over-socket symbols are
> +prefixed with vfio_user or VFIO_USER. In revision 0.1, all data is in the
> +little-endian format, although this may be relaxed in future revision in cases
> +where the client and server are both big-endian. The messages are formatted
> +for seamless reuse of the native VFIO structs. A server can serve:
> +
> +1) multiple clients, and/or
> +2) multiple virtual devices, belonging to one or more clients.
> +
> +Therefore each message requires a header that uniquely identifies the virtual
> +device. It is a server-side implementation detail whether a single server
> +handles multiple virtual devices from the same or multiple guests.
> +
> +Socket
> +------
> +A single UNIX domain socket is assumed to be used for each device. The location
> +of the socket is implementation-specific. Multiplexing clients, devices, and
> +servers over the same socket is not supported in this version of the protocol,
> +but a device ID field exists in the message header so that a future support can
> +be added without a major version change.

There is a version negotiation mechanism for making protocol changes.
I'm not sure there is any advantage to including the multiplexing header
now? I suggest keeping it simple and dropping the unused header.

> +
> +Authentication
> +--------------
> +For AF_UNIX, we rely on OS mandatory access controls on the socket files,
> +therefore it is up to the management layer to set up the socket as required.
> +Socket types than span guests or hosts will require a proper authentication
> +mechanism. Defining that mechanism is deferred to a future version of the
> +protocol.
> +
> +Request Concurrency
> +-------------------
> +There can be multiple outstanding requests per virtual device, e.g. a
> +frame buffer where the guest does multiple stores to the virtual device. The
> +server can execute and reorder non-conflicting requests in parallel, depending
> +on the device semantics.

The word "request" has not been used up to this point in the spec. What
does "request" mean here?

I'm not sure if this section is talking about vfio-user protocol message
exchanges (e.g. a client->server message followed by a server->client
message) or something else?

What about the ordering semantics at the vfio-user protocol level? For
example, if a client sends multiple VFIO_USER_DMA_MAP/UNMAP messages
then the order matters. What are the rules? I wonder if maybe
concurrency is a special case and really only applies to a subset of
protocol commands?

I'm not sure how a client would exploit parallelism in this protocol.
Can you give an example of a case where there would be multiple commands
pending on a single device?

> +
> +Socket Disconnection Behavior
> +-----------------------------
> +The server and the client can disconnect from each other, either intentionally
> +or unexpectedly. Both the client and the server need to know how to handle such
> +events.
> +
> +Server Disconnection
> +^^^^^^^^^^^^^^^^^^^^
> +A server disconnecting from the client may indicate that:
> +
> +1) A virtual device has been restarted, either intentionally (e.g. because of a
> +device update) or unintentionally (e.g. because of a crash). In any case, the
> +virtual device will come back so the client should not do anything (e.g. simply
> +reconnect and retry failed operations).
> +
> +2) A virtual device has been shut down with no intention to be restarted.
> +
> +It is impossible for the client to know whether or not a failure is
> +intermittent or innocuous and should be retried, therefore the client should
> +attempt to reconnect to the socket. Since an intentional server restart (e.g.
> +due to an upgrade) might take some time, a reasonable timeout should be used.
> +In cases where the disconnection is expected (e.g. the guest shutting down), no
> +new requests will be sent anyway so this situation doesn't pose a problem. The
> +control stack will clean up accordingly.
> +
> +Parametrizing this behaviour by having the virtual device advertise a

s/Parametrizing/Parameterizing/

> +reasonable reconnect is deferred to a future version of the protocol.

No mention is made of recovering state or how disconnect maps to VFIO
device types (PCI, etc.). Does a disconnect mean that the device has
been reset?

> +
> +Client Disconnection
> +^^^^^^^^^^^^^^^^^^^^
> +The client disconnecting from the server primarily means that the QEMU process
> +has exited. Currently this means that the guest is shut down so the device is
> +no longer needed therefore the server can automatically exit. However, there
> +can be cases where a client disconnect should not result in a server exit:
> +
> +1) A single server serving multiple clients.
> +2) A multi-process QEMU upgrading itself step by step, which isn't yet
> +   implemented.
> +
> +Therefore in order for the protocol to be forward compatible the server should
> +take no action when the client disconnects. If anything happens to the client
> +process the control stack will know about it and can clean up resources
> +accordingly.
> +
> +Request Retry and Response Timeout
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +QEMU's VFIO retries certain operations if they fail. While this makes sense for
> +real HW, we don't know for sure whether it makes sense for virtual devices. A
> +failed request is a request that has been successfully sent and has been
> +responded to with an error code. Failure to send the request in the first place
> +(e.g. because the socket is disconnected) is a different type of error examined
> +earlier in the disconnect section.
> +
> +Defining a retry and timeout scheme if deferred to a future version of the
> +protocol.
> +
> +Commands
> +--------
> +The following table lists the VFIO message command IDs, and whether the
> +message request is sent from the client or the server.
> +
> ++----------------------------------+---------+-------------------+
> +| Name                             | Command | Request Direction |
> ++==================================+=========+===================+
> +| VFIO_USER_VERSION                | 1       | server ???????? client   |

Unicode issues? I see "????????" instead of the characters that were
supposed to be here.

> ++----------------------------------+---------+-------------------+
> +| VFIO_USER_DMA_MAP                | 2       | client ???????? server   |
> ++----------------------------------+---------+-------------------+
> +| VFIO_USER_DMA_UNMAP              | 3       | client ???????? server   |
> ++----------------------------------+---------+-------------------+
> +| VFIO_USER_DEVICE_GET_INFO        | 4       | client ???????? server   |
> ++----------------------------------+---------+-------------------+
> +| VFIO_USER_DEVICE_GET_REGION_INFO | 5       | client ???????? server   |
> ++----------------------------------+---------+-------------------+
> +| VFIO_USER_DEVICE_GET_IRQ_INFO    | 6       | client ???????? server   |
> ++----------------------------------+---------+-------------------+
> +| VFIO_USER_DEVICE_SET_IRQS        | 7       | client ???????? server   |
> ++----------------------------------+---------+-------------------+
> +| VFIO_USER_REGION_READ            | 8       | client ???????? server   |
> ++----------------------------------+---------+-------------------+
> +| VFIO_USER_REGION_WRITE           | 9       | client ???????? server   |
> ++----------------------------------+---------+-------------------+
> +| VFIO_USER_DMA_READ               | 10      | server ???????? client   |
> ++----------------------------------+---------+-------------------+
> +| VFIO_USER_DMA_READ               | 11      | server ???????? client   |
> ++----------------------------------+---------+-------------------+
> +| VFIO_USER_VM_INTERRUPT           | 12      | server ???????? client   |
> ++----------------------------------+---------+-------------------+
> +| VFIO_DEVICE_RESET                | 13      | client ???????? server   |
> ++----------------------------------+---------+-------------------+

Why is this command named VFIO_DEVICE_RESET and not
VFIO_USER_DEVICE_RESET?

> +
> +Header
> +------
> +All messages are preceded by a 16 byte header that contains basic information

I suggest dropping "16 byte" here because the table already shows this
and it is easy to forget to update the text when changing the table.

> +about the message. The header is followed by message-specific data described
> +in the sections below.
> +
> ++----------------+--------+-------------+
> +| Name           | Offset | Size        |
> ++================+========+=============+
> +| Device ID      | 0      | 2           |
> ++----------------+--------+-------------+
> +| Message ID     | 2      | 2           |
> ++----------------+--------+-------------+
> +| Command        | 4      | 4           |
> ++----------------+--------+-------------+
> +| Message size   | 8      | 4           |
> ++----------------+--------+-------------+
> +| Flags          | 12     | 4           |
> ++----------------+--------+-------------+
> +|                | +-----+------------+ |
> +|                | | Bit | Definition | |
> +|                | +=====+============+ |
> +|                | | 0   | Reply      | |
> +|                | +-----+------------+ |
> +|                | | 1   | No_reply   | |
> +|                | +-----+------------+ | 
> ++----------------+--------+-------------+
> +| <message data> | 16     | variable    |
> ++----------------+--------+-------------+
> +
> +* Device ID identifies the destination device of the message. This field is
> +  reserved when the server only supports one device per socket.

Clearer:
s/is reserved/must be 0/

> +* Message ID identifies the message, and is used in the message acknowledgement.

Are Message IDs global or per-device?

> +* Command specifies the command to be executed, listed in the Command Table.
> +* Message size contains the size of the entire message, including the header.
> +* Flags contains attributes of the message:
> +
> +  * The reply bit differentiates request messages from reply messages. A reply
> +    message acknowledges a previous request with the same message ID.
> +  * No_reply indicates that no reply is needed for this request. This is
> +    commonly used when multiple requests are sent, and only the last needs
> +    acknowledgement.

Is_Reply and Dont_Reply are more self-explanatory to me, but this is
probably subjective.

> +
> +VFIO_USER_VERSION
> +-----------------
> +
> +Message format
> +^^^^^^^^^^^^^^
> +
> ++--------------+------------------------+
> +| Name         | Value                  |
> ++==============+========================+
> +| Device ID    | 0                      |
> ++--------------+------------------------+

The Device ID is always 0 and this message is sent just once regardless
of the number of devices? Please describe this explicitly so it's clear
that this message is per-session and not per-device.

There could also be a special NO_DEVICE (0xff) value but I don't think
it matters.

> +| Message ID   | <ID>                   |
> ++--------------+------------------------+
> +| Command      | 1                      |
> ++--------------+------------------------+
> +| Message size | 16 + version length    |
> ++--------------+------------------------+
> +| Flags        | Reply bit set in reply |
> ++--------------+------------------------+
> +| Version      | JSON byte array        |
> ++--------------+------------------------+
> +
> +This is the initial message sent by the server after the socket connection is
> +established. The version is in JSON format, and the following objects must be
> +included:
> +
> ++--------------+--------+---------------------------------------------------+
> +| Name         | Type   | Description                                       |
> ++==============+========+===================================================+
> +| version      | object | {???????major????????: <number>, ???????minor????????: <number>}            |

More Unicode issues.

> +|              |        | Version supported by the sender, e.g. ???????0.1????????.      |
> ++--------------+--------+---------------------------------------------------+
> +| type         | string | Fixed to ???????vfio-user????????.                             |
> ++--------------+--------+---------------------------------------------------+
> +| capabilities | array  | Reserved. Can be omitted for v0.1, otherwise must |
> +|              |        | be empty.                                         |
> ++--------------+--------+---------------------------------------------------+

s/array/array of strings/ ?

> +
> +Versioning and Feature Support
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +Upon accepting a connection, the server must send a VFIO_USER_VERSION message
> +proposing a protocol version and a set of capabilities. The client compares
> +these with the versions and capabilities it supports and sends a
> +VFIO_USER_VERSION reply according to the following rules.
> +
> +* The major version in the reply must be the same as proposed. If the client
> +  does not support the proposed major, it closes the connection.
> +* The minor version in the reply must be equal to or less than the minor
> +  version proposed.
> +* The capability list must be a subset of those proposed. If the client
> +  requires a capability the server did not include, it closes the connection.
> +* If type is not ???????vfio-user????????, the client closes the connection.

Is there a rationale for this field? In order to get to the point where
this JSON is parsed we must already implement the vfio-user protocol
(e.g. parse the header).

> +
> +The protocol major version will only change when incompatible protocol changes
> +are made, such as changing the message format. The minor version may change
> +when compatible changes are made, such as adding new messages or capabilities,
> +Both the client and server must support all minor versions less than the
> +maximum minor version it supports. E.g., an implementation that supports
> +version 1.3 must also support 1.0 through 1.2.
> +
> +VFIO_USER_DMA_MAP
> +-----------------
> +
> +VFIO_USER_DMA_UNMAP
> +-------------------
> +
> +Message Format
> +^^^^^^^^^^^^^^
> +
> ++--------------+------------------------+
> +| Name         | Value                  |
> ++==============+========================+
> +| Device ID    | 0                      |
> ++--------------+------------------------+

Is this message per-session instead of per-device?

This has implications on IOMMU semantics. Two different devices sharing
a connection have access to the same DMA memory?

> +| Message ID   | <ID>                   |
> ++--------------+------------------------+
> +| Command      | MAP=2, UNMAP=3         |
> ++--------------+------------------------+
> +| Message size | 16 + table size        |
> ++--------------+------------------------+
> +| Flags        | Reply bit set in reply |
> ++--------------+------------------------+
> +| Table        | array of table entries |
> ++--------------+------------------------+
> +
> +This message is sent by the client to the server to inform it of the guest
> +memory regions the device can access. It must be sent before the device can
> +perform any DMA to the guest. It is normally sent directly after the version
> +handshake is completed, but may also occur when memory is added or subtracted
> +in the guest.
> +
> +The table is an array of the following structure. This structure is 32 bytes
> +in size, so the message size will be 16 + (# of table entries * 32). If a
> +region being added can be directly mapped by the server, an array of file
> +descriptors will be sent as part of the message meta-data. Each region entry
> +will have a corresponding file descriptor. On AF_UNIX sockets, the file
> +descriptors will be passed as SCM_RIGHTS type ancillary data.
> +
> +Table entry format
> +^^^^^^^^^^^^^^^^^^
> +
> ++-------------+--------+-------------+
> +| Name        | Offset | Size        |
> ++=============+========+=============+
> +| Address     | 0      | 8           |
> ++-------------+--------+-------------+
> +| Size        | 8      | 8           |
> ++-------------+--------+-------------+
> +| Offset      | 16     | 8           |
> ++-------------+--------+-------------+
> +| Protections | 24     | 4           |
> ++-------------+--------+-------------+
> +| Flags       | 28     | 4           |
> ++-------------+--------+-------------+
> +|             | +-----+------------+ |
> +|             | | Bit | Definition | |
> +|             | +=====+============+ |
> +|             | | 0   | Mappable   | |
> +|             | +-----+------------+ |
> ++-------------+--------+-------------+
> +
> +* Address is the base DMA address of the region.
> +* Size is the size of the region.
> +* Offset is the file offset of the region with respect to the associated file
> +  descriptor.
> +* Protections are the region's protection attributes as encoded in
> +  ``<sys/mman.h>``.
> +* Flags contain the following region attributes:
> +
> +  * Mappable indicate the region can be mapped via the mmap() system call using
> +    the file descriptor provided in the message meta-data.

What are the semantics of map/unmap with respect to mapping over
existing regions, unmapping a subset of an existing region, etc?

What are the concerns for MAP/UNMAP while the device is operational and
may be performing DMA? Should this command be combined into a single
VFIO_USER_SET_DMA_REGIONS command with Flags Bit 1 indicating Add/Delete
so that a single message can atomically add and delete DMA regions?

What is the format of the reply to this message?

> +
> +VFIO_USER_DEVICE_GET_INFO
> +-------------------------
> +
> +Message format
> +^^^^^^^^^^^^^^
> +
> ++--------------+----------------------------+
> +| Name         | Value                      |
> ++==============+============================+
> +| Device ID    | <ID>                       |
> ++--------------+----------------------------+
> +| Message ID   | <ID>                       |
> ++--------------+----------------------------+
> +| Command      | 4                          |
> ++--------------+----------------------------+
> +| Message size | 16 in request, 32 in reply |
> ++--------------+----------------------------+
> +| Flags        | Reply bit set in reply     |
> ++--------------+----------------------------+
> +| Device info  | VFIO device info           |
> ++--------------+----------------------------+
> +
> +This message is sent by the client to the server to query for basic information
> +about the device. Only the message header is needed in the request message.
> +The VFIO device info structure is defined in ``<sys/vfio.h>`` (``struct
> +vfio_device_info``).
> +
> +VFIO device info format
> +^^^^^^^^^^^^^^^^^^^^^^^
> +
> ++-------------+--------+--------------------------+
> +| Name        | Offset | Size                     |
> ++=============+========+==========================+
> +| argsz       | 16     | 4                        |
> ++-------------+--------+--------------------------+
> +| flags       | 20     | 4                        |
> ++-------------+--------+--------------------------+
> +|             | +-----+-------------------------+ |
> +|             | | Bit | Definition              | |
> +|             | +=====+=========================+ |
> +|             | | 0   | VFIO_DEVICE_FLAGS_RESET | |
> +|             | +-----+-------------------------+ |
> +|             | | 1   | VFIO_DEVICE_FLAGS_PCI   | |
> +|             | +-----+-------------------------+ |
> ++-------------+--------+--------------------------+
> +| num_regions | 24     | 4                        |
> ++-------------+--------+--------------------------+
> +| num_irqs    | 28     | 4                        |
> ++-------------+--------+--------------------------+
> +
> +* argz is reserved in vfio-user, it is only used in the ioctl() VFIO
> +  implementation.
> +* flags contains the following device attributes.
> +
> +  * VFIO_DEVICE_FLAGS_RESET indicates the device supports the
> +    VFIO_USER_DEVICE_RESET message.

Why is VFIO_USER_DEVICE_RESET support optional?

> +  * VFIO_DEVICE_FLAGS_PCI indicates the device is a PCI device.
> +
> +* num_regions is the number of memory regions the device exposes.
> +* num_irqs is the number of distinct interrupt types the device supports.
> +
> +This version of the protocol only supports PCI devices. Additional devices may
> +be supported in future versions. 
> +
> +VFIO_USER_DEVICE_GET_REGION_INFO
> +--------------------------------
> +
> +Message format
> +^^^^^^^^^^^^^^
> +
> ++--------------+------------------+
> +| Name         | Value            |
> ++==============+==================+
> +| Device ID    | <ID>             |
> ++--------------+------------------+
> +| Message ID   | <ID>             |
> ++--------------+------------------+
> +| Command      | 5                | 
> ++--------------+------------------+
> +| Message size | 48 + any caps    |
> ++--------------+------------------+

The client does not know how much space capabilities need when sending
the request. Should the client send a 48-byte request and expect the
server to reply with a 48+ byte response?

> +DMA Read/Write Data
> +^^^^^^^^^^^^^^^^^^^
> +
> ++---------+--------+----------+
> +| Name    | Offset | Size     |
> ++=========+========+==========+
> +| Address | 16     | 8        |
> ++---------+--------+----------+
> +| Count   | 24     | 4        |
> ++---------+--------+----------+
> +| Data    | 28     | variable |
> ++---------+--------+----------+
> +
> +* Address is the area of guest memory being accessed. This address must have
> +  been exported to the server with a VFIO_USER_DMA_MAP message.
> +* Count is the size of the data to be transferred.
> +* Data is the data to be read or written.
> +
> +Address and count can also be accessed as ``struct iovec`` from ``<sys/uio.h>``.

This seems to be incorrect since the count field is 4 bytes but struct
iovec::iov_len is size_t (64-bit on 64-bit hosts).

> +
> +VFIO_USER_REGION_READ
> +---------------------
> +
> +Message format
> +^^^^^^^^^^^^^^
> +
> ++--------------+------------------------+
> +| Name         | Value                  |
> ++==============+========================+
> +| Device ID    | <ID>                   |
> ++--------------+------------------------+
> +| Message ID   | <ID>                   |
> ++--------------+------------------------+
> +| Command      | 8                      |
> ++--------------+------------------------+
> +| Message size | 32 + data size         |
> ++--------------+------------------------+
> +| Flags Reply  | bit set in reply       |
> ++--------------+------------------------+
> +| Read info    | REGION read/write data |
> ++--------------+------------------------+
> +
> +This request is sent from the client to the server to read from device memory.
> +In the request messages, there will be no data, and the count field will be the
> +amount of data to be read. The reply will include the data read, and its count
> +field will be the amount of data read.
> +
> +VFIO_USER_REGION_WRITE
> +----------------------
> +
> +Message format
> +^^^^^^^^^^^^^^
> +
> ++--------------+------------------------+
> +| Name         | Value                  |
> ++==============+========================+
> +| Device ID    | <ID>                   |
> ++--------------+------------------------+
> +| Message ID   | <ID>                   |
> ++--------------+------------------------+
> +| Command      | 9                      |
> ++--------------+------------------------+
> +| Message size | 32 + data size         |
> ++--------------+------------------------+
> +| Flags        | Reply bit set in reply |
> ++--------------+------------------------+
> +| Write info   | REGION read write data |
> ++--------------+------------------------+
> +
> +This request is sent from the client to the server to write to device memory.
> +The request message will contain the data to be written, and its count field
> +will contain the amount of write data. The count field in the reply will be
> +zero.
> +
> +VFIO_USER_DMA_READ
> +------------------
> +
> +Message format
> +^^^^^^^^^^^^^^
> +
> ++--------------+---------------------+
> +| Name         | Value               |
> ++==============+=====================+
> +| Device ID    | <ID>                |
> ++--------------+---------------------+
> +| Message ID   | <ID>                |
> ++--------------+---------------------+
> +| Command      | 10                  |
> ++--------------+---------------------+
> +| Message size | 28 + data size      |
> ++--------------+---------------------+
> +| Flags Reply  | bit set in reply    |
> ++--------------+---------------------+
> +| DMA info     | DMA read/write data |
> ++--------------+---------------------+
> +
> +This request is sent from the server to the client to read from guest memory.
> +In the request messages, there will be no data, and the count field will be the
> +amount of data to be read. The reply will include the data read, and its count
> +field will be the amount of data read.
> +
> +VFIO_USER_DMA_WRITE
> +-------------------
> +
> +Message format
> +^^^^^^^^^^^^^^
> +
> ++--------------+------------------------+
> +| Name         | Value                  |
> ++==============+========================+
> +| Device ID    | <ID>                   |
> ++--------------+------------------------+
> +| Message ID   | <ID>                   |
> ++--------------+------------------------+
> +| Command      | 11                     |
> ++--------------+------------------------+
> +| Message size | 28 + data size         |
> ++--------------+------------------------+
> +| Flags        | Reply bit set in reply |
> ++--------------+------------------------+
> +| DMA info     | DMA read/write data    |
> ++--------------+------------------------+
> +
> +This request is sent from the server to the client to write to guest memory.
> +The request message will contain the data to be written, and its count field
> +will contain the amount of write data. The count field in the reply will be
> +zero.
> +
> +VFIO_USER_VM_INTERRUPT
> +----------------------
> +
> +Message format
> +^^^^^^^^^^^^^^
> +
> ++----------------+------------------------+
> +| Name           | Value                  |
> ++================+========================+
> +| Device ID      | <ID>                   |
> ++----------------+------------------------+
> +| Message ID     | <ID>                   |
> ++----------------+------------------------+
> +| Command        | 12                     |
> ++----------------+------------------------+
> +| Message size   | 24                     |
> ++----------------+------------------------+
> +| Flags          | Reply bit set in reply |
> ++----------------+------------------------+
> +| Interrupt info | <interrupt>            |
> ++----------------+------------------------+
> +
> +This request is sent from the server to the client to signal the device has
> +raised an interrupt.
> +
> +Interrupt info format
> +^^^^^^^^^^^^^^^^^^^^^
> +
> ++----------+--------+------+
> +| Name     | Offset | Size |
> ++==========+========+======+
> +| Index    | 16     | 4    |
> ++----------+--------+------+
> +| Subindex | 20     | 4    |
> ++----------+--------+------+
> +
> +* Index is the interrupt index; it is the same value used in VFIO_USER_SET_IRQS.
> +* Subindex is relative to the index, e.g., the vector number used in PCI MSI/X
> +  type interrupts.
> +
> +VFIO_USER_DEVICE_RESET
> +----------------------
> +
> +Message format
> +^^^^^^^^^^^^^^
> +
> ++--------------+------------------------+
> +| Name         | Value                  |
> ++==============+========================+
> +| Device ID    | <ID>                   |
> ++--------------+------------------------+
> +| Message ID   | <ID>                   |
> ++--------------+------------------------+
> +| Command      | 13                     |
> ++--------------+------------------------+
> +| Message size | 16                     |
> ++--------------+------------------------+
> +| Flags        | Reply bit set in reply |
> ++--------------+------------------------+
> +
> +This request is sent from the client to the server to reset the device.

How are errors treated in any of these commands? For example, if memory
addresses are out-of-bounds?
RE: [PATCH] introduce VFIO-over-socket protocol specificaion
Posted by Thanos Makatos 4 years, 2 months ago
> -----Original Message-----
> From: Stefan Hajnoczi <stefanha@redhat.com>
> Sent: 17 July 2020 13:18
> To: Thanos Makatos <thanos.makatos@nutanix.com>
> Cc: qemu-devel@nongnu.org; alex.williamson@redhat.com;
> benjamin.walker@intel.com; elena.ufimtseva@oracle.com;
> jag.raman@oracle.com; Swapnil Ingle <swapnil.ingle@nutanix.com>;
> james.r.harris@intel.com; konrad.wilk@oracle.com; Raphael Norwitz
> <raphael.norwitz@nutanix.com>; marcandre.lureau@redhat.com;
> Kanth.Ghatraju@oracle.com; Felipe Franciosi <felipe@nutanix.com>;
> tina.zhang@intel.com; changpeng.liu@intel.com; dgilbert@redhat.com;
> tomassetti.andrea@gmail.com; yuvalkashtan@gmail.com;
> ismael@linux.com; John G Johnson <john.g.johnson@oracle.com>
> Subject: Re: [PATCH] introduce VFIO-over-socket protocol specificaion
> 
> On Thu, Jul 16, 2020 at 08:31:43AM -0700, Thanos Makatos wrote:
> > This patch introduces the VFIO-over-socket protocol specification, which
> > is designed to allow devices to be emulated outside QEMU, in a separate
> > process. VFIO-over-socket reuses the existing VFIO defines, structs and
> > concepts.
> >
> > It has been earlier discussed as an RFC in:
> > "RFC: use VFIO over a UNIX domain socket to implement device offloading"
> >
> > Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> > Signed-off-by: Thanos Makatos <thanos.makatos@nutanix.com>
> > ---
> >  docs/devel/vfio-over-socket.rst | 1135
> +++++++++++++++++++++++++++++++++++++++
> >  1 files changed, 1135 insertions(+), 0 deletions(-)
> >  create mode 100644 docs/devel/vfio-over-socket.rst
> 
> This is exciting! The spec is clear enough that I feel I could start
> writing a client/server. There is enough functionality here to implement
> real-world devices. Can you share links to client/server
> implementations?

Strictly speaking there's no client/server implementation yet as we're waiting
until the protocol is finalized.

The closest server implementation we have is an NVMe controller implemented in
SPDK with MUSER:
(https://github.com/tmakatos/spdk/tree/rfc-vfio-over-socket and
https://github.com/tmakatos/muser/tree/vfio-over-socket).

The closest client implementation we have is libvfio in the VFIO-over-socket
PoC.

Neither of these implementation use the new protocol, but they're similar in
spirit.

John is working on the VFIO changes, the only thing left to do is the DMA/IOMMU
changes we made in the last review round, they'll be on their GitHub site soon.


> 
> launching, and controlling device emulation processes. That doesn't need
> It would be useful to introduce a standard way of enumerating,
> to be part of this specification document though. In vhost-user there
> are conventions for command-line parameters, process lifecycle, etc that
> make it easier for management tools to run device processes (the
> "Backend program conventions" section in vhost-user.rst).

Sure, we'll come up with something similar based on those.

> 
> > diff --git a/docs/devel/vfio-over-socket.rst b/docs/devel/vfio-over-
> socket.rst
> > new file mode 100644
> > index 0000000..723b944
> > --- /dev/null
> > +++ b/docs/devel/vfio-over-socket.rst
> > @@ -0,0 +1,1135 @@
> > +***************************************
> > +VFIO-over-socket Protocol Specification
> > +***************************************
> > +
> > +Version 0.1
> 
> Please include a reference to the section below explaining how
> versioning works.

I'm not sure I understand, do you mean we should add something like the
following (right below "Version 0.1"):

"Refer to section 1.2.3 on how versioning works."

?

> 
> Also, are there rules about noting versions when updating the spec? For
> example:
> 
>   When making a change to this specification, the protocol version
>   number must be included:
> 
>     The `foo` field contains ... Added in version 1.3.

OK, we'll add the rule as per your recommendation.

> 
> > +
> > +Introduction
> > +============
> > +VFIO-over-socket, also known as vfio-user, is a protocol that allows a
> device
> 
> vfio-user is shorted. Now is the best time to start consistently using
> "vfio-user" as the name for this protocol. Want to drop the name
> VFIO-over-socket?

"vfio-user" it is.

> 
> > +to be virtualized in a separate process outside of QEMU. VFIO-over-
> socket
> 
> Is there anything QEMU-specific about this protocol?
> 
> I think the scope of this protocol is more general and it could be
> described as:
> 
>   allows device emulation in a separate process outside of a Virtual
>   Machine Monitor (VMM).
> 
> (Or "emulator" instead of VMM, if you prefer.)
> 
> > +devices consist of a generic VFIO device type, living inside QEMU, which
> we
> 
> s/QEMU/the VMM/

Correct, QEMU is simply our main use case but the protocol is agnostic to that.
We'll drop QEMU and rephrase based on your suggestion.

> 
> > +call the client, and the core device implementation, living outside QEMU,
> which
> > +we call the server. VFIO-over-socket can be the main transport
> mechanism for
> > +multi-process QEMU, however it can be used by other applications
> offering
> > +device virtualization. Explaining the advantages of a
> > +disaggregated/multi-process QEMU, and device virtualization outside
> QEMU in
> > +general, is beyond the scope of this document.
> > +
> > +This document focuses on specifying the VFIO-over-socket protocol. VFIO
> has
> > +been chosen for the following reasons:
> 
> It's a little subtle here that "VFIO" means the Linux VFIO ioctl
> interface. Expanding it a bit makes the rest of the text clearer:
> 
>   The `Linux VFIO ioctl
>   interface
>   <https://www.kernel.org/doc/html/latest/driver-api/vfio.html>`_ has
>   been chosen as the base for this protocol for the following reasons:

OK.

> 
> > +
> > +1) It is a mature and stable API, backed by an extensively used
> framework.
> > +2) The existing VFIO client implementation (qemu/hw/vfio/) can be
> largely
> > +   reused.
> > +
> > +In a proof of concept implementation it has been demonstrated that
> using VFIO
> > +over a UNIX domain socket is a viable option. VFIO-over-socket is
> designed with
> > +QEMU in mind, however it could be used by other client applications. The
> > +VFIO-over-socket protocol does not require that QEMU's VFIO client
> > +implementation is used in QEMU. None of the VFIO kernel modules are
> required
> > +for supporting the protocol, neither in the client nor the server, only the
> > +source header files are used.
> 
> In the long run only the last sentence in this paragraph needs to be in
> the specification.

OK.

> 
> The proof of concept and QEMU-specific info is good to know for the
> discussion right now, but I don't think this needs to be in the
> specification.

OK.

> 
> > +
> > +The main idea is to allow a virtual device to function in a separate process
> in
> > +the same host over a UNIX domain socket. A UNIX domain socket
> (AF_UNIX) is
> > +chosen because we can trivially send file descriptors over it, which in turn
> > +allows:
> > +
> > +* Sharing of guest memory for DMA with the virtual device process.
> 
> Should the spec start consistently using "server" instead of various
> terms like "virtual device process"?

Indeed, that would be better.

> 
> > +* Sharing of virtual device memory with the guest for fast MMIO.
> 
> The VIRTIO spec restricts itself to the terms "device" and "driver" so
> that the scope isn't too specific to virtualization of software devices.
> I think the same might be good here because who knows how
> VFIO-over-socket will be used in the future. Maybe it will be used
> outside the context of traditional guests. For example, it's a great
> interface for writing device emulation tests and fuzzers!
> 
> I suggest consistently using "client" (for the guest) and "server" (for
> the device emulation process). That way the spec isn't focussed on a
> specific use case.

OK.

> 
> > +* Efficient sharing of eventfd's for triggering interrupts.
> > +
> > +However, other socket types could be used which allows the virtual
> device
> > +process to run in a separate guest in the same host (AF_VSOCK) or
> remotely
> > +(AF_INET). Theoretically the underlying transport doesn't necessarily have
> to
> > +be a socket, however we don't examine such alternatives. In this
> document we
> > +focus on using a UNIX domain socket and introduce basic support for the
> other
> > +two types of sockets without considering performance implications.
> 
> This is a good place to be explicit about the protocol requirements:
> 
> Is file-descriptor passing necessary?
> 
> Is shared-memory necessary?
> 
> Is there always an in-band message-passing fallback for any operation
> that can be optimized via fd passing?
> 
> By stating something about this in the spec it's easier to ensure that
> the protocol continues to follow these design parameters in the future
> when other people make modifications.

Passing of file descriptors isn't necessary. We explain that later in
"Guest Memory Configuration" and we'll also state it here for clarity.

> 
> > +This document does not yet describe any internal details of the server-
> side
> > +implementation, however QEMU's VFIO client implementation will have
> to be
> > +adapted according to this protocol in order to support VFIO-over-socket
> virtual
> > +devices.
> 
> This paragraph about the QEMU implementation status can be dropped from
> the specification.

OK.

> 
> > +
> > +VFIO
> > +====
> > +VFIO is a framework that allows a physical device to be securely passed
> through
> > +to a user space process; the kernel does not drive the device at all.
> > +Typically, the user space process is a VM and the device is passed through
> to
> > +it in order to achieve high performance. VFIO provides an API and the
> required
> > +functionality in the kernel. QEMU has adopted VFIO to allow a guest
> virtual
> > +machine to directly access physical devices, instead of emulating them in
> > +software
> 
> s/software/software./

OK.

> 
> > +
> > +VFIO-over-socket reuses the core VFIO concepts defined in its API, but
> > +implements them as messages to be sent over a UNIX-domain socket. It
> does not
> 
> Can "UNIX-domain" be dropped here? If the protocol design is intended to
> support other transports then it helps to avoid mentioning a specific
> transport even if only AF_UNIX will be implemented in the beginning.

OK.

> 
> > +change the kernel-based VFIO in any way, in fact none of the VFIO kernel
> > +modules need to be loaded to use VFIO-over-socket. It is also possible for
> QEMU
> > +to concurrently use the current kernel-based VFIO for one guest device,
> and use
> > +VFIO-over-socket for another device in the same guest.
> > +
> > +VFIO Device Model
> > +-----------------
> > +A device under VFIO presents a standard VFIO model to the user process.
> Many
> 
> I'm not sure what the meaning of the first sentence is.
> 
> s/standard VFIO model/standard interface/ ?

OK.

> 
> > +of the VFIO operations in the existing kernel model use the ioctl() system
> > +call, and references to the existing model are called the ioctl()
> > +implementation in this document.
> > +
> > +The following sections describe the set of messages that implement the
> VFIO
> > +device model over a UNIX domain socket. In many cases, the messages
> are direct
> > +translations of data structures used in the ioctl() implementation.
> Messages
> > +derived from ioctl()s will have a name derived from the ioctl() command
> name.
> > +E.g., the VFIO_GET_INFO ioctl() command becomes a
> VFIO_USER_GET_INFO message.
> > +The purpose for this reuse is to share as much code as feasible with the
> > +ioctl() implementation.
> > +
> > +Client and Server
> > +^^^^^^^^^^^^^^^^^
> > +The socket connects two processes together: a client process and a server
> > +process. In the context of this document, the client process is the process
> > +emulating a guest virtual machine, such as QEMU. The server process is a
> > +process that provides device emulation.
> > +
> > +Connection Initiation
> > +^^^^^^^^^^^^^^^^^^^^^
> > +After the client connects to the server, the initial server message is
> > +VFIO_USER_VERSION to propose a protocol version and set of capabilities
> to
> > +apply to the session. The client replies with a compatible version and set
> of
> > +capabilities it will support, or closes the connection if it cannot support the
> > +advertised version.
> > +
> > +Guest Memory Configuration
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +The client uses VFIO_USER_DMA_MAP and VFIO_USER_DMA_UNMAP
> messages to inform
> > +the server of the valid guest DMA ranges that the server can access on
> behalf
> > +of a device. Guest memory may be accessed by the server via
> VFIO_USER_DMA_READ
> > +and VFIO_USER_DMA_WRITE messages over the socket.
> 
> "Guest" is a virtualization term. "DMA Memory Configuration" and "DMA
> memory may be accessed ..." would be more general.

OK.

> 
> > +
> > +An optimization for server access to guest memory is for the client to
> provide
> > +file descriptors the server can mmap() to directly access guest memory.
> Note
> > +that mmap() privileges cannot be revoked by the client, therefore file
> > +descriptors should only be exported in environments where the client
> trusts the
> > +server not to corrupt guest memory.
> > +
> > +Device Information
> > +^^^^^^^^^^^^^^^^^^
> > +The client uses a VFIO_USER_DEVICE_GET_INFO message to query the
> server for
> > +information about the device. This information includes:
> > +
> > +* The device type and capabilities,
> > +* the number of memory regions, and
> 
> VFIO calls them "device regions", which is clearer:
> 1. It cannot be confused with "DMA Memory Regions"
> 2. It allows for I/O space in additional to Memory space
> 
> I suggest using "device regions" consistently.

OK.

> 
> > +* the device presents to the guest the number of interrupt types the
> device
> > +  supports.
> > +
> > +Region Information
> > +^^^^^^^^^^^^^^^^^^
> > +The client uses VFIO_USER_DEVICE_GET_REGION_INFO messages to
> query the server
> > +for information about the device's memory regions. This information
> describes:
> > +
> > +* Read and write permissions, whether it can be memory mapped, and
> whether it
> > +  supports additional capabilities.
> > +* Region index, size, and offset.
> > +
> > +When a region can be mapped by the client, the server provides a file
> > +descriptor which the client can mmap(). The server is responsible for
> polling
> > +for client updates to memory mapped regions.
> > +
> > +Region Capabilities
> > +"""""""""""""""""""
> > +Some regions have additional capabilities that cannot be described
> adequately
> > +by the region info data structure. These capabilities are returned in the
> > +region info reply in a list similar to PCI capabilities in a PCI device's
> > +configuration space.
> > +
> > +Sparse Regions
> > +""""""""""""""
> > +A region can be memory-mappable in whole or in part. When only a
> subset of a
> > +region can be mapped by the client, a
> VFIO_REGION_INFO_CAP_SPARSE_MMAP
> > +capability is included in the region info reply. This capability describes
> > +which portions can be mapped by the client.
> > +
> > +For example, in a virtual NVMe controller, sparse regions can be used so
> that
> > +accesses to the NVMe registers (found in the beginning of BAR0) are
> trapped (an
> > +infrequent an event), while allowing direct access to the doorbells (an
> > +extremely frequent event as every I/O submission requires a write to
> BAR0),
> > +found right after the NVMe registers in BAR0.
> > +
> > +Interrupts
> > +^^^^^^^^^^
> > +The client uses VFIO_USER_DEVICE_GET_IRQ_INFO messages to query
> the server for
> > +the device's interrupt types. The interrupt types are specific to the bus
> the
> > +device is attached to, and the client is expected to know the capabilities of
> > +each interrupt type. The server can signal an interrupt either with
> > +VFIO_USER_VM_INTERRUPT messages over the socket, or can directly
> inject
> > +interrupts into the guest via an event file descriptor. The client configures
> > +how the server signals an interrupt with VFIO_USER_SET_IRQS messages.
> > +
> > +Device Read and Write
> > +^^^^^^^^^^^^^^^^^^^^^
> > +When the guest executes load or store operations to device memory, the
> client
> > +forwards these operations to the server with VFIO_USER_REGION_READ
> or
> > +VFIO_USER_REGION_WRITE messages. The server will reply with data
> from the
> > +device on read operations or an acknowledgement on write operations.
> > +
> > +DMA
> > +^^^
> > +When a device performs DMA accesses to guest memory, the server will
> forward
> > +them to the client with VFIO_USER_DMA_READ and
> VFIO_USER_DMA_WRITE messages.
> > +These messages can only be used to access guest memory the client has
> > +configured into the server.
> > +
> > +Protocol Specification
> > +======================
> > +To distinguish from the base VFIO symbols, all VFIO-over-socket symbols
> are
> > +prefixed with vfio_user or VFIO_USER. In revision 0.1, all data is in the
> > +little-endian format, although this may be relaxed in future revision in
> cases
> > +where the client and server are both big-endian. The messages are
> formatted
> > +for seamless reuse of the native VFIO structs. A server can serve:
> > +
> > +1) multiple clients, and/or
> > +2) multiple virtual devices, belonging to one or more clients.
> > +
> > +Therefore each message requires a header that uniquely identifies the
> virtual
> > +device. It is a server-side implementation detail whether a single server
> > +handles multiple virtual devices from the same or multiple guests.
> > +
> > +Socket
> > +------
> > +A single UNIX domain socket is assumed to be used for each device. The
> location
> > +of the socket is implementation-specific. Multiplexing clients, devices, and
> > +servers over the same socket is not supported in this version of the
> protocol,
> > +but a device ID field exists in the message header so that a future support
> can
> > +be added without a major version change.
> 
> There is a version negotiation mechanism for making protocol changes.
> I'm not sure there is any advantage to including the multiplexing header
> now? I suggest keeping it simple and dropping the unused header.

OK.

> 
> > +
> > +Authentication
> > +--------------
> > +For AF_UNIX, we rely on OS mandatory access controls on the socket
> files,
> > +therefore it is up to the management layer to set up the socket as
> required.
> > +Socket types than span guests or hosts will require a proper
> authentication
> > +mechanism. Defining that mechanism is deferred to a future version of
> the
> > +protocol.
> > +
> > +Request Concurrency
> > +-------------------
> > +There can be multiple outstanding requests per virtual device, e.g. a
> > +frame buffer where the guest does multiple stores to the virtual device.
> The
> > +server can execute and reorder non-conflicting requests in parallel,
> depending
> > +on the device semantics.
> 
> The word "request" has not been used up to this point in the spec. What
> does "request" mean here?
> 
> I'm not sure if this section is talking about vfio-user protocol message
> exchanges (e.g. a client->server message followed by a server->client
> message) or something else?

Indeed, this is confusing, we use the term "command" earlier in the
"VFIO Device Model" section, we'll replace the term "request"
with "command" throughout the specification.

> 
> What about the ordering semantics at the vfio-user protocol level? For
> example, if a client sends multiple VFIO_USER_DMA_MAP/UNMAP
> messages
> then the order matters. What are the rules? I wonder if maybe
> concurrency is a special case and really only applies to a subset of
> protocol commands?

All commands are executed in the order they were sent, regardless of whether a
reply is needed.

> 
> I'm not sure how a client would exploit parallelism in this protocol.
> Can you give an example of a case where there would be multiple commands
> pending on a single device?

For instance, a client can issue the following operations back to back without
waiting for the first two to complete:
1. map a DMA region 
2. trigger some device-specific operation that results in data being read into
   that DMA region, and
3. unmap the DMA region

> 
> > +
> > +Socket Disconnection Behavior
> > +-----------------------------
> > +The server and the client can disconnect from each other, either
> intentionally
> > +or unexpectedly. Both the client and the server need to know how to
> handle such
> > +events.
> > +
> > +Server Disconnection
> > +^^^^^^^^^^^^^^^^^^^^
> > +A server disconnecting from the client may indicate that:
> > +
> > +1) A virtual device has been restarted, either intentionally (e.g. because of
> a
> > +device update) or unintentionally (e.g. because of a crash). In any case,
> the
> > +virtual device will come back so the client should not do anything (e.g.
> simply
> > +reconnect and retry failed operations).
> > +
> > +2) A virtual device has been shut down with no intention to be restarted.
> > +
> > +It is impossible for the client to know whether or not a failure is
> > +intermittent or innocuous and should be retried, therefore the client
> should
> > +attempt to reconnect to the socket. Since an intentional server restart
> (e.g.
> > +due to an upgrade) might take some time, a reasonable timeout should
> be used.
> > +In cases where the disconnection is expected (e.g. the guest shutting
> down), no
> > +new requests will be sent anyway so this situation doesn't pose a
> problem. The
> > +control stack will clean up accordingly.
> > +
> > +Parametrizing this behaviour by having the virtual device advertise a
> 
> s/Parametrizing/Parameterizing/

OK.

> 
> > +reasonable reconnect is deferred to a future version of the protocol.
> 
> No mention is made of recovering state or how disconnect maps to VFIO
> device types (PCI, etc.). Does a disconnect mean that the device has
> been reset?

Regarding recovering state, I believe that since all the building blocks are
there and the client is pretty much the master in the vfio-user model, it's up
to the client how to deal with exceptional situations. For recovery to work, the
client will have to reconfigure IRQs, DMAs, etc., and the server will have to
persistently store device state.

Regarding how disconnect maps to VFIO device types, it depends on whether or not
the client/server can recover from it and continue operating. If this doesn't
happen (e.g. the server hasn't restarted, the client doesn't support
recovering), then the device cannot continue being operational, which I suppose
is an implementation detail of the client.

Do we need something more specific at this stage?

> 
> > +
> > +Client Disconnection
> > +^^^^^^^^^^^^^^^^^^^^
> > +The client disconnecting from the server primarily means that the QEMU
> process
> > +has exited. Currently this means that the guest is shut down so the device
> is
> > +no longer needed therefore the server can automatically exit. However,
> there
> > +can be cases where a client disconnect should not result in a server exit:
> > +
> > +1) A single server serving multiple clients.
> > +2) A multi-process QEMU upgrading itself step by step, which isn't yet
> > +   implemented.
> > +
> > +Therefore in order for the protocol to be forward compatible the server
> should
> > +take no action when the client disconnects. If anything happens to the
> client
> > +process the control stack will know about it and can clean up resources
> > +accordingly.
> > +
> > +Request Retry and Response Timeout
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +QEMU's VFIO retries certain operations if they fail. While this makes sense
> for
> > +real HW, we don't know for sure whether it makes sense for virtual
> devices. A
> > +failed request is a request that has been successfully sent and has been
> > +responded to with an error code. Failure to send the request in the first
> place
> > +(e.g. because the socket is disconnected) is a different type of error
> examined
> > +earlier in the disconnect section.
> > +
> > +Defining a retry and timeout scheme if deferred to a future version of the
> > +protocol.
> > +
> > +Commands
> > +--------
> > +The following table lists the VFIO message command IDs, and whether
> the
> > +message request is sent from the client or the server.
> > +
> > ++----------------------------------+---------+-------------------+
> > +| Name                             | Command | Request Direction |
> >
> ++==================================+=========+===========
> ========+
> > +| VFIO_USER_VERSION                | 1       | server ???????? client   |
> 
> Unicode issues? I see "????????" instead of the characters that were
> supposed to be here.

My bad, I selected UTF-8 when I ran git send-email, it's supposed to be "->".

> 
> > ++----------------------------------+---------+-------------------+
> > +| VFIO_USER_DMA_MAP                | 2       | client ???????? server   |
> > ++----------------------------------+---------+-------------------+
> > +| VFIO_USER_DMA_UNMAP              | 3       | client ???????? server   |
> > ++----------------------------------+---------+-------------------+
> > +| VFIO_USER_DEVICE_GET_INFO        | 4       | client ???????? server   |
> > ++----------------------------------+---------+-------------------+
> > +| VFIO_USER_DEVICE_GET_REGION_INFO | 5       | client ???????? server
> |
> > ++----------------------------------+---------+-------------------+
> > +| VFIO_USER_DEVICE_GET_IRQ_INFO    | 6       | client ???????? server   |
> > ++----------------------------------+---------+-------------------+
> > +| VFIO_USER_DEVICE_SET_IRQS        | 7       | client ???????? server   |
> > ++----------------------------------+---------+-------------------+
> > +| VFIO_USER_REGION_READ            | 8       | client ???????? server   |
> > ++----------------------------------+---------+-------------------+
> > +| VFIO_USER_REGION_WRITE           | 9       | client ???????? server   |
> > ++----------------------------------+---------+-------------------+
> > +| VFIO_USER_DMA_READ               | 10      | server ???????? client   |
> > ++----------------------------------+---------+-------------------+
> > +| VFIO_USER_DMA_READ               | 11      | server ???????? client   |
> > ++----------------------------------+---------+-------------------+
> > +| VFIO_USER_VM_INTERRUPT           | 12      | server ???????? client   |
> > ++----------------------------------+---------+-------------------+
> > +| VFIO_DEVICE_RESET                | 13      | client ???????? server   |
> > ++----------------------------------+---------+-------------------+
> 
> Why is this command named VFIO_DEVICE_RESET and not
> VFIO_USER_DEVICE_RESET?

That's a mistake, it should be VFIO_USER_DEVICE_RESET.

> 
> > +
> > +Header
> > +------
> > +All messages are preceded by a 16 byte header that contains basic
> information
> 
> I suggest dropping "16 byte" here because the table already shows this
> and it is easy to forget to update the text when changing the table.

OK.

> 
> > +about the message. The header is followed by message-specific data
> described
> > +in the sections below.
> > +
> > ++----------------+--------+-------------+
> > +| Name           | Offset | Size        |
> > ++================+========+=============+
> > +| Device ID      | 0      | 2           |
> > ++----------------+--------+-------------+
> > +| Message ID     | 2      | 2           |
> > ++----------------+--------+-------------+
> > +| Command        | 4      | 4           |
> > ++----------------+--------+-------------+
> > +| Message size   | 8      | 4           |
> > ++----------------+--------+-------------+
> > +| Flags          | 12     | 4           |
> > ++----------------+--------+-------------+
> > +|                | +-----+------------+ |
> > +|                | | Bit | Definition | |
> > +|                | +=====+============+ |
> > +|                | | 0   | Reply      | |
> > +|                | +-----+------------+ |
> > +|                | | 1   | No_reply   | |
> > +|                | +-----+------------+ |
> > ++----------------+--------+-------------+
> > +| <message data> | 16     | variable    |
> > ++----------------+--------+-------------+
> > +
> > +* Device ID identifies the destination device of the message. This field is
> > +  reserved when the server only supports one device per socket.
> 
> Clearer:
> s/is reserved/must be 0/

OK.

> 
> > +* Message ID identifies the message, and is used in the message
> acknowledgement.
> 
> Are Message IDs global or per-device?

We're dropping the device ID field in the header therefore it's global.

> 
> > +* Command specifies the command to be executed, listed in the
> Command Table.
> > +* Message size contains the size of the entire message, including the
> header.
> > +* Flags contains attributes of the message:
> > +
> > +  * The reply bit differentiates request messages from reply messages. A
> reply
> > +    message acknowledges a previous request with the same message ID.
> > +  * No_reply indicates that no reply is needed for this request. This is
> > +    commonly used when multiple requests are sent, and only the last
> needs
> > +    acknowledgement.
> 
> Is_Reply and Dont_Reply are more self-explanatory to me, but this is
> probably subjective.
> 
> > +
> > +VFIO_USER_VERSION
> > +-----------------
> > +
> > +Message format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+------------------------+
> > +| Name         | Value                  |
> > ++==============+========================+
> > +| Device ID    | 0                      |
> > ++--------------+------------------------+
> 
> The Device ID is always 0 and this message is sent just once regardless
> of the number of devices? Please describe this explicitly so it's clear
> that this message is per-session and not per-device.

We're dropping the device ID from the header.

> 
> There could also be a special NO_DEVICE (0xff) value but I don't think
> it matters.
> 
> > +| Message ID   | <ID>                   |
> > ++--------------+------------------------+
> > +| Command      | 1                      |
> > ++--------------+------------------------+
> > +| Message size | 16 + version length    |
> > ++--------------+------------------------+
> > +| Flags        | Reply bit set in reply |
> > ++--------------+------------------------+
> > +| Version      | JSON byte array        |
> > ++--------------+------------------------+
> > +
> > +This is the initial message sent by the server after the socket connection is
> > +established. The version is in JSON format, and the following objects must
> be
> > +included:
> > +
> > ++--------------+--------+---------------------------------------------------+
> > +| Name         | Type   | Description                                       |
> >
> ++==============+========+================================
> ===================+
> > +| version      | object | {???????major????????: <number>,
> ???????minor????????: <number>}            |
> 
> More Unicode issues.
> 
> > +|              |        | Version supported by the sender, e.g.
> ???????0.1????????.      |
> > ++--------------+--------+---------------------------------------------------+
> > +| type         | string | Fixed to ???????vfio-user????????.                             |
> > ++--------------+--------+---------------------------------------------------+
> > +| capabilities | array  | Reserved. Can be omitted for v0.1, otherwise must
> |
> > +|              |        | be empty.                                         |
> > ++--------------+--------+---------------------------------------------------+
> 
> s/array/array of strings/ ?

No, it's array were each element can be any object in JSON.

> 
> > +
> > +Versioning and Feature Support
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +Upon accepting a connection, the server must send a
> VFIO_USER_VERSION message
> > +proposing a protocol version and a set of capabilities. The client compares
> > +these with the versions and capabilities it supports and sends a
> > +VFIO_USER_VERSION reply according to the following rules.
> > +
> > +* The major version in the reply must be the same as proposed. If the
> client
> > +  does not support the proposed major, it closes the connection.
> > +* The minor version in the reply must be equal to or less than the minor
> > +  version proposed.
> > +* The capability list must be a subset of those proposed. If the client
> > +  requires a capability the server did not include, it closes the connection.
> > +* If type is not ???????vfio-user????????, the client closes the
> connection.
> 
> Is there a rationale for this field? In order to get to the point where
> this JSON is parsed we must already implement the vfio-user protocol
> (e.g. parse the header).

It was suggested to include it as a sanity check, we'll drop it.

> 
> > +
> > +The protocol major version will only change when incompatible protocol
> changes
> > +are made, such as changing the message format. The minor version may
> change
> > +when compatible changes are made, such as adding new messages or
> capabilities,
> > +Both the client and server must support all minor versions less than the
> > +maximum minor version it supports. E.g., an implementation that
> supports
> > +version 1.3 must also support 1.0 through 1.2.
> > +
> > +VFIO_USER_DMA_MAP
> > +-----------------
> > +
> > +VFIO_USER_DMA_UNMAP
> > +-------------------
> > +
> > +Message Format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+------------------------+
> > +| Name         | Value                  |
> > ++==============+========================+
> > +| Device ID    | 0                      |
> > ++--------------+------------------------+
> 
> Is this message per-session instead of per-device?
> 
> This has implications on IOMMU semantics. Two different devices sharing
> a connection have access to the same DMA memory?

Good point, it is per-session since we're dropping the device ID from the
header. I suppose this means that if in a future version we want to support
devices sharing the same connection the header will have to be extended to
include the device ID.

> 
> > +| Message ID   | <ID>                   |
> > ++--------------+------------------------+
> > +| Command      | MAP=2, UNMAP=3         |
> > ++--------------+------------------------+
> > +| Message size | 16 + table size        |
> > ++--------------+------------------------+
> > +| Flags        | Reply bit set in reply |
> > ++--------------+------------------------+
> > +| Table        | array of table entries |
> > ++--------------+------------------------+
> > +
> > +This message is sent by the client to the server to inform it of the guest
> > +memory regions the device can access. It must be sent before the device
> can
> > +perform any DMA to the guest. It is normally sent directly after the
> version
> > +handshake is completed, but may also occur when memory is added or
> subtracted
> > +in the guest.
> > +
> > +The table is an array of the following structure. This structure is 32 bytes
> > +in size, so the message size will be 16 + (# of table entries * 32). If a
> > +region being added can be directly mapped by the server, an array of file
> > +descriptors will be sent as part of the message meta-data. Each region
> entry
> > +will have a corresponding file descriptor. On AF_UNIX sockets, the file
> > +descriptors will be passed as SCM_RIGHTS type ancillary data.
> > +
> > +Table entry format
> > +^^^^^^^^^^^^^^^^^^
> > +
> > ++-------------+--------+-------------+
> > +| Name        | Offset | Size        |
> > ++=============+========+=============+
> > +| Address     | 0      | 8           |
> > ++-------------+--------+-------------+
> > +| Size        | 8      | 8           |
> > ++-------------+--------+-------------+
> > +| Offset      | 16     | 8           |
> > ++-------------+--------+-------------+
> > +| Protections | 24     | 4           |
> > ++-------------+--------+-------------+
> > +| Flags       | 28     | 4           |
> > ++-------------+--------+-------------+
> > +|             | +-----+------------+ |
> > +|             | | Bit | Definition | |
> > +|             | +=====+============+ |
> > +|             | | 0   | Mappable   | |
> > +|             | +-----+------------+ |
> > ++-------------+--------+-------------+
> > +
> > +* Address is the base DMA address of the region.
> > +* Size is the size of the region.
> > +* Offset is the file offset of the region with respect to the associated file
> > +  descriptor.
> > +* Protections are the region's protection attributes as encoded in
> > +  ``<sys/mman.h>``.
> > +* Flags contain the following region attributes:
> > +
> > +  * Mappable indicate the region can be mapped via the mmap() system
> call using
> > +    the file descriptor provided in the message meta-data.
> 
> What are the semantics of map/unmap with respect to mapping over
> existing regions, unmapping a subset of an existing region, etc?

The VFIO kernel driver returns EEXIST if mapping over an existing area
and allows unmapping a subset. I'd think we should follow their example.

> 
> What are the concerns for MAP/UNMAP while the device is operational and
> may be performing DMA? Should this command be combined into a single
> VFIO_USER_SET_DMA_REGIONS command with Flags Bit 1 indicating
> Add/Delete
> so that a single message can atomically add and delete DMA regions?

Regarding receiving a DMA unmap request while there is an ongoing DMA
transaction, once the server ACK's the DMA unmap then it must not touch that DMA
region. It is an implementation detail whether the server waits for the DMA to
finish or kills the DMA operation, which might not be possible at all (e.g part
of that region has been submitted for I/O to a physical device).

> 
> What is the format of the reply to this message?

It's just the header explained in the "Header" section, no additional data are
sent by the server, it simply ACK's the command.

> 
> > +
> > +VFIO_USER_DEVICE_GET_INFO
> > +-------------------------
> > +
> > +Message format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+----------------------------+
> > +| Name         | Value                      |
> > ++==============+============================+
> > +| Device ID    | <ID>                       |
> > ++--------------+----------------------------+
> > +| Message ID   | <ID>                       |
> > ++--------------+----------------------------+
> > +| Command      | 4                          |
> > ++--------------+----------------------------+
> > +| Message size | 16 in request, 32 in reply |
> > ++--------------+----------------------------+
> > +| Flags        | Reply bit set in reply     |
> > ++--------------+----------------------------+
> > +| Device info  | VFIO device info           |
> > ++--------------+----------------------------+
> > +
> > +This message is sent by the client to the server to query for basic
> information
> > +about the device. Only the message header is needed in the request
> message.
> > +The VFIO device info structure is defined in ``<sys/vfio.h>`` (``struct
> > +vfio_device_info``).
> > +
> > +VFIO device info format
> > +^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > ++-------------+--------+--------------------------+
> > +| Name        | Offset | Size                     |
> > ++=============+========+==========================+
> > +| argsz       | 16     | 4                        |
> > ++-------------+--------+--------------------------+
> > +| flags       | 20     | 4                        |
> > ++-------------+--------+--------------------------+
> > +|             | +-----+-------------------------+ |
> > +|             | | Bit | Definition              | |
> > +|             | +=====+=========================+ |
> > +|             | | 0   | VFIO_DEVICE_FLAGS_RESET | |
> > +|             | +-----+-------------------------+ |
> > +|             | | 1   | VFIO_DEVICE_FLAGS_PCI   | |
> > +|             | +-----+-------------------------+ |
> > ++-------------+--------+--------------------------+
> > +| num_regions | 24     | 4                        |
> > ++-------------+--------+--------------------------+
> > +| num_irqs    | 28     | 4                        |
> > ++-------------+--------+--------------------------+
> > +
> > +* argz is reserved in vfio-user, it is only used in the ioctl() VFIO
> > +  implementation.
> > +* flags contains the following device attributes.
> > +
> > +  * VFIO_DEVICE_FLAGS_RESET indicates the device supports the
> > +    VFIO_USER_DEVICE_RESET message.
> 
> Why is VFIO_USER_DEVICE_RESET support optional?

Because it is optional in VFIO, too.

> 
> > +  * VFIO_DEVICE_FLAGS_PCI indicates the device is a PCI device.
> > +
> > +* num_regions is the number of memory regions the device exposes.
> > +* num_irqs is the number of distinct interrupt types the device supports.
> > +
> > +This version of the protocol only supports PCI devices. Additional devices
> may
> > +be supported in future versions.
> > +
> > +VFIO_USER_DEVICE_GET_REGION_INFO
> > +--------------------------------
> > +
> > +Message format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+------------------+
> > +| Name         | Value            |
> > ++==============+==================+
> > +| Device ID    | <ID>             |
> > ++--------------+------------------+
> > +| Message ID   | <ID>             |
> > ++--------------+------------------+
> > +| Command      | 5                |
> > ++--------------+------------------+
> > +| Message size | 48 + any caps    |
> > ++--------------+------------------+
> 
> The client does not know how much space capabilities need when sending
> the request. Should the client send a 48-byte request and expect the
> server to reply with a 48+ byte response?

Correct.

> 
> > +DMA Read/Write Data
> > +^^^^^^^^^^^^^^^^^^^
> > +
> > ++---------+--------+----------+
> > +| Name    | Offset | Size     |
> > ++=========+========+==========+
> > +| Address | 16     | 8        |
> > ++---------+--------+----------+
> > +| Count   | 24     | 4        |
> > ++---------+--------+----------+
> > +| Data    | 28     | variable |
> > ++---------+--------+----------+
> > +
> > +* Address is the area of guest memory being accessed. This address must
> have
> > +  been exported to the server with a VFIO_USER_DMA_MAP message.
> > +* Count is the size of the data to be transferred.
> > +* Data is the data to be read or written.
> > +
> > +Address and count can also be accessed as ``struct iovec`` from
> ``<sys/uio.h>``.
> 
> This seems to be incorrect since the count field is 4 bytes but struct
> iovec::iov_len is size_t (64-bit on 64-bit hosts).

Indeed, I forgot about padding. We can remove the reference to struct iovec or
make count 8 bytes?

> 
> > +
> > +VFIO_USER_REGION_READ
> > +---------------------
> > +
> > +Message format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+------------------------+
> > +| Name         | Value                  |
> > ++==============+========================+
> > +| Device ID    | <ID>                   |
> > ++--------------+------------------------+
> > +| Message ID   | <ID>                   |
> > ++--------------+------------------------+
> > +| Command      | 8                      |
> > ++--------------+------------------------+
> > +| Message size | 32 + data size         |
> > ++--------------+------------------------+
> > +| Flags Reply  | bit set in reply       |
> > ++--------------+------------------------+
> > +| Read info    | REGION read/write data |
> > ++--------------+------------------------+
> > +
> > +This request is sent from the client to the server to read from device
> memory.
> > +In the request messages, there will be no data, and the count field will be
> the
> > +amount of data to be read. The reply will include the data read, and its
> count
> > +field will be the amount of data read.
> > +
> > +VFIO_USER_REGION_WRITE
> > +----------------------
> > +
> > +Message format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+------------------------+
> > +| Name         | Value                  |
> > ++==============+========================+
> > +| Device ID    | <ID>                   |
> > ++--------------+------------------------+
> > +| Message ID   | <ID>                   |
> > ++--------------+------------------------+
> > +| Command      | 9                      |
> > ++--------------+------------------------+
> > +| Message size | 32 + data size         |
> > ++--------------+------------------------+
> > +| Flags        | Reply bit set in reply |
> > ++--------------+------------------------+
> > +| Write info   | REGION read write data |
> > ++--------------+------------------------+
> > +
> > +This request is sent from the client to the server to write to device
> memory.
> > +The request message will contain the data to be written, and its count
> field
> > +will contain the amount of write data. The count field in the reply will be
> > +zero.
> > +
> > +VFIO_USER_DMA_READ
> > +------------------
> > +
> > +Message format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+---------------------+
> > +| Name         | Value               |
> > ++==============+=====================+
> > +| Device ID    | <ID>                |
> > ++--------------+---------------------+
> > +| Message ID   | <ID>                |
> > ++--------------+---------------------+
> > +| Command      | 10                  |
> > ++--------------+---------------------+
> > +| Message size | 28 + data size      |
> > ++--------------+---------------------+
> > +| Flags Reply  | bit set in reply    |
> > ++--------------+---------------------+
> > +| DMA info     | DMA read/write data |
> > ++--------------+---------------------+
> > +
> > +This request is sent from the server to the client to read from guest
> memory.
> > +In the request messages, there will be no data, and the count field will be
> the
> > +amount of data to be read. The reply will include the data read, and its
> count
> > +field will be the amount of data read.
> > +
> > +VFIO_USER_DMA_WRITE
> > +-------------------
> > +
> > +Message format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+------------------------+
> > +| Name         | Value                  |
> > ++==============+========================+
> > +| Device ID    | <ID>                   |
> > ++--------------+------------------------+
> > +| Message ID   | <ID>                   |
> > ++--------------+------------------------+
> > +| Command      | 11                     |
> > ++--------------+------------------------+
> > +| Message size | 28 + data size         |
> > ++--------------+------------------------+
> > +| Flags        | Reply bit set in reply |
> > ++--------------+------------------------+
> > +| DMA info     | DMA read/write data    |
> > ++--------------+------------------------+
> > +
> > +This request is sent from the server to the client to write to guest
> memory.
> > +The request message will contain the data to be written, and its count
> field
> > +will contain the amount of write data. The count field in the reply will be
> > +zero.
> > +
> > +VFIO_USER_VM_INTERRUPT
> > +----------------------
> > +
> > +Message format
> > +^^^^^^^^^^^^^^
> > +
> > ++----------------+------------------------+
> > +| Name           | Value                  |
> > ++================+========================+
> > +| Device ID      | <ID>                   |
> > ++----------------+------------------------+
> > +| Message ID     | <ID>                   |
> > ++----------------+------------------------+
> > +| Command        | 12                     |
> > ++----------------+------------------------+
> > +| Message size   | 24                     |
> > ++----------------+------------------------+
> > +| Flags          | Reply bit set in reply |
> > ++----------------+------------------------+
> > +| Interrupt info | <interrupt>            |
> > ++----------------+------------------------+
> > +
> > +This request is sent from the server to the client to signal the device has
> > +raised an interrupt.
> > +
> > +Interrupt info format
> > +^^^^^^^^^^^^^^^^^^^^^
> > +
> > ++----------+--------+------+
> > +| Name     | Offset | Size |
> > ++==========+========+======+
> > +| Index    | 16     | 4    |
> > ++----------+--------+------+
> > +| Subindex | 20     | 4    |
> > ++----------+--------+------+
> > +
> > +* Index is the interrupt index; it is the same value used in
> VFIO_USER_SET_IRQS.
> > +* Subindex is relative to the index, e.g., the vector number used in PCI
> MSI/X
> > +  type interrupts.
> > +
> > +VFIO_USER_DEVICE_RESET
> > +----------------------
> > +
> > +Message format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+------------------------+
> > +| Name         | Value                  |
> > ++==============+========================+
> > +| Device ID    | <ID>                   |
> > ++--------------+------------------------+
> > +| Message ID   | <ID>                   |
> > ++--------------+------------------------+
> > +| Command      | 13                     |
> > ++--------------+------------------------+
> > +| Message size | 16                     |
> > ++--------------+------------------------+
> > +| Flags        | Reply bit set in reply |
> > ++--------------+------------------------+
> > +
> > +This request is sent from the client to the server to reset the device.
> 
> How are errors treated in any of these commands? For example, if memory
> addresses are out-of-bounds?

We'll add an error flag in the header and an error field to store a UNIX errno.
It will be unused in the command.


John and Thanos

Re: [PATCH] introduce VFIO-over-socket protocol specificaion
Posted by Stefan Hajnoczi 4 years, 1 month ago
On Wed, Jul 22, 2020 at 11:42:26AM +0000, Thanos Makatos wrote:
> > > diff --git a/docs/devel/vfio-over-socket.rst b/docs/devel/vfio-over-
> > socket.rst
> > > new file mode 100644
> > > index 0000000..723b944
> > > --- /dev/null
> > > +++ b/docs/devel/vfio-over-socket.rst
> > > @@ -0,0 +1,1135 @@
> > > +***************************************
> > > +VFIO-over-socket Protocol Specification
> > > +***************************************
> > > +
> > > +Version 0.1
> > 
> > Please include a reference to the section below explaining how
> > versioning works.
> 
> I'm not sure I understand, do you mean we should add something like the
> following (right below "Version 0.1"):
> 
> "Refer to section 1.2.3 on how versioning works."
> 
> ?

Yes, coming across this version number the reader has no idea about its
meaning and how the protocol is versioned. The spec then moves on to
other things. It would be helpful to reference the section that explains
how versioning works so that the reader knows where to go to understand
the meaning of the number.

> > What about the ordering semantics at the vfio-user protocol level? For
> > example, if a client sends multiple VFIO_USER_DMA_MAP/UNMAP
> > messages
> > then the order matters. What are the rules? I wonder if maybe
> > concurrency is a special case and really only applies to a subset of
> > protocol commands?
> 
> All commands are executed in the order they were sent, regardless of whether a
> reply is needed.
> 
> > 
> > I'm not sure how a client would exploit parallelism in this protocol.
> > Can you give an example of a case where there would be multiple commands
> > pending on a single device?
> 
> For instance, a client can issue the following operations back to back without
> waiting for the first two to complete:
> 1. map a DMA region 
> 2. trigger some device-specific operation that results in data being read into
>    that DMA region, and
> 3. unmap the DMA region

That is pipelining, but I don't see the ability to "reorder
non-conflicting requests in parallel". That example has no parallelism.

It's unclear to me what the spec means by concurrency and parallelism.

If the intention is just to allow pipelining then request identifiers
aren't really necessary. The client can keep track of which request will
complete next. So I'm wondering if there is some parallelism somewhere
that I'm missing...


> > 
> > > +
> > > +Socket Disconnection Behavior
> > > +-----------------------------
> > > +The server and the client can disconnect from each other, either
> > intentionally
> > > +or unexpectedly. Both the client and the server need to know how to
> > handle such
> > > +events.
> > > +
> > > +Server Disconnection
> > > +^^^^^^^^^^^^^^^^^^^^
> > > +A server disconnecting from the client may indicate that:
> > > +
> > > +1) A virtual device has been restarted, either intentionally (e.g. because of
> > a
> > > +device update) or unintentionally (e.g. because of a crash). In any case,
> > the
> > > +virtual device will come back so the client should not do anything (e.g.
> > simply
> > > +reconnect and retry failed operations).
> > > +
> > > +2) A virtual device has been shut down with no intention to be restarted.
> > > +
> > > +It is impossible for the client to know whether or not a failure is
> > > +intermittent or innocuous and should be retried, therefore the client
> > should
> > > +attempt to reconnect to the socket. Since an intentional server restart
> > (e.g.
> > > +due to an upgrade) might take some time, a reasonable timeout should
> > be used.
> > > +In cases where the disconnection is expected (e.g. the guest shutting
> > down), no
> > > +new requests will be sent anyway so this situation doesn't pose a
> > problem. The
> > > +control stack will clean up accordingly.
> > > +
> > > +Parametrizing this behaviour by having the virtual device advertise a
> > 
> > s/Parametrizing/Parameterizing/
> 
> OK.
> 
> > 
> > > +reasonable reconnect is deferred to a future version of the protocol.
> > 
> > No mention is made of recovering state or how disconnect maps to VFIO
> > device types (PCI, etc.). Does a disconnect mean that the device has
> > been reset?
> 
> Regarding recovering state, I believe that since all the building blocks are
> there and the client is pretty much the master in the vfio-user model, it's up
> to the client how to deal with exceptional situations. For recovery to work, the
> client will have to reconfigure IRQs, DMAs, etc., and the server will have to
> persistently store device state.
> 
> Regarding how disconnect maps to VFIO device types, it depends on whether or not
> the client/server can recover from it and continue operating. If this doesn't
> happen (e.g. the server hasn't restarted, the client doesn't support
> recovering), then the device cannot continue being operational, which I suppose
> is an implementation detail of the client.
> 
> Do we need something more specific at this stage?

This sounds like implementation-defined behavior. If the spec does not
cover this area then it might be hard to standardize it later without
breaking existing implementations.

Do you know how kernel VFIO behaves when the application is terminated
unexpectedly?

My intuition is that the device should be reset or at least fenced to
discard all DMA immediately. With that as the default behavior the
protocol could let implementations use protocol extensions for other
recovery behavior.

> > > +|              |        | Version supported by the sender, e.g.
> > ???????0.1????????.      |
> > > ++--------------+--------+---------------------------------------------------+
> > > +| type         | string | Fixed to ???????vfio-user????????.                             |
> > > ++--------------+--------+---------------------------------------------------+
> > > +| capabilities | array  | Reserved. Can be omitted for v0.1, otherwise must
> > |
> > > +|              |        | be empty.                                         |
> > > ++--------------+--------+---------------------------------------------------+
> > 
> > s/array/array of strings/ ?
> 
> No, it's array were each element can be any object in JSON.

Okay. Maybe there is a way to be explicit by saying "untyped array" or
adding "The type of the elements is not defined in this version of the
specification" to the description.

> 
> > 
> > > +
> > > +Versioning and Feature Support
> > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > +Upon accepting a connection, the server must send a
> > VFIO_USER_VERSION message
> > > +proposing a protocol version and a set of capabilities. The client compares
> > > +these with the versions and capabilities it supports and sends a
> > > +VFIO_USER_VERSION reply according to the following rules.
> > > +
> > > +* The major version in the reply must be the same as proposed. If the
> > client
> > > +  does not support the proposed major, it closes the connection.
> > > +* The minor version in the reply must be equal to or less than the minor
> > > +  version proposed.
> > > +* The capability list must be a subset of those proposed. If the client
> > > +  requires a capability the server did not include, it closes the connection.
> > > +* If type is not ???????vfio-user????????, the client closes the
> > connection.
> > 
> > Is there a rationale for this field? In order to get to the point where
> > this JSON is parsed we must already implement the vfio-user protocol
> > (e.g. parse the header).
> 
> It was suggested to include it as a sanity check, we'll drop it.

It's fine to keep it. Many protocols have a banner or identifier early
on that requires some parsing. In this case it could be: this field
makes it easy for humans decoding a protocol conversation to identify
the protocol as vfio-user and also serves as a sanity check.

> > What are the semantics of map/unmap with respect to mapping over
> > existing regions, unmapping a subset of an existing region, etc?
> 
> The VFIO kernel driver returns EEXIST if mapping over an existing area
> and allows unmapping a subset. I'd think we should follow their example.

Sounds good.

> > 
> > What are the concerns for MAP/UNMAP while the device is operational and
> > may be performing DMA? Should this command be combined into a single
> > VFIO_USER_SET_DMA_REGIONS command with Flags Bit 1 indicating
> > Add/Delete
> > so that a single message can atomically add and delete DMA regions?
> 
> Regarding receiving a DMA unmap request while there is an ongoing DMA
> transaction, once the server ACK's the DMA unmap then it must not touch that DMA
> region. It is an implementation detail whether the server waits for the DMA to
> finish or kills the DMA operation, which might not be possible at all (e.g part
> of that region has been submitted for I/O to a physical device).
> 
> > 
> > What is the format of the reply to this message?
> 
> It's just the header explained in the "Header" section, no additional data are
> sent by the server, it simply ACK's the command.

Okay, this was a little unclear to me when reading the spec. Maybe it
would help to have a general explanation at the beginning stating that
the message format is used by both the request and the reply unless
stated otherwise.

> 
> > 
> > > +
> > > +VFIO_USER_DEVICE_GET_INFO
> > > +-------------------------
> > > +
> > > +Message format
> > > +^^^^^^^^^^^^^^
> > > +
> > > ++--------------+----------------------------+
> > > +| Name         | Value                      |
> > > ++==============+============================+
> > > +| Device ID    | <ID>                       |
> > > ++--------------+----------------------------+
> > > +| Message ID   | <ID>                       |
> > > ++--------------+----------------------------+
> > > +| Command      | 4                          |
> > > ++--------------+----------------------------+
> > > +| Message size | 16 in request, 32 in reply |
> > > ++--------------+----------------------------+
> > > +| Flags        | Reply bit set in reply     |
> > > ++--------------+----------------------------+
> > > +| Device info  | VFIO device info           |
> > > ++--------------+----------------------------+
> > > +
> > > +This message is sent by the client to the server to query for basic
> > information
> > > +about the device. Only the message header is needed in the request
> > message.
> > > +The VFIO device info structure is defined in ``<sys/vfio.h>`` (``struct
> > > +vfio_device_info``).
> > > +
> > > +VFIO device info format
> > > +^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > ++-------------+--------+--------------------------+
> > > +| Name        | Offset | Size                     |
> > > ++=============+========+==========================+
> > > +| argsz       | 16     | 4                        |
> > > ++-------------+--------+--------------------------+
> > > +| flags       | 20     | 4                        |
> > > ++-------------+--------+--------------------------+
> > > +|             | +-----+-------------------------+ |
> > > +|             | | Bit | Definition              | |
> > > +|             | +=====+=========================+ |
> > > +|             | | 0   | VFIO_DEVICE_FLAGS_RESET | |
> > > +|             | +-----+-------------------------+ |
> > > +|             | | 1   | VFIO_DEVICE_FLAGS_PCI   | |
> > > +|             | +-----+-------------------------+ |
> > > ++-------------+--------+--------------------------+
> > > +| num_regions | 24     | 4                        |
> > > ++-------------+--------+--------------------------+
> > > +| num_irqs    | 28     | 4                        |
> > > ++-------------+--------+--------------------------+
> > > +
> > > +* argz is reserved in vfio-user, it is only used in the ioctl() VFIO
> > > +  implementation.
> > > +* flags contains the following device attributes.
> > > +
> > > +  * VFIO_DEVICE_FLAGS_RESET indicates the device supports the
> > > +    VFIO_USER_DEVICE_RESET message.
> > 
> > Why is VFIO_USER_DEVICE_RESET support optional?
> 
> Because it is optional in VFIO, too.

I see :D

> > > +DMA Read/Write Data
> > > +^^^^^^^^^^^^^^^^^^^
> > > +
> > > ++---------+--------+----------+
> > > +| Name    | Offset | Size     |
> > > ++=========+========+==========+
> > > +| Address | 16     | 8        |
> > > ++---------+--------+----------+
> > > +| Count   | 24     | 4        |
> > > ++---------+--------+----------+
> > > +| Data    | 28     | variable |
> > > ++---------+--------+----------+
> > > +
> > > +* Address is the area of guest memory being accessed. This address must
> > have
> > > +  been exported to the server with a VFIO_USER_DMA_MAP message.
> > > +* Count is the size of the data to be transferred.
> > > +* Data is the data to be read or written.
> > > +
> > > +Address and count can also be accessed as ``struct iovec`` from
> > ``<sys/uio.h>``.
> > 
> > This seems to be incorrect since the count field is 4 bytes but struct
> > iovec::iov_len is size_t (64-bit on 64-bit hosts).
> 
> Indeed, I forgot about padding. We can remove the reference to struct iovec or
> make count 8 bytes?

I suggest removing the reference to struct iovec.
Re: [PATCH] introduce VFIO-over-socket protocol specificaion
Posted by John G Johnson 4 years, 1 month ago
	Thanos is on vacation.  My comments embedded.

					JJ



> On Jul 29, 2020, at 5:48 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> 
> On Wed, Jul 22, 2020 at 11:42:26AM +0000, Thanos Makatos wrote:
>>>> diff --git a/docs/devel/vfio-over-socket.rst b/docs/devel/vfio-over-
>>> socket.rst
>>>> new file mode 100644
>>>> index 0000000..723b944
>>>> --- /dev/null
>>>> +++ b/docs/devel/vfio-over-socket.rst
>>>> @@ -0,0 +1,1135 @@
>>>> +***************************************
>>>> +VFIO-over-socket Protocol Specification
>>>> +***************************************
>>>> +
>>>> +Version 0.1
>>> 
>>> Please include a reference to the section below explaining how
>>> versioning works.
>> 
>> I'm not sure I understand, do you mean we should add something like the
>> following (right below "Version 0.1"):
>> 
>> "Refer to section 1.2.3 on how versioning works."
>> 
>> ?
> 
> Yes, coming across this version number the reader has no idea about its
> meaning and how the protocol is versioned. The spec then moves on to
> other things. It would be helpful to reference the section that explains
> how versioning works so that the reader knows where to go to understand
> the meaning of the number.
> 

	OK, we’ll add a forward reference


>>> What about the ordering semantics at the vfio-user protocol level? For
>>> example, if a client sends multiple VFIO_USER_DMA_MAP/UNMAP
>>> messages
>>> then the order matters. What are the rules? I wonder if maybe
>>> concurrency is a special case and really only applies to a subset of
>>> protocol commands?
>> 
>> All commands are executed in the order they were sent, regardless of whether a
>> reply is needed.
>> 
>>> 
>>> I'm not sure how a client would exploit parallelism in this protocol.
>>> Can you give an example of a case where there would be multiple commands
>>> pending on a single device?
>> 
>> For instance, a client can issue the following operations back to back without
>> waiting for the first two to complete:
>> 1. map a DMA region 
>> 2. trigger some device-specific operation that results in data being read into
>>   that DMA region, and
>> 3. unmap the DMA region
> 
> That is pipelining, but I don't see the ability to "reorder
> non-conflicting requests in parallel". That example has no parallelism.
> 
> It's unclear to me what the spec means by concurrency and parallelism.
> 
> If the intention is just to allow pipelining then request identifiers
> aren't really necessary. The client can keep track of which request will
> complete next. So I'm wondering if there is some parallelism somewhere
> that I'm missing…
> 

	The reason we have message IDs is so the client knows which request
is being acknowledged if it has more than one non-ack'd request outstanding.
Requests are executed in-order; the only time parallelism can happen is if
multiple client threads send requests in parallel.  A single thread can
pipeline requests, but those requests are not executed out of order by the
server.

	I’ll try to re-word it to be clearer.


> 
>>> 
>>>> +
>>>> +Socket Disconnection Behavior
>>>> +-----------------------------
>>>> +The server and the client can disconnect from each other, either
>>> intentionally
>>>> +or unexpectedly. Both the client and the server need to know how to
>>> handle such
>>>> +events.
>>>> +
>>>> +Server Disconnection
>>>> +^^^^^^^^^^^^^^^^^^^^
>>>> +A server disconnecting from the client may indicate that:
>>>> +
>>>> +1) A virtual device has been restarted, either intentionally (e.g. because of
>>> a
>>>> +device update) or unintentionally (e.g. because of a crash). In any case,
>>> the
>>>> +virtual device will come back so the client should not do anything (e.g.
>>> simply
>>>> +reconnect and retry failed operations).
>>>> +
>>>> +2) A virtual device has been shut down with no intention to be restarted.
>>>> +
>>>> +It is impossible for the client to know whether or not a failure is
>>>> +intermittent or innocuous and should be retried, therefore the client
>>> should
>>>> +attempt to reconnect to the socket. Since an intentional server restart
>>> (e.g.
>>>> +due to an upgrade) might take some time, a reasonable timeout should
>>> be used.
>>>> +In cases where the disconnection is expected (e.g. the guest shutting
>>> down), no
>>>> +new requests will be sent anyway so this situation doesn't pose a
>>> problem. The
>>>> +control stack will clean up accordingly.
>>>> +
>>>> +Parametrizing this behaviour by having the virtual device advertise a
>>> 
>>> s/Parametrizing/Parameterizing/
>> 
>> OK.
>> 
>>> 
>>>> +reasonable reconnect is deferred to a future version of the protocol.
>>> 
>>> No mention is made of recovering state or how disconnect maps to VFIO
>>> device types (PCI, etc.). Does a disconnect mean that the device has
>>> been reset?
>> 
>> Regarding recovering state, I believe that since all the building blocks are
>> there and the client is pretty much the master in the vfio-user model, it's up
>> to the client how to deal with exceptional situations. For recovery to work, the
>> client will have to reconfigure IRQs, DMAs, etc., and the server will have to
>> persistently store device state.
>> 
>> Regarding how disconnect maps to VFIO device types, it depends on whether or not
>> the client/server can recover from it and continue operating. If this doesn't
>> happen (e.g. the server hasn't restarted, the client doesn't support
>> recovering), then the device cannot continue being operational, which I suppose
>> is an implementation detail of the client.
>> 
>> Do we need something more specific at this stage?
> 
> This sounds like implementation-defined behavior. If the spec does not
> cover this area then it might be hard to standardize it later without
> breaking existing implementations.
> 
> Do you know how kernel VFIO behaves when the application is terminated
> unexpectedly?
> 
> My intuition is that the device should be reset or at least fenced to
> discard all DMA immediately. With that as the default behavior the
> protocol could let implementations use protocol extensions for other
> recovery behavior.
> 

	I’m not sure we’ll be able to improve on device reset, since
the client does not know what operations were completed before the
server died, and some region reads and writes may not be idempotent.  

	I’d like to give Thanos a chance to reply before we commit to
a change.


>>>> +|              |        | Version supported by the sender, e.g.
>>> ???????0.1????????.      |
>>>> ++--------------+--------+---------------------------------------------------+
>>>> +| type         | string | Fixed to ???????vfio-user????????.                             |
>>>> ++--------------+--------+---------------------------------------------------+
>>>> +| capabilities | array  | Reserved. Can be omitted for v0.1, otherwise must
>>> |
>>>> +|              |        | be empty.                                         |
>>>> ++--------------+--------+---------------------------------------------------+
>>> 
>>> s/array/array of strings/ ?
>> 
>> No, it's array were each element can be any object in JSON.
> 
> Okay. Maybe there is a way to be explicit by saying "untyped array" or
> adding "The type of the elements is not defined in this version of the
> specification" to the description.
> 
>> 
>>> 
>>>> +
>>>> +Versioning and Feature Support
>>>> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>> +Upon accepting a connection, the server must send a
>>> VFIO_USER_VERSION message
>>>> +proposing a protocol version and a set of capabilities. The client compares
>>>> +these with the versions and capabilities it supports and sends a
>>>> +VFIO_USER_VERSION reply according to the following rules.
>>>> +
>>>> +* The major version in the reply must be the same as proposed. If the
>>> client
>>>> +  does not support the proposed major, it closes the connection.
>>>> +* The minor version in the reply must be equal to or less than the minor
>>>> +  version proposed.
>>>> +* The capability list must be a subset of those proposed. If the client
>>>> +  requires a capability the server did not include, it closes the connection.
>>>> +* If type is not ???????vfio-user????????, the client closes the
>>> connection.
>>> 
>>> Is there a rationale for this field? In order to get to the point where
>>> this JSON is parsed we must already implement the vfio-user protocol
>>> (e.g. parse the header).
>> 
>> It was suggested to include it as a sanity check, we'll drop it.
> 
> It's fine to keep it. Many protocols have a banner or identifier early
> on that requires some parsing. In this case it could be: this field
> makes it easy for humans decoding a protocol conversation to identify
> the protocol as vfio-user and also serves as a sanity check.
> 

	I’d like to re-think the versioning a bit to avoid some of
the indeterminism.  I think it might be better to have the fields that
must exist to bootstrap the protocol (major & minor) be enumerated in
the version packet body.  Then, for each major/minor we can specify the
objects and their types that may exist in the JSON array.

	This list would initially only have the “type” string.



>>> What are the semantics of map/unmap with respect to mapping over
>>> existing regions, unmapping a subset of an existing region, etc?
>> 
>> The VFIO kernel driver returns EEXIST if mapping over an existing area
>> and allows unmapping a subset. I'd think we should follow their example.
> 
> Sounds good.
> 
>>> 
>>> What are the concerns for MAP/UNMAP while the device is operational and
>>> may be performing DMA? Should this command be combined into a single
>>> VFIO_USER_SET_DMA_REGIONS command with Flags Bit 1 indicating
>>> Add/Delete
>>> so that a single message can atomically add and delete DMA regions?
>> 
>> Regarding receiving a DMA unmap request while there is an ongoing DMA
>> transaction, once the server ACK's the DMA unmap then it must not touch that DMA
>> region. It is an implementation detail whether the server waits for the DMA to
>> finish or kills the DMA operation, which might not be possible at all (e.g part
>> of that region has been submitted for I/O to a physical device).
>> 
>>> 
>>> What is the format of the reply to this message?
>> 
>> It's just the header explained in the "Header" section, no additional data are
>> sent by the server, it simply ACK's the command.
> 
> Okay, this was a little unclear to me when reading the spec. Maybe it
> would help to have a general explanation at the beginning stating that
> the message format is used by both the request and the reply unless
> stated otherwise.
> 

	OK.

>> 
>>> 
>>>> +
>>>> +VFIO_USER_DEVICE_GET_INFO
>>>> +-------------------------
>>>> +
>>>> +Message format
>>>> +^^^^^^^^^^^^^^
>>>> +
>>>> ++--------------+----------------------------+
>>>> +| Name         | Value                      |
>>>> ++==============+============================+
>>>> +| Device ID    | <ID>                       |
>>>> ++--------------+----------------------------+
>>>> +| Message ID   | <ID>                       |
>>>> ++--------------+----------------------------+
>>>> +| Command      | 4                          |
>>>> ++--------------+----------------------------+
>>>> +| Message size | 16 in request, 32 in reply |
>>>> ++--------------+----------------------------+
>>>> +| Flags        | Reply bit set in reply     |
>>>> ++--------------+----------------------------+
>>>> +| Device info  | VFIO device info           |
>>>> ++--------------+----------------------------+
>>>> +
>>>> +This message is sent by the client to the server to query for basic
>>> information
>>>> +about the device. Only the message header is needed in the request
>>> message.
>>>> +The VFIO device info structure is defined in ``<sys/vfio.h>`` (``struct
>>>> +vfio_device_info``).
>>>> +
>>>> +VFIO device info format
>>>> +^^^^^^^^^^^^^^^^^^^^^^^
>>>> +
>>>> ++-------------+--------+--------------------------+
>>>> +| Name        | Offset | Size                     |
>>>> ++=============+========+==========================+
>>>> +| argsz       | 16     | 4                        |
>>>> ++-------------+--------+--------------------------+
>>>> +| flags       | 20     | 4                        |
>>>> ++-------------+--------+--------------------------+
>>>> +|             | +-----+-------------------------+ |
>>>> +|             | | Bit | Definition              | |
>>>> +|             | +=====+=========================+ |
>>>> +|             | | 0   | VFIO_DEVICE_FLAGS_RESET | |
>>>> +|             | +-----+-------------------------+ |
>>>> +|             | | 1   | VFIO_DEVICE_FLAGS_PCI   | |
>>>> +|             | +-----+-------------------------+ |
>>>> ++-------------+--------+--------------------------+
>>>> +| num_regions | 24     | 4                        |
>>>> ++-------------+--------+--------------------------+
>>>> +| num_irqs    | 28     | 4                        |
>>>> ++-------------+--------+--------------------------+
>>>> +
>>>> +* argz is reserved in vfio-user, it is only used in the ioctl() VFIO
>>>> +  implementation.
>>>> +* flags contains the following device attributes.
>>>> +
>>>> +  * VFIO_DEVICE_FLAGS_RESET indicates the device supports the
>>>> +    VFIO_USER_DEVICE_RESET message.
>>> 
>>> Why is VFIO_USER_DEVICE_RESET support optional?
>> 
>> Because it is optional in VFIO, too.
> 
> I see :D
> 
>>>> +DMA Read/Write Data
>>>> +^^^^^^^^^^^^^^^^^^^
>>>> +
>>>> ++---------+--------+----------+
>>>> +| Name    | Offset | Size     |
>>>> ++=========+========+==========+
>>>> +| Address | 16     | 8        |
>>>> ++---------+--------+----------+
>>>> +| Count   | 24     | 4        |
>>>> ++---------+--------+----------+
>>>> +| Data    | 28     | variable |
>>>> ++---------+--------+----------+
>>>> +
>>>> +* Address is the area of guest memory being accessed. This address must
>>> have
>>>> +  been exported to the server with a VFIO_USER_DMA_MAP message.
>>>> +* Count is the size of the data to be transferred.
>>>> +* Data is the data to be read or written.
>>>> +
>>>> +Address and count can also be accessed as ``struct iovec`` from
>>> ``<sys/uio.h>``.
>>> 
>>> This seems to be incorrect since the count field is 4 bytes but struct
>>> iovec::iov_len is size_t (64-bit on 64-bit hosts).
>> 
>> Indeed, I forgot about padding. We can remove the reference to struct iovec or
>> make count 8 bytes?
> 
> I suggest removing the reference to struct iovec.

	OK




Re: [PATCH] introduce VFIO-over-socket protocol specificaion
Posted by John G Johnson 4 years, 1 month ago
	Updated doc to address comments.  Many changes are typos, but
some are more more substantive.

								JJ


Version title is a hyperlink the versioning section
Rewrote concurrency section to be less concurrent
Removed disconnection recovery section - disconnect now causes client device reset
Message type is now enumerated in flags to distinguish it from message attributes
argsz fields use VFIO definitions instead of being reserved
sys/vfio.h -> linux/vfio.h


diff --git a/docs/devel/vfio-user.rst b/docs/devel/vfio-user.rst
index 05a788b..ea99db6 100644
--- a/docs/devel/vfio-user.rst
+++ b/docs/devel/vfio-user.rst
@@ -2,9 +2,9 @@
 vfio-user Protocol Specification
 ********************************
 
------------
-Version 0.1
------------
+------------
+Version_ 0.1
+------------
 
 .. contents:: Table of Contents
 
@@ -207,21 +207,19 @@ protocol.
 
 Command Concurrency
 -------------------
-There can be multiple outstanding commands per virtual device, e.g. a frame
-buffer where the guest does multiple stores to the virtual device.  Commands
-must be executed in the order they are received.  The server can execute and
-reorder non-conflicting commands in parallel, depending on the device
-semantics.
-
-.. Note::
-   For instance, a client can issue the following operations back to back
-   without waiting for the first two to complete:
-
-   1. map a DMA region 
-   2. trigger some device-specific operation that results in data being read
-      into that DMA region, and
-   3. unmap the DMA region
-
+A client may pipeline multiple commands without waiting for previous command
+replies.  The server will process commands in the order they are received.
+A consequence of this is if a client issues a command with the *No_reply* bit,
+then subseqently issues a command without *No_reply*, the older command will
+have been processed before the reply to the younger command is sent by the
+server.  The client must be aware of the device's capability to process concurrent
+commands if pipelining is used.  For example, pipelining allows multiple client
+threads to concurently access device memory; the client must ensure these acceses
+obey device semantics.
+
+An example is a frame buffer device, where the device may allow concurrent access
+to different areas of video memory, but may have indeterminate behavior if concurrent
+acceses are performed to command or status registers.
 
 Socket Disconnection Behavior
 -----------------------------
@@ -234,29 +232,14 @@ Server Disconnection
 A server disconnecting from the client may indicate that:
 
 1) A virtual device has been restarted, either intentionally (e.g. because of a
-   device update) or unintentionally (e.g. because of a crash). In any case,
-   the virtual device will come back so the client should not do anything (e.g.
-   simply reconnect and retry failed operations).
+   device update) or unintentionally (e.g. because of a crash).
 2) A virtual device has been shut down with no intention to be restarted.
 
 It is impossible for the client to know whether or not a failure is
 intermittent or innocuous and should be retried, therefore the client should
-attempt to reconnect to the socket. Since an intentional server restart (e.g.
-due to an upgrade) might take some time, a reasonable timeout should be used.
-In cases where the disconnection is expected (e.g. the client exits), no new
-commands will be sent anyway so this situation does not pose a problem. The
-control stack will clean up accordingly.
-
-Parameterizing this behaviour by having the server advertise a reasonable
-reconnect is deferred to a future version of the protocol.
-
-Recovering state is implementation-specific. The protocol provides all the
-building blocks and does not enforce a particular mechanism:
-
-* Recovering state must be initiated by the client since it is the master in
-  the vfio-user model. Interrupts/DMA may have to be reconfigured,
-  unacknowledged requests may have to be re-submitted, etc.
-* The server may have to persistently store device state.
+reset the VFIO device when it detects the socket has been disconnected.
+Error recovery will be driven by the guest's device error handling
+behavior.
 
 Client Disconnection
 ^^^^^^^^^^^^^^^^^^^^
@@ -274,6 +257,12 @@ take no action when the client disconnects. If anything happens to the client
 the control stack will know about it and can clean up resources
 accordingly.
 
+Live Migration
+^^^^^^^^^^^^^^
+A future version of the protocol will support client live migration.  This action
+will require the socket to be quiesced before it is disconnected,  This mechanism
+will be defined when live migration support is added.
+
 Request Retry and Response Timeout
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 A failed command is a command that has been successfully sent and has been
@@ -335,48 +324,53 @@ All messages, both command messages and reply messages, are preceded by a
 header that contains basic information about the message. The header is
 followed by message-specific data described in the sections below.
 
-+----------------+--------+-------------+
-| Name           | Offset | Size        |
-+================+========+=============+
-| Message ID     | 0      | 2           |
-+----------------+--------+-------------+
-| Command        | 2      | 2           |
-+----------------+--------+-------------+
-| Message size   | 4      | 4           |
-+----------------+--------+-------------+
-| Flags          | 8      | 4           |
-+----------------+--------+-------------+
-|                | +-----+------------+ |
-|                | | Bit | Definition | |
-|                | +=====+============+ |
-|                | | 0   | Reply      | |
-|                | +-----+------------+ |
-|                | | 1   | No_reply   | |
-|                | +-----+------------+ |
-|                | | 2   | Error      | |
-|                | +-----+------------+ |
-+----------------+--------+-------------+
-| Error          | 12     | 4           |
-+----------------+--------+-------------+
-| <message data> | 16     | variable    |
-+----------------+--------+-------------+
-
-* *Message ID* identifies the message, and is used echoed in the reply message.
++----------------+---------+-------------+
+| Name           | Offset  | Size        |
++================+=========+=============+
+| Message ID     | 0       | 2           |
++----------------+---------+-------------+
+| Command        | 2       | 2           |
++----------------+---------+-------------+
+| Message size   | 4       | 4           |
++----------------+---------+-------------+
+| Flags          | 8       | 4           |
++----------------+---------+-------------+
+|                | +------+------------+ |
+|                | | Bit  | Definition | |
+|                | +======+============+ |
+|                | | 0-3  | Type       | |
+|                | +------+------------+ |
+|                | | 4    | No_reply   | |
+|                | +------+------------+ |
+|                | | 5    | Error      | |
+|                | +------+------------+ |
++----------------+---------+-------------+
+| Errno          | 12      | 4           |
++----------------+---------+-------------+
+| <message data> | 16      | variable    |
++----------------+---------+-------------+
+
+* *Message ID* identifies the message, and is echoed in the command's reply message.
 * *Command* specifies the command to be executed, listed in Commands_.
 * *Message size* contains the size of the entire message, including the header.
 * *Flags* contains attributes of the message:
 
-  * The reply bit differentiates command messages from reply messages. A reply
-    message acknowledges a previous command with the same message ID.
-  * No_reply indicates that no reply is needed for this command. This is
-    commonly used when multiple commands are sent, and only the last needs
+  * The *Type* bits indicate the message type.
+
+    *  *Command* (value 0x0) indicates a command message.
+    *  *Reply* (value 0x1) indicates a reply message acknowledging a previous
+       command with the same message ID.
+  * *No_reply* in a command message indicates that no reply is needed for this command.
+    This is commonly used when multiple commands are sent, and only the last needs
     acknowledgement.
+  * *Error* in a reply message indicates the command being acknowledged had
+    an error. In this case, the *Errno* field will be valid.
 
-* *Error* is a UNIX errno value set only in the reply message. It is reserved in
-  the command message.
+* *Errno* in a reply message is a UNIX errno value. It is reserved in a command message.
 
-Each command message in Commands_ must be replied to with a reply message,
-which consists of the header with the reply bit set, plus any additional data.
+Each command message in Commands_ must be replied to with a reply message, unless the
+message sets the *No_Reply* bit.  The reply consists of the header with the *Reply*
+bit set, plus any additional data.
 
 VFIO_USER_VERSION
 -----------------
@@ -395,7 +389,7 @@ Message format
 +--------------+------------------------+
 | Flags        | Reply bit set in reply |
 +--------------+------------------------+
-| Error        | 0/errno                |
+| Errno        | 0/errno                |
 +--------------+------------------------+
 | Version      | JSON byte array        |
 +--------------+------------------------+
@@ -415,6 +409,8 @@ included:
 |              |        | be empty.                                         |
 +--------------+--------+---------------------------------------------------+
 
+.. _Version:
+
 Versioning and Feature Support
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Upon accepting a connection, the server must send a VFIO_USER_VERSION message
@@ -457,7 +453,7 @@ Message Format
 +--------------+------------------------+
 | Flags        | Reply bit set in reply |
 +--------------+------------------------+
-| Error        | 0/errno                |
+| Errno        | 0/errno                |
 +--------------+------------------------+
 | Table        | array of table entries |
 +--------------+------------------------+
@@ -543,14 +539,14 @@ Message format
 +--------------+----------------------------+
 | Flags        | Reply bit set in reply     |
 +--------------+----------------------------+
-| Error        | 0/errno                    |
+| Errno        | 0/errno                    |
 +--------------+----------------------------+
 | Device info  | VFIO device info           |
 +--------------+----------------------------+
 
 This command message is sent by the client to the server to query for basic
 information about the device. Only the message header is needed in the command
-message.  The VFIO device info structure is defined in ``<sys/vfio.h>``
+message.  The VFIO device info structure is defined in ``<linux/vfio.h>``
 (``struct vfio_device_info``).
 
 VFIO device info format
@@ -576,8 +572,7 @@ VFIO device info format
 | num_irqs    | 28     | 4                        |
 +-------------+--------+--------------------------+
 
-* *argz* is reserved in vfio-user, it is only used in the ioctl() VFIO
-  implementation.
+* *argsz* is the size of the VFIO device info structure.
 * *flags* contains the following device attributes.
 
   * VFIO_DEVICE_FLAGS_RESET indicates that the device supports the
@@ -607,14 +602,14 @@ Message format
 +--------------+------------------------+
 | Flags        | Reply bit set in reply |
 +--------------+------------------------+
-| Error        | 0/errno                |
+| Errno        | 0/errno                |
 +--------------+------------------------+
 | Region info  | VFIO region info       |
 +--------------+------------------------+
 
 This command message is sent by the client to the server to query for
 information about device memory regions. The VFIO region info structure is
-defined in ``<sys/vfio.h>`` (``struct vfio_region_info``). Since the client
+defined in ``<linux/vfio.h>`` (``struct vfio_region_info``). Since the client
 does not know the size of the capabilities, the size of the reply it should
 expect is 48 plus any capabilities whose size is indicated in the size field of
 the reply header.
@@ -650,8 +645,8 @@ VFIO region info format
 | offset     | 40     | 8                            |
 +------------+--------+------------------------------+
 
-* *argz* is reserved in vfio-user, it is only used in the ioctl() VFIO
-  implementation.
+* *argsz* is the size of the VFIO region info structure plus the
+  size of any region capabilities returned.
 * *flags* are attributes of the region:
 
   * *VFIO_REGION_INFO_FLAG_READ* allows client read access to the region.
@@ -668,7 +663,7 @@ VFIO region info format
 * *cap_offset* describes where additional region capabilities can be found.
   cap_offset is relative to the beginning of the VFIO region info structure.
   The data structure it points is a VFIO cap header defined in
-  ``<sys/vfio.h>``.
+  ``<linux/vfio.h>``.
 * *size* is the size of the region.
 * *offset* is the offset given to the mmap() system call for regions with the
   MMAP attribute. It is also used as the base offset when mapping a VFIO
@@ -679,7 +674,7 @@ VFIO Region capabilities
 The VFIO region information can also include a capabilities list. This list is
 similar to a PCI capability list - each entry has a common header that
 identifies a capability and where the next capability in the list can be found.
-The VFIO capability header format is defined in ``<sys/vfio.h>`` (``struct
+The VFIO capability header format is defined in ``<linux/vfio.h>`` (``struct
 vfio_info_cap_header``).
 
 VFIO cap header format
@@ -718,7 +713,7 @@ VFIO sparse mmap
 The only capability supported in this version of the protocol is for sparse
 mmap. This capability is defined when only a subrange of the region supports
 direct access by the client via mmap(). The VFIO sparse mmap area is defined in
-``<sys/vfio.h>`` (``struct vfio_region_sparse_mmap_area``).
+``<linux/vfio.h>`` (``struct vfio_region_sparse_mmap_area``).
 
 VFIO region info cap sparse mmap
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -742,7 +737,7 @@ VFIO region info cap sparse mmap
   the base offset given in the VFIO_USER_DEVICE_GET_REGION_INFO to form the
   offset argument of the subsequent mmap() call.
 
-The VFIO sparse mmap area is defined in ``<sys/vfio.h>`` (``struct
+The VFIO sparse mmap area is defined in ``<linux/vfio.h>`` (``struct
 vfio_region_info_cap_sparse_mmap``).
 
 VFIO_USER_DEVICE_GET_IRQ_INFO
@@ -762,14 +757,14 @@ Message format
 +--------------+------------------------+
 | Flags        | Reply bit set in reply |
 +--------------+------------------------+
-| Error        | 0/errno                |
+| Errno        | 0/errno                |
 +--------------+------------------------+
 | IRQ info     | VFIO IRQ info          |
 +--------------+------------------------+
 
 This command message is sent by the client to the server to query for
 information about device interrupt types. The VFIO IRQ info structure is
-defined in ``<sys/vfio.h>`` (``struct vfio_irq_info``).
+defined in ``<linux/vfio.h>`` (``struct vfio_irq_info``).
 
 VFIO IRQ info format
 ^^^^^^^^^^^^^^^^^^^^
@@ -798,8 +793,7 @@ VFIO IRQ info format
 | count | 28     | 4                         |
 +-------+--------+---------------------------+
 
-* *argz* is reserved in vfio-user, it is only used in the ioctl() VFIO
-  implementation.
+* *argsz* is the size of the VFIO IRQ info structure.
 * *flags* defines IRQ attributes:
 
   * *VFIO_IRQ_INFO_EVENTFD* indicates the IRQ type can support server eventfd
@@ -834,17 +828,17 @@ Message format
 +--------------+------------------------+
 | Flags        | Reply bit set in reply |
 +--------------+------------------------+
-| Error        | 0/errno                |
+| Errno        | 0/errno                |
 +--------------+------------------------+
 | IRQ set      | VFIO IRQ set           |
 +--------------+------------------------+
 
 This command message is sent by the client to the server to set actions for
 device interrupt types. The VFIO IRQ set structure is defined in
-``<sys/vfio.h>`` (``struct vfio_irq_set``).
+``<linux/vfio.h>`` (``struct vfio_irq_set``).
 
-VFIO IRQ info format
-^^^^^^^^^^^^^^^^^^^^
+VFIO IRQ set format
+^^^^^^^^^^^^^^^^^^^
 
 +-------+--------+------------------------------+
 | Name  | Offset | Size                         |
@@ -878,8 +872,7 @@ VFIO IRQ info format
 | data  | 36     | variable                     |
 +-------+--------+------------------------------+
 
-* *argz* is reserved in vfio-user, it is only used in the ioctl() VFIO
-  implementation.
+* *argsz* is the size of the VFIO IRQ set structure, including any *data* field.
 * *flags* defines the action performed on the interrupt range. The DATA flags
   describe the data field sent in the message; the ACTION flags describe the
   action to be performed. The flags are mutually exclusive for both sets.
@@ -909,7 +902,8 @@ VFIO IRQ info format
 * *start* is the start of the sub-index being set.
 * *count* describes the number of sub-indexes being set. As a special case, a
   count of 0 with data flags of VFIO_IRQ_SET_DATA_NONE disables all interrupts
-  of the index data is an optional field included when the
+  of the index.
+* *data* is an optional field included when the
   VFIO_IRQ_SET_DATA_BOOL flag is present. It contains an array of booleans
   that specify whether the action is to be performed on the corresponding
   index. It's used when the action is only performed on a subset of the range
@@ -975,8 +969,6 @@ DMA Read/Write Data
 * *Count* is the size of the data to be transferred.
 * *Data* is the data to be read or written.
 
-Address and count can also be accessed as ``struct iovec`` from ``<sys/uio.h>``.
-
 VFIO_USER_REGION_READ
 ---------------------
 
@@ -994,7 +986,7 @@ Message format
 +--------------+------------------------+
 | Flags        | Reply bit set in reply |
 +--------------+------------------------+
-| Error        | 0/errno                |
+| Errno        | 0/errno                |
 +--------------+------------------------+
 | Read info    | REGION read/write data |
 +--------------+------------------------+
@@ -1021,7 +1013,7 @@ Message format
 +--------------+------------------------+
 | Flags        | Reply bit set in reply |
 +--------------+------------------------+
-| Error        | 0/errno                |
+| Errno        | 0/errno                |
 +--------------+------------------------+
 | Write info   | REGION read/write data |
 +--------------+------------------------+
@@ -1048,7 +1040,7 @@ Message format
 +--------------+------------------------+
 | Flags        | Reply bit set in reply |
 +--------------+------------------------+
-| Error        | 0/errno                |
+| Errno        | 0/errno                |
 +--------------+------------------------+
 | DMA info     | DMA read/write data    |
 +--------------+------------------------+
@@ -1075,7 +1067,7 @@ Message format
 +--------------+------------------------+
 | Flags        | Reply bit set in reply |
 +--------------+------------------------+
-| Error        | 0/errno                |
+| Errno        | 0/errno                |
 +--------------+------------------------+
 | DMA info     | DMA read/write data    |
 +--------------+------------------------+
@@ -1102,7 +1094,7 @@ Message format
 +----------------+------------------------+
 | Flags          | Reply bit set in reply |
 +----------------+------------------------+
-| Error          | 0/errno                |
+| Errno          | 0/errno                |
 +----------------+------------------------+
 | Interrupt info | <interrupt>            |
 +----------------+------------------------+
@@ -1143,7 +1135,7 @@ Message format
 +--------------+------------------------+
 | Flags        | Reply bit set in reply |
 +--------------+------------------------+
-| Error        | 0/errno                |
+| Errno        | 0/errno                |
 +--------------+------------------------+
 
 This command message is sent from the client to the server to reset the device.
********************************
vfio-user Protocol Specification
********************************

------------
Version_ 0.1
------------

.. contents:: Table of Contents

Introduction
============
vfio-user is a protocol that allows a device to be emulated in a separate
process outside of a Virtual Machine Monitor (VMM). vfio-user devices consist
of a generic VFIO device type, living inside the VMM, which we call the client,
and the core device implementation, living outside the VMM, which we call the
server.

The `Linux VFIO ioctl interface <https://www.kernel.org/doc/html/latest/driver-api/vfio.html>`_
been chosen as the base for this protocol for the following reasons:

1) It is a mature and stable API, backed by an extensively used framework.
2) The existing VFIO client implementation in QEMU (qemu/hw/vfio/) can be
   largely reused.

.. Note::
   In a proof of concept implementation it has been demonstrated that using VFIO
   over a UNIX domain socket is a viable option. vfio-user is designed with
   QEMU in mind, however it could be used by other client applications. The
   vfio-user protocol does not require that QEMU's VFIO client  implementation
   is used in QEMU.

None of the VFIO kernel modules are required for supporting the protocol,
neither in the client nor the server, only the source header files are used.

The main idea is to allow a virtual device to function in a separate process in
the same host over a UNIX domain socket. A UNIX domain socket (AF_UNIX) is
chosen because file descriptors can be trivially sent over it, which in turn
allows:

* Sharing of client memory for DMA with the server.
* Sharing of server memory with the client for fast MMIO.
* Efficient sharing of eventfd's for triggering interrupts.

Other socket types could be used which allow the server to run in a separate
guest in the same host (AF_VSOCK) or remotely (AF_INET). Theoretically the
underlying transport does not necessarily have to be a socket, however we do
not examine such alternatives. In this protocol version we focus on using a
UNIX domain socket and introduce basic support for the other two types of
sockets without considering performance implications.

While passing of file descriptors is desirable for performance reasons, it is
not necessary neither for the client nor for the server to support it in order
to implement the protocol. There is always an in-band, message-passing fall
back mechanism.

VFIO
====
VFIO is a framework that allows a physical device to be securely passed through
to a user space process; the device-specific kernel driver does not drive the
device at all.  Typically, the user space process is a VMM and the device is
passed through to it in order to achieve high performance. VFIO provides an API
and the required functionality in the kernel. QEMU has adopted VFIO to allow a
guest to directly access physical devices, instead of emulating them in
software.

vfio-user reuses the core VFIO concepts defined in its API, but implements them
as messages to be sent over a socket. It does not change the kernel-based VFIO
in any way, in fact none of the VFIO kernel modules need to be loaded to use
vfio-user. It is also possible for the client to concurrently use the current
kernel-based VFIO for one device, and vfio-user for another device.

VFIO Device Model
-----------------
A device under VFIO presents a standard interface to the user process. Many of
the VFIO operations in the existing interface use the ioctl() system call, and
references to the existing interface are called the ioctl() implementation in
this document.

The following sections describe the set of messages that implement the VFIO
interface over a socket. In many cases, the messages are direct translations of
data structures used in the ioctl() implementation. Messages derived from
ioctl()s will have a name derived from the ioctl() command name.  E.g., the
VFIO_GET_INFO ioctl() command becomes a VFIO_USER_GET_INFO message.  The
purpose of this reuse is to share as much code as feasible with the ioctl()
implementation.

Connection Initiation
^^^^^^^^^^^^^^^^^^^^^
After the client connects to the server, the initial server message is
VFIO_USER_VERSION to propose a protocol version and set of capabilities to
apply to the session. The client replies with a compatible version and set of
capabilities it supports, or closes the connection if it cannot support the
advertised version.

DMA Memory Configuration
^^^^^^^^^^^^^^^^^^^^^^^^
The client uses VFIO_USER_DMA_MAP and VFIO_USER_DMA_UNMAP messages to inform
the server of the valid DMA ranges that the server can access on behalf
of a device. DMA memory may be accessed by the server via VFIO_USER_DMA_READ
and VFIO_USER_DMA_WRITE messages over the socket.

An optimization for server access to client memory is for the client to provide
file descriptors the server can mmap() to directly access client memory. Note
that mmap() privileges cannot be revoked by the client, therefore file
descriptors should only be exported in environments where the client trusts the
server not to corrupt guest memory.

Device Information
^^^^^^^^^^^^^^^^^^
The client uses a VFIO_USER_DEVICE_GET_INFO message to query the server for
information about the device. This information includes:

* The device type and capabilities,
* the number of device regions, and
* the device presents to the client the number of interrupt types the device
  supports.

Region Information
^^^^^^^^^^^^^^^^^^
The client uses VFIO_USER_DEVICE_GET_REGION_INFO messages to query the server
for information about the device's memory regions. This information describes:

* Read and write permissions, whether it can be memory mapped, and whether it
  supports additional capabilities.
* Region index, size, and offset.

When a region can be mapped by the client, the server provides a file
descriptor which the client can mmap(). The server is responsible for polling
for client updates to memory mapped regions.

Region Capabilities
"""""""""""""""""""
Some regions have additional capabilities that cannot be described adequately
by the region info data structure. These capabilities are returned in the
region info reply in a list similar to PCI capabilities in a PCI device's
configuration space.

Sparse Regions
""""""""""""""
A region can be memory-mappable in whole or in part. When only a subset of a
region can be mapped by the client, a VFIO_REGION_INFO_CAP_SPARSE_MMAP
capability is included in the region info reply. This capability describes
which portions can be mapped by the client.

.. Note::
   For example, in a virtual NVMe controller, sparse regions can be used so
   that accesses to the NVMe registers (found in the beginning of BAR0) are
   trapped (an infrequent event), while allowing direct access to the doorbells
   (an extremely frequent event as every I/O submission requires a write to
   BAR0), found right after the NVMe registers in BAR0.

Interrupts
^^^^^^^^^^
The client uses VFIO_USER_DEVICE_GET_IRQ_INFO messages to query the server for
the device's interrupt types. The interrupt types are specific to the bus the
device is attached to, and the client is expected to know the capabilities of
each interrupt type. The server can signal an interrupt either with
VFIO_USER_VM_INTERRUPT messages over the socket, or can directly inject
interrupts into the guest via an event file descriptor. The client configures
how the server signals an interrupt with VFIO_USER_SET_IRQS messages.

Device Read and Write
^^^^^^^^^^^^^^^^^^^^^
When the guest executes load or store operations to device memory, the client
forwards these operations to the server with VFIO_USER_REGION_READ or
VFIO_USER_REGION_WRITE messages. The server will reply with data from the
device on read operations or an acknowledgement on write operations.

DMA
^^^
When a device performs DMA accesses to guest memory, the server will forward
them to the client with VFIO_USER_DMA_READ and VFIO_USER_DMA_WRITE messages.
These messages can only be used to access guest memory the client has
configured into the server.

Protocol Specification
======================
To distinguish from the base VFIO symbols, all vfio-user symbols are prefixed
with vfio_user or VFIO_USER. In revision 0.1, all data is in the little-endian
format, although this may be relaxed in future revision in cases where the
client and server are both big-endian. The messages are formatted for seamless
reuse of the native VFIO structs.

Socket
------

A server can serve:

1) one or more clients, and/or
2) one or more virtual devices, belonging to one or more clients.

The current protocol specification requires a dedicated socket per
client/server connection. It is a server-side implementation detail whether a
single server handles multiple virtual devices from the same or multiple
clients. The location of the socket is implementation-specific. Multiplexing
clients, devices, and servers over the same socket is not supported in this
version of the protocol.

Authentication
--------------
For AF_UNIX, we rely on OS mandatory access controls on the socket files,
therefore it is up to the management layer to set up the socket as required.
Socket types than span guests or hosts will require a proper authentication
mechanism. Defining that mechanism is deferred to a future version of the
protocol.

Command Concurrency
-------------------
A client may pipeline multiple commands without waiting for previous command
replies.  The server will process commands in the order they are received.
A consequence of this is if a client issues a command with the *No_reply* bit,
then subseqently issues a command without *No_reply*, the older command will
have been processed before the reply to the younger command is sent by the
server.  The client must be aware of the device's capability to process concurrent
commands if pipelining is used.  For example, pipelining allows multiple client
threads to concurently access device memory; the client must ensure these acceses
obey device semantics.

An example is a frame buffer device, where the device may allow concurrent access
to different areas of video memory, but may have indeterminate behavior if concurrent
acceses are performed to command or status registers.

Socket Disconnection Behavior
-----------------------------
The server and the client can disconnect from each other, either intentionally
or unexpectedly. Both the client and the server need to know how to handle such
events.

Server Disconnection
^^^^^^^^^^^^^^^^^^^^
A server disconnecting from the client may indicate that:

1) A virtual device has been restarted, either intentionally (e.g. because of a
   device update) or unintentionally (e.g. because of a crash).
2) A virtual device has been shut down with no intention to be restarted.

It is impossible for the client to know whether or not a failure is
intermittent or innocuous and should be retried, therefore the client should
reset the VFIO device when it detects the socket has been disconnected.
Error recovery will be driven by the guest's device error handling
behavior.

Client Disconnection
^^^^^^^^^^^^^^^^^^^^
The client disconnecting from the server primarily means that the client
has exited. Currently, this means that the guest is shut down so the device is
no longer needed therefore the server can automatically exit. However, there
can be cases where a client disconnection should not result in a server exit:

1) A single server serving multiple clients.
2) A multi-process QEMU upgrading itself step by step, which is not yet
   implemented.

Therefore in order for the protocol to be forward compatible the server should
take no action when the client disconnects. If anything happens to the client
the control stack will know about it and can clean up resources
accordingly.

Live Migration
^^^^^^^^^^^^^^
A future version of the protocol will support client live migration.  This action
will require the socket to be quiesced before it is disconnected,  This mechanism
will be defined when live migration support is added.

Request Retry and Response Timeout
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
A failed command is a command that has been successfully sent and has been
responded to with an error code. Failure to send the command in the first place
(e.g. because the socket is disconnected) is a different type of error examined
earlier in the disconnect section.

.. Note::
   QEMU's VFIO retries certain operations if they fail. While this makes sense
   for real HW, we don't know for sure whether it makes sense for virtual
   devices.

Defining a retry and timeout scheme is deferred to a future version of the
protocol.

.. _Commands:

Commands
--------
The following table lists the VFIO message command IDs, and whether the
message command is sent from the client or the server.

+----------------------------------+---------+-------------------+
| Name                             | Command | Request Direction |
+==================================+=========+===================+
| VFIO_USER_VERSION                | 1       | server -> client  |
+----------------------------------+---------+-------------------+
| VFIO_USER_DMA_MAP                | 2       | client -> server  |
+----------------------------------+---------+-------------------+
| VFIO_USER_DMA_UNMAP              | 3       | client -> server  |
+----------------------------------+---------+-------------------+
| VFIO_USER_DEVICE_GET_INFO        | 4       | client -> server  |
+----------------------------------+---------+-------------------+
| VFIO_USER_DEVICE_GET_REGION_INFO | 5       | client -> server  |
+----------------------------------+---------+-------------------+
| VFIO_USER_DEVICE_GET_IRQ_INFO    | 6       | client -> server  |
+----------------------------------+---------+-------------------+
| VFIO_USER_DEVICE_SET_IRQS        | 7       | client -> server  |
+----------------------------------+---------+-------------------+
| VFIO_USER_REGION_READ            | 8       | client -> server  |
+----------------------------------+---------+-------------------+
| VFIO_USER_REGION_WRITE           | 9       | client -> server  |
+----------------------------------+---------+-------------------+
| VFIO_USER_DMA_READ               | 10      | server -> client  |
+----------------------------------+---------+-------------------+
| VFIO_USER_DMA_WRITE              | 11      | server -> client  |
+----------------------------------+---------+-------------------+
| VFIO_USER_VM_INTERRUPT           | 12      | server -> client  |
+----------------------------------+---------+-------------------+
| VFIO_USER_DEVICE_RESET           | 13      | client -> server  |
+----------------------------------+---------+-------------------+

.. Note:: Some VFIO defines cannot be reused since their values are
   architecture-specific (e.g. VFIO_IOMMU_MAP_DMA).

Header
------
All messages, both command messages and reply messages, are preceded by a
header that contains basic information about the message. The header is
followed by message-specific data described in the sections below.

+----------------+---------+-------------+
| Name           | Offset  | Size        |
+================+=========+=============+
| Message ID     | 0       | 2           |
+----------------+---------+-------------+
| Command        | 2       | 2           |
+----------------+---------+-------------+
| Message size   | 4       | 4           |
+----------------+---------+-------------+
| Flags          | 8       | 4           |
+----------------+---------+-------------+
|                | +------+------------+ |
|                | | Bit  | Definition | |
|                | +======+============+ |
|                | | 0-3  | Type       | |
|                | +------+------------+ |
|                | | 4    | No_reply   | |
|                | +------+------------+ |
|                | | 5    | Error      | |
|                | +------+------------+ |
+----------------+---------+-------------+
| Errno          | 12      | 4           |
+----------------+---------+-------------+
| <message data> | 16      | variable    |
+----------------+---------+-------------+

* *Message ID* identifies the message, and is echoed in the command's reply message.
* *Command* specifies the command to be executed, listed in Commands_.
* *Message size* contains the size of the entire message, including the header.
* *Flags* contains attributes of the message:

  * The *Type* bits indicate the message type.

    *  *Command* (value 0x0) indicates a command message.
    *  *Reply* (value 0x1) indicates a reply message acknowledging a previous
       command with the same message ID.
  * *No_reply* in a command message indicates that no reply is needed for this command.
    This is commonly used when multiple commands are sent, and only the last needs
    acknowledgement.
  * *Error* in a reply message indicates the command being acknowledged had
    an error. In this case, the *Errno* field will be valid.

* *Errno* in a reply message is a UNIX errno value. It is reserved in a command message.

Each command message in Commands_ must be replied to with a reply message, unless the
message sets the *No_Reply* bit.  The reply consists of the header with the *Reply*
bit set, plus any additional data.

VFIO_USER_VERSION
-----------------

Message format
^^^^^^^^^^^^^^

+--------------+------------------------+
| Name         | Value                  |
+==============+========================+
| Message ID   | <ID>                   |
+--------------+------------------------+
| Command      | 1                      |
+--------------+------------------------+
| Message size | 16 + version length    |
+--------------+------------------------+
| Flags        | Reply bit set in reply |
+--------------+------------------------+
| Errno        | 0/errno                |
+--------------+------------------------+
| Version      | JSON byte array        |
+--------------+------------------------+

This is the initial message sent by the server after the socket connection is
established. The version is in JSON format, and the following objects must be
included:

+--------------+--------+---------------------------------------------------+
| Name         | Type   | Description                                       |
+==============+========+===================================================+
| version      | object | ``{"major": <number>, "minor": <number>}``        |
|              |        |                                                   |
|              |        | Version supported by the sender, e.g. "0.1".      |
+--------------+--------+---------------------------------------------------+
| capabilities | array  | Reserved. Can be omitted for v0.1, otherwise must |
|              |        | be empty.                                         |
+--------------+--------+---------------------------------------------------+

.. _Version:

Versioning and Feature Support
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Upon accepting a connection, the server must send a VFIO_USER_VERSION message
proposing a protocol version and a set of capabilities. The client compares
these with the versions and capabilities it supports and sends a
VFIO_USER_VERSION reply according to the following rules.

* The major version in the reply must be the same as proposed. If the client
  does not support the proposed major, it closes the connection.
* The minor version in the reply must be equal to or less than the minor
  version proposed.
* The capability list must be a subset of those proposed. If the client
  requires a capability the server did not include, it closes the connection.

The protocol major version will only change when incompatible protocol changes
are made, such as changing the message format. The minor version may change
when compatible changes are made, such as adding new messages or capabilities,
Both the client and server must support all minor versions less than the
maximum minor version it supports. E.g., an implementation that supports
version 1.3 must also support 1.0 through 1.2.

When making a change to this specification, the protocol version number must
be included in the form "added in version X.Y"


VFIO_USER_DMA_MAP and VFIO_USER_DMA_UNMAP
-----------------------------------------

Message Format
^^^^^^^^^^^^^^

+--------------+------------------------+
| Name         | Value                  |
+==============+========================+
| Message ID   | <ID>                   |
+--------------+------------------------+
| Command      | MAP=2, UNMAP=3         |
+--------------+------------------------+
| Message size | 16 + table size        |
+--------------+------------------------+
| Flags        | Reply bit set in reply |
+--------------+------------------------+
| Errno        | 0/errno                |
+--------------+------------------------+
| Table        | array of table entries |
+--------------+------------------------+

This command message is sent by the client to the server to inform it of the
memory regions the server can access. It must be sent before the server can
perform any DMA to the client. It is normally sent directly after the version
handshake is completed, but may also occur when memory is added to or
subtracted from the client, or if the client uses a vIOMMU. If the client does
not expect the server to perform DMA then it does not need to send to the
server VFIO_USER_DMA_MAP and VFIO_USER_DMA_UNMAP commands. If the server does
not need to perform DMA the then it can ignore such commands but it must still
reply to them. The table is an array of the following structure.  This
structure is 32 bytes in size, so the message size is:
16 + (# of table entries * 32).

Table entry format
^^^^^^^^^^^^^^^^^^

+-------------+--------+-------------+
| Name        | Offset | Size        |
+=============+========+=============+
| Address     | 0      | 8           |
+-------------+--------+-------------+
| Size        | 8      | 8           |
+-------------+--------+-------------+
| Offset      | 16     | 8           |
+-------------+--------+-------------+
| Protections | 24     | 4           |
+-------------+--------+-------------+
| Flags       | 28     | 4           |
+-------------+--------+-------------+
|             | +-----+------------+ |
|             | | Bit | Definition | |
|             | +=====+============+ |
|             | | 0   | Mappable   | |
|             | +-----+------------+ |
+-------------+--------+-------------+

* *Address* is the base DMA address of the region.
* *Size* is the size of the region.
* *Offset* is the file offset of the region with respect to the associated file
  descriptor.
* *Protections* are the region's protection attributes as encoded in
  ``<sys/mman.h>``.
* *Flags* contain the following region attributes:

  * *Mappable* indicates that the region can be mapped via the mmap() system call
    using the file descriptor provided in the message meta-data.

VFIO_USER_DMA_MAP
"""""""""""""""""
If a DMA region being added can be directly mapped by the server, an array of
file descriptors must be sent as part of the message meta-data. Each region
entry must have a corresponding file descriptor. On AF_UNIX sockets, the file
descriptors must be passed as SCM_RIGHTS type ancillary data. Otherwise, if a
DMA region cannot be directly mapped by the server, it can be accessed by the
server using VFIO_USER_DMA_READ and VFIO_USER_DMA_WRITE messages, explained in
`Read and Write Operations`_. A command to map over an existing region must be
failed by the server with ``EEXIST`` set in error field in the reply.

VFIO_USER_DMA_UNMAP
"""""""""""""""""""
Upon receiving a VFIO_USER_DMA_UNMAP command, if the file descriptor is mapped
then the server must release all references to that DMA region before replying,
which includes potentially in flight DMA transactions. Removing a portion of a
DMA region is possible. 

VFIO_USER_DEVICE_GET_INFO
-------------------------

Message format
^^^^^^^^^^^^^^

+--------------+----------------------------+
| Name         | Value                      |
+==============+============================+
| Message ID   | <ID>                       |
+--------------+----------------------------+
| Command      | 4                          |
+--------------+----------------------------+
| Message size | 16 in command, 32 in reply |
+--------------+----------------------------+
| Flags        | Reply bit set in reply     |
+--------------+----------------------------+
| Errno        | 0/errno                    |
+--------------+----------------------------+
| Device info  | VFIO device info           |
+--------------+----------------------------+

This command message is sent by the client to the server to query for basic
information about the device. Only the message header is needed in the command
message.  The VFIO device info structure is defined in ``<linux/vfio.h>``
(``struct vfio_device_info``).

VFIO device info format
^^^^^^^^^^^^^^^^^^^^^^^

+-------------+--------+--------------------------+
| Name        | Offset | Size                     |
+=============+========+==========================+
| argsz       | 16     | 4                        |
+-------------+--------+--------------------------+
| flags       | 20     | 4                        |
+-------------+--------+--------------------------+
|             | +-----+-------------------------+ |
|             | | Bit | Definition              | |
|             | +=====+=========================+ |
|             | | 0   | VFIO_DEVICE_FLAGS_RESET | |
|             | +-----+-------------------------+ |
|             | | 1   | VFIO_DEVICE_FLAGS_PCI   | |
|             | +-----+-------------------------+ |
+-------------+--------+--------------------------+
| num_regions | 24     | 4                        |
+-------------+--------+--------------------------+
| num_irqs    | 28     | 4                        |
+-------------+--------+--------------------------+

* *argsz* is the size of the VFIO device info structure.
* *flags* contains the following device attributes.

  * VFIO_DEVICE_FLAGS_RESET indicates that the device supports the
    VFIO_USER_DEVICE_RESET message.
  * VFIO_DEVICE_FLAGS_PCI indicates that the device is a PCI device.

* *num_regions* is the number of memory regions that the device exposes.
* *num_irqs* is the number of distinct interrupt types that the device supports.

This version of the protocol only supports PCI devices. Additional devices may
be supported in future versions.

VFIO_USER_DEVICE_GET_REGION_INFO
--------------------------------

Message format
^^^^^^^^^^^^^^

+--------------+------------------------+
| Name         | Value                  |
+==============+========================+
| Message ID   | <ID>                   |
+--------------+------------------------+
| Command      | 5                      |
+--------------+------------------------+
| Message size | 48 + any caps          |
+--------------+------------------------+
| Flags        | Reply bit set in reply |
+--------------+------------------------+
| Errno        | 0/errno                |
+--------------+------------------------+
| Region info  | VFIO region info       |
+--------------+------------------------+

This command message is sent by the client to the server to query for
information about device memory regions. The VFIO region info structure is
defined in ``<linux/vfio.h>`` (``struct vfio_region_info``). Since the client
does not know the size of the capabilities, the size of the reply it should
expect is 48 plus any capabilities whose size is indicated in the size field of
the reply header.

VFIO region info format
^^^^^^^^^^^^^^^^^^^^^^^

+------------+--------+------------------------------+
| Name       | Offset | Size                         |
+============+========+==============================+
| argsz      | 16     | 4                            |
+------------+--------+------------------------------+
| flags      | 20     | 4                            |
+------------+--------+------------------------------+
|            | +-----+-----------------------------+ |
|            | | Bit | Definition                  | |
|            | +=====+=============================+ |
|            | | 0   | VFIO_REGION_INFO_FLAG_READ  | |
|            | +-----+-----------------------------+ |
|            | | 1   | VFIO_REGION_INFO_FLAG_WRITE | |
|            | +-----+-----------------------------+ |
|            | | 2   | VFIO_REGION_INFO_FLAG_MMAP  | |
|            | +-----+-----------------------------+ |
|            | | 3   | VFIO_REGION_INFO_FLAG_CAPS  | |
|            | +-----+-----------------------------+ |
+------------+--------+------------------------------+
| index      | 24     | 4                            |
+------------+--------+------------------------------+
| cap_offset | 28     | 4                            |
+------------+--------+------------------------------+
| size       | 32     | 8                            |
+------------+--------+------------------------------+
| offset     | 40     | 8                            |
+------------+--------+------------------------------+

* *argsz* is the size of the VFIO region info structure plus the
  size of any region capabilities returned.
* *flags* are attributes of the region:

  * *VFIO_REGION_INFO_FLAG_READ* allows client read access to the region.
  * *VFIO_REGION_INFO_FLAG_WRITE* allows client write access to the region.
  * *VFIO_REGION_INFO_FLAG_MMAP* specifies the client can mmap() the region.
    When this flag is set, the reply will include a file descriptor in its
    meta-data. On AF_UNIX sockets, the file descriptors will be passed as
    SCM_RIGHTS type ancillary data.
  * *VFIO_REGION_INFO_FLAG_CAPS* indicates additional capabilities found in the
    reply.

* *index* is the index of memory region being queried, it is the only field
  that is required to be set in the command message.
* *cap_offset* describes where additional region capabilities can be found.
  cap_offset is relative to the beginning of the VFIO region info structure.
  The data structure it points is a VFIO cap header defined in
  ``<linux/vfio.h>``.
* *size* is the size of the region.
* *offset* is the offset given to the mmap() system call for regions with the
  MMAP attribute. It is also used as the base offset when mapping a VFIO
  sparse mmap area, described below.

VFIO Region capabilities
^^^^^^^^^^^^^^^^^^^^^^^^
The VFIO region information can also include a capabilities list. This list is
similar to a PCI capability list - each entry has a common header that
identifies a capability and where the next capability in the list can be found.
The VFIO capability header format is defined in ``<linux/vfio.h>`` (``struct
vfio_info_cap_header``).

VFIO cap header format
^^^^^^^^^^^^^^^^^^^^^^

+---------+--------+------+
| Name    | Offset | Size |
+=========+========+======+
| id      | 0      | 2    |
+---------+--------+------+
| version | 2      | 2    |
+---------+--------+------+
| next    | 4      | 4    |
+---------+--------+------+

* *id* is the capability identity.
* *version* is a capability-specific version number.
* *next* specifies the offset of the next capability in the capability list. It
  is relative to the beginning of the VFIO region info structure.

VFIO sparse mmap
^^^^^^^^^^^^^^^^

+------------------+----------------------------------+
| Name             | Value                            |
+==================+==================================+
| id               | VFIO_REGION_INFO_CAP_SPARSE_MMAP |
+------------------+----------------------------------+
| version          | 0x1                              |
+------------------+----------------------------------+
| next             | <next>                           |
+------------------+----------------------------------+
| sparse mmap info | VFIO region info sparse mmap     |
+------------------+----------------------------------+

The only capability supported in this version of the protocol is for sparse
mmap. This capability is defined when only a subrange of the region supports
direct access by the client via mmap(). The VFIO sparse mmap area is defined in
``<linux/vfio.h>`` (``struct vfio_region_sparse_mmap_area``).

VFIO region info cap sparse mmap
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+----------+--------+------+
| Name     | Offset | Size |
+==========+========+======+
| nr_areas | 0      | 4    |
+----------+--------+------+
| reserved | 4      | 4    |
+----------+--------+------+
| offset   | 8      | 8    |
+----------+--------+------+
| size     | 16     | 9    |
+----------+--------+------+
| ...      |        |      |
+----------+--------+------+

* *nr_areas* is the number of sparse mmap areas in the region.
* *offset* and size describe a single area that can be mapped by the client.
  There will be nr_areas pairs of offset and size. The offset will be added to
  the base offset given in the VFIO_USER_DEVICE_GET_REGION_INFO to form the
  offset argument of the subsequent mmap() call.

The VFIO sparse mmap area is defined in ``<linux/vfio.h>`` (``struct
vfio_region_info_cap_sparse_mmap``).

VFIO_USER_DEVICE_GET_IRQ_INFO
-----------------------------

Message format
^^^^^^^^^^^^^^

+--------------+------------------------+
| Name         | Value                  |
+==============+========================+
| Message ID   | <ID>                   |
+--------------+------------------------+
| Command      | 6                      |
+--------------+------------------------+
| Message size | 32                     |
+--------------+------------------------+
| Flags        | Reply bit set in reply |
+--------------+------------------------+
| Errno        | 0/errno                |
+--------------+------------------------+
| IRQ info     | VFIO IRQ info          |
+--------------+------------------------+

This command message is sent by the client to the server to query for
information about device interrupt types. The VFIO IRQ info structure is
defined in ``<linux/vfio.h>`` (``struct vfio_irq_info``).

VFIO IRQ info format
^^^^^^^^^^^^^^^^^^^^

+-------+--------+---------------------------+
| Name  | Offset | Size                      |
+=======+========+===========================+
| argsz | 16     | 4                         |
+-------+--------+---------------------------+
| flags | 20     | 4                         |
+-------+--------+---------------------------+
|       | +-----+--------------------------+ |
|       | | Bit | Definition               | |
|       | +=====+==========================+ |
|       | | 0   | VFIO_IRQ_INFO_EVENTFD    | |
|       | +-----+--------------------------+ |
|       | | 1   | VFIO_IRQ_INFO_MASKABLE   | |
|       | +-----+--------------------------+ |
|       | | 2   | VFIO_IRQ_INFO_AUTOMASKED | |
|       | +-----+--------------------------+ |
|       | | 3   | VFIO_IRQ_INFO_NORESIZE   | |
|       | +-----+--------------------------+ |
+-------+--------+---------------------------+
| index | 24     | 4                         |
+-------+--------+---------------------------+
| count | 28     | 4                         |
+-------+--------+---------------------------+

* *argsz* is the size of the VFIO IRQ info structure.
* *flags* defines IRQ attributes:

  * *VFIO_IRQ_INFO_EVENTFD* indicates the IRQ type can support server eventfd
    signalling.
  * *VFIO_IRQ_INFO_MASKABLE* indicates that the IRQ type supports the MASK and
    UNMASK actions in a VFIO_USER_DEVICE_SET_IRQS message.
  * *VFIO_IRQ_INFO_AUTOMASKED* indicates the IRQ type masks itself after being
    triggered, and the client must send an UNMASK action to receive new
    interrupts.
  * *VFIO_IRQ_INFO_NORESIZE* indicates VFIO_USER_SET_IRQS operations setup
    interrupts as a set, and new sub-indexes cannot be enabled without disabling
    the entire type.

* index is the index of IRQ type being queried, it is the only field that is
  required to be set in the command message.
* count describes the number of interrupts of the queried type.

VFIO_USER_DEVICE_SET_IRQS
-------------------------

Message format
^^^^^^^^^^^^^^

+--------------+------------------------+
| Name         | Value                  |
+==============+========================+
| Message ID   | <ID>                   |
+--------------+------------------------+
| Command      | 7                      |
+--------------+------------------------+
| Message size | 36 + any data          |
+--------------+------------------------+
| Flags        | Reply bit set in reply |
+--------------+------------------------+
| Errno        | 0/errno                |
+--------------+------------------------+
| IRQ set      | VFIO IRQ set           |
+--------------+------------------------+

This command message is sent by the client to the server to set actions for
device interrupt types. The VFIO IRQ set structure is defined in
``<linux/vfio.h>`` (``struct vfio_irq_set``).

VFIO IRQ set format
^^^^^^^^^^^^^^^^^^^

+-------+--------+------------------------------+
| Name  | Offset | Size                         |
+=======+========+==============================+
| argsz | 16     | 4                            |
+-------+--------+------------------------------+
| flags | 20     | 4                            |
+-------+--------+------------------------------+
|       | +-----+-----------------------------+ |
|       | | Bit | Definition                  | |
|       | +=====+=============================+ |
|       | | 0   | VFIO_IRQ_SET_DATA_NONE      | |
|       | +-----+-----------------------------+ |
|       | | 1   | VFIO_IRQ_SET_DATA_BOOL      | |
|       | +-----+-----------------------------+ |
|       | | 2   | VFIO_IRQ_SET_DATA_EVENTFD   | |
|       | +-----+-----------------------------+ |
|       | | 3   | VFIO_IRQ_SET_ACTION_MASK    | |
|       | +-----+-----------------------------+ |
|       | | 4   | VFIO_IRQ_SET_ACTION_UNMASK  | |
|       | +-----+-----------------------------+ |
|       | | 5   | VFIO_IRQ_SET_ACTION_TRIGGER | |
|       | +-----+-----------------------------+ |
+-------+--------+------------------------------+
| index | 24     | 4                            |
+-------+--------+------------------------------+
| start | 28     | 4                            |
+-------+--------+------------------------------+
| count | 32     | 4                            |
+-------+--------+------------------------------+
| data  | 36     | variable                     |
+-------+--------+------------------------------+

* *argsz* is the size of the VFIO IRQ set structure, including any *data* field.
* *flags* defines the action performed on the interrupt range. The DATA flags
  describe the data field sent in the message; the ACTION flags describe the
  action to be performed. The flags are mutually exclusive for both sets.

  * *VFIO_IRQ_SET_DATA_NONE* indicates there is no data field in the command.
    The action is performed unconditionally.
  * *VFIO_IRQ_SET_DATA_BOOL* indicates the data field is an array of boolean
    bytes. The action is performed if the corresponding boolean is true.
  * *VFIO_IRQ_SET_DATA_EVENTFD* indicates an array of event file descriptors
    was sent in the message meta-data. These descriptors will be signalled when
    the action defined by the action flags occurs. In AF_UNIX sockets, the
    descriptors are sent as SCM_RIGHTS type ancillary data.
  * *VFIO_IRQ_SET_ACTION_MASK* indicates a masking event. It can be used with
    VFIO_IRQ_SET_DATA_BOOL or VFIO_IRQ_SET_DATA_NONE to mask an interrupt, or
    with VFIO_IRQ_SET_DATA_EVENTFD to generate an event when the guest masks
    the interrupt.
  * *VFIO_IRQ_SET_ACTION_UNMASK* indicates an unmasking event. It can be used
    with VFIO_IRQ_SET_DATA_BOOL or VFIO_IRQ_SET_DATA_NONE to unmask an
    interrupt, or with VFIO_IRQ_SET_DATA_EVENTFD to generate an event when the
    guest unmasks the interrupt.
  * *VFIO_IRQ_SET_ACTION_TRIGGER* indicates a triggering event. It can be used
    with VFIO_IRQ_SET_DATA_BOOL or VFIO_IRQ_SET_DATA_NONE to trigger an
    interrupt, or with VFIO_IRQ_SET_DATA_EVENTFD to generate an event when the
    server triggers the interrupt.

* *index* is the index of IRQ type being setup.
* *start* is the start of the sub-index being set.
* *count* describes the number of sub-indexes being set. As a special case, a
  count of 0 with data flags of VFIO_IRQ_SET_DATA_NONE disables all interrupts
  of the index.
* *data* is an optional field included when the
  VFIO_IRQ_SET_DATA_BOOL flag is present. It contains an array of booleans
  that specify whether the action is to be performed on the corresponding
  index. It's used when the action is only performed on a subset of the range
  specified.

Not all interrupt types support every combination of data and action flags.
The client must know the capabilities of the device and IRQ index before it
sends a VFIO_USER_DEVICE_SET_IRQ message.

.. _Read and Write Operations:

Read and Write Operations
-------------------------

Not all I/O operations between the client and server can be done via direct
access of memory mapped with an mmap() call. In these cases, the client and
server use messages sent over the socket. It is expected that these operations
will have lower performance than direct access.

The client can access server memory with VFIO_USER_REGION_READ and
VFIO_USER_REGION_WRITE commands. These share a common data structure that
appears after the message header.

REGION Read/Write Data
^^^^^^^^^^^^^^^^^^^^^^

+--------+--------+----------+
| Name   | Offset | Size     |
+========+========+==========+
| Offset | 16     | 8        |
+--------+--------+----------+
| Region | 24     | 4        |
+--------+--------+----------+
| Count  | 28     | 4        |
+--------+--------+----------+
| Data   | 32     | variable |
+--------+--------+----------+

* *Offset* into the region being accessed.
* *Region* is the index of the region being accessed.
* *Count* is the size of the data to be transferred.
* *Data* is the data to be read or written.

The server can access client memory with VFIO_USER_DMA_READ and
VFIO_USER_DMA_WRITE messages. These also share a common data structure that
appears after the message header.

DMA Read/Write Data
^^^^^^^^^^^^^^^^^^^

+---------+--------+----------+
| Name    | Offset | Size     |
+=========+========+==========+
| Address | 16     | 8        |
+---------+--------+----------+
| Count   | 24     | 4        |
+---------+--------+----------+
| Data    | 28     | variable |
+---------+--------+----------+

* *Address* is the area of client memory being accessed. This address must have
  been previously exported to the server with a VFIO_USER_DMA_MAP message.
* *Count* is the size of the data to be transferred.
* *Data* is the data to be read or written.

VFIO_USER_REGION_READ
---------------------

Message format
^^^^^^^^^^^^^^

+--------------+------------------------+
| Name         | Value                  |
+==============+========================+
| Message ID   | <ID>                   |
+--------------+------------------------+
| Command      | 8                      |
+--------------+------------------------+
| Message size | 32 + data size         |
+--------------+------------------------+
| Flags        | Reply bit set in reply |
+--------------+------------------------+
| Errno        | 0/errno                |
+--------------+------------------------+
| Read info    | REGION read/write data |
+--------------+------------------------+

This command message is sent from the client to the server to read from server
memory.  In the command messages, there is no data, and the count is the amount
of data to be read. The reply message must include the data read, and its count
field is the amount of data read.

VFIO_USER_REGION_WRITE
----------------------

Message format
^^^^^^^^^^^^^^

+--------------+------------------------+
| Name         | Value                  |
+==============+========================+
| Message ID   | <ID>                   |
+--------------+------------------------+
| Command      | 9                      |
+--------------+------------------------+
| Message size | 32 + data size         |
+--------------+------------------------+
| Flags        | Reply bit set in reply |
+--------------+------------------------+
| Errno        | 0/errno                |
+--------------+------------------------+
| Write info   | REGION read/write data |
+--------------+------------------------+

This command message is sent from the client to the server to write to server
memory.  The command message must contain the data to be written, and its count
field must contain the amount of write data. The count field in the reply
message must be zero.

VFIO_USER_DMA_READ
------------------

Message format
^^^^^^^^^^^^^^

+--------------+------------------------+
| Name         | Value                  |
+==============+========================+
| Message ID   | <ID>                   |
+--------------+------------------------+
| Command      | 10                     |
+--------------+------------------------+
| Message size | 28 + data size         |
+--------------+------------------------+
| Flags        | Reply bit set in reply |
+--------------+------------------------+
| Errno        | 0/errno                |
+--------------+------------------------+
| DMA info     | DMA read/write data    |
+--------------+------------------------+

This command message is sent from the server to the client to read from client
memory.  In the command message, there is no data, and the count must will be
the amount of data to be read. The reply message must include the data read,
and its count field must be the amount of data read.

VFIO_USER_DMA_WRITE
-------------------

Message format
^^^^^^^^^^^^^^

+--------------+------------------------+
| Name         | Value                  |
+==============+========================+
| Message ID   | <ID>                   |
+--------------+------------------------+
| Command      | 11                     |
+--------------+------------------------+
| Message size | 28 + data size         |
+--------------+------------------------+
| Flags        | Reply bit set in reply |
+--------------+------------------------+
| Errno        | 0/errno                |
+--------------+------------------------+
| DMA info     | DMA read/write data    |
+--------------+------------------------+

This command message is sent from the server to the client to write to server
memory.  The command message must contain the data to be written, and its count
field must contain the amount of write data. The count field in the reply
message must be zero.

VFIO_USER_VM_INTERRUPT
----------------------

Message format
^^^^^^^^^^^^^^

+----------------+------------------------+
| Name           | Value                  |
+================+========================+
| Message ID     | <ID>                   |
+----------------+------------------------+
| Command        | 12                     |
+----------------+------------------------+
| Message size   | 24                     |
+----------------+------------------------+
| Flags          | Reply bit set in reply |
+----------------+------------------------+
| Errno          | 0/errno                |
+----------------+------------------------+
| Interrupt info | <interrupt>            |
+----------------+------------------------+

This command message is sent from the server to the client to signal the device
has raised an interrupt.

Interrupt info format
^^^^^^^^^^^^^^^^^^^^^

+-----------+--------+------+
| Name      | Offset | Size |
+===========+========+======+
| Index     | 16     | 4    |
+-----------+--------+------+
| Sub-index | 20     | 4    |
+-----------+--------+------+

* *Index* is the interrupt index; it is the same value used in
  VFIO_USER_SET_IRQS.
* *Sub-index* is relative to the index, e.g., the vector number used in PCI
  MSI/X type interrupts.

VFIO_USER_DEVICE_RESET
----------------------

Message format
^^^^^^^^^^^^^^

+--------------+------------------------+
| Name         | Value                  |
+==============+========================+
| Message ID   | <ID>                   |
+--------------+------------------------+
| Command      | 13                     |
+--------------+------------------------+
| Message size | 16                     |
+--------------+------------------------+
| Flags        | Reply bit set in reply |
+--------------+------------------------+
| Errno        | 0/errno                |
+--------------+------------------------+

This command message is sent from the client to the server to reset the device.

Appendices
==========

Unused VFIO ioctl() commands
----------------------------

The following VFIO commands do not have an equivalent vfio-user command:

* VFIO_GET_API_VERSION
* VFIO_CHECK_EXTENSION
* VFIO_SET_IOMMU
* VFIO_GROUP_GET_STATUS
* VFIO_GROUP_SET_CONTAINER
* VFIO_GROUP_UNSET_CONTAINER
* VFIO_GROUP_GET_DEVICE_FD
* VFIO_IOMMU_GET_INFO

However, once support for live migration for VFIO devices is finalized some
of the above commands may have to be handled by the client in their
corresponding vfio-user form. This will be addressed in a future protocol
version.

Live Migration
--------------
Currently live migration is not supported for devices passed through via VFIO,
therefore it is not supported for vfio-user, either. This is being actively
worked on in the `Add migration support for VFIO devices (v25) 
<https://lists.gnu.org/archive/html/qemu-devel/2020-06/msg06463.html>`_
patch series.

VFIO groups and containers
^^^^^^^^^^^^^^^^^^^^^^^^^^

The current VFIO implementation includes group and container idioms that
describe how a device relates to the host IOMMU. In the vfio-user
implementation, the IOMMU is implemented in SW by the client, and is not
visible to the server. The simplest idea would be that the client put each
device into its own group and container.

Backend Program Conventions
---------------------------

vfio-user backend program conventions are based on the vhost-user ones.

* The backend program must not daemonize itself.
* No assumptions must be made as to what access the backend program has on the
  system.
* File descriptors 0, 1 and 2 must exist, must have regular
  stdin/stdout/stderr semantics, and can be redirected.
* The backend program must honor the SIGTERM signal.
* The backend program must accept the following commands line options:

  * ``--socket-path=PATH``: path to UNIX domain socket,
  * ``--fd=FDNUM``: file descriptor for UNIX domain socket, incompatible with
    ``--socket-path``
* The backend program must be accompanied with a JSON file stored under
  ``/usr/share/vfio-user``.
Re: [PATCH] introduce VFIO-over-socket protocol specificaion
Posted by Alex Williamson 4 years, 1 month ago
On Thu, 16 Jul 2020 08:31:43 -0700
Thanos Makatos <thanos.makatos@nutanix.com> wrote:

> This patch introduces the VFIO-over-socket protocol specification, which
> is designed to allow devices to be emulated outside QEMU, in a separate
> process. VFIO-over-socket reuses the existing VFIO defines, structs and
> concepts.
> 
> It has been earlier discussed as an RFC in:
> "RFC: use VFIO over a UNIX domain socket to implement device offloading"
> 
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Thanos Makatos <thanos.makatos@nutanix.com>
> ---
>  docs/devel/vfio-over-socket.rst | 1135 +++++++++++++++++++++++++++++++++++++++
>  1 files changed, 1135 insertions(+), 0 deletions(-)
>  create mode 100644 docs/devel/vfio-over-socket.rst
> 
> diff --git a/docs/devel/vfio-over-socket.rst b/docs/devel/vfio-over-socket.rst
> new file mode 100644
> index 0000000..723b944
> --- /dev/null
> +++ b/docs/devel/vfio-over-socket.rst
> @@ -0,0 +1,1135 @@
> +***************************************
> +VFIO-over-socket Protocol Specification
> +***************************************
> +
> +Version 0.1
> +
> +Introduction
> +============
> +VFIO-over-socket, also known as vfio-user, is a protocol that allows a device
> +to be virtualized in a separate process outside of QEMU. VFIO-over-socket
> +devices consist of a generic VFIO device type, living inside QEMU, which we
> +call the client, and the core device implementation, living outside QEMU, which
> +we call the server. VFIO-over-socket can be the main transport mechanism for
> +multi-process QEMU, however it can be used by other applications offering
> +device virtualization. Explaining the advantages of a
> +disaggregated/multi-process QEMU, and device virtualization outside QEMU in
> +general, is beyond the scope of this document.
> +
> +This document focuses on specifying the VFIO-over-socket protocol. VFIO has
> +been chosen for the following reasons:
> +
> +1) It is a mature and stable API, backed by an extensively used framework.
> +2) The existing VFIO client implementation (qemu/hw/vfio/) can be largely
> +   reused.
> +
> +In a proof of concept implementation it has been demonstrated that using VFIO
> +over a UNIX domain socket is a viable option. VFIO-over-socket is designed with
> +QEMU in mind, however it could be used by other client applications. The
> +VFIO-over-socket protocol does not require that QEMU's VFIO client
> +implementation is used in QEMU. None of the VFIO kernel modules are required
> +for supporting the protocol, neither in the client nor the server, only the
> +source header files are used.
> +
> +The main idea is to allow a virtual device to function in a separate process in
> +the same host over a UNIX domain socket. A UNIX domain socket (AF_UNIX) is
> +chosen because we can trivially send file descriptors over it, which in turn
> +allows:
> +
> +* Sharing of guest memory for DMA with the virtual device process.
> +* Sharing of virtual device memory with the guest for fast MMIO.
> +* Efficient sharing of eventfd's for triggering interrupts.
> +
> +However, other socket types could be used which allows the virtual device
> +process to run in a separate guest in the same host (AF_VSOCK) or remotely
> +(AF_INET). Theoretically the underlying transport doesn't necessarily have to
> +be a socket, however we don't examine such alternatives. In this document we
> +focus on using a UNIX domain socket and introduce basic support for the other
> +two types of sockets without considering performance implications.
> +
> +This document does not yet describe any internal details of the server-side
> +implementation, however QEMU's VFIO client implementation will have to be
> +adapted according to this protocol in order to support VFIO-over-socket virtual
> +devices.
> +
> +VFIO
> +====
> +VFIO is a framework that allows a physical device to be securely passed through
> +to a user space process; the kernel does not drive the device at all.
> +Typically, the user space process is a VM and the device is passed through to
> +it in order to achieve high performance. VFIO provides an API and the required
> +functionality in the kernel. QEMU has adopted VFIO to allow a guest virtual
> +machine to directly access physical devices, instead of emulating them in
> +software
> +
> +VFIO-over-socket reuses the core VFIO concepts defined in its API, but
> +implements them as messages to be sent over a UNIX-domain socket. It does not
> +change the kernel-based VFIO in any way, in fact none of the VFIO kernel
> +modules need to be loaded to use VFIO-over-socket. It is also possible for QEMU
> +to concurrently use the current kernel-based VFIO for one guest device, and use
> +VFIO-over-socket for another device in the same guest.
> +
> +VFIO Device Model
> +-----------------
> +A device under VFIO presents a standard VFIO model to the user process. Many
> +of the VFIO operations in the existing kernel model use the ioctl() system
> +call, and references to the existing model are called the ioctl()
> +implementation in this document.
> +
> +The following sections describe the set of messages that implement the VFIO
> +device model over a UNIX domain socket. In many cases, the messages are direct
> +translations of data structures used in the ioctl() implementation. Messages
> +derived from ioctl()s will have a name derived from the ioctl() command name.
> +E.g., the VFIO_GET_INFO ioctl() command becomes a VFIO_USER_GET_INFO message.
> +The purpose for this reuse is to share as much code as feasible with the
> +ioctl() implementation.
> +
> +Client and Server
> +^^^^^^^^^^^^^^^^^
> +The socket connects two processes together: a client process and a server
> +process. In the context of this document, the client process is the process
> +emulating a guest virtual machine, such as QEMU. The server process is a
> +process that provides device emulation.
> +
> +Connection Initiation
> +^^^^^^^^^^^^^^^^^^^^^
> +After the client connects to the server, the initial server message is
> +VFIO_USER_VERSION to propose a protocol version and set of capabilities to
> +apply to the session. The client replies with a compatible version and set of
> +capabilities it will support, or closes the connection if it cannot support the
> +advertised version.
> +
> +Guest Memory Configuration
> +^^^^^^^^^^^^^^^^^^^^^^^^^^
> +The client uses VFIO_USER_DMA_MAP and VFIO_USER_DMA_UNMAP messages to inform
> +the server of the valid guest DMA ranges that the server can access on behalf
> +of a device. Guest memory may be accessed by the server via VFIO_USER_DMA_READ
> +and VFIO_USER_DMA_WRITE messages over the socket.
> +
> +An optimization for server access to guest memory is for the client to provide
> +file descriptors the server can mmap() to directly access guest memory. Note
> +that mmap() privileges cannot be revoked by the client, therefore file
> +descriptors should only be exported in environments where the client trusts the
> +server not to corrupt guest memory.
> +
> +Device Information
> +^^^^^^^^^^^^^^^^^^
> +The client uses a VFIO_USER_DEVICE_GET_INFO message to query the server for
> +information about the device. This information includes:
> +
> +* The device type and capabilities,
> +* the number of memory regions, and
> +* the device presents to the guest the number of interrupt types the device
> +  supports.
> +
> +Region Information
> +^^^^^^^^^^^^^^^^^^
> +The client uses VFIO_USER_DEVICE_GET_REGION_INFO messages to query the server
> +for information about the device's memory regions. This information describes:
> +
> +* Read and write permissions, whether it can be memory mapped, and whether it
> +  supports additional capabilities.
> +* Region index, size, and offset.
> +
> +When a region can be mapped by the client, the server provides a file
> +descriptor which the client can mmap(). The server is responsible for polling
> +for client updates to memory mapped regions.
> +
> +Region Capabilities
> +"""""""""""""""""""
> +Some regions have additional capabilities that cannot be described adequately
> +by the region info data structure. These capabilities are returned in the
> +region info reply in a list similar to PCI capabilities in a PCI device's
> +configuration space. 
> +
> +Sparse Regions
> +""""""""""""""
> +A region can be memory-mappable in whole or in part. When only a subset of a
> +region can be mapped by the client, a VFIO_REGION_INFO_CAP_SPARSE_MMAP
> +capability is included in the region info reply. This capability describes
> +which portions can be mapped by the client.
> +
> +For example, in a virtual NVMe controller, sparse regions can be used so that
> +accesses to the NVMe registers (found in the beginning of BAR0) are trapped (an
> +infrequent an event), while allowing direct access to the doorbells (an
> +extremely frequent event as every I/O submission requires a write to BAR0),
> +found right after the NVMe registers in BAR0.
> +
> +Interrupts
> +^^^^^^^^^^
> +The client uses VFIO_USER_DEVICE_GET_IRQ_INFO messages to query the server for
> +the device's interrupt types. The interrupt types are specific to the bus the
> +device is attached to, and the client is expected to know the capabilities of
> +each interrupt type. The server can signal an interrupt either with
> +VFIO_USER_VM_INTERRUPT messages over the socket, or can directly inject
> +interrupts into the guest via an event file descriptor. The client configures
> +how the server signals an interrupt with VFIO_USER_SET_IRQS messages.


Given that in-kernel VFIO uses eventfds exclusively for interrupts
to the client, what's the use case of the VFIO_USER_VM_INTERRUPT message
over the socket to generate an interrupt?  If a client wanted to know
about it, wouldn't they have used VFIO_USER_SET_IRQS to configure an
eventfd?

> +
> +Device Read and Write
> +^^^^^^^^^^^^^^^^^^^^^
> +When the guest executes load or store operations to device memory, the client
> +forwards these operations to the server with VFIO_USER_REGION_READ or
> +VFIO_USER_REGION_WRITE messages. The server will reply with data from the
> +device on read operations or an acknowledgement on write operations.
> +
> +DMA
> +^^^
> +When a device performs DMA accesses to guest memory, the server will forward
> +them to the client with VFIO_USER_DMA_READ and VFIO_USER_DMA_WRITE messages.
> +These messages can only be used to access guest memory the client has
> +configured into the server.
> +
> +Protocol Specification
> +======================
> +To distinguish from the base VFIO symbols, all VFIO-over-socket symbols are
> +prefixed with vfio_user or VFIO_USER. In revision 0.1, all data is in the
> +little-endian format, although this may be relaxed in future revision in cases
> +where the client and server are both big-endian. The messages are formatted
> +for seamless reuse of the native VFIO structs. A server can serve:
> +
> +1) multiple clients, and/or
> +2) multiple virtual devices, belonging to one or more clients.
> +
> +Therefore each message requires a header that uniquely identifies the virtual
> +device. It is a server-side implementation detail whether a single server
> +handles multiple virtual devices from the same or multiple guests.
>
> +Socket
> +------
> +A single UNIX domain socket is assumed to be used for each device. The location
> +of the socket is implementation-specific. Multiplexing clients, devices, and
> +servers over the same socket is not supported in this version of the protocol,
> +but a device ID field exists in the message header so that a future support can
> +be added without a major version change.


If we were to make use of that, we really start to stray from the
in-kernel model where a file descriptor is the representation of a
device to the user.  It also seems to invite server exploits if a
client can attempt to spoof devices used by other clients through their
socket.  How do you imagine this being used?


> +Authentication
> +--------------
> +For AF_UNIX, we rely on OS mandatory access controls on the socket files,
> +therefore it is up to the management layer to set up the socket as required.
> +Socket types than span guests or hosts will require a proper authentication


s/than/that/


> +mechanism. Defining that mechanism is deferred to a future version of the
> +protocol.
> +
> +Request Concurrency
> +-------------------
> +There can be multiple outstanding requests per virtual device, e.g. a
> +frame buffer where the guest does multiple stores to the virtual device. The
> +server can execute and reorder non-conflicting requests in parallel, depending
> +on the device semantics.
> +
> +Socket Disconnection Behavior
> +-----------------------------
> +The server and the client can disconnect from each other, either intentionally
> +or unexpectedly. Both the client and the server need to know how to handle such
> +events.
> +
> +Server Disconnection
> +^^^^^^^^^^^^^^^^^^^^
> +A server disconnecting from the client may indicate that:
> +
> +1) A virtual device has been restarted, either intentionally (e.g. because of a
> +device update) or unintentionally (e.g. because of a crash). In any case, the
> +virtual device will come back so the client should not do anything (e.g. simply
> +reconnect and retry failed operations).
> +
> +2) A virtual device has been shut down with no intention to be restarted.
> +
> +It is impossible for the client to know whether or not a failure is
> +intermittent or innocuous and should be retried, therefore the client should
> +attempt to reconnect to the socket. Since an intentional server restart (e.g.
> +due to an upgrade) might take some time, a reasonable timeout should be used.
> +In cases where the disconnection is expected (e.g. the guest shutting down), no
> +new requests will be sent anyway so this situation doesn't pose a problem. The
> +control stack will clean up accordingly.


The theory sounds reasonable, the practice sounds much more difficult.
For example if we're emulating a PCI device in the client, a blocked
read or write could be problematic to the guest.


> +Parametrizing this behaviour by having the virtual device advertise a
> +reasonable reconnect is deferred to a future version of the protocol.
> +
> +Client Disconnection
> +^^^^^^^^^^^^^^^^^^^^
> +The client disconnecting from the server primarily means that the QEMU process
> +has exited. Currently this means that the guest is shut down so the device is
> +no longer needed therefore the server can automatically exit. However, there
> +can be cases where a client disconnect should not result in a server exit:
> +
> +1) A single server serving multiple clients.
> +2) A multi-process QEMU upgrading itself step by step, which isn't yet
> +   implemented.
> +
> +Therefore in order for the protocol to be forward compatible the server should
> +take no action when the client disconnects. If anything happens to the client
> +process the control stack will know about it and can clean up resources
> +accordingly.


This seems to invite complication, if QEMU crashes, the device is gone
and the server should release the resources.  If we're in this mythical
second scenario, wouldn't it make sense to implement a suspend and
resume sequence such that the sever will exit or free resources unless
it's been told otherwise.  Maybe the suspend operation could also
indicate a timeout such that the server would consider the client dead
after some period and do cleanup.


> +Request Retry and Response Timeout
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +QEMU's VFIO retries certain operations if they fail. While this makes sense for
> +real HW, we don't know for sure whether it makes sense for virtual devices. A
> +failed request is a request that has been successfully sent and has been
> +responded to with an error code. Failure to send the request in the first place
> +(e.g. because the socket is disconnected) is a different type of error examined
> +earlier in the disconnect section.
> +
> +Defining a retry and timeout scheme if deferred to a future version of the
> +protocol.
> +
> +Commands
> +--------
> +The following table lists the VFIO message command IDs, and whether the
> +message request is sent from the client or the server.
> +
> ++----------------------------------+---------+-------------------+
> +| Name                             | Command | Request Direction |
> ++==================================+=========+===================+
> +| VFIO_USER_VERSION                | 1       | server → client   |
> ++----------------------------------+---------+-------------------+
> +| VFIO_USER_DMA_MAP                | 2       | client → server   |
> ++----------------------------------+---------+-------------------+
> +| VFIO_USER_DMA_UNMAP              | 3       | client → server   |
> ++----------------------------------+---------+-------------------+
> +| VFIO_USER_DEVICE_GET_INFO        | 4       | client → server   |
> ++----------------------------------+---------+-------------------+
> +| VFIO_USER_DEVICE_GET_REGION_INFO | 5       | client → server   |
> ++----------------------------------+---------+-------------------+
> +| VFIO_USER_DEVICE_GET_IRQ_INFO    | 6       | client → server   |
> ++----------------------------------+---------+-------------------+
> +| VFIO_USER_DEVICE_SET_IRQS        | 7       | client → server   |
> ++----------------------------------+---------+-------------------+
> +| VFIO_USER_REGION_READ            | 8       | client → server   |
> ++----------------------------------+---------+-------------------+
> +| VFIO_USER_REGION_WRITE           | 9       | client → server   |
> ++----------------------------------+---------+-------------------+
> +| VFIO_USER_DMA_READ               | 10      | server → client   |
> ++----------------------------------+---------+-------------------+
> +| VFIO_USER_DMA_READ               | 11      | server → client   |
> ++----------------------------------+---------+-------------------+
> +| VFIO_USER_VM_INTERRUPT           | 12      | server → client   |
> ++----------------------------------+---------+-------------------+
> +| VFIO_DEVICE_RESET                | 13      | client → server   |
> ++----------------------------------+---------+-------------------+


The only one not to use _USER_?


> +
> +Header
> +------
> +All messages are preceded by a 16 byte header that contains basic information
> +about the message. The header is followed by message-specific data described
> +in the sections below.
> +
> ++----------------+--------+-------------+
> +| Name           | Offset | Size        |
> ++================+========+=============+
> +| Device ID      | 0      | 2           |
> ++----------------+--------+-------------+
> +| Message ID     | 2      | 2           |
> ++----------------+--------+-------------+
> +| Command        | 4      | 4           |
> ++----------------+--------+-------------+
> +| Message size   | 8      | 4           |
> ++----------------+--------+-------------+
> +| Flags          | 12     | 4           |
> ++----------------+--------+-------------+
> +|                | +-----+------------+ |
> +|                | | Bit | Definition | |
> +|                | +=====+============+ |
> +|                | | 0   | Reply      | |
> +|                | +-----+------------+ |
> +|                | | 1   | No_reply   | |
> +|                | +-----+------------+ | 
> ++----------------+--------+-------------+
> +| <message data> | 16     | variable    |
> ++----------------+--------+-------------+
> +
> +* Device ID identifies the destination device of the message. This field is
> +  reserved when the server only supports one device per socket.
> +* Message ID identifies the message, and is used in the message acknowledgement.


I don't find any protocol for managing this.  Is the client expected to
increment per message and wrap around?


> +* Command specifies the command to be executed, listed in the Command Table.
> +* Message size contains the size of the entire message, including the header.
> +* Flags contains attributes of the message:
> +
> +  * The reply bit differentiates request messages from reply messages. A reply
> +    message acknowledges a previous request with the same message ID.
> +  * No_reply indicates that no reply is needed for this request. This is
> +    commonly used when multiple requests are sent, and only the last needs
> +    acknowledgement.


I wish I could think of better terminology, I had to delete my question
of why we need two bits for flags that at a casual read appear to be
opposites ;)


> +
> +VFIO_USER_VERSION
> +-----------------
> +
> +Message format
> +^^^^^^^^^^^^^^
> +
> ++--------------+------------------------+
> +| Name         | Value                  |
> ++==============+========================+
> +| Device ID    | 0                      |
> ++--------------+------------------------+
> +| Message ID   | <ID>                   |
> ++--------------+------------------------+
> +| Command      | 1                      |
> ++--------------+------------------------+
> +| Message size | 16 + version length    |
> ++--------------+------------------------+
> +| Flags        | Reply bit set in reply |
> ++--------------+------------------------+
> +| Version      | JSON byte array        |
> ++--------------+------------------------+
> +
> +This is the initial message sent by the server after the socket connection is
> +established. The version is in JSON format, and the following objects must be
> +included:
> +
> ++--------------+--------+---------------------------------------------------+
> +| Name         | Type   | Description                                       |
> ++==============+========+===================================================+
> +| version      | object | {“major�: <number>, “minor�: <number>}            |
> +|              |        | Version supported by the sender, e.g. “0.1�.      |
> ++--------------+--------+---------------------------------------------------+
> +| type         | string | Fixed to “vfio-user�.                             |
> ++--------------+--------+---------------------------------------------------+
> +| capabilities | array  | Reserved. Can be omitted for v0.1, otherwise must |
> +|              |        | be empty.                                         |
> ++--------------+--------+---------------------------------------------------+
> +
> +Versioning and Feature Support
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +Upon accepting a connection, the server must send a VFIO_USER_VERSION message
> +proposing a protocol version and a set of capabilities. The client compares
> +these with the versions and capabilities it supports and sends a
> +VFIO_USER_VERSION reply according to the following rules.
> +
> +* The major version in the reply must be the same as proposed. If the client
> +  does not support the proposed major, it closes the connection.
> +* The minor version in the reply must be equal to or less than the minor
> +  version proposed.
> +* The capability list must be a subset of those proposed. If the client
> +  requires a capability the server did not include, it closes the connection.
> +* If type is not “vfio-user�, the client closes the connection.
> +
> +The protocol major version will only change when incompatible protocol changes
> +are made, such as changing the message format. The minor version may change
> +when compatible changes are made, such as adding new messages or capabilities,
> +Both the client and server must support all minor versions less than the
> +maximum minor version it supports. E.g., an implementation that supports
> +version 1.3 must also support 1.0 through 1.2.
> +
> +VFIO_USER_DMA_MAP
> +-----------------
> +
> +VFIO_USER_DMA_UNMAP
> +-------------------
> +
> +Message Format
> +^^^^^^^^^^^^^^
> +
> ++--------------+------------------------+
> +| Name         | Value                  |
> ++==============+========================+
> +| Device ID    | 0                      |
> ++--------------+------------------------+
> +| Message ID   | <ID>                   |
> ++--------------+------------------------+
> +| Command      | MAP=2, UNMAP=3         |
> ++--------------+------------------------+
> +| Message size | 16 + table size        |
> ++--------------+------------------------+
> +| Flags        | Reply bit set in reply |
> ++--------------+------------------------+
> +| Table        | array of table entries |
> ++--------------+------------------------+
> +
> +This message is sent by the client to the server to inform it of the guest
> +memory regions the device can access. It must be sent before the device can
> +perform any DMA to the guest. It is normally sent directly after the version
> +handshake is completed, but may also occur when memory is added or subtracted
> +in the guest.
> +
> +The table is an array of the following structure. This structure is 32 bytes
> +in size, so the message size will be 16 + (# of table entries * 32). If a
> +region being added can be directly mapped by the server, an array of file
> +descriptors will be sent as part of the message meta-data. Each region entry
> +will have a corresponding file descriptor. On AF_UNIX sockets, the file
> +descriptors will be passed as SCM_RIGHTS type ancillary data.
> +
> +Table entry format
> +^^^^^^^^^^^^^^^^^^
> +
> ++-------------+--------+-------------+
> +| Name        | Offset | Size        |
> ++=============+========+=============+
> +| Address     | 0      | 8           |
> ++-------------+--------+-------------+
> +| Size        | 8      | 8           |
> ++-------------+--------+-------------+
> +| Offset      | 16     | 8           |
> ++-------------+--------+-------------+
> +| Protections | 24     | 4           |
> ++-------------+--------+-------------+
> +| Flags       | 28     | 4           |
> ++-------------+--------+-------------+
> +|             | +-----+------------+ |
> +|             | | Bit | Definition | |
> +|             | +=====+============+ |
> +|             | | 0   | Mappable   | |
> +|             | +-----+------------+ |
> ++-------------+--------+-------------+
> +
> +* Address is the base DMA address of the region.
> +* Size is the size of the region.
> +* Offset is the file offset of the region with respect to the associated file
> +  descriptor.
> +* Protections are the region's protection attributes as encoded in
> +  ``<sys/mman.h>``.
> +* Flags contain the following region attributes:
> +
> +  * Mappable indicate the region can be mapped via the mmap() system call using
> +    the file descriptor provided in the message meta-data.
> +
> +VFIO_USER_DEVICE_GET_INFO
> +-------------------------
> +
> +Message format
> +^^^^^^^^^^^^^^
> +
> ++--------------+----------------------------+
> +| Name         | Value                      |
> ++==============+============================+
> +| Device ID    | <ID>                       |


The previous commands were specified with Device ID 0, which made me
think that's the reserved value, but here we indicate it as a variable.
Does this imply that VFIO_USER_DMA_MAP/UNMAP are per socket rather than
per device?


> ++--------------+----------------------------+
> +| Message ID   | <ID>                       |
> ++--------------+----------------------------+
> +| Command      | 4                          |
> ++--------------+----------------------------+
> +| Message size | 16 in request, 32 in reply |
> ++--------------+----------------------------+
> +| Flags        | Reply bit set in reply     |
> ++--------------+----------------------------+
> +| Device info  | VFIO device info           |
> ++--------------+----------------------------+
> +
> +This message is sent by the client to the server to query for basic information
> +about the device. Only the message header is needed in the request message.
> +The VFIO device info structure is defined in ``<sys/vfio.h>`` (``struct


<linux/vfio.h>?


> +vfio_device_info``).
> +
> +VFIO device info format
> +^^^^^^^^^^^^^^^^^^^^^^^
> +
> ++-------------+--------+--------------------------+
> +| Name        | Offset | Size                     |
> ++=============+========+==========================+
> +| argsz       | 16     | 4                        |
> ++-------------+--------+--------------------------+
> +| flags       | 20     | 4                        |
> ++-------------+--------+--------------------------+
> +|             | +-----+-------------------------+ |
> +|             | | Bit | Definition              | |
> +|             | +=====+=========================+ |
> +|             | | 0   | VFIO_DEVICE_FLAGS_RESET | |
> +|             | +-----+-------------------------+ |
> +|             | | 1   | VFIO_DEVICE_FLAGS_PCI   | |
> +|             | +-----+-------------------------+ |
> ++-------------+--------+--------------------------+
> +| num_regions | 24     | 4                        |
> ++-------------+--------+--------------------------+
> +| num_irqs    | 28     | 4                        |
> ++-------------+--------+--------------------------+
> +
> +* argz is reserved in vfio-user, it is only used in the ioctl() VFIO
> +  implementation.


This seems strange, why not use it throughout?  It seems this makes it
more difficult for the client if it wants to unwrap the reply and send
it through code common with in-kernel vfio usage.


> +* flags contains the following device attributes.
> +
> +  * VFIO_DEVICE_FLAGS_RESET indicates the device supports the
> +    VFIO_USER_DEVICE_RESET message.
> +  * VFIO_DEVICE_FLAGS_PCI indicates the device is a PCI device.
> +
> +* num_regions is the number of memory regions the device exposes.
> +* num_irqs is the number of distinct interrupt types the device supports.
> +
> +This version of the protocol only supports PCI devices. Additional devices may
> +be supported in future versions. 
> +
> +VFIO_USER_DEVICE_GET_REGION_INFO
> +--------------------------------
> +
> +Message format
> +^^^^^^^^^^^^^^
> +
> ++--------------+------------------+
> +| Name         | Value            |
> ++==============+==================+
> +| Device ID    | <ID>             |
> ++--------------+------------------+
> +| Message ID   | <ID>             |
> ++--------------+------------------+
> +| Command      | 5                | 
> ++--------------+------------------+
> +| Message size | 48 + any caps    |
> ++--------------+------------------+
> +| Flags Reply  | bit set in reply |
> ++--------------+------------------+
> +| Region info  | VFIO region info |
> ++--------------+------------------+
> +
> +This message is sent by the client to the server to query for information about
> +device memory regions. The VFIO region info structure is defined in
> +``<sys/vfio.h>`` (``struct vfio_region_info``).
> +
> +VFIO region info format
> +^^^^^^^^^^^^^^^^^^^^^^^
> +
> ++------------+--------+------------------------------+
> +| Name       | Offset | Size                         |
> ++============+========+==============================+
> +| argsz      | 16     | 4                            |
> ++------------+--------+------------------------------+
> +| flags      | 20     | 4                            |
> ++------------+--------+------------------------------+
> +|            | +-----+-----------------------------+ |
> +|            | | Bit | Definition                  | |
> +|            | +=====+=============================+ |
> +|            | | 0   | VFIO_REGION_INFO_FLAG_READ  | |
> +|            | +-----+-----------------------------+ |
> +|            | | 1   | VFIO_REGION_INFO_FLAG_WRITE | |
> +|            | +-----+-----------------------------+ |
> +|            | | 2   | VFIO_REGION_INFO_FLAG_MMAP  | |
> +|            | +-----+-----------------------------+ |
> +|            | | 3   | VFIO_REGION_INFO_FLAG_CAPS  | |
> +|            | +-----+-----------------------------+ |
> ++------------+--------+------------------------------+
> +| index      | 24     | 4                            |
> ++------------+--------+------------------------------+
> +| cap_offset | 28     | 4                            |
> ++------------+--------+------------------------------+
> +| size       | 32     | 8                            |
> ++------------+--------+------------------------------+
> +| offset     | 40     | 8                            |
> ++------------+--------+------------------------------+
> +
> +* argz is reserved in vfio-user, it is only used in the ioctl() VFIO
> +  implementation.
> +* flags are attributes of the region:
> +
> +  * VFIO_REGION_INFO_FLAG_READ allows client read access to the region.
> +  * VFIO_REGION_INFO_FLAG_WRITE allows client write access region.
> +  * VFIO_REGION_INFO_FLAG_MMAP specifies the client can mmap() the region. When
> +    this flag is set, the reply will include a file descriptor in its meta-data.
> +    On AF_UNIX sockets, the file descriptors will be passed as SCM_RIGHTS type
> +    ancillary data.
> +  * VFIO_REGION_INFO_FLAG_CAPS indicates additional capabilities found in the
> +    reply.
> +
> +* index is the index of memory region being queried, it is the only field that
> +  is required to be set in the request message.
> +* cap_offset describes where additional region capabilities can be found.
> +  cap_offset is relative to the beginning of the VFIO region info structure.
> +  The data structure it points is a VFIO cap header defined in ``<sys/vfio.h>``.
> +* size is the size of the region.
> +* offset is the offset given to the mmap() system call for regions with the
> +  MMAP attribute. It is also used as the base offset when mapping a VFIO
> +  sparse mmap area, described below.
> +
> +VFIO Region capabilities
> +^^^^^^^^^^^^^^^^^^^^^^^^
> +The VFIO region information can also include a capabilities list. This list is
> +similar to a PCI capability list - each entry has a common header that
> +identifies a capability and where the next capability in the list can be found.
> +The VFIO capability header format is defined in ``<sys/vfio.h>`` (``struct
> +vfio_info_cap_header``).
> +
> +VFIO cap header format
> +^^^^^^^^^^^^^^^^^^^^^^
> +
> ++---------+--------+------+
> +| Name    | Offset | Size |
> ++=========+========+======+
> +| id      | 0      | 2    |
> ++---------+--------+------+
> +| version | 2      | 2    |
> ++---------+--------+------+
> +| next    | 4      | 4    |
> ++---------+--------+------+
> +
> +* id is the capability identity.
> +* version is a capability-specific version number.
> +* next specifies the offset of the next capability in the capability list. It
> +  is relative to the beginning of the VFIO region info structure.
> +
> +VFIO sparse mmap
> +^^^^^^^^^^^^^^^^
> +
> ++------------------+----------------------------------+
> +| Name             | Value                            |
> ++==================+==================================+
> +| id               | VFIO_REGION_INFO_CAP_SPARSE_MMAP |
> ++------------------+----------------------------------+
> +| version          | 0x1                              |
> ++------------------+----------------------------------+
> +| next             | <next>                           |
> ++------------------+----------------------------------+
> +| sparse mmap info | VFIO region info sparse mmap     |
> ++------------------+----------------------------------+
> +
> +The only capability supported in this version of the protocol is for sparse
> +mmap. This capability is defined when only a subrange of the region supports
> +direct access by the client via mmap(). The VFIO sparse mmap area is defined in
> +``<sys/vfio.h>`` (``struct vfio_region_sparse_mmap_area``).
> +
> +VFIO region info cap sparse mmap
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ++----------+--------+------+
> +| Name     | Offset | Size |
> ++==========+========+======+
> +| nr_areas | 0      | 4    |
> ++----------+--------+------+
> +| reserved | 4      | 4    |
> ++----------+--------+------+
> +| offset   | 8      | 8    |
> ++----------+--------+------+
> +| size     | 16     | 9    |
> ++----------+--------+------+
> +| ...      |        |      |
> ++----------+--------+------+
> +
> +* nr_areas is the number of sparse mmap areas in the region.
> +* offset and size describe a single area that can be mapped by the client.
> +  There will be nr_areas pairs of offset and size. The offset will be added to
> +  the base offset given in the VFIO_USER_DEVICE_GET_REGION_INFO to form the
> +  offset argument of the subsequent mmap() call.
> +
> +The VFIO sparse mmap area is defined in ``<sys/vfio.h>`` (``struct
> +vfio_region_info_cap_sparse_mmap``).
> +
> +VFIO_USER_DEVICE_GET_IRQ_INFO
> +-----------------------------
> +
> +Message format
> +^^^^^^^^^^^^^^
> +
> ++--------------+------------------------+
> +| Name         | Value                  |
> ++==============+========================+
> +| Device ID    | <ID>                   |
> ++--------------+------------------------+
> +| Message ID   | <ID>                   |
> ++--------------+------------------------+
> +| Command      | 6                      |
> ++--------------+------------------------+
> +| Message size | 32                     |
> ++--------------+------------------------+
> +| Flags        | Reply bit set in reply |
> ++--------------+------------------------+
> +| IRQ info     | VFIO IRQ info          |
> ++--------------+------------------------+
> +
> +This message is sent by the client to the server to query for information about
> +device interrupt types. The VFIO IRQ info structure is defined in
> +``<sys/vfio.h>`` (``struct vfio_irq_info``).
> +
> +VFIO IRQ info format
> +^^^^^^^^^^^^^^^^^^^^
> +
> ++-------+--------+---------------------------+
> +| Name  | Offset | Size                      |
> ++=======+========+===========================+
> +| argsz | 16     | 4                         |
> ++-------+--------+---------------------------+
> +| flags | 20     | 4                         |
> ++-------+--------+---------------------------+
> +|       | +-----+--------------------------+ |
> +|       | | Bit | Definition               | |
> +|       | +=====+==========================+ |
> +|       | | 0   | VFIO_IRQ_INFO_EVENTFD    | |
> +|       | +-----+--------------------------+ |
> +|       | | 1   | VFIO_IRQ_INFO_MASKABLE   | |
> +|       | +-----+--------------------------+ |
> +|       | | 2   | VFIO_IRQ_INFO_AUTOMASKED | |
> +|       | +-----+--------------------------+ |
> +|       | | 3   | VFIO_IRQ_INFO_NORESIZE   | |
> +|       | +-----+--------------------------+ |
> ++-------+--------+---------------------------+
> +| index | 24     | 4                         |
> ++-------+--------+---------------------------+
> +| count | 28     | 4                         |
> ++-------+--------+---------------------------+
> +
> +* argz is reserved in vfio-user, it is only used in the ioctl() VFIO
> +  implementation.
> +* flags defines IRQ attributes:
> +
> +  * VFIO_IRQ_INFO_EVENTFD indicates the IRQ type can support server eventfd
> +    signalling.
> +  * VFIO_IRQ_INFO_MASKABLE indicates that the IRQ type supports the MASK and
> +    UNMASK actions in a VFIO_USER_DEVICE_SET_IRQS message.
> +  * VFIO_IRQ_INFO_AUTOMASKED indicates the IRQ type masks itself after being
> +    triggered, and the client must send an UNMASK action to receive new
> +    interrupts.
> +  * VFIO_IRQ_INFO_NORESIZE indicates VFIO_USER_SET_IRQS operations setup
> +    interrupts as a set, and new subindexes cannot be enabled without disabling
> +    the entire type.
> +
> +* index is the index of IRQ type being queried, it is the only field that is
> +  required to be set in the request message.
> +* count describes the number of interrupts of the queried type.
> +
> +VFIO_USER_DEVICE_SET_IRQS
> +-------------------------
> +
> +Message format
> +^^^^^^^^^^^^^^
> +
> ++--------------+------------------------+
> +| Name         | Value                  |
> +| Device ID    | <ID>                   |
> ++--------------+------------------------+
> +| Message ID   | <ID>                   |
> ++--------------+------------------------+
> +| Command      | 7                      |
> ++--------------+------------------------+
> +| Message size | 36 + any data          |
> ++--------------+------------------------+
> +| Flags        | Reply bit set in reply |
> ++--------------+------------------------+
> +| IRQ set      | VFIO IRQ set           |
> ++--------------+------------------------+
> +
> +This message is sent by the client to the server to set actions for device
> +interrupt types. The VFIO IRQ set structure is defined in ``<sys/vfio.h>``
> +(``struct vfio_irq_set``).
> +
> +VFIO IRQ info format
> +^^^^^^^^^^^^^^^^^^^^
> +
> ++-------+--------+------------------------------+
> +| Name  | Offset | Size                         |
> ++=======+========+==============================+
> +| argsz | 6      | 4                            |
> ++-------+--------+------------------------------+
> +| flags | 20     | 4                            |
> ++-------+--------+------------------------------+
> +|       | +-----+-----------------------------+ |
> +|       | | Bit | Definition                  | |
> +|       | +=====+=============================+ |
> +|       | | 0   | VFIO_IRQ_SET_DATA_NONE      | |
> +|       | +-----+-----------------------------+ |
> +|       | | 1   | VFIO_IRQ_SET_DATA_BOOL      | |
> +|       | +-----+-----------------------------+ |
> +|       | | 2   | VFIO_IRQ_SET_DATA_EVENTFD   | |
> +|       | +-----+-----------------------------+ |
> +|       | | 3   | VFIO_IRQ_SET_ACTION_MASK    | |
> +|       | +-----+-----------------------------+ |
> +|       | | 4   | VFIO_IRQ_SET_ACTION_UNMASK  | |
> +|       | +-----+-----------------------------+ |
> +|       | | 5   | VFIO_IRQ_SET_ACTION_TRIGGER | |
> +|       | +-----+-----------------------------+ |
> ++-------+--------+------------------------------+
> +| index | 24     | 4                            |
> ++-------+--------+------------------------------+
> +| start | 28     | 4                            |
> ++-------+--------+------------------------------+
> +| count | 32     | 4                            |
> ++-------+--------+------------------------------+
> +| data  | 36     | variable                     |
> ++-------+--------+------------------------------+
> +
> +* argz is reserved in vfio-user, it is only used in the ioctl() VFIO
> +  implementation.
> +* flags defines the action performed on the interrupt range. The DATA flags
> +  describe the data field sent in the message; the ACTION flags describe the
> +  action to be performed. The flags are mutually exclusive for both sets.
> +
> +  * VFIO_IRQ_SET_DATA_NONE indicates there is no data field in the request. The
> +    action is performed unconditionally.
> +  * VFIO_IRQ_SET_DATA_BOOL indicates the data field is an array of boolean
> +    bytes. The action is performed if the corresponding boolean is true.
> +  * VFIO_IRQ_SET_DATA_EVENTFD indicates an array of event file descriptors was
> +    sent in the message meta-data. These descriptors will be signalled when the
> +    action defined by the action flags occurs. In AF_UNIX sockets, the
> +    descriptors are sent as SCM_RIGHTS type ancillary data.
> +  * VFIO_IRQ_SET_ACTION_MASK indicates a masking event. It can be used with
> +    VFIO_IRQ_SET_DATA_BOOL or VFIO_IRQ_SET_DATA_NONE to mask an interrupt, or
> +    with VFIO_IRQ_SET_DATA_EVENTFD to generate an event when the guest masks
> +    the interrupt. 
> +  * VFIO_IRQ_SET_ACTION_UNMASK indicates an unmasking event. It can be used
> +    with VFIO_IRQ_SET_DATA_BOOL or VFIO_IRQ_SET_DATA_NONE to unmask an
> +    interrupt, or with VFIO_IRQ_SET_DATA_EVENTFD to generate an event when the
> +    guest unmasks the interrupt. 
> +  * VFIO_IRQ_SET_ACTION_TRIGGER indicates a triggering event. It can be used
> +    with VFIO_IRQ_SET_DATA_BOOL or VFIO_IRQ_SET_DATA_NONE to trigger an
> +    interrupt, or with VFIO_IRQ_SET_DATA_EVENTFD to generate an event when the
> +    guest triggers the interrupt.
> +
> +* index is the index of IRQ type being setup.
> +* start is the start of the subindex being set.
> +* count describes the number of sub-indexes being set. As a special case, a
> +  count of 0 with data flags of VFIO_IRQ_SET_DATA_NONE disables all interrupts
> +  of the index data is an optional field included when the
                 ^
                 |
                 +-  Missing period? 

> +  VFIO_IRQ_SET_DATA_BOOL flag is present. It contains an array of booleans
> +  that specify whether the action is to be performed on the corresponding
> +  index. It's used when the action is only performed on a subset of the range
> +  specified.
> +
> +Not all interrupt types support every combination of data and action flags.
> +The client must know the capabilities of the device and IRQ index before it
> +sends a VFIO_USER_DEVICE_SET_IRQ message.
> +
> +Read and Write Operations
> +-------------------------
> +
> +Not all I/O operations between the client and server can be done via direct
> +access of memory mapped with an mmap() call. In these cases, the client and
> +server use messages sent over the socket. It is expected that these operations
> +will have lower performance than direct access.
> +
> +The client can access device memory with VFIO_USER_REGION_READ and
> +VFIO_USER_REGION_WRITE requests. These share a common data structure that
> +appears after the 16 byte message header. 
> +
> +REGION Read/Write Data
> +^^^^^^^^^^^^^^^^^^^^^^
> +
> ++--------+--------+----------+
> +| Name   | Offset | Size     |
> ++========+========+==========+
> +| Offset | 16     | 8        |
> ++--------+--------+----------+
> +| Region | 24     | 4        |
> ++--------+--------+----------+
> +| Count  | 28     | 4        |
> ++--------+--------+----------+
> +| Data   | 32     | variable |
> ++--------+--------+----------+
> +
> +* Offset into the region being accessed.
> +* Region is the index of the region being accessed.
> +* Count is the size of the data to be transferred.
> +* Data is the data to be read or written.
> +
> +The server can access guest memory with VFIO_USER_DMA_READ and
> +VFIO_USER_DMA_WRITE messages. These also share a common data structure that
> +appears after the 16 byte message header.
> +
> +DMA Read/Write Data
> +^^^^^^^^^^^^^^^^^^^
> +
> ++---------+--------+----------+
> +| Name    | Offset | Size     |
> ++=========+========+==========+
> +| Address | 16     | 8        |
> ++---------+--------+----------+
> +| Count   | 24     | 4        |
> ++---------+--------+----------+
> +| Data    | 28     | variable |
> ++---------+--------+----------+
> +
> +* Address is the area of guest memory being accessed. This address must have
> +  been exported to the server with a VFIO_USER_DMA_MAP message.
> +* Count is the size of the data to be transferred.
> +* Data is the data to be read or written.
> +
> +Address and count can also be accessed as ``struct iovec`` from ``<sys/uio.h>``.
> +
> +VFIO_USER_REGION_READ
> +---------------------
> +
> +Message format
> +^^^^^^^^^^^^^^
> +
> ++--------------+------------------------+
> +| Name         | Value                  |
> ++==============+========================+
> +| Device ID    | <ID>                   |
> ++--------------+------------------------+
> +| Message ID   | <ID>                   |
> ++--------------+------------------------+
> +| Command      | 8                      |
> ++--------------+------------------------+
> +| Message size | 32 + data size         |
> ++--------------+------------------------+
> +| Flags Reply  | bit set in reply       |
> ++--------------+------------------------+
> +| Read info    | REGION read/write data |
> ++--------------+------------------------+
> +
> +This request is sent from the client to the server to read from device memory.
> +In the request messages, there will be no data, and the count field will be the
> +amount of data to be read. The reply will include the data read, and its count
> +field will be the amount of data read.
> +
> +VFIO_USER_REGION_WRITE
> +----------------------
> +
> +Message format
> +^^^^^^^^^^^^^^
> +
> ++--------------+------------------------+
> +| Name         | Value                  |
> ++==============+========================+
> +| Device ID    | <ID>                   |
> ++--------------+------------------------+
> +| Message ID   | <ID>                   |
> ++--------------+------------------------+
> +| Command      | 9                      |
> ++--------------+------------------------+
> +| Message size | 32 + data size         |
> ++--------------+------------------------+
> +| Flags        | Reply bit set in reply |
> ++--------------+------------------------+
> +| Write info   | REGION read write data |
> ++--------------+------------------------+
> +
> +This request is sent from the client to the server to write to device memory.
> +The request message will contain the data to be written, and its count field
> +will contain the amount of write data. The count field in the reply will be
> +zero.
> +
> +VFIO_USER_DMA_READ
> +------------------
> +
> +Message format
> +^^^^^^^^^^^^^^
> +
> ++--------------+---------------------+
> +| Name         | Value               |
> ++==============+=====================+
> +| Device ID    | <ID>                |
> ++--------------+---------------------+
> +| Message ID   | <ID>                |
> ++--------------+---------------------+
> +| Command      | 10                  |
> ++--------------+---------------------+
> +| Message size | 28 + data size      |
> ++--------------+---------------------+
> +| Flags Reply  | bit set in reply    |
> ++--------------+---------------------+
> +| DMA info     | DMA read/write data |
> ++--------------+---------------------+
> +
> +This request is sent from the server to the client to read from guest memory.
> +In the request messages, there will be no data, and the count field will be the
> +amount of data to be read. The reply will include the data read, and its count
> +field will be the amount of data read.
> +
> +VFIO_USER_DMA_WRITE
> +-------------------
> +
> +Message format
> +^^^^^^^^^^^^^^
> +
> ++--------------+------------------------+
> +| Name         | Value                  |
> ++==============+========================+
> +| Device ID    | <ID>                   |
> ++--------------+------------------------+
> +| Message ID   | <ID>                   |
> ++--------------+------------------------+
> +| Command      | 11                     |
> ++--------------+------------------------+
> +| Message size | 28 + data size         |
> ++--------------+------------------------+
> +| Flags        | Reply bit set in reply |
> ++--------------+------------------------+
> +| DMA info     | DMA read/write data    |
> ++--------------+------------------------+
> +
> +This request is sent from the server to the client to write to guest memory.
> +The request message will contain the data to be written, and its count field
> +will contain the amount of write data. The count field in the reply will be
> +zero.
> +
> +VFIO_USER_VM_INTERRUPT
> +----------------------
> +
> +Message format
> +^^^^^^^^^^^^^^
> +
> ++----------------+------------------------+
> +| Name           | Value                  |
> ++================+========================+
> +| Device ID      | <ID>                   |
> ++----------------+------------------------+
> +| Message ID     | <ID>                   |
> ++----------------+------------------------+
> +| Command        | 12                     |
> ++----------------+------------------------+
> +| Message size   | 24                     |
> ++----------------+------------------------+
> +| Flags          | Reply bit set in reply |
> ++----------------+------------------------+
> +| Interrupt info | <interrupt>            |
> ++----------------+------------------------+
> +
> +This request is sent from the server to the client to signal the device has
> +raised an interrupt.
> +
> +Interrupt info format
> +^^^^^^^^^^^^^^^^^^^^^
> +
> ++----------+--------+------+
> +| Name     | Offset | Size |
> ++==========+========+======+
> +| Index    | 16     | 4    |
> ++----------+--------+------+
> +| Subindex | 20     | 4    |
> ++----------+--------+------+
> +
> +* Index is the interrupt index; it is the same value used in VFIO_USER_SET_IRQS.
> +* Subindex is relative to the index, e.g., the vector number used in PCI MSI/X
> +  type interrupts.


I'm still not sure why we need this, but why is a reply required?
Wouldn't an edge interrupt not require a reply?  For a level interrupt,
does the reply imply an EOI?  But really, why are we sending interrupts
through anything other than eventfds?  SET_IRQS is used to configure
the active interrupt on the device, it allows the client to provide
eventfds, so when would we ever need to send an unsolicited,
unconfigured interrupt?


> +
> +VFIO_USER_DEVICE_RESET
> +----------------------


Aha, just a typo above to missing _USER_

Thanks,
Alex

> +
> +Message format
> +^^^^^^^^^^^^^^
> +
> ++--------------+------------------------+
> +| Name         | Value                  |
> ++==============+========================+
> +| Device ID    | <ID>                   |
> ++--------------+------------------------+
> +| Message ID   | <ID>                   |
> ++--------------+------------------------+
> +| Command      | 13                     |
> ++--------------+------------------------+
> +| Message size | 16                     |
> ++--------------+------------------------+
> +| Flags        | Reply bit set in reply |
> ++--------------+------------------------+
> +
> +This request is sent from the client to the server to reset the device.
> +
> +Appendices
> +==========
> +
> +Unused VFIO ioctl() commands
> +----------------------------
> +
> +The following commands must be handled by the client and not sent to the server:
> +
> +* VFIO_GET_API_VERSION
> +* VFIO_CHECK_EXTENSION
> +* VFIO_SET_IOMMU
> +* VFIO_GROUP_GET_STATUS
> +* VFIO_GROUP_SET_CONTAINER
> +* VFIO_GROUP_UNSET_CONTAINER
> +* VFIO_GROUP_GET_DEVICE_FD
> +* VFIO_IOMMU_GET_INFO
> +
> +However, once support for live migration for VFIO devices is finalized some
> +of the above commands might have to be handled by the client. This will be
> +addressed in a future protocol version.
> +
> +Live Migration
> +--------------
> +Currently live migration is not supported for devices passed through via VFIO,
> +therefore it is not supported for VFIO-over-socket, either. This is being
> +actively worked on in the "Add migration support for VFIO devices" (v25) patch
> +series.
> +
> +VFIO groups and containers
> +^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +The current VFIO implementation includes group and container idioms that
> +describe how a device relates to the host IOMMU. In the VFIO over socket
> +implementation, the IOMMU is implemented in SW by the client, and isn't visible
> +to the server. The simplest idea is for the client is to put each device into
> +its own group and container.


Re: [PATCH] introduce VFIO-over-socket protocol specificaion
Posted by John G Johnson 4 years, 1 month ago
	Comments embedded.  Thanks for looking at it.

					JJ



> On Jul 29, 2020, at 4:51 PM, Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> On Thu, 16 Jul 2020 08:31:43 -0700
> Thanos Makatos <thanos.makatos@nutanix.com> wrote:
> 
>> This patch introduces the VFIO-over-socket protocol specification, which
>> is designed to allow devices to be emulated outside QEMU, in a separate
>> process. VFIO-over-socket reuses the existing VFIO defines, structs and
>> concepts.
>> 
>> It has been earlier discussed as an RFC in:
>> "RFC: use VFIO over a UNIX domain socket to implement device offloading"
>> 
>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
>> Signed-off-by: Thanos Makatos <thanos.makatos@nutanix.com>
>> ---
>> docs/devel/vfio-over-socket.rst | 1135 +++++++++++++++++++++++++++++++++++++++
>> 1 files changed, 1135 insertions(+), 0 deletions(-)
>> create mode 100644 docs/devel/vfio-over-socket.rst
>> 
>> diff --git a/docs/devel/vfio-over-socket.rst b/docs/devel/vfio-over-socket.rst
>> new file mode 100644
>> index 0000000..723b944
>> --- /dev/null
>> +++ b/docs/devel/vfio-over-socket.rst
>> @@ -0,0 +1,1135 @@
>> +***************************************
>> +VFIO-over-socket Protocol Specification
>> +***************************************
>> +
>> +Version 0.1
>> +
>> +Introduction
>> +============
>> +VFIO-over-socket, also known as vfio-user, is a protocol that allows a device
>> +to be virtualized in a separate process outside of QEMU. VFIO-over-socket
>> +devices consist of a generic VFIO device type, living inside QEMU, which we
>> +call the client, and the core device implementation, living outside QEMU, which
>> +we call the server. VFIO-over-socket can be the main transport mechanism for
>> +multi-process QEMU, however it can be used by other applications offering
>> +device virtualization. Explaining the advantages of a
>> +disaggregated/multi-process QEMU, and device virtualization outside QEMU in
>> +general, is beyond the scope of this document.
>> +
>> +This document focuses on specifying the VFIO-over-socket protocol. VFIO has
>> +been chosen for the following reasons:
>> +
>> +1) It is a mature and stable API, backed by an extensively used framework.
>> +2) The existing VFIO client implementation (qemu/hw/vfio/) can be largely
>> +   reused.
>> +
>> +In a proof of concept implementation it has been demonstrated that using VFIO
>> +over a UNIX domain socket is a viable option. VFIO-over-socket is designed with
>> +QEMU in mind, however it could be used by other client applications. The
>> +VFIO-over-socket protocol does not require that QEMU's VFIO client
>> +implementation is used in QEMU. None of the VFIO kernel modules are required
>> +for supporting the protocol, neither in the client nor the server, only the
>> +source header files are used.
>> +
>> +The main idea is to allow a virtual device to function in a separate process in
>> +the same host over a UNIX domain socket. A UNIX domain socket (AF_UNIX) is
>> +chosen because we can trivially send file descriptors over it, which in turn
>> +allows:
>> +
>> +* Sharing of guest memory for DMA with the virtual device process.
>> +* Sharing of virtual device memory with the guest for fast MMIO.
>> +* Efficient sharing of eventfd's for triggering interrupts.
>> +
>> +However, other socket types could be used which allows the virtual device
>> +process to run in a separate guest in the same host (AF_VSOCK) or remotely
>> +(AF_INET). Theoretically the underlying transport doesn't necessarily have to
>> +be a socket, however we don't examine such alternatives. In this document we
>> +focus on using a UNIX domain socket and introduce basic support for the other
>> +two types of sockets without considering performance implications.
>> +
>> +This document does not yet describe any internal details of the server-side
>> +implementation, however QEMU's VFIO client implementation will have to be
>> +adapted according to this protocol in order to support VFIO-over-socket virtual
>> +devices.
>> +
>> +VFIO
>> +====
>> +VFIO is a framework that allows a physical device to be securely passed through
>> +to a user space process; the kernel does not drive the device at all.
>> +Typically, the user space process is a VM and the device is passed through to
>> +it in order to achieve high performance. VFIO provides an API and the required
>> +functionality in the kernel. QEMU has adopted VFIO to allow a guest virtual
>> +machine to directly access physical devices, instead of emulating them in
>> +software
>> +
>> +VFIO-over-socket reuses the core VFIO concepts defined in its API, but
>> +implements them as messages to be sent over a UNIX-domain socket. It does not
>> +change the kernel-based VFIO in any way, in fact none of the VFIO kernel
>> +modules need to be loaded to use VFIO-over-socket. It is also possible for QEMU
>> +to concurrently use the current kernel-based VFIO for one guest device, and use
>> +VFIO-over-socket for another device in the same guest.
>> +
>> +VFIO Device Model
>> +-----------------
>> +A device under VFIO presents a standard VFIO model to the user process. Many
>> +of the VFIO operations in the existing kernel model use the ioctl() system
>> +call, and references to the existing model are called the ioctl()
>> +implementation in this document.
>> +
>> +The following sections describe the set of messages that implement the VFIO
>> +device model over a UNIX domain socket. In many cases, the messages are direct
>> +translations of data structures used in the ioctl() implementation. Messages
>> +derived from ioctl()s will have a name derived from the ioctl() command name.
>> +E.g., the VFIO_GET_INFO ioctl() command becomes a VFIO_USER_GET_INFO message.
>> +The purpose for this reuse is to share as much code as feasible with the
>> +ioctl() implementation.
>> +
>> +Client and Server
>> +^^^^^^^^^^^^^^^^^
>> +The socket connects two processes together: a client process and a server
>> +process. In the context of this document, the client process is the process
>> +emulating a guest virtual machine, such as QEMU. The server process is a
>> +process that provides device emulation.
>> +
>> +Connection Initiation
>> +^^^^^^^^^^^^^^^^^^^^^
>> +After the client connects to the server, the initial server message is
>> +VFIO_USER_VERSION to propose a protocol version and set of capabilities to
>> +apply to the session. The client replies with a compatible version and set of
>> +capabilities it will support, or closes the connection if it cannot support the
>> +advertised version.
>> +
>> +Guest Memory Configuration
>> +^^^^^^^^^^^^^^^^^^^^^^^^^^
>> +The client uses VFIO_USER_DMA_MAP and VFIO_USER_DMA_UNMAP messages to inform
>> +the server of the valid guest DMA ranges that the server can access on behalf
>> +of a device. Guest memory may be accessed by the server via VFIO_USER_DMA_READ
>> +and VFIO_USER_DMA_WRITE messages over the socket.
>> +
>> +An optimization for server access to guest memory is for the client to provide
>> +file descriptors the server can mmap() to directly access guest memory. Note
>> +that mmap() privileges cannot be revoked by the client, therefore file
>> +descriptors should only be exported in environments where the client trusts the
>> +server not to corrupt guest memory.
>> +
>> +Device Information
>> +^^^^^^^^^^^^^^^^^^
>> +The client uses a VFIO_USER_DEVICE_GET_INFO message to query the server for
>> +information about the device. This information includes:
>> +
>> +* The device type and capabilities,
>> +* the number of memory regions, and
>> +* the device presents to the guest the number of interrupt types the device
>> +  supports.
>> +
>> +Region Information
>> +^^^^^^^^^^^^^^^^^^
>> +The client uses VFIO_USER_DEVICE_GET_REGION_INFO messages to query the server
>> +for information about the device's memory regions. This information describes:
>> +
>> +* Read and write permissions, whether it can be memory mapped, and whether it
>> +  supports additional capabilities.
>> +* Region index, size, and offset.
>> +
>> +When a region can be mapped by the client, the server provides a file
>> +descriptor which the client can mmap(). The server is responsible for polling
>> +for client updates to memory mapped regions.
>> +
>> +Region Capabilities
>> +"""""""""""""""""""
>> +Some regions have additional capabilities that cannot be described adequately
>> +by the region info data structure. These capabilities are returned in the
>> +region info reply in a list similar to PCI capabilities in a PCI device's
>> +configuration space. 
>> +
>> +Sparse Regions
>> +""""""""""""""
>> +A region can be memory-mappable in whole or in part. When only a subset of a
>> +region can be mapped by the client, a VFIO_REGION_INFO_CAP_SPARSE_MMAP
>> +capability is included in the region info reply. This capability describes
>> +which portions can be mapped by the client.
>> +
>> +For example, in a virtual NVMe controller, sparse regions can be used so that
>> +accesses to the NVMe registers (found in the beginning of BAR0) are trapped (an
>> +infrequent an event), while allowing direct access to the doorbells (an
>> +extremely frequent event as every I/O submission requires a write to BAR0),
>> +found right after the NVMe registers in BAR0.
>> +
>> +Interrupts
>> +^^^^^^^^^^
>> +The client uses VFIO_USER_DEVICE_GET_IRQ_INFO messages to query the server for
>> +the device's interrupt types. The interrupt types are specific to the bus the
>> +device is attached to, and the client is expected to know the capabilities of
>> +each interrupt type. The server can signal an interrupt either with
>> +VFIO_USER_VM_INTERRUPT messages over the socket, or can directly inject
>> +interrupts into the guest via an event file descriptor. The client configures
>> +how the server signals an interrupt with VFIO_USER_SET_IRQS messages.
> 
> 
> Given that in-kernel VFIO uses eventfds exclusively for interrupts
> to the client, what's the use case of the VFIO_USER_VM_INTERRUPT message
> over the socket to generate an interrupt?  If a client wanted to know
> about it, wouldn't they have used VFIO_USER_SET_IRQS to configure an
> eventfd?
> 

	This is the fallback if the socket doesn’t support passing FDs.
We can take it out for this version, but, unlike multi-device, I think
it will be added soon when the client and server are in different VMs.


>> +
>> +Device Read and Write
>> +^^^^^^^^^^^^^^^^^^^^^
>> +When the guest executes load or store operations to device memory, the client
>> +forwards these operations to the server with VFIO_USER_REGION_READ or
>> +VFIO_USER_REGION_WRITE messages. The server will reply with data from the
>> +device on read operations or an acknowledgement on write operations.
>> +
>> +DMA
>> +^^^
>> +When a device performs DMA accesses to guest memory, the server will forward
>> +them to the client with VFIO_USER_DMA_READ and VFIO_USER_DMA_WRITE messages.
>> +These messages can only be used to access guest memory the client has
>> +configured into the server.
>> +
>> +Protocol Specification
>> +======================
>> +To distinguish from the base VFIO symbols, all VFIO-over-socket symbols are
>> +prefixed with vfio_user or VFIO_USER. In revision 0.1, all data is in the
>> +little-endian format, although this may be relaxed in future revision in cases
>> +where the client and server are both big-endian. The messages are formatted
>> +for seamless reuse of the native VFIO structs. A server can serve:
>> +
>> +1) multiple clients, and/or
>> +2) multiple virtual devices, belonging to one or more clients.
>> +
>> +Therefore each message requires a header that uniquely identifies the virtual
>> +device. It is a server-side implementation detail whether a single server
>> +handles multiple virtual devices from the same or multiple guests.
>> 
>> +Socket
>> +------
>> +A single UNIX domain socket is assumed to be used for each device. The location
>> +of the socket is implementation-specific. Multiplexing clients, devices, and
>> +servers over the same socket is not supported in this version of the protocol,
>> +but a device ID field exists in the message header so that a future support can
>> +be added without a major version change.
> 
> 
> If we were to make use of that, we really start to stray from the
> in-kernel model where a file descriptor is the representation of a
> device to the user.  It also seems to invite server exploits if a
> client can attempt to spoof devices used by other clients through their
> socket.  How do you imagine this being used?
> 

	There is no multiple device per socket implementation, so these
issues haven’t been fully hashed out, but the initial idea was only to
only support one socket per client and rely on OS MAC on the socket file
for authorization.  This is moot now, as we’ve dropped the device ID for
this revision.


> 
>> +Authentication
>> +--------------
>> +For AF_UNIX, we rely on OS mandatory access controls on the socket files,
>> +therefore it is up to the management layer to set up the socket as required.
>> +Socket types than span guests or hosts will require a proper authentication
> 
> 
> s/than/that/
> 

	OK

> 
>> +mechanism. Defining that mechanism is deferred to a future version of the
>> +protocol.
>> +
>> +Request Concurrency
>> +-------------------
>> +There can be multiple outstanding requests per virtual device, e.g. a
>> +frame buffer where the guest does multiple stores to the virtual device. The
>> +server can execute and reorder non-conflicting requests in parallel, depending
>> +on the device semantics.
>> +
>> +Socket Disconnection Behavior
>> +-----------------------------
>> +The server and the client can disconnect from each other, either intentionally
>> +or unexpectedly. Both the client and the server need to know how to handle such
>> +events.
>> +
>> +Server Disconnection
>> +^^^^^^^^^^^^^^^^^^^^
>> +A server disconnecting from the client may indicate that:
>> +
>> +1) A virtual device has been restarted, either intentionally (e.g. because of a
>> +device update) or unintentionally (e.g. because of a crash). In any case, the
>> +virtual device will come back so the client should not do anything (e.g. simply
>> +reconnect and retry failed operations).
>> +
>> +2) A virtual device has been shut down with no intention to be restarted.
>> +
>> +It is impossible for the client to know whether or not a failure is
>> +intermittent or innocuous and should be retried, therefore the client should
>> +attempt to reconnect to the socket. Since an intentional server restart (e.g.
>> +due to an upgrade) might take some time, a reasonable timeout should be used.
>> +In cases where the disconnection is expected (e.g. the guest shutting down), no
>> +new requests will be sent anyway so this situation doesn't pose a problem. The
>> +control stack will clean up accordingly.
> 
> 
> The theory sounds reasonable, the practice sounds much more difficult.
> For example if we're emulating a PCI device in the client, a blocked
> read or write could be problematic to the guest.
> 

	As I mention in my reply to Stefan, I’m skeptical we can do
better than a device reset on socket disconnect, but I want to give
Thanos a opportunity to reply.


> 
>> +Parametrizing this behaviour by having the virtual device advertise a
>> +reasonable reconnect is deferred to a future version of the protocol.
>> +
>> +Client Disconnection
>> +^^^^^^^^^^^^^^^^^^^^
>> +The client disconnecting from the server primarily means that the QEMU process
>> +has exited. Currently this means that the guest is shut down so the device is
>> +no longer needed therefore the server can automatically exit. However, there
>> +can be cases where a client disconnect should not result in a server exit:
>> +
>> +1) A single server serving multiple clients.
>> +2) A multi-process QEMU upgrading itself step by step, which isn't yet
>> +   implemented.
>> +
>> +Therefore in order for the protocol to be forward compatible the server should
>> +take no action when the client disconnects. If anything happens to the client
>> +process the control stack will know about it and can clean up resources
>> +accordingly.
> 
> 
> This seems to invite complication, if QEMU crashes, the device is gone
> and the server should release the resources.  If we're in this mythical
> second scenario, wouldn't it make sense to implement a suspend and
> resume sequence such that the sever will exit or free resources unless
> it's been told otherwise.  Maybe the suspend operation could also
> indicate a timeout such that the server would consider the client dead
> after some period and do cleanup.
> 

	Same as above


> 
>> +Request Retry and Response Timeout
>> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> +QEMU's VFIO retries certain operations if they fail. While this makes sense for
>> +real HW, we don't know for sure whether it makes sense for virtual devices. A
>> +failed request is a request that has been successfully sent and has been
>> +responded to with an error code. Failure to send the request in the first place
>> +(e.g. because the socket is disconnected) is a different type of error examined
>> +earlier in the disconnect section.
>> +
>> +Defining a retry and timeout scheme if deferred to a future version of the
>> +protocol.
>> +
>> +Commands
>> +--------
>> +The following table lists the VFIO message command IDs, and whether the
>> +message request is sent from the client or the server.
>> +
>> ++----------------------------------+---------+-------------------+
>> +| Name                             | Command | Request Direction |
>> ++==================================+=========+===================+
>> +| VFIO_USER_VERSION                | 1       | server → client   |
>> ++----------------------------------+---------+-------------------+
>> +| VFIO_USER_DMA_MAP                | 2       | client → server   |
>> ++----------------------------------+---------+-------------------+
>> +| VFIO_USER_DMA_UNMAP              | 3       | client → server   |
>> ++----------------------------------+---------+-------------------+
>> +| VFIO_USER_DEVICE_GET_INFO        | 4       | client → server   |
>> ++----------------------------------+---------+-------------------+
>> +| VFIO_USER_DEVICE_GET_REGION_INFO | 5       | client → server   |
>> ++----------------------------------+---------+-------------------+
>> +| VFIO_USER_DEVICE_GET_IRQ_INFO    | 6       | client → server   |
>> ++----------------------------------+---------+-------------------+
>> +| VFIO_USER_DEVICE_SET_IRQS        | 7       | client → server   |
>> ++----------------------------------+---------+-------------------+
>> +| VFIO_USER_REGION_READ            | 8       | client → server   |
>> ++----------------------------------+---------+-------------------+
>> +| VFIO_USER_REGION_WRITE           | 9       | client → server   |
>> ++----------------------------------+---------+-------------------+
>> +| VFIO_USER_DMA_READ               | 10      | server → client   |
>> ++----------------------------------+---------+-------------------+
>> +| VFIO_USER_DMA_READ               | 11      | server → client   |
>> ++----------------------------------+---------+-------------------+
>> +| VFIO_USER_VM_INTERRUPT           | 12      | server → client   |
>> ++----------------------------------+---------+-------------------+
>> +| VFIO_DEVICE_RESET                | 13      | client → server   |
>> ++----------------------------------+---------+-------------------+
> 
> 
> The only one not to use _USER_?
> 

	That was a typo

> 
>> +
>> +Header
>> +------
>> +All messages are preceded by a 16 byte header that contains basic information
>> +about the message. The header is followed by message-specific data described
>> +in the sections below.
>> +
>> ++----------------+--------+-------------+
>> +| Name           | Offset | Size        |
>> ++================+========+=============+
>> +| Device ID      | 0      | 2           |
>> ++----------------+--------+-------------+
>> +| Message ID     | 2      | 2           |
>> ++----------------+--------+-------------+
>> +| Command        | 4      | 4           |
>> ++----------------+--------+-------------+
>> +| Message size   | 8      | 4           |
>> ++----------------+--------+-------------+
>> +| Flags          | 12     | 4           |
>> ++----------------+--------+-------------+
>> +|                | +-----+------------+ |
>> +|                | | Bit | Definition | |
>> +|                | +=====+============+ |
>> +|                | | 0   | Reply      | |
>> +|                | +-----+------------+ |
>> +|                | | 1   | No_reply   | |
>> +|                | +-----+------------+ | 
>> ++----------------+--------+-------------+
>> +| <message data> | 16     | variable    |
>> ++----------------+--------+-------------+
>> +
>> +* Device ID identifies the destination device of the message. This field is
>> +  reserved when the server only supports one device per socket.
>> +* Message ID identifies the message, and is used in the message acknowledgement.
> 
> 
> I don't find any protocol for managing this.  Is the client expected to
> increment per message and wrap around?
> 

	There is no explicit algorithm in the spec. Its only use is to
identify which request a reply is associated with, so the requestor only
needs to set it to a unique value among its outstanding requests.

	Incrementing it per request would work fine.


> 
>> +* Command specifies the command to be executed, listed in the Command Table.
>> +* Message size contains the size of the entire message, including the header.
>> +* Flags contains attributes of the message:
>> +
>> +  * The reply bit differentiates request messages from reply messages. A reply
>> +    message acknowledges a previous request with the same message ID.
>> +  * No_reply indicates that no reply is needed for this request. This is
>> +    commonly used when multiple requests are sent, and only the last needs
>> +    acknowledgement.
> 
> 
> I wish I could think of better terminology, I had to delete my question
> of why we need two bits for flags that at a casual read appear to be
> opposites ;)
> 

	Bits aren’t scarce here, so I could have a type field with explicit
REQUEST and REPLY encodings; then have a separate attributes field with NO_REPLY.
 


> 
>> +
>> +VFIO_USER_VERSION
>> +-----------------
>> +
>> +Message format
>> +^^^^^^^^^^^^^^
>> +
>> ++--------------+------------------------+
>> +| Name         | Value                  |
>> ++==============+========================+
>> +| Device ID    | 0                      |
>> ++--------------+------------------------+
>> +| Message ID   | <ID>                   |
>> ++--------------+------------------------+
>> +| Command      | 1                      |
>> ++--------------+------------------------+
>> +| Message size | 16 + version length    |
>> ++--------------+------------------------+
>> +| Flags        | Reply bit set in reply |
>> ++--------------+------------------------+
>> +| Version      | JSON byte array        |
>> ++--------------+------------------------+
>> +
>> +This is the initial message sent by the server after the socket connection is
>> +established. The version is in JSON format, and the following objects must be
>> +included:
>> +
>> ++--------------+--------+---------------------------------------------------+
>> +| Name         | Type   | Description                                       |
>> ++==============+========+===================================================+
>> +| version      | object | {“major�: <number>, “minor�: <number>}            |
>> +|              |        | Version supported by the sender, e.g. “0.1�.      |
>> ++--------------+--------+---------------------------------------------------+
>> +| type         | string | Fixed to “vfio-user�.                             |
>> ++--------------+--------+---------------------------------------------------+
>> +| capabilities | array  | Reserved. Can be omitted for v0.1, otherwise must |
>> +|              |        | be empty.                                         |
>> ++--------------+--------+---------------------------------------------------+
>> +
>> +Versioning and Feature Support
>> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> +Upon accepting a connection, the server must send a VFIO_USER_VERSION message
>> +proposing a protocol version and a set of capabilities. The client compares
>> +these with the versions and capabilities it supports and sends a
>> +VFIO_USER_VERSION reply according to the following rules.
>> +
>> +* The major version in the reply must be the same as proposed. If the client
>> +  does not support the proposed major, it closes the connection.
>> +* The minor version in the reply must be equal to or less than the minor
>> +  version proposed.
>> +* The capability list must be a subset of those proposed. If the client
>> +  requires a capability the server did not include, it closes the connection.
>> +* If type is not “vfio-user�, the client closes the connection.
>> +
>> +The protocol major version will only change when incompatible protocol changes
>> +are made, such as changing the message format. The minor version may change
>> +when compatible changes are made, such as adding new messages or capabilities,
>> +Both the client and server must support all minor versions less than the
>> +maximum minor version it supports. E.g., an implementation that supports
>> +version 1.3 must also support 1.0 through 1.2.
>> +
>> +VFIO_USER_DMA_MAP
>> +-----------------
>> +
>> +VFIO_USER_DMA_UNMAP
>> +-------------------
>> +
>> +Message Format
>> +^^^^^^^^^^^^^^
>> +
>> ++--------------+------------------------+
>> +| Name         | Value                  |
>> ++==============+========================+
>> +| Device ID    | 0                      |
>> ++--------------+------------------------+
>> +| Message ID   | <ID>                   |
>> ++--------------+------------------------+
>> +| Command      | MAP=2, UNMAP=3         |
>> ++--------------+------------------------+
>> +| Message size | 16 + table size        |
>> ++--------------+------------------------+
>> +| Flags        | Reply bit set in reply |
>> ++--------------+------------------------+
>> +| Table        | array of table entries |
>> ++--------------+------------------------+
>> +
>> +This message is sent by the client to the server to inform it of the guest
>> +memory regions the device can access. It must be sent before the device can
>> +perform any DMA to the guest. It is normally sent directly after the version
>> +handshake is completed, but may also occur when memory is added or subtracted
>> +in the guest.
>> +
>> +The table is an array of the following structure. This structure is 32 bytes
>> +in size, so the message size will be 16 + (# of table entries * 32). If a
>> +region being added can be directly mapped by the server, an array of file
>> +descriptors will be sent as part of the message meta-data. Each region entry
>> +will have a corresponding file descriptor. On AF_UNIX sockets, the file
>> +descriptors will be passed as SCM_RIGHTS type ancillary data.
>> +
>> +Table entry format
>> +^^^^^^^^^^^^^^^^^^
>> +
>> ++-------------+--------+-------------+
>> +| Name        | Offset | Size        |
>> ++=============+========+=============+
>> +| Address     | 0      | 8           |
>> ++-------------+--------+-------------+
>> +| Size        | 8      | 8           |
>> ++-------------+--------+-------------+
>> +| Offset      | 16     | 8           |
>> ++-------------+--------+-------------+
>> +| Protections | 24     | 4           |
>> ++-------------+--------+-------------+
>> +| Flags       | 28     | 4           |
>> ++-------------+--------+-------------+
>> +|             | +-----+------------+ |
>> +|             | | Bit | Definition | |
>> +|             | +=====+============+ |
>> +|             | | 0   | Mappable   | |
>> +|             | +-----+------------+ |
>> ++-------------+--------+-------------+
>> +
>> +* Address is the base DMA address of the region.
>> +* Size is the size of the region.
>> +* Offset is the file offset of the region with respect to the associated file
>> +  descriptor.
>> +* Protections are the region's protection attributes as encoded in
>> +  ``<sys/mman.h>``.
>> +* Flags contain the following region attributes:
>> +
>> +  * Mappable indicate the region can be mapped via the mmap() system call using
>> +    the file descriptor provided in the message meta-data.
>> +
>> +VFIO_USER_DEVICE_GET_INFO
>> +-------------------------
>> +
>> +Message format
>> +^^^^^^^^^^^^^^
>> +
>> ++--------------+----------------------------+
>> +| Name         | Value                      |
>> ++==============+============================+
>> +| Device ID    | <ID>                       |
> 
> 
> The previous commands were specified with Device ID 0, which made me
> think that's the reserved value, but here we indicate it as a variable.
> Does this imply that VFIO_USER_DMA_MAP/UNMAP are per socket rather than
> per device?
> 

	I think MAP and UNMAP would have to be per-device, so the client
can support per-device IOMMU address spaces.


> 
>> ++--------------+----------------------------+
>> +| Message ID   | <ID>                       |
>> ++--------------+----------------------------+
>> +| Command      | 4                          |
>> ++--------------+----------------------------+
>> +| Message size | 16 in request, 32 in reply |
>> ++--------------+----------------------------+
>> +| Flags        | Reply bit set in reply     |
>> ++--------------+----------------------------+
>> +| Device info  | VFIO device info           |
>> ++--------------+----------------------------+
>> +
>> +This message is sent by the client to the server to query for basic information
>> +about the device. Only the message header is needed in the request message.
>> +The VFIO device info structure is defined in ``<sys/vfio.h>`` (``struct
> 
> 
> <linux/vfio.h>?
> 

	You’re right: vfio.h is installed in /usr/include/linux.


> 
>> +vfio_device_info``).
>> +
>> +VFIO device info format
>> +^^^^^^^^^^^^^^^^^^^^^^^
>> +
>> ++-------------+--------+--------------------------+
>> +| Name        | Offset | Size                     |
>> ++=============+========+==========================+
>> +| argsz       | 16     | 4                        |
>> ++-------------+--------+--------------------------+
>> +| flags       | 20     | 4                        |
>> ++-------------+--------+--------------------------+
>> +|             | +-----+-------------------------+ |
>> +|             | | Bit | Definition              | |
>> +|             | +=====+=========================+ |
>> +|             | | 0   | VFIO_DEVICE_FLAGS_RESET | |
>> +|             | +-----+-------------------------+ |
>> +|             | | 1   | VFIO_DEVICE_FLAGS_PCI   | |
>> +|             | +-----+-------------------------+ |
>> ++-------------+--------+--------------------------+
>> +| num_regions | 24     | 4                        |
>> ++-------------+--------+--------------------------+
>> +| num_irqs    | 28     | 4                        |
>> ++-------------+--------+--------------------------+
>> +
>> +* argz is reserved in vfio-user, it is only used in the ioctl() VFIO
>> +  implementation.
> 
> 
> This seems strange, why not use it throughout?  It seems this makes it
> more difficult for the client if it wants to unwrap the reply and send
> it through code common with in-kernel vfio usage.
> 

	I’d originally thought the size in the header is enough,
but the return argsz is used by vfio code, so the spec should be
changed.



> 
>> +* flags contains the following device attributes.
>> +
>> +  * VFIO_DEVICE_FLAGS_RESET indicates the device supports the
>> +    VFIO_USER_DEVICE_RESET message.
>> +  * VFIO_DEVICE_FLAGS_PCI indicates the device is a PCI device.
>> +
>> +* num_regions is the number of memory regions the device exposes.
>> +* num_irqs is the number of distinct interrupt types the device supports.
>> +
>> +This version of the protocol only supports PCI devices. Additional devices may
>> +be supported in future versions. 
>> +
>> +VFIO_USER_DEVICE_GET_REGION_INFO
>> +--------------------------------
>> +
>> +Message format
>> +^^^^^^^^^^^^^^
>> +
>> ++--------------+------------------+
>> +| Name         | Value            |
>> ++==============+==================+
>> +| Device ID    | <ID>             |
>> ++--------------+------------------+
>> +| Message ID   | <ID>             |
>> ++--------------+------------------+
>> +| Command      | 5                | 
>> ++--------------+------------------+
>> +| Message size | 48 + any caps    |
>> ++--------------+------------------+
>> +| Flags Reply  | bit set in reply |
>> ++--------------+------------------+
>> +| Region info  | VFIO region info |
>> ++--------------+------------------+
>> +
>> +This message is sent by the client to the server to query for information about
>> +device memory regions. The VFIO region info structure is defined in
>> +``<sys/vfio.h>`` (``struct vfio_region_info``).
>> +
>> +VFIO region info format
>> +^^^^^^^^^^^^^^^^^^^^^^^
>> +
>> ++------------+--------+------------------------------+
>> +| Name       | Offset | Size                         |
>> ++============+========+==============================+
>> +| argsz      | 16     | 4                            |
>> ++------------+--------+------------------------------+
>> +| flags      | 20     | 4                            |
>> ++------------+--------+------------------------------+
>> +|            | +-----+-----------------------------+ |
>> +|            | | Bit | Definition                  | |
>> +|            | +=====+=============================+ |
>> +|            | | 0   | VFIO_REGION_INFO_FLAG_READ  | |
>> +|            | +-----+-----------------------------+ |
>> +|            | | 1   | VFIO_REGION_INFO_FLAG_WRITE | |
>> +|            | +-----+-----------------------------+ |
>> +|            | | 2   | VFIO_REGION_INFO_FLAG_MMAP  | |
>> +|            | +-----+-----------------------------+ |
>> +|            | | 3   | VFIO_REGION_INFO_FLAG_CAPS  | |
>> +|            | +-----+-----------------------------+ |
>> ++------------+--------+------------------------------+
>> +| index      | 24     | 4                            |
>> ++------------+--------+------------------------------+
>> +| cap_offset | 28     | 4                            |
>> ++------------+--------+------------------------------+
>> +| size       | 32     | 8                            |
>> ++------------+--------+------------------------------+
>> +| offset     | 40     | 8                            |
>> ++------------+--------+------------------------------+
>> +
>> +* argz is reserved in vfio-user, it is only used in the ioctl() VFIO
>> +  implementation.
>> +* flags are attributes of the region:
>> +
>> +  * VFIO_REGION_INFO_FLAG_READ allows client read access to the region.
>> +  * VFIO_REGION_INFO_FLAG_WRITE allows client write access region.
>> +  * VFIO_REGION_INFO_FLAG_MMAP specifies the client can mmap() the region. When
>> +    this flag is set, the reply will include a file descriptor in its meta-data.
>> +    On AF_UNIX sockets, the file descriptors will be passed as SCM_RIGHTS type
>> +    ancillary data.
>> +  * VFIO_REGION_INFO_FLAG_CAPS indicates additional capabilities found in the
>> +    reply.
>> +
>> +* index is the index of memory region being queried, it is the only field that
>> +  is required to be set in the request message.
>> +* cap_offset describes where additional region capabilities can be found.
>> +  cap_offset is relative to the beginning of the VFIO region info structure.
>> +  The data structure it points is a VFIO cap header defined in ``<sys/vfio.h>``.
>> +* size is the size of the region.
>> +* offset is the offset given to the mmap() system call for regions with the
>> +  MMAP attribute. It is also used as the base offset when mapping a VFIO
>> +  sparse mmap area, described below.
>> +
>> +VFIO Region capabilities
>> +^^^^^^^^^^^^^^^^^^^^^^^^
>> +The VFIO region information can also include a capabilities list. This list is
>> +similar to a PCI capability list - each entry has a common header that
>> +identifies a capability and where the next capability in the list can be found.
>> +The VFIO capability header format is defined in ``<sys/vfio.h>`` (``struct
>> +vfio_info_cap_header``).
>> +
>> +VFIO cap header format
>> +^^^^^^^^^^^^^^^^^^^^^^
>> +
>> ++---------+--------+------+
>> +| Name    | Offset | Size |
>> ++=========+========+======+
>> +| id      | 0      | 2    |
>> ++---------+--------+------+
>> +| version | 2      | 2    |
>> ++---------+--------+------+
>> +| next    | 4      | 4    |
>> ++---------+--------+------+
>> +
>> +* id is the capability identity.
>> +* version is a capability-specific version number.
>> +* next specifies the offset of the next capability in the capability list. It
>> +  is relative to the beginning of the VFIO region info structure.
>> +
>> +VFIO sparse mmap
>> +^^^^^^^^^^^^^^^^
>> +
>> ++------------------+----------------------------------+
>> +| Name             | Value                            |
>> ++==================+==================================+
>> +| id               | VFIO_REGION_INFO_CAP_SPARSE_MMAP |
>> ++------------------+----------------------------------+
>> +| version          | 0x1                              |
>> ++------------------+----------------------------------+
>> +| next             | <next>                           |
>> ++------------------+----------------------------------+
>> +| sparse mmap info | VFIO region info sparse mmap     |
>> ++------------------+----------------------------------+
>> +
>> +The only capability supported in this version of the protocol is for sparse
>> +mmap. This capability is defined when only a subrange of the region supports
>> +direct access by the client via mmap(). The VFIO sparse mmap area is defined in
>> +``<sys/vfio.h>`` (``struct vfio_region_sparse_mmap_area``).
>> +
>> +VFIO region info cap sparse mmap
>> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> ++----------+--------+------+
>> +| Name     | Offset | Size |
>> ++==========+========+======+
>> +| nr_areas | 0      | 4    |
>> ++----------+--------+------+
>> +| reserved | 4      | 4    |
>> ++----------+--------+------+
>> +| offset   | 8      | 8    |
>> ++----------+--------+------+
>> +| size     | 16     | 9    |
>> ++----------+--------+------+
>> +| ...      |        |      |
>> ++----------+--------+------+
>> +
>> +* nr_areas is the number of sparse mmap areas in the region.
>> +* offset and size describe a single area that can be mapped by the client.
>> +  There will be nr_areas pairs of offset and size. The offset will be added to
>> +  the base offset given in the VFIO_USER_DEVICE_GET_REGION_INFO to form the
>> +  offset argument of the subsequent mmap() call.
>> +
>> +The VFIO sparse mmap area is defined in ``<sys/vfio.h>`` (``struct
>> +vfio_region_info_cap_sparse_mmap``).
>> +
>> +VFIO_USER_DEVICE_GET_IRQ_INFO
>> +-----------------------------
>> +
>> +Message format
>> +^^^^^^^^^^^^^^
>> +
>> ++--------------+------------------------+
>> +| Name         | Value                  |
>> ++==============+========================+
>> +| Device ID    | <ID>                   |
>> ++--------------+------------------------+
>> +| Message ID   | <ID>                   |
>> ++--------------+------------------------+
>> +| Command      | 6                      |
>> ++--------------+------------------------+
>> +| Message size | 32                     |
>> ++--------------+------------------------+
>> +| Flags        | Reply bit set in reply |
>> ++--------------+------------------------+
>> +| IRQ info     | VFIO IRQ info          |
>> ++--------------+------------------------+
>> +
>> +This message is sent by the client to the server to query for information about
>> +device interrupt types. The VFIO IRQ info structure is defined in
>> +``<sys/vfio.h>`` (``struct vfio_irq_info``).
>> +
>> +VFIO IRQ info format
>> +^^^^^^^^^^^^^^^^^^^^
>> +
>> ++-------+--------+---------------------------+
>> +| Name  | Offset | Size                      |
>> ++=======+========+===========================+
>> +| argsz | 16     | 4                         |
>> ++-------+--------+---------------------------+
>> +| flags | 20     | 4                         |
>> ++-------+--------+---------------------------+
>> +|       | +-----+--------------------------+ |
>> +|       | | Bit | Definition               | |
>> +|       | +=====+==========================+ |
>> +|       | | 0   | VFIO_IRQ_INFO_EVENTFD    | |
>> +|       | +-----+--------------------------+ |
>> +|       | | 1   | VFIO_IRQ_INFO_MASKABLE   | |
>> +|       | +-----+--------------------------+ |
>> +|       | | 2   | VFIO_IRQ_INFO_AUTOMASKED | |
>> +|       | +-----+--------------------------+ |
>> +|       | | 3   | VFIO_IRQ_INFO_NORESIZE   | |
>> +|       | +-----+--------------------------+ |
>> ++-------+--------+---------------------------+
>> +| index | 24     | 4                         |
>> ++-------+--------+---------------------------+
>> +| count | 28     | 4                         |
>> ++-------+--------+---------------------------+
>> +
>> +* argz is reserved in vfio-user, it is only used in the ioctl() VFIO
>> +  implementation.
>> +* flags defines IRQ attributes:
>> +
>> +  * VFIO_IRQ_INFO_EVENTFD indicates the IRQ type can support server eventfd
>> +    signalling.
>> +  * VFIO_IRQ_INFO_MASKABLE indicates that the IRQ type supports the MASK and
>> +    UNMASK actions in a VFIO_USER_DEVICE_SET_IRQS message.
>> +  * VFIO_IRQ_INFO_AUTOMASKED indicates the IRQ type masks itself after being
>> +    triggered, and the client must send an UNMASK action to receive new
>> +    interrupts.
>> +  * VFIO_IRQ_INFO_NORESIZE indicates VFIO_USER_SET_IRQS operations setup
>> +    interrupts as a set, and new subindexes cannot be enabled without disabling
>> +    the entire type.
>> +
>> +* index is the index of IRQ type being queried, it is the only field that is
>> +  required to be set in the request message.
>> +* count describes the number of interrupts of the queried type.
>> +
>> +VFIO_USER_DEVICE_SET_IRQS
>> +-------------------------
>> +
>> +Message format
>> +^^^^^^^^^^^^^^
>> +
>> ++--------------+------------------------+
>> +| Name         | Value                  |
>> +| Device ID    | <ID>                   |
>> ++--------------+------------------------+
>> +| Message ID   | <ID>                   |
>> ++--------------+------------------------+
>> +| Command      | 7                      |
>> ++--------------+------------------------+
>> +| Message size | 36 + any data          |
>> ++--------------+------------------------+
>> +| Flags        | Reply bit set in reply |
>> ++--------------+------------------------+
>> +| IRQ set      | VFIO IRQ set           |
>> ++--------------+------------------------+
>> +
>> +This message is sent by the client to the server to set actions for device
>> +interrupt types. The VFIO IRQ set structure is defined in ``<sys/vfio.h>``
>> +(``struct vfio_irq_set``).
>> +
>> +VFIO IRQ info format
>> +^^^^^^^^^^^^^^^^^^^^
>> +
>> ++-------+--------+------------------------------+
>> +| Name  | Offset | Size                         |
>> ++=======+========+==============================+
>> +| argsz | 6      | 4                            |
>> ++-------+--------+------------------------------+
>> +| flags | 20     | 4                            |
>> ++-------+--------+------------------------------+
>> +|       | +-----+-----------------------------+ |
>> +|       | | Bit | Definition                  | |
>> +|       | +=====+=============================+ |
>> +|       | | 0   | VFIO_IRQ_SET_DATA_NONE      | |
>> +|       | +-----+-----------------------------+ |
>> +|       | | 1   | VFIO_IRQ_SET_DATA_BOOL      | |
>> +|       | +-----+-----------------------------+ |
>> +|       | | 2   | VFIO_IRQ_SET_DATA_EVENTFD   | |
>> +|       | +-----+-----------------------------+ |
>> +|       | | 3   | VFIO_IRQ_SET_ACTION_MASK    | |
>> +|       | +-----+-----------------------------+ |
>> +|       | | 4   | VFIO_IRQ_SET_ACTION_UNMASK  | |
>> +|       | +-----+-----------------------------+ |
>> +|       | | 5   | VFIO_IRQ_SET_ACTION_TRIGGER | |
>> +|       | +-----+-----------------------------+ |
>> ++-------+--------+------------------------------+
>> +| index | 24     | 4                            |
>> ++-------+--------+------------------------------+
>> +| start | 28     | 4                            |
>> ++-------+--------+------------------------------+
>> +| count | 32     | 4                            |
>> ++-------+--------+------------------------------+
>> +| data  | 36     | variable                     |
>> ++-------+--------+------------------------------+
>> +
>> +* argz is reserved in vfio-user, it is only used in the ioctl() VFIO
>> +  implementation.
>> +* flags defines the action performed on the interrupt range. The DATA flags
>> +  describe the data field sent in the message; the ACTION flags describe the
>> +  action to be performed. The flags are mutually exclusive for both sets.
>> +
>> +  * VFIO_IRQ_SET_DATA_NONE indicates there is no data field in the request. The
>> +    action is performed unconditionally.
>> +  * VFIO_IRQ_SET_DATA_BOOL indicates the data field is an array of boolean
>> +    bytes. The action is performed if the corresponding boolean is true.
>> +  * VFIO_IRQ_SET_DATA_EVENTFD indicates an array of event file descriptors was
>> +    sent in the message meta-data. These descriptors will be signalled when the
>> +    action defined by the action flags occurs. In AF_UNIX sockets, the
>> +    descriptors are sent as SCM_RIGHTS type ancillary data.
>> +  * VFIO_IRQ_SET_ACTION_MASK indicates a masking event. It can be used with
>> +    VFIO_IRQ_SET_DATA_BOOL or VFIO_IRQ_SET_DATA_NONE to mask an interrupt, or
>> +    with VFIO_IRQ_SET_DATA_EVENTFD to generate an event when the guest masks
>> +    the interrupt. 
>> +  * VFIO_IRQ_SET_ACTION_UNMASK indicates an unmasking event. It can be used
>> +    with VFIO_IRQ_SET_DATA_BOOL or VFIO_IRQ_SET_DATA_NONE to unmask an
>> +    interrupt, or with VFIO_IRQ_SET_DATA_EVENTFD to generate an event when the
>> +    guest unmasks the interrupt. 
>> +  * VFIO_IRQ_SET_ACTION_TRIGGER indicates a triggering event. It can be used
>> +    with VFIO_IRQ_SET_DATA_BOOL or VFIO_IRQ_SET_DATA_NONE to trigger an
>> +    interrupt, or with VFIO_IRQ_SET_DATA_EVENTFD to generate an event when the
>> +    guest triggers the interrupt.
>> +
>> +* index is the index of IRQ type being setup.
>> +* start is the start of the subindex being set.
>> +* count describes the number of sub-indexes being set. As a special case, a
>> +  count of 0 with data flags of VFIO_IRQ_SET_DATA_NONE disables all interrupts
>> +  of the index data is an optional field included when the
>                 ^
>                 |
>                 +-  Missing period? 

	Yes

> 
>> +  VFIO_IRQ_SET_DATA_BOOL flag is present. It contains an array of booleans
>> +  that specify whether the action is to be performed on the corresponding
>> +  index. It's used when the action is only performed on a subset of the range
>> +  specified.
>> +
>> +Not all interrupt types support every combination of data and action flags.
>> +The client must know the capabilities of the device and IRQ index before it
>> +sends a VFIO_USER_DEVICE_SET_IRQ message.
>> +
>> +Read and Write Operations
>> +-------------------------
>> +
>> +Not all I/O operations between the client and server can be done via direct
>> +access of memory mapped with an mmap() call. In these cases, the client and
>> +server use messages sent over the socket. It is expected that these operations
>> +will have lower performance than direct access.
>> +
>> +The client can access device memory with VFIO_USER_REGION_READ and
>> +VFIO_USER_REGION_WRITE requests. These share a common data structure that
>> +appears after the 16 byte message header. 
>> +
>> +REGION Read/Write Data
>> +^^^^^^^^^^^^^^^^^^^^^^
>> +
>> ++--------+--------+----------+
>> +| Name   | Offset | Size     |
>> ++========+========+==========+
>> +| Offset | 16     | 8        |
>> ++--------+--------+----------+
>> +| Region | 24     | 4        |
>> ++--------+--------+----------+
>> +| Count  | 28     | 4        |
>> ++--------+--------+----------+
>> +| Data   | 32     | variable |
>> ++--------+--------+----------+
>> +
>> +* Offset into the region being accessed.
>> +* Region is the index of the region being accessed.
>> +* Count is the size of the data to be transferred.
>> +* Data is the data to be read or written.
>> +
>> +The server can access guest memory with VFIO_USER_DMA_READ and
>> +VFIO_USER_DMA_WRITE messages. These also share a common data structure that
>> +appears after the 16 byte message header.
>> +
>> +DMA Read/Write Data
>> +^^^^^^^^^^^^^^^^^^^
>> +
>> ++---------+--------+----------+
>> +| Name    | Offset | Size     |
>> ++=========+========+==========+
>> +| Address | 16     | 8        |
>> ++---------+--------+----------+
>> +| Count   | 24     | 4        |
>> ++---------+--------+----------+
>> +| Data    | 28     | variable |
>> ++---------+--------+----------+
>> +
>> +* Address is the area of guest memory being accessed. This address must have
>> +  been exported to the server with a VFIO_USER_DMA_MAP message.
>> +* Count is the size of the data to be transferred.
>> +* Data is the data to be read or written.
>> +
>> +Address and count can also be accessed as ``struct iovec`` from ``<sys/uio.h>``.
>> +
>> +VFIO_USER_REGION_READ
>> +---------------------
>> +
>> +Message format
>> +^^^^^^^^^^^^^^
>> +
>> ++--------------+------------------------+
>> +| Name         | Value                  |
>> ++==============+========================+
>> +| Device ID    | <ID>                   |
>> ++--------------+------------------------+
>> +| Message ID   | <ID>                   |
>> ++--------------+------------------------+
>> +| Command      | 8                      |
>> ++--------------+------------------------+
>> +| Message size | 32 + data size         |
>> ++--------------+------------------------+
>> +| Flags Reply  | bit set in reply       |
>> ++--------------+------------------------+
>> +| Read info    | REGION read/write data |
>> ++--------------+------------------------+
>> +
>> +This request is sent from the client to the server to read from device memory.
>> +In the request messages, there will be no data, and the count field will be the
>> +amount of data to be read. The reply will include the data read, and its count
>> +field will be the amount of data read.
>> +
>> +VFIO_USER_REGION_WRITE
>> +----------------------
>> +
>> +Message format
>> +^^^^^^^^^^^^^^
>> +
>> ++--------------+------------------------+
>> +| Name         | Value                  |
>> ++==============+========================+
>> +| Device ID    | <ID>                   |
>> ++--------------+------------------------+
>> +| Message ID   | <ID>                   |
>> ++--------------+------------------------+
>> +| Command      | 9                      |
>> ++--------------+------------------------+
>> +| Message size | 32 + data size         |
>> ++--------------+------------------------+
>> +| Flags        | Reply bit set in reply |
>> ++--------------+------------------------+
>> +| Write info   | REGION read write data |
>> ++--------------+------------------------+
>> +
>> +This request is sent from the client to the server to write to device memory.
>> +The request message will contain the data to be written, and its count field
>> +will contain the amount of write data. The count field in the reply will be
>> +zero.
>> +
>> +VFIO_USER_DMA_READ
>> +------------------
>> +
>> +Message format
>> +^^^^^^^^^^^^^^
>> +
>> ++--------------+---------------------+
>> +| Name         | Value               |
>> ++==============+=====================+
>> +| Device ID    | <ID>                |
>> ++--------------+---------------------+
>> +| Message ID   | <ID>                |
>> ++--------------+---------------------+
>> +| Command      | 10                  |
>> ++--------------+---------------------+
>> +| Message size | 28 + data size      |
>> ++--------------+---------------------+
>> +| Flags Reply  | bit set in reply    |
>> ++--------------+---------------------+
>> +| DMA info     | DMA read/write data |
>> ++--------------+---------------------+
>> +
>> +This request is sent from the server to the client to read from guest memory.
>> +In the request messages, there will be no data, and the count field will be the
>> +amount of data to be read. The reply will include the data read, and its count
>> +field will be the amount of data read.
>> +
>> +VFIO_USER_DMA_WRITE
>> +-------------------
>> +
>> +Message format
>> +^^^^^^^^^^^^^^
>> +
>> ++--------------+------------------------+
>> +| Name         | Value                  |
>> ++==============+========================+
>> +| Device ID    | <ID>                   |
>> ++--------------+------------------------+
>> +| Message ID   | <ID>                   |
>> ++--------------+------------------------+
>> +| Command      | 11                     |
>> ++--------------+------------------------+
>> +| Message size | 28 + data size         |
>> ++--------------+------------------------+
>> +| Flags        | Reply bit set in reply |
>> ++--------------+------------------------+
>> +| DMA info     | DMA read/write data    |
>> ++--------------+------------------------+
>> +
>> +This request is sent from the server to the client to write to guest memory.
>> +The request message will contain the data to be written, and its count field
>> +will contain the amount of write data. The count field in the reply will be
>> +zero.
>> +
>> +VFIO_USER_VM_INTERRUPT
>> +----------------------
>> +
>> +Message format
>> +^^^^^^^^^^^^^^
>> +
>> ++----------------+------------------------+
>> +| Name           | Value                  |
>> ++================+========================+
>> +| Device ID      | <ID>                   |
>> ++----------------+------------------------+
>> +| Message ID     | <ID>                   |
>> ++----------------+------------------------+
>> +| Command        | 12                     |
>> ++----------------+------------------------+
>> +| Message size   | 24                     |
>> ++----------------+------------------------+
>> +| Flags          | Reply bit set in reply |
>> ++----------------+------------------------+
>> +| Interrupt info | <interrupt>            |
>> ++----------------+------------------------+
>> +
>> +This request is sent from the server to the client to signal the device has
>> +raised an interrupt.
>> +
>> +Interrupt info format
>> +^^^^^^^^^^^^^^^^^^^^^
>> +
>> ++----------+--------+------+
>> +| Name     | Offset | Size |
>> ++==========+========+======+
>> +| Index    | 16     | 4    |
>> ++----------+--------+------+
>> +| Subindex | 20     | 4    |
>> ++----------+--------+------+
>> +
>> +* Index is the interrupt index; it is the same value used in VFIO_USER_SET_IRQS.
>> +* Subindex is relative to the index, e.g., the vector number used in PCI MSI/X
>> +  type interrupts.
> 
> 
> I'm still not sure why we need this, but why is a reply required?
> Wouldn't an edge interrupt not require a reply?  For a level interrupt,
> does the reply imply an EOI?  But really, why are we sending interrupts
> through anything other than eventfds?  SET_IRQS is used to configure
> the active interrupt on the device, it allows the client to provide
> eventfds, so when would we ever need to send an unsolicited,
> unconfigured interrupt?
> 

	This is a request from the server to the client.  If FDs can’t
be passed via the socket, the client will handle this request similarly
to how it handles non-KVM eventfds - asserting the interrupt to the VM
(e.g., a VFIO client using INTx would call vfio_intx_interrupt()) and
sending a VFIO_USER_SET_IRQS unmask request on EOI for level interrupts,
as currently is done is vfio_intx_eoi()


> 
>> +
>> +VFIO_USER_DEVICE_RESET
>> +----------------------
> 
> 
> Aha, just a typo above to missing _USER_
> 
> Thanks,
> Alex
> 
>> +
>> +Message format
>> +^^^^^^^^^^^^^^
>> +
>> ++--------------+------------------------+
>> +| Name         | Value                  |
>> ++==============+========================+
>> +| Device ID    | <ID>                   |
>> ++--------------+------------------------+
>> +| Message ID   | <ID>                   |
>> ++--------------+------------------------+
>> +| Command      | 13                     |
>> ++--------------+------------------------+
>> +| Message size | 16                     |
>> ++--------------+------------------------+
>> +| Flags        | Reply bit set in reply |
>> ++--------------+------------------------+
>> +
>> +This request is sent from the client to the server to reset the device.
>> +
>> +Appendices
>> +==========
>> +
>> +Unused VFIO ioctl() commands
>> +----------------------------
>> +
>> +The following commands must be handled by the client and not sent to the server:
>> +
>> +* VFIO_GET_API_VERSION
>> +* VFIO_CHECK_EXTENSION
>> +* VFIO_SET_IOMMU
>> +* VFIO_GROUP_GET_STATUS
>> +* VFIO_GROUP_SET_CONTAINER
>> +* VFIO_GROUP_UNSET_CONTAINER
>> +* VFIO_GROUP_GET_DEVICE_FD
>> +* VFIO_IOMMU_GET_INFO
>> +
>> +However, once support for live migration for VFIO devices is finalized some
>> +of the above commands might have to be handled by the client. This will be
>> +addressed in a future protocol version.
>> +
>> +Live Migration
>> +--------------
>> +Currently live migration is not supported for devices passed through via VFIO,
>> +therefore it is not supported for VFIO-over-socket, either. This is being
>> +actively worked on in the "Add migration support for VFIO devices" (v25) patch
>> +series.
>> +
>> +VFIO groups and containers
>> +^^^^^^^^^^^^^^^^^^^^^^^^^^
>> +
>> +The current VFIO implementation includes group and container idioms that
>> +describe how a device relates to the host IOMMU. In the VFIO over socket
>> +implementation, the IOMMU is implemented in SW by the client, and isn't visible
>> +to the server. The simplest idea is for the client is to put each device into
>> +its own group and container.