:p
atchew
Login
This series adds basic support for message-based DMA in qemu's vfio-user server. This is useful for cases where the client does not provide file descriptors for accessing system memory via memory mappings. My motivating use case is to hook up device models as PCIe endpoints to a hardware design. This works by bridging the PCIe transaction layer to vfio-user, and the endpoint does not access memory directly, but sends memory requests TLPs to the hardware design in order to perform DMA. Note that there is some more work required on top of this series to get message-based DMA to really work well: * libvfio-user has a long-standing issue where socket communication gets messed up when messages are sent from both ends at the same time. See https://github.com/nutanix/libvfio-user/issues/279 for more details. I've been engaging there and a fix is in review. * qemu currently breaks down DMA accesses into chunks of size 8 bytes at maximum, each of which will be handled in a separate vfio-user DMA request message. This is quite terrible for large DMA accesses, such as when nvme reads and writes page-sized blocks for example. Thus, I would like to improve qemu to be able to perform larger accesses, at least for indirect memory regions. I have something working locally, but since this will likely result in more involved surgery and discussion, I am leaving this to be addressed in a separate patch. Changes from v1: * Address Stefan's review comments. In particular, enforce an allocation limit and don't drop the map client callbacks given that map requests can fail when hitting size limits. * libvfio-user version bump now included in the series. * Tested as well on big-endian s390x. This uncovered another byte order issue in vfio-user server code that I've included a fix for. Mattias Nissler (4): softmmu: Support concurrent bounce buffers Update subprojects/libvfio-user vfio-user: Message-based DMA support vfio-user: Fix config space access byte order hw/remote/trace-events | 2 + hw/remote/vfio-user-obj.c | 88 +++++++++++++++++++++++++++++++---- include/sysemu/sysemu.h | 2 + qemu-options.hx | 27 +++++++++++ softmmu/globals.c | 1 + softmmu/physmem.c | 84 ++++++++++++++++++--------------- softmmu/vl.c | 6 +++ subprojects/libvfio-user.wrap | 2 +- 8 files changed, 165 insertions(+), 47 deletions(-) -- 2.34.1
When DMA memory can't be directly accessed, as is the case when running the device model in a separate process without shareable DMA file descriptors, bounce buffering is used. It is not uncommon for device models to request mapping of several DMA regions at the same time. Examples include: * net devices, e.g. when transmitting a packet that is split across several TX descriptors (observed with igb) * USB host controllers, when handling a packet with multiple data TRBs (observed with xhci) Previously, qemu only provided a single bounce buffer and would fail DMA map requests while the buffer was already in use. In turn, this would cause DMA failures that ultimately manifest as hardware errors from the guest perspective. This change allocates DMA bounce buffers dynamically instead of supporting only a single buffer. Thus, multiple DMA mappings work correctly also when RAM can't be mmap()-ed. The total bounce buffer allocation size is limited by a new command line parameter. The default is 4096 bytes to match the previous maximum buffer size. It is expected that suitable limits will vary quite a bit in practice depending on device models and workloads. Signed-off-by: Mattias Nissler <mnissler@rivosinc.com> --- include/sysemu/sysemu.h | 2 + qemu-options.hx | 27 +++++++++++++ softmmu/globals.c | 1 + softmmu/physmem.c | 84 +++++++++++++++++++++++------------------ softmmu/vl.c | 6 +++ 5 files changed, 83 insertions(+), 37 deletions(-) diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h index XXXXXXX..XXXXXXX 100644 --- a/include/sysemu/sysemu.h +++ b/include/sysemu/sysemu.h @@ -XXX,XX +XXX,XX @@ extern int nb_option_roms; extern const char *prom_envs[MAX_PROM_ENVS]; extern unsigned int nb_prom_envs; +extern uint64_t max_bounce_buffer_size; + /* serial ports */ /* Return the Chardev for serial port i, or NULL if none */ diff --git a/qemu-options.hx b/qemu-options.hx index XXXXXXX..XXXXXXX 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -XXX,XX +XXX,XX @@ SRST ERST #endif +DEF("max-bounce-buffer-size", HAS_ARG, + QEMU_OPTION_max_bounce_buffer_size, + "-max-bounce-buffer-size size\n" + " DMA bounce buffer size limit in bytes (default=4096)\n", + QEMU_ARCH_ALL) +SRST +``-max-bounce-buffer-size size`` + Set the limit in bytes for DMA bounce buffer allocations. + + DMA bounce buffers are used when device models request memory-mapped access + to memory regions that can't be directly mapped by the qemu process, so the + memory must read or written to a temporary local buffer for the device + model to work with. This is the case e.g. for I/O memory regions, and when + running in multi-process mode without shared access to memory. + + Whether bounce buffering is necessary depends heavily on the device model + implementation. Some devices use explicit DMA read and write operations + which do not require bounce buffers. Some devices, notably storage, will + retry a failed DMA map request after bounce buffer space becomes available + again. Most other devices will bail when encountering map request failures, + which will typically appear to the guest as a hardware error. + + Suitable bounce buffer size values depend on the workload and guest + configuration. A few kilobytes up to a few megabytes are common sizes + encountered in practice. +ERST + DEFHEADING() DEFHEADING(Generic object creation:) diff --git a/softmmu/globals.c b/softmmu/globals.c index XXXXXXX..XXXXXXX 100644 --- a/softmmu/globals.c +++ b/softmmu/globals.c @@ -XXX,XX +XXX,XX @@ const char *prom_envs[MAX_PROM_ENVS]; uint8_t *boot_splash_filedata; int only_migratable; /* turn it off unless user states otherwise */ int icount_align_option; +uint64_t max_bounce_buffer_size = 4096; /* The bytes in qemu_uuid are in the order specified by RFC4122, _not_ in the * little-endian "wire format" described in the SMBIOS 2.6 specification. diff --git a/softmmu/physmem.c b/softmmu/physmem.c index XXXXXXX..XXXXXXX 100644 --- a/softmmu/physmem.c +++ b/softmmu/physmem.c @@ -XXX,XX +XXX,XX @@ #include "sysemu/dma.h" #include "sysemu/hostmem.h" #include "sysemu/hw_accel.h" +#include "sysemu/sysemu.h" #include "sysemu/xen-mapcache.h" #include "trace/trace-root.h" @@ -XXX,XX +XXX,XX @@ void cpu_flush_icache_range(hwaddr start, hwaddr len) typedef struct { MemoryRegion *mr; - void *buffer; hwaddr addr; - hwaddr len; - bool in_use; + size_t len; + uint8_t buffer[]; } BounceBuffer; -static BounceBuffer bounce; +static size_t bounce_buffer_size; typedef struct MapClient { QEMUBH *bh; @@ -XXX,XX +XXX,XX @@ void cpu_register_map_client(QEMUBH *bh) qemu_mutex_lock(&map_client_list_lock); client->bh = bh; QLIST_INSERT_HEAD(&map_client_list, client, link); - /* Write map_client_list before reading in_use. */ + /* Write map_client_list before reading bounce_buffer_size. */ smp_mb(); - if (!qatomic_read(&bounce.in_use)) { + if (qatomic_read(&bounce_buffer_size) < max_bounce_buffer_size) { cpu_notify_map_clients_locked(); } qemu_mutex_unlock(&map_client_list_lock); @@ -XXX,XX +XXX,XX @@ void *address_space_map(AddressSpace *as, RCU_READ_LOCK_GUARD(); fv = address_space_to_flatview(as); mr = flatview_translate(fv, addr, &xlat, &l, is_write, attrs); + memory_region_ref(mr); if (!memory_access_is_direct(mr, is_write)) { - if (qatomic_xchg(&bounce.in_use, true)) { + size_t size = qatomic_add_fetch(&bounce_buffer_size, l); + if (size > max_bounce_buffer_size) { + size_t excess = size - max_bounce_buffer_size; + l -= excess; + qatomic_sub(&bounce_buffer_size, excess); + } + + if (l == 0) { *plen = 0; return NULL; } - /* Avoid unbounded allocations */ - l = MIN(l, TARGET_PAGE_SIZE); - bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, l); - bounce.addr = addr; - bounce.len = l; - memory_region_ref(mr); - bounce.mr = mr; + BounceBuffer *bounce = g_malloc(l + sizeof(BounceBuffer)); + bounce->mr = mr; + bounce->addr = addr; + bounce->len = l; + if (!is_write) { flatview_read(fv, addr, MEMTXATTRS_UNSPECIFIED, - bounce.buffer, l); + bounce->buffer, l); } *plen = l; - return bounce.buffer; + return bounce->buffer; } - - memory_region_ref(mr); *plen = flatview_extend_translation(fv, addr, len, mr, xlat, l, is_write, attrs); fuzz_dma_read_cb(addr, *plen, mr); @@ -XXX,XX +XXX,XX @@ void *address_space_map(AddressSpace *as, void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len, bool is_write, hwaddr access_len) { - if (buffer != bounce.buffer) { - MemoryRegion *mr; - ram_addr_t addr1; + MemoryRegion *mr; + ram_addr_t addr1; + + mr = memory_region_from_host(buffer, &addr1); + if (mr == NULL) { + /* + * Must be a bounce buffer (unless the caller passed a pointer which + * wasn't returned by address_space_map, which is illegal). + */ + BounceBuffer *bounce = container_of(buffer, BounceBuffer, buffer); - mr = memory_region_from_host(buffer, &addr1); - assert(mr != NULL); if (is_write) { - invalidate_and_set_dirty(mr, addr1, access_len); - } - if (xen_enabled()) { - xen_invalidate_map_cache_entry(buffer); + address_space_write(as, bounce->addr, MEMTXATTRS_UNSPECIFIED, + bounce->buffer, access_len); } - memory_region_unref(mr); + + memory_region_unref(bounce->mr); + qatomic_sub(&bounce_buffer_size, bounce->len); + /* Write bounce_buffer_size before reading map_client_list. */ + smp_mb(); + cpu_notify_map_clients(); + g_free(bounce); return; } + + if (xen_enabled()) { + xen_invalidate_map_cache_entry(buffer); + } if (is_write) { - address_space_write(as, bounce.addr, MEMTXATTRS_UNSPECIFIED, - bounce.buffer, access_len); - } - qemu_vfree(bounce.buffer); - bounce.buffer = NULL; - memory_region_unref(bounce.mr); - /* Clear in_use before reading map_client_list. */ - qatomic_set_mb(&bounce.in_use, false); - cpu_notify_map_clients(); + invalidate_and_set_dirty(mr, addr1, access_len); + } } void *cpu_physical_memory_map(hwaddr addr, diff --git a/softmmu/vl.c b/softmmu/vl.c index XXXXXXX..XXXXXXX 100644 --- a/softmmu/vl.c +++ b/softmmu/vl.c @@ -XXX,XX +XXX,XX @@ void qemu_init(int argc, char **argv) exit(1); #endif break; + case QEMU_OPTION_max_bounce_buffer_size: + if (qemu_strtosz(optarg, NULL, &max_bounce_buffer_size) < 0) { + error_report("invalid -max-ounce-buffer-size value"); + exit(1); + } + break; case QEMU_OPTION_object: object_option_parse(optarg); break; -- 2.34.1
Brings in assorted bug fixes. In particular, "Fix address calculation for message-based DMA" corrects a bug in DMA address calculation which is necessary to get DMA across VFIO-user messages working. Signed-off-by: Mattias Nissler <mnissler@rivosinc.com> --- subprojects/libvfio-user.wrap | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/subprojects/libvfio-user.wrap b/subprojects/libvfio-user.wrap index XXXXXXX..XXXXXXX 100644 --- a/subprojects/libvfio-user.wrap +++ b/subprojects/libvfio-user.wrap @@ -XXX,XX +XXX,XX @@ [wrap-git] url = https://gitlab.com/qemu-project/libvfio-user.git -revision = 0b28d205572c80b568a1003db2c8f37ca333e4d7 +revision = cfb7d908dca025bdea6709801c5790863e902ef8 depth = 1 -- 2.34.1
Wire up support for DMA for the case where the vfio-user client does not provide mmap()-able file descriptors, but DMA requests must be performed via the VFIO-user protocol. This installs an indirect memory region, which already works for pci_dma_{read,write}, and pci_dma_map works thanks to the existing DMA bounce buffering support. Note that while simple scenarios work with this patch, there's a known race condition in libvfio-user that will mess up the communication channel. See https://github.com/nutanix/libvfio-user/issues/279 for details as well as a proposed fix. Signed-off-by: Mattias Nissler <mnissler@rivosinc.com> --- hw/remote/trace-events | 2 + hw/remote/vfio-user-obj.c | 84 +++++++++++++++++++++++++++++++++++---- 2 files changed, 79 insertions(+), 7 deletions(-) diff --git a/hw/remote/trace-events b/hw/remote/trace-events index XXXXXXX..XXXXXXX 100644 --- a/hw/remote/trace-events +++ b/hw/remote/trace-events @@ -XXX,XX +XXX,XX @@ vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x -> 0x%x" vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x <- 0x%x" vfu_dma_register(uint64_t gpa, size_t len) "vfu: registering GPA 0x%"PRIx64", %zu bytes" vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64"" +vfu_dma_read(uint64_t gpa, size_t len) "vfu: DMA read 0x%"PRIx64", %zu bytes" +vfu_dma_write(uint64_t gpa, size_t len) "vfu: DMA write 0x%"PRIx64", %zu bytes" vfu_bar_register(int i, uint64_t addr, uint64_t size) "vfu: BAR %d: addr 0x%"PRIx64" size 0x%"PRIx64"" vfu_bar_rw_enter(const char *op, uint64_t addr) "vfu: %s request for BAR address 0x%"PRIx64"" vfu_bar_rw_exit(const char *op, uint64_t addr) "vfu: Finished %s of BAR address 0x%"PRIx64"" diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c index XXXXXXX..XXXXXXX 100644 --- a/hw/remote/vfio-user-obj.c +++ b/hw/remote/vfio-user-obj.c @@ -XXX,XX +XXX,XX @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf, return count; } +static MemTxResult vfu_dma_read(void *opaque, hwaddr addr, uint64_t *val, + unsigned size, MemTxAttrs attrs) +{ + MemoryRegion *region = opaque; + VfuObject *o = VFU_OBJECT(region->owner); + uint8_t buf[sizeof(uint64_t)]; + + trace_vfu_dma_read(region->addr + addr, size); + + dma_sg_t *sg = alloca(dma_sg_size()); + vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr); + if (vfu_addr_to_sgl(o->vfu_ctx, vfu_addr, size, sg, 1, PROT_READ) < 0 || + vfu_sgl_read(o->vfu_ctx, sg, 1, buf) != 0) { + return MEMTX_ERROR; + } + + *val = ldn_he_p(buf, size); + + return MEMTX_OK; +} + +static MemTxResult vfu_dma_write(void *opaque, hwaddr addr, uint64_t val, + unsigned size, MemTxAttrs attrs) +{ + MemoryRegion *region = opaque; + VfuObject *o = VFU_OBJECT(region->owner); + uint8_t buf[sizeof(uint64_t)]; + + trace_vfu_dma_write(region->addr + addr, size); + + stn_he_p(buf, size, val); + + dma_sg_t *sg = alloca(dma_sg_size()); + vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr); + if (vfu_addr_to_sgl(o->vfu_ctx, vfu_addr, size, sg, 1, PROT_WRITE) < 0 || + vfu_sgl_write(o->vfu_ctx, sg, 1, buf) != 0) { + return MEMTX_ERROR; + } + + return MEMTX_OK; +} + +static const MemoryRegionOps vfu_dma_ops = { + .read_with_attrs = vfu_dma_read, + .write_with_attrs = vfu_dma_write, + .endianness = DEVICE_HOST_ENDIAN, + .valid = { + .min_access_size = 1, + .max_access_size = 8, + .unaligned = true, + }, + .impl = { + .min_access_size = 1, + .max_access_size = 8, + }, +}; + static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info) { VfuObject *o = vfu_get_private(vfu_ctx); @@ -XXX,XX +XXX,XX @@ static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info) g_autofree char *name = NULL; struct iovec *iov = &info->iova; - if (!info->vaddr) { - return; - } - name = g_strdup_printf("mem-%s-%"PRIx64"", o->device, - (uint64_t)info->vaddr); + (uint64_t)iov->iov_base); subregion = g_new0(MemoryRegion, 1); - memory_region_init_ram_ptr(subregion, NULL, name, - iov->iov_len, info->vaddr); + if (info->vaddr) { + memory_region_init_ram_ptr(subregion, OBJECT(o), name, + iov->iov_len, info->vaddr); + } else { + /* + * Note that I/O regions' MemoryRegionOps handle accesses of at most 8 + * bytes at a time, and larger accesses are broken down. However, + * many/most DMA accesses are larger than 8 bytes and VFIO-user can + * handle large DMA accesses just fine, thus this size restriction + * unnecessarily hurts performance, in particular given that each + * access causes a round trip on the VFIO-user socket. + * + * TODO: Investigate how to plumb larger accesses through memory + * regions, possibly by amending MemoryRegionOps or by creating a new + * memory region type. + */ + memory_region_init_io(subregion, OBJECT(o), &vfu_dma_ops, subregion, + name, iov->iov_len); + } dma_as = pci_device_iommu_address_space(o->pci_dev); -- 2.34.1
PCI config space is little-endian, so on a big-endian host we need to perform byte swaps for values as they are passed to and received from the generic PCI config space access machinery. Signed-off-by: Mattias Nissler <mnissler@rivosinc.com> --- hw/remote/vfio-user-obj.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c index XXXXXXX..XXXXXXX 100644 --- a/hw/remote/vfio-user-obj.c +++ b/hw/remote/vfio-user-obj.c @@ -XXX,XX +XXX,XX @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf, while (bytes > 0) { len = (bytes > pci_access_width) ? pci_access_width : bytes; if (is_write) { - memcpy(&val, ptr, len); + val = ldn_le_p(ptr, len); pci_host_config_write_common(o->pci_dev, offset, pci_config_size(o->pci_dev), val, len); @@ -XXX,XX +XXX,XX @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf, } else { val = pci_host_config_read_common(o->pci_dev, offset, pci_config_size(o->pci_dev), len); - memcpy(ptr, &val, len); + stn_le_p(ptr, len, val); trace_vfu_cfg_read(offset, val); } offset += len; -- 2.34.1
This series adds basic support for message-based DMA in qemu's vfio-user server. This is useful for cases where the client does not provide file descriptors for accessing system memory via memory mappings. My motivating use case is to hook up device models as PCIe endpoints to a hardware design. This works by bridging the PCIe transaction layer to vfio-user, and the endpoint does not access memory directly, but sends memory requests TLPs to the hardware design in order to perform DMA. Note that more work is needed to make message-based DMA work well: qemu currently breaks down DMA accesses into chunks of size 8 bytes at maximum, each of which will be handled in a separate vfio-user DMA request message. This is quite terrible for large DMA accesses, such as when nvme reads and writes page-sized blocks for example. Thus, I would like to improve qemu to be able to perform larger accesses, at least for indirect memory regions. I have something working locally, but since this will likely result in more involved surgery and discussion, I am leaving this to be addressed in a separate patch. Changes from v1: * Address Stefan's review comments. In particular, enforce an allocation limit and don't drop the map client callbacks given that map requests can fail when hitting size limits. * libvfio-user version bump now included in the series. * Tested as well on big-endian s390x. This uncovered another byte order issue in vfio-user server code that I've included a fix for. Changes from v2: * Add a preparatory patch to make bounce buffering an AddressSpace-specific concept. * The total buffer size limit parameter is now per AdressSpace and can be configured for PCIDevice via a property. * Store a magic value in first bytes of bounce buffer struct as a best effort measure to detect invalid pointers in address_space_unmap. Changes from v3: * libvfio-user now supports twin-socket mode which uses separate sockets for client->server and server->client commands, respectively. This addresses the concurrent command bug triggered by server->client DMA access commands. See https://github.com/nutanix/libvfio-user/issues/279 for details. * Add missing teardown code in do_address_space_destroy. * Fix bounce buffer size bookkeeping race condition. * Generate unmap notification callbacks unconditionally. * Some cosmetic fixes. Changes from v4: * Fix accidentally dropped memory_region_unref, control flow restored to match previous code to simplify review. * Some cosmetic fixes. Changes from v5: * Unregister indirect memory region in libvfio-user dma_unregister callback. Changes from v6: * Rebase, resolve straightforward merge conflict in system/dma-helpers.c Changes from v7: * Rebase (applied cleanly) * Restore various Reviewed-by and Tested-by tags that I failed to carry forward (I double-checked that the patches haven't changed since the reviewed version) Changes from v8: * Rebase (clean) * Change bounce buffer size accounting to use uint32_t so it works also on hosts that don't support uint64_t atomics, such as mipsel. As a consequence overflows are a real concern now, so switch to a cmpxchg loop for allocating bounce buffer space. Changes from v9: * Incorporate patch split and QEMU_MUTEX_GUARD change by philmd@linaro.org * Use size_t instead of uint32_t for bounce buffer size accounting. The qdev property remains uint32_t though, so it has a consistent size regardless of host. Changes from v10: * Update the commit to uprev the libvfio-user subproject to the latest libvfio-user revision. * Break out the "softmmu: Support concurrent bounce buffers" patch so this series only touches vfio-user code and can be picked up as is by Jag. Changes from v11: * Rebase (clean) * Add a patch to fix DMA memory region reference leaks, which prevented vfio-user-server auto-shutdown from working. Mattias Nissler (3): Update subprojects/libvfio-user vfio-user: Message-based DMA support vfio-user: Fix memory region reference accounting hw/remote/trace-events | 2 + hw/remote/vfio-user-obj.c | 108 +++++++++++++++++++++++++++++----- subprojects/libvfio-user.wrap | 2 +- 3 files changed, 96 insertions(+), 16 deletions(-) -- 2.34.1
Brings in assorted bug fixes. The following are of particular interest with respect to message-based DMA support: * bb308a2 "Fix address calculation for message-based DMA" Corrects a bug in DMA address calculation. * 1569a37 "Pass server->client command over a separate socket pair" Adds support for separate sockets for either command direction, addressing a bug where libvfio-user gets confused if both client and server send commands concurrently. Reviewed-by: Jagannathan Raman <jag.raman@oracle.com> Signed-off-by: Mattias Nissler <mnissler@rivosinc.com> --- subprojects/libvfio-user.wrap | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/subprojects/libvfio-user.wrap b/subprojects/libvfio-user.wrap index XXXXXXX..XXXXXXX 100644 --- a/subprojects/libvfio-user.wrap +++ b/subprojects/libvfio-user.wrap @@ -XXX,XX +XXX,XX @@ [wrap-git] url = https://gitlab.com/qemu-project/libvfio-user.git -revision = 0b28d205572c80b568a1003db2c8f37ca333e4d7 +revision = b1a156d86f55a8fa3f78ece5bee7748ec75e7b82 depth = 1 -- 2.34.1
Wire up support for DMA for the case where the vfio-user client does not provide mmap()-able file descriptors, but DMA requests must be performed via the VFIO-user protocol. This installs an indirect memory region, which already works for pci_dma_{read,write}, and pci_dma_map works thanks to the existing DMA bounce buffering support. Note that while simple scenarios work with this patch, there's a known race condition in libvfio-user that will mess up the communication channel. See https://github.com/nutanix/libvfio-user/issues/279 for details as well as a proposed fix. Reviewed-by: Jagannathan Raman <jag.raman@oracle.com> Signed-off-by: Mattias Nissler <mnissler@rivosinc.com> --- hw/remote/trace-events | 2 + hw/remote/vfio-user-obj.c | 100 ++++++++++++++++++++++++++++++++------ 2 files changed, 87 insertions(+), 15 deletions(-) diff --git a/hw/remote/trace-events b/hw/remote/trace-events index XXXXXXX..XXXXXXX 100644 --- a/hw/remote/trace-events +++ b/hw/remote/trace-events @@ -XXX,XX +XXX,XX @@ vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x -> 0x%x" vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x <- 0x%x" vfu_dma_register(uint64_t gpa, size_t len) "vfu: registering GPA 0x%"PRIx64", %zu bytes" vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64"" +vfu_dma_read(uint64_t gpa, size_t len) "vfu: DMA read 0x%"PRIx64", %zu bytes" +vfu_dma_write(uint64_t gpa, size_t len) "vfu: DMA write 0x%"PRIx64", %zu bytes" vfu_bar_register(int i, uint64_t addr, uint64_t size) "vfu: BAR %d: addr 0x%"PRIx64" size 0x%"PRIx64"" vfu_bar_rw_enter(const char *op, uint64_t addr) "vfu: %s request for BAR address 0x%"PRIx64"" vfu_bar_rw_exit(const char *op, uint64_t addr) "vfu: Finished %s of BAR address 0x%"PRIx64"" diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c index XXXXXXX..XXXXXXX 100644 --- a/hw/remote/vfio-user-obj.c +++ b/hw/remote/vfio-user-obj.c @@ -XXX,XX +XXX,XX @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf, return count; } +static MemTxResult vfu_dma_read(void *opaque, hwaddr addr, uint64_t *val, + unsigned size, MemTxAttrs attrs) +{ + MemoryRegion *region = opaque; + vfu_ctx_t *vfu_ctx = VFU_OBJECT(region->owner)->vfu_ctx; + uint8_t buf[sizeof(uint64_t)]; + + trace_vfu_dma_read(region->addr + addr, size); + + g_autofree dma_sg_t *sg = g_malloc0(dma_sg_size()); + vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr); + if (vfu_addr_to_sgl(vfu_ctx, vfu_addr, size, sg, 1, PROT_READ) < 0 || + vfu_sgl_read(vfu_ctx, sg, 1, buf) != 0) { + return MEMTX_ERROR; + } + + *val = ldn_he_p(buf, size); + + return MEMTX_OK; +} + +static MemTxResult vfu_dma_write(void *opaque, hwaddr addr, uint64_t val, + unsigned size, MemTxAttrs attrs) +{ + MemoryRegion *region = opaque; + vfu_ctx_t *vfu_ctx = VFU_OBJECT(region->owner)->vfu_ctx; + uint8_t buf[sizeof(uint64_t)]; + + trace_vfu_dma_write(region->addr + addr, size); + + stn_he_p(buf, size, val); + + g_autofree dma_sg_t *sg = g_malloc0(dma_sg_size()); + vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr); + if (vfu_addr_to_sgl(vfu_ctx, vfu_addr, size, sg, 1, PROT_WRITE) < 0 || + vfu_sgl_write(vfu_ctx, sg, 1, buf) != 0) { + return MEMTX_ERROR; + } + + return MEMTX_OK; +} + +static const MemoryRegionOps vfu_dma_ops = { + .read_with_attrs = vfu_dma_read, + .write_with_attrs = vfu_dma_write, + .endianness = DEVICE_HOST_ENDIAN, + .valid = { + .min_access_size = 1, + .max_access_size = 8, + .unaligned = true, + }, + .impl = { + .min_access_size = 1, + .max_access_size = 8, + }, +}; + static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info) { VfuObject *o = vfu_get_private(vfu_ctx); @@ -XXX,XX +XXX,XX @@ static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info) g_autofree char *name = NULL; struct iovec *iov = &info->iova; - if (!info->vaddr) { - return; - } - name = g_strdup_printf("mem-%s-%"PRIx64"", o->device, - (uint64_t)info->vaddr); + (uint64_t)iov->iov_base); subregion = g_new0(MemoryRegion, 1); - memory_region_init_ram_ptr(subregion, NULL, name, - iov->iov_len, info->vaddr); + if (info->vaddr) { + memory_region_init_ram_ptr(subregion, OBJECT(o), name, + iov->iov_len, info->vaddr); + } else { + /* + * Note that I/O regions' MemoryRegionOps handle accesses of at most 8 + * bytes at a time, and larger accesses are broken down. However, + * many/most DMA accesses are larger than 8 bytes and VFIO-user can + * handle large DMA accesses just fine, thus this size restriction + * unnecessarily hurts performance, in particular given that each + * access causes a round trip on the VFIO-user socket. + * + * TODO: Investigate how to plumb larger accesses through memory + * regions, possibly by amending MemoryRegionOps or by creating a new + * memory region type. + */ + memory_region_init_io(subregion, OBJECT(o), &vfu_dma_ops, subregion, + name, iov->iov_len); + } dma_as = pci_device_iommu_address_space(o->pci_dev); @@ -XXX,XX +XXX,XX @@ static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info) static void dma_unregister(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info) { VfuObject *o = vfu_get_private(vfu_ctx); + MemoryRegionSection mr_section; AddressSpace *dma_as = NULL; - MemoryRegion *mr = NULL; - ram_addr_t offset; - mr = memory_region_from_host(info->vaddr, &offset); - if (!mr) { + dma_as = pci_device_iommu_address_space(o->pci_dev); + + mr_section = + memory_region_find(dma_as->root, (hwaddr)info->iova.iov_base, 1); + if (!mr_section.mr) { return; } - dma_as = pci_device_iommu_address_space(o->pci_dev); - - memory_region_del_subregion(dma_as->root, mr); + memory_region_del_subregion(dma_as->root, mr_section.mr); - object_unparent((OBJECT(mr))); + object_unparent((OBJECT(mr_section.mr))); trace_vfu_dma_unregister((uint64_t)info->iova.iov_base); } -- 2.34.1
The memory regions created for DMA regions where leaking the original reference the object is initialized with. This happened since we insert the memory region as a subregion, but don't keep the reference obtained when creating the object. Thus, drop the reference after inserting the DMA memory region into the address space. This fixes auto-shutdown behavior: Due to the leaked references, the memory regions would never be released, and indirectly keep the VFU object as their owner alive. Thus, vfu_object_finalize didn't get invoked, and qemu wouldn't terminate. With this fix, this is now working as originally intended. Signed-off-by: Mattias Nissler <mnissler@rivosinc.com> --- hw/remote/vfio-user-obj.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c index XXXXXXX..XXXXXXX 100644 --- a/hw/remote/vfio-user-obj.c +++ b/hw/remote/vfio-user-obj.c @@ -XXX,XX +XXX,XX @@ static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info) memory_region_add_subregion(dma_as->root, (hwaddr)iov->iov_base, subregion); + /* + * Insertion into the address space grabbed a reference to keep the memory + * region alive. However, the memory region object was created with an + * original reference count of 1, so we must unref since we don't keep that + * reference. + */ + memory_region_unref(subregion); + trace_vfu_dma_register((uint64_t)iov->iov_base, iov->iov_len); } -- 2.34.1