From nobody Sat Apr 27 00:47:14 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) client-ip=208.118.235.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Authentication-Results: mx.zohomail.com; spf=pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org Return-Path: Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) by mx.zohomail.com with SMTPS id 1515747549453134.1927739494447; Fri, 12 Jan 2018 00:59:09 -0800 (PST) Received: from localhost ([::1]:54925 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eZvAz-0000MM-Li for importer@patchew.org; Fri, 12 Jan 2018 03:59:01 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:40856) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eZv8i-0007VT-DX for qemu-devel@nongnu.org; Fri, 12 Jan 2018 03:56:41 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eZv8h-00026g-Jq for qemu-devel@nongnu.org; Fri, 12 Jan 2018 03:56:40 -0500 Received: from mx1.redhat.com ([209.132.183.28]:57606) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1eZv8b-0001qE-5h; Fri, 12 Jan 2018 03:56:33 -0500 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 39AA6C057F9F; Fri, 12 Jan 2018 08:56:32 +0000 (UTC) Received: from lemon.usersys.redhat.com (ovpn-12-81.pek2.redhat.com [10.72.12.81]) by smtp.corp.redhat.com (Postfix) with ESMTP id 12FE175554; Fri, 12 Jan 2018 08:56:14 +0000 (UTC) From: Fam Zheng To: qemu-devel@nongnu.org Date: Fri, 12 Jan 2018 16:55:47 +0800 Message-Id: <20180112085555.14447-2-famz@redhat.com> In-Reply-To: <20180112085555.14447-1-famz@redhat.com> References: <20180112085555.14447-1-famz@redhat.com> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.32]); Fri, 12 Jan 2018 08:56:32 +0000 (UTC) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 209.132.183.28 Subject: [Qemu-devel] [PATCH v5 1/9] stubs: Add stubs for ram block API X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Kevin Wolf , Fam Zheng , qemu-block@nongnu.org, Markus Armbruster , Max Reitz , Keith Busch , alex.williamson@redhat.com, Stefan Hajnoczi , Paolo Bonzini , Karl Rister Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: "Qemu-devel" X-ZohoMail: RSF_0 Z_629925259 SPT_0 Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" These functions will be wanted by block-obj-y but the actual definition is in obj-y, so stub them to keep the linker happy. Signed-off-by: Fam Zheng Acked-by: Paolo Bonzini Message-Id: <20180110091846.10699-2-famz@redhat.com> Reviewed-by: Stefan Hajnoczi --- stubs/Makefile.objs | 1 + stubs/ram-block.c | 16 ++++++++++++++++ 2 files changed, 17 insertions(+) create mode 100644 stubs/ram-block.c diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs index 8cfe34328a..2d59d84091 100644 --- a/stubs/Makefile.objs +++ b/stubs/Makefile.objs @@ -42,3 +42,4 @@ stub-obj-y +=3D vmgenid.o stub-obj-y +=3D xen-common.o stub-obj-y +=3D xen-hvm.o stub-obj-y +=3D pci-host-piix.o +stub-obj-y +=3D ram-block.o diff --git a/stubs/ram-block.c b/stubs/ram-block.c new file mode 100644 index 0000000000..cfa5d8678f --- /dev/null +++ b/stubs/ram-block.c @@ -0,0 +1,16 @@ +#include "qemu/osdep.h" +#include "exec/ramlist.h" +#include "exec/cpu-common.h" + +void ram_block_notifier_add(RAMBlockNotifier *n) +{ +} + +void ram_block_notifier_remove(RAMBlockNotifier *n) +{ +} + +int qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque) +{ + return 0; +} --=20 2.14.3 From nobody Sat Apr 27 00:47:14 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) client-ip=208.118.235.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Authentication-Results: mx.zohomail.com; spf=pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org Return-Path: Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) by mx.zohomail.com with SMTPS id 1515747554272822.2775774585189; Fri, 12 Jan 2018 00:59:14 -0800 (PST) Received: from localhost ([::1]:54929 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eZvBB-0000Uu-DO for importer@patchew.org; Fri, 12 Jan 2018 03:59:13 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:41152) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eZv9c-000862-IC for qemu-devel@nongnu.org; Fri, 12 Jan 2018 03:57:41 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eZv9Z-00047g-K7 for qemu-devel@nongnu.org; Fri, 12 Jan 2018 03:57:36 -0500 Received: from mx1.redhat.com ([209.132.183.28]:34618) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1eZv9Q-0003no-Vd; Fri, 12 Jan 2018 03:57:25 -0500 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 03B344E8BD; Fri, 12 Jan 2018 08:57:24 +0000 (UTC) Received: from lemon.usersys.redhat.com (ovpn-12-81.pek2.redhat.com [10.72.12.81]) by smtp.corp.redhat.com (Postfix) with ESMTP id EC81E7C74B; Fri, 12 Jan 2018 08:56:32 +0000 (UTC) From: Fam Zheng To: qemu-devel@nongnu.org Date: Fri, 12 Jan 2018 16:55:48 +0800 Message-Id: <20180112085555.14447-3-famz@redhat.com> In-Reply-To: <20180112085555.14447-1-famz@redhat.com> References: <20180112085555.14447-1-famz@redhat.com> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.38]); Fri, 12 Jan 2018 08:57:24 +0000 (UTC) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 209.132.183.28 Subject: [Qemu-devel] [PATCH v5 2/9] util: Introduce vfio helpers X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Kevin Wolf , Fam Zheng , qemu-block@nongnu.org, Markus Armbruster , Max Reitz , Keith Busch , alex.williamson@redhat.com, Stefan Hajnoczi , Paolo Bonzini , Karl Rister Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: "Qemu-devel" X-ZohoMail: RSF_0 Z_629925259 SPT_0 Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" This is a library to manage the host vfio interface, which could be used to implement userspace device driver code in QEMU such as NVMe or net controllers. Signed-off-by: Fam Zheng Message-Id: <20180110091846.10699-3-famz@redhat.com> Reviewed-by: Stefan Hajnoczi --- include/qemu/vfio-helpers.h | 33 ++ util/Makefile.objs | 1 + util/trace-events | 11 + util/vfio-helpers.c | 726 ++++++++++++++++++++++++++++++++++++++++= ++++ 4 files changed, 771 insertions(+) create mode 100644 include/qemu/vfio-helpers.h create mode 100644 util/vfio-helpers.c diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h new file mode 100644 index 0000000000..ce7e7b057f --- /dev/null +++ b/include/qemu/vfio-helpers.h @@ -0,0 +1,33 @@ +/* + * QEMU VFIO helpers + * + * Copyright 2016 - 2018 Red Hat, Inc. + * + * Authors: + * Fam Zheng + * + * This work is licensed under the terms of the GNU GPL, version 2 or late= r. + * See the COPYING file in the top-level directory. + */ + +#ifndef QEMU_VFIO_HELPERS_H +#define QEMU_VFIO_HELPERS_H +#include "qemu/typedefs.h" + +typedef struct QEMUVFIOState QEMUVFIOState; + +QEMUVFIOState *qemu_vfio_open_pci(const char *device, Error **errp); +void qemu_vfio_close(QEMUVFIOState *s); +int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t size, + bool temporary, uint64_t *iova_list); +int qemu_vfio_dma_reset_temporary(QEMUVFIOState *s); +void qemu_vfio_dma_unmap(QEMUVFIOState *s, void *host); +void *qemu_vfio_pci_map_bar(QEMUVFIOState *s, int index, + uint64_t offset, uint64_t size, + Error **errp); +void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, void *bar, + uint64_t offset, uint64_t size); +int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e, + int irq_type, Error **errp); + +#endif diff --git a/util/Makefile.objs b/util/Makefile.objs index 2973b0a323..3fb611631f 100644 --- a/util/Makefile.objs +++ b/util/Makefile.objs @@ -46,3 +46,4 @@ util-obj-y +=3D qht.o util-obj-y +=3D range.o util-obj-y +=3D stats64.o util-obj-y +=3D systemd.o +util-obj-$(CONFIG_LINUX) +=3D vfio-helpers.o diff --git a/util/trace-events b/util/trace-events index 025499f83f..2f57bf2337 100644 --- a/util/trace-events +++ b/util/trace-events @@ -59,3 +59,14 @@ lockcnt_futex_wake(const void *lockcnt) "lockcnt %p waki= ng up one waiter" # util/qemu-thread-posix.c qemu_mutex_locked(void *lock) "locked mutex %p" qemu_mutex_unlocked(void *lock) "unlocked mutex %p" + +# util/vfio-helpers.c +qemu_vfio_dma_reset_temporary(void *s) "s %p" +qemu_vfio_ram_block_added(void *s, void *p, size_t size) "s %p host %p siz= e 0x%zx" +qemu_vfio_ram_block_removed(void *s, void *p, size_t size) "s %p host %p s= ize 0x%zx" +qemu_vfio_find_mapping(void *s, void *p) "s %p host %p" +qemu_vfio_new_mapping(void *s, void *host, size_t size, int index, uint64_= t iova) "s %p host %p size %zu index %d iova 0x%"PRIx64 +qemu_vfio_do_mapping(void *s, void *host, size_t size, uint64_t iova) "s %= p host %p size %zu iova 0x%"PRIx64 +qemu_vfio_dma_map(void *s, void *host, size_t size, bool temporary, uint64= _t *iova) "s %p host %p size %zu temporary %d iova %p" +qemu_vfio_dma_map_invalid(void *s, void *mapping_host, size_t mapping_size= , void *host, size_t size) "s %p mapping %p %zu requested %p %zu" +qemu_vfio_dma_unmap(void *s, void *host) "s %p host %p" diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c new file mode 100644 index 0000000000..0660aaf2f7 --- /dev/null +++ b/util/vfio-helpers.c @@ -0,0 +1,726 @@ +/* + * VFIO utility + * + * Copyright 2016 - 2018 Red Hat, Inc. + * + * Authors: + * Fam Zheng + * + * This work is licensed under the terms of the GNU GPL, version 2 or late= r. + * See the COPYING file in the top-level directory. + */ + +#include "qemu/osdep.h" +#include +#include +#include "qapi/error.h" +#include "exec/ramlist.h" +#include "exec/cpu-common.h" +#include "trace.h" +#include "qemu/queue.h" +#include "qemu/error-report.h" +#include "standard-headers/linux/pci_regs.h" +#include "qemu/event_notifier.h" +#include "qemu/vfio-helpers.h" +#include "trace.h" + +#define QEMU_VFIO_DEBUG 0 + +#define QEMU_VFIO_IOVA_MIN 0x10000ULL +/* XXX: Once VFIO exposes the iova bit width in the IOMMU capability inter= face, + * we can use a runtime limit; alternatively it's also possible to do plat= form + * specific detection by reading sysfs entries. Until then, 39 is a safe b= et. + **/ +#define QEMU_VFIO_IOVA_MAX (1ULL << 39) + +typedef struct { + /* Page aligned addr. */ + void *host; + size_t size; + uint64_t iova; +} IOVAMapping; + +struct QEMUVFIOState { + QemuMutex lock; + + /* These fields are protected by BQL */ + int container; + int group; + int device; + RAMBlockNotifier ram_notifier; + struct vfio_region_info config_region_info, bar_region_info[6]; + + /* These fields are protected by @lock */ + /* VFIO's IO virtual address space is managed by splitting into a few + * sections: + * + * --------------- <=3D 0 + * |xxxxxxxxxxxxx| + * |-------------| <=3D QEMU_VFIO_IOVA_MIN + * | | + * | Fixed | + * | | + * |-------------| <=3D low_water_mark + * | | + * | Free | + * | | + * |-------------| <=3D high_water_mark + * | | + * | Temp | + * | | + * |-------------| <=3D QEMU_VFIO_IOVA_MAX + * |xxxxxxxxxxxxx| + * |xxxxxxxxxxxxx| + * --------------- + * + * - Addresses lower than QEMU_VFIO_IOVA_MIN are reserved as invalid; + * + * - Fixed mappings of HVAs are assigned "low" IOVAs in the range of + * [QEMU_VFIO_IOVA_MIN, low_water_mark). Once allocated they will n= ot be + * reclaimed - low_water_mark never shrinks; + * + * - IOVAs in range [low_water_mark, high_water_mark) are free; + * + * - IOVAs in range [high_water_mark, QEMU_VFIO_IOVA_MAX) are volatile + * mappings. At each qemu_vfio_dma_reset_temporary() call, the whole= area + * is recycled. The caller should make sure I/O's depending on these + * mappings are completed before calling. + **/ + uint64_t low_water_mark; + uint64_t high_water_mark; + IOVAMapping *mappings; + int nr_mappings; +}; + +/** + * Find group file by PCI device address as specified @device, and return = the + * path. The returned string is owned by caller and should be g_free'ed la= ter. + */ +static char *sysfs_find_group_file(const char *device, Error **errp) +{ + char *sysfs_link; + char *sysfs_group; + char *p; + char *path =3D NULL; + + sysfs_link =3D g_strdup_printf("/sys/bus/pci/devices/%s/iommu_group", = device); + sysfs_group =3D g_malloc(PATH_MAX); + if (readlink(sysfs_link, sysfs_group, PATH_MAX - 1) =3D=3D -1) { + error_setg_errno(errp, errno, "Failed to find iommu group sysfs pa= th"); + goto out; + } + p =3D strrchr(sysfs_group, '/'); + if (!p) { + error_setg(errp, "Failed to find iommu group number"); + goto out; + } + + path =3D g_strdup_printf("/dev/vfio/%s", p + 1); +out: + g_free(sysfs_link); + g_free(sysfs_group); + return path; +} + +static inline void assert_bar_index_valid(QEMUVFIOState *s, int index) +{ + assert(index >=3D 0 && index < ARRAY_SIZE(s->bar_region_info)); +} + +static int qemu_vfio_pci_init_bar(QEMUVFIOState *s, int index, Error **err= p) +{ + assert_bar_index_valid(s, index); + s->bar_region_info[index] =3D (struct vfio_region_info) { + .index =3D VFIO_PCI_BAR0_REGION_INDEX + index, + .argsz =3D sizeof(struct vfio_region_info), + }; + if (ioctl(s->device, VFIO_DEVICE_GET_REGION_INFO, &s->bar_region_info[= index])) { + error_setg_errno(errp, errno, "Failed to get BAR region info"); + return -errno; + } + + return 0; +} + +/** + * Map a PCI bar area. + */ +void *qemu_vfio_pci_map_bar(QEMUVFIOState *s, int index, + uint64_t offset, uint64_t size, + Error **errp) +{ + void *p; + assert_bar_index_valid(s, index); + p =3D mmap(NULL, MIN(size, s->bar_region_info[index].size - offset), + PROT_READ | PROT_WRITE, MAP_SHARED, + s->device, s->bar_region_info[index].offset + offset); + if (p =3D=3D MAP_FAILED) { + error_setg_errno(errp, errno, "Failed to map BAR region"); + p =3D NULL; + } + return p; +} + +/** + * Unmap a PCI bar area. + */ +void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, void *bar, + uint64_t offset, uint64_t size) +{ + if (bar) { + munmap(bar, MIN(size, s->bar_region_info[index].size - offset)); + } +} + +/** + * Initialize device IRQ with @irq_type and and register an event notifier. + */ +int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e, + int irq_type, Error **errp) +{ + int r; + struct vfio_irq_set *irq_set; + size_t irq_set_size; + struct vfio_irq_info irq_info =3D { .argsz =3D sizeof(irq_info) }; + + irq_info.index =3D irq_type; + if (ioctl(s->device, VFIO_DEVICE_GET_IRQ_INFO, &irq_info)) { + error_setg_errno(errp, errno, "Failed to get device interrupt info= "); + return -errno; + } + if (!(irq_info.flags & VFIO_IRQ_INFO_EVENTFD)) { + error_setg(errp, "Device interrupt doesn't support eventfd"); + return -EINVAL; + } + + irq_set_size =3D sizeof(*irq_set) + sizeof(int); + irq_set =3D g_malloc0(irq_set_size); + + /* Get to a known IRQ state */ + *irq_set =3D (struct vfio_irq_set) { + .argsz =3D irq_set_size, + .flags =3D VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER, + .index =3D irq_info.index, + .start =3D 0, + .count =3D 1, + }; + + *(int *)&irq_set->data =3D event_notifier_get_fd(e); + r =3D ioctl(s->device, VFIO_DEVICE_SET_IRQS, irq_set); + g_free(irq_set); + if (r) { + error_setg_errno(errp, errno, "Failed to setup device interrupt"); + return -errno; + } + return 0; +} + +static int qemu_vfio_pci_read_config(QEMUVFIOState *s, void *buf, + int size, int ofs) +{ + int ret; + + do { + ret =3D pread(s->device, buf, size, s->config_region_info.offset += ofs); + } while (ret =3D=3D -1 && errno =3D=3D EINTR); + return ret =3D=3D size ? 0 : -errno; +} + +static int qemu_vfio_pci_write_config(QEMUVFIOState *s, void *buf, int siz= e, int ofs) +{ + int ret; + + do { + ret =3D pwrite(s->device, buf, size, s->config_region_info.offset = + ofs); + } while (ret =3D=3D -1 && errno =3D=3D EINTR); + return ret =3D=3D size ? 0 : -errno; +} + +static int qemu_vfio_init_pci(QEMUVFIOState *s, const char *device, + Error **errp) +{ + int ret; + int i; + uint16_t pci_cmd; + struct vfio_group_status group_status =3D { .argsz =3D sizeof(group_st= atus) }; + struct vfio_iommu_type1_info iommu_info =3D { .argsz =3D sizeof(iommu_= info) }; + struct vfio_device_info device_info =3D { .argsz =3D sizeof(device_inf= o) }; + char *group_file =3D NULL; + + /* Create a new container */ + s->container =3D open("/dev/vfio/vfio", O_RDWR); + + if (s->container =3D=3D -1) { + error_setg_errno(errp, errno, "Failed to open /dev/vfio/vfio"); + return -errno; + } + if (ioctl(s->container, VFIO_GET_API_VERSION) !=3D VFIO_API_VERSION) { + error_setg(errp, "Invalid VFIO version"); + ret =3D -EINVAL; + goto fail_container; + } + + if (!ioctl(s->container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) { + error_setg_errno(errp, errno, "VFIO IOMMU check failed"); + ret =3D -EINVAL; + goto fail_container; + } + + /* Open the group */ + group_file =3D sysfs_find_group_file(device, errp); + if (!group_file) { + ret =3D -EINVAL; + goto fail_container; + } + + s->group =3D open(group_file, O_RDWR); + if (s->group =3D=3D -1) { + error_setg_errno(errp, errno, "Failed to open VFIO group file: %s", + group_file); + g_free(group_file); + ret =3D -errno; + goto fail_container; + } + g_free(group_file); + + /* Test the group is viable and available */ + if (ioctl(s->group, VFIO_GROUP_GET_STATUS, &group_status)) { + error_setg_errno(errp, errno, "Failed to get VFIO group status"); + ret =3D -errno; + goto fail; + } + + if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE)) { + error_setg(errp, "VFIO group is not viable"); + ret =3D -EINVAL; + goto fail; + } + + /* Add the group to the container */ + if (ioctl(s->group, VFIO_GROUP_SET_CONTAINER, &s->container)) { + error_setg_errno(errp, errno, "Failed to add group to VFIO contain= er"); + ret =3D -errno; + goto fail; + } + + /* Enable the IOMMU model we want */ + if (ioctl(s->container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU)) { + error_setg_errno(errp, errno, "Failed to set VFIO IOMMU type"); + ret =3D -errno; + goto fail; + } + + /* Get additional IOMMU info */ + if (ioctl(s->container, VFIO_IOMMU_GET_INFO, &iommu_info)) { + error_setg_errno(errp, errno, "Failed to get IOMMU info"); + ret =3D -errno; + goto fail; + } + + s->device =3D ioctl(s->group, VFIO_GROUP_GET_DEVICE_FD, device); + + if (s->device < 0) { + error_setg_errno(errp, errno, "Failed to get device fd"); + ret =3D -errno; + goto fail; + } + + /* Test and setup the device */ + if (ioctl(s->device, VFIO_DEVICE_GET_INFO, &device_info)) { + error_setg_errno(errp, errno, "Failed to get device info"); + ret =3D -errno; + goto fail; + } + + if (device_info.num_regions < VFIO_PCI_CONFIG_REGION_INDEX) { + error_setg(errp, "Invalid device regions"); + ret =3D -EINVAL; + goto fail; + } + + s->config_region_info =3D (struct vfio_region_info) { + .index =3D VFIO_PCI_CONFIG_REGION_INDEX, + .argsz =3D sizeof(struct vfio_region_info), + }; + if (ioctl(s->device, VFIO_DEVICE_GET_REGION_INFO, &s->config_region_in= fo)) { + error_setg_errno(errp, errno, "Failed to get config region info"); + ret =3D -errno; + goto fail; + } + + for (i =3D 0; i < 6; i++) { + ret =3D qemu_vfio_pci_init_bar(s, i, errp); + if (ret) { + goto fail; + } + } + + /* Enable bus master */ + ret =3D qemu_vfio_pci_read_config(s, &pci_cmd, sizeof(pci_cmd), PCI_CO= MMAND); + if (ret) { + goto fail; + } + pci_cmd |=3D PCI_COMMAND_MASTER; + ret =3D qemu_vfio_pci_write_config(s, &pci_cmd, sizeof(pci_cmd), PCI_C= OMMAND); + if (ret) { + goto fail; + } + return 0; +fail: + close(s->group); +fail_container: + close(s->container); + return ret; +} + +static void qemu_vfio_ram_block_added(RAMBlockNotifier *n, + void *host, size_t size) +{ + QEMUVFIOState *s =3D container_of(n, QEMUVFIOState, ram_notifier); + trace_qemu_vfio_ram_block_added(s, host, size); + qemu_vfio_dma_map(s, host, size, false, NULL); +} + +static void qemu_vfio_ram_block_removed(RAMBlockNotifier *n, + void *host, size_t size) +{ + QEMUVFIOState *s =3D container_of(n, QEMUVFIOState, ram_notifier); + if (host) { + trace_qemu_vfio_ram_block_removed(s, host, size); + qemu_vfio_dma_unmap(s, host); + } +} + +static int qemu_vfio_init_ramblock(const char *block_name, void *host_addr, + ram_addr_t offset, ram_addr_t length, + void *opaque) +{ + int ret; + QEMUVFIOState *s =3D opaque; + + if (!host_addr) { + return 0; + } + ret =3D qemu_vfio_dma_map(s, host_addr, length, false, NULL); + if (ret) { + fprintf(stderr, "qemu_vfio_init_ramblock: failed %p %ld\n", + host_addr, length); + } + return 0; +} + +static void qemu_vfio_open_common(QEMUVFIOState *s) +{ + s->ram_notifier.ram_block_added =3D qemu_vfio_ram_block_added; + s->ram_notifier.ram_block_removed =3D qemu_vfio_ram_block_removed; + ram_block_notifier_add(&s->ram_notifier); + s->low_water_mark =3D QEMU_VFIO_IOVA_MIN; + s->high_water_mark =3D QEMU_VFIO_IOVA_MAX; + qemu_ram_foreach_block(qemu_vfio_init_ramblock, s); + qemu_mutex_init(&s->lock); +} + +/** + * Open a PCI device, e.g. "0000:00:01.0". + */ +QEMUVFIOState *qemu_vfio_open_pci(const char *device, Error **errp) +{ + int r; + QEMUVFIOState *s =3D g_new0(QEMUVFIOState, 1); + + r =3D qemu_vfio_init_pci(s, device, errp); + if (r) { + g_free(s); + return NULL; + } + qemu_vfio_open_common(s); + return s; +} + +static void qemu_vfio_dump_mapping(IOVAMapping *m) +{ + if (QEMU_VFIO_DEBUG) { + printf(" vfio mapping %p %lx to %lx\n", m->host, m->size, m->iova= ); + } +} + +static void qemu_vfio_dump_mappings(QEMUVFIOState *s) +{ + int i; + + if (QEMU_VFIO_DEBUG) { + printf("vfio mappings\n"); + for (i =3D 0; i < s->nr_mappings; ++i) { + qemu_vfio_dump_mapping(&s->mappings[i]); + } + } +} + +/** + * Find the mapping entry that contains [host, host + size) and set @index= to + * the position. If no entry contains it, @index is the position _after_ w= hich + * to insert the new mapping. IOW, it is the index of the largest element = that + * is smaller than @host, or -1 if no entry is. + */ +static IOVAMapping *qemu_vfio_find_mapping(QEMUVFIOState *s, void *host, + int *index) +{ + IOVAMapping *p =3D s->mappings; + IOVAMapping *q =3D p ? p + s->nr_mappings - 1 : NULL; + IOVAMapping *mid; + trace_qemu_vfio_find_mapping(s, host); + if (!p) { + *index =3D -1; + return NULL; + } + while (true) { + mid =3D p + (q - p) / 2; + if (mid =3D=3D p) { + break; + } + if (mid->host > host) { + q =3D mid; + } else if (mid->host < host) { + p =3D mid; + } else { + break; + } + } + if (mid->host > host) { + mid--; + } else if (mid < &s->mappings[s->nr_mappings - 1] + && (mid + 1)->host <=3D host) { + mid++; + } + *index =3D mid - &s->mappings[0]; + if (mid >=3D &s->mappings[0] && + mid->host <=3D host && mid->host + mid->size > host) { + assert(mid < &s->mappings[s->nr_mappings]); + return mid; + } + /* At this point *index + 1 is the right position to insert the new + * mapping.*/ + return NULL; +} + +/** + * Allocate IOVA and and create a new mapping record and insert it in @s. + */ +static IOVAMapping *qemu_vfio_add_mapping(QEMUVFIOState *s, + void *host, size_t size, + int index, uint64_t iova) +{ + int shift; + IOVAMapping m =3D {.host =3D host, .size =3D size, iova =3D iova}; + IOVAMapping *insert; + + assert(QEMU_IS_ALIGNED(size, getpagesize())); + assert(QEMU_IS_ALIGNED(s->low_water_mark, getpagesize())); + assert(QEMU_IS_ALIGNED(s->high_water_mark, getpagesize())); + trace_qemu_vfio_new_mapping(s, host, size, index, iova); + + assert(index >=3D 0); + s->nr_mappings++; + s->mappings =3D g_realloc_n(s->mappings, sizeof(s->mappings[0]), + s->nr_mappings); + insert =3D &s->mappings[index]; + shift =3D s->nr_mappings - index - 1; + if (shift) { + memmove(insert + 1, insert, shift * sizeof(s->mappings[0])); + } + *insert =3D m; + return insert; +} + +/* Do the DMA mapping with VFIO. */ +static int qemu_vfio_do_mapping(QEMUVFIOState *s, void *host, size_t size, + uint64_t iova) +{ + struct vfio_iommu_type1_dma_map dma_map =3D { + .argsz =3D sizeof(dma_map), + .flags =3D VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE, + .iova =3D iova, + .vaddr =3D (uintptr_t)host, + .size =3D size, + }; + trace_qemu_vfio_do_mapping(s, host, size, iova); + + if (ioctl(s->container, VFIO_IOMMU_MAP_DMA, &dma_map)) { + error_report("VFIO_MAP_DMA: %d", -errno); + return -errno; + } + return 0; +} + +/** + * Undo the DMA mapping from @s with VFIO, and remove from mapping list. + */ +static void qemu_vfio_undo_mapping(QEMUVFIOState *s, IOVAMapping *mapping, + Error **errp) +{ + int index; + struct vfio_iommu_type1_dma_unmap unmap =3D { + .argsz =3D sizeof(unmap), + .flags =3D 0, + .iova =3D mapping->iova, + .size =3D mapping->size, + }; + + index =3D mapping - s->mappings; + assert(mapping->size > 0); + assert(QEMU_IS_ALIGNED(mapping->size, getpagesize())); + assert(index >=3D 0 && index < s->nr_mappings); + if (ioctl(s->container, VFIO_IOMMU_UNMAP_DMA, &unmap)) { + error_setg(errp, "VFIO_UNMAP_DMA failed: %d", -errno); + } + memmove(mapping, &s->mappings[index + 1], + sizeof(s->mappings[0]) * (s->nr_mappings - index - 1)); + s->nr_mappings--; + s->mappings =3D g_realloc_n(s->mappings, sizeof(s->mappings[0]), + s->nr_mappings); +} + +/* Check if the mapping list is (ascending) ordered. */ +static bool qemu_vfio_verify_mappings(QEMUVFIOState *s) +{ + int i; + if (QEMU_VFIO_DEBUG) { + for (i =3D 0; i < s->nr_mappings - 1; ++i) { + if (!(s->mappings[i].host < s->mappings[i + 1].host)) { + fprintf(stderr, "item %d not sorted!\n", i); + qemu_vfio_dump_mappings(s); + return false; + } + if (!(s->mappings[i].host + s->mappings[i].size <=3D + s->mappings[i + 1].host)) { + fprintf(stderr, "item %d overlap with next!\n", i); + qemu_vfio_dump_mappings(s); + return false; + } + } + } + return true; +} + +/* Map [host, host + size) area into a contiguous IOVA address space, and = store + * the result in @iova if not NULL. The caller need to make sure the area = is + * aligned to page size, and mustn't overlap with existing mapping areas (= split + * mapping status within this area is not allowed). + */ +int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t size, + bool temporary, uint64_t *iova) +{ + int ret =3D 0; + int index; + IOVAMapping *mapping; + uint64_t iova0; + + assert(QEMU_PTR_IS_ALIGNED(host, getpagesize())); + assert(QEMU_IS_ALIGNED(size, getpagesize())); + trace_qemu_vfio_dma_map(s, host, size, temporary, iova); + qemu_mutex_lock(&s->lock); + mapping =3D qemu_vfio_find_mapping(s, host, &index); + if (mapping) { + iova0 =3D mapping->iova + ((uint8_t *)host - (uint8_t *)mapping->h= ost); + } else { + if (s->high_water_mark - s->low_water_mark + 1 < size) { + ret =3D -ENOMEM; + goto out; + } + if (!temporary) { + iova0 =3D s->low_water_mark; + mapping =3D qemu_vfio_add_mapping(s, host, size, index + 1, io= va0); + if (!mapping) { + ret =3D -ENOMEM; + goto out; + } + assert(qemu_vfio_verify_mappings(s)); + ret =3D qemu_vfio_do_mapping(s, host, size, iova0); + if (ret) { + qemu_vfio_undo_mapping(s, mapping, NULL); + goto out; + } + s->low_water_mark +=3D size; + qemu_vfio_dump_mappings(s); + } else { + iova0 =3D s->high_water_mark - size; + ret =3D qemu_vfio_do_mapping(s, host, size, iova0); + if (ret) { + goto out; + } + s->high_water_mark -=3D size; + } + } + if (iova) { + *iova =3D iova0; + } +out: + qemu_mutex_unlock(&s->lock); + return ret; +} + +/* Reset the high watermark and free all "temporary" mappings. */ +int qemu_vfio_dma_reset_temporary(QEMUVFIOState *s) +{ + struct vfio_iommu_type1_dma_unmap unmap =3D { + .argsz =3D sizeof(unmap), + .flags =3D 0, + .iova =3D s->high_water_mark, + .size =3D QEMU_VFIO_IOVA_MAX - s->high_water_mark, + }; + trace_qemu_vfio_dma_reset_temporary(s); + qemu_mutex_lock(&s->lock); + if (ioctl(s->container, VFIO_IOMMU_UNMAP_DMA, &unmap)) { + error_report("VFIO_UNMAP_DMA: %d", -errno); + qemu_mutex_unlock(&s->lock); + return -errno; + } + s->high_water_mark =3D QEMU_VFIO_IOVA_MAX; + qemu_mutex_unlock(&s->lock); + return 0; +} + +/* Unmapping the whole area that was previously mapped with + * qemu_vfio_dma_map(). */ +void qemu_vfio_dma_unmap(QEMUVFIOState *s, void *host) +{ + int index =3D 0; + IOVAMapping *m; + + if (!host) { + return; + } + + trace_qemu_vfio_dma_unmap(s, host); + qemu_mutex_lock(&s->lock); + m =3D qemu_vfio_find_mapping(s, host, &index); + if (!m) { + goto out; + } + qemu_vfio_undo_mapping(s, m, NULL); +out: + qemu_mutex_unlock(&s->lock); +} + +static void qemu_vfio_reset(QEMUVFIOState *s) +{ + ioctl(s->device, VFIO_DEVICE_RESET); +} + +/* Close and free the VFIO resources. */ +void qemu_vfio_close(QEMUVFIOState *s) +{ + int i; + + if (!s) { + return; + } + for (i =3D 0; i < s->nr_mappings; ++i) { + qemu_vfio_undo_mapping(s, &s->mappings[i], NULL); + } + ram_block_notifier_remove(&s->ram_notifier); + qemu_vfio_reset(s); + close(s->device); + close(s->group); + close(s->container); +} --=20 2.14.3 From nobody Sat Apr 27 00:47:14 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) client-ip=208.118.235.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Authentication-Results: mx.zohomail.com; spf=pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org Return-Path: Received: from lists.gnu.org (208.118.235.17 [208.118.235.17]) by mx.zohomail.com with SMTPS id 1515747794320890.7138781444023; Fri, 12 Jan 2018 01:03:14 -0800 (PST) Received: from localhost ([::1]:55047 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eZvEy-0003s8-2Q for importer@patchew.org; Fri, 12 Jan 2018 04:03:08 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:41332) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eZvAL-0000Cp-39 for qemu-devel@nongnu.org; Fri, 12 Jan 2018 03:58:25 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eZvAH-0004YK-6j for qemu-devel@nongnu.org; Fri, 12 Jan 2018 03:58:21 -0500 Received: from mx1.redhat.com ([209.132.183.28]:46797) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1eZvA8-0004Rt-65; Fri, 12 Jan 2018 03:58:08 -0500 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 389A6C049D4C; Fri, 12 Jan 2018 08:57:57 +0000 (UTC) Received: from lemon.usersys.redhat.com (ovpn-12-81.pek2.redhat.com [10.72.12.81]) by smtp.corp.redhat.com (Postfix) with ESMTP id 72AEA7D504; Fri, 12 Jan 2018 08:57:24 +0000 (UTC) From: Fam Zheng To: qemu-devel@nongnu.org Date: Fri, 12 Jan 2018 16:55:49 +0800 Message-Id: <20180112085555.14447-4-famz@redhat.com> In-Reply-To: <20180112085555.14447-1-famz@redhat.com> References: <20180112085555.14447-1-famz@redhat.com> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.31]); Fri, 12 Jan 2018 08:58:07 +0000 (UTC) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 209.132.183.28 Subject: [Qemu-devel] [PATCH v5 3/9] block: Add VFIO based NVMe driver X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Kevin Wolf , Fam Zheng , qemu-block@nongnu.org, Markus Armbruster , Max Reitz , Keith Busch , alex.williamson@redhat.com, Stefan Hajnoczi , Paolo Bonzini , Karl Rister Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: "Qemu-devel" X-ZohoMail: RSF_0 Z_629925259 SPT_0 Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" This is a new protocol driver that exclusively opens a host NVMe controller through VFIO. It achieves better latency than linux-aio by completely bypassing host kernel vfs/block layer. $rw-$bs-$iodepth linux-aio nvme:// ---------------------------------------- randread-4k-1 10.5k 21.6k randread-512k-1 745 1591 randwrite-4k-1 30.7k 37.0k randwrite-512k-1 1945 1980 (unit: IOPS) The driver also integrates with the polling mechanism of iothread. This patch is co-authored by Paolo and me. Signed-off-by: Paolo Bonzini Signed-off-by: Fam Zheng Message-Id: <20180110091846.10699-4-famz@redhat.com> --- MAINTAINERS | 6 + block/Makefile.objs | 1 + block/nvme.c | 1163 +++++++++++++++++++++++++++++++++++++++++++++++= ++++ block/trace-events | 21 + 4 files changed, 1191 insertions(+) create mode 100644 block/nvme.c diff --git a/MAINTAINERS b/MAINTAINERS index 4770f105d4..bd636a4bff 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1876,6 +1876,12 @@ L: qemu-block@nongnu.org S: Supported F: block/null.c =20 +NVMe Block Driver +M: Fam Zheng +L: qemu-block@nongnu.org +S: Supported +F: block/nvme* + Bootdevice M: Gonglei S: Maintained diff --git a/block/Makefile.objs b/block/Makefile.objs index 6eaf78a046..4c7e9d84a7 100644 --- a/block/Makefile.objs +++ b/block/Makefile.objs @@ -11,6 +11,7 @@ block-obj-$(CONFIG_POSIX) +=3D file-posix.o block-obj-$(CONFIG_LINUX_AIO) +=3D linux-aio.o block-obj-y +=3D null.o mirror.o commit.o io.o block-obj-y +=3D throttle-groups.o +block-obj-$(CONFIG_LINUX) +=3D nvme.o =20 block-obj-y +=3D nbd.o nbd-client.o sheepdog.o block-obj-$(CONFIG_LIBISCSI) +=3D iscsi.o diff --git a/block/nvme.c b/block/nvme.c new file mode 100644 index 0000000000..97ab01686f --- /dev/null +++ b/block/nvme.c @@ -0,0 +1,1163 @@ +/* + * NVMe block driver based on vfio + * + * Copyright 2016 - 2018 Red Hat, Inc. + * + * Authors: + * Fam Zheng + * Paolo Bonzini + * + * This work is licensed under the terms of the GNU GPL, version 2 or late= r. + * See the COPYING file in the top-level directory. + */ + +#include "qemu/osdep.h" +#include +#include "qapi/error.h" +#include "qapi/qmp/qdict.h" +#include "qapi/qmp/qstring.h" +#include "qemu/error-report.h" +#include "qemu/cutils.h" +#include "qemu/vfio-helpers.h" +#include "block/block_int.h" +#include "trace.h" + +/* TODO: Move nvme spec definitions from hw/block/nvme.h into a separate f= ile + * that doesn't depend on dma/pci headers. */ +#include "sysemu/dma.h" +#include "hw/pci/pci.h" +#include "hw/block/block.h" +#include "hw/block/nvme.h" + +#define NVME_SQ_ENTRY_BYTES 64 +#define NVME_CQ_ENTRY_BYTES 16 +#define NVME_QUEUE_SIZE 128 +#define NVME_BAR_SIZE 8192 + +typedef struct { + int32_t head, tail; + uint8_t *queue; + uint64_t iova; + /* Hardware MMIO register */ + volatile uint32_t *doorbell; +} NVMeQueue; + +typedef struct { + BlockCompletionFunc *cb; + void *opaque; + int cid; + void *prp_list_page; + uint64_t prp_list_iova; + bool busy; +} NVMeRequest; + +typedef struct { + CoQueue free_req_queue; + QemuMutex lock; + + /* Fields protected by BQL */ + int index; + uint8_t *prp_list_pages; + + /* Fields protected by @lock */ + NVMeQueue sq, cq; + int cq_phase; + NVMeRequest reqs[NVME_QUEUE_SIZE]; + bool busy; + int need_kick; + int inflight; +} NVMeQueuePair; + +/* Memory mapped registers */ +typedef volatile struct { + uint64_t cap; + uint32_t vs; + uint32_t intms; + uint32_t intmc; + uint32_t cc; + uint32_t reserved0; + uint32_t csts; + uint32_t nssr; + uint32_t aqa; + uint64_t asq; + uint64_t acq; + uint32_t cmbloc; + uint32_t cmbsz; + uint8_t reserved1[0xec0]; + uint8_t cmd_set_specfic[0x100]; + uint32_t doorbells[]; +} QEMU_PACKED NVMeRegs; + +QEMU_BUILD_BUG_ON(offsetof(NVMeRegs, doorbells) !=3D 0x1000); + +typedef struct { + AioContext *aio_context; + QEMUVFIOState *vfio; + NVMeRegs *regs; + /* The submission/completion queue pairs. + * [0]: admin queue. + * [1..]: io queues. + */ + NVMeQueuePair **queues; + int nr_queues; + size_t page_size; + /* How many uint32_t elements does each doorbell entry take. */ + size_t doorbell_scale; + bool write_cache_supported; + EventNotifier irq_notifier; + uint64_t nsze; /* Namespace size reported by identify command */ + int nsid; /* The namespace id to read/write data. */ + uint64_t max_transfer; + int plugged; + + CoMutex dma_map_lock; + CoQueue dma_flush_queue; + + /* Total size of mapped qiov, accessed under dma_map_lock */ + int dma_map_count; +} BDRVNVMeState; + +#define NVME_BLOCK_OPT_DEVICE "device" +#define NVME_BLOCK_OPT_NAMESPACE "namespace" + +static QemuOptsList runtime_opts =3D { + .name =3D "nvme", + .head =3D QTAILQ_HEAD_INITIALIZER(runtime_opts.head), + .desc =3D { + { + .name =3D NVME_BLOCK_OPT_DEVICE, + .type =3D QEMU_OPT_STRING, + .help =3D "NVMe PCI device address", + }, + { + .name =3D NVME_BLOCK_OPT_NAMESPACE, + .type =3D QEMU_OPT_NUMBER, + .help =3D "NVMe namespace", + }, + { /* end of list */ } + }, +}; + +static void nvme_init_queue(BlockDriverState *bs, NVMeQueue *q, + int nentries, int entry_bytes, Error **errp) +{ + BDRVNVMeState *s =3D bs->opaque; + size_t bytes; + int r; + + bytes =3D ROUND_UP(nentries * entry_bytes, s->page_size); + q->head =3D q->tail =3D 0; + q->queue =3D qemu_try_blockalign0(bs, bytes); + + if (!q->queue) { + error_setg(errp, "Cannot allocate queue"); + return; + } + r =3D qemu_vfio_dma_map(s->vfio, q->queue, bytes, false, &q->iova); + if (r) { + error_setg(errp, "Cannot map queue"); + } +} + +static void nvme_free_queue_pair(BlockDriverState *bs, NVMeQueuePair *q) +{ + qemu_vfree(q->prp_list_pages); + qemu_vfree(q->sq.queue); + qemu_vfree(q->cq.queue); + qemu_mutex_destroy(&q->lock); + g_free(q); +} + +static void nvme_free_req_queue_cb(void *opaque) +{ + NVMeQueuePair *q =3D opaque; + + while (qemu_co_enter_next(&q->free_req_queue)) { + /* Retry all pending requests */ + } +} + +static NVMeQueuePair *nvme_create_queue_pair(BlockDriverState *bs, + int idx, int size, + Error **errp) +{ + int i, r; + BDRVNVMeState *s =3D bs->opaque; + Error *local_err =3D NULL; + NVMeQueuePair *q =3D g_new0(NVMeQueuePair, 1); + uint64_t prp_list_iova; + + qemu_mutex_init(&q->lock); + q->index =3D idx; + qemu_co_queue_init(&q->free_req_queue); + q->prp_list_pages =3D qemu_blockalign0(bs, s->page_size * NVME_QUEUE_S= IZE); + r =3D qemu_vfio_dma_map(s->vfio, q->prp_list_pages, + s->page_size * NVME_QUEUE_SIZE, + false, &prp_list_iova); + if (r) { + goto fail; + } + for (i =3D 0; i < NVME_QUEUE_SIZE; i++) { + NVMeRequest *req =3D &q->reqs[i]; + req->cid =3D i + 1; + req->prp_list_page =3D q->prp_list_pages + i * s->page_size; + req->prp_list_iova =3D prp_list_iova + i * s->page_size; + } + nvme_init_queue(bs, &q->sq, size, NVME_SQ_ENTRY_BYTES, &local_err); + if (local_err) { + error_propagate(errp, local_err); + goto fail; + } + q->sq.doorbell =3D &s->regs->doorbells[idx * 2 * s->doorbell_scale]; + + nvme_init_queue(bs, &q->cq, size, NVME_CQ_ENTRY_BYTES, &local_err); + if (local_err) { + error_propagate(errp, local_err); + goto fail; + } + q->cq.doorbell =3D &s->regs->doorbells[idx * 2 * s->doorbell_scale + 1= ]; + + return q; +fail: + nvme_free_queue_pair(bs, q); + return NULL; +} + +/* With q->lock */ +static void nvme_kick(BDRVNVMeState *s, NVMeQueuePair *q) +{ + if (s->plugged || !q->need_kick) { + return; + } + trace_nvme_kick(s, q->index); + assert(!(q->sq.tail & 0xFF00)); + /* Fence the write to submission queue entry before notifying the devi= ce. */ + smp_wmb(); + *q->sq.doorbell =3D cpu_to_le32(q->sq.tail); + q->inflight +=3D q->need_kick; + q->need_kick =3D 0; +} + +/* Find a free request element if any, otherwise: + * a) if in coroutine context, try to wait for one to become available; + * b) if not in coroutine, return NULL; + */ +static NVMeRequest *nvme_get_free_req(NVMeQueuePair *q) +{ + int i; + NVMeRequest *req =3D NULL; + + qemu_mutex_lock(&q->lock); + while (q->inflight + q->need_kick > NVME_QUEUE_SIZE - 2) { + /* We have to leave one slot empty as that is the full queue case = (head + * =3D=3D tail + 1). */ + if (qemu_in_coroutine()) { + trace_nvme_free_req_queue_wait(q); + qemu_mutex_unlock(&q->lock); + qemu_co_queue_wait(&q->free_req_queue, NULL); + qemu_mutex_lock(&q->lock); + } else { + qemu_mutex_unlock(&q->lock); + return NULL; + } + } + for (i =3D 0; i < NVME_QUEUE_SIZE; i++) { + if (!q->reqs[i].busy) { + q->reqs[i].busy =3D true; + req =3D &q->reqs[i]; + break; + } + } + /* We have checked inflight and need_kick while holding q->lock, so one + * free req must be available. */ + assert(req); + qemu_mutex_unlock(&q->lock); + return req; +} + +static inline int nvme_translate_error(const NvmeCqe *c) +{ + uint16_t status =3D (le16_to_cpu(c->status) >> 1) & 0xFF; + if (status) { + trace_nvme_error(le32_to_cpu(c->result), + le16_to_cpu(c->sq_head), + le16_to_cpu(c->sq_id), + le16_to_cpu(c->cid), + le16_to_cpu(status)); + } + switch (status) { + case 0: + return 0; + case 1: + return -ENOSYS; + case 2: + return -EINVAL; + default: + return -EIO; + } +} + +/* With q->lock */ +static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q) +{ + bool progress =3D false; + NVMeRequest *preq; + NVMeRequest req; + NvmeCqe *c; + + trace_nvme_process_completion(s, q->index, q->inflight); + if (q->busy || s->plugged) { + trace_nvme_process_completion_queue_busy(s, q->index); + return false; + } + q->busy =3D true; + assert(q->inflight >=3D 0); + while (q->inflight) { + int16_t cid; + c =3D (NvmeCqe *)&q->cq.queue[q->cq.head * NVME_CQ_ENTRY_BYTES]; + if (!c->cid || (le16_to_cpu(c->status) & 0x1) =3D=3D q->cq_phase) { + break; + } + q->cq.head =3D (q->cq.head + 1) % NVME_QUEUE_SIZE; + if (!q->cq.head) { + q->cq_phase =3D !q->cq_phase; + } + cid =3D le16_to_cpu(c->cid); + if (cid =3D=3D 0 || cid > NVME_QUEUE_SIZE) { + fprintf(stderr, "Unexpected CID in completion queue: %" PRIu32= "\n", + cid); + continue; + } + assert(cid <=3D NVME_QUEUE_SIZE); + trace_nvme_complete_command(s, q->index, cid); + preq =3D &q->reqs[cid - 1]; + req =3D *preq; + assert(req.cid =3D=3D cid); + assert(req.cb); + preq->busy =3D false; + preq->cb =3D preq->opaque =3D NULL; + qemu_mutex_unlock(&q->lock); + req.cb(req.opaque, nvme_translate_error(c)); + qemu_mutex_lock(&q->lock); + c->cid =3D cpu_to_le16(0); + q->inflight--; + /* Flip Phase Tag bit. */ + c->status =3D cpu_to_le16(le16_to_cpu(c->status) ^ 0x1); + progress =3D true; + } + if (progress) { + /* Notify the device so it can post more completions. */ + smp_mb_release(); + *q->cq.doorbell =3D cpu_to_le32(q->cq.head); + if (!qemu_co_queue_empty(&q->free_req_queue)) { + aio_bh_schedule_oneshot(s->aio_context, nvme_free_req_queue_cb= , q); + } + } + q->busy =3D false; + return progress; +} + +static void nvme_trace_command(const NvmeCmd *cmd) +{ + int i; + + for (i =3D 0; i < 8; ++i) { + uint8_t *cmdp =3D (uint8_t *)cmd + i * 8; + trace_nvme_submit_command_raw(cmdp[0], cmdp[1], cmdp[2], cmdp[3], + cmdp[4], cmdp[5], cmdp[6], cmdp[7]); + } +} + +static void nvme_submit_command(BDRVNVMeState *s, NVMeQueuePair *q, + NVMeRequest *req, + NvmeCmd *cmd, BlockCompletionFunc cb, + void *opaque) +{ + assert(!req->cb); + req->cb =3D cb; + req->opaque =3D opaque; + cmd->cid =3D cpu_to_le32(req->cid); + + trace_nvme_submit_command(s, q->index, req->cid); + nvme_trace_command(cmd); + qemu_mutex_lock(&q->lock); + memcpy((uint8_t *)q->sq.queue + + q->sq.tail * NVME_SQ_ENTRY_BYTES, cmd, sizeof(*cmd)); + q->sq.tail =3D (q->sq.tail + 1) % NVME_QUEUE_SIZE; + q->need_kick++; + nvme_kick(s, q); + nvme_process_completion(s, q); + qemu_mutex_unlock(&q->lock); +} + +static void nvme_cmd_sync_cb(void *opaque, int ret) +{ + int *pret =3D opaque; + *pret =3D ret; +} + +static int nvme_cmd_sync(BlockDriverState *bs, NVMeQueuePair *q, + NvmeCmd *cmd) +{ + NVMeRequest *req; + BDRVNVMeState *s =3D bs->opaque; + int ret =3D -EINPROGRESS; + req =3D nvme_get_free_req(q); + if (!req) { + return -EBUSY; + } + nvme_submit_command(s, q, req, cmd, nvme_cmd_sync_cb, &ret); + + BDRV_POLL_WHILE(bs, ret =3D=3D -EINPROGRESS); + return ret; +} + +static void nvme_identify(BlockDriverState *bs, int namespace, Error **err= p) +{ + BDRVNVMeState *s =3D bs->opaque; + NvmeIdCtrl *idctrl; + NvmeIdNs *idns; + uint8_t *resp; + int r; + uint64_t iova; + NvmeCmd cmd =3D { + .opcode =3D NVME_ADM_CMD_IDENTIFY, + .cdw10 =3D cpu_to_le32(0x1), + }; + + resp =3D qemu_try_blockalign0(bs, sizeof(NvmeIdCtrl)); + if (!resp) { + error_setg(errp, "Cannot allocate buffer for identify response"); + goto out; + } + idctrl =3D (NvmeIdCtrl *)resp; + idns =3D (NvmeIdNs *)resp; + r =3D qemu_vfio_dma_map(s->vfio, resp, sizeof(NvmeIdCtrl), true, &iova= ); + if (r) { + error_setg(errp, "Cannot map buffer for DMA"); + goto out; + } + cmd.prp1 =3D cpu_to_le64(iova); + + if (nvme_cmd_sync(bs, s->queues[0], &cmd)) { + error_setg(errp, "Failed to identify controller"); + goto out; + } + + if (le32_to_cpu(idctrl->nn) < namespace) { + error_setg(errp, "Invalid namespace"); + goto out; + } + s->write_cache_supported =3D le32_to_cpu(idctrl->vwc) & 0x1; + s->max_transfer =3D (idctrl->mdts ? 1 << idctrl->mdts : 0) * s->page_s= ize; + /* For now the page list buffer per command is one page, to hold at mo= st + * s->page_size / sizeof(uint64_t) entries. */ + s->max_transfer =3D MIN_NON_ZERO(s->max_transfer, + s->page_size / sizeof(uint64_t) * s->page_size); + + memset(resp, 0, 4096); + + cmd.cdw10 =3D 0; + cmd.nsid =3D cpu_to_le32(namespace); + if (nvme_cmd_sync(bs, s->queues[0], &cmd)) { + error_setg(errp, "Failed to identify namespace"); + goto out; + } + + s->nsze =3D le64_to_cpu(idns->nsze); + +out: + qemu_vfio_dma_unmap(s->vfio, resp); + qemu_vfree(resp); +} + +static bool nvme_poll_queues(BDRVNVMeState *s) +{ + bool progress =3D false; + int i; + + for (i =3D 0; i < s->nr_queues; i++) { + NVMeQueuePair *q =3D s->queues[i]; + qemu_mutex_lock(&q->lock); + while (nvme_process_completion(s, q)) { + /* Keep polling */ + progress =3D true; + } + qemu_mutex_unlock(&q->lock); + } + return progress; +} + +static void nvme_handle_event(EventNotifier *n) +{ + BDRVNVMeState *s =3D container_of(n, BDRVNVMeState, irq_notifier); + + trace_nvme_handle_event(s); + aio_context_acquire(s->aio_context); + event_notifier_test_and_clear(n); + nvme_poll_queues(s); + aio_context_release(s->aio_context); +} + +static bool nvme_add_io_queue(BlockDriverState *bs, Error **errp) +{ + BDRVNVMeState *s =3D bs->opaque; + int n =3D s->nr_queues; + NVMeQueuePair *q; + NvmeCmd cmd; + int queue_size =3D NVME_QUEUE_SIZE; + + q =3D nvme_create_queue_pair(bs, n, queue_size, errp); + if (!q) { + return false; + } + cmd =3D (NvmeCmd) { + .opcode =3D NVME_ADM_CMD_CREATE_CQ, + .prp1 =3D cpu_to_le64(q->cq.iova), + .cdw10 =3D cpu_to_le32(((queue_size - 1) << 16) | (n & 0xFFFF)), + .cdw11 =3D cpu_to_le32(0x3), + }; + if (nvme_cmd_sync(bs, s->queues[0], &cmd)) { + error_setg(errp, "Failed to create io queue [%d]", n); + nvme_free_queue_pair(bs, q); + return false; + } + cmd =3D (NvmeCmd) { + .opcode =3D NVME_ADM_CMD_CREATE_SQ, + .prp1 =3D cpu_to_le64(q->sq.iova), + .cdw10 =3D cpu_to_le32(((queue_size - 1) << 16) | (n & 0xFFFF)), + .cdw11 =3D cpu_to_le32(0x1 | (n << 16)), + }; + if (nvme_cmd_sync(bs, s->queues[0], &cmd)) { + error_setg(errp, "Failed to create io queue [%d]", n); + nvme_free_queue_pair(bs, q); + return false; + } + s->queues =3D g_renew(NVMeQueuePair *, s->queues, n + 1); + s->queues[n] =3D q; + s->nr_queues++; + return true; +} + +static bool nvme_poll_cb(void *opaque) +{ + EventNotifier *e =3D opaque; + BDRVNVMeState *s =3D container_of(e, BDRVNVMeState, irq_notifier); + bool progress =3D false; + + trace_nvme_poll_cb(s); + progress =3D nvme_poll_queues(s); + return progress; +} + +static int nvme_init(BlockDriverState *bs, const char *device, int namespa= ce, + Error **errp) +{ + BDRVNVMeState *s =3D bs->opaque; + int ret; + uint64_t cap; + uint64_t timeout_ms; + uint64_t deadline, now; + Error *local_err =3D NULL; + + qemu_co_mutex_init(&s->dma_map_lock); + qemu_co_queue_init(&s->dma_flush_queue); + s->nsid =3D namespace; + s->aio_context =3D bdrv_get_aio_context(bs); + ret =3D event_notifier_init(&s->irq_notifier, 0); + if (ret) { + error_setg(errp, "Failed to init event notifier"); + return ret; + } + + s->vfio =3D qemu_vfio_open_pci(device, errp); + if (!s->vfio) { + ret =3D -EINVAL; + goto fail; + } + + s->regs =3D qemu_vfio_pci_map_bar(s->vfio, 0, 0, NVME_BAR_SIZE, errp); + if (!s->regs) { + ret =3D -EINVAL; + goto fail; + } + + /* Perform initialize sequence as described in NVMe spec "7.6.1 + * Initialization". */ + + cap =3D le64_to_cpu(s->regs->cap); + if (!(cap & (1ULL << 37))) { + error_setg(errp, "Device doesn't support NVMe command set"); + ret =3D -EINVAL; + goto fail; + } + + s->page_size =3D MAX(4096, 1 << (12 + ((cap >> 48) & 0xF))); + s->doorbell_scale =3D (4 << (((cap >> 32) & 0xF))) / sizeof(uint32_t); + bs->bl.opt_mem_alignment =3D s->page_size; + timeout_ms =3D MIN(500 * ((cap >> 24) & 0xFF), 30000); + + /* Reset device to get a clean state. */ + s->regs->cc =3D cpu_to_le32(le32_to_cpu(s->regs->cc) & 0xFE); + /* Wait for CSTS.RDY =3D 0. */ + deadline =3D qemu_clock_get_ns(QEMU_CLOCK_REALTIME) + timeout_ms * 100= 0000ULL; + while (le32_to_cpu(s->regs->csts) & 0x1) { + if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) { + error_setg(errp, "Timeout while waiting for device to reset (%= ld ms)", + timeout_ms); + ret =3D -ETIMEDOUT; + goto fail; + } + } + + /* Set up admin queue. */ + s->queues =3D g_new(NVMeQueuePair *, 1); + s->nr_queues =3D 1; + s->queues[0] =3D nvme_create_queue_pair(bs, 0, NVME_QUEUE_SIZE, errp); + if (!s->queues[0]) { + ret =3D -EINVAL; + goto fail; + } + QEMU_BUILD_BUG_ON(NVME_QUEUE_SIZE & 0xF000); + s->regs->aqa =3D cpu_to_le32((NVME_QUEUE_SIZE << 16) | NVME_QUEUE_SIZE= ); + s->regs->asq =3D cpu_to_le64(s->queues[0]->sq.iova); + s->regs->acq =3D cpu_to_le64(s->queues[0]->cq.iova); + + /* After setting up all control registers we can enable device now. */ + s->regs->cc =3D cpu_to_le32((ctz32(NVME_CQ_ENTRY_BYTES) << 20) | + (ctz32(NVME_SQ_ENTRY_BYTES) << 16) | + 0x1); + /* Wait for CSTS.RDY =3D 1. */ + now =3D qemu_clock_get_ns(QEMU_CLOCK_REALTIME); + deadline =3D now + timeout_ms * 1000000; + while (!(le32_to_cpu(s->regs->csts) & 0x1)) { + if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) { + error_setg(errp, "Timeout while waiting for device to start (%= ld ms)", + timeout_ms); + ret =3D -ETIMEDOUT; + goto fail_queue; + } + } + + ret =3D qemu_vfio_pci_init_irq(s->vfio, &s->irq_notifier, + VFIO_PCI_MSIX_IRQ_INDEX, errp); + if (ret) { + goto fail_queue; + } + aio_set_event_notifier(bdrv_get_aio_context(bs), &s->irq_notifier, + false, nvme_handle_event, nvme_poll_cb); + + nvme_identify(bs, namespace, errp); + if (local_err) { + error_propagate(errp, local_err); + ret =3D -EIO; + goto fail_handler; + } + + /* Set up command queues. */ + if (!nvme_add_io_queue(bs, errp)) { + ret =3D -EIO; + goto fail_handler; + } + return 0; + +fail_handler: + aio_set_event_notifier(bdrv_get_aio_context(bs), &s->irq_notifier, + false, NULL, NULL); +fail_queue: + nvme_free_queue_pair(bs, s->queues[0]); +fail: + g_free(s->queues); + qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->regs, 0, NVME_BAR_SIZE); + qemu_vfio_close(s->vfio); + event_notifier_cleanup(&s->irq_notifier); + return ret; +} + +/* Parse a filename in the format of nvme://XXXX:XX:XX.X/X. Example: + * + * nvme://0000:44:00.0/1 + * + * where the "nvme://" is a fixed form of the protocol prefix, the middle = part + * is the PCI address, and the last part is the namespace number starting = from + * 1 according to the NVMe spec. */ +static void nvme_parse_filename(const char *filename, QDict *options, + Error **errp) +{ + int pref =3D strlen("nvme://"); + + if (strlen(filename) > pref && !strncmp(filename, "nvme://", pref)) { + const char *tmp =3D filename + pref; + char *device; + const char *namespace; + unsigned long ns; + const char *slash =3D strchr(tmp, '/'); + if (!slash) { + qdict_put(options, NVME_BLOCK_OPT_DEVICE, + qstring_from_str(tmp)); + return; + } + device =3D g_strndup(tmp, slash - tmp); + qdict_put(options, NVME_BLOCK_OPT_DEVICE, qstring_from_str(device)= ); + g_free(device); + namespace =3D slash + 1; + if (*namespace && qemu_strtoul(namespace, NULL, 10, &ns)) { + error_setg(errp, "Invalid namespace '%s', positive number expe= cted", + namespace); + return; + } + qdict_put(options, NVME_BLOCK_OPT_NAMESPACE, + qstring_from_str(*namespace ? namespace : "1")); + } +} + +static int nvme_enable_disable_write_cache(BlockDriverState *bs, bool enab= le, + Error **errp) +{ + BDRVNVMeState *s =3D bs->opaque; + NvmeCmd cmd =3D { + .opcode =3D NVME_ADM_CMD_SET_FEATURES, + .nsid =3D cpu_to_le32(s->nsid), + .cdw10 =3D cpu_to_le32(0x06), + .cdw11 =3D cpu_to_le32(enable ? 0x01 : 0x00), + }; + + if (enable && !s->write_cache_supported) { + error_setg(errp, + "NVMe controller doesn't have volatile write cache"); + return -EINVAL; + } + return nvme_cmd_sync(bs, s->queues[0], &cmd); +} + +static int nvme_file_open(BlockDriverState *bs, QDict *options, int flags, + Error **errp) +{ + const char *device; + QemuOpts *opts; + int namespace; + + opts =3D qemu_opts_create(&runtime_opts, NULL, 0, &error_abort); + qemu_opts_absorb_qdict(opts, options, &error_abort); + device =3D qemu_opt_get(opts, NVME_BLOCK_OPT_DEVICE); + if (!device) { + error_setg(errp, "'" NVME_BLOCK_OPT_DEVICE "' option is required"); + qemu_opts_del(opts); + return -EINVAL; + } + + namespace =3D qemu_opt_get_number(opts, NVME_BLOCK_OPT_NAMESPACE, 1); + nvme_init(bs, device, namespace, errp); + + qemu_opts_del(opts); + bs->supported_write_flags =3D BDRV_REQ_FUA; + if (nvme_enable_disable_write_cache(bs, !(flags & BDRV_O_NOCACHE), err= p)) { + return -EINVAL; + } + return 0; +} + +static void nvme_close(BlockDriverState *bs) +{ + int i; + BDRVNVMeState *s =3D bs->opaque; + + for (i =3D 0; i < s->nr_queues; ++i) { + nvme_free_queue_pair(bs, s->queues[i]); + } + aio_set_event_notifier(bdrv_get_aio_context(bs), &s->irq_notifier, + false, NULL, NULL); + qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->regs, 0, NVME_BAR_SIZE); + qemu_vfio_close(s->vfio); +} + +static int64_t nvme_getlength(BlockDriverState *bs) +{ + BDRVNVMeState *s =3D bs->opaque; + + return s->nsze << BDRV_SECTOR_BITS; +} + +/* Called with s->dma_map_lock */ +static coroutine_fn int nvme_cmd_unmap_qiov(BlockDriverState *bs, + QEMUIOVector *qiov) +{ + int r =3D 0; + BDRVNVMeState *s =3D bs->opaque; + + s->dma_map_count -=3D qiov->size; + if (!s->dma_map_count && !qemu_co_queue_empty(&s->dma_flush_queue)) { + r =3D qemu_vfio_dma_reset_temporary(s->vfio); + if (!r) { + qemu_co_queue_restart_all(&s->dma_flush_queue); + } + } + return r; +} + +/* Called with s->dma_map_lock */ +static coroutine_fn int nvme_cmd_map_qiov(BlockDriverState *bs, NvmeCmd *c= md, + NVMeRequest *req, QEMUIOVector *= qiov) +{ + BDRVNVMeState *s =3D bs->opaque; + uint64_t *pagelist =3D req->prp_list_page; + int i, j, r; + int entries =3D 0; + + assert(qiov->size); + assert(QEMU_IS_ALIGNED(qiov->size, s->page_size)); + assert(qiov->size / s->page_size <=3D s->page_size / sizeof(uint64_t)); + for (i =3D 0; i < qiov->niov; ++i) { + bool retry =3D true; + uint64_t iova; +try_map: + r =3D qemu_vfio_dma_map(s->vfio, + qiov->iov[i].iov_base, + qiov->iov[i].iov_len, + true, &iova); + if (r =3D=3D -ENOMEM && retry) { + retry =3D false; + trace_nvme_dma_flush_queue_wait(s); + if (s->dma_map_count) { + trace_nvme_dma_map_flush(s); + qemu_co_queue_wait(&s->dma_flush_queue, &s->dma_map_lock); + } else { + r =3D qemu_vfio_dma_reset_temporary(s->vfio); + if (r) { + goto fail; + } + } + goto try_map; + } + if (r) { + goto fail; + } + + for (j =3D 0; j < qiov->iov[i].iov_len / s->page_size; j++) { + pagelist[entries++] =3D iova + j * s->page_size; + } + trace_nvme_cmd_map_qiov_iov(s, i, qiov->iov[i].iov_base, + qiov->iov[i].iov_len / s->page_size); + } + + s->dma_map_count +=3D qiov->size; + + assert(entries <=3D s->page_size / sizeof(uint64_t)); + switch (entries) { + case 0: + abort(); + case 1: + cmd->prp1 =3D cpu_to_le64(pagelist[0]); + cmd->prp2 =3D 0; + break; + case 2: + cmd->prp1 =3D cpu_to_le64(pagelist[0]); + cmd->prp2 =3D cpu_to_le64(pagelist[1]);; + break; + default: + cmd->prp1 =3D cpu_to_le64(pagelist[0]); + cmd->prp2 =3D cpu_to_le64(req->prp_list_iova); + for (i =3D 0; i < entries - 1; ++i) { + pagelist[i] =3D cpu_to_le64(pagelist[i + 1]); + } + pagelist[entries - 1] =3D 0; + break; + } + trace_nvme_cmd_map_qiov(s, cmd, req, qiov, entries); + for (i =3D 0; i < entries; ++i) { + trace_nvme_cmd_map_qiov_pages(s, i, pagelist[i]); + } + return 0; +fail: + /* No need to unmap [0 - i) iovs even if we've failed, since we don't + * increment s->dma_map_count. This is okay for fixed mapping memory a= reas + * because they are already mapped before calling this function; for + * temporary mappings, a later nvme_cmd_(un)map_qiov will reclaim by + * calling qemu_vfio_dma_reset_temporary when necessary. */ + return r; +} + +typedef struct { + Coroutine *co; + int ret; + AioContext *ctx; +} NVMeCoData; + +static void nvme_rw_cb_bh(void *opaque) +{ + NVMeCoData *data =3D opaque; + qemu_coroutine_enter(data->co); +} + +static void nvme_rw_cb(void *opaque, int ret) +{ + NVMeCoData *data =3D opaque; + data->ret =3D ret; + if (!data->co) { + /* The rw coroutine hasn't yielded, don't try to enter. */ + return; + } + aio_bh_schedule_oneshot(data->ctx, nvme_rw_cb_bh, data); +} + +static coroutine_fn int nvme_co_prw_aligned(BlockDriverState *bs, + uint64_t offset, uint64_t byte= s, + QEMUIOVector *qiov, + bool is_write, + int flags) +{ + int r; + BDRVNVMeState *s =3D bs->opaque; + NVMeQueuePair *ioq =3D s->queues[1]; + NVMeRequest *req; + uint32_t cdw12 =3D (((bytes >> BDRV_SECTOR_BITS) - 1) & 0xFFFF) | + (flags & BDRV_REQ_FUA ? 1 << 30 : 0); + NvmeCmd cmd =3D { + .opcode =3D is_write ? NVME_CMD_WRITE : NVME_CMD_READ, + .nsid =3D cpu_to_le32(s->nsid), + .cdw10 =3D cpu_to_le32((offset >> BDRV_SECTOR_BITS) & 0xFFFFFFFF), + .cdw11 =3D cpu_to_le32(((offset >> BDRV_SECTOR_BITS) >> 32) & 0xFF= FFFFFF), + .cdw12 =3D cpu_to_le32(cdw12), + }; + NVMeCoData data =3D { + .ctx =3D bdrv_get_aio_context(bs), + .ret =3D -EINPROGRESS, + }; + + trace_nvme_prw_aligned(s, is_write, offset, bytes, flags, qiov->niov); + assert(s->nr_queues > 1); + req =3D nvme_get_free_req(ioq); + assert(req); + + qemu_co_mutex_lock(&s->dma_map_lock); + r =3D nvme_cmd_map_qiov(bs, &cmd, req, qiov); + qemu_co_mutex_unlock(&s->dma_map_lock); + if (r) { + req->busy =3D false; + return r; + } + nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data); + + data.co =3D qemu_coroutine_self(); + while (data.ret =3D=3D -EINPROGRESS) { + qemu_coroutine_yield(); + } + + qemu_co_mutex_lock(&s->dma_map_lock); + r =3D nvme_cmd_unmap_qiov(bs, qiov); + qemu_co_mutex_unlock(&s->dma_map_lock); + if (r) { + return r; + } + + trace_nvme_rw_done(s, is_write, offset, bytes, data.ret); + return data.ret; +} + +static inline bool nvme_qiov_aligned(BlockDriverState *bs, + const QEMUIOVector *qiov) +{ + int i; + BDRVNVMeState *s =3D bs->opaque; + + for (i =3D 0; i < qiov->niov; ++i) { + if (!QEMU_PTR_IS_ALIGNED(qiov->iov[i].iov_base, s->page_size) || + !QEMU_IS_ALIGNED(qiov->iov[i].iov_len, s->page_size)) { + trace_nvme_qiov_unaligned(qiov, i, qiov->iov[i].iov_base, + qiov->iov[i].iov_len, s->page_size); + return false; + } + } + return true; +} + +static int nvme_co_prw(BlockDriverState *bs, uint64_t offset, uint64_t byt= es, + QEMUIOVector *qiov, bool is_write, int flags) +{ + BDRVNVMeState *s =3D bs->opaque; + int r; + uint8_t *buf =3D NULL; + QEMUIOVector local_qiov; + + assert(QEMU_IS_ALIGNED(offset, s->page_size)); + assert(QEMU_IS_ALIGNED(bytes, s->page_size)); + assert(bytes <=3D s->max_transfer); + if (nvme_qiov_aligned(bs, qiov)) { + return nvme_co_prw_aligned(bs, offset, bytes, qiov, is_write, flag= s); + } + trace_nvme_prw_buffered(s, offset, bytes, qiov->niov, is_write); + buf =3D qemu_try_blockalign(bs, bytes); + + if (!buf) { + return -ENOMEM; + } + qemu_iovec_init(&local_qiov, 1); + if (is_write) { + qemu_iovec_to_buf(qiov, 0, buf, bytes); + } + qemu_iovec_add(&local_qiov, buf, bytes); + r =3D nvme_co_prw_aligned(bs, offset, bytes, &local_qiov, is_write, fl= ags); + qemu_iovec_destroy(&local_qiov); + if (!r && !is_write) { + qemu_iovec_from_buf(qiov, 0, buf, bytes); + } + qemu_vfree(buf); + return r; +} + +static coroutine_fn int nvme_co_preadv(BlockDriverState *bs, + uint64_t offset, uint64_t bytes, + QEMUIOVector *qiov, int flags) +{ + return nvme_co_prw(bs, offset, bytes, qiov, false, flags); +} + +static coroutine_fn int nvme_co_pwritev(BlockDriverState *bs, + uint64_t offset, uint64_t bytes, + QEMUIOVector *qiov, int flags) +{ + return nvme_co_prw(bs, offset, bytes, qiov, true, flags); +} + +static coroutine_fn int nvme_co_flush(BlockDriverState *bs) +{ + BDRVNVMeState *s =3D bs->opaque; + NVMeQueuePair *ioq =3D s->queues[1]; + NVMeRequest *req; + NvmeCmd cmd =3D { + .opcode =3D NVME_CMD_FLUSH, + .nsid =3D cpu_to_le32(s->nsid), + }; + NVMeCoData data =3D { + .ctx =3D bdrv_get_aio_context(bs), + .ret =3D -EINPROGRESS, + }; + + assert(s->nr_queues > 1); + req =3D nvme_get_free_req(ioq); + assert(req); + nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data); + + data.co =3D qemu_coroutine_self(); + if (data.ret =3D=3D -EINPROGRESS) { + qemu_coroutine_yield(); + } + + return data.ret; +} + + +static int nvme_reopen_prepare(BDRVReopenState *reopen_state, + BlockReopenQueue *queue, Error **errp) +{ + return 0; +} + +static int64_t coroutine_fn nvme_co_get_block_status(BlockDriverState *bs, + int64_t sector_num, + int nb_sectors, int *= pnum, + BlockDriverState **fi= le) +{ + *pnum =3D nb_sectors; + *file =3D bs; + + return BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_OFFSET_VALID | + (sector_num << BDRV_SECTOR_BITS); +} + +static void nvme_refresh_filename(BlockDriverState *bs, QDict *opts) +{ + QINCREF(opts); + qdict_del(opts, "filename"); + + if (!qdict_size(opts)) { + snprintf(bs->exact_filename, sizeof(bs->exact_filename), "%s://", + bs->drv->format_name); + } + + qdict_put(opts, "driver", qstring_from_str(bs->drv->format_name)); + bs->full_open_options =3D opts; +} + +static void nvme_refresh_limits(BlockDriverState *bs, Error **errp) +{ + BDRVNVMeState *s =3D bs->opaque; + + bs->bl.opt_mem_alignment =3D s->page_size; + bs->bl.request_alignment =3D s->page_size; + bs->bl.max_transfer =3D s->max_transfer; +} + +static void nvme_detach_aio_context(BlockDriverState *bs) +{ + BDRVNVMeState *s =3D bs->opaque; + + aio_set_event_notifier(bdrv_get_aio_context(bs), &s->irq_notifier, + false, NULL, NULL); +} + +static void nvme_attach_aio_context(BlockDriverState *bs, + AioContext *new_context) +{ + BDRVNVMeState *s =3D bs->opaque; + + s->aio_context =3D new_context; + aio_set_event_notifier(new_context, &s->irq_notifier, + false, nvme_handle_event, nvme_poll_cb); +} + +static void nvme_aio_plug(BlockDriverState *bs) +{ + BDRVNVMeState *s =3D bs->opaque; + s->plugged++; +} + +static void nvme_aio_unplug(BlockDriverState *bs) +{ + int i; + BDRVNVMeState *s =3D bs->opaque; + assert(s->plugged); + if (!--s->plugged) { + for (i =3D 1; i < s->nr_queues; i++) { + NVMeQueuePair *q =3D s->queues[i]; + qemu_mutex_lock(&q->lock); + nvme_kick(s, q); + nvme_process_completion(s, q); + qemu_mutex_unlock(&q->lock); + } + } +} + +static BlockDriver bdrv_nvme =3D { + .format_name =3D "nvme", + .protocol_name =3D "nvme", + .instance_size =3D sizeof(BDRVNVMeState), + + .bdrv_parse_filename =3D nvme_parse_filename, + .bdrv_file_open =3D nvme_file_open, + .bdrv_close =3D nvme_close, + .bdrv_getlength =3D nvme_getlength, + + .bdrv_co_preadv =3D nvme_co_preadv, + .bdrv_co_pwritev =3D nvme_co_pwritev, + .bdrv_co_flush_to_disk =3D nvme_co_flush, + .bdrv_reopen_prepare =3D nvme_reopen_prepare, + + .bdrv_co_get_block_status =3D nvme_co_get_block_status, + + .bdrv_refresh_filename =3D nvme_refresh_filename, + .bdrv_refresh_limits =3D nvme_refresh_limits, + + .bdrv_detach_aio_context =3D nvme_detach_aio_context, + .bdrv_attach_aio_context =3D nvme_attach_aio_context, + + .bdrv_io_plug =3D nvme_aio_plug, + .bdrv_io_unplug =3D nvme_aio_unplug, +}; + +static void bdrv_nvme_init(void) +{ + bdrv_register(&bdrv_nvme); +} + +block_init(bdrv_nvme_init); diff --git a/block/trace-events b/block/trace-events index 11c8d5f590..02dd80ff0c 100644 --- a/block/trace-events +++ b/block/trace-events @@ -124,3 +124,24 @@ vxhs_open_iio_open(const char *host) "Failed to connec= t to storage agent on host vxhs_parse_uri_hostinfo(char *host, int port) "Host: IP %s, Port %d" vxhs_close(char *vdisk_guid) "Closing vdisk %s" vxhs_get_creds(const char *cacert, const char *client_key, const char *cli= ent_cert) "cacert %s, client_key %s, client_cert %s" + +# block/nvme.c +nvme_kick(void *s, int queue) "s %p queue %d" +nvme_dma_flush_queue_wait(void *s) "s %p" +nvme_error(int cmd_specific, int sq_head, int sqid, int cid, int status) "= cmd_specific %d sq_head %d sqid %d cid %d status 0x%x" +nvme_process_completion(void *s, int index, int inflight) "s %p queue %d i= nflight %d" +nvme_process_completion_queue_busy(void *s, int index) "s %p queue %d" +nvme_complete_command(void *s, int index, int cid) "s %p queue %d cid %d" +nvme_submit_command(void *s, int index, int cid) "s %p queue %d cid %d" +nvme_submit_command_raw(int c0, int c1, int c2, int c3, int c4, int c5, in= t c6, int c7) "%02x %02x %02x %02x %02x %02x %02x %02x" +nvme_handle_event(void *s) "s %p" +nvme_poll_cb(void *s) "s %p" +nvme_prw_aligned(void *s, int is_write, uint64_t offset, uint64_t bytes, i= nt flags, int niov) "s %p is_write %d offset %"PRId64" bytes %"PRId64" flag= s %d niov %d" +nvme_qiov_unaligned(const void *qiov, int n, void *base, size_t size, int = align) "qiov %p n %d base %p size 0x%zx align 0x%x" +nvme_prw_buffered(void *s, uint64_t offset, uint64_t bytes, int niov, int = is_write) "s %p offset %"PRId64" bytes %"PRId64" niov %d is_write %d" +nvme_rw_done(void *s, int is_write, uint64_t offset, uint64_t bytes, int r= et) "s %p is_write %d offset %"PRId64" bytes %"PRId64" ret %d" +nvme_dma_map_flush(void *s) "s %p" +nvme_free_req_queue_wait(void *q) "q %p" +nvme_cmd_map_qiov(void *s, void *cmd, void *req, void *qiov, int entries) = "s %p cmd %p req %p qiov %p entries %d" +nvme_cmd_map_qiov_pages(void *s, int i, uint64_t page) "s %p page[%d] 0x%"= PRIx64 +nvme_cmd_map_qiov_iov(void *s, int i, void *page, int pages) "s %p iov[%d]= %p pages %d" --=20 2.14.3 From nobody Sat Apr 27 00:47:14 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) client-ip=208.118.235.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Authentication-Results: mx.zohomail.com; spf=pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org Return-Path: Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) by mx.zohomail.com with SMTPS id 151574761667931.366337801359123; Fri, 12 Jan 2018 01:00:16 -0800 (PST) Received: from localhost ([::1]:54932 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eZvCB-0001Rw-Oe for importer@patchew.org; Fri, 12 Jan 2018 04:00:15 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:41271) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eZvAG-00008J-7B for qemu-devel@nongnu.org; Fri, 12 Jan 2018 03:58:17 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eZvAF-0004Wx-8p for qemu-devel@nongnu.org; Fri, 12 Jan 2018 03:58:16 -0500 Received: from mx1.redhat.com ([209.132.183.28]:56192) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1eZvA9-0004Sx-Bt; Fri, 12 Jan 2018 03:58:09 -0500 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 66784C0799B3; Fri, 12 Jan 2018 08:58:08 +0000 (UTC) Received: from lemon.usersys.redhat.com (ovpn-12-81.pek2.redhat.com [10.72.12.81]) by smtp.corp.redhat.com (Postfix) with ESMTP id 7BB13798A4; Fri, 12 Jan 2018 08:57:57 +0000 (UTC) From: Fam Zheng To: qemu-devel@nongnu.org Date: Fri, 12 Jan 2018 16:55:50 +0800 Message-Id: <20180112085555.14447-5-famz@redhat.com> In-Reply-To: <20180112085555.14447-1-famz@redhat.com> References: <20180112085555.14447-1-famz@redhat.com> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.31]); Fri, 12 Jan 2018 08:58:08 +0000 (UTC) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 209.132.183.28 Subject: [Qemu-devel] [PATCH v5 4/9] block: Introduce buf register API X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Kevin Wolf , Fam Zheng , qemu-block@nongnu.org, Markus Armbruster , Max Reitz , Keith Busch , alex.williamson@redhat.com, Stefan Hajnoczi , Paolo Bonzini , Karl Rister Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: "Qemu-devel" X-ZohoMail: RSF_0 Z_629925259 SPT_0 Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Allow block driver to map and unmap a buffer for later I/O, as a performance hint. Signed-off-by: Fam Zheng Message-Id: <20180110091846.10699-5-famz@redhat.com> Reviewed-by: Stefan Hajnoczi --- block/block-backend.c | 10 ++++++++++ block/io.c | 24 ++++++++++++++++++++++++ include/block/block.h | 11 ++++++++++- include/block/block_int.h | 9 +++++++++ include/sysemu/block-backend.h | 3 +++ 5 files changed, 56 insertions(+), 1 deletion(-) diff --git a/block/block-backend.c b/block/block-backend.c index baef8e7abc..f66349c2c9 100644 --- a/block/block-backend.c +++ b/block/block-backend.c @@ -2096,3 +2096,13 @@ static void blk_root_drained_end(BdrvChild *child) } } } + +void blk_register_buf(BlockBackend *blk, void *host, size_t size) +{ + bdrv_register_buf(blk_bs(blk), host, size); +} + +void blk_unregister_buf(BlockBackend *blk, void *host) +{ + bdrv_unregister_buf(blk_bs(blk), host); +} diff --git a/block/io.c b/block/io.c index 7ea402352e..89d0745e95 100644 --- a/block/io.c +++ b/block/io.c @@ -2825,3 +2825,27 @@ void bdrv_io_unplug(BlockDriverState *bs) bdrv_io_unplug(child->bs); } } + +void bdrv_register_buf(BlockDriverState *bs, void *host, size_t size) +{ + BdrvChild *child; + + if (bs->drv && bs->drv->bdrv_register_buf) { + bs->drv->bdrv_register_buf(bs, host, size); + } + QLIST_FOREACH(child, &bs->children, next) { + bdrv_register_buf(child->bs, host, size); + } +} + +void bdrv_unregister_buf(BlockDriverState *bs, void *host) +{ + BdrvChild *child; + + if (bs->drv && bs->drv->bdrv_unregister_buf) { + bs->drv->bdrv_unregister_buf(bs, host); + } + QLIST_FOREACH(child, &bs->children, next) { + bdrv_unregister_buf(child->bs, host); + } +} diff --git a/include/block/block.h b/include/block/block.h index 9b12774ddf..2025d7ed19 100644 --- a/include/block/block.h +++ b/include/block/block.h @@ -631,5 +631,14 @@ void bdrv_del_child(BlockDriverState *parent, BdrvChil= d *child, Error **errp); =20 bool bdrv_can_store_new_dirty_bitmap(BlockDriverState *bs, const char *nam= e, uint32_t granularity, Error **errp); - +/** + * + * bdrv_register_buf/bdrv_unregister_buf: + * + * Register/unregister a buffer for I/O. For example, VFIO drivers are + * interested to know the memory areas that would later be used for I/O, so + * that they can prepare IOMMU mapping etc., to get better performance. + */ +void bdrv_register_buf(BlockDriverState *bs, void *host, size_t size); +void bdrv_unregister_buf(BlockDriverState *bs, void *host); #endif diff --git a/include/block/block_int.h b/include/block/block_int.h index 29cafa4236..99b9190627 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -446,6 +446,15 @@ struct BlockDriver { const char *name, Error **errp); =20 + /** + * Register/unregister a buffer for I/O. For example, when the driver = is + * interested to know the memory areas that will later be used in iovs= , so + * that it can do IOMMU mapping with VFIO etc., in order to get better + * performance. In the case of VFIO drivers, this callback is used to = do + * DMA mapping for hot buffers. + */ + void (*bdrv_register_buf)(BlockDriverState *bs, void *host, size_t siz= e); + void (*bdrv_unregister_buf)(BlockDriverState *bs, void *host); QLIST_ENTRY(BlockDriver) list; }; =20 diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h index c4e52a5fa3..92ab624fac 100644 --- a/include/sysemu/block-backend.h +++ b/include/sysemu/block-backend.h @@ -229,4 +229,7 @@ void blk_io_limits_enable(BlockBackend *blk, const char= *group); void blk_io_limits_update_group(BlockBackend *blk, const char *group); void blk_set_force_allow_inactivate(BlockBackend *blk); =20 +void blk_register_buf(BlockBackend *blk, void *host, size_t size); +void blk_unregister_buf(BlockBackend *blk, void *host); + #endif --=20 2.14.3 From nobody Sat Apr 27 00:47:14 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) client-ip=208.118.235.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Authentication-Results: mx.zohomail.com; spf=pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org Return-Path: Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) by mx.zohomail.com with SMTPS id 1515747718090975.4333976882622; Fri, 12 Jan 2018 01:01:58 -0800 (PST) Received: from localhost ([::1]:55033 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eZvDp-0002zG-Br for importer@patchew.org; Fri, 12 Jan 2018 04:01:57 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:41469) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eZvAW-0000Mo-34 for qemu-devel@nongnu.org; Fri, 12 Jan 2018 03:58:37 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eZvAV-0004hz-9K for qemu-devel@nongnu.org; Fri, 12 Jan 2018 03:58:32 -0500 Received: from mx1.redhat.com ([209.132.183.28]:63143) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1eZvAT-0004g7-2d; Fri, 12 Jan 2018 03:58:29 -0500 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 16532C057F9F; Fri, 12 Jan 2018 08:58:23 +0000 (UTC) Received: from lemon.usersys.redhat.com (ovpn-12-81.pek2.redhat.com [10.72.12.81]) by smtp.corp.redhat.com (Postfix) with ESMTP id C14BF5DA2A; Fri, 12 Jan 2018 08:58:08 +0000 (UTC) From: Fam Zheng To: qemu-devel@nongnu.org Date: Fri, 12 Jan 2018 16:55:51 +0800 Message-Id: <20180112085555.14447-6-famz@redhat.com> In-Reply-To: <20180112085555.14447-1-famz@redhat.com> References: <20180112085555.14447-1-famz@redhat.com> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.32]); Fri, 12 Jan 2018 08:58:28 +0000 (UTC) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 209.132.183.28 Subject: [Qemu-devel] [PATCH v5 5/9] block/nvme: Implement .bdrv_(un)register_buf X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Kevin Wolf , Fam Zheng , qemu-block@nongnu.org, Markus Armbruster , Max Reitz , Keith Busch , alex.williamson@redhat.com, Stefan Hajnoczi , Paolo Bonzini , Karl Rister Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: "Qemu-devel" X-ZohoMail: RSF_0 Z_629925259 SPT_0 Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Forward these two calls to the IOVA manager. Signed-off-by: Fam Zheng Message-Id: <20180110091846.10699-6-famz@redhat.com> Reviewed-by: Stefan Hajnoczi --- block/nvme.c | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/block/nvme.c b/block/nvme.c index 97ab01686f..a786f9e094 100644 --- a/block/nvme.c +++ b/block/nvme.c @@ -1128,6 +1128,27 @@ static void nvme_aio_unplug(BlockDriverState *bs) } } =20 +static void nvme_register_buf(BlockDriverState *bs, void *host, size_t siz= e) +{ + int ret; + BDRVNVMeState *s =3D bs->opaque; + + ret =3D qemu_vfio_dma_map(s->vfio, host, size, false, NULL); + if (ret) { + /* FIXME: we may run out of IOVA addresses after repeated + * bdrv_register_buf/bdrv_unregister_buf, because nvme_vfio_dma_un= map + * doesn't reclaim addresses for fixed mappings. */ + error_report("nvme_register_buf failed: %s", strerror(-ret)); + } +} + +static void nvme_unregister_buf(BlockDriverState *bs, void *host) +{ + BDRVNVMeState *s =3D bs->opaque; + + qemu_vfio_dma_unmap(s->vfio, host); +} + static BlockDriver bdrv_nvme =3D { .format_name =3D "nvme", .protocol_name =3D "nvme", @@ -1153,6 +1174,9 @@ static BlockDriver bdrv_nvme =3D { =20 .bdrv_io_plug =3D nvme_aio_plug, .bdrv_io_unplug =3D nvme_aio_unplug, + + .bdrv_register_buf =3D nvme_register_buf, + .bdrv_unregister_buf =3D nvme_unregister_buf, }; =20 static void bdrv_nvme_init(void) --=20 2.14.3 From nobody Sat Apr 27 00:47:14 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) client-ip=208.118.235.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Authentication-Results: mx.zohomail.com; spf=pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org Return-Path: Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) by mx.zohomail.com with SMTPS id 1515747634708628.8369802534655; Fri, 12 Jan 2018 01:00:34 -0800 (PST) Received: from localhost ([::1]:54957 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eZvCT-0001l8-VP for importer@patchew.org; Fri, 12 Jan 2018 04:00:34 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:41537) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eZvAe-0000UI-4w for qemu-devel@nongnu.org; Fri, 12 Jan 2018 03:58:40 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eZvAd-0004nR-CZ for qemu-devel@nongnu.org; Fri, 12 Jan 2018 03:58:40 -0500 Received: from mx1.redhat.com ([209.132.183.28]:56356) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1eZvAY-0004k1-EH; Fri, 12 Jan 2018 03:58:34 -0500 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 7DADD2BAB; Fri, 12 Jan 2018 08:58:33 +0000 (UTC) Received: from lemon.usersys.redhat.com (ovpn-12-81.pek2.redhat.com [10.72.12.81]) by smtp.corp.redhat.com (Postfix) with ESMTP id 58155798B0; Fri, 12 Jan 2018 08:58:23 +0000 (UTC) From: Fam Zheng To: qemu-devel@nongnu.org Date: Fri, 12 Jan 2018 16:55:52 +0800 Message-Id: <20180112085555.14447-7-famz@redhat.com> In-Reply-To: <20180112085555.14447-1-famz@redhat.com> References: <20180112085555.14447-1-famz@redhat.com> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]); Fri, 12 Jan 2018 08:58:33 +0000 (UTC) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 209.132.183.28 Subject: [Qemu-devel] [PATCH v5 6/9] qemu-img: Map bench buffer X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Kevin Wolf , Fam Zheng , qemu-block@nongnu.org, Markus Armbruster , Max Reitz , Keith Busch , alex.williamson@redhat.com, Stefan Hajnoczi , Paolo Bonzini , Karl Rister Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: "Qemu-devel" X-ZohoMail: RSF_0 Z_629925259 SPT_0 Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Signed-off-by: Fam Zheng Message-Id: <20180110091846.10699-7-famz@redhat.com> Reviewed-by: Stefan Hajnoczi --- qemu-img.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/qemu-img.c b/qemu-img.c index 68b375f998..28d0e4e9f8 100644 --- a/qemu-img.c +++ b/qemu-img.c @@ -3862,6 +3862,7 @@ static int img_bench(int argc, char **argv) struct timeval t1, t2; int i; bool force_share =3D false; + size_t buf_size; =20 for (;;) { static const struct option long_options[] =3D { @@ -4050,9 +4051,12 @@ static int img_bench(int argc, char **argv) printf("Sending flush every %d requests\n", flush_interval); } =20 - data.buf =3D blk_blockalign(blk, data.nrreq * data.bufsize); + buf_size =3D data.nrreq * data.bufsize; + data.buf =3D blk_blockalign(blk, buf_size); memset(data.buf, pattern, data.nrreq * data.bufsize); =20 + blk_register_buf(blk, data.buf, buf_size); + data.qiov =3D g_new(QEMUIOVector, data.nrreq); for (i =3D 0; i < data.nrreq; i++) { qemu_iovec_init(&data.qiov[i], 1); @@ -4073,6 +4077,9 @@ static int img_bench(int argc, char **argv) + ((double)(t2.tv_usec - t1.tv_usec) / 1000000)); =20 out: + if (data.buf) { + blk_unregister_buf(blk, data.buf); + } qemu_vfree(data.buf); blk_unref(blk); =20 --=20 2.14.3 From nobody Sat Apr 27 00:47:14 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) client-ip=208.118.235.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Authentication-Results: mx.zohomail.com; spf=pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org Return-Path: Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) by mx.zohomail.com with SMTPS id 1515747915611863.2141652649992; Fri, 12 Jan 2018 01:05:15 -0800 (PST) Received: from localhost ([::1]:55072 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eZvGy-0005Wo-II for importer@patchew.org; Fri, 12 Jan 2018 04:05:12 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:41710) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eZvAy-0000qL-Mx for qemu-devel@nongnu.org; Fri, 12 Jan 2018 03:59:05 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eZvAu-00054F-T2 for qemu-devel@nongnu.org; Fri, 12 Jan 2018 03:59:00 -0500 Received: from mx1.redhat.com ([209.132.183.28]:56640) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1eZvAk-0004tc-Se; Fri, 12 Jan 2018 03:58:47 -0500 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id CD936C05004A; Fri, 12 Jan 2018 08:58:45 +0000 (UTC) Received: from lemon.usersys.redhat.com (ovpn-12-81.pek2.redhat.com [10.72.12.81]) by smtp.corp.redhat.com (Postfix) with ESMTP id EC81475556; Fri, 12 Jan 2018 08:58:33 +0000 (UTC) From: Fam Zheng To: qemu-devel@nongnu.org Date: Fri, 12 Jan 2018 16:55:53 +0800 Message-Id: <20180112085555.14447-8-famz@redhat.com> In-Reply-To: <20180112085555.14447-1-famz@redhat.com> References: <20180112085555.14447-1-famz@redhat.com> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.31]); Fri, 12 Jan 2018 08:58:45 +0000 (UTC) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 209.132.183.28 Subject: [Qemu-devel] [PATCH v5 7/9] block: Move NVMe constants to a separate header X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Kevin Wolf , Fam Zheng , qemu-block@nongnu.org, Markus Armbruster , Max Reitz , Keith Busch , alex.williamson@redhat.com, Stefan Hajnoczi , Paolo Bonzini , Karl Rister Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: "Qemu-devel" X-ZohoMail: RSF_0 Z_629925259 SPT_0 Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Signed-off-by: Fam Zheng Message-Id: <20180110091846.10699-8-famz@redhat.com> Reviewed-by: Stefan Hajnoczi --- block/nvme.c | 7 +- hw/block/nvme.h | 698 +----------------------------------------------= --- include/block/nvme.h | 700 +++++++++++++++++++++++++++++++++++++++++++++++= ++++ 3 files changed, 702 insertions(+), 703 deletions(-) create mode 100644 include/block/nvme.h diff --git a/block/nvme.c b/block/nvme.c index a786f9e094..2a61e50594 100644 --- a/block/nvme.c +++ b/block/nvme.c @@ -22,12 +22,7 @@ #include "block/block_int.h" #include "trace.h" =20 -/* TODO: Move nvme spec definitions from hw/block/nvme.h into a separate f= ile - * that doesn't depend on dma/pci headers. */ -#include "sysemu/dma.h" -#include "hw/pci/pci.h" -#include "hw/block/block.h" -#include "hw/block/nvme.h" +#include "block/nvme.h" =20 #define NVME_SQ_ENTRY_BYTES 64 #define NVME_CQ_ENTRY_BYTES 16 diff --git a/hw/block/nvme.h b/hw/block/nvme.h index 6aab338ff5..59a1504018 100644 --- a/hw/block/nvme.h +++ b/hw/block/nvme.h @@ -1,703 +1,7 @@ #ifndef HW_NVME_H #define HW_NVME_H #include "qemu/cutils.h" - -typedef struct NvmeBar { - uint64_t cap; - uint32_t vs; - uint32_t intms; - uint32_t intmc; - uint32_t cc; - uint32_t rsvd1; - uint32_t csts; - uint32_t nssrc; - uint32_t aqa; - uint64_t asq; - uint64_t acq; - uint32_t cmbloc; - uint32_t cmbsz; -} NvmeBar; - -enum NvmeCapShift { - CAP_MQES_SHIFT =3D 0, - CAP_CQR_SHIFT =3D 16, - CAP_AMS_SHIFT =3D 17, - CAP_TO_SHIFT =3D 24, - CAP_DSTRD_SHIFT =3D 32, - CAP_NSSRS_SHIFT =3D 33, - CAP_CSS_SHIFT =3D 37, - CAP_MPSMIN_SHIFT =3D 48, - CAP_MPSMAX_SHIFT =3D 52, -}; - -enum NvmeCapMask { - CAP_MQES_MASK =3D 0xffff, - CAP_CQR_MASK =3D 0x1, - CAP_AMS_MASK =3D 0x3, - CAP_TO_MASK =3D 0xff, - CAP_DSTRD_MASK =3D 0xf, - CAP_NSSRS_MASK =3D 0x1, - CAP_CSS_MASK =3D 0xff, - CAP_MPSMIN_MASK =3D 0xf, - CAP_MPSMAX_MASK =3D 0xf, -}; - -#define NVME_CAP_MQES(cap) (((cap) >> CAP_MQES_SHIFT) & CAP_MQES_MASK) -#define NVME_CAP_CQR(cap) (((cap) >> CAP_CQR_SHIFT) & CAP_CQR_MASK) -#define NVME_CAP_AMS(cap) (((cap) >> CAP_AMS_SHIFT) & CAP_AMS_MASK) -#define NVME_CAP_TO(cap) (((cap) >> CAP_TO_SHIFT) & CAP_TO_MASK) -#define NVME_CAP_DSTRD(cap) (((cap) >> CAP_DSTRD_SHIFT) & CAP_DSTRD_MASK) -#define NVME_CAP_NSSRS(cap) (((cap) >> CAP_NSSRS_SHIFT) & CAP_NSSRS_MASK) -#define NVME_CAP_CSS(cap) (((cap) >> CAP_CSS_SHIFT) & CAP_CSS_MASK) -#define NVME_CAP_MPSMIN(cap)(((cap) >> CAP_MPSMIN_SHIFT) & CAP_MPSMIN_MASK) -#define NVME_CAP_MPSMAX(cap)(((cap) >> CAP_MPSMAX_SHIFT) & CAP_MPSMAX_MASK) - -#define NVME_CAP_SET_MQES(cap, val) (cap |=3D (uint64_t)(val & CAP_MQES_= MASK) \ - << CAP_MQES_SHI= FT) -#define NVME_CAP_SET_CQR(cap, val) (cap |=3D (uint64_t)(val & CAP_CQR_M= ASK) \ - << CAP_CQR_SHIF= T) -#define NVME_CAP_SET_AMS(cap, val) (cap |=3D (uint64_t)(val & CAP_AMS_M= ASK) \ - << CAP_AMS_SHIF= T) -#define NVME_CAP_SET_TO(cap, val) (cap |=3D (uint64_t)(val & CAP_TO_MA= SK) \ - << CAP_TO_SHIFT) -#define NVME_CAP_SET_DSTRD(cap, val) (cap |=3D (uint64_t)(val & CAP_DSTRD= _MASK) \ - << CAP_DSTRD_SH= IFT) -#define NVME_CAP_SET_NSSRS(cap, val) (cap |=3D (uint64_t)(val & CAP_NSSRS= _MASK) \ - << CAP_NSSRS_SH= IFT) -#define NVME_CAP_SET_CSS(cap, val) (cap |=3D (uint64_t)(val & CAP_CSS_M= ASK) \ - << CAP_CSS_SHIF= T) -#define NVME_CAP_SET_MPSMIN(cap, val) (cap |=3D (uint64_t)(val & CAP_MPSMI= N_MASK)\ - << CAP_MPSMIN_S= HIFT) -#define NVME_CAP_SET_MPSMAX(cap, val) (cap |=3D (uint64_t)(val & CAP_MPSMA= X_MASK)\ - << CAP_MPSMAX_= SHIFT) - -enum NvmeCcShift { - CC_EN_SHIFT =3D 0, - CC_CSS_SHIFT =3D 4, - CC_MPS_SHIFT =3D 7, - CC_AMS_SHIFT =3D 11, - CC_SHN_SHIFT =3D 14, - CC_IOSQES_SHIFT =3D 16, - CC_IOCQES_SHIFT =3D 20, -}; - -enum NvmeCcMask { - CC_EN_MASK =3D 0x1, - CC_CSS_MASK =3D 0x7, - CC_MPS_MASK =3D 0xf, - CC_AMS_MASK =3D 0x7, - CC_SHN_MASK =3D 0x3, - CC_IOSQES_MASK =3D 0xf, - CC_IOCQES_MASK =3D 0xf, -}; - -#define NVME_CC_EN(cc) ((cc >> CC_EN_SHIFT) & CC_EN_MASK) -#define NVME_CC_CSS(cc) ((cc >> CC_CSS_SHIFT) & CC_CSS_MASK) -#define NVME_CC_MPS(cc) ((cc >> CC_MPS_SHIFT) & CC_MPS_MASK) -#define NVME_CC_AMS(cc) ((cc >> CC_AMS_SHIFT) & CC_AMS_MASK) -#define NVME_CC_SHN(cc) ((cc >> CC_SHN_SHIFT) & CC_SHN_MASK) -#define NVME_CC_IOSQES(cc) ((cc >> CC_IOSQES_SHIFT) & CC_IOSQES_MASK) -#define NVME_CC_IOCQES(cc) ((cc >> CC_IOCQES_SHIFT) & CC_IOCQES_MASK) - -enum NvmeCstsShift { - CSTS_RDY_SHIFT =3D 0, - CSTS_CFS_SHIFT =3D 1, - CSTS_SHST_SHIFT =3D 2, - CSTS_NSSRO_SHIFT =3D 4, -}; - -enum NvmeCstsMask { - CSTS_RDY_MASK =3D 0x1, - CSTS_CFS_MASK =3D 0x1, - CSTS_SHST_MASK =3D 0x3, - CSTS_NSSRO_MASK =3D 0x1, -}; - -enum NvmeCsts { - NVME_CSTS_READY =3D 1 << CSTS_RDY_SHIFT, - NVME_CSTS_FAILED =3D 1 << CSTS_CFS_SHIFT, - NVME_CSTS_SHST_NORMAL =3D 0 << CSTS_SHST_SHIFT, - NVME_CSTS_SHST_PROGRESS =3D 1 << CSTS_SHST_SHIFT, - NVME_CSTS_SHST_COMPLETE =3D 2 << CSTS_SHST_SHIFT, - NVME_CSTS_NSSRO =3D 1 << CSTS_NSSRO_SHIFT, -}; - -#define NVME_CSTS_RDY(csts) ((csts >> CSTS_RDY_SHIFT) & CSTS_RDY_MAS= K) -#define NVME_CSTS_CFS(csts) ((csts >> CSTS_CFS_SHIFT) & CSTS_CFS_MAS= K) -#define NVME_CSTS_SHST(csts) ((csts >> CSTS_SHST_SHIFT) & CSTS_SHST_MA= SK) -#define NVME_CSTS_NSSRO(csts) ((csts >> CSTS_NSSRO_SHIFT) & CSTS_NSSRO_M= ASK) - -enum NvmeAqaShift { - AQA_ASQS_SHIFT =3D 0, - AQA_ACQS_SHIFT =3D 16, -}; - -enum NvmeAqaMask { - AQA_ASQS_MASK =3D 0xfff, - AQA_ACQS_MASK =3D 0xfff, -}; - -#define NVME_AQA_ASQS(aqa) ((aqa >> AQA_ASQS_SHIFT) & AQA_ASQS_MASK) -#define NVME_AQA_ACQS(aqa) ((aqa >> AQA_ACQS_SHIFT) & AQA_ACQS_MASK) - -enum NvmeCmblocShift { - CMBLOC_BIR_SHIFT =3D 0, - CMBLOC_OFST_SHIFT =3D 12, -}; - -enum NvmeCmblocMask { - CMBLOC_BIR_MASK =3D 0x7, - CMBLOC_OFST_MASK =3D 0xfffff, -}; - -#define NVME_CMBLOC_BIR(cmbloc) ((cmbloc >> CMBLOC_BIR_SHIFT) & \ - CMBLOC_BIR_MASK) -#define NVME_CMBLOC_OFST(cmbloc)((cmbloc >> CMBLOC_OFST_SHIFT) & \ - CMBLOC_OFST_MASK) - -#define NVME_CMBLOC_SET_BIR(cmbloc, val) \ - (cmbloc |=3D (uint64_t)(val & CMBLOC_BIR_MASK) << CMBLOC_BIR_SHIFT) -#define NVME_CMBLOC_SET_OFST(cmbloc, val) \ - (cmbloc |=3D (uint64_t)(val & CMBLOC_OFST_MASK) << CMBLOC_OFST_SHIFT) - -enum NvmeCmbszShift { - CMBSZ_SQS_SHIFT =3D 0, - CMBSZ_CQS_SHIFT =3D 1, - CMBSZ_LISTS_SHIFT =3D 2, - CMBSZ_RDS_SHIFT =3D 3, - CMBSZ_WDS_SHIFT =3D 4, - CMBSZ_SZU_SHIFT =3D 8, - CMBSZ_SZ_SHIFT =3D 12, -}; - -enum NvmeCmbszMask { - CMBSZ_SQS_MASK =3D 0x1, - CMBSZ_CQS_MASK =3D 0x1, - CMBSZ_LISTS_MASK =3D 0x1, - CMBSZ_RDS_MASK =3D 0x1, - CMBSZ_WDS_MASK =3D 0x1, - CMBSZ_SZU_MASK =3D 0xf, - CMBSZ_SZ_MASK =3D 0xfffff, -}; - -#define NVME_CMBSZ_SQS(cmbsz) ((cmbsz >> CMBSZ_SQS_SHIFT) & CMBSZ_SQS_M= ASK) -#define NVME_CMBSZ_CQS(cmbsz) ((cmbsz >> CMBSZ_CQS_SHIFT) & CMBSZ_CQS_M= ASK) -#define NVME_CMBSZ_LISTS(cmbsz)((cmbsz >> CMBSZ_LISTS_SHIFT) & CMBSZ_LISTS= _MASK) -#define NVME_CMBSZ_RDS(cmbsz) ((cmbsz >> CMBSZ_RDS_SHIFT) & CMBSZ_RDS_M= ASK) -#define NVME_CMBSZ_WDS(cmbsz) ((cmbsz >> CMBSZ_WDS_SHIFT) & CMBSZ_WDS_M= ASK) -#define NVME_CMBSZ_SZU(cmbsz) ((cmbsz >> CMBSZ_SZU_SHIFT) & CMBSZ_SZU_M= ASK) -#define NVME_CMBSZ_SZ(cmbsz) ((cmbsz >> CMBSZ_SZ_SHIFT) & CMBSZ_SZ_MA= SK) - -#define NVME_CMBSZ_SET_SQS(cmbsz, val) \ - (cmbsz |=3D (uint64_t)(val & CMBSZ_SQS_MASK) << CMBSZ_SQS_SHIFT) -#define NVME_CMBSZ_SET_CQS(cmbsz, val) \ - (cmbsz |=3D (uint64_t)(val & CMBSZ_CQS_MASK) << CMBSZ_CQS_SHIFT) -#define NVME_CMBSZ_SET_LISTS(cmbsz, val) \ - (cmbsz |=3D (uint64_t)(val & CMBSZ_LISTS_MASK) << CMBSZ_LISTS_SHIFT) -#define NVME_CMBSZ_SET_RDS(cmbsz, val) \ - (cmbsz |=3D (uint64_t)(val & CMBSZ_RDS_MASK) << CMBSZ_RDS_SHIFT) -#define NVME_CMBSZ_SET_WDS(cmbsz, val) \ - (cmbsz |=3D (uint64_t)(val & CMBSZ_WDS_MASK) << CMBSZ_WDS_SHIFT) -#define NVME_CMBSZ_SET_SZU(cmbsz, val) \ - (cmbsz |=3D (uint64_t)(val & CMBSZ_SZU_MASK) << CMBSZ_SZU_SHIFT) -#define NVME_CMBSZ_SET_SZ(cmbsz, val) \ - (cmbsz |=3D (uint64_t)(val & CMBSZ_SZ_MASK) << CMBSZ_SZ_SHIFT) - -#define NVME_CMBSZ_GETSIZE(cmbsz) \ - (NVME_CMBSZ_SZ(cmbsz) * (1 << (12 + 4 * NVME_CMBSZ_SZU(cmbsz)))) - -typedef struct NvmeCmd { - uint8_t opcode; - uint8_t fuse; - uint16_t cid; - uint32_t nsid; - uint64_t res1; - uint64_t mptr; - uint64_t prp1; - uint64_t prp2; - uint32_t cdw10; - uint32_t cdw11; - uint32_t cdw12; - uint32_t cdw13; - uint32_t cdw14; - uint32_t cdw15; -} NvmeCmd; - -enum NvmeAdminCommands { - NVME_ADM_CMD_DELETE_SQ =3D 0x00, - NVME_ADM_CMD_CREATE_SQ =3D 0x01, - NVME_ADM_CMD_GET_LOG_PAGE =3D 0x02, - NVME_ADM_CMD_DELETE_CQ =3D 0x04, - NVME_ADM_CMD_CREATE_CQ =3D 0x05, - NVME_ADM_CMD_IDENTIFY =3D 0x06, - NVME_ADM_CMD_ABORT =3D 0x08, - NVME_ADM_CMD_SET_FEATURES =3D 0x09, - NVME_ADM_CMD_GET_FEATURES =3D 0x0a, - NVME_ADM_CMD_ASYNC_EV_REQ =3D 0x0c, - NVME_ADM_CMD_ACTIVATE_FW =3D 0x10, - NVME_ADM_CMD_DOWNLOAD_FW =3D 0x11, - NVME_ADM_CMD_FORMAT_NVM =3D 0x80, - NVME_ADM_CMD_SECURITY_SEND =3D 0x81, - NVME_ADM_CMD_SECURITY_RECV =3D 0x82, -}; - -enum NvmeIoCommands { - NVME_CMD_FLUSH =3D 0x00, - NVME_CMD_WRITE =3D 0x01, - NVME_CMD_READ =3D 0x02, - NVME_CMD_WRITE_UNCOR =3D 0x04, - NVME_CMD_COMPARE =3D 0x05, - NVME_CMD_WRITE_ZEROS =3D 0x08, - NVME_CMD_DSM =3D 0x09, -}; - -typedef struct NvmeDeleteQ { - uint8_t opcode; - uint8_t flags; - uint16_t cid; - uint32_t rsvd1[9]; - uint16_t qid; - uint16_t rsvd10; - uint32_t rsvd11[5]; -} NvmeDeleteQ; - -typedef struct NvmeCreateCq { - uint8_t opcode; - uint8_t flags; - uint16_t cid; - uint32_t rsvd1[5]; - uint64_t prp1; - uint64_t rsvd8; - uint16_t cqid; - uint16_t qsize; - uint16_t cq_flags; - uint16_t irq_vector; - uint32_t rsvd12[4]; -} NvmeCreateCq; - -#define NVME_CQ_FLAGS_PC(cq_flags) (cq_flags & 0x1) -#define NVME_CQ_FLAGS_IEN(cq_flags) ((cq_flags >> 1) & 0x1) - -typedef struct NvmeCreateSq { - uint8_t opcode; - uint8_t flags; - uint16_t cid; - uint32_t rsvd1[5]; - uint64_t prp1; - uint64_t rsvd8; - uint16_t sqid; - uint16_t qsize; - uint16_t sq_flags; - uint16_t cqid; - uint32_t rsvd12[4]; -} NvmeCreateSq; - -#define NVME_SQ_FLAGS_PC(sq_flags) (sq_flags & 0x1) -#define NVME_SQ_FLAGS_QPRIO(sq_flags) ((sq_flags >> 1) & 0x3) - -enum NvmeQueueFlags { - NVME_Q_PC =3D 1, - NVME_Q_PRIO_URGENT =3D 0, - NVME_Q_PRIO_HIGH =3D 1, - NVME_Q_PRIO_NORMAL =3D 2, - NVME_Q_PRIO_LOW =3D 3, -}; - -typedef struct NvmeIdentify { - uint8_t opcode; - uint8_t flags; - uint16_t cid; - uint32_t nsid; - uint64_t rsvd2[2]; - uint64_t prp1; - uint64_t prp2; - uint32_t cns; - uint32_t rsvd11[5]; -} NvmeIdentify; - -typedef struct NvmeRwCmd { - uint8_t opcode; - uint8_t flags; - uint16_t cid; - uint32_t nsid; - uint64_t rsvd2; - uint64_t mptr; - uint64_t prp1; - uint64_t prp2; - uint64_t slba; - uint16_t nlb; - uint16_t control; - uint32_t dsmgmt; - uint32_t reftag; - uint16_t apptag; - uint16_t appmask; -} NvmeRwCmd; - -enum { - NVME_RW_LR =3D 1 << 15, - NVME_RW_FUA =3D 1 << 14, - NVME_RW_DSM_FREQ_UNSPEC =3D 0, - NVME_RW_DSM_FREQ_TYPICAL =3D 1, - NVME_RW_DSM_FREQ_RARE =3D 2, - NVME_RW_DSM_FREQ_READS =3D 3, - NVME_RW_DSM_FREQ_WRITES =3D 4, - NVME_RW_DSM_FREQ_RW =3D 5, - NVME_RW_DSM_FREQ_ONCE =3D 6, - NVME_RW_DSM_FREQ_PREFETCH =3D 7, - NVME_RW_DSM_FREQ_TEMP =3D 8, - NVME_RW_DSM_LATENCY_NONE =3D 0 << 4, - NVME_RW_DSM_LATENCY_IDLE =3D 1 << 4, - NVME_RW_DSM_LATENCY_NORM =3D 2 << 4, - NVME_RW_DSM_LATENCY_LOW =3D 3 << 4, - NVME_RW_DSM_SEQ_REQ =3D 1 << 6, - NVME_RW_DSM_COMPRESSED =3D 1 << 7, - NVME_RW_PRINFO_PRACT =3D 1 << 13, - NVME_RW_PRINFO_PRCHK_GUARD =3D 1 << 12, - NVME_RW_PRINFO_PRCHK_APP =3D 1 << 11, - NVME_RW_PRINFO_PRCHK_REF =3D 1 << 10, -}; - -typedef struct NvmeDsmCmd { - uint8_t opcode; - uint8_t flags; - uint16_t cid; - uint32_t nsid; - uint64_t rsvd2[2]; - uint64_t prp1; - uint64_t prp2; - uint32_t nr; - uint32_t attributes; - uint32_t rsvd12[4]; -} NvmeDsmCmd; - -enum { - NVME_DSMGMT_IDR =3D 1 << 0, - NVME_DSMGMT_IDW =3D 1 << 1, - NVME_DSMGMT_AD =3D 1 << 2, -}; - -typedef struct NvmeDsmRange { - uint32_t cattr; - uint32_t nlb; - uint64_t slba; -} NvmeDsmRange; - -enum NvmeAsyncEventRequest { - NVME_AER_TYPE_ERROR =3D 0, - NVME_AER_TYPE_SMART =3D 1, - NVME_AER_TYPE_IO_SPECIFIC =3D 6, - NVME_AER_TYPE_VENDOR_SPECIFIC =3D 7, - NVME_AER_INFO_ERR_INVALID_SQ =3D 0, - NVME_AER_INFO_ERR_INVALID_DB =3D 1, - NVME_AER_INFO_ERR_DIAG_FAIL =3D 2, - NVME_AER_INFO_ERR_PERS_INTERNAL_ERR =3D 3, - NVME_AER_INFO_ERR_TRANS_INTERNAL_ERR =3D 4, - NVME_AER_INFO_ERR_FW_IMG_LOAD_ERR =3D 5, - NVME_AER_INFO_SMART_RELIABILITY =3D 0, - NVME_AER_INFO_SMART_TEMP_THRESH =3D 1, - NVME_AER_INFO_SMART_SPARE_THRESH =3D 2, -}; - -typedef struct NvmeAerResult { - uint8_t event_type; - uint8_t event_info; - uint8_t log_page; - uint8_t resv; -} NvmeAerResult; - -typedef struct NvmeCqe { - uint32_t result; - uint32_t rsvd; - uint16_t sq_head; - uint16_t sq_id; - uint16_t cid; - uint16_t status; -} NvmeCqe; - -enum NvmeStatusCodes { - NVME_SUCCESS =3D 0x0000, - NVME_INVALID_OPCODE =3D 0x0001, - NVME_INVALID_FIELD =3D 0x0002, - NVME_CID_CONFLICT =3D 0x0003, - NVME_DATA_TRAS_ERROR =3D 0x0004, - NVME_POWER_LOSS_ABORT =3D 0x0005, - NVME_INTERNAL_DEV_ERROR =3D 0x0006, - NVME_CMD_ABORT_REQ =3D 0x0007, - NVME_CMD_ABORT_SQ_DEL =3D 0x0008, - NVME_CMD_ABORT_FAILED_FUSE =3D 0x0009, - NVME_CMD_ABORT_MISSING_FUSE =3D 0x000a, - NVME_INVALID_NSID =3D 0x000b, - NVME_CMD_SEQ_ERROR =3D 0x000c, - NVME_LBA_RANGE =3D 0x0080, - NVME_CAP_EXCEEDED =3D 0x0081, - NVME_NS_NOT_READY =3D 0x0082, - NVME_NS_RESV_CONFLICT =3D 0x0083, - NVME_INVALID_CQID =3D 0x0100, - NVME_INVALID_QID =3D 0x0101, - NVME_MAX_QSIZE_EXCEEDED =3D 0x0102, - NVME_ACL_EXCEEDED =3D 0x0103, - NVME_RESERVED =3D 0x0104, - NVME_AER_LIMIT_EXCEEDED =3D 0x0105, - NVME_INVALID_FW_SLOT =3D 0x0106, - NVME_INVALID_FW_IMAGE =3D 0x0107, - NVME_INVALID_IRQ_VECTOR =3D 0x0108, - NVME_INVALID_LOG_ID =3D 0x0109, - NVME_INVALID_FORMAT =3D 0x010a, - NVME_FW_REQ_RESET =3D 0x010b, - NVME_INVALID_QUEUE_DEL =3D 0x010c, - NVME_FID_NOT_SAVEABLE =3D 0x010d, - NVME_FID_NOT_NSID_SPEC =3D 0x010f, - NVME_FW_REQ_SUSYSTEM_RESET =3D 0x0110, - NVME_CONFLICTING_ATTRS =3D 0x0180, - NVME_INVALID_PROT_INFO =3D 0x0181, - NVME_WRITE_TO_RO =3D 0x0182, - NVME_WRITE_FAULT =3D 0x0280, - NVME_UNRECOVERED_READ =3D 0x0281, - NVME_E2E_GUARD_ERROR =3D 0x0282, - NVME_E2E_APP_ERROR =3D 0x0283, - NVME_E2E_REF_ERROR =3D 0x0284, - NVME_CMP_FAILURE =3D 0x0285, - NVME_ACCESS_DENIED =3D 0x0286, - NVME_MORE =3D 0x2000, - NVME_DNR =3D 0x4000, - NVME_NO_COMPLETE =3D 0xffff, -}; - -typedef struct NvmeFwSlotInfoLog { - uint8_t afi; - uint8_t reserved1[7]; - uint8_t frs1[8]; - uint8_t frs2[8]; - uint8_t frs3[8]; - uint8_t frs4[8]; - uint8_t frs5[8]; - uint8_t frs6[8]; - uint8_t frs7[8]; - uint8_t reserved2[448]; -} NvmeFwSlotInfoLog; - -typedef struct NvmeErrorLog { - uint64_t error_count; - uint16_t sqid; - uint16_t cid; - uint16_t status_field; - uint16_t param_error_location; - uint64_t lba; - uint32_t nsid; - uint8_t vs; - uint8_t resv[35]; -} NvmeErrorLog; - -typedef struct NvmeSmartLog { - uint8_t critical_warning; - uint8_t temperature[2]; - uint8_t available_spare; - uint8_t available_spare_threshold; - uint8_t percentage_used; - uint8_t reserved1[26]; - uint64_t data_units_read[2]; - uint64_t data_units_written[2]; - uint64_t host_read_commands[2]; - uint64_t host_write_commands[2]; - uint64_t controller_busy_time[2]; - uint64_t power_cycles[2]; - uint64_t power_on_hours[2]; - uint64_t unsafe_shutdowns[2]; - uint64_t media_errors[2]; - uint64_t number_of_error_log_entries[2]; - uint8_t reserved2[320]; -} NvmeSmartLog; - -enum NvmeSmartWarn { - NVME_SMART_SPARE =3D 1 << 0, - NVME_SMART_TEMPERATURE =3D 1 << 1, - NVME_SMART_RELIABILITY =3D 1 << 2, - NVME_SMART_MEDIA_READ_ONLY =3D 1 << 3, - NVME_SMART_FAILED_VOLATILE_MEDIA =3D 1 << 4, -}; - -enum LogIdentifier { - NVME_LOG_ERROR_INFO =3D 0x01, - NVME_LOG_SMART_INFO =3D 0x02, - NVME_LOG_FW_SLOT_INFO =3D 0x03, -}; - -typedef struct NvmePSD { - uint16_t mp; - uint16_t reserved; - uint32_t enlat; - uint32_t exlat; - uint8_t rrt; - uint8_t rrl; - uint8_t rwt; - uint8_t rwl; - uint8_t resv[16]; -} NvmePSD; - -typedef struct NvmeIdCtrl { - uint16_t vid; - uint16_t ssvid; - uint8_t sn[20]; - uint8_t mn[40]; - uint8_t fr[8]; - uint8_t rab; - uint8_t ieee[3]; - uint8_t cmic; - uint8_t mdts; - uint8_t rsvd255[178]; - uint16_t oacs; - uint8_t acl; - uint8_t aerl; - uint8_t frmw; - uint8_t lpa; - uint8_t elpe; - uint8_t npss; - uint8_t rsvd511[248]; - uint8_t sqes; - uint8_t cqes; - uint16_t rsvd515; - uint32_t nn; - uint16_t oncs; - uint16_t fuses; - uint8_t fna; - uint8_t vwc; - uint16_t awun; - uint16_t awupf; - uint8_t rsvd703[174]; - uint8_t rsvd2047[1344]; - NvmePSD psd[32]; - uint8_t vs[1024]; -} NvmeIdCtrl; - -enum NvmeIdCtrlOacs { - NVME_OACS_SECURITY =3D 1 << 0, - NVME_OACS_FORMAT =3D 1 << 1, - NVME_OACS_FW =3D 1 << 2, -}; - -enum NvmeIdCtrlOncs { - NVME_ONCS_COMPARE =3D 1 << 0, - NVME_ONCS_WRITE_UNCORR =3D 1 << 1, - NVME_ONCS_DSM =3D 1 << 2, - NVME_ONCS_WRITE_ZEROS =3D 1 << 3, - NVME_ONCS_FEATURES =3D 1 << 4, - NVME_ONCS_RESRVATIONS =3D 1 << 5, -}; - -#define NVME_CTRL_SQES_MIN(sqes) ((sqes) & 0xf) -#define NVME_CTRL_SQES_MAX(sqes) (((sqes) >> 4) & 0xf) -#define NVME_CTRL_CQES_MIN(cqes) ((cqes) & 0xf) -#define NVME_CTRL_CQES_MAX(cqes) (((cqes) >> 4) & 0xf) - -typedef struct NvmeFeatureVal { - uint32_t arbitration; - uint32_t power_mgmt; - uint32_t temp_thresh; - uint32_t err_rec; - uint32_t volatile_wc; - uint32_t num_queues; - uint32_t int_coalescing; - uint32_t *int_vector_config; - uint32_t write_atomicity; - uint32_t async_config; - uint32_t sw_prog_marker; -} NvmeFeatureVal; - -#define NVME_ARB_AB(arb) (arb & 0x7) -#define NVME_ARB_LPW(arb) ((arb >> 8) & 0xff) -#define NVME_ARB_MPW(arb) ((arb >> 16) & 0xff) -#define NVME_ARB_HPW(arb) ((arb >> 24) & 0xff) - -#define NVME_INTC_THR(intc) (intc & 0xff) -#define NVME_INTC_TIME(intc) ((intc >> 8) & 0xff) - -enum NvmeFeatureIds { - NVME_ARBITRATION =3D 0x1, - NVME_POWER_MANAGEMENT =3D 0x2, - NVME_LBA_RANGE_TYPE =3D 0x3, - NVME_TEMPERATURE_THRESHOLD =3D 0x4, - NVME_ERROR_RECOVERY =3D 0x5, - NVME_VOLATILE_WRITE_CACHE =3D 0x6, - NVME_NUMBER_OF_QUEUES =3D 0x7, - NVME_INTERRUPT_COALESCING =3D 0x8, - NVME_INTERRUPT_VECTOR_CONF =3D 0x9, - NVME_WRITE_ATOMICITY =3D 0xa, - NVME_ASYNCHRONOUS_EVENT_CONF =3D 0xb, - NVME_SOFTWARE_PROGRESS_MARKER =3D 0x80 -}; - -typedef struct NvmeRangeType { - uint8_t type; - uint8_t attributes; - uint8_t rsvd2[14]; - uint64_t slba; - uint64_t nlb; - uint8_t guid[16]; - uint8_t rsvd48[16]; -} NvmeRangeType; - -typedef struct NvmeLBAF { - uint16_t ms; - uint8_t ds; - uint8_t rp; -} NvmeLBAF; - -typedef struct NvmeIdNs { - uint64_t nsze; - uint64_t ncap; - uint64_t nuse; - uint8_t nsfeat; - uint8_t nlbaf; - uint8_t flbas; - uint8_t mc; - uint8_t dpc; - uint8_t dps; - uint8_t res30[98]; - NvmeLBAF lbaf[16]; - uint8_t res192[192]; - uint8_t vs[3712]; -} NvmeIdNs; - -#define NVME_ID_NS_NSFEAT_THIN(nsfeat) ((nsfeat & 0x1)) -#define NVME_ID_NS_FLBAS_EXTENDED(flbas) ((flbas >> 4) & 0x1) -#define NVME_ID_NS_FLBAS_INDEX(flbas) ((flbas & 0xf)) -#define NVME_ID_NS_MC_SEPARATE(mc) ((mc >> 1) & 0x1) -#define NVME_ID_NS_MC_EXTENDED(mc) ((mc & 0x1)) -#define NVME_ID_NS_DPC_LAST_EIGHT(dpc) ((dpc >> 4) & 0x1) -#define NVME_ID_NS_DPC_FIRST_EIGHT(dpc) ((dpc >> 3) & 0x1) -#define NVME_ID_NS_DPC_TYPE_3(dpc) ((dpc >> 2) & 0x1) -#define NVME_ID_NS_DPC_TYPE_2(dpc) ((dpc >> 1) & 0x1) -#define NVME_ID_NS_DPC_TYPE_1(dpc) ((dpc & 0x1)) -#define NVME_ID_NS_DPC_TYPE_MASK 0x7 - -enum NvmeIdNsDps { - DPS_TYPE_NONE =3D 0, - DPS_TYPE_1 =3D 1, - DPS_TYPE_2 =3D 2, - DPS_TYPE_3 =3D 3, - DPS_TYPE_MASK =3D 0x7, - DPS_FIRST_EIGHT =3D 8, -}; - -static inline void _nvme_check_size(void) -{ - QEMU_BUILD_BUG_ON(sizeof(NvmeAerResult) !=3D 4); - QEMU_BUILD_BUG_ON(sizeof(NvmeCqe) !=3D 16); - QEMU_BUILD_BUG_ON(sizeof(NvmeDsmRange) !=3D 16); - QEMU_BUILD_BUG_ON(sizeof(NvmeCmd) !=3D 64); - QEMU_BUILD_BUG_ON(sizeof(NvmeDeleteQ) !=3D 64); - QEMU_BUILD_BUG_ON(sizeof(NvmeCreateCq) !=3D 64); - QEMU_BUILD_BUG_ON(sizeof(NvmeCreateSq) !=3D 64); - QEMU_BUILD_BUG_ON(sizeof(NvmeIdentify) !=3D 64); - QEMU_BUILD_BUG_ON(sizeof(NvmeRwCmd) !=3D 64); - QEMU_BUILD_BUG_ON(sizeof(NvmeDsmCmd) !=3D 64); - QEMU_BUILD_BUG_ON(sizeof(NvmeRangeType) !=3D 64); - QEMU_BUILD_BUG_ON(sizeof(NvmeErrorLog) !=3D 64); - QEMU_BUILD_BUG_ON(sizeof(NvmeFwSlotInfoLog) !=3D 512); - QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) !=3D 512); - QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) !=3D 4096); - QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) !=3D 4096); -} +#include "block/nvme.h" =20 typedef struct NvmeAsyncEvent { QSIMPLEQ_ENTRY(NvmeAsyncEvent) entry; diff --git a/include/block/nvme.h b/include/block/nvme.h new file mode 100644 index 0000000000..849a6f3fa3 --- /dev/null +++ b/include/block/nvme.h @@ -0,0 +1,700 @@ +#ifndef BLOCK_NVME_H +#define BLOCK_NVME_H + +typedef struct NvmeBar { + uint64_t cap; + uint32_t vs; + uint32_t intms; + uint32_t intmc; + uint32_t cc; + uint32_t rsvd1; + uint32_t csts; + uint32_t nssrc; + uint32_t aqa; + uint64_t asq; + uint64_t acq; + uint32_t cmbloc; + uint32_t cmbsz; +} NvmeBar; + +enum NvmeCapShift { + CAP_MQES_SHIFT =3D 0, + CAP_CQR_SHIFT =3D 16, + CAP_AMS_SHIFT =3D 17, + CAP_TO_SHIFT =3D 24, + CAP_DSTRD_SHIFT =3D 32, + CAP_NSSRS_SHIFT =3D 33, + CAP_CSS_SHIFT =3D 37, + CAP_MPSMIN_SHIFT =3D 48, + CAP_MPSMAX_SHIFT =3D 52, +}; + +enum NvmeCapMask { + CAP_MQES_MASK =3D 0xffff, + CAP_CQR_MASK =3D 0x1, + CAP_AMS_MASK =3D 0x3, + CAP_TO_MASK =3D 0xff, + CAP_DSTRD_MASK =3D 0xf, + CAP_NSSRS_MASK =3D 0x1, + CAP_CSS_MASK =3D 0xff, + CAP_MPSMIN_MASK =3D 0xf, + CAP_MPSMAX_MASK =3D 0xf, +}; + +#define NVME_CAP_MQES(cap) (((cap) >> CAP_MQES_SHIFT) & CAP_MQES_MASK) +#define NVME_CAP_CQR(cap) (((cap) >> CAP_CQR_SHIFT) & CAP_CQR_MASK) +#define NVME_CAP_AMS(cap) (((cap) >> CAP_AMS_SHIFT) & CAP_AMS_MASK) +#define NVME_CAP_TO(cap) (((cap) >> CAP_TO_SHIFT) & CAP_TO_MASK) +#define NVME_CAP_DSTRD(cap) (((cap) >> CAP_DSTRD_SHIFT) & CAP_DSTRD_MASK) +#define NVME_CAP_NSSRS(cap) (((cap) >> CAP_NSSRS_SHIFT) & CAP_NSSRS_MASK) +#define NVME_CAP_CSS(cap) (((cap) >> CAP_CSS_SHIFT) & CAP_CSS_MASK) +#define NVME_CAP_MPSMIN(cap)(((cap) >> CAP_MPSMIN_SHIFT) & CAP_MPSMIN_MASK) +#define NVME_CAP_MPSMAX(cap)(((cap) >> CAP_MPSMAX_SHIFT) & CAP_MPSMAX_MASK) + +#define NVME_CAP_SET_MQES(cap, val) (cap |=3D (uint64_t)(val & CAP_MQES_= MASK) \ + << CAP_MQES_SHI= FT) +#define NVME_CAP_SET_CQR(cap, val) (cap |=3D (uint64_t)(val & CAP_CQR_M= ASK) \ + << CAP_CQR_SHIF= T) +#define NVME_CAP_SET_AMS(cap, val) (cap |=3D (uint64_t)(val & CAP_AMS_M= ASK) \ + << CAP_AMS_SHIF= T) +#define NVME_CAP_SET_TO(cap, val) (cap |=3D (uint64_t)(val & CAP_TO_MA= SK) \ + << CAP_TO_SHIFT) +#define NVME_CAP_SET_DSTRD(cap, val) (cap |=3D (uint64_t)(val & CAP_DSTRD= _MASK) \ + << CAP_DSTRD_SH= IFT) +#define NVME_CAP_SET_NSSRS(cap, val) (cap |=3D (uint64_t)(val & CAP_NSSRS= _MASK) \ + << CAP_NSSRS_SH= IFT) +#define NVME_CAP_SET_CSS(cap, val) (cap |=3D (uint64_t)(val & CAP_CSS_M= ASK) \ + << CAP_CSS_SHIF= T) +#define NVME_CAP_SET_MPSMIN(cap, val) (cap |=3D (uint64_t)(val & CAP_MPSMI= N_MASK)\ + << CAP_MPSMIN_S= HIFT) +#define NVME_CAP_SET_MPSMAX(cap, val) (cap |=3D (uint64_t)(val & CAP_MPSMA= X_MASK)\ + << CAP_MPSMAX_= SHIFT) + +enum NvmeCcShift { + CC_EN_SHIFT =3D 0, + CC_CSS_SHIFT =3D 4, + CC_MPS_SHIFT =3D 7, + CC_AMS_SHIFT =3D 11, + CC_SHN_SHIFT =3D 14, + CC_IOSQES_SHIFT =3D 16, + CC_IOCQES_SHIFT =3D 20, +}; + +enum NvmeCcMask { + CC_EN_MASK =3D 0x1, + CC_CSS_MASK =3D 0x7, + CC_MPS_MASK =3D 0xf, + CC_AMS_MASK =3D 0x7, + CC_SHN_MASK =3D 0x3, + CC_IOSQES_MASK =3D 0xf, + CC_IOCQES_MASK =3D 0xf, +}; + +#define NVME_CC_EN(cc) ((cc >> CC_EN_SHIFT) & CC_EN_MASK) +#define NVME_CC_CSS(cc) ((cc >> CC_CSS_SHIFT) & CC_CSS_MASK) +#define NVME_CC_MPS(cc) ((cc >> CC_MPS_SHIFT) & CC_MPS_MASK) +#define NVME_CC_AMS(cc) ((cc >> CC_AMS_SHIFT) & CC_AMS_MASK) +#define NVME_CC_SHN(cc) ((cc >> CC_SHN_SHIFT) & CC_SHN_MASK) +#define NVME_CC_IOSQES(cc) ((cc >> CC_IOSQES_SHIFT) & CC_IOSQES_MASK) +#define NVME_CC_IOCQES(cc) ((cc >> CC_IOCQES_SHIFT) & CC_IOCQES_MASK) + +enum NvmeCstsShift { + CSTS_RDY_SHIFT =3D 0, + CSTS_CFS_SHIFT =3D 1, + CSTS_SHST_SHIFT =3D 2, + CSTS_NSSRO_SHIFT =3D 4, +}; + +enum NvmeCstsMask { + CSTS_RDY_MASK =3D 0x1, + CSTS_CFS_MASK =3D 0x1, + CSTS_SHST_MASK =3D 0x3, + CSTS_NSSRO_MASK =3D 0x1, +}; + +enum NvmeCsts { + NVME_CSTS_READY =3D 1 << CSTS_RDY_SHIFT, + NVME_CSTS_FAILED =3D 1 << CSTS_CFS_SHIFT, + NVME_CSTS_SHST_NORMAL =3D 0 << CSTS_SHST_SHIFT, + NVME_CSTS_SHST_PROGRESS =3D 1 << CSTS_SHST_SHIFT, + NVME_CSTS_SHST_COMPLETE =3D 2 << CSTS_SHST_SHIFT, + NVME_CSTS_NSSRO =3D 1 << CSTS_NSSRO_SHIFT, +}; + +#define NVME_CSTS_RDY(csts) ((csts >> CSTS_RDY_SHIFT) & CSTS_RDY_MAS= K) +#define NVME_CSTS_CFS(csts) ((csts >> CSTS_CFS_SHIFT) & CSTS_CFS_MAS= K) +#define NVME_CSTS_SHST(csts) ((csts >> CSTS_SHST_SHIFT) & CSTS_SHST_MA= SK) +#define NVME_CSTS_NSSRO(csts) ((csts >> CSTS_NSSRO_SHIFT) & CSTS_NSSRO_M= ASK) + +enum NvmeAqaShift { + AQA_ASQS_SHIFT =3D 0, + AQA_ACQS_SHIFT =3D 16, +}; + +enum NvmeAqaMask { + AQA_ASQS_MASK =3D 0xfff, + AQA_ACQS_MASK =3D 0xfff, +}; + +#define NVME_AQA_ASQS(aqa) ((aqa >> AQA_ASQS_SHIFT) & AQA_ASQS_MASK) +#define NVME_AQA_ACQS(aqa) ((aqa >> AQA_ACQS_SHIFT) & AQA_ACQS_MASK) + +enum NvmeCmblocShift { + CMBLOC_BIR_SHIFT =3D 0, + CMBLOC_OFST_SHIFT =3D 12, +}; + +enum NvmeCmblocMask { + CMBLOC_BIR_MASK =3D 0x7, + CMBLOC_OFST_MASK =3D 0xfffff, +}; + +#define NVME_CMBLOC_BIR(cmbloc) ((cmbloc >> CMBLOC_BIR_SHIFT) & \ + CMBLOC_BIR_MASK) +#define NVME_CMBLOC_OFST(cmbloc)((cmbloc >> CMBLOC_OFST_SHIFT) & \ + CMBLOC_OFST_MASK) + +#define NVME_CMBLOC_SET_BIR(cmbloc, val) \ + (cmbloc |=3D (uint64_t)(val & CMBLOC_BIR_MASK) << CMBLOC_BIR_SHIFT) +#define NVME_CMBLOC_SET_OFST(cmbloc, val) \ + (cmbloc |=3D (uint64_t)(val & CMBLOC_OFST_MASK) << CMBLOC_OFST_SHIFT) + +enum NvmeCmbszShift { + CMBSZ_SQS_SHIFT =3D 0, + CMBSZ_CQS_SHIFT =3D 1, + CMBSZ_LISTS_SHIFT =3D 2, + CMBSZ_RDS_SHIFT =3D 3, + CMBSZ_WDS_SHIFT =3D 4, + CMBSZ_SZU_SHIFT =3D 8, + CMBSZ_SZ_SHIFT =3D 12, +}; + +enum NvmeCmbszMask { + CMBSZ_SQS_MASK =3D 0x1, + CMBSZ_CQS_MASK =3D 0x1, + CMBSZ_LISTS_MASK =3D 0x1, + CMBSZ_RDS_MASK =3D 0x1, + CMBSZ_WDS_MASK =3D 0x1, + CMBSZ_SZU_MASK =3D 0xf, + CMBSZ_SZ_MASK =3D 0xfffff, +}; + +#define NVME_CMBSZ_SQS(cmbsz) ((cmbsz >> CMBSZ_SQS_SHIFT) & CMBSZ_SQS_M= ASK) +#define NVME_CMBSZ_CQS(cmbsz) ((cmbsz >> CMBSZ_CQS_SHIFT) & CMBSZ_CQS_M= ASK) +#define NVME_CMBSZ_LISTS(cmbsz)((cmbsz >> CMBSZ_LISTS_SHIFT) & CMBSZ_LISTS= _MASK) +#define NVME_CMBSZ_RDS(cmbsz) ((cmbsz >> CMBSZ_RDS_SHIFT) & CMBSZ_RDS_M= ASK) +#define NVME_CMBSZ_WDS(cmbsz) ((cmbsz >> CMBSZ_WDS_SHIFT) & CMBSZ_WDS_M= ASK) +#define NVME_CMBSZ_SZU(cmbsz) ((cmbsz >> CMBSZ_SZU_SHIFT) & CMBSZ_SZU_M= ASK) +#define NVME_CMBSZ_SZ(cmbsz) ((cmbsz >> CMBSZ_SZ_SHIFT) & CMBSZ_SZ_MA= SK) + +#define NVME_CMBSZ_SET_SQS(cmbsz, val) \ + (cmbsz |=3D (uint64_t)(val & CMBSZ_SQS_MASK) << CMBSZ_SQS_SHIFT) +#define NVME_CMBSZ_SET_CQS(cmbsz, val) \ + (cmbsz |=3D (uint64_t)(val & CMBSZ_CQS_MASK) << CMBSZ_CQS_SHIFT) +#define NVME_CMBSZ_SET_LISTS(cmbsz, val) \ + (cmbsz |=3D (uint64_t)(val & CMBSZ_LISTS_MASK) << CMBSZ_LISTS_SHIFT) +#define NVME_CMBSZ_SET_RDS(cmbsz, val) \ + (cmbsz |=3D (uint64_t)(val & CMBSZ_RDS_MASK) << CMBSZ_RDS_SHIFT) +#define NVME_CMBSZ_SET_WDS(cmbsz, val) \ + (cmbsz |=3D (uint64_t)(val & CMBSZ_WDS_MASK) << CMBSZ_WDS_SHIFT) +#define NVME_CMBSZ_SET_SZU(cmbsz, val) \ + (cmbsz |=3D (uint64_t)(val & CMBSZ_SZU_MASK) << CMBSZ_SZU_SHIFT) +#define NVME_CMBSZ_SET_SZ(cmbsz, val) \ + (cmbsz |=3D (uint64_t)(val & CMBSZ_SZ_MASK) << CMBSZ_SZ_SHIFT) + +#define NVME_CMBSZ_GETSIZE(cmbsz) \ + (NVME_CMBSZ_SZ(cmbsz) * (1 << (12 + 4 * NVME_CMBSZ_SZU(cmbsz)))) + +typedef struct NvmeCmd { + uint8_t opcode; + uint8_t fuse; + uint16_t cid; + uint32_t nsid; + uint64_t res1; + uint64_t mptr; + uint64_t prp1; + uint64_t prp2; + uint32_t cdw10; + uint32_t cdw11; + uint32_t cdw12; + uint32_t cdw13; + uint32_t cdw14; + uint32_t cdw15; +} NvmeCmd; + +enum NvmeAdminCommands { + NVME_ADM_CMD_DELETE_SQ =3D 0x00, + NVME_ADM_CMD_CREATE_SQ =3D 0x01, + NVME_ADM_CMD_GET_LOG_PAGE =3D 0x02, + NVME_ADM_CMD_DELETE_CQ =3D 0x04, + NVME_ADM_CMD_CREATE_CQ =3D 0x05, + NVME_ADM_CMD_IDENTIFY =3D 0x06, + NVME_ADM_CMD_ABORT =3D 0x08, + NVME_ADM_CMD_SET_FEATURES =3D 0x09, + NVME_ADM_CMD_GET_FEATURES =3D 0x0a, + NVME_ADM_CMD_ASYNC_EV_REQ =3D 0x0c, + NVME_ADM_CMD_ACTIVATE_FW =3D 0x10, + NVME_ADM_CMD_DOWNLOAD_FW =3D 0x11, + NVME_ADM_CMD_FORMAT_NVM =3D 0x80, + NVME_ADM_CMD_SECURITY_SEND =3D 0x81, + NVME_ADM_CMD_SECURITY_RECV =3D 0x82, +}; + +enum NvmeIoCommands { + NVME_CMD_FLUSH =3D 0x00, + NVME_CMD_WRITE =3D 0x01, + NVME_CMD_READ =3D 0x02, + NVME_CMD_WRITE_UNCOR =3D 0x04, + NVME_CMD_COMPARE =3D 0x05, + NVME_CMD_WRITE_ZEROS =3D 0x08, + NVME_CMD_DSM =3D 0x09, +}; + +typedef struct NvmeDeleteQ { + uint8_t opcode; + uint8_t flags; + uint16_t cid; + uint32_t rsvd1[9]; + uint16_t qid; + uint16_t rsvd10; + uint32_t rsvd11[5]; +} NvmeDeleteQ; + +typedef struct NvmeCreateCq { + uint8_t opcode; + uint8_t flags; + uint16_t cid; + uint32_t rsvd1[5]; + uint64_t prp1; + uint64_t rsvd8; + uint16_t cqid; + uint16_t qsize; + uint16_t cq_flags; + uint16_t irq_vector; + uint32_t rsvd12[4]; +} NvmeCreateCq; + +#define NVME_CQ_FLAGS_PC(cq_flags) (cq_flags & 0x1) +#define NVME_CQ_FLAGS_IEN(cq_flags) ((cq_flags >> 1) & 0x1) + +typedef struct NvmeCreateSq { + uint8_t opcode; + uint8_t flags; + uint16_t cid; + uint32_t rsvd1[5]; + uint64_t prp1; + uint64_t rsvd8; + uint16_t sqid; + uint16_t qsize; + uint16_t sq_flags; + uint16_t cqid; + uint32_t rsvd12[4]; +} NvmeCreateSq; + +#define NVME_SQ_FLAGS_PC(sq_flags) (sq_flags & 0x1) +#define NVME_SQ_FLAGS_QPRIO(sq_flags) ((sq_flags >> 1) & 0x3) + +enum NvmeQueueFlags { + NVME_Q_PC =3D 1, + NVME_Q_PRIO_URGENT =3D 0, + NVME_Q_PRIO_HIGH =3D 1, + NVME_Q_PRIO_NORMAL =3D 2, + NVME_Q_PRIO_LOW =3D 3, +}; + +typedef struct NvmeIdentify { + uint8_t opcode; + uint8_t flags; + uint16_t cid; + uint32_t nsid; + uint64_t rsvd2[2]; + uint64_t prp1; + uint64_t prp2; + uint32_t cns; + uint32_t rsvd11[5]; +} NvmeIdentify; + +typedef struct NvmeRwCmd { + uint8_t opcode; + uint8_t flags; + uint16_t cid; + uint32_t nsid; + uint64_t rsvd2; + uint64_t mptr; + uint64_t prp1; + uint64_t prp2; + uint64_t slba; + uint16_t nlb; + uint16_t control; + uint32_t dsmgmt; + uint32_t reftag; + uint16_t apptag; + uint16_t appmask; +} NvmeRwCmd; + +enum { + NVME_RW_LR =3D 1 << 15, + NVME_RW_FUA =3D 1 << 14, + NVME_RW_DSM_FREQ_UNSPEC =3D 0, + NVME_RW_DSM_FREQ_TYPICAL =3D 1, + NVME_RW_DSM_FREQ_RARE =3D 2, + NVME_RW_DSM_FREQ_READS =3D 3, + NVME_RW_DSM_FREQ_WRITES =3D 4, + NVME_RW_DSM_FREQ_RW =3D 5, + NVME_RW_DSM_FREQ_ONCE =3D 6, + NVME_RW_DSM_FREQ_PREFETCH =3D 7, + NVME_RW_DSM_FREQ_TEMP =3D 8, + NVME_RW_DSM_LATENCY_NONE =3D 0 << 4, + NVME_RW_DSM_LATENCY_IDLE =3D 1 << 4, + NVME_RW_DSM_LATENCY_NORM =3D 2 << 4, + NVME_RW_DSM_LATENCY_LOW =3D 3 << 4, + NVME_RW_DSM_SEQ_REQ =3D 1 << 6, + NVME_RW_DSM_COMPRESSED =3D 1 << 7, + NVME_RW_PRINFO_PRACT =3D 1 << 13, + NVME_RW_PRINFO_PRCHK_GUARD =3D 1 << 12, + NVME_RW_PRINFO_PRCHK_APP =3D 1 << 11, + NVME_RW_PRINFO_PRCHK_REF =3D 1 << 10, +}; + +typedef struct NvmeDsmCmd { + uint8_t opcode; + uint8_t flags; + uint16_t cid; + uint32_t nsid; + uint64_t rsvd2[2]; + uint64_t prp1; + uint64_t prp2; + uint32_t nr; + uint32_t attributes; + uint32_t rsvd12[4]; +} NvmeDsmCmd; + +enum { + NVME_DSMGMT_IDR =3D 1 << 0, + NVME_DSMGMT_IDW =3D 1 << 1, + NVME_DSMGMT_AD =3D 1 << 2, +}; + +typedef struct NvmeDsmRange { + uint32_t cattr; + uint32_t nlb; + uint64_t slba; +} NvmeDsmRange; + +enum NvmeAsyncEventRequest { + NVME_AER_TYPE_ERROR =3D 0, + NVME_AER_TYPE_SMART =3D 1, + NVME_AER_TYPE_IO_SPECIFIC =3D 6, + NVME_AER_TYPE_VENDOR_SPECIFIC =3D 7, + NVME_AER_INFO_ERR_INVALID_SQ =3D 0, + NVME_AER_INFO_ERR_INVALID_DB =3D 1, + NVME_AER_INFO_ERR_DIAG_FAIL =3D 2, + NVME_AER_INFO_ERR_PERS_INTERNAL_ERR =3D 3, + NVME_AER_INFO_ERR_TRANS_INTERNAL_ERR =3D 4, + NVME_AER_INFO_ERR_FW_IMG_LOAD_ERR =3D 5, + NVME_AER_INFO_SMART_RELIABILITY =3D 0, + NVME_AER_INFO_SMART_TEMP_THRESH =3D 1, + NVME_AER_INFO_SMART_SPARE_THRESH =3D 2, +}; + +typedef struct NvmeAerResult { + uint8_t event_type; + uint8_t event_info; + uint8_t log_page; + uint8_t resv; +} NvmeAerResult; + +typedef struct NvmeCqe { + uint32_t result; + uint32_t rsvd; + uint16_t sq_head; + uint16_t sq_id; + uint16_t cid; + uint16_t status; +} NvmeCqe; + +enum NvmeStatusCodes { + NVME_SUCCESS =3D 0x0000, + NVME_INVALID_OPCODE =3D 0x0001, + NVME_INVALID_FIELD =3D 0x0002, + NVME_CID_CONFLICT =3D 0x0003, + NVME_DATA_TRAS_ERROR =3D 0x0004, + NVME_POWER_LOSS_ABORT =3D 0x0005, + NVME_INTERNAL_DEV_ERROR =3D 0x0006, + NVME_CMD_ABORT_REQ =3D 0x0007, + NVME_CMD_ABORT_SQ_DEL =3D 0x0008, + NVME_CMD_ABORT_FAILED_FUSE =3D 0x0009, + NVME_CMD_ABORT_MISSING_FUSE =3D 0x000a, + NVME_INVALID_NSID =3D 0x000b, + NVME_CMD_SEQ_ERROR =3D 0x000c, + NVME_LBA_RANGE =3D 0x0080, + NVME_CAP_EXCEEDED =3D 0x0081, + NVME_NS_NOT_READY =3D 0x0082, + NVME_NS_RESV_CONFLICT =3D 0x0083, + NVME_INVALID_CQID =3D 0x0100, + NVME_INVALID_QID =3D 0x0101, + NVME_MAX_QSIZE_EXCEEDED =3D 0x0102, + NVME_ACL_EXCEEDED =3D 0x0103, + NVME_RESERVED =3D 0x0104, + NVME_AER_LIMIT_EXCEEDED =3D 0x0105, + NVME_INVALID_FW_SLOT =3D 0x0106, + NVME_INVALID_FW_IMAGE =3D 0x0107, + NVME_INVALID_IRQ_VECTOR =3D 0x0108, + NVME_INVALID_LOG_ID =3D 0x0109, + NVME_INVALID_FORMAT =3D 0x010a, + NVME_FW_REQ_RESET =3D 0x010b, + NVME_INVALID_QUEUE_DEL =3D 0x010c, + NVME_FID_NOT_SAVEABLE =3D 0x010d, + NVME_FID_NOT_NSID_SPEC =3D 0x010f, + NVME_FW_REQ_SUSYSTEM_RESET =3D 0x0110, + NVME_CONFLICTING_ATTRS =3D 0x0180, + NVME_INVALID_PROT_INFO =3D 0x0181, + NVME_WRITE_TO_RO =3D 0x0182, + NVME_WRITE_FAULT =3D 0x0280, + NVME_UNRECOVERED_READ =3D 0x0281, + NVME_E2E_GUARD_ERROR =3D 0x0282, + NVME_E2E_APP_ERROR =3D 0x0283, + NVME_E2E_REF_ERROR =3D 0x0284, + NVME_CMP_FAILURE =3D 0x0285, + NVME_ACCESS_DENIED =3D 0x0286, + NVME_MORE =3D 0x2000, + NVME_DNR =3D 0x4000, + NVME_NO_COMPLETE =3D 0xffff, +}; + +typedef struct NvmeFwSlotInfoLog { + uint8_t afi; + uint8_t reserved1[7]; + uint8_t frs1[8]; + uint8_t frs2[8]; + uint8_t frs3[8]; + uint8_t frs4[8]; + uint8_t frs5[8]; + uint8_t frs6[8]; + uint8_t frs7[8]; + uint8_t reserved2[448]; +} NvmeFwSlotInfoLog; + +typedef struct NvmeErrorLog { + uint64_t error_count; + uint16_t sqid; + uint16_t cid; + uint16_t status_field; + uint16_t param_error_location; + uint64_t lba; + uint32_t nsid; + uint8_t vs; + uint8_t resv[35]; +} NvmeErrorLog; + +typedef struct NvmeSmartLog { + uint8_t critical_warning; + uint8_t temperature[2]; + uint8_t available_spare; + uint8_t available_spare_threshold; + uint8_t percentage_used; + uint8_t reserved1[26]; + uint64_t data_units_read[2]; + uint64_t data_units_written[2]; + uint64_t host_read_commands[2]; + uint64_t host_write_commands[2]; + uint64_t controller_busy_time[2]; + uint64_t power_cycles[2]; + uint64_t power_on_hours[2]; + uint64_t unsafe_shutdowns[2]; + uint64_t media_errors[2]; + uint64_t number_of_error_log_entries[2]; + uint8_t reserved2[320]; +} NvmeSmartLog; + +enum NvmeSmartWarn { + NVME_SMART_SPARE =3D 1 << 0, + NVME_SMART_TEMPERATURE =3D 1 << 1, + NVME_SMART_RELIABILITY =3D 1 << 2, + NVME_SMART_MEDIA_READ_ONLY =3D 1 << 3, + NVME_SMART_FAILED_VOLATILE_MEDIA =3D 1 << 4, +}; + +enum LogIdentifier { + NVME_LOG_ERROR_INFO =3D 0x01, + NVME_LOG_SMART_INFO =3D 0x02, + NVME_LOG_FW_SLOT_INFO =3D 0x03, +}; + +typedef struct NvmePSD { + uint16_t mp; + uint16_t reserved; + uint32_t enlat; + uint32_t exlat; + uint8_t rrt; + uint8_t rrl; + uint8_t rwt; + uint8_t rwl; + uint8_t resv[16]; +} NvmePSD; + +typedef struct NvmeIdCtrl { + uint16_t vid; + uint16_t ssvid; + uint8_t sn[20]; + uint8_t mn[40]; + uint8_t fr[8]; + uint8_t rab; + uint8_t ieee[3]; + uint8_t cmic; + uint8_t mdts; + uint8_t rsvd255[178]; + uint16_t oacs; + uint8_t acl; + uint8_t aerl; + uint8_t frmw; + uint8_t lpa; + uint8_t elpe; + uint8_t npss; + uint8_t rsvd511[248]; + uint8_t sqes; + uint8_t cqes; + uint16_t rsvd515; + uint32_t nn; + uint16_t oncs; + uint16_t fuses; + uint8_t fna; + uint8_t vwc; + uint16_t awun; + uint16_t awupf; + uint8_t rsvd703[174]; + uint8_t rsvd2047[1344]; + NvmePSD psd[32]; + uint8_t vs[1024]; +} NvmeIdCtrl; + +enum NvmeIdCtrlOacs { + NVME_OACS_SECURITY =3D 1 << 0, + NVME_OACS_FORMAT =3D 1 << 1, + NVME_OACS_FW =3D 1 << 2, +}; + +enum NvmeIdCtrlOncs { + NVME_ONCS_COMPARE =3D 1 << 0, + NVME_ONCS_WRITE_UNCORR =3D 1 << 1, + NVME_ONCS_DSM =3D 1 << 2, + NVME_ONCS_WRITE_ZEROS =3D 1 << 3, + NVME_ONCS_FEATURES =3D 1 << 4, + NVME_ONCS_RESRVATIONS =3D 1 << 5, +}; + +#define NVME_CTRL_SQES_MIN(sqes) ((sqes) & 0xf) +#define NVME_CTRL_SQES_MAX(sqes) (((sqes) >> 4) & 0xf) +#define NVME_CTRL_CQES_MIN(cqes) ((cqes) & 0xf) +#define NVME_CTRL_CQES_MAX(cqes) (((cqes) >> 4) & 0xf) + +typedef struct NvmeFeatureVal { + uint32_t arbitration; + uint32_t power_mgmt; + uint32_t temp_thresh; + uint32_t err_rec; + uint32_t volatile_wc; + uint32_t num_queues; + uint32_t int_coalescing; + uint32_t *int_vector_config; + uint32_t write_atomicity; + uint32_t async_config; + uint32_t sw_prog_marker; +} NvmeFeatureVal; + +#define NVME_ARB_AB(arb) (arb & 0x7) +#define NVME_ARB_LPW(arb) ((arb >> 8) & 0xff) +#define NVME_ARB_MPW(arb) ((arb >> 16) & 0xff) +#define NVME_ARB_HPW(arb) ((arb >> 24) & 0xff) + +#define NVME_INTC_THR(intc) (intc & 0xff) +#define NVME_INTC_TIME(intc) ((intc >> 8) & 0xff) + +enum NvmeFeatureIds { + NVME_ARBITRATION =3D 0x1, + NVME_POWER_MANAGEMENT =3D 0x2, + NVME_LBA_RANGE_TYPE =3D 0x3, + NVME_TEMPERATURE_THRESHOLD =3D 0x4, + NVME_ERROR_RECOVERY =3D 0x5, + NVME_VOLATILE_WRITE_CACHE =3D 0x6, + NVME_NUMBER_OF_QUEUES =3D 0x7, + NVME_INTERRUPT_COALESCING =3D 0x8, + NVME_INTERRUPT_VECTOR_CONF =3D 0x9, + NVME_WRITE_ATOMICITY =3D 0xa, + NVME_ASYNCHRONOUS_EVENT_CONF =3D 0xb, + NVME_SOFTWARE_PROGRESS_MARKER =3D 0x80 +}; + +typedef struct NvmeRangeType { + uint8_t type; + uint8_t attributes; + uint8_t rsvd2[14]; + uint64_t slba; + uint64_t nlb; + uint8_t guid[16]; + uint8_t rsvd48[16]; +} NvmeRangeType; + +typedef struct NvmeLBAF { + uint16_t ms; + uint8_t ds; + uint8_t rp; +} NvmeLBAF; + +typedef struct NvmeIdNs { + uint64_t nsze; + uint64_t ncap; + uint64_t nuse; + uint8_t nsfeat; + uint8_t nlbaf; + uint8_t flbas; + uint8_t mc; + uint8_t dpc; + uint8_t dps; + uint8_t res30[98]; + NvmeLBAF lbaf[16]; + uint8_t res192[192]; + uint8_t vs[3712]; +} NvmeIdNs; + +#define NVME_ID_NS_NSFEAT_THIN(nsfeat) ((nsfeat & 0x1)) +#define NVME_ID_NS_FLBAS_EXTENDED(flbas) ((flbas >> 4) & 0x1) +#define NVME_ID_NS_FLBAS_INDEX(flbas) ((flbas & 0xf)) +#define NVME_ID_NS_MC_SEPARATE(mc) ((mc >> 1) & 0x1) +#define NVME_ID_NS_MC_EXTENDED(mc) ((mc & 0x1)) +#define NVME_ID_NS_DPC_LAST_EIGHT(dpc) ((dpc >> 4) & 0x1) +#define NVME_ID_NS_DPC_FIRST_EIGHT(dpc) ((dpc >> 3) & 0x1) +#define NVME_ID_NS_DPC_TYPE_3(dpc) ((dpc >> 2) & 0x1) +#define NVME_ID_NS_DPC_TYPE_2(dpc) ((dpc >> 1) & 0x1) +#define NVME_ID_NS_DPC_TYPE_1(dpc) ((dpc & 0x1)) +#define NVME_ID_NS_DPC_TYPE_MASK 0x7 + +enum NvmeIdNsDps { + DPS_TYPE_NONE =3D 0, + DPS_TYPE_1 =3D 1, + DPS_TYPE_2 =3D 2, + DPS_TYPE_3 =3D 3, + DPS_TYPE_MASK =3D 0x7, + DPS_FIRST_EIGHT =3D 8, +}; + +static inline void _nvme_check_size(void) +{ + QEMU_BUILD_BUG_ON(sizeof(NvmeAerResult) !=3D 4); + QEMU_BUILD_BUG_ON(sizeof(NvmeCqe) !=3D 16); + QEMU_BUILD_BUG_ON(sizeof(NvmeDsmRange) !=3D 16); + QEMU_BUILD_BUG_ON(sizeof(NvmeCmd) !=3D 64); + QEMU_BUILD_BUG_ON(sizeof(NvmeDeleteQ) !=3D 64); + QEMU_BUILD_BUG_ON(sizeof(NvmeCreateCq) !=3D 64); + QEMU_BUILD_BUG_ON(sizeof(NvmeCreateSq) !=3D 64); + QEMU_BUILD_BUG_ON(sizeof(NvmeIdentify) !=3D 64); + QEMU_BUILD_BUG_ON(sizeof(NvmeRwCmd) !=3D 64); + QEMU_BUILD_BUG_ON(sizeof(NvmeDsmCmd) !=3D 64); + QEMU_BUILD_BUG_ON(sizeof(NvmeRangeType) !=3D 64); + QEMU_BUILD_BUG_ON(sizeof(NvmeErrorLog) !=3D 64); + QEMU_BUILD_BUG_ON(sizeof(NvmeFwSlotInfoLog) !=3D 512); + QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) !=3D 512); + QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) !=3D 4096); + QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) !=3D 4096); +} +#endif --=20 2.14.3 From nobody Sat Apr 27 00:47:14 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) client-ip=208.118.235.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Authentication-Results: mx.zohomail.com; spf=pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org Return-Path: Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) by mx.zohomail.com with SMTPS id 1515747996642705.3850844424287; Fri, 12 Jan 2018 01:06:36 -0800 (PST) Received: from localhost ([::1]:55124 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eZvIJ-0006me-Rg for importer@patchew.org; Fri, 12 Jan 2018 04:06:35 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:41686) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eZvAv-0000m6-Fz for qemu-devel@nongnu.org; Fri, 12 Jan 2018 03:58:58 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eZvAu-00053n-HJ for qemu-devel@nongnu.org; Fri, 12 Jan 2018 03:58:57 -0500 Received: from mx1.redhat.com ([209.132.183.28]:57728) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1eZvAs-0004zo-9M; Fri, 12 Jan 2018 03:58:54 -0500 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 4C5692CE91D; Fri, 12 Jan 2018 08:58:53 +0000 (UTC) Received: from lemon.usersys.redhat.com (ovpn-12-81.pek2.redhat.com [10.72.12.81]) by smtp.corp.redhat.com (Postfix) with ESMTP id 3794C75545; Fri, 12 Jan 2018 08:58:45 +0000 (UTC) From: Fam Zheng To: qemu-devel@nongnu.org Date: Fri, 12 Jan 2018 16:55:54 +0800 Message-Id: <20180112085555.14447-9-famz@redhat.com> In-Reply-To: <20180112085555.14447-1-famz@redhat.com> References: <20180112085555.14447-1-famz@redhat.com> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.29]); Fri, 12 Jan 2018 08:58:53 +0000 (UTC) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 209.132.183.28 Subject: [Qemu-devel] [PATCH v5 8/9] docs: Add section for NVMe VFIO driver X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Kevin Wolf , Fam Zheng , qemu-block@nongnu.org, Markus Armbruster , Max Reitz , Keith Busch , alex.williamson@redhat.com, Stefan Hajnoczi , Paolo Bonzini , Karl Rister Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: "Qemu-devel" X-ZohoMail: RSF_0 Z_629925259 SPT_0 Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Signed-off-by: Fam Zheng Message-Id: <20180110091846.10699-9-famz@redhat.com> Reviewed-by: Stefan Hajnoczi --- docs/qemu-block-drivers.texi | 37 +++++++++++++++++++++++++++++++++++++ 1 file changed, 37 insertions(+) diff --git a/docs/qemu-block-drivers.texi b/docs/qemu-block-drivers.texi index 503c1847aa..cd74767ed3 100644 --- a/docs/qemu-block-drivers.texi +++ b/docs/qemu-block-drivers.texi @@ -785,6 +785,43 @@ warning: ssh server @code{ssh.example.com:22} does not= support fsync With sufficiently new versions of libssh2 and OpenSSH, @code{fsync} is supported. =20 +@node disk_images_nvme +@subsection NVMe disk images + +NVM Express (NVMe) storage controllers can be accessed directly by a users= pace +driver in QEMU. This bypasses the host kernel file system and block layers +while retaining QEMU block layer functionalities, such as block jobs, I/O +throttling, image formats, etc. Disk I/O performance is typically higher = than +with @code{-drive file=3D/dev/sda} using either thread pool or linux-aio. + +The controller will be exclusively used by the QEMU process once started. = To be +able to share storage between multiple VMs and other applications on the h= ost, +please use the file based protocols. + +Before starting QEMU, bind the host NVMe controller to the host vfio-pci +driver. For example: + +@example +# modprobe vfio-pci +# lspci -n -s 0000:06:0d.0 +06:0d.0 0401: 1102:0002 (rev 08) +# echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind +# echo 1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id + +# qemu-system-x86_64 -drive file=3Dnvme://@var{host}:@var{bus}:@var{slot}.= @var{func}/@var{namespace} +@end example + +Alternative syntax using properties: + +@example +qemu-system-x86_64 -drive file.driver=3Dnvme,file.device=3D@var{host}:@var= {bus}:@var{slot}.@var{func},file.namespace=3D@var{namespace} +@end example + +@var{host}:@var{bus}:@var{slot}.@var{func} is the NVMe controller's PCI de= vice +address on the host. + +@var{namespace} is the NVMe namespace number, starting from 1. + @node disk_image_locking @subsection Disk image file locking =20 --=20 2.14.3 From nobody Sat Apr 27 00:47:15 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) client-ip=208.118.235.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Authentication-Results: mx.zohomail.com; spf=pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org Return-Path: Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) by mx.zohomail.com with SMTPS id 1515747877363341.0606467935894; Fri, 12 Jan 2018 01:04:37 -0800 (PST) Received: from localhost ([::1]:55059 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eZvGK-00051l-0a for importer@patchew.org; Fri, 12 Jan 2018 04:04:32 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:41771) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eZvB6-0000xf-7B for qemu-devel@nongnu.org; Fri, 12 Jan 2018 03:59:09 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eZvB5-0005Mx-AG for qemu-devel@nongnu.org; Fri, 12 Jan 2018 03:59:08 -0500 Received: from mx1.redhat.com ([209.132.183.28]:32768) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1eZvAz-0005B1-OV; Fri, 12 Jan 2018 03:59:01 -0500 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id A0536C05689D; Fri, 12 Jan 2018 08:59:00 +0000 (UTC) Received: from lemon.usersys.redhat.com (ovpn-12-81.pek2.redhat.com [10.72.12.81]) by smtp.corp.redhat.com (Postfix) with ESMTP id 9DB3575545; Fri, 12 Jan 2018 08:58:53 +0000 (UTC) From: Fam Zheng To: qemu-devel@nongnu.org Date: Fri, 12 Jan 2018 16:55:55 +0800 Message-Id: <20180112085555.14447-10-famz@redhat.com> In-Reply-To: <20180112085555.14447-1-famz@redhat.com> References: <20180112085555.14447-1-famz@redhat.com> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.32]); Fri, 12 Jan 2018 08:59:00 +0000 (UTC) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 209.132.183.28 Subject: [Qemu-devel] [PATCH v5 9/9] qapi: Add NVMe driver options to the schema X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Kevin Wolf , Fam Zheng , qemu-block@nongnu.org, Markus Armbruster , Max Reitz , Keith Busch , alex.williamson@redhat.com, Stefan Hajnoczi , Paolo Bonzini , Karl Rister Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: "Qemu-devel" X-ZohoMail: RSF_0 Z_629925259 SPT_0 Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Signed-off-by: Fam Zheng Message-Id: <20180110091846.10699-10-famz@redhat.com> Reviewed-by: Stefan Hajnoczi --- qapi/block-core.json | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/qapi/block-core.json b/qapi/block-core.json index e94a6881b2..bd16440dc7 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -2230,6 +2230,7 @@ # # @vxhs: Since 2.10 # @throttle: Since 2.11 +# @nvme: Since 2.12 # # Since: 2.9 ## @@ -2237,7 +2238,7 @@ 'data': [ 'blkdebug', 'blkverify', 'bochs', 'cloop', 'dmg', 'file', 'ftp', 'ftps', 'gluster', 'host_cdrom', 'host_device', 'http', 'https', 'iscsi', 'luks', 'nbd', 'nfs', - 'null-aio', 'null-co', 'parallels', 'qcow', 'qcow2', 'qed', + 'null-aio', 'null-co', 'nvme', 'parallels', 'qcow', 'qcow2', '= qed', 'quorum', 'raw', 'rbd', 'replication', 'sheepdog', 'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', 'vxhs' ] } =20 @@ -2278,6 +2279,19 @@ { 'struct': 'BlockdevOptionsNull', 'data': { '*size': 'int', '*latency-ns': 'uint64' } } =20 +## +# @BlockdevOptionsNVMe: +# +# Driver specific block device options for the NVMe backend. +# +# @device: controller address of the NVMe device. +# @namespace: namespace number of the device, starting from 1. +# +# Since: 2.12 +## +{ 'struct': 'BlockdevOptionsNVMe', + 'data': { 'device': 'str', 'namespace': 'int' } } + ## # @BlockdevOptionsVVFAT: # @@ -3183,6 +3197,7 @@ 'nfs': 'BlockdevOptionsNfs', 'null-aio': 'BlockdevOptionsNull', 'null-co': 'BlockdevOptionsNull', + 'nvme': 'BlockdevOptionsNVMe', 'parallels': 'BlockdevOptionsGenericFormat', 'qcow2': 'BlockdevOptionsQcow2', 'qcow': 'BlockdevOptionsQcow', --=20 2.14.3