* [Qemu-devel] [PATCH v3 0/6] block: Add VFIO based driver for NVMe device @ 2017-07-05 13:36 Fam Zheng 2017-07-05 13:36 ` [Qemu-devel] [PATCH v3 1/6] stubs: Add stubs for ram block API Fam Zheng ` (7 more replies) 0 siblings, 8 replies; 33+ messages in thread From: Fam Zheng @ 2017-07-05 13:36 UTC (permalink / raw) To: qemu-devel Cc: Paolo Bonzini, Keith Busch, qemu-block, Fam Zheng, Kevin Wolf, Max Reitz, Stefan Hajnoczi, Karl Rister v3: Rebase, small tweaks/fixes and add locks to provide basic thread safety (basic because it is not really tested). v2: - Implement "split vfio addr space" appraoch. [Paolo] - Add back 'device reset' in nvme_close(). [Paolo] - Better variable namings. [Stefan] - "Reuse" macro definitions from NVMe emulation code. - Rebase onto current master which has polling by default and update performance results accordingly. - Update MAINTAINERS. - Specify namespace in URI. - The sporadical I/O error from v1 "disappeared" in this version. - Tests one: qemu-img bench, fio, bonnie++ and installation of ubuntu/fedora/rhel on QEMU emulated nvme and a Intel P3700 card. Fam Zheng (6): stubs: Add stubs for ram block API block: Add VFIO based NVMe driver block: Introduce bdrv_dma_map and bdrv_dma_unmap block/nvme: Implement .bdrv_dma_map and .bdrv_dma_unmap qemu-img: Map bench buffer block: Move NVMe spec definitions to a separate header MAINTAINERS | 6 + block/Makefile.objs | 1 + block/block-backend.c | 10 + block/io.c | 24 + block/nvme-vfio.c | 703 +++++++++++++++++++++++++ block/nvme-vfio.h | 30 ++ block/nvme.c | 1103 ++++++++++++++++++++++++++++++++++++++++ block/nvme.h | 700 +++++++++++++++++++++++++ block/trace-events | 32 ++ hw/block/nvme.h | 698 +------------------------ include/block/block.h | 2 + include/block/block_int.h | 4 + include/sysemu/block-backend.h | 3 + qemu-img.c | 9 +- stubs/Makefile.objs | 1 + stubs/ram-block.c | 16 + 16 files changed, 2644 insertions(+), 698 deletions(-) create mode 100644 block/nvme-vfio.c create mode 100644 block/nvme-vfio.h create mode 100644 block/nvme.c create mode 100644 block/nvme.h create mode 100644 stubs/ram-block.c -- 2.9.4 ^ permalink raw reply [flat|nested] 33+ messages in thread
* [Qemu-devel] [PATCH v3 1/6] stubs: Add stubs for ram block API 2017-07-05 13:36 [Qemu-devel] [PATCH v3 0/6] block: Add VFIO based driver for NVMe device Fam Zheng @ 2017-07-05 13:36 ` Fam Zheng 2017-07-05 13:36 ` [Qemu-devel] [PATCH v3 2/6] block: Add VFIO based NVMe driver Fam Zheng ` (6 subsequent siblings) 7 siblings, 0 replies; 33+ messages in thread From: Fam Zheng @ 2017-07-05 13:36 UTC (permalink / raw) To: qemu-devel Cc: Paolo Bonzini, Keith Busch, qemu-block, Fam Zheng, Kevin Wolf, Max Reitz, Stefan Hajnoczi, Karl Rister These functions will be wanted by block-obj-y but the actual definition is in obj-y, so stub them to keep the linker happy. Signed-off-by: Fam Zheng <famz@redhat.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> --- stubs/Makefile.objs | 1 + stubs/ram-block.c | 16 ++++++++++++++++ 2 files changed, 17 insertions(+) create mode 100644 stubs/ram-block.c diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs index f5b47bf..c93a800 100644 --- a/stubs/Makefile.objs +++ b/stubs/Makefile.objs @@ -39,3 +39,4 @@ stub-obj-y += pc_madt_cpu_entry.o stub-obj-y += vmgenid.o stub-obj-y += xen-common.o stub-obj-y += xen-hvm.o +stub-obj-y += ram-block.o diff --git a/stubs/ram-block.c b/stubs/ram-block.c new file mode 100644 index 0000000..cfa5d86 --- /dev/null +++ b/stubs/ram-block.c @@ -0,0 +1,16 @@ +#include "qemu/osdep.h" +#include "exec/ramlist.h" +#include "exec/cpu-common.h" + +void ram_block_notifier_add(RAMBlockNotifier *n) +{ +} + +void ram_block_notifier_remove(RAMBlockNotifier *n) +{ +} + +int qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque) +{ + return 0; +} -- 2.9.4 ^ permalink raw reply related [flat|nested] 33+ messages in thread
* [Qemu-devel] [PATCH v3 2/6] block: Add VFIO based NVMe driver 2017-07-05 13:36 [Qemu-devel] [PATCH v3 0/6] block: Add VFIO based driver for NVMe device Fam Zheng 2017-07-05 13:36 ` [Qemu-devel] [PATCH v3 1/6] stubs: Add stubs for ram block API Fam Zheng @ 2017-07-05 13:36 ` Fam Zheng 2017-07-06 17:38 ` Keith Busch ` (2 more replies) 2017-07-05 13:36 ` [Qemu-devel] [PATCH v3 3/6] block: Introduce bdrv_dma_map and bdrv_dma_unmap Fam Zheng ` (5 subsequent siblings) 7 siblings, 3 replies; 33+ messages in thread From: Fam Zheng @ 2017-07-05 13:36 UTC (permalink / raw) To: qemu-devel Cc: Paolo Bonzini, Keith Busch, qemu-block, Fam Zheng, Kevin Wolf, Max Reitz, Stefan Hajnoczi, Karl Rister This is a new protocol driver that exclusively opens a host NVMe controller through VFIO. It achieves better latency than linux-aio by completely bypassing host kernel vfs/block layer. $rw-$bs-$iodepth linux-aio nvme:// ---------------------------------------- randread-4k-1 8269 8851 randread-512k-1 584 610 randwrite-4k-1 28601 34649 randwrite-512k-1 1809 1975 The driver also integrates with the polling mechanism of iothread. This patch is co-authored by Paolo and me. Signed-off-by: Fam Zheng <famz@redhat.com> --- MAINTAINERS | 6 + block/Makefile.objs | 1 + block/nvme-vfio.c | 703 +++++++++++++++++++++++++++++++++ block/nvme-vfio.h | 30 ++ block/nvme.c | 1091 +++++++++++++++++++++++++++++++++++++++++++++++++++ block/trace-events | 32 ++ 6 files changed, 1863 insertions(+) create mode 100644 block/nvme-vfio.c create mode 100644 block/nvme-vfio.h create mode 100644 block/nvme.c diff --git a/MAINTAINERS b/MAINTAINERS index 839f7ca..4cce80c 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1746,6 +1746,12 @@ L: qemu-block@nongnu.org S: Supported F: block/null.c +NVMe Block Driver +M: Fam Zheng <famz@redhat.com> +L: qemu-block@nongnu.org +S: Supported +F: block/nvme* + Bootdevice M: Gonglei <arei.gonglei@huawei.com> S: Maintained diff --git a/block/Makefile.objs b/block/Makefile.objs index f9368b5..8866487 100644 --- a/block/Makefile.objs +++ b/block/Makefile.objs @@ -11,6 +11,7 @@ block-obj-$(CONFIG_POSIX) += file-posix.o block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o block-obj-y += null.o mirror.o commit.o io.o block-obj-y += throttle-groups.o +block-obj-$(CONFIG_LINUX) += nvme.o nvme-vfio.o block-obj-y += nbd.o nbd-client.o sheepdog.o block-obj-$(CONFIG_LIBISCSI) += iscsi.o diff --git a/block/nvme-vfio.c b/block/nvme-vfio.c new file mode 100644 index 0000000..f030a82 --- /dev/null +++ b/block/nvme-vfio.c @@ -0,0 +1,703 @@ +/* + * NVMe VFIO interface + * + * Copyright 2016, 2017 Red Hat, Inc. + * + * Authors: + * Fam Zheng <famz@redhat.com> + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ + +#include "qemu/osdep.h" +#include <sys/ioctl.h> +#include <linux/vfio.h> +#include "qapi/error.h" +#include "exec/ramlist.h" +#include "exec/cpu-common.h" +#include "trace.h" +#include "qemu/queue.h" +#include "qemu/error-report.h" +#include "standard-headers/linux/pci_regs.h" +#include "qemu/event_notifier.h" +#include "block/nvme-vfio.h" +#include "trace.h" + +#define NVME_DEBUG 0 + +#define NVME_VFIO_IOVA_MIN 0x10000ULL +/* XXX: Once VFIO exposes the iova bit width in the IOMMU capability interface, + * we can use a runtime limit; alternatively it's also possible to do platform + * specific detection by reading sysfs entries. Until then, 39 is a safe bet. + **/ +#define NVME_VFIO_IOVA_MAX (1ULL << 39) + +typedef struct { + /* Page aligned addr. */ + void *host; + size_t size; + uint64_t iova; +} IOVAMapping; + +struct NVMeVFIOState { + int container; + int group; + int device; + RAMBlockNotifier ram_notifier; + struct vfio_region_info config_region_info, bar_region_info[6]; + + /* VFIO's IO virtual address space is managed by splitting into a few + * sections: + * + * --------------- <= 0 + * |xxxxxxxxxxxxx| + * |-------------| <= NVME_VFIO_IOVA_MIN + * | | + * | Fixed | + * | | + * |-------------| <= low_water_mark + * | | + * | Free | + * | | + * |-------------| <= high_water_mark + * | | + * | Temp | + * | | + * |-------------| <= NVME_VFIO_IOVA_MAX + * |xxxxxxxxxxxxx| + * |xxxxxxxxxxxxx| + * --------------- + * + * - Addresses lower than NVME_VFIO_IOVA_MIN are reserved as invalid; + * + * - Fixed mappings of HVAs are assigned "low" IOVAs in the range of + * [NVME_VFIO_IOVA_MIN, low_water_mark). Once allocated they will not be + * reclaimed - low_water_mark never shrinks; + * + * - IOVAs in range [low_water_mark, high_water_mark) are free; + * + * - IOVAs in range [high_water_mark, NVME_VFIO_IOVA_MAX) are volatile + * mappings. At each nvme_vfio_dma_reset_temporary() call, the whole area + * is recycled. The caller should make sure I/O's depending on these + * mappings are completed before calling. + **/ + uint64_t low_water_mark; + uint64_t high_water_mark; + IOVAMapping *mappings; + int nr_mappings; + QemuMutex lock; +}; + +/** Find group file and return the full path in @path by PCI device address + * @device. If succeeded, caller needs to g_free the returned path. */ +static int sysfs_find_group_file(const char *device, char **path, Error **errp) +{ + int ret; + char *sysfs_link = NULL; + char *sysfs_group = NULL; + char *p; + + sysfs_link = g_strdup_printf("/sys/bus/pci/devices/%s/iommu_group", + device); + sysfs_group = g_malloc(PATH_MAX); + ret = readlink(sysfs_link, sysfs_group, PATH_MAX - 1); + if (ret == -1) { + error_setg_errno(errp, errno, "Failed to find iommu group sysfs path"); + ret = -errno; + goto out; + } + ret = 0; + p = strrchr(sysfs_group, '/'); + if (!p) { + error_setg(errp, "Failed to find iommu group number"); + ret = -errno; + goto out; + } + + *path = g_strdup_printf("/dev/vfio/%s", p + 1); +out: + g_free(sysfs_link); + g_free(sysfs_group); + return ret; +} + +static int nvme_vfio_pci_init_bar(NVMeVFIOState *s, unsigned int index, + Error **errp) +{ + assert(index < ARRAY_SIZE(s->bar_region_info)); + s->bar_region_info[index] = (struct vfio_region_info) { + .index = VFIO_PCI_BAR0_REGION_INDEX + index, + .argsz = sizeof(struct vfio_region_info), + }; + if (ioctl(s->device, VFIO_DEVICE_GET_REGION_INFO, &s->bar_region_info[index])) { + error_setg_errno(errp, errno, "Failed to get BAR region info"); + return -errno; + } + + return 0; +} + +/** + * Map a PCI bar area. + */ +void *nvme_vfio_pci_map_bar(NVMeVFIOState *s, int index, Error **errp) +{ + void *p; + assert(index >= 0 && index < 6); + p = mmap(NULL, MIN(8192, s->bar_region_info[index].size), + PROT_READ | PROT_WRITE, MAP_SHARED, + s->device, s->bar_region_info[index].offset); + if (p == MAP_FAILED) { + error_setg_errno(errp, errno, "Failed to map BAR region"); + p = NULL; + } + return p; +} + +/** + * Unmap a PCI bar area. + */ +void nvme_vfio_pci_unmap_bar(NVMeVFIOState *s, int index, void *bar) +{ + if (bar) { + munmap(bar, MIN(8192, s->bar_region_info[index].size)); + } +} + +/** + * Initialize device IRQ with @irq_type and and register an event notifier. + */ +int nvme_vfio_pci_init_irq(NVMeVFIOState *s, EventNotifier *e, + int irq_type, Error **errp) +{ + int r; + struct vfio_irq_set *irq_set; + size_t irq_set_size; + struct vfio_irq_info irq_info = { .argsz = sizeof(irq_info) }; + + irq_info.index = irq_type; + if (ioctl(s->device, VFIO_DEVICE_GET_IRQ_INFO, &irq_info)) { + error_setg_errno(errp, errno, "Failed to get device interrupt info"); + return -errno; + } + if (!(irq_info.flags & VFIO_IRQ_INFO_EVENTFD)) { + error_setg(errp, "Device interrupt doesn't support eventfd"); + return -EINVAL; + } + + irq_set_size = sizeof(*irq_set) + sizeof(int); + irq_set = g_malloc0(irq_set_size); + + /* Get to a known IRQ state */ + *irq_set = (struct vfio_irq_set) { + .argsz = irq_set_size, + .flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER, + .index = irq_info.index, + .start = 0, + .count = 1, + }; + + *(int *)&irq_set->data = event_notifier_get_fd(e); + r = ioctl(s->device, VFIO_DEVICE_SET_IRQS, irq_set); + g_free(irq_set); + if (r) { + error_setg_errno(errp, errno, "Failed to setup device interrupt"); + return -errno; + } + return 0; +} + +static int nvme_vfio_pci_read_config(NVMeVFIOState *s, void *buf, + int size, int ofs) +{ + if (pread(s->device, buf, size, + s->config_region_info.offset + ofs) == size) { + return 0; + } + return -1; +} + +static int nvme_vfio_pci_write_config(NVMeVFIOState *s, void *buf, int size, int ofs) +{ + if (pwrite(s->device, buf, size, + s->config_region_info.offset + ofs) == size) { + return 0; + } + + return -1; +} + +static int nvme_vfio_init_pci(NVMeVFIOState *s, const char *device, + Error **errp) +{ + int ret; + int i; + uint16_t pci_cmd; + struct vfio_group_status group_status = { .argsz = sizeof(group_status) }; + struct vfio_iommu_type1_info iommu_info = { .argsz = sizeof(iommu_info) }; + struct vfio_device_info device_info = { .argsz = sizeof(device_info) }; + char *group_file = NULL; + + /* Create a new container */ + s->container = open("/dev/vfio/vfio", O_RDWR); + + if (ioctl(s->container, VFIO_GET_API_VERSION) != VFIO_API_VERSION) { + error_setg(errp, "Invalid VFIO version"); + ret = -EINVAL; + goto out; + } + + if (!ioctl(s->container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) { + error_setg_errno(errp, errno, "VFIO IOMMU check failed"); + ret = -EINVAL; + goto out; + } + + /* Open the group */ + ret = sysfs_find_group_file(device, &group_file, errp); + if (ret) { + goto out; + } + + s->group = open(group_file, O_RDWR); + g_free(group_file); + if (s->group <= 0) { + error_setg_errno(errp, errno, "Failed to open VFIO group file"); + ret = -errno; + goto out; + } + + /* Test the group is viable and available */ + if (ioctl(s->group, VFIO_GROUP_GET_STATUS, &group_status)) { + error_setg_errno(errp, errno, "Failed to get VFIO group status"); + ret = -errno; + goto out; + } + + if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE)) { + error_setg(errp, "VFIO group is not viable"); + ret = -EINVAL; + goto out; + } + + /* Add the group to the container */ + if (ioctl(s->group, VFIO_GROUP_SET_CONTAINER, &s->container)) { + error_setg_errno(errp, errno, "Failed to add group to VFIO container"); + ret = -errno; + goto out; + } + + /* Enable the IOMMU model we want */ + if (ioctl(s->container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU)) { + error_setg_errno(errp, errno, "Failed to set VFIO IOMMU type"); + ret = -errno; + goto out; + } + + /* Get additional IOMMU info */ + if (ioctl(s->container, VFIO_IOMMU_GET_INFO, &iommu_info)) { + error_setg_errno(errp, errno, "Failed to get IOMMU info"); + ret = -errno; + goto out; + } + + s->device = ioctl(s->group, VFIO_GROUP_GET_DEVICE_FD, device); + + if (s->device < 0) { + error_setg_errno(errp, errno, "Failed to get device fd"); + ret = -errno; + goto out; + } + + /* Test and setup the device */ + if (ioctl(s->device, VFIO_DEVICE_GET_INFO, &device_info)) { + error_setg_errno(errp, errno, "Failed to get device info"); + ret = -errno; + goto out; + } + + if (device_info.num_regions < VFIO_PCI_CONFIG_REGION_INDEX) { + error_setg(errp, "Invalid device regions"); + ret = -EINVAL; + goto out; + } + + s->config_region_info = (struct vfio_region_info) { + .index = VFIO_PCI_CONFIG_REGION_INDEX, + .argsz = sizeof(struct vfio_region_info), + }; + if (ioctl(s->device, VFIO_DEVICE_GET_REGION_INFO, &s->config_region_info)) { + error_setg_errno(errp, errno, "Failed to get config region info"); + ret = -errno; + goto out; + } + + for (i = 0; i < 6; i++) { + ret = nvme_vfio_pci_init_bar(s, i, errp); + if (ret) { + goto out; + } + } + + /* Enable bus master */ + if (nvme_vfio_pci_read_config(s, &pci_cmd, sizeof(pci_cmd), + PCI_COMMAND) < 0) { + goto out; + } + pci_cmd |= PCI_COMMAND_MASTER; + if (nvme_vfio_pci_write_config(s, &pci_cmd, sizeof(pci_cmd), + PCI_COMMAND) < 0) { + goto out; + } +out: + return ret; +} + +static void nvme_vfio_ram_block_added(RAMBlockNotifier *n, + void *host, size_t size) +{ + NVMeVFIOState *s = container_of(n, NVMeVFIOState, ram_notifier); + trace_nvme_vfio_ram_block_added(host, size); + nvme_vfio_dma_map(s, host, size, false, NULL); +} + +static void nvme_vfio_ram_block_removed(RAMBlockNotifier *n, + void *host, size_t size) +{ + NVMeVFIOState *s = container_of(n, NVMeVFIOState, ram_notifier); + if (host) { + trace_nvme_vfio_ram_block_removed(host, size); + nvme_vfio_dma_unmap(s, host); + } +} + +static int nvme_vfio_init_ramblock(const char *block_name, void *host_addr, + ram_addr_t offset, ram_addr_t length, + void *opaque) +{ + int ret; + NVMeVFIOState *s = opaque; + + if (!host_addr) { + return 0; + } + ret = nvme_vfio_dma_map(s, host_addr, length, false, NULL); + if (ret) { + fprintf(stderr, "nvme_vfio_init_ramblock: failed %p %ld\n", + host_addr, length); + } + return 0; +} + +static void nvme_vfio_open_common(NVMeVFIOState *s) +{ + s->ram_notifier.ram_block_added = nvme_vfio_ram_block_added; + s->ram_notifier.ram_block_removed = nvme_vfio_ram_block_removed; + ram_block_notifier_add(&s->ram_notifier); + s->low_water_mark = NVME_VFIO_IOVA_MIN; + s->high_water_mark = NVME_VFIO_IOVA_MAX; + qemu_ram_foreach_block(nvme_vfio_init_ramblock, s); + qemu_mutex_init(&s->lock); +} + +/** + * Open a PCI device, e.g. "0000:00:01.0". + */ +NVMeVFIOState *nvme_vfio_open_pci(const char *device, Error **errp) +{ + int r; + NVMeVFIOState *s = g_new0(NVMeVFIOState, 1); + + r = nvme_vfio_init_pci(s, device, errp); + if (r) { + g_free(s); + return NULL; + } + nvme_vfio_open_common(s); + return s; +} + +static void nvme_vfio_dump_mapping(IOVAMapping *m) +{ + if (NVME_DEBUG) { + printf(" vfio mapping %p %lx to %lx\n", m->host, m->size, m->iova); + } +} + +static void nvme_vfio_dump_mappings(NVMeVFIOState *s) +{ + int i; + + if (NVME_DEBUG) { + printf("vfio mappings\n"); + for (i = 0; i < s->nr_mappings; ++i) { + nvme_vfio_dump_mapping(&s->mappings[i]); + } + } +} + +/** + * Find the mapping entry that contains [host, host + size) and set @index to + * the position. If no entry contains it, @index is the position _after_ which + * to insert the new mapping. IOW, it is the index of the largest element that + * is smaller than @host, or -1 if no entry is. + */ +static IOVAMapping *nvme_vfio_find_mapping(NVMeVFIOState *s, void *host, + int *index) +{ + IOVAMapping *p = s->mappings; + IOVAMapping *q = p ? p + s->nr_mappings - 1 : NULL; + IOVAMapping *mid = p ? p + (q - p) / 2 : NULL; + trace_nvme_vfio_find_mapping(s, host); + if (!p) { + *index = -1; + return NULL; + } + while (true) { + mid = p + (q - p) / 2; + if (mid == p) { + break; + } + if (mid->host > host) { + q = mid; + } else if (mid->host < host) { + p = mid; + } else { + break; + } + } + if (mid->host > host) { + mid--; + } else if (mid < &s->mappings[s->nr_mappings - 1] + && (mid + 1)->host <= host) { + mid++; + } + *index = mid - &s->mappings[0]; + if (mid >= &s->mappings[0] && + mid->host <= host && mid->host + mid->size > host) { + assert(mid < &s->mappings[s->nr_mappings]); + return mid; + } + return NULL; +} + +/** + * Allocate IOVA and and create a new mapping record and insert it in @s. + */ +static IOVAMapping *nvme_vfio_add_mapping(NVMeVFIOState *s, + void *host, size_t size, + int index, uint64_t iova) +{ + int shift; + IOVAMapping m = {.host = host, .size = size, iova = iova}; + IOVAMapping *insert; + + assert(QEMU_IS_ALIGNED(size, getpagesize())); + assert(QEMU_IS_ALIGNED(s->low_water_mark, getpagesize())); + assert(QEMU_IS_ALIGNED(s->high_water_mark, getpagesize())); + trace_nvme_vfio_new_mapping(s, host, size, index, iova); + + assert(index >= 0); + s->nr_mappings++; + s->mappings = g_realloc_n(s->mappings, sizeof(s->mappings[0]), + s->nr_mappings); + insert = &s->mappings[index]; + shift = s->nr_mappings - index - 1; + if (shift) { + memmove(insert + 1, insert, shift * sizeof(s->mappings[0])); + } + *insert = m; + return insert; +} + +/* Do the DMA mapping with VFIO. */ +static int nvme_vfio_do_mapping(NVMeVFIOState *s, void *host, size_t size, + uint64_t iova) +{ + struct vfio_iommu_type1_dma_map dma_map = { + .argsz = sizeof(dma_map), + .flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE, + .iova = iova, + .vaddr = (uintptr_t)host, + .size = size, + }; + trace_nvme_vfio_do_mapping(s, host, size, iova); + + if (ioctl(s->container, VFIO_IOMMU_MAP_DMA, &dma_map)) { + error_report("VFIO_MAP_DMA: %d", -errno); + return -errno; + } + return 0; +} + +/** + * Undo the DMA mapping from @s with VFIO, and remove from mapping list. + */ +static void nvme_vfio_undo_mapping(NVMeVFIOState *s, IOVAMapping *mapping, + Error **errp) +{ + int index; + struct vfio_iommu_type1_dma_unmap unmap = { + .argsz = sizeof(unmap), + .flags = 0, + .iova = mapping->iova, + .size = mapping->size, + }; + + index = mapping - s->mappings; + assert(mapping->size > 0); + assert(QEMU_IS_ALIGNED(mapping->size, getpagesize())); + assert(index >= 0 && index < s->nr_mappings); + if (ioctl(s->container, VFIO_IOMMU_UNMAP_DMA, &unmap)) { + error_setg(errp, "VFIO_UNMAP_DMA failed: %d", -errno); + } + memmove(mapping, &s->mappings[index + 1], + sizeof(s->mappings[0]) * (s->nr_mappings - index - 1)); + s->nr_mappings--; + s->mappings = g_realloc_n(s->mappings, sizeof(s->mappings[0]), + s->nr_mappings); +} + +/* Check if the mapping list is (ascending) ordered. */ +static bool nvme_vfio_verify_mappings(NVMeVFIOState *s) +{ + int i; + if (NVME_DEBUG) { + for (i = 0; i < s->nr_mappings - 1; ++i) { + if (!(s->mappings[i].host < s->mappings[i + 1].host)) { + fprintf(stderr, "item %d not sorted!\n", i); + nvme_vfio_dump_mappings(s); + return false; + } + if (!(s->mappings[i].host + s->mappings[i].size <= + s->mappings[i + 1].host)) { + fprintf(stderr, "item %d overlap with next!\n", i); + nvme_vfio_dump_mappings(s); + return false; + } + } + } + return true; +} + +/* Map [host, host + size) area into a contiguous IOVA address space, and store + * the result in @iova if not NULL. The area must be aligned to page size, and + * mustn't overlap with existing mapping areas. + */ +int nvme_vfio_dma_map(NVMeVFIOState *s, void *host, size_t size, + bool temporary, uint64_t *iova) +{ + int ret = 0; + int index; + IOVAMapping *mapping; + uint64_t iova0; + + assert(QEMU_PTR_IS_ALIGNED(host, getpagesize())); + assert(QEMU_IS_ALIGNED(size, getpagesize())); + trace_nvme_vfio_dma_map(s, host, size, temporary, iova); + qemu_mutex_lock(&s->lock); + mapping = nvme_vfio_find_mapping(s, host, &index); + if (mapping) { + iova0 = mapping->iova + ((uint8_t *)host - (uint8_t *)mapping->host); + } else { + if (s->high_water_mark - s->low_water_mark + 1 < size) { + ret = -ENOMEM; + goto out; + } + if (!temporary) { + iova0 = s->low_water_mark; + mapping = nvme_vfio_add_mapping(s, host, size, index + 1, iova0); + if (!mapping) { + ret = -ENOMEM; + goto out; + } + assert(nvme_vfio_verify_mappings(s)); + ret = nvme_vfio_do_mapping(s, host, size, iova0); + if (ret) { + nvme_vfio_undo_mapping(s, mapping, NULL); + goto out; + } + s->low_water_mark += size; + nvme_vfio_dump_mappings(s); + } else { + iova0 = s->high_water_mark - size; + ret = nvme_vfio_do_mapping(s, host, size, iova0); + if (ret) { + goto out; + } + s->high_water_mark -= size; + } + } + if (iova) { + *iova = iova0; + } + qemu_mutex_unlock(&s->lock); +out: + return ret; +} + +/* Reset the high watermark and free all "temporary" mappings. */ +int nvme_vfio_dma_reset_temporary(NVMeVFIOState *s) +{ + struct vfio_iommu_type1_dma_unmap unmap = { + .argsz = sizeof(unmap), + .flags = 0, + .iova = s->high_water_mark, + .size = NVME_VFIO_IOVA_MAX - s->high_water_mark, + }; + trace_nvme_vfio_dma_reset_temporary(s); + qemu_mutex_lock(&s->lock); + if (ioctl(s->container, VFIO_IOMMU_UNMAP_DMA, &unmap)) { + error_report("VFIO_UNMAP_DMA: %d", -errno); + return -errno; + } + s->high_water_mark = NVME_VFIO_IOVA_MAX; + qemu_mutex_lock(&s->lock); + return 0; +} + +/* Unmapping the whole area that was previously mapped with + * nvme_vfio_dma_map(). */ +void nvme_vfio_dma_unmap(NVMeVFIOState *s, void *host) +{ + int index = 0; + IOVAMapping *m; + + if (!host) { + return; + } + + trace_nvme_vfio_dma_unmap(s, host); + qemu_mutex_lock(&s->lock); + m = nvme_vfio_find_mapping(s, host, &index); + if (!m) { + goto out; + } + nvme_vfio_undo_mapping(s, m, NULL); +out: + qemu_mutex_unlock(&s->lock); +} + +static void nvme_vfio_reset(NVMeVFIOState *s) +{ + ioctl(s->device, VFIO_DEVICE_RESET); +} + +/* Close and free the VFIO resources. */ +void nvme_vfio_close(NVMeVFIOState *s) +{ + int i; + + if (!s) { + return; + } + for (i = 0; i < s->nr_mappings; ++i) { + nvme_vfio_undo_mapping(s, &s->mappings[i], NULL); + } + ram_block_notifier_remove(&s->ram_notifier); + nvme_vfio_reset(s); + close(s->device); + close(s->group); + close(s->container); +} diff --git a/block/nvme-vfio.h b/block/nvme-vfio.h new file mode 100644 index 0000000..2d5840b --- /dev/null +++ b/block/nvme-vfio.h @@ -0,0 +1,30 @@ +/* + * NVMe VFIO interface + * + * Copyright 2016, 2017 Red Hat, Inc. + * + * Authors: + * Fam Zheng <famz@redhat.com> + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ + +#ifndef QEMU_VFIO_H +#define QEMU_VFIO_H +#include "qemu/queue.h" + +typedef struct NVMeVFIOState NVMeVFIOState; + +NVMeVFIOState *nvme_vfio_open_pci(const char *device, Error **errp); +void nvme_vfio_close(NVMeVFIOState *s); +int nvme_vfio_dma_map(NVMeVFIOState *s, void *host, size_t size, + bool temporary, uint64_t *iova_list); +int nvme_vfio_dma_reset_temporary(NVMeVFIOState *s); +void nvme_vfio_dma_unmap(NVMeVFIOState *s, void *host); +void *nvme_vfio_pci_map_bar(NVMeVFIOState *s, int index, Error **errp); +void nvme_vfio_pci_unmap_bar(NVMeVFIOState *s, int index, void *bar); +int nvme_vfio_pci_init_irq(NVMeVFIOState *s, EventNotifier *e, + int irq_type, Error **errp); + +#endif diff --git a/block/nvme.c b/block/nvme.c new file mode 100644 index 0000000..eb999a1 --- /dev/null +++ b/block/nvme.c @@ -0,0 +1,1091 @@ +/* + * NVMe block driver based on vfio + * + * Copyright 2016, 2017 Red Hat, Inc. + * + * Authors: + * Fam Zheng <famz@redhat.com> + * Paolo Bonzini <pbonzini@redhat.com> + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ + +#include "qemu/osdep.h" +#include <linux/vfio.h> +#include "qapi/error.h" +#include "qapi/qmp/qdict.h" +#include "qapi/qmp/qstring.h" +#include "qemu/error-report.h" +#include "qemu/cutils.h" +#include "block/block_int.h" +#include "block/nvme-vfio.h" +#include "trace.h" + +/* TODO: Move nvme spec definitions from hw/block/nvme.h into a separate file + * that doesn't depend on dma/pci headers. */ +#include "sysemu/dma.h" +#include "hw/pci/pci.h" +#include "hw/block/block.h" +#include "hw/block/nvme.h" + +#define NVME_SQ_ENTRY_BYTES 64 +#define NVME_CQ_ENTRY_BYTES 16 +#define NVME_QUEUE_SIZE 128 + +typedef struct { + int32_t head, tail; + uint8_t *queue; + uint64_t iova; + volatile uint32_t *doorbell; +} NVMeQueue; + +typedef struct { + BlockCompletionFunc *cb; + void *opaque; + int cid; + void *prp_list_page; + uint64_t prp_list_iova; + bool busy; +} NVMeRequest; + +typedef struct { + int index; + NVMeQueue sq, cq; + int cq_phase; + uint8_t *prp_list_pages; + uint64_t prp_list_base_iova; + NVMeRequest reqs[NVME_QUEUE_SIZE]; + CoQueue free_req_queue; + bool busy; + int need_kick; + int inflight; + QemuMutex lock; +} NVMeQueuePair; + +typedef volatile struct { + uint64_t cap; + uint32_t vs; + uint32_t intms; + uint32_t intmc; + uint32_t cc; + uint32_t reserved0; + uint32_t csts; + uint32_t nssr; + uint32_t aqa; + uint64_t asq; + uint64_t acq; + uint32_t cmbloc; + uint32_t cmbsz; + uint8_t reserved1[0xec0]; + uint8_t cmd_set_specfic[0x100]; + uint32_t doorbells[]; +} QEMU_PACKED NVMeRegs; + +QEMU_BUILD_BUG_ON(offsetof(NVMeRegs, doorbells) != 0x1000); + +typedef struct { + AioContext *aio_context; + NVMeVFIOState *vfio; + NVMeRegs *regs; + /* The submission/completion queue pairs. + * [0]: admin queue. + * [1..]: io queues. + */ + NVMeQueuePair **queues; + int nr_queues; + size_t page_size; + /* How many uint32_t elements does each doorbell entry take. */ + size_t doorbell_scale; + bool write_cache; + EventNotifier irq_notifier; + uint64_t nsze; /* Namespace size reported by identify command */ + int nsid; /* The namespace id to read/write data. */ + uint64_t max_transfer; + int plugged; + + CoMutex dma_map_lock; + CoQueue dma_flush_queue; + + /* Total inflight */ + int inflight; +} BDRVNVMeState; + +#define NVME_BLOCK_OPT_DEVICE "device" +#define NVME_BLOCK_OPT_NAMESPACE "namespace" + +static QemuOptsList runtime_opts = { + .name = "nvme", + .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head), + .desc = { + { + .name = NVME_BLOCK_OPT_DEVICE, + .type = QEMU_OPT_STRING, + .help = "NVMe PCI device address", + }, + { + .name = NVME_BLOCK_OPT_NAMESPACE, + .type = QEMU_OPT_NUMBER, + .help = "NVMe namespace", + }, + { /* end of list */ } + }, +}; + +static void nvme_init_queue(BlockDriverState *bs, NVMeQueue *q, + int nentries, int entry_bytes, Error **errp) +{ + BDRVNVMeState *s = bs->opaque; + size_t bytes; + int r; + + bytes = ROUND_UP(nentries * entry_bytes, s->page_size); + q->head = q->tail = 0; + q->queue = qemu_try_blockalign0(bs, bytes); + + if (!q->queue) { + error_setg(errp, "Cannot allocate queue"); + return; + } + r = nvme_vfio_dma_map(s->vfio, q->queue, bytes, false, &q->iova); + if (r) { + error_setg(errp, "Cannot map queue"); + } +} + +static void nvme_free_queue_pair(BlockDriverState *bs, NVMeQueuePair *q) +{ + qemu_vfree(q->prp_list_pages); + qemu_vfree(q->sq.queue); + qemu_vfree(q->cq.queue); + g_free(q); +} + +static void nvme_free_req_queue_cb(void *opaque) +{ + NVMeQueuePair *q = opaque; + + qemu_co_enter_next(&q->free_req_queue); +} + +static NVMeQueuePair *nvme_create_queue_pair(BlockDriverState *bs, + int idx, int size, + Error **errp) +{ + int i, r; + BDRVNVMeState *s = bs->opaque; + Error *local_err = NULL; + NVMeQueuePair *q = g_new0(NVMeQueuePair, 1); + uint64_t prp_list_iova; + + qemu_mutex_init(&q->lock); + q->index = idx; + qemu_co_queue_init(&q->free_req_queue); + q->prp_list_pages = qemu_blockalign0(bs, s->page_size * NVME_QUEUE_SIZE); + r = nvme_vfio_dma_map(s->vfio, q->prp_list_pages, + s->page_size * NVME_QUEUE_SIZE, + false, &prp_list_iova); + if (r) { + goto fail; + } + for (i = 0; i < NVME_QUEUE_SIZE; i++) { + NVMeRequest *req = &q->reqs[i]; + req->cid = i + 1; + req->prp_list_page = q->prp_list_pages + i * s->page_size; + req->prp_list_iova = prp_list_iova + i * s->page_size; + } + nvme_init_queue(bs, &q->sq, size, NVME_SQ_ENTRY_BYTES, &local_err); + if (local_err) { + error_propagate(errp, local_err); + goto fail; + } + q->sq.doorbell = &s->regs->doorbells[idx * 2 * s->doorbell_scale]; + + nvme_init_queue(bs, &q->cq, size, NVME_CQ_ENTRY_BYTES, &local_err); + if (local_err) { + error_propagate(errp, local_err); + goto fail; + } + q->cq.doorbell = &s->regs->doorbells[idx * 2 * s->doorbell_scale + 1]; + + return q; +fail: + nvme_free_queue_pair(bs, q); + return NULL; +} + +/* With q->lock */ +static void nvme_kick(BDRVNVMeState *s, NVMeQueuePair *q) +{ + if (s->plugged || !q->need_kick) { + return; + } + trace_nvme_kick(s, q->index); + assert(!(q->sq.tail & 0xFF00)); + /* Fence the write to submission queue entry before notifying the device. */ + smp_wmb(); + *q->sq.doorbell = cpu_to_le32(q->sq.tail); + q->inflight += q->need_kick; + s->inflight += q->need_kick; + q->need_kick = 0; +} + +static NVMeRequest *nvme_get_free_req(NVMeQueuePair *q) +{ + int i; + NVMeRequest *req = NULL; + + qemu_mutex_lock(&q->lock); + while (q->inflight + q->need_kick > NVME_QUEUE_SIZE - 2) { + /* We have to leave one slot empty as that is the full queue case (head + * == tail + 1). */ + trace_nvme_free_req_queue_wait(q); + qemu_mutex_unlock(&q->lock); + qemu_co_queue_wait(&q->free_req_queue, NULL); + qemu_mutex_lock(&q->lock); + } + for (i = 0; i < NVME_QUEUE_SIZE; i++) { + if (!q->reqs[i].busy) { + q->reqs[i].busy = true; + req = &q->reqs[i]; + break; + } + } + assert(req); + qemu_mutex_unlock(&q->lock); + return req; +} + +static inline int nvme_translate_error(const NvmeCqe *c) +{ + uint16_t status = (le16_to_cpu(c->status) >> 1) & 0xFF; + if (status) { + trace_nvme_error(c->result, c->sq_head, c->sq_id, c->cid, status); + } + switch (status) { + case 0: + return 0; + case 1: + return -ENOSYS; + case 2: + return -EINVAL; + default: + return -EIO; + } +} + +/* With q->lock */ +static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q) +{ + bool progress = false; + NVMeRequest *req; + NvmeCqe *c; + + trace_nvme_process_completion(s, q->index, q->inflight); + if (q->busy || s->plugged) { + trace_nvme_process_completion_queue_busy(s, q->index); + return false; + } + q->busy = true; + assert(q->inflight >= 0); + while (q->inflight) { + c = (NvmeCqe *)&q->cq.queue[q->cq.head * NVME_CQ_ENTRY_BYTES]; + if (!c->cid || (le16_to_cpu(c->status) & 0x1) == q->cq_phase) { + break; + } + q->cq.head = (q->cq.head + 1) % NVME_QUEUE_SIZE; + if (!q->cq.head) { + q->cq_phase = !q->cq_phase; + } + if (c->cid == 0 || c->cid > NVME_QUEUE_SIZE) { + fprintf(stderr, "Unexpected CID in completion queue: %" PRIu32 "\n", + c->cid); + continue; + } + assert(c->cid <= NVME_QUEUE_SIZE); + trace_nvme_complete_command(s, q->index, c->cid); + req = &q->reqs[c->cid - 1]; + assert(req->cid == c->cid); + assert(req->cb); + req->cb(req->opaque, nvme_translate_error(c)); + req->cb = req->opaque = NULL; + req->busy = false; + if (!qemu_co_queue_empty(&q->free_req_queue)) { + aio_bh_schedule_oneshot(s->aio_context, nvme_free_req_queue_cb, q); + } + c->cid = 0; + q->inflight--; + s->inflight--; + /* Flip Phase Tag bit. */ + c->status = cpu_to_le16(le16_to_cpu(c->status) ^ 0x1); + progress = true; + } + if (progress) { + /* Notify the device so it can post more completions. */ + smp_mb_release(); + *q->cq.doorbell = cpu_to_le32(q->cq.head); + } + q->busy = false; + return progress; +} + +static void nvme_trace_command(const NvmeCmd *cmd) +{ + int i; + + for (i = 0; i < 8; ++i) { + uint8_t *cmdp = (uint8_t *)cmd + i * 8; + trace_nvme_submit_command_raw(cmdp[0], cmdp[1], cmdp[2], cmdp[3], + cmdp[4], cmdp[5], cmdp[6], cmdp[7]); + } +} + +static void nvme_submit_command(BDRVNVMeState *s, NVMeQueuePair *q, + NVMeRequest *req, + NvmeCmd *cmd, BlockCompletionFunc cb, + void *opaque) +{ + assert(!req->cb); + req->cb = cb; + req->opaque = opaque; + cmd->cid = cpu_to_le32(req->cid); + + trace_nvme_submit_command(s, q->index, req->cid); + nvme_trace_command(cmd); + qemu_mutex_lock(&q->lock); + memcpy((uint8_t *)q->sq.queue + + q->sq.tail * NVME_SQ_ENTRY_BYTES, cmd, sizeof(*cmd)); + q->sq.tail = (q->sq.tail + 1) % NVME_QUEUE_SIZE; + q->need_kick++; + nvme_kick(s, q); + nvme_process_completion(s, q); + qemu_mutex_unlock(&q->lock); +} + +static void nvme_cmd_sync_cb(void *opaque, int ret) +{ + int *pret = opaque; + *pret = ret; +} + +static int nvme_cmd_sync(BlockDriverState *bs, NVMeQueuePair *q, + NvmeCmd *cmd) +{ + NVMeRequest *req; + BDRVNVMeState *s = bs->opaque; + int ret = -EINPROGRESS; + req = nvme_get_free_req(q); + if (!req) { + return -EBUSY; + } + nvme_submit_command(s, q, req, cmd, nvme_cmd_sync_cb, &ret); + + BDRV_POLL_WHILE(bs, ret == -EINPROGRESS); + return ret; +} + +static bool nvme_identify(BlockDriverState *bs, int namespace, Error **errp) +{ + BDRVNVMeState *s = bs->opaque; + uint8_t *resp; + int r; + uint64_t iova; + NvmeCmd cmd = { + .opcode = NVME_ADM_CMD_IDENTIFY, + .cdw10 = cpu_to_le32(0x1), + }; + + resp = qemu_try_blockalign0(bs, 4096); + if (!resp) { + error_setg(errp, "Cannot allocate buffer for identify response"); + return false; + } + r = nvme_vfio_dma_map(s->vfio, resp, 4096, true, &iova); + if (r) { + error_setg(errp, "Cannot map buffer for DMA"); + goto fail; + } + cmd.prp1 = cpu_to_le64(iova); + + if (nvme_cmd_sync(bs, s->queues[0], &cmd)) { + error_setg(errp, "Failed to identify controller"); + goto fail; + } + + if (le32_to_cpu(*(uint32_t *)&resp[516]) < namespace) { + error_setg(errp, "Invalid namespace"); + goto fail; + } + s->write_cache = le32_to_cpu(resp[525]) & 0x1; + s->max_transfer = (resp[77] ? 1 << resp[77] : 0) * s->page_size; + /* For now the page list buffer per command is one page, to hold at most + * s->page_size / sizeof(uint64_t) entries. */ + s->max_transfer = MIN_NON_ZERO(s->max_transfer, + s->page_size / sizeof(uint64_t) * s->page_size); + + memset((char *)resp, 0, 4096); + + cmd.cdw10 = 0; + cmd.nsid = namespace; + if (nvme_cmd_sync(bs, s->queues[0], &cmd)) { + error_setg(errp, "Failed to identify namespace"); + goto fail; + } + + s->nsze = le64_to_cpu(*(uint64_t *)&resp[0]); + + nvme_vfio_dma_unmap(s->vfio, resp); + qemu_vfree(resp); + return true; +fail: + qemu_vfree(resp); + return false; +} + +static bool nvme_poll_queues(BDRVNVMeState *s) +{ + bool progress = false; + int i; + + for (i = 0; i < s->nr_queues; i++) { + NVMeQueuePair *q = s->queues[i]; + qemu_mutex_lock(&q->lock); + while (nvme_process_completion(s, q)) { + /* Keep polling */ + progress = true; + } + qemu_mutex_unlock(&q->lock); + } + return progress; +} + +static void nvme_handle_event(EventNotifier *n) +{ + BDRVNVMeState *s = container_of(n, BDRVNVMeState, irq_notifier); + + trace_nvme_handle_event(s); + aio_context_acquire(s->aio_context); + event_notifier_test_and_clear(n); + nvme_poll_queues(s); + aio_context_release(s->aio_context); +} + +static bool nvme_add_io_queue(BlockDriverState *bs, Error **errp) +{ + BDRVNVMeState *s = bs->opaque; + int n = s->nr_queues; + NVMeQueuePair *q; + NvmeCmd cmd; + int queue_size = NVME_QUEUE_SIZE; + + q = nvme_create_queue_pair(bs, n, queue_size, errp); + if (!q) { + return false; + } + cmd = (NvmeCmd) { + .opcode = NVME_ADM_CMD_CREATE_CQ, + .prp1 = cpu_to_le64(q->cq.iova), + .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0xFFFF)), + .cdw11 = cpu_to_le32(0x3), + }; + if (nvme_cmd_sync(bs, s->queues[0], &cmd)) { + error_setg(errp, "Failed to create io queue [%d]", n); + nvme_free_queue_pair(bs, q); + return false; + } + cmd = (NvmeCmd) { + .opcode = NVME_ADM_CMD_CREATE_SQ, + .prp1 = cpu_to_le64(q->sq.iova), + .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0xFFFF)), + .cdw11 = cpu_to_le32(0x1 | (n << 16)), + }; + if (nvme_cmd_sync(bs, s->queues[0], &cmd)) { + error_setg(errp, "Failed to create io queue [%d]", n); + nvme_free_queue_pair(bs, q); + return false; + } + s->queues = g_renew(NVMeQueuePair *, s->queues, n + 1); + s->queues[n] = q; + s->nr_queues++; + return true; +} + +static bool nvme_poll_cb(void *opaque) +{ + EventNotifier *e = opaque; + BDRVNVMeState *s = container_of(e, BDRVNVMeState, irq_notifier); + bool progress = false; + + aio_context_acquire(s->aio_context); + trace_nvme_poll_cb(s); + progress = nvme_poll_queues(s); + aio_context_release(s->aio_context); + return progress; +} + +static int nvme_init(BlockDriverState *bs, const char *device, int namespace, + Error **errp) +{ + BDRVNVMeState *s = bs->opaque; + int ret; + uint64_t cap; + uint64_t timeout_ms; + uint64_t deadline, now; + + qemu_co_mutex_init(&s->dma_map_lock); + qemu_co_queue_init(&s->dma_flush_queue); + s->nsid = namespace; + s->aio_context = qemu_get_current_aio_context(); + ret = event_notifier_init(&s->irq_notifier, 0); + if (ret) { + error_setg(errp, "Failed to init event notifier"); + return ret; + } + + s->vfio = nvme_vfio_open_pci(device, errp); + if (!s->vfio) { + ret = -EINVAL; + goto fail; + } + + s->regs = nvme_vfio_pci_map_bar(s->vfio, 0, errp); + if (!s->regs) { + ret = -EINVAL; + goto fail; + } + + /* Perform initialize sequence as described in NVMe spec "7.6.1 + * Initialization". */ + + cap = le64_to_cpu(s->regs->cap); + if (!(cap & (1ULL << 37))) { + error_setg(errp, "Device doesn't support NVMe command set"); + ret = -EINVAL; + goto fail; + } + + s->page_size = MAX(4096, 1 << (12 + ((cap >> 48) & 0xF))); + s->doorbell_scale = (4 << (((cap >> 32) & 0xF))) / sizeof(uint32_t); + bs->bl.opt_mem_alignment = s->page_size; + timeout_ms = MIN(500 * ((cap >> 24) & 0xFF), 30000); + + /* Reset device to get a clean state. */ + s->regs->cc = cpu_to_le32(le32_to_cpu(s->regs->cc) & 0xFE); + /* Wait for CSTS.RDY = 0. */ + deadline = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) + timeout_ms * 1000000ULL; + while (le32_to_cpu(s->regs->csts) & 0x1) { + if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) { + error_setg(errp, "Timeout while waiting for device to reset (%ld ms)", + timeout_ms); + ret = -ETIMEDOUT; + goto fail; + } + } + + /* Set up admin queue. */ + s->queues = g_new(NVMeQueuePair *, 1); + s->nr_queues = 1; + s->queues[0] = nvme_create_queue_pair(bs, 0, NVME_QUEUE_SIZE, errp); + if (!s->queues[0]) { + ret = -EINVAL; + goto fail; + } + QEMU_BUILD_BUG_ON(NVME_QUEUE_SIZE & 0xF000); + s->regs->aqa = cpu_to_le32((NVME_QUEUE_SIZE << 16) | NVME_QUEUE_SIZE); + s->regs->asq = cpu_to_le64(s->queues[0]->sq.iova); + s->regs->acq = cpu_to_le64(s->queues[0]->cq.iova); + + /* After setting up all control registers we can enable device now. */ + s->regs->cc = cpu_to_le32((ctz32(NVME_CQ_ENTRY_BYTES) << 20) | + (ctz32(NVME_SQ_ENTRY_BYTES) << 16) | + 0x1); + /* Wait for CSTS.RDY = 1. */ + now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); + deadline = now + timeout_ms * 1000000; + while (!(le32_to_cpu(s->regs->csts) & 0x1)) { + if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) { + error_setg(errp, "Timeout while waiting for device to start (%ld ms)", + timeout_ms); + ret = -ETIMEDOUT; + goto fail_queue; + } + } + + ret = nvme_vfio_pci_init_irq(s->vfio, &s->irq_notifier, + VFIO_PCI_MSIX_IRQ_INDEX, errp); + if (ret) { + goto fail_queue; + } + aio_set_event_notifier(bdrv_get_aio_context(bs), &s->irq_notifier, + false, nvme_handle_event, nvme_poll_cb); + + if (!nvme_identify(bs, namespace, errp)) { + ret = -EIO; + goto fail_handler; + } + + /* Set up command queues. */ + if (!nvme_add_io_queue(bs, errp)) { + ret = -EIO; + goto fail_handler; + } + return 0; + +fail_handler: + aio_set_event_notifier(bdrv_get_aio_context(bs), &s->irq_notifier, + false, NULL, NULL); +fail_queue: + nvme_free_queue_pair(bs, s->queues[0]); +fail: + nvme_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->regs); + nvme_vfio_close(s->vfio); + event_notifier_cleanup(&s->irq_notifier); + return ret; +} + +/* Parse a filename in the format of nvme://XXXX:XX:XX.X/X. Example: + * + * nvme://0000:44:00.0/1 + * + * where the "nvme://" is a fixed form of the protocol prefix, the middle part + * is the PCI address, and the last part is the namespace number starting from + * 1 according to the NVMe spec. */ +static void nvme_parse_filename(const char *filename, QDict *options, + Error **errp) +{ + int pref = strlen("nvme://"); + + if (strlen(filename) > pref && !strncmp(filename, "nvme://", pref)) { + const char *tmp = filename + pref; + char *device; + const char *namespace; + unsigned long ns; + const char *slash = strchr(tmp, '/'); + if (!slash) { + qdict_put(options, NVME_BLOCK_OPT_DEVICE, + qstring_from_str(tmp)); + return; + } + device = g_strndup(tmp, slash - tmp); + qdict_put(options, NVME_BLOCK_OPT_DEVICE, qstring_from_str(device)); + g_free(device); + namespace = slash + 1; + if (*namespace && qemu_strtoul(namespace, NULL, 10, &ns)) { + error_setg(errp, "Invalid namespace '%s', positive number expected", + namespace); + return; + } + qdict_put(options, NVME_BLOCK_OPT_NAMESPACE, + qstring_from_str(*namespace ? namespace : "1")); + } +} + +static int nvme_file_open(BlockDriverState *bs, QDict *options, int flags, + Error **errp) +{ + const char *device; + QemuOpts *opts; + int namespace; + + opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort); + qemu_opts_absorb_qdict(opts, options, &error_abort); + device = qemu_opt_get(opts, NVME_BLOCK_OPT_DEVICE); + if (!device) { + error_setg(errp, "'" NVME_BLOCK_OPT_DEVICE "' option is required"); + return -EINVAL; + } + + namespace = qemu_opt_get_number(opts, NVME_BLOCK_OPT_NAMESPACE, 1); + nvme_init(bs, device, namespace, errp); + + qemu_opts_del(opts); + bs->supported_write_flags = BDRV_REQ_FUA; + return 0; +} + +static void nvme_close(BlockDriverState *bs) +{ + int i; + BDRVNVMeState *s = bs->opaque; + + for (i = 0; i < s->nr_queues; ++i) { + nvme_free_queue_pair(bs, s->queues[i]); + } + aio_set_event_notifier(bdrv_get_aio_context(bs), &s->irq_notifier, + false, NULL, NULL); + nvme_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->regs); + nvme_vfio_close(s->vfio); +} + +static int64_t nvme_getlength(BlockDriverState *bs) +{ + BDRVNVMeState *s = bs->opaque; + + return s->nsze << BDRV_SECTOR_BITS; +} + +static coroutine_fn int nvme_cmd_unmap_qiov(BlockDriverState *bs, + QEMUIOVector *qiov) +{ + int r = 0; + BDRVNVMeState *s = bs->opaque; + + if (!s->inflight && !qemu_co_queue_empty(&s->dma_flush_queue)) { + r = nvme_vfio_dma_reset_temporary(s->vfio); + qemu_co_queue_restart_all(&s->dma_flush_queue); + } + return r; +} + +static coroutine_fn int nvme_cmd_map_qiov(BlockDriverState *bs, NvmeCmd *cmd, + NVMeRequest *req, QEMUIOVector *qiov) +{ + BDRVNVMeState *s = bs->opaque; + uint64_t *pagelist = req->prp_list_page; + int i, j, r; + int entries = 0; + + assert(qiov->size); + assert(QEMU_IS_ALIGNED(qiov->size, s->page_size)); + assert(qiov->size / s->page_size <= s->page_size / sizeof(uint64_t)); + for (i = 0; i < qiov->niov; ++i) { + bool retry = true; + uint64_t iova; + qemu_co_mutex_lock(&s->dma_map_lock); +try_map: + r = nvme_vfio_dma_map(s->vfio, + qiov->iov[i].iov_base, + qiov->iov[i].iov_len, + true, &iova); + if (r == -ENOMEM && retry) { + retry = false; + trace_nvme_dma_flush_queue_wait(s); + if (s->inflight) { + trace_nvme_dma_map_flush(s); + qemu_co_queue_wait(&s->dma_flush_queue, &s->dma_map_lock); + } else { + r = nvme_vfio_dma_reset_temporary(s->vfio); + if (r) { + return r; + } + } + goto try_map; + } + qemu_co_mutex_unlock(&s->dma_map_lock); + if (r) { + return r; + } + + for (j = 0; j < qiov->iov[i].iov_len / s->page_size; j++) { + pagelist[entries++] = iova + j * s->page_size; + } + trace_nvme_cmd_map_qiov_iov(s, i, qiov->iov[i].iov_base, + qiov->iov[i].iov_len / s->page_size); + } + + assert(entries <= s->page_size / sizeof(uint64_t)); + switch (entries) { + case 0: + abort(); + case 1: + cmd->prp1 = cpu_to_le64(pagelist[0]); + cmd->prp2 = 0; + break; + case 2: + cmd->prp1 = cpu_to_le64(pagelist[0]); + cmd->prp2 = cpu_to_le64(pagelist[1]);; + break; + default: + cmd->prp1 = cpu_to_le64(pagelist[0]); + cmd->prp2 = cpu_to_le64(req->prp_list_iova); + for (i = 0; i < entries - 1; ++i) { + pagelist[i] = cpu_to_le64(pagelist[i + 1]); + } + pagelist[entries - 1] = 0; + break; + } + trace_nvme_cmd_map_qiov(s, cmd, req, qiov, entries); + for (i = 0; i < entries; ++i) { + trace_nvme_cmd_map_qiov_pages(s, i, pagelist[i]); + } + return 0; +} + +typedef struct { + Coroutine *co; + int ret; + AioContext *ctx; +} NVMeCoData; + +static void nvme_rw_cb_bh(void *opaque) +{ + NVMeCoData *data = opaque; + qemu_coroutine_enter(data->co); +} + +static void nvme_rw_cb(void *opaque, int ret) +{ + NVMeCoData *data = opaque; + data->ret = ret; + if (!data->co) { + /* The rw coroutine hasn't yielded, don't try to enter. */ + return; + } + aio_bh_schedule_oneshot(data->ctx, nvme_rw_cb_bh, data); +} + +static coroutine_fn int nvme_co_prw_aligned(BlockDriverState *bs, + uint64_t offset, uint64_t bytes, + QEMUIOVector *qiov, + bool is_write, + int flags) +{ + int r; + BDRVNVMeState *s = bs->opaque; + NVMeQueuePair *ioq = s->queues[1]; + NVMeRequest *req; + uint32_t cdw12 = (((bytes >> BDRV_SECTOR_BITS) - 1) & 0xFFFF) | + (flags & BDRV_REQ_FUA ? 1 << 30 : 0); + NvmeCmd cmd = { + .opcode = is_write ? NVME_CMD_WRITE : NVME_CMD_READ, + .nsid = cpu_to_le32(s->nsid), + .cdw10 = cpu_to_le32((offset >> BDRV_SECTOR_BITS) & 0xFFFFFFFF), + .cdw11 = cpu_to_le32(((offset >> BDRV_SECTOR_BITS) >> 32) & 0xFFFFFFFF), + .cdw12 = cpu_to_le32(cdw12), + }; + NVMeCoData data = { + .ctx = bdrv_get_aio_context(bs), + .ret = -EINPROGRESS, + }; + + trace_nvme_prw_aligned(s, is_write, offset, bytes, flags, qiov->niov); + assert(s->nr_queues > 1); + req = nvme_get_free_req(ioq); + + r = nvme_cmd_map_qiov(bs, &cmd, req, qiov); + if (r) { + req->busy = false; + return r; + } + nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data); + + data.co = qemu_coroutine_self(); + while (data.ret == -EINPROGRESS) { + qemu_coroutine_yield(); + } + + r = nvme_cmd_unmap_qiov(bs, qiov); + if (r) { + return r; + } + + trace_nvme_rw_done(s, is_write, offset, bytes, data.ret); + return data.ret; +} + +static inline bool nvme_qiov_aligned(BlockDriverState *bs, + const QEMUIOVector *qiov) +{ + int i; + BDRVNVMeState *s = bs->opaque; + + for (i = 0; i < qiov->niov; ++i) { + if (!QEMU_PTR_IS_ALIGNED(qiov->iov[i].iov_base, s->page_size) || + !QEMU_IS_ALIGNED(qiov->iov[i].iov_len, s->page_size)) { + trace_nvme_qiov_unaligned(qiov, i, qiov->iov[i].iov_base, + qiov->iov[i].iov_len, s->page_size); + return false; + } + } + return true; +} + +static int nvme_co_prw(BlockDriverState *bs, uint64_t offset, uint64_t bytes, + QEMUIOVector *qiov, bool is_write, int flags) +{ + BDRVNVMeState *s = bs->opaque; + int r; + uint8_t *buf = NULL; + QEMUIOVector local_qiov; + + assert(QEMU_IS_ALIGNED(offset, s->page_size)); + assert(QEMU_IS_ALIGNED(bytes, s->page_size)); + assert(bytes <= s->max_transfer); + if (nvme_qiov_aligned(bs, qiov)) { + return nvme_co_prw_aligned(bs, offset, bytes, qiov, is_write, flags); + } + trace_nvme_prw_buffered(s, offset, bytes, qiov->niov, is_write); + buf = qemu_try_blockalign(bs, bytes); + + if (!buf) { + return -ENOMEM; + } + qemu_iovec_init(&local_qiov, 1); + if (is_write) { + qemu_iovec_to_buf(qiov, 0, buf, bytes); + } + qemu_iovec_add(&local_qiov, buf, bytes); + r = nvme_co_prw_aligned(bs, offset, bytes, &local_qiov, is_write, flags); + qemu_iovec_destroy(&local_qiov); + if (!r && !is_write) { + qemu_iovec_from_buf(qiov, 0, buf, bytes); + } + qemu_vfree(buf); + return r; +} + +static coroutine_fn int nvme_co_preadv(BlockDriverState *bs, + uint64_t offset, uint64_t bytes, + QEMUIOVector *qiov, int flags) +{ + return nvme_co_prw(bs, offset, bytes, qiov, false, flags); +} + +static coroutine_fn int nvme_co_pwritev(BlockDriverState *bs, + uint64_t offset, uint64_t bytes, + QEMUIOVector *qiov, int flags) +{ + return nvme_co_prw(bs, offset, bytes, qiov, true, flags); +} + +static coroutine_fn int nvme_co_flush(BlockDriverState *bs) +{ + BDRVNVMeState *s = bs->opaque; + NVMeQueuePair *ioq = s->queues[1]; + NVMeRequest *req; + NvmeCmd cmd = { + .opcode = NVME_CMD_FLUSH, + .nsid = cpu_to_le32(s->nsid), + }; + NVMeCoData data = { + .ctx = bdrv_get_aio_context(bs), + .ret = -EINPROGRESS, + }; + + assert(s->nr_queues > 1); + req = nvme_get_free_req(ioq); + nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data); + + data.co = qemu_coroutine_self(); + if (data.ret == -EINPROGRESS) { + qemu_coroutine_yield(); + } + + return data.ret; +} + + +static int nvme_reopen_prepare(BDRVReopenState *reopen_state, + BlockReopenQueue *queue, Error **errp) +{ + return 0; +} + +static int64_t coroutine_fn nvme_co_get_block_status(BlockDriverState *bs, + int64_t sector_num, + int nb_sectors, int *pnum, + BlockDriverState **file) +{ + *pnum = nb_sectors; + *file = bs; + + return BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_OFFSET_VALID | + (sector_num << BDRV_SECTOR_BITS); +} + +static void nvme_refresh_filename(BlockDriverState *bs, QDict *opts) +{ + QINCREF(opts); + qdict_del(opts, "filename"); + + if (!qdict_size(opts)) { + snprintf(bs->exact_filename, sizeof(bs->exact_filename), "%s://", + bs->drv->format_name); + } + + qdict_put(opts, "driver", qstring_from_str(bs->drv->format_name)); + bs->full_open_options = opts; +} + +static void nvme_refresh_limits(BlockDriverState *bs, Error **errp) +{ + BDRVNVMeState *s = bs->opaque; + + bs->bl.opt_mem_alignment = s->page_size; + bs->bl.request_alignment = s->page_size; + bs->bl.max_transfer = s->max_transfer; +} + +static void nvme_detach_aio_context(BlockDriverState *bs) +{ + BDRVNVMeState *s = bs->opaque; + + aio_set_event_notifier(bdrv_get_aio_context(bs), &s->irq_notifier, + false, NULL, NULL); +} + +static void nvme_attach_aio_context(BlockDriverState *bs, + AioContext *new_context) +{ + BDRVNVMeState *s = bs->opaque; + + s->aio_context = new_context; + aio_set_event_notifier(new_context, &s->irq_notifier, + false, nvme_handle_event, nvme_poll_cb); +} + +static void nvme_aio_plug(BlockDriverState *bs) +{ + BDRVNVMeState *s = bs->opaque; + s->plugged++; +} + +static void nvme_aio_unplug(BlockDriverState *bs) +{ + int i; + BDRVNVMeState *s = bs->opaque; + assert(s->plugged); + if (!--s->plugged) { + for (i = 1; i < s->nr_queues; i++) { + NVMeQueuePair *q = s->queues[i]; + qemu_mutex_lock(&q->lock); + nvme_kick(s, q); + nvme_process_completion(s, q); + qemu_mutex_unlock(&q->lock); + } + } +} + +static BlockDriver bdrv_nvme = { + .format_name = "nvme", + .protocol_name = "nvme", + .instance_size = sizeof(BDRVNVMeState), + + .bdrv_parse_filename = nvme_parse_filename, + .bdrv_file_open = nvme_file_open, + .bdrv_close = nvme_close, + .bdrv_getlength = nvme_getlength, + + .bdrv_co_preadv = nvme_co_preadv, + .bdrv_co_pwritev = nvme_co_pwritev, + .bdrv_co_flush_to_disk = nvme_co_flush, + .bdrv_reopen_prepare = nvme_reopen_prepare, + + .bdrv_co_get_block_status = nvme_co_get_block_status, + + .bdrv_refresh_filename = nvme_refresh_filename, + .bdrv_refresh_limits = nvme_refresh_limits, + + .bdrv_detach_aio_context = nvme_detach_aio_context, + .bdrv_attach_aio_context = nvme_attach_aio_context, + + .bdrv_io_plug = nvme_aio_plug, + .bdrv_io_unplug = nvme_aio_unplug, +}; + +static void bdrv_nvme_init(void) +{ + bdrv_register(&bdrv_nvme); +} + +block_init(bdrv_nvme_init); diff --git a/block/trace-events b/block/trace-events index 752de6a..3637d00 100644 --- a/block/trace-events +++ b/block/trace-events @@ -124,3 +124,35 @@ vxhs_open_iio_open(const char *host) "Failed to connect to storage agent on host vxhs_parse_uri_hostinfo(char *host, int port) "Host: IP %s, Port %d" vxhs_close(char *vdisk_guid) "Closing vdisk %s" vxhs_get_creds(const char *cacert, const char *client_key, const char *client_cert) "cacert %s, client_key %s, client_cert %s" + +# block/nvme.c +nvme_kick(void *s, int queue) "s %p queue %d" +nvme_dma_flush_queue_wait(void *s) "s %p" +nvme_vfio_ram_block_removed(void *s, size_t size) "host %p size %zu" +nvme_error(int cmd_specific, int sq_head, int sqid, int cid, int status) "cmd_specific %d sq_head %d sqid %d cid %d status %x" +nvme_process_completion(void *s, int index, int inflight) "s %p queue %d inflight %d" +nvme_process_completion_queue_busy(void *s, int index) "s %p queue %d" +nvme_complete_command(void *s, int index, int cid) "s %p queue %d cid %d" +nvme_submit_command(void *s, int index, int cid) "s %p queue %d cid %d" +nvme_submit_command_raw(int c0, int c1, int c2, int c3, int c4, int c5, int c6, int c7) "%02x %02x %02x %02x %02x %02x %02x %02x" +nvme_handle_event(void *s) "s %p" +nvme_poll_cb(void *s) "s %p" +nvme_prw_aligned(void *s, int is_write, uint64_t offset, uint64_t bytes, int flags, int niov) "s %p is_write %d offset %"PRId64" bytes %"PRId64" flags %d niov %d" +nvme_qiov_unaligned(const void *qiov, int n, void *base, size_t size, int align) "qiov %p n %d base %p size 0x%zx align 0x%x" +nvme_prw_buffered(void *s, uint64_t offset, uint64_t bytes, int niov, int is_write) "s %p offset %"PRId64" bytes %"PRId64" niov %d is_write %d" +nvme_rw_done(void *s, int is_write, uint64_t offset, uint64_t bytes, int ret) "s %p is_write %d offset %"PRId64" bytes %"PRId64" ret %d" +nvme_dma_map_flush(void *s) "s %p" +nvme_free_req_queue_wait(void *q) "q %p" +nvme_cmd_map_qiov(void *s, void *cmd, void *req, void *qiov, int entries) "s %p cmd %p req %p qiov %p entries %d" +nvme_cmd_map_qiov_pages(void *s, int i, uint64_t page) "s %p page[%d] %"PRIx64 +nvme_cmd_map_qiov_iov(void *s, int i, void *page, int pages) "s %p iov[%d] %p pages %d" + +# block/nvme-vfio.c +nvme_vfio_dma_reset_temporary(void *s) "s %p" +nvme_vfio_ram_block_added(void *p, size_t size) "host %p size %zu" +nvme_vfio_find_mapping(void *s, void *p) "s %p host %p" +nvme_vfio_new_mapping(void *s, void *host, size_t size, int index, uint64_t iova) "s %p host %p size %zu index %d iova %"PRIx64 +nvme_vfio_do_mapping(void *s, void *host, size_t size, uint64_t iova) "s %p host %p size %zu iova %"PRIx64 +nvme_vfio_dma_map(void *s, void *host, size_t size, bool temporary, uint64_t *iova) "s %p host %p size %zu temporary %d iova %p" +nvme_vfio_dma_map_invalid(void *s, void *mapping_host, size_t mapping_size, void *host, size_t size) "s %p mapping %p %zu requested %p %zu" +nvme_vfio_dma_unmap(void *s, void *host) "s %p host %p" -- 2.9.4 ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH v3 2/6] block: Add VFIO based NVMe driver 2017-07-05 13:36 ` [Qemu-devel] [PATCH v3 2/6] block: Add VFIO based NVMe driver Fam Zheng @ 2017-07-06 17:38 ` Keith Busch 2017-07-06 23:27 ` Fam Zheng 2017-07-07 10:06 ` [Qemu-devel] [Qemu-block] " Paolo Bonzini 2017-07-07 17:15 ` [Qemu-devel] " Stefan Hajnoczi 2017-07-10 14:55 ` Stefan Hajnoczi 2 siblings, 2 replies; 33+ messages in thread From: Keith Busch @ 2017-07-06 17:38 UTC (permalink / raw) To: Fam Zheng Cc: qemu-devel, Paolo Bonzini, qemu-block, Kevin Wolf, Max Reitz, Stefan Hajnoczi, Karl Rister On Wed, Jul 05, 2017 at 09:36:31PM +0800, Fam Zheng wrote: > This is a new protocol driver that exclusively opens a host NVMe > controller through VFIO. It achieves better latency than linux-aio by > completely bypassing host kernel vfs/block layer. > > $rw-$bs-$iodepth linux-aio nvme:// > ---------------------------------------- > randread-4k-1 8269 8851 > randread-512k-1 584 610 > randwrite-4k-1 28601 34649 > randwrite-512k-1 1809 1975 > > The driver also integrates with the polling mechanism of iothread. > > This patch is co-authored by Paolo and me. > > Signed-off-by: Fam Zheng <famz@redhat.com> I haven't much time to do a thorough review, but in the brief time so far the implementation looks fine to me. I am wondering, though, if an NVMe vfio driver can be done as its own program that qemu can link to. The SPDK driver comes to mind as such an example, but it may create undesirable dependencies. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH v3 2/6] block: Add VFIO based NVMe driver 2017-07-06 17:38 ` Keith Busch @ 2017-07-06 23:27 ` Fam Zheng 2017-07-07 10:06 ` [Qemu-devel] [Qemu-block] " Paolo Bonzini 1 sibling, 0 replies; 33+ messages in thread From: Fam Zheng @ 2017-07-06 23:27 UTC (permalink / raw) To: Keith Busch Cc: qemu-devel, Paolo Bonzini, qemu-block, Kevin Wolf, Max Reitz, Stefan Hajnoczi, Karl Rister On Thu, 07/06 13:38, Keith Busch wrote: > On Wed, Jul 05, 2017 at 09:36:31PM +0800, Fam Zheng wrote: > > This is a new protocol driver that exclusively opens a host NVMe > > controller through VFIO. It achieves better latency than linux-aio by > > completely bypassing host kernel vfs/block layer. > > > > $rw-$bs-$iodepth linux-aio nvme:// > > ---------------------------------------- > > randread-4k-1 8269 8851 > > randread-512k-1 584 610 > > randwrite-4k-1 28601 34649 > > randwrite-512k-1 1809 1975 > > > > The driver also integrates with the polling mechanism of iothread. > > > > This patch is co-authored by Paolo and me. > > > > Signed-off-by: Fam Zheng <famz@redhat.com> > > I haven't much time to do a thorough review, but in the brief time so > far the implementation looks fine to me. Thanks for taking a look! > > I am wondering, though, if an NVMe vfio driver can be done as its own > program that qemu can link to. The SPDK driver comes to mind as such an > example, but it may create undesirable dependencies. Yes, good question. I will take a look at the current SPDK driver codebase to see if it can be linked this way. When I started this work, SPDK doesn't work with guest memory, because it requires apps to use its own hugepage powered allocators. This may have changed because I know it gained a vhost-user-scsi implementation (but that is a different story, together with vhost-user-blk). Fam ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [PATCH v3 2/6] block: Add VFIO based NVMe driver 2017-07-06 17:38 ` Keith Busch 2017-07-06 23:27 ` Fam Zheng @ 2017-07-07 10:06 ` Paolo Bonzini 1 sibling, 0 replies; 33+ messages in thread From: Paolo Bonzini @ 2017-07-07 10:06 UTC (permalink / raw) To: Keith Busch, Fam Zheng Cc: Kevin Wolf, qemu-block, qemu-devel, Max Reitz, Stefan Hajnoczi, Karl Rister On 06/07/2017 19:38, Keith Busch wrote: > On Wed, Jul 05, 2017 at 09:36:31PM +0800, Fam Zheng wrote: >> This is a new protocol driver that exclusively opens a host NVMe >> controller through VFIO. It achieves better latency than linux-aio by >> completely bypassing host kernel vfs/block layer. >> >> $rw-$bs-$iodepth linux-aio nvme:// >> ---------------------------------------- >> randread-4k-1 8269 8851 >> randread-512k-1 584 610 >> randwrite-4k-1 28601 34649 >> randwrite-512k-1 1809 1975 >> >> The driver also integrates with the polling mechanism of iothread. >> >> This patch is co-authored by Paolo and me. >> >> Signed-off-by: Fam Zheng <famz@redhat.com> > > I haven't much time to do a thorough review, but in the brief time so > far the implementation looks fine to me. > > I am wondering, though, if an NVMe vfio driver can be done as its own > program that qemu can link to. The SPDK driver comes to mind as such an > example, but it may create undesirable dependencies. I think there's room for both (and for PCI passthrough too). SPDK as "its own program" is what vhost-user-blk provides, in the end. This driver is simpler for developers to test than SPDK. For cloud providers that want to provide a stable guest ABI but also want a faster interface for high-performance PCI SSDs, it offers a different performance/ABI stability/power consumption tradeoff than either PCI passthorough or SDPK's poll-mode driver. The driver is also useful when tuning the QEMU event loop, because its higher performance makes it easier to see some second order effects that appear at higher queue depths (e.g. faster driver -> more guest interrupts -> lower performance). Paolo ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH v3 2/6] block: Add VFIO based NVMe driver 2017-07-05 13:36 ` [Qemu-devel] [PATCH v3 2/6] block: Add VFIO based NVMe driver Fam Zheng 2017-07-06 17:38 ` Keith Busch @ 2017-07-07 17:15 ` Stefan Hajnoczi 2017-07-10 14:55 ` Stefan Hajnoczi 2 siblings, 0 replies; 33+ messages in thread From: Stefan Hajnoczi @ 2017-07-07 17:15 UTC (permalink / raw) To: Fam Zheng Cc: qemu-devel, Paolo Bonzini, Keith Busch, qemu-block, Kevin Wolf, Max Reitz, Karl Rister [-- Attachment #1: Type: text/plain, Size: 19753 bytes --] On Wed, Jul 05, 2017 at 09:36:31PM +0800, Fam Zheng wrote: > diff --git a/block/nvme-vfio.c b/block/nvme-vfio.c > new file mode 100644 > index 0000000..f030a82 > --- /dev/null > +++ b/block/nvme-vfio.c > @@ -0,0 +1,703 @@ > +/* > + * NVMe VFIO interface As far as I can tell nothing in this file is related to NVMe. This is purely a VFIO utility library. If someone wanted to write a VFIO NetClient, they could reuse these functions. Should they be generic from the start? > +struct NVMeVFIOState { > + int container; > + int group; > + int device; > + RAMBlockNotifier ram_notifier; > + struct vfio_region_info config_region_info, bar_region_info[6]; > + > + /* VFIO's IO virtual address space is managed by splitting into a few > + * sections: > + * > + * --------------- <= 0 > + * |xxxxxxxxxxxxx| > + * |-------------| <= NVME_VFIO_IOVA_MIN > + * | | > + * | Fixed | > + * | | > + * |-------------| <= low_water_mark > + * | | > + * | Free | > + * | | > + * |-------------| <= high_water_mark > + * | | > + * | Temp | > + * | | > + * |-------------| <= NVME_VFIO_IOVA_MAX > + * |xxxxxxxxxxxxx| > + * |xxxxxxxxxxxxx| > + * --------------- > + * > + * - Addresses lower than NVME_VFIO_IOVA_MIN are reserved as invalid; > + * > + * - Fixed mappings of HVAs are assigned "low" IOVAs in the range of > + * [NVME_VFIO_IOVA_MIN, low_water_mark). Once allocated they will not be > + * reclaimed - low_water_mark never shrinks; > + * > + * - IOVAs in range [low_water_mark, high_water_mark) are free; > + * > + * - IOVAs in range [high_water_mark, NVME_VFIO_IOVA_MAX) are volatile > + * mappings. At each nvme_vfio_dma_reset_temporary() call, the whole area > + * is recycled. The caller should make sure I/O's depending on these > + * mappings are completed before calling. > + **/ > + uint64_t low_water_mark; > + uint64_t high_water_mark; > + IOVAMapping *mappings; > + int nr_mappings; > + QemuMutex lock; Please document what the lock protects. > +}; > + > +/** Find group file and return the full path in @path by PCI device address > + * @device. If succeeded, caller needs to g_free the returned path. */ > +static int sysfs_find_group_file(const char *device, char **path, Error **errp) > +{ > + int ret; > + char *sysfs_link = NULL; > + char *sysfs_group = NULL; > + char *p; > + > + sysfs_link = g_strdup_printf("/sys/bus/pci/devices/%s/iommu_group", > + device); > + sysfs_group = g_malloc(PATH_MAX); > + ret = readlink(sysfs_link, sysfs_group, PATH_MAX - 1); > + if (ret == -1) { > + error_setg_errno(errp, errno, "Failed to find iommu group sysfs path"); > + ret = -errno; > + goto out; > + } > + ret = 0; > + p = strrchr(sysfs_group, '/'); > + if (!p) { > + error_setg(errp, "Failed to find iommu group number"); > + ret = -errno; strrchr() doesn't set errno so this is likely to be 0. I'm not sure why this function returns int. It seems simpler to return char *path instead. > +/** > + * Map a PCI bar area. > + */ > +void *nvme_vfio_pci_map_bar(NVMeVFIOState *s, int index, Error **errp) > +{ > + void *p; > + assert(index >= 0 && index < 6); nvme_vfio_pci_init_bar() says: assert(index < ARRAY_SIZE(s->bar_region_info)); I think they are trying to test for the same thing but are doing it in different ways. It would be nicer to avoid repetition: static inline void assert_bar_index_valid(NVMeVFIOState *s, int index) { assert(index >= 0 && index < ARRAY_SIZE(s->bar_region_info)); } > +static int nvme_vfio_pci_write_config(NVMeVFIOState *s, void *buf, int size, int ofs) > +{ > + if (pwrite(s->device, buf, size, > + s->config_region_info.offset + ofs) == size) { > + return 0; > + } > + > + return -1; > +} I'm not sure if it's safe to assume pread()/pwrite() do not return EINTR. It would be a shame for vfio initialization to fail because a signal arrived at an incovenient time. > +static int nvme_vfio_init_pci(NVMeVFIOState *s, const char *device, > + Error **errp) > +{ > + int ret; > + int i; > + uint16_t pci_cmd; > + struct vfio_group_status group_status = { .argsz = sizeof(group_status) }; > + struct vfio_iommu_type1_info iommu_info = { .argsz = sizeof(iommu_info) }; > + struct vfio_device_info device_info = { .argsz = sizeof(device_info) }; > + char *group_file = NULL; > + > + /* Create a new container */ > + s->container = open("/dev/vfio/vfio", O_RDWR); > + > + if (ioctl(s->container, VFIO_GET_API_VERSION) != VFIO_API_VERSION) { > + error_setg(errp, "Invalid VFIO version"); > + ret = -EINVAL; > + goto out; > + } > + > + if (!ioctl(s->container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) { > + error_setg_errno(errp, errno, "VFIO IOMMU check failed"); > + ret = -EINVAL; > + goto out; > + } > + > + /* Open the group */ > + ret = sysfs_find_group_file(device, &group_file, errp); > + if (ret) { > + goto out; > + } > + > + s->group = open(group_file, O_RDWR); > + g_free(group_file); > + if (s->group <= 0) { > + error_setg_errno(errp, errno, "Failed to open VFIO group file"); > + ret = -errno; > + goto out; > + } > + > + /* Test the group is viable and available */ > + if (ioctl(s->group, VFIO_GROUP_GET_STATUS, &group_status)) { > + error_setg_errno(errp, errno, "Failed to get VFIO group status"); > + ret = -errno; > + goto out; > + } > + > + if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE)) { > + error_setg(errp, "VFIO group is not viable"); > + ret = -EINVAL; > + goto out; > + } > + > + /* Add the group to the container */ > + if (ioctl(s->group, VFIO_GROUP_SET_CONTAINER, &s->container)) { > + error_setg_errno(errp, errno, "Failed to add group to VFIO container"); > + ret = -errno; > + goto out; > + } > + > + /* Enable the IOMMU model we want */ > + if (ioctl(s->container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU)) { > + error_setg_errno(errp, errno, "Failed to set VFIO IOMMU type"); > + ret = -errno; > + goto out; > + } > + > + /* Get additional IOMMU info */ > + if (ioctl(s->container, VFIO_IOMMU_GET_INFO, &iommu_info)) { > + error_setg_errno(errp, errno, "Failed to get IOMMU info"); > + ret = -errno; > + goto out; > + } > + > + s->device = ioctl(s->group, VFIO_GROUP_GET_DEVICE_FD, device); > + > + if (s->device < 0) { > + error_setg_errno(errp, errno, "Failed to get device fd"); > + ret = -errno; > + goto out; > + } > + > + /* Test and setup the device */ > + if (ioctl(s->device, VFIO_DEVICE_GET_INFO, &device_info)) { > + error_setg_errno(errp, errno, "Failed to get device info"); > + ret = -errno; > + goto out; > + } > + > + if (device_info.num_regions < VFIO_PCI_CONFIG_REGION_INDEX) { > + error_setg(errp, "Invalid device regions"); > + ret = -EINVAL; > + goto out; > + } > + > + s->config_region_info = (struct vfio_region_info) { > + .index = VFIO_PCI_CONFIG_REGION_INDEX, > + .argsz = sizeof(struct vfio_region_info), > + }; > + if (ioctl(s->device, VFIO_DEVICE_GET_REGION_INFO, &s->config_region_info)) { > + error_setg_errno(errp, errno, "Failed to get config region info"); > + ret = -errno; > + goto out; > + } > + > + for (i = 0; i < 6; i++) { > + ret = nvme_vfio_pci_init_bar(s, i, errp); > + if (ret) { > + goto out; > + } > + } > + > + /* Enable bus master */ > + if (nvme_vfio_pci_read_config(s, &pci_cmd, sizeof(pci_cmd), > + PCI_COMMAND) < 0) { > + goto out; > + } > + pci_cmd |= PCI_COMMAND_MASTER; > + if (nvme_vfio_pci_write_config(s, &pci_cmd, sizeof(pci_cmd), > + PCI_COMMAND) < 0) { > + goto out; > + } > +out: > + return ret; Missing if (ret < 0) { close(foo); ... } cleanup in the error case. > +} > + > +static void nvme_vfio_ram_block_added(RAMBlockNotifier *n, > + void *host, size_t size) > +{ > + NVMeVFIOState *s = container_of(n, NVMeVFIOState, ram_notifier); > + trace_nvme_vfio_ram_block_added(host, size); Please include "s %p" s in the trace event so multiple NVMe adapters can be differentiated from each other. All trace events should include s. > +/** > + * Find the mapping entry that contains [host, host + size) and set @index to > + * the position. If no entry contains it, @index is the position _after_ which > + * to insert the new mapping. IOW, it is the index of the largest element that > + * is smaller than @host, or -1 if no entry is. > + */ > +static IOVAMapping *nvme_vfio_find_mapping(NVMeVFIOState *s, void *host, > + int *index) > +{ > + IOVAMapping *p = s->mappings; > + IOVAMapping *q = p ? p + s->nr_mappings - 1 : NULL; > + IOVAMapping *mid = p ? p + (q - p) / 2 : NULL; This value is never used because mid is recalculated in the while loop. > + trace_nvme_vfio_find_mapping(s, host); > + if (!p) { > + *index = -1; > + return NULL; > + } > + while (true) { > + mid = p + (q - p) / 2; > + if (mid == p) { > + break; > + } > + if (mid->host > host) { > + q = mid; > + } else if (mid->host < host) { > + p = mid; > + } else { > + break; > + } > + } > + if (mid->host > host) { > + mid--; > + } else if (mid < &s->mappings[s->nr_mappings - 1] > + && (mid + 1)->host <= host) { > + mid++; > + } > + *index = mid - &s->mappings[0]; > + if (mid >= &s->mappings[0] && > + mid->host <= host && mid->host + mid->size > host) { > + assert(mid < &s->mappings[s->nr_mappings]); > + return mid; > + } > + return NULL; A junk *index value may be produced when we return NULL. Consider these inputs: mappings[] = {{.host = 0x2000}} nr_mappings = 1 host = 0x1000 The result is: *index = &s->mappings[-1] - &s->mappings[0] > +/* Map [host, host + size) area into a contiguous IOVA address space, and store > + * the result in @iova if not NULL. The area must be aligned to page size, and > + * mustn't overlap with existing mapping areas. > + */ > +int nvme_vfio_dma_map(NVMeVFIOState *s, void *host, size_t size, > + bool temporary, uint64_t *iova) This function assumes that the mapping status is constant for the entire range [host, host + size). It does not handle split mappings. For example: 1. [host, host + 4K) is mapped but [host + 4K, host + size) is not mapped. 2. [host, host + 4K) is not mapped but [host + 4K, host + size) is mapped. 3. [host, host + 4K) is mapped temporary but [host + 4K, host + size) is mapped !temporary. (The iova space would not be contiguous.) Is it safe to assume none of these can happen? > +{ > + int ret = 0; > + int index; > + IOVAMapping *mapping; > + uint64_t iova0; > + > + assert(QEMU_PTR_IS_ALIGNED(host, getpagesize())); > + assert(QEMU_IS_ALIGNED(size, getpagesize())); > + trace_nvme_vfio_dma_map(s, host, size, temporary, iova); > + qemu_mutex_lock(&s->lock); > + mapping = nvme_vfio_find_mapping(s, host, &index); > + if (mapping) { > + iova0 = mapping->iova + ((uint8_t *)host - (uint8_t *)mapping->host); > + } else { > + if (s->high_water_mark - s->low_water_mark + 1 < size) { > + ret = -ENOMEM; > + goto out; > + } > + if (!temporary) { > + iova0 = s->low_water_mark; > + mapping = nvme_vfio_add_mapping(s, host, size, index + 1, iova0); > + if (!mapping) { > + ret = -ENOMEM; > + goto out; > + } > + assert(nvme_vfio_verify_mappings(s)); > + ret = nvme_vfio_do_mapping(s, host, size, iova0); > + if (ret) { > + nvme_vfio_undo_mapping(s, mapping, NULL); > + goto out; > + } > + s->low_water_mark += size; > + nvme_vfio_dump_mappings(s); > + } else { > + iova0 = s->high_water_mark - size; > + ret = nvme_vfio_do_mapping(s, host, size, iova0); > + if (ret) { > + goto out; > + } > + s->high_water_mark -= size; > + } > + } > + if (iova) { > + *iova = iova0; > + } > + qemu_mutex_unlock(&s->lock); > +out: > + return ret; > +} > + > +/* Reset the high watermark and free all "temporary" mappings. */ > +int nvme_vfio_dma_reset_temporary(NVMeVFIOState *s) > +{ > + struct vfio_iommu_type1_dma_unmap unmap = { > + .argsz = sizeof(unmap), > + .flags = 0, > + .iova = s->high_water_mark, > + .size = NVME_VFIO_IOVA_MAX - s->high_water_mark, > + }; > + trace_nvme_vfio_dma_reset_temporary(s); > + qemu_mutex_lock(&s->lock); > + if (ioctl(s->container, VFIO_IOMMU_UNMAP_DMA, &unmap)) { > + error_report("VFIO_UNMAP_DMA: %d", -errno); > + return -errno; Missing qemu_mutex_unlock(&s->lock). > + } > + s->high_water_mark = NVME_VFIO_IOVA_MAX; > + qemu_mutex_lock(&s->lock); s/lock/unlock/ > diff --git a/block/nvme-vfio.h b/block/nvme-vfio.h > new file mode 100644 > index 0000000..2d5840b > --- /dev/null > +++ b/block/nvme-vfio.h > @@ -0,0 +1,30 @@ > +/* > + * NVMe VFIO interface > + * > + * Copyright 2016, 2017 Red Hat, Inc. > + * > + * Authors: > + * Fam Zheng <famz@redhat.com> > + * > + * This work is licensed under the terms of the GNU GPL, version 2 or later. > + * See the COPYING file in the top-level directory. > + */ > + > +#ifndef QEMU_VFIO_H > +#define QEMU_VFIO_H > +#include "qemu/queue.h" Is "qemu/queue.h" needed by this header? Error, bool, uint64_t, and EventNotifier so additional headers should probably be included. > +typedef struct { > + int index; > + NVMeQueue sq, cq; > + int cq_phase; > + uint8_t *prp_list_pages; > + uint64_t prp_list_base_iova; > + NVMeRequest reqs[NVME_QUEUE_SIZE]; > + CoQueue free_req_queue; > + bool busy; > + int need_kick; > + int inflight; > + QemuMutex lock; > +} NVMeQueuePair; What does lock protect? > +static void nvme_free_queue_pair(BlockDriverState *bs, NVMeQueuePair *q) > +{ > + qemu_vfree(q->prp_list_pages); > + qemu_vfree(q->sq.queue); > + qemu_vfree(q->cq.queue); qemu_mutex_destroy(&q->lock); > +static NVMeRequest *nvme_get_free_req(NVMeQueuePair *q) Missing coroutine_fn since this function calls qemu_co_queue_wait(). > +{ > + int i; > + NVMeRequest *req = NULL; > + > + qemu_mutex_lock(&q->lock); > + while (q->inflight + q->need_kick > NVME_QUEUE_SIZE - 2) { > + /* We have to leave one slot empty as that is the full queue case (head > + * == tail + 1). */ > + trace_nvme_free_req_queue_wait(q); > + qemu_mutex_unlock(&q->lock); > + qemu_co_queue_wait(&q->free_req_queue, NULL); > + qemu_mutex_lock(&q->lock); > + } > + for (i = 0; i < NVME_QUEUE_SIZE; i++) { > + if (!q->reqs[i].busy) { > + q->reqs[i].busy = true; > + req = &q->reqs[i]; > + break; > + } > + } > + assert(req); > + qemu_mutex_unlock(&q->lock); This code takes q->lock but actually relies on coroutine cooperative scheduling to avoid failing assert(req). This bothers me a little because it means there are undocumented locking assumptions. > + return req; > +} > + > +static inline int nvme_translate_error(const NvmeCqe *c) > +{ > + uint16_t status = (le16_to_cpu(c->status) >> 1) & 0xFF; > + if (status) { > + trace_nvme_error(c->result, c->sq_head, c->sq_id, c->cid, status); Should c's fields be byteswapped? > + } > + switch (status) { > + case 0: > + return 0; > + case 1: > + return -ENOSYS; > + case 2: > + return -EINVAL; > + default: > + return -EIO; > + } > +} > + > +/* With q->lock */ > +static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q) > +{ > + bool progress = false; > + NVMeRequest *req; > + NvmeCqe *c; > + > + trace_nvme_process_completion(s, q->index, q->inflight); > + if (q->busy || s->plugged) { > + trace_nvme_process_completion_queue_busy(s, q->index); > + return false; > + } > + q->busy = true; > + assert(q->inflight >= 0); > + while (q->inflight) { > + c = (NvmeCqe *)&q->cq.queue[q->cq.head * NVME_CQ_ENTRY_BYTES]; > + if (!c->cid || (le16_to_cpu(c->status) & 0x1) == q->cq_phase) { > + break; > + } > + q->cq.head = (q->cq.head + 1) % NVME_QUEUE_SIZE; > + if (!q->cq.head) { > + q->cq_phase = !q->cq_phase; > + } > + if (c->cid == 0 || c->cid > NVME_QUEUE_SIZE) { Will c->cid > NVME_QUEUE_SIZE work on big-endian hosts? Looks like le16_to_cpu(c->cid) is missing. There are more instances below. > + fprintf(stderr, "Unexpected CID in completion queue: %" PRIu32 "\n", > + c->cid); > + continue; > + } > + assert(c->cid <= NVME_QUEUE_SIZE); > + trace_nvme_complete_command(s, q->index, c->cid); > + req = &q->reqs[c->cid - 1]; > + assert(req->cid == c->cid); > + assert(req->cb); > + req->cb(req->opaque, nvme_translate_error(c)); The user callback is invoked with q->lock held? This could have a performance impact or risk deadlocks if the callback touches this BDS. > + req->cb = req->opaque = NULL; > + req->busy = false; > + if (!qemu_co_queue_empty(&q->free_req_queue)) { > + aio_bh_schedule_oneshot(s->aio_context, nvme_free_req_queue_cb, q); > + } The relationship between waiting coroutines and completion processing seems strange to me: A new oneshot BH is scheduled for each processed completion. There may only be one queued coroutine waiting so a lot of these BHs are wasted. We hold q->lock so we cannot expect q->free_req_queue to empty itself while we're still running. What I'm wondering is whether it's better to schedule the BH in if (progress) below. At the moment nvme_free_req_queue_cb() will only enter 1 waiting coroutine, so that would need to be adjusted to ensure more waiting coroutines are woken if multiple reqs completed. Maybe it's simpler to keep doing spurious notifications but it's worth at least considering the idea of notifying from if (progress). [I ran out of time here. Will review more later.] [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH v3 2/6] block: Add VFIO based NVMe driver 2017-07-05 13:36 ` [Qemu-devel] [PATCH v3 2/6] block: Add VFIO based NVMe driver Fam Zheng 2017-07-06 17:38 ` Keith Busch 2017-07-07 17:15 ` [Qemu-devel] " Stefan Hajnoczi @ 2017-07-10 14:55 ` Stefan Hajnoczi 2017-07-12 2:14 ` Fam Zheng 2 siblings, 1 reply; 33+ messages in thread From: Stefan Hajnoczi @ 2017-07-10 14:55 UTC (permalink / raw) To: Fam Zheng Cc: qemu-devel, Paolo Bonzini, Keith Busch, qemu-block, Kevin Wolf, Max Reitz, Karl Rister [-- Attachment #1: Type: text/plain, Size: 4249 bytes --] On Wed, Jul 05, 2017 at 09:36:31PM +0800, Fam Zheng wrote: > +static bool nvme_identify(BlockDriverState *bs, int namespace, Error **errp) > +{ > + BDRVNVMeState *s = bs->opaque; > + uint8_t *resp; > + int r; > + uint64_t iova; > + NvmeCmd cmd = { > + .opcode = NVME_ADM_CMD_IDENTIFY, > + .cdw10 = cpu_to_le32(0x1), > + }; > + > + resp = qemu_try_blockalign0(bs, 4096); Is it possible to use struct NvmeIdCtrl to make this code clearer and eliminate the hardcoded sizes/offsets? > + if (!resp) { > + error_setg(errp, "Cannot allocate buffer for identify response"); > + return false; > + } > + r = nvme_vfio_dma_map(s->vfio, resp, 4096, true, &iova); > + if (r) { > + error_setg(errp, "Cannot map buffer for DMA"); > + goto fail; > + } > + cmd.prp1 = cpu_to_le64(iova); > + > + if (nvme_cmd_sync(bs, s->queues[0], &cmd)) { > + error_setg(errp, "Failed to identify controller"); > + goto fail; > + } > + > + if (le32_to_cpu(*(uint32_t *)&resp[516]) < namespace) { > + error_setg(errp, "Invalid namespace"); > + goto fail; > + } > + s->write_cache = le32_to_cpu(resp[525]) & 0x1; > + s->max_transfer = (resp[77] ? 1 << resp[77] : 0) * s->page_size; > + /* For now the page list buffer per command is one page, to hold at most > + * s->page_size / sizeof(uint64_t) entries. */ > + s->max_transfer = MIN_NON_ZERO(s->max_transfer, > + s->page_size / sizeof(uint64_t) * s->page_size); > + > + memset((char *)resp, 0, 4096); > + > + cmd.cdw10 = 0; > + cmd.nsid = namespace; > + if (nvme_cmd_sync(bs, s->queues[0], &cmd)) { > + error_setg(errp, "Failed to identify namespace"); > + goto fail; > + } > + > + s->nsze = le64_to_cpu(*(uint64_t *)&resp[0]); > + > + nvme_vfio_dma_unmap(s->vfio, resp); > + qemu_vfree(resp); > + return true; > +fail: > + qemu_vfree(resp); > + return false; nvme_vfio_dma_unmap() is not called in the error path. > +static coroutine_fn int nvme_cmd_map_qiov(BlockDriverState *bs, NvmeCmd *cmd, > + NVMeRequest *req, QEMUIOVector *qiov) > +{ > + BDRVNVMeState *s = bs->opaque; > + uint64_t *pagelist = req->prp_list_page; > + int i, j, r; > + int entries = 0; > + > + assert(qiov->size); > + assert(QEMU_IS_ALIGNED(qiov->size, s->page_size)); > + assert(qiov->size / s->page_size <= s->page_size / sizeof(uint64_t)); > + for (i = 0; i < qiov->niov; ++i) { > + bool retry = true; > + uint64_t iova; > + qemu_co_mutex_lock(&s->dma_map_lock); > +try_map: > + r = nvme_vfio_dma_map(s->vfio, > + qiov->iov[i].iov_base, > + qiov->iov[i].iov_len, > + true, &iova); > + if (r == -ENOMEM && retry) { > + retry = false; > + trace_nvme_dma_flush_queue_wait(s); > + if (s->inflight) { > + trace_nvme_dma_map_flush(s); > + qemu_co_queue_wait(&s->dma_flush_queue, &s->dma_map_lock); > + } else { > + r = nvme_vfio_dma_reset_temporary(s->vfio); > + if (r) { > + return r; dma_map_lock is held here! > +static int nvme_co_prw(BlockDriverState *bs, uint64_t offset, uint64_t bytes, > + QEMUIOVector *qiov, bool is_write, int flags) > +{ > + BDRVNVMeState *s = bs->opaque; > + int r; > + uint8_t *buf = NULL; > + QEMUIOVector local_qiov; > + > + assert(QEMU_IS_ALIGNED(offset, s->page_size)); > + assert(QEMU_IS_ALIGNED(bytes, s->page_size)); > + assert(bytes <= s->max_transfer); Who guarantees max_transfer? I think request alignment is enforced by block/io.c but there is no generic max_transfer handling code, so this assertion can be triggered by the guest. Please handle it as a genuine request error instead of using an assertion. > +static int nvme_reopen_prepare(BDRVReopenState *reopen_state, > + BlockReopenQueue *queue, Error **errp) > +{ > + return 0; > +} What is the purpose of this dummy .bdrv_reopen_prepare() implementation? [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH v3 2/6] block: Add VFIO based NVMe driver 2017-07-10 14:55 ` Stefan Hajnoczi @ 2017-07-12 2:14 ` Fam Zheng 2017-07-12 10:49 ` Stefan Hajnoczi 0 siblings, 1 reply; 33+ messages in thread From: Fam Zheng @ 2017-07-12 2:14 UTC (permalink / raw) To: Stefan Hajnoczi Cc: qemu-devel, Paolo Bonzini, Keith Busch, qemu-block, Kevin Wolf, Max Reitz, Karl Rister On Mon, 07/10 15:55, Stefan Hajnoczi wrote: > On Wed, Jul 05, 2017 at 09:36:31PM +0800, Fam Zheng wrote: > > +static bool nvme_identify(BlockDriverState *bs, int namespace, Error **errp) > > +{ > > + BDRVNVMeState *s = bs->opaque; > > + uint8_t *resp; > > + int r; > > + uint64_t iova; > > + NvmeCmd cmd = { > > + .opcode = NVME_ADM_CMD_IDENTIFY, > > + .cdw10 = cpu_to_le32(0x1), > > + }; > > + > > + resp = qemu_try_blockalign0(bs, 4096); > > Is it possible to use struct NvmeIdCtrl to make this code clearer and > eliminate the hardcoded sizes/offsets? Yes, will do. > > > + if (!resp) { > > + error_setg(errp, "Cannot allocate buffer for identify response"); > > + return false; > > + } > > + r = nvme_vfio_dma_map(s->vfio, resp, 4096, true, &iova); > > + if (r) { > > + error_setg(errp, "Cannot map buffer for DMA"); > > + goto fail; > > + } > > + cmd.prp1 = cpu_to_le64(iova); > > + > > + if (nvme_cmd_sync(bs, s->queues[0], &cmd)) { > > + error_setg(errp, "Failed to identify controller"); > > + goto fail; > > + } > > + > > + if (le32_to_cpu(*(uint32_t *)&resp[516]) < namespace) { > > + error_setg(errp, "Invalid namespace"); > > + goto fail; > > + } > > + s->write_cache = le32_to_cpu(resp[525]) & 0x1; > > + s->max_transfer = (resp[77] ? 1 << resp[77] : 0) * s->page_size; > > + /* For now the page list buffer per command is one page, to hold at most > > + * s->page_size / sizeof(uint64_t) entries. */ > > + s->max_transfer = MIN_NON_ZERO(s->max_transfer, > > + s->page_size / sizeof(uint64_t) * s->page_size); > > + > > + memset((char *)resp, 0, 4096); > > + > > + cmd.cdw10 = 0; > > + cmd.nsid = namespace; > > + if (nvme_cmd_sync(bs, s->queues[0], &cmd)) { > > + error_setg(errp, "Failed to identify namespace"); > > + goto fail; > > + } > > + > > + s->nsze = le64_to_cpu(*(uint64_t *)&resp[0]); > > + > > + nvme_vfio_dma_unmap(s->vfio, resp); > > + qemu_vfree(resp); > > + return true; > > +fail: > > + qemu_vfree(resp); > > + return false; > > nvme_vfio_dma_unmap() is not called in the error path. Will fix. > > > +static coroutine_fn int nvme_cmd_map_qiov(BlockDriverState *bs, NvmeCmd *cmd, > > + NVMeRequest *req, QEMUIOVector *qiov) > > +{ > > + BDRVNVMeState *s = bs->opaque; > > + uint64_t *pagelist = req->prp_list_page; > > + int i, j, r; > > + int entries = 0; > > + > > + assert(qiov->size); > > + assert(QEMU_IS_ALIGNED(qiov->size, s->page_size)); > > + assert(qiov->size / s->page_size <= s->page_size / sizeof(uint64_t)); > > + for (i = 0; i < qiov->niov; ++i) { > > + bool retry = true; > > + uint64_t iova; > > + qemu_co_mutex_lock(&s->dma_map_lock); > > +try_map: > > + r = nvme_vfio_dma_map(s->vfio, > > + qiov->iov[i].iov_base, > > + qiov->iov[i].iov_len, > > + true, &iova); > > + if (r == -ENOMEM && retry) { > > + retry = false; > > + trace_nvme_dma_flush_queue_wait(s); > > + if (s->inflight) { > > + trace_nvme_dma_map_flush(s); > > + qemu_co_queue_wait(&s->dma_flush_queue, &s->dma_map_lock); > > + } else { > > + r = nvme_vfio_dma_reset_temporary(s->vfio); > > + if (r) { > > + return r; > > dma_map_lock is held here! Oops, will fix. > > > +static int nvme_co_prw(BlockDriverState *bs, uint64_t offset, uint64_t bytes, > > + QEMUIOVector *qiov, bool is_write, int flags) > > +{ > > + BDRVNVMeState *s = bs->opaque; > > + int r; > > + uint8_t *buf = NULL; > > + QEMUIOVector local_qiov; > > + > > + assert(QEMU_IS_ALIGNED(offset, s->page_size)); > > + assert(QEMU_IS_ALIGNED(bytes, s->page_size)); > > + assert(bytes <= s->max_transfer); > > Who guarantees max_transfer? I think request alignment is enforced by > block/io.c but there is no generic max_transfer handling code, so this > assertion can be triggered by the guest. Please handle it as a genuine > request error instead of using an assertion. There has been one since 04ed95f4843281e292d93018d56d4b14705f9f2c, see the code around max_transfer in block/io.c:bdrv_aligned_*. > > > +static int nvme_reopen_prepare(BDRVReopenState *reopen_state, > > + BlockReopenQueue *queue, Error **errp) > > +{ > > + return 0; > > +} > > What is the purpose of this dummy .bdrv_reopen_prepare() implementation? This is necessary for block jobs to work, other drivers provides dummy implementations as well. Fam ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH v3 2/6] block: Add VFIO based NVMe driver 2017-07-12 2:14 ` Fam Zheng @ 2017-07-12 10:49 ` Stefan Hajnoczi 0 siblings, 0 replies; 33+ messages in thread From: Stefan Hajnoczi @ 2017-07-12 10:49 UTC (permalink / raw) To: Fam Zheng Cc: qemu-devel, Paolo Bonzini, Keith Busch, qemu-block, Kevin Wolf, Max Reitz, Karl Rister [-- Attachment #1: Type: text/plain, Size: 1583 bytes --] On Wed, Jul 12, 2017 at 10:14:48AM +0800, Fam Zheng wrote: > On Mon, 07/10 15:55, Stefan Hajnoczi wrote: > > On Wed, Jul 05, 2017 at 09:36:31PM +0800, Fam Zheng wrote: > > > +static int nvme_co_prw(BlockDriverState *bs, uint64_t offset, uint64_t bytes, > > > + QEMUIOVector *qiov, bool is_write, int flags) > > > +{ > > > + BDRVNVMeState *s = bs->opaque; > > > + int r; > > > + uint8_t *buf = NULL; > > > + QEMUIOVector local_qiov; > > > + > > > + assert(QEMU_IS_ALIGNED(offset, s->page_size)); > > > + assert(QEMU_IS_ALIGNED(bytes, s->page_size)); > > > + assert(bytes <= s->max_transfer); > > > > Who guarantees max_transfer? I think request alignment is enforced by > > block/io.c but there is no generic max_transfer handling code, so this > > assertion can be triggered by the guest. Please handle it as a genuine > > request error instead of using an assertion. > > There has been one since 04ed95f4843281e292d93018d56d4b14705f9f2c, see the code > around max_transfer in block/io.c:bdrv_aligned_*. Thanks for pointing that out! > > > > > +static int nvme_reopen_prepare(BDRVReopenState *reopen_state, > > > + BlockReopenQueue *queue, Error **errp) > > > +{ > > > + return 0; > > > +} > > > > What is the purpose of this dummy .bdrv_reopen_prepare() implementation? > > This is necessary for block jobs to work, other drivers provides dummy > implementations as well. Please include a comment similar to what the other drivers with dummy implements do. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* [Qemu-devel] [PATCH v3 3/6] block: Introduce bdrv_dma_map and bdrv_dma_unmap 2017-07-05 13:36 [Qemu-devel] [PATCH v3 0/6] block: Add VFIO based driver for NVMe device Fam Zheng 2017-07-05 13:36 ` [Qemu-devel] [PATCH v3 1/6] stubs: Add stubs for ram block API Fam Zheng 2017-07-05 13:36 ` [Qemu-devel] [PATCH v3 2/6] block: Add VFIO based NVMe driver Fam Zheng @ 2017-07-05 13:36 ` Fam Zheng 2017-07-10 14:57 ` Stefan Hajnoczi 2017-07-10 15:07 ` Stefan Hajnoczi 2017-07-05 13:36 ` [Qemu-devel] [PATCH v3 4/6] block/nvme: Implement .bdrv_dma_map and .bdrv_dma_unmap Fam Zheng ` (4 subsequent siblings) 7 siblings, 2 replies; 33+ messages in thread From: Fam Zheng @ 2017-07-05 13:36 UTC (permalink / raw) To: qemu-devel Cc: Paolo Bonzini, Keith Busch, qemu-block, Fam Zheng, Kevin Wolf, Max Reitz, Stefan Hajnoczi, Karl Rister Allow block driver to map and unmap a buffer for later I/O, as a performance hint. Signed-off-by: Fam Zheng <famz@redhat.com> --- block/block-backend.c | 10 ++++++++++ block/io.c | 24 ++++++++++++++++++++++++ include/block/block.h | 2 ++ include/block/block_int.h | 4 ++++ include/sysemu/block-backend.h | 3 +++ 5 files changed, 43 insertions(+) diff --git a/block/block-backend.c b/block/block-backend.c index 0df3457..784b936 100644 --- a/block/block-backend.c +++ b/block/block-backend.c @@ -1974,3 +1974,13 @@ static void blk_root_drained_end(BdrvChild *child) } } } + +void blk_dma_map(BlockBackend *blk, void *host, size_t size) +{ + bdrv_dma_map(blk_bs(blk), host, size); +} + +void blk_dma_unmap(BlockBackend *blk, void *host) +{ + bdrv_dma_unmap(blk_bs(blk), host); +} diff --git a/block/io.c b/block/io.c index 2de7c77..988e4db 100644 --- a/block/io.c +++ b/block/io.c @@ -2537,3 +2537,27 @@ void bdrv_io_unplug(BlockDriverState *bs) bdrv_io_unplug(child->bs); } } + +void bdrv_dma_map(BlockDriverState *bs, void *host, size_t size) +{ + BdrvChild *child; + + if (bs->drv && bs->drv->bdrv_dma_map) { + bs->drv->bdrv_dma_map(bs, host, size); + } + QLIST_FOREACH(child, &bs->children, next) { + bdrv_dma_map(child->bs, host, size); + } +} + +void bdrv_dma_unmap(BlockDriverState *bs, void *host) +{ + BdrvChild *child; + + if (bs->drv && bs->drv->bdrv_dma_unmap) { + bs->drv->bdrv_dma_unmap(bs, host); + } + QLIST_FOREACH(child, &bs->children, next) { + bdrv_dma_unmap(child->bs, host); + } +} diff --git a/include/block/block.h b/include/block/block.h index 4c149ad..f59b50a 100644 --- a/include/block/block.h +++ b/include/block/block.h @@ -624,4 +624,6 @@ void bdrv_add_child(BlockDriverState *parent, BlockDriverState *child, Error **errp); void bdrv_del_child(BlockDriverState *parent, BdrvChild *child, Error **errp); +void bdrv_dma_map(BlockDriverState *bs, void *host, size_t size); +void bdrv_dma_unmap(BlockDriverState *bs, void *host); #endif diff --git a/include/block/block_int.h b/include/block/block_int.h index 15fa602..4092669 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -381,6 +381,10 @@ struct BlockDriver { uint64_t parent_perm, uint64_t parent_shared, uint64_t *nperm, uint64_t *nshared); + /* Map and unmap a buffer for I/O, as a performance hint to the + * driver. */ + void (*bdrv_dma_map)(BlockDriverState *bs, void *host, size_t size); + void (*bdrv_dma_unmap)(BlockDriverState *bs, void *host); QLIST_ENTRY(BlockDriver) list; }; diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h index 1e05281..5f7ccdb 100644 --- a/include/sysemu/block-backend.h +++ b/include/sysemu/block-backend.h @@ -239,4 +239,7 @@ void blk_io_limits_disable(BlockBackend *blk); void blk_io_limits_enable(BlockBackend *blk, const char *group); void blk_io_limits_update_group(BlockBackend *blk, const char *group); +void blk_dma_map(BlockBackend *blk, void *host, size_t size); +void blk_dma_unmap(BlockBackend *blk, void *host); + #endif -- 2.9.4 ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH v3 3/6] block: Introduce bdrv_dma_map and bdrv_dma_unmap 2017-07-05 13:36 ` [Qemu-devel] [PATCH v3 3/6] block: Introduce bdrv_dma_map and bdrv_dma_unmap Fam Zheng @ 2017-07-10 14:57 ` Stefan Hajnoczi 2017-07-10 15:07 ` Stefan Hajnoczi 1 sibling, 0 replies; 33+ messages in thread From: Stefan Hajnoczi @ 2017-07-10 14:57 UTC (permalink / raw) To: Fam Zheng Cc: qemu-devel, Paolo Bonzini, Keith Busch, qemu-block, Kevin Wolf, Max Reitz, Karl Rister [-- Attachment #1: Type: text/plain, Size: 795 bytes --] On Wed, Jul 05, 2017 at 09:36:32PM +0800, Fam Zheng wrote: > diff --git a/include/block/block_int.h b/include/block/block_int.h > index 15fa602..4092669 100644 > --- a/include/block/block_int.h > +++ b/include/block/block_int.h > @@ -381,6 +381,10 @@ struct BlockDriver { > uint64_t parent_perm, uint64_t parent_shared, > uint64_t *nperm, uint64_t *nshared); > > + /* Map and unmap a buffer for I/O, as a performance hint to the > + * driver. */ > + void (*bdrv_dma_map)(BlockDriverState *bs, void *host, size_t size); > + void (*bdrv_dma_unmap)(BlockDriverState *bs, void *host); It's unclear what this API does and how to use it correctly. Please flesh out the doc comments a bit to explain its use. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH v3 3/6] block: Introduce bdrv_dma_map and bdrv_dma_unmap 2017-07-05 13:36 ` [Qemu-devel] [PATCH v3 3/6] block: Introduce bdrv_dma_map and bdrv_dma_unmap Fam Zheng 2017-07-10 14:57 ` Stefan Hajnoczi @ 2017-07-10 15:07 ` Stefan Hajnoczi 2017-07-10 15:08 ` Paolo Bonzini 1 sibling, 1 reply; 33+ messages in thread From: Stefan Hajnoczi @ 2017-07-10 15:07 UTC (permalink / raw) To: Fam Zheng Cc: qemu-devel, Paolo Bonzini, Keith Busch, qemu-block, Kevin Wolf, Max Reitz, Karl Rister [-- Attachment #1: Type: text/plain, Size: 557 bytes --] On Wed, Jul 05, 2017 at 09:36:32PM +0800, Fam Zheng wrote: > Allow block driver to map and unmap a buffer for later I/O, as a performance > hint. The name blk_dma_map() is confusing since other "dma" APIs like dma_addr_t and dma_blk_io() deal with guest physical addresses instead of host addresses. They are about DMA to/from guest RAM. Have you considered hiding this cached mapping in block/nvme.c so that it isn't exposed? block/nvme.c could keep the last buffer mapped and callers would get the performance benefit without a new blk_dma_map() API. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH v3 3/6] block: Introduce bdrv_dma_map and bdrv_dma_unmap 2017-07-10 15:07 ` Stefan Hajnoczi @ 2017-07-10 15:08 ` Paolo Bonzini 2017-07-11 10:05 ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi 0 siblings, 1 reply; 33+ messages in thread From: Paolo Bonzini @ 2017-07-10 15:08 UTC (permalink / raw) To: Stefan Hajnoczi, Fam Zheng Cc: qemu-devel, Keith Busch, qemu-block, Kevin Wolf, Max Reitz, Karl Rister [-- Attachment #1: Type: text/plain, Size: 731 bytes --] On 10/07/2017 17:07, Stefan Hajnoczi wrote: > On Wed, Jul 05, 2017 at 09:36:32PM +0800, Fam Zheng wrote: >> Allow block driver to map and unmap a buffer for later I/O, as a performance >> hint. > The name blk_dma_map() is confusing since other "dma" APIs like > dma_addr_t and dma_blk_io() deal with guest physical addresses instead > of host addresses. They are about DMA to/from guest RAM. > > Have you considered hiding this cached mapping in block/nvme.c so that > it isn't exposed? block/nvme.c could keep the last buffer mapped and > callers would get the performance benefit without a new blk_dma_map() > API. One buffer is enough for qemu-img bench, but not for more complex cases (e.g. fio). Paolo [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [PATCH v3 3/6] block: Introduce bdrv_dma_map and bdrv_dma_unmap 2017-07-10 15:08 ` Paolo Bonzini @ 2017-07-11 10:05 ` Stefan Hajnoczi 2017-07-11 10:28 ` Paolo Bonzini 0 siblings, 1 reply; 33+ messages in thread From: Stefan Hajnoczi @ 2017-07-11 10:05 UTC (permalink / raw) To: Paolo Bonzini Cc: Stefan Hajnoczi, Fam Zheng, Kevin Wolf, qemu-block, qemu-devel, Max Reitz, Keith Busch, Karl Rister [-- Attachment #1: Type: text/plain, Size: 872 bytes --] On Mon, Jul 10, 2017 at 05:08:56PM +0200, Paolo Bonzini wrote: > On 10/07/2017 17:07, Stefan Hajnoczi wrote: > > On Wed, Jul 05, 2017 at 09:36:32PM +0800, Fam Zheng wrote: > >> Allow block driver to map and unmap a buffer for later I/O, as a performance > >> hint. > > The name blk_dma_map() is confusing since other "dma" APIs like > > dma_addr_t and dma_blk_io() deal with guest physical addresses instead > > of host addresses. They are about DMA to/from guest RAM. > > > > Have you considered hiding this cached mapping in block/nvme.c so that > > it isn't exposed? block/nvme.c could keep the last buffer mapped and > > callers would get the performance benefit without a new blk_dma_map() > > API. > > One buffer is enough for qemu-img bench, but not for more complex cases > (e.g. fio). I don't see any other blk_dma_map() callers. Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [PATCH v3 3/6] block: Introduce bdrv_dma_map and bdrv_dma_unmap 2017-07-11 10:05 ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi @ 2017-07-11 10:28 ` Paolo Bonzini 2017-07-12 1:07 ` Fam Zheng 0 siblings, 1 reply; 33+ messages in thread From: Paolo Bonzini @ 2017-07-11 10:28 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Stefan Hajnoczi, Fam Zheng, Kevin Wolf, qemu-block, qemu-devel, Max Reitz, Keith Busch, Karl Rister [-- Attachment #1: Type: text/plain, Size: 1046 bytes --] On 11/07/2017 12:05, Stefan Hajnoczi wrote: > On Mon, Jul 10, 2017 at 05:08:56PM +0200, Paolo Bonzini wrote: >> On 10/07/2017 17:07, Stefan Hajnoczi wrote: >>> On Wed, Jul 05, 2017 at 09:36:32PM +0800, Fam Zheng wrote: >>>> Allow block driver to map and unmap a buffer for later I/O, as a performance >>>> hint. >>> The name blk_dma_map() is confusing since other "dma" APIs like >>> dma_addr_t and dma_blk_io() deal with guest physical addresses instead >>> of host addresses. They are about DMA to/from guest RAM. >>> >>> Have you considered hiding this cached mapping in block/nvme.c so that >>> it isn't exposed? block/nvme.c could keep the last buffer mapped and >>> callers would get the performance benefit without a new blk_dma_map() >>> API. >> >> One buffer is enough for qemu-img bench, but not for more complex cases >> (e.g. fio). > > I don't see any other blk_dma_map() callers. Indeed, the fio plugin is not part of this series, but it also used blk_dma_map. Without it, performance is awful. Paolo [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [PATCH v3 3/6] block: Introduce bdrv_dma_map and bdrv_dma_unmap 2017-07-11 10:28 ` Paolo Bonzini @ 2017-07-12 1:07 ` Fam Zheng 2017-07-12 14:03 ` Paolo Bonzini 0 siblings, 1 reply; 33+ messages in thread From: Fam Zheng @ 2017-07-12 1:07 UTC (permalink / raw) To: Paolo Bonzini Cc: Stefan Hajnoczi, Stefan Hajnoczi, Kevin Wolf, qemu-block, qemu-devel, Max Reitz, Keith Busch, Karl Rister On Tue, 07/11 12:28, Paolo Bonzini wrote: > On 11/07/2017 12:05, Stefan Hajnoczi wrote: > > On Mon, Jul 10, 2017 at 05:08:56PM +0200, Paolo Bonzini wrote: > >> On 10/07/2017 17:07, Stefan Hajnoczi wrote: > >>> On Wed, Jul 05, 2017 at 09:36:32PM +0800, Fam Zheng wrote: > >>>> Allow block driver to map and unmap a buffer for later I/O, as a performance > >>>> hint. > >>> The name blk_dma_map() is confusing since other "dma" APIs like > >>> dma_addr_t and dma_blk_io() deal with guest physical addresses instead > >>> of host addresses. They are about DMA to/from guest RAM. > >>> > >>> Have you considered hiding this cached mapping in block/nvme.c so that > >>> it isn't exposed? block/nvme.c could keep the last buffer mapped and > >>> callers would get the performance benefit without a new blk_dma_map() > >>> API. > >> > >> One buffer is enough for qemu-img bench, but not for more complex cases > >> (e.g. fio). > > > > I don't see any other blk_dma_map() callers. > > Indeed, the fio plugin is not part of this series, but it also used > blk_dma_map. Without it, performance is awful. How many buffers does fio use, typically? If it's not too many, block/nvme.c can cache the last N buffers. I'm with Stefan that hiding the mapping logic from block layer callers makes a nicer API, especially such that qemu-img is much easier to maintain good performance across subcommmands. Fam ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [PATCH v3 3/6] block: Introduce bdrv_dma_map and bdrv_dma_unmap 2017-07-12 1:07 ` Fam Zheng @ 2017-07-12 14:03 ` Paolo Bonzini 2017-07-14 13:37 ` Stefan Hajnoczi 0 siblings, 1 reply; 33+ messages in thread From: Paolo Bonzini @ 2017-07-12 14:03 UTC (permalink / raw) To: Fam Zheng Cc: Stefan Hajnoczi, Stefan Hajnoczi, Kevin Wolf, qemu-block, qemu-devel, Max Reitz, Keith Busch, Karl Rister On 12/07/2017 03:07, Fam Zheng wrote: > On Tue, 07/11 12:28, Paolo Bonzini wrote: >> On 11/07/2017 12:05, Stefan Hajnoczi wrote: >>> On Mon, Jul 10, 2017 at 05:08:56PM +0200, Paolo Bonzini wrote: >>>> On 10/07/2017 17:07, Stefan Hajnoczi wrote: >>>>> On Wed, Jul 05, 2017 at 09:36:32PM +0800, Fam Zheng wrote: >>>>>> Allow block driver to map and unmap a buffer for later I/O, as a performance >>>>>> hint. >>>>> The name blk_dma_map() is confusing since other "dma" APIs like >>>>> dma_addr_t and dma_blk_io() deal with guest physical addresses instead >>>>> of host addresses. They are about DMA to/from guest RAM. >>>>> >>>>> Have you considered hiding this cached mapping in block/nvme.c so that >>>>> it isn't exposed? block/nvme.c could keep the last buffer mapped and >>>>> callers would get the performance benefit without a new blk_dma_map() >>>>> API. >>>> >>>> One buffer is enough for qemu-img bench, but not for more complex cases >>>> (e.g. fio). >>> >>> I don't see any other blk_dma_map() callers. >> >> Indeed, the fio plugin is not part of this series, but it also used >> blk_dma_map. Without it, performance is awful. > > How many buffers does fio use, typically? If it's not too many, block/nvme.c can > cache the last N buffers. I'm with Stefan that hiding the mapping logic from > block layer callers makes a nicer API, especially such that qemu-img is much > easier to maintain good performance across subcommmands. It depends on the queue depth. I think the API addition is necessary, otherwise we wouldn't have added the RAMBlockNotifier which is a layering violation that does the same thing (create permanent HVA->IOVA mappings). In fact, the RAMBlockNotifier could be moved out of nvme.c and made to use blk_dma_map/unmap, though I'm not proposing to do it now. I don't think qemu-img convert and dd are impacted by IOMMU map/unmap as heavily as bench, because they operate with queue depth 1. But adding map/unmap there would not be hard. Paolo ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [PATCH v3 3/6] block: Introduce bdrv_dma_map and bdrv_dma_unmap 2017-07-12 14:03 ` Paolo Bonzini @ 2017-07-14 13:37 ` Stefan Hajnoczi 2017-07-14 13:46 ` Paolo Bonzini 0 siblings, 1 reply; 33+ messages in thread From: Stefan Hajnoczi @ 2017-07-14 13:37 UTC (permalink / raw) To: Paolo Bonzini Cc: Fam Zheng, Stefan Hajnoczi, Kevin Wolf, qemu-block, qemu-devel, Max Reitz, Keith Busch, Karl Rister [-- Attachment #1: Type: text/plain, Size: 2399 bytes --] On Wed, Jul 12, 2017 at 04:03:57PM +0200, Paolo Bonzini wrote: > On 12/07/2017 03:07, Fam Zheng wrote: > > On Tue, 07/11 12:28, Paolo Bonzini wrote: > >> On 11/07/2017 12:05, Stefan Hajnoczi wrote: > >>> On Mon, Jul 10, 2017 at 05:08:56PM +0200, Paolo Bonzini wrote: > >>>> On 10/07/2017 17:07, Stefan Hajnoczi wrote: > >>>>> On Wed, Jul 05, 2017 at 09:36:32PM +0800, Fam Zheng wrote: > >>>>>> Allow block driver to map and unmap a buffer for later I/O, as a performance > >>>>>> hint. > >>>>> The name blk_dma_map() is confusing since other "dma" APIs like > >>>>> dma_addr_t and dma_blk_io() deal with guest physical addresses instead > >>>>> of host addresses. They are about DMA to/from guest RAM. > >>>>> > >>>>> Have you considered hiding this cached mapping in block/nvme.c so that > >>>>> it isn't exposed? block/nvme.c could keep the last buffer mapped and > >>>>> callers would get the performance benefit without a new blk_dma_map() > >>>>> API. > >>>> > >>>> One buffer is enough for qemu-img bench, but not for more complex cases > >>>> (e.g. fio). > >>> > >>> I don't see any other blk_dma_map() callers. > >> > >> Indeed, the fio plugin is not part of this series, but it also used > >> blk_dma_map. Without it, performance is awful. > > > > How many buffers does fio use, typically? If it's not too many, block/nvme.c can > > cache the last N buffers. I'm with Stefan that hiding the mapping logic from > > block layer callers makes a nicer API, especially such that qemu-img is much > > easier to maintain good performance across subcommmands. > > It depends on the queue depth. > > I think the API addition is necessary, otherwise we wouldn't have added > the RAMBlockNotifier which is a layering violation that does the same > thing (create permanent HVA->IOVA mappings). In fact, the > RAMBlockNotifier could be moved out of nvme.c and made to use > blk_dma_map/unmap, though I'm not proposing to do it now. > > I don't think qemu-img convert and dd are impacted by IOMMU map/unmap as > heavily as bench, because they operate with queue depth 1. But adding > map/unmap there would not be hard. I'm not against an API existing for this. I would just ask: 1. It's documented so the purpose and semantics are clear. 2. The name cannot be confused with dma-helpers.c APIs. Maybe blk_register_buf() or blk_add_buf_hint()? [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [PATCH v3 3/6] block: Introduce bdrv_dma_map and bdrv_dma_unmap 2017-07-14 13:37 ` Stefan Hajnoczi @ 2017-07-14 13:46 ` Paolo Bonzini 0 siblings, 0 replies; 33+ messages in thread From: Paolo Bonzini @ 2017-07-14 13:46 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Fam Zheng, Stefan Hajnoczi, Kevin Wolf, qemu-block, qemu-devel, Max Reitz, Keith Busch, Karl Rister [-- Attachment #1: Type: text/plain, Size: 2581 bytes --] On 14/07/2017 15:37, Stefan Hajnoczi wrote: > On Wed, Jul 12, 2017 at 04:03:57PM +0200, Paolo Bonzini wrote: >> On 12/07/2017 03:07, Fam Zheng wrote: >>> On Tue, 07/11 12:28, Paolo Bonzini wrote: >>>> On 11/07/2017 12:05, Stefan Hajnoczi wrote: >>>>> On Mon, Jul 10, 2017 at 05:08:56PM +0200, Paolo Bonzini wrote: >>>>>> On 10/07/2017 17:07, Stefan Hajnoczi wrote: >>>>>>> On Wed, Jul 05, 2017 at 09:36:32PM +0800, Fam Zheng wrote: >>>>>>>> Allow block driver to map and unmap a buffer for later I/O, as a performance >>>>>>>> hint. >>>>>>> The name blk_dma_map() is confusing since other "dma" APIs like >>>>>>> dma_addr_t and dma_blk_io() deal with guest physical addresses instead >>>>>>> of host addresses. They are about DMA to/from guest RAM. >>>>>>> >>>>>>> Have you considered hiding this cached mapping in block/nvme.c so that >>>>>>> it isn't exposed? block/nvme.c could keep the last buffer mapped and >>>>>>> callers would get the performance benefit without a new blk_dma_map() >>>>>>> API. >>>>>> >>>>>> One buffer is enough for qemu-img bench, but not for more complex cases >>>>>> (e.g. fio). >>>>> >>>>> I don't see any other blk_dma_map() callers. >>>> >>>> Indeed, the fio plugin is not part of this series, but it also used >>>> blk_dma_map. Without it, performance is awful. >>> >>> How many buffers does fio use, typically? If it's not too many, block/nvme.c can >>> cache the last N buffers. I'm with Stefan that hiding the mapping logic from >>> block layer callers makes a nicer API, especially such that qemu-img is much >>> easier to maintain good performance across subcommmands. >> >> It depends on the queue depth. >> >> I think the API addition is necessary, otherwise we wouldn't have added >> the RAMBlockNotifier which is a layering violation that does the same >> thing (create permanent HVA->IOVA mappings). In fact, the >> RAMBlockNotifier could be moved out of nvme.c and made to use >> blk_dma_map/unmap, though I'm not proposing to do it now. >> >> I don't think qemu-img convert and dd are impacted by IOMMU map/unmap as >> heavily as bench, because they operate with queue depth 1. But adding >> map/unmap there would not be hard. > > I'm not against an API existing for this. I would just ask: > > 1. It's documented so the purpose and semantics are clear. > 2. The name cannot be confused with dma-helpers.c APIs. Yes, I agree completely. > Maybe blk_register_buf() or blk_add_buf_hint()? blk_(un)register_buf, or perhaps iobuf or io_buffer, sounds good to me. Paolo [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* [Qemu-devel] [PATCH v3 4/6] block/nvme: Implement .bdrv_dma_map and .bdrv_dma_unmap 2017-07-05 13:36 [Qemu-devel] [PATCH v3 0/6] block: Add VFIO based driver for NVMe device Fam Zheng ` (2 preceding siblings ...) 2017-07-05 13:36 ` [Qemu-devel] [PATCH v3 3/6] block: Introduce bdrv_dma_map and bdrv_dma_unmap Fam Zheng @ 2017-07-05 13:36 ` Fam Zheng 2017-07-10 14:59 ` Stefan Hajnoczi 2017-07-05 13:36 ` [Qemu-devel] [PATCH v3 5/6] qemu-img: Map bench buffer Fam Zheng ` (3 subsequent siblings) 7 siblings, 1 reply; 33+ messages in thread From: Fam Zheng @ 2017-07-05 13:36 UTC (permalink / raw) To: qemu-devel Cc: Paolo Bonzini, Keith Busch, qemu-block, Fam Zheng, Kevin Wolf, Max Reitz, Stefan Hajnoczi, Karl Rister Forward these two calls to the IOVA manager. Signed-off-by: Fam Zheng <famz@redhat.com> --- block/nvme.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/block/nvme.c b/block/nvme.c index eb999a1..7913017 100644 --- a/block/nvme.c +++ b/block/nvme.c @@ -1056,6 +1056,20 @@ static void nvme_aio_unplug(BlockDriverState *bs) } } +static void nvme_dma_map(BlockDriverState *bs, void *host, size_t size) +{ + BDRVNVMeState *s = bs->opaque; + + nvme_vfio_dma_map(s->vfio, host, size, false, NULL); +} + +static void nvme_dma_unmap(BlockDriverState *bs, void *host) +{ + BDRVNVMeState *s = bs->opaque; + + nvme_vfio_dma_unmap(s->vfio, host); +} + static BlockDriver bdrv_nvme = { .format_name = "nvme", .protocol_name = "nvme", @@ -1081,6 +1095,9 @@ static BlockDriver bdrv_nvme = { .bdrv_io_plug = nvme_aio_plug, .bdrv_io_unplug = nvme_aio_unplug, + + .bdrv_dma_map = nvme_dma_map, + .bdrv_dma_unmap = nvme_dma_unmap, }; static void bdrv_nvme_init(void) -- 2.9.4 ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH v3 4/6] block/nvme: Implement .bdrv_dma_map and .bdrv_dma_unmap 2017-07-05 13:36 ` [Qemu-devel] [PATCH v3 4/6] block/nvme: Implement .bdrv_dma_map and .bdrv_dma_unmap Fam Zheng @ 2017-07-10 14:59 ` Stefan Hajnoczi 2017-07-10 15:09 ` Paolo Bonzini 0 siblings, 1 reply; 33+ messages in thread From: Stefan Hajnoczi @ 2017-07-10 14:59 UTC (permalink / raw) To: Fam Zheng Cc: qemu-devel, Paolo Bonzini, Keith Busch, qemu-block, Kevin Wolf, Max Reitz, Karl Rister [-- Attachment #1: Type: text/plain, Size: 348 bytes --] On Wed, Jul 05, 2017 at 09:36:33PM +0800, Fam Zheng wrote: > +static void nvme_dma_map(BlockDriverState *bs, void *host, size_t size) > +{ > + BDRVNVMeState *s = bs->opaque; > + > + nvme_vfio_dma_map(s->vfio, host, size, false, NULL); Since temporary=false repeated calls to map/unmap will run out of space and stop working after some time? [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH v3 4/6] block/nvme: Implement .bdrv_dma_map and .bdrv_dma_unmap 2017-07-10 14:59 ` Stefan Hajnoczi @ 2017-07-10 15:09 ` Paolo Bonzini 2017-07-11 10:04 ` Stefan Hajnoczi 0 siblings, 1 reply; 33+ messages in thread From: Paolo Bonzini @ 2017-07-10 15:09 UTC (permalink / raw) To: Stefan Hajnoczi, Fam Zheng Cc: qemu-devel, Keith Busch, qemu-block, Kevin Wolf, Max Reitz, Karl Rister [-- Attachment #1: Type: text/plain, Size: 826 bytes --] On 10/07/2017 16:59, Stefan Hajnoczi wrote: >> +static void nvme_dma_map(BlockDriverState *bs, void *host, size_t size) >> +{ >> + BDRVNVMeState *s = bs->opaque; >> + >> + nvme_vfio_dma_map(s->vfio, host, size, false, NULL); > Since temporary=false repeated calls to map/unmap will run out of space > and stop working after some time? Yes, the point of bdrv_dma_map/unmap is to add a permanent mapping. Temporary mappings are only valid inside nvme.c, because the corresponding iova is not recorded anywhere. Instead, bdrv_dma_map/unmap cache the iova just like we do for RAMBlock areas during system emulation. The solution is simply not to do that, just like img_bench only calls map/unmap once. If it happens, things just become slower as the driver falls back to temporary mappings. Paolo [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH v3 4/6] block/nvme: Implement .bdrv_dma_map and .bdrv_dma_unmap 2017-07-10 15:09 ` Paolo Bonzini @ 2017-07-11 10:04 ` Stefan Hajnoczi 0 siblings, 0 replies; 33+ messages in thread From: Stefan Hajnoczi @ 2017-07-11 10:04 UTC (permalink / raw) To: Paolo Bonzini Cc: Stefan Hajnoczi, Fam Zheng, Kevin Wolf, qemu-block, qemu-devel, Max Reitz, Keith Busch, Karl Rister [-- Attachment #1: Type: text/plain, Size: 1065 bytes --] On Mon, Jul 10, 2017 at 05:09:25PM +0200, Paolo Bonzini wrote: > On 10/07/2017 16:59, Stefan Hajnoczi wrote: > >> +static void nvme_dma_map(BlockDriverState *bs, void *host, size_t size) > >> +{ > >> + BDRVNVMeState *s = bs->opaque; > >> + > >> + nvme_vfio_dma_map(s->vfio, host, size, false, NULL); > > Since temporary=false repeated calls to map/unmap will run out of space > > and stop working after some time? > > Yes, the point of bdrv_dma_map/unmap is to add a permanent mapping. > Temporary mappings are only valid inside nvme.c, because the > corresponding iova is not recorded anywhere. Instead, > bdrv_dma_map/unmap cache the iova just like we do for RAMBlock areas > during system emulation. > > The solution is simply not to do that, just like img_bench only calls > map/unmap once. If it happens, things just become slower as the driver > falls back to temporary mappings. The constraints need to be documented. Someone might try to use blk_dma_map() and waste time debugging poor performance in the future. Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* [Qemu-devel] [PATCH v3 5/6] qemu-img: Map bench buffer 2017-07-05 13:36 [Qemu-devel] [PATCH v3 0/6] block: Add VFIO based driver for NVMe device Fam Zheng ` (3 preceding siblings ...) 2017-07-05 13:36 ` [Qemu-devel] [PATCH v3 4/6] block/nvme: Implement .bdrv_dma_map and .bdrv_dma_unmap Fam Zheng @ 2017-07-05 13:36 ` Fam Zheng 2017-07-05 13:36 ` [Qemu-devel] [PATCH v3 6/6] block: Move NVMe spec definitions to a separate header Fam Zheng ` (2 subsequent siblings) 7 siblings, 0 replies; 33+ messages in thread From: Fam Zheng @ 2017-07-05 13:36 UTC (permalink / raw) To: qemu-devel Cc: Paolo Bonzini, Keith Busch, qemu-block, Fam Zheng, Kevin Wolf, Max Reitz, Stefan Hajnoczi, Karl Rister Signed-off-by: Fam Zheng <famz@redhat.com> --- qemu-img.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/qemu-img.c b/qemu-img.c index 91ad6be..fea156c 100644 --- a/qemu-img.c +++ b/qemu-img.c @@ -3875,6 +3875,7 @@ static int img_bench(int argc, char **argv) struct timeval t1, t2; int i; bool force_share = false; + size_t buf_size; for (;;) { static const struct option long_options[] = { @@ -4063,9 +4064,12 @@ static int img_bench(int argc, char **argv) printf("Sending flush every %d requests\n", flush_interval); } - data.buf = blk_blockalign(blk, data.nrreq * data.bufsize); + buf_size = data.nrreq * data.bufsize; + data.buf = blk_blockalign(blk, buf_size); memset(data.buf, pattern, data.nrreq * data.bufsize); + blk_dma_map(blk, data.buf, buf_size); + data.qiov = g_new(QEMUIOVector, data.nrreq); for (i = 0; i < data.nrreq; i++) { qemu_iovec_init(&data.qiov[i], 1); @@ -4086,6 +4090,9 @@ static int img_bench(int argc, char **argv) + ((double)(t2.tv_usec - t1.tv_usec) / 1000000)); out: + if (data.buf) { + blk_dma_unmap(blk, data.buf); + } qemu_vfree(data.buf); blk_unref(blk); -- 2.9.4 ^ permalink raw reply related [flat|nested] 33+ messages in thread
* [Qemu-devel] [PATCH v3 6/6] block: Move NVMe spec definitions to a separate header 2017-07-05 13:36 [Qemu-devel] [PATCH v3 0/6] block: Add VFIO based driver for NVMe device Fam Zheng ` (4 preceding siblings ...) 2017-07-05 13:36 ` [Qemu-devel] [PATCH v3 5/6] qemu-img: Map bench buffer Fam Zheng @ 2017-07-05 13:36 ` Fam Zheng 2017-07-05 13:39 ` Paolo Bonzini 2017-07-10 15:01 ` Stefan Hajnoczi 2017-07-05 13:41 ` [Qemu-devel] [PATCH v3 0/6] block: Add VFIO based driver for NVMe device Paolo Bonzini 2017-07-06 14:06 ` no-reply 7 siblings, 2 replies; 33+ messages in thread From: Fam Zheng @ 2017-07-05 13:36 UTC (permalink / raw) To: qemu-devel Cc: Paolo Bonzini, Keith Busch, qemu-block, Fam Zheng, Kevin Wolf, Max Reitz, Stefan Hajnoczi, Karl Rister Signed-off-by: Fam Zheng <famz@redhat.com> --- block/nvme.c | 7 +- block/nvme.h | 700 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ hw/block/nvme.h | 698 +------------------------------------------------------ 3 files changed, 702 insertions(+), 703 deletions(-) create mode 100644 block/nvme.h diff --git a/block/nvme.c b/block/nvme.c index 7913017..2680b29 100644 --- a/block/nvme.c +++ b/block/nvme.c @@ -22,12 +22,7 @@ #include "block/nvme-vfio.h" #include "trace.h" -/* TODO: Move nvme spec definitions from hw/block/nvme.h into a separate file - * that doesn't depend on dma/pci headers. */ -#include "sysemu/dma.h" -#include "hw/pci/pci.h" -#include "hw/block/block.h" -#include "hw/block/nvme.h" +#include "block/nvme.h" #define NVME_SQ_ENTRY_BYTES 64 #define NVME_CQ_ENTRY_BYTES 16 diff --git a/block/nvme.h b/block/nvme.h new file mode 100644 index 0000000..ed18091 --- /dev/null +++ b/block/nvme.h @@ -0,0 +1,700 @@ +#ifndef BLOCK_NVME_H +#define BLOC_NVMEK_H + +typedef struct NvmeBar { + uint64_t cap; + uint32_t vs; + uint32_t intms; + uint32_t intmc; + uint32_t cc; + uint32_t rsvd1; + uint32_t csts; + uint32_t nssrc; + uint32_t aqa; + uint64_t asq; + uint64_t acq; + uint32_t cmbloc; + uint32_t cmbsz; +} NvmeBar; + +enum NvmeCapShift { + CAP_MQES_SHIFT = 0, + CAP_CQR_SHIFT = 16, + CAP_AMS_SHIFT = 17, + CAP_TO_SHIFT = 24, + CAP_DSTRD_SHIFT = 32, + CAP_NSSRS_SHIFT = 33, + CAP_CSS_SHIFT = 37, + CAP_MPSMIN_SHIFT = 48, + CAP_MPSMAX_SHIFT = 52, +}; + +enum NvmeCapMask { + CAP_MQES_MASK = 0xffff, + CAP_CQR_MASK = 0x1, + CAP_AMS_MASK = 0x3, + CAP_TO_MASK = 0xff, + CAP_DSTRD_MASK = 0xf, + CAP_NSSRS_MASK = 0x1, + CAP_CSS_MASK = 0xff, + CAP_MPSMIN_MASK = 0xf, + CAP_MPSMAX_MASK = 0xf, +}; + +#define NVME_CAP_MQES(cap) (((cap) >> CAP_MQES_SHIFT) & CAP_MQES_MASK) +#define NVME_CAP_CQR(cap) (((cap) >> CAP_CQR_SHIFT) & CAP_CQR_MASK) +#define NVME_CAP_AMS(cap) (((cap) >> CAP_AMS_SHIFT) & CAP_AMS_MASK) +#define NVME_CAP_TO(cap) (((cap) >> CAP_TO_SHIFT) & CAP_TO_MASK) +#define NVME_CAP_DSTRD(cap) (((cap) >> CAP_DSTRD_SHIFT) & CAP_DSTRD_MASK) +#define NVME_CAP_NSSRS(cap) (((cap) >> CAP_NSSRS_SHIFT) & CAP_NSSRS_MASK) +#define NVME_CAP_CSS(cap) (((cap) >> CAP_CSS_SHIFT) & CAP_CSS_MASK) +#define NVME_CAP_MPSMIN(cap)(((cap) >> CAP_MPSMIN_SHIFT) & CAP_MPSMIN_MASK) +#define NVME_CAP_MPSMAX(cap)(((cap) >> CAP_MPSMAX_SHIFT) & CAP_MPSMAX_MASK) + +#define NVME_CAP_SET_MQES(cap, val) (cap |= (uint64_t)(val & CAP_MQES_MASK) \ + << CAP_MQES_SHIFT) +#define NVME_CAP_SET_CQR(cap, val) (cap |= (uint64_t)(val & CAP_CQR_MASK) \ + << CAP_CQR_SHIFT) +#define NVME_CAP_SET_AMS(cap, val) (cap |= (uint64_t)(val & CAP_AMS_MASK) \ + << CAP_AMS_SHIFT) +#define NVME_CAP_SET_TO(cap, val) (cap |= (uint64_t)(val & CAP_TO_MASK) \ + << CAP_TO_SHIFT) +#define NVME_CAP_SET_DSTRD(cap, val) (cap |= (uint64_t)(val & CAP_DSTRD_MASK) \ + << CAP_DSTRD_SHIFT) +#define NVME_CAP_SET_NSSRS(cap, val) (cap |= (uint64_t)(val & CAP_NSSRS_MASK) \ + << CAP_NSSRS_SHIFT) +#define NVME_CAP_SET_CSS(cap, val) (cap |= (uint64_t)(val & CAP_CSS_MASK) \ + << CAP_CSS_SHIFT) +#define NVME_CAP_SET_MPSMIN(cap, val) (cap |= (uint64_t)(val & CAP_MPSMIN_MASK)\ + << CAP_MPSMIN_SHIFT) +#define NVME_CAP_SET_MPSMAX(cap, val) (cap |= (uint64_t)(val & CAP_MPSMAX_MASK)\ + << CAP_MPSMAX_SHIFT) + +enum NvmeCcShift { + CC_EN_SHIFT = 0, + CC_CSS_SHIFT = 4, + CC_MPS_SHIFT = 7, + CC_AMS_SHIFT = 11, + CC_SHN_SHIFT = 14, + CC_IOSQES_SHIFT = 16, + CC_IOCQES_SHIFT = 20, +}; + +enum NvmeCcMask { + CC_EN_MASK = 0x1, + CC_CSS_MASK = 0x7, + CC_MPS_MASK = 0xf, + CC_AMS_MASK = 0x7, + CC_SHN_MASK = 0x3, + CC_IOSQES_MASK = 0xf, + CC_IOCQES_MASK = 0xf, +}; + +#define NVME_CC_EN(cc) ((cc >> CC_EN_SHIFT) & CC_EN_MASK) +#define NVME_CC_CSS(cc) ((cc >> CC_CSS_SHIFT) & CC_CSS_MASK) +#define NVME_CC_MPS(cc) ((cc >> CC_MPS_SHIFT) & CC_MPS_MASK) +#define NVME_CC_AMS(cc) ((cc >> CC_AMS_SHIFT) & CC_AMS_MASK) +#define NVME_CC_SHN(cc) ((cc >> CC_SHN_SHIFT) & CC_SHN_MASK) +#define NVME_CC_IOSQES(cc) ((cc >> CC_IOSQES_SHIFT) & CC_IOSQES_MASK) +#define NVME_CC_IOCQES(cc) ((cc >> CC_IOCQES_SHIFT) & CC_IOCQES_MASK) + +enum NvmeCstsShift { + CSTS_RDY_SHIFT = 0, + CSTS_CFS_SHIFT = 1, + CSTS_SHST_SHIFT = 2, + CSTS_NSSRO_SHIFT = 4, +}; + +enum NvmeCstsMask { + CSTS_RDY_MASK = 0x1, + CSTS_CFS_MASK = 0x1, + CSTS_SHST_MASK = 0x3, + CSTS_NSSRO_MASK = 0x1, +}; + +enum NvmeCsts { + NVME_CSTS_READY = 1 << CSTS_RDY_SHIFT, + NVME_CSTS_FAILED = 1 << CSTS_CFS_SHIFT, + NVME_CSTS_SHST_NORMAL = 0 << CSTS_SHST_SHIFT, + NVME_CSTS_SHST_PROGRESS = 1 << CSTS_SHST_SHIFT, + NVME_CSTS_SHST_COMPLETE = 2 << CSTS_SHST_SHIFT, + NVME_CSTS_NSSRO = 1 << CSTS_NSSRO_SHIFT, +}; + +#define NVME_CSTS_RDY(csts) ((csts >> CSTS_RDY_SHIFT) & CSTS_RDY_MASK) +#define NVME_CSTS_CFS(csts) ((csts >> CSTS_CFS_SHIFT) & CSTS_CFS_MASK) +#define NVME_CSTS_SHST(csts) ((csts >> CSTS_SHST_SHIFT) & CSTS_SHST_MASK) +#define NVME_CSTS_NSSRO(csts) ((csts >> CSTS_NSSRO_SHIFT) & CSTS_NSSRO_MASK) + +enum NvmeAqaShift { + AQA_ASQS_SHIFT = 0, + AQA_ACQS_SHIFT = 16, +}; + +enum NvmeAqaMask { + AQA_ASQS_MASK = 0xfff, + AQA_ACQS_MASK = 0xfff, +}; + +#define NVME_AQA_ASQS(aqa) ((aqa >> AQA_ASQS_SHIFT) & AQA_ASQS_MASK) +#define NVME_AQA_ACQS(aqa) ((aqa >> AQA_ACQS_SHIFT) & AQA_ACQS_MASK) + +enum NvmeCmblocShift { + CMBLOC_BIR_SHIFT = 0, + CMBLOC_OFST_SHIFT = 12, +}; + +enum NvmeCmblocMask { + CMBLOC_BIR_MASK = 0x7, + CMBLOC_OFST_MASK = 0xfffff, +}; + +#define NVME_CMBLOC_BIR(cmbloc) ((cmbloc >> CMBLOC_BIR_SHIFT) & \ + CMBLOC_BIR_MASK) +#define NVME_CMBLOC_OFST(cmbloc)((cmbloc >> CMBLOC_OFST_SHIFT) & \ + CMBLOC_OFST_MASK) + +#define NVME_CMBLOC_SET_BIR(cmbloc, val) \ + (cmbloc |= (uint64_t)(val & CMBLOC_BIR_MASK) << CMBLOC_BIR_SHIFT) +#define NVME_CMBLOC_SET_OFST(cmbloc, val) \ + (cmbloc |= (uint64_t)(val & CMBLOC_OFST_MASK) << CMBLOC_OFST_SHIFT) + +enum NvmeCmbszShift { + CMBSZ_SQS_SHIFT = 0, + CMBSZ_CQS_SHIFT = 1, + CMBSZ_LISTS_SHIFT = 2, + CMBSZ_RDS_SHIFT = 3, + CMBSZ_WDS_SHIFT = 4, + CMBSZ_SZU_SHIFT = 8, + CMBSZ_SZ_SHIFT = 12, +}; + +enum NvmeCmbszMask { + CMBSZ_SQS_MASK = 0x1, + CMBSZ_CQS_MASK = 0x1, + CMBSZ_LISTS_MASK = 0x1, + CMBSZ_RDS_MASK = 0x1, + CMBSZ_WDS_MASK = 0x1, + CMBSZ_SZU_MASK = 0xf, + CMBSZ_SZ_MASK = 0xfffff, +}; + +#define NVME_CMBSZ_SQS(cmbsz) ((cmbsz >> CMBSZ_SQS_SHIFT) & CMBSZ_SQS_MASK) +#define NVME_CMBSZ_CQS(cmbsz) ((cmbsz >> CMBSZ_CQS_SHIFT) & CMBSZ_CQS_MASK) +#define NVME_CMBSZ_LISTS(cmbsz)((cmbsz >> CMBSZ_LISTS_SHIFT) & CMBSZ_LISTS_MASK) +#define NVME_CMBSZ_RDS(cmbsz) ((cmbsz >> CMBSZ_RDS_SHIFT) & CMBSZ_RDS_MASK) +#define NVME_CMBSZ_WDS(cmbsz) ((cmbsz >> CMBSZ_WDS_SHIFT) & CMBSZ_WDS_MASK) +#define NVME_CMBSZ_SZU(cmbsz) ((cmbsz >> CMBSZ_SZU_SHIFT) & CMBSZ_SZU_MASK) +#define NVME_CMBSZ_SZ(cmbsz) ((cmbsz >> CMBSZ_SZ_SHIFT) & CMBSZ_SZ_MASK) + +#define NVME_CMBSZ_SET_SQS(cmbsz, val) \ + (cmbsz |= (uint64_t)(val & CMBSZ_SQS_MASK) << CMBSZ_SQS_SHIFT) +#define NVME_CMBSZ_SET_CQS(cmbsz, val) \ + (cmbsz |= (uint64_t)(val & CMBSZ_CQS_MASK) << CMBSZ_CQS_SHIFT) +#define NVME_CMBSZ_SET_LISTS(cmbsz, val) \ + (cmbsz |= (uint64_t)(val & CMBSZ_LISTS_MASK) << CMBSZ_LISTS_SHIFT) +#define NVME_CMBSZ_SET_RDS(cmbsz, val) \ + (cmbsz |= (uint64_t)(val & CMBSZ_RDS_MASK) << CMBSZ_RDS_SHIFT) +#define NVME_CMBSZ_SET_WDS(cmbsz, val) \ + (cmbsz |= (uint64_t)(val & CMBSZ_WDS_MASK) << CMBSZ_WDS_SHIFT) +#define NVME_CMBSZ_SET_SZU(cmbsz, val) \ + (cmbsz |= (uint64_t)(val & CMBSZ_SZU_MASK) << CMBSZ_SZU_SHIFT) +#define NVME_CMBSZ_SET_SZ(cmbsz, val) \ + (cmbsz |= (uint64_t)(val & CMBSZ_SZ_MASK) << CMBSZ_SZ_SHIFT) + +#define NVME_CMBSZ_GETSIZE(cmbsz) \ + (NVME_CMBSZ_SZ(cmbsz) * (1 << (12 + 4 * NVME_CMBSZ_SZU(cmbsz)))) + +typedef struct NvmeCmd { + uint8_t opcode; + uint8_t fuse; + uint16_t cid; + uint32_t nsid; + uint64_t res1; + uint64_t mptr; + uint64_t prp1; + uint64_t prp2; + uint32_t cdw10; + uint32_t cdw11; + uint32_t cdw12; + uint32_t cdw13; + uint32_t cdw14; + uint32_t cdw15; +} NvmeCmd; + +enum NvmeAdminCommands { + NVME_ADM_CMD_DELETE_SQ = 0x00, + NVME_ADM_CMD_CREATE_SQ = 0x01, + NVME_ADM_CMD_GET_LOG_PAGE = 0x02, + NVME_ADM_CMD_DELETE_CQ = 0x04, + NVME_ADM_CMD_CREATE_CQ = 0x05, + NVME_ADM_CMD_IDENTIFY = 0x06, + NVME_ADM_CMD_ABORT = 0x08, + NVME_ADM_CMD_SET_FEATURES = 0x09, + NVME_ADM_CMD_GET_FEATURES = 0x0a, + NVME_ADM_CMD_ASYNC_EV_REQ = 0x0c, + NVME_ADM_CMD_ACTIVATE_FW = 0x10, + NVME_ADM_CMD_DOWNLOAD_FW = 0x11, + NVME_ADM_CMD_FORMAT_NVM = 0x80, + NVME_ADM_CMD_SECURITY_SEND = 0x81, + NVME_ADM_CMD_SECURITY_RECV = 0x82, +}; + +enum NvmeIoCommands { + NVME_CMD_FLUSH = 0x00, + NVME_CMD_WRITE = 0x01, + NVME_CMD_READ = 0x02, + NVME_CMD_WRITE_UNCOR = 0x04, + NVME_CMD_COMPARE = 0x05, + NVME_CMD_WRITE_ZEROS = 0x08, + NVME_CMD_DSM = 0x09, +}; + +typedef struct NvmeDeleteQ { + uint8_t opcode; + uint8_t flags; + uint16_t cid; + uint32_t rsvd1[9]; + uint16_t qid; + uint16_t rsvd10; + uint32_t rsvd11[5]; +} NvmeDeleteQ; + +typedef struct NvmeCreateCq { + uint8_t opcode; + uint8_t flags; + uint16_t cid; + uint32_t rsvd1[5]; + uint64_t prp1; + uint64_t rsvd8; + uint16_t cqid; + uint16_t qsize; + uint16_t cq_flags; + uint16_t irq_vector; + uint32_t rsvd12[4]; +} NvmeCreateCq; + +#define NVME_CQ_FLAGS_PC(cq_flags) (cq_flags & 0x1) +#define NVME_CQ_FLAGS_IEN(cq_flags) ((cq_flags >> 1) & 0x1) + +typedef struct NvmeCreateSq { + uint8_t opcode; + uint8_t flags; + uint16_t cid; + uint32_t rsvd1[5]; + uint64_t prp1; + uint64_t rsvd8; + uint16_t sqid; + uint16_t qsize; + uint16_t sq_flags; + uint16_t cqid; + uint32_t rsvd12[4]; +} NvmeCreateSq; + +#define NVME_SQ_FLAGS_PC(sq_flags) (sq_flags & 0x1) +#define NVME_SQ_FLAGS_QPRIO(sq_flags) ((sq_flags >> 1) & 0x3) + +enum NvmeQueueFlags { + NVME_Q_PC = 1, + NVME_Q_PRIO_URGENT = 0, + NVME_Q_PRIO_HIGH = 1, + NVME_Q_PRIO_NORMAL = 2, + NVME_Q_PRIO_LOW = 3, +}; + +typedef struct NvmeIdentify { + uint8_t opcode; + uint8_t flags; + uint16_t cid; + uint32_t nsid; + uint64_t rsvd2[2]; + uint64_t prp1; + uint64_t prp2; + uint32_t cns; + uint32_t rsvd11[5]; +} NvmeIdentify; + +typedef struct NvmeRwCmd { + uint8_t opcode; + uint8_t flags; + uint16_t cid; + uint32_t nsid; + uint64_t rsvd2; + uint64_t mptr; + uint64_t prp1; + uint64_t prp2; + uint64_t slba; + uint16_t nlb; + uint16_t control; + uint32_t dsmgmt; + uint32_t reftag; + uint16_t apptag; + uint16_t appmask; +} NvmeRwCmd; + +enum { + NVME_RW_LR = 1 << 15, + NVME_RW_FUA = 1 << 14, + NVME_RW_DSM_FREQ_UNSPEC = 0, + NVME_RW_DSM_FREQ_TYPICAL = 1, + NVME_RW_DSM_FREQ_RARE = 2, + NVME_RW_DSM_FREQ_READS = 3, + NVME_RW_DSM_FREQ_WRITES = 4, + NVME_RW_DSM_FREQ_RW = 5, + NVME_RW_DSM_FREQ_ONCE = 6, + NVME_RW_DSM_FREQ_PREFETCH = 7, + NVME_RW_DSM_FREQ_TEMP = 8, + NVME_RW_DSM_LATENCY_NONE = 0 << 4, + NVME_RW_DSM_LATENCY_IDLE = 1 << 4, + NVME_RW_DSM_LATENCY_NORM = 2 << 4, + NVME_RW_DSM_LATENCY_LOW = 3 << 4, + NVME_RW_DSM_SEQ_REQ = 1 << 6, + NVME_RW_DSM_COMPRESSED = 1 << 7, + NVME_RW_PRINFO_PRACT = 1 << 13, + NVME_RW_PRINFO_PRCHK_GUARD = 1 << 12, + NVME_RW_PRINFO_PRCHK_APP = 1 << 11, + NVME_RW_PRINFO_PRCHK_REF = 1 << 10, +}; + +typedef struct NvmeDsmCmd { + uint8_t opcode; + uint8_t flags; + uint16_t cid; + uint32_t nsid; + uint64_t rsvd2[2]; + uint64_t prp1; + uint64_t prp2; + uint32_t nr; + uint32_t attributes; + uint32_t rsvd12[4]; +} NvmeDsmCmd; + +enum { + NVME_DSMGMT_IDR = 1 << 0, + NVME_DSMGMT_IDW = 1 << 1, + NVME_DSMGMT_AD = 1 << 2, +}; + +typedef struct NvmeDsmRange { + uint32_t cattr; + uint32_t nlb; + uint64_t slba; +} NvmeDsmRange; + +enum NvmeAsyncEventRequest { + NVME_AER_TYPE_ERROR = 0, + NVME_AER_TYPE_SMART = 1, + NVME_AER_TYPE_IO_SPECIFIC = 6, + NVME_AER_TYPE_VENDOR_SPECIFIC = 7, + NVME_AER_INFO_ERR_INVALID_SQ = 0, + NVME_AER_INFO_ERR_INVALID_DB = 1, + NVME_AER_INFO_ERR_DIAG_FAIL = 2, + NVME_AER_INFO_ERR_PERS_INTERNAL_ERR = 3, + NVME_AER_INFO_ERR_TRANS_INTERNAL_ERR = 4, + NVME_AER_INFO_ERR_FW_IMG_LOAD_ERR = 5, + NVME_AER_INFO_SMART_RELIABILITY = 0, + NVME_AER_INFO_SMART_TEMP_THRESH = 1, + NVME_AER_INFO_SMART_SPARE_THRESH = 2, +}; + +typedef struct NvmeAerResult { + uint8_t event_type; + uint8_t event_info; + uint8_t log_page; + uint8_t resv; +} NvmeAerResult; + +typedef struct NvmeCqe { + uint32_t result; + uint32_t rsvd; + uint16_t sq_head; + uint16_t sq_id; + uint16_t cid; + uint16_t status; +} NvmeCqe; + +enum NvmeStatusCodes { + NVME_SUCCESS = 0x0000, + NVME_INVALID_OPCODE = 0x0001, + NVME_INVALID_FIELD = 0x0002, + NVME_CID_CONFLICT = 0x0003, + NVME_DATA_TRAS_ERROR = 0x0004, + NVME_POWER_LOSS_ABORT = 0x0005, + NVME_INTERNAL_DEV_ERROR = 0x0006, + NVME_CMD_ABORT_REQ = 0x0007, + NVME_CMD_ABORT_SQ_DEL = 0x0008, + NVME_CMD_ABORT_FAILED_FUSE = 0x0009, + NVME_CMD_ABORT_MISSING_FUSE = 0x000a, + NVME_INVALID_NSID = 0x000b, + NVME_CMD_SEQ_ERROR = 0x000c, + NVME_LBA_RANGE = 0x0080, + NVME_CAP_EXCEEDED = 0x0081, + NVME_NS_NOT_READY = 0x0082, + NVME_NS_RESV_CONFLICT = 0x0083, + NVME_INVALID_CQID = 0x0100, + NVME_INVALID_QID = 0x0101, + NVME_MAX_QSIZE_EXCEEDED = 0x0102, + NVME_ACL_EXCEEDED = 0x0103, + NVME_RESERVED = 0x0104, + NVME_AER_LIMIT_EXCEEDED = 0x0105, + NVME_INVALID_FW_SLOT = 0x0106, + NVME_INVALID_FW_IMAGE = 0x0107, + NVME_INVALID_IRQ_VECTOR = 0x0108, + NVME_INVALID_LOG_ID = 0x0109, + NVME_INVALID_FORMAT = 0x010a, + NVME_FW_REQ_RESET = 0x010b, + NVME_INVALID_QUEUE_DEL = 0x010c, + NVME_FID_NOT_SAVEABLE = 0x010d, + NVME_FID_NOT_NSID_SPEC = 0x010f, + NVME_FW_REQ_SUSYSTEM_RESET = 0x0110, + NVME_CONFLICTING_ATTRS = 0x0180, + NVME_INVALID_PROT_INFO = 0x0181, + NVME_WRITE_TO_RO = 0x0182, + NVME_WRITE_FAULT = 0x0280, + NVME_UNRECOVERED_READ = 0x0281, + NVME_E2E_GUARD_ERROR = 0x0282, + NVME_E2E_APP_ERROR = 0x0283, + NVME_E2E_REF_ERROR = 0x0284, + NVME_CMP_FAILURE = 0x0285, + NVME_ACCESS_DENIED = 0x0286, + NVME_MORE = 0x2000, + NVME_DNR = 0x4000, + NVME_NO_COMPLETE = 0xffff, +}; + +typedef struct NvmeFwSlotInfoLog { + uint8_t afi; + uint8_t reserved1[7]; + uint8_t frs1[8]; + uint8_t frs2[8]; + uint8_t frs3[8]; + uint8_t frs4[8]; + uint8_t frs5[8]; + uint8_t frs6[8]; + uint8_t frs7[8]; + uint8_t reserved2[448]; +} NvmeFwSlotInfoLog; + +typedef struct NvmeErrorLog { + uint64_t error_count; + uint16_t sqid; + uint16_t cid; + uint16_t status_field; + uint16_t param_error_location; + uint64_t lba; + uint32_t nsid; + uint8_t vs; + uint8_t resv[35]; +} NvmeErrorLog; + +typedef struct NvmeSmartLog { + uint8_t critical_warning; + uint8_t temperature[2]; + uint8_t available_spare; + uint8_t available_spare_threshold; + uint8_t percentage_used; + uint8_t reserved1[26]; + uint64_t data_units_read[2]; + uint64_t data_units_written[2]; + uint64_t host_read_commands[2]; + uint64_t host_write_commands[2]; + uint64_t controller_busy_time[2]; + uint64_t power_cycles[2]; + uint64_t power_on_hours[2]; + uint64_t unsafe_shutdowns[2]; + uint64_t media_errors[2]; + uint64_t number_of_error_log_entries[2]; + uint8_t reserved2[320]; +} NvmeSmartLog; + +enum NvmeSmartWarn { + NVME_SMART_SPARE = 1 << 0, + NVME_SMART_TEMPERATURE = 1 << 1, + NVME_SMART_RELIABILITY = 1 << 2, + NVME_SMART_MEDIA_READ_ONLY = 1 << 3, + NVME_SMART_FAILED_VOLATILE_MEDIA = 1 << 4, +}; + +enum LogIdentifier { + NVME_LOG_ERROR_INFO = 0x01, + NVME_LOG_SMART_INFO = 0x02, + NVME_LOG_FW_SLOT_INFO = 0x03, +}; + +typedef struct NvmePSD { + uint16_t mp; + uint16_t reserved; + uint32_t enlat; + uint32_t exlat; + uint8_t rrt; + uint8_t rrl; + uint8_t rwt; + uint8_t rwl; + uint8_t resv[16]; +} NvmePSD; + +typedef struct NvmeIdCtrl { + uint16_t vid; + uint16_t ssvid; + uint8_t sn[20]; + uint8_t mn[40]; + uint8_t fr[8]; + uint8_t rab; + uint8_t ieee[3]; + uint8_t cmic; + uint8_t mdts; + uint8_t rsvd255[178]; + uint16_t oacs; + uint8_t acl; + uint8_t aerl; + uint8_t frmw; + uint8_t lpa; + uint8_t elpe; + uint8_t npss; + uint8_t rsvd511[248]; + uint8_t sqes; + uint8_t cqes; + uint16_t rsvd515; + uint32_t nn; + uint16_t oncs; + uint16_t fuses; + uint8_t fna; + uint8_t vwc; + uint16_t awun; + uint16_t awupf; + uint8_t rsvd703[174]; + uint8_t rsvd2047[1344]; + NvmePSD psd[32]; + uint8_t vs[1024]; +} NvmeIdCtrl; + +enum NvmeIdCtrlOacs { + NVME_OACS_SECURITY = 1 << 0, + NVME_OACS_FORMAT = 1 << 1, + NVME_OACS_FW = 1 << 2, +}; + +enum NvmeIdCtrlOncs { + NVME_ONCS_COMPARE = 1 << 0, + NVME_ONCS_WRITE_UNCORR = 1 << 1, + NVME_ONCS_DSM = 1 << 2, + NVME_ONCS_WRITE_ZEROS = 1 << 3, + NVME_ONCS_FEATURES = 1 << 4, + NVME_ONCS_RESRVATIONS = 1 << 5, +}; + +#define NVME_CTRL_SQES_MIN(sqes) ((sqes) & 0xf) +#define NVME_CTRL_SQES_MAX(sqes) (((sqes) >> 4) & 0xf) +#define NVME_CTRL_CQES_MIN(cqes) ((cqes) & 0xf) +#define NVME_CTRL_CQES_MAX(cqes) (((cqes) >> 4) & 0xf) + +typedef struct NvmeFeatureVal { + uint32_t arbitration; + uint32_t power_mgmt; + uint32_t temp_thresh; + uint32_t err_rec; + uint32_t volatile_wc; + uint32_t num_queues; + uint32_t int_coalescing; + uint32_t *int_vector_config; + uint32_t write_atomicity; + uint32_t async_config; + uint32_t sw_prog_marker; +} NvmeFeatureVal; + +#define NVME_ARB_AB(arb) (arb & 0x7) +#define NVME_ARB_LPW(arb) ((arb >> 8) & 0xff) +#define NVME_ARB_MPW(arb) ((arb >> 16) & 0xff) +#define NVME_ARB_HPW(arb) ((arb >> 24) & 0xff) + +#define NVME_INTC_THR(intc) (intc & 0xff) +#define NVME_INTC_TIME(intc) ((intc >> 8) & 0xff) + +enum NvmeFeatureIds { + NVME_ARBITRATION = 0x1, + NVME_POWER_MANAGEMENT = 0x2, + NVME_LBA_RANGE_TYPE = 0x3, + NVME_TEMPERATURE_THRESHOLD = 0x4, + NVME_ERROR_RECOVERY = 0x5, + NVME_VOLATILE_WRITE_CACHE = 0x6, + NVME_NUMBER_OF_QUEUES = 0x7, + NVME_INTERRUPT_COALESCING = 0x8, + NVME_INTERRUPT_VECTOR_CONF = 0x9, + NVME_WRITE_ATOMICITY = 0xa, + NVME_ASYNCHRONOUS_EVENT_CONF = 0xb, + NVME_SOFTWARE_PROGRESS_MARKER = 0x80 +}; + +typedef struct NvmeRangeType { + uint8_t type; + uint8_t attributes; + uint8_t rsvd2[14]; + uint64_t slba; + uint64_t nlb; + uint8_t guid[16]; + uint8_t rsvd48[16]; +} NvmeRangeType; + +typedef struct NvmeLBAF { + uint16_t ms; + uint8_t ds; + uint8_t rp; +} NvmeLBAF; + +typedef struct NvmeIdNs { + uint64_t nsze; + uint64_t ncap; + uint64_t nuse; + uint8_t nsfeat; + uint8_t nlbaf; + uint8_t flbas; + uint8_t mc; + uint8_t dpc; + uint8_t dps; + uint8_t res30[98]; + NvmeLBAF lbaf[16]; + uint8_t res192[192]; + uint8_t vs[3712]; +} NvmeIdNs; + +#define NVME_ID_NS_NSFEAT_THIN(nsfeat) ((nsfeat & 0x1)) +#define NVME_ID_NS_FLBAS_EXTENDED(flbas) ((flbas >> 4) & 0x1) +#define NVME_ID_NS_FLBAS_INDEX(flbas) ((flbas & 0xf)) +#define NVME_ID_NS_MC_SEPARATE(mc) ((mc >> 1) & 0x1) +#define NVME_ID_NS_MC_EXTENDED(mc) ((mc & 0x1)) +#define NVME_ID_NS_DPC_LAST_EIGHT(dpc) ((dpc >> 4) & 0x1) +#define NVME_ID_NS_DPC_FIRST_EIGHT(dpc) ((dpc >> 3) & 0x1) +#define NVME_ID_NS_DPC_TYPE_3(dpc) ((dpc >> 2) & 0x1) +#define NVME_ID_NS_DPC_TYPE_2(dpc) ((dpc >> 1) & 0x1) +#define NVME_ID_NS_DPC_TYPE_1(dpc) ((dpc & 0x1)) +#define NVME_ID_NS_DPC_TYPE_MASK 0x7 + +enum NvmeIdNsDps { + DPS_TYPE_NONE = 0, + DPS_TYPE_1 = 1, + DPS_TYPE_2 = 2, + DPS_TYPE_3 = 3, + DPS_TYPE_MASK = 0x7, + DPS_FIRST_EIGHT = 8, +}; + +static inline void _nvme_check_size(void) +{ + QEMU_BUILD_BUG_ON(sizeof(NvmeAerResult) != 4); + QEMU_BUILD_BUG_ON(sizeof(NvmeCqe) != 16); + QEMU_BUILD_BUG_ON(sizeof(NvmeDsmRange) != 16); + QEMU_BUILD_BUG_ON(sizeof(NvmeCmd) != 64); + QEMU_BUILD_BUG_ON(sizeof(NvmeDeleteQ) != 64); + QEMU_BUILD_BUG_ON(sizeof(NvmeCreateCq) != 64); + QEMU_BUILD_BUG_ON(sizeof(NvmeCreateSq) != 64); + QEMU_BUILD_BUG_ON(sizeof(NvmeIdentify) != 64); + QEMU_BUILD_BUG_ON(sizeof(NvmeRwCmd) != 64); + QEMU_BUILD_BUG_ON(sizeof(NvmeDsmCmd) != 64); + QEMU_BUILD_BUG_ON(sizeof(NvmeRangeType) != 64); + QEMU_BUILD_BUG_ON(sizeof(NvmeErrorLog) != 64); + QEMU_BUILD_BUG_ON(sizeof(NvmeFwSlotInfoLog) != 512); + QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512); + QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096); + QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096); +} +#endif diff --git a/hw/block/nvme.h b/hw/block/nvme.h index 6aab338..59a1504 100644 --- a/hw/block/nvme.h +++ b/hw/block/nvme.h @@ -1,703 +1,7 @@ #ifndef HW_NVME_H #define HW_NVME_H #include "qemu/cutils.h" - -typedef struct NvmeBar { - uint64_t cap; - uint32_t vs; - uint32_t intms; - uint32_t intmc; - uint32_t cc; - uint32_t rsvd1; - uint32_t csts; - uint32_t nssrc; - uint32_t aqa; - uint64_t asq; - uint64_t acq; - uint32_t cmbloc; - uint32_t cmbsz; -} NvmeBar; - -enum NvmeCapShift { - CAP_MQES_SHIFT = 0, - CAP_CQR_SHIFT = 16, - CAP_AMS_SHIFT = 17, - CAP_TO_SHIFT = 24, - CAP_DSTRD_SHIFT = 32, - CAP_NSSRS_SHIFT = 33, - CAP_CSS_SHIFT = 37, - CAP_MPSMIN_SHIFT = 48, - CAP_MPSMAX_SHIFT = 52, -}; - -enum NvmeCapMask { - CAP_MQES_MASK = 0xffff, - CAP_CQR_MASK = 0x1, - CAP_AMS_MASK = 0x3, - CAP_TO_MASK = 0xff, - CAP_DSTRD_MASK = 0xf, - CAP_NSSRS_MASK = 0x1, - CAP_CSS_MASK = 0xff, - CAP_MPSMIN_MASK = 0xf, - CAP_MPSMAX_MASK = 0xf, -}; - -#define NVME_CAP_MQES(cap) (((cap) >> CAP_MQES_SHIFT) & CAP_MQES_MASK) -#define NVME_CAP_CQR(cap) (((cap) >> CAP_CQR_SHIFT) & CAP_CQR_MASK) -#define NVME_CAP_AMS(cap) (((cap) >> CAP_AMS_SHIFT) & CAP_AMS_MASK) -#define NVME_CAP_TO(cap) (((cap) >> CAP_TO_SHIFT) & CAP_TO_MASK) -#define NVME_CAP_DSTRD(cap) (((cap) >> CAP_DSTRD_SHIFT) & CAP_DSTRD_MASK) -#define NVME_CAP_NSSRS(cap) (((cap) >> CAP_NSSRS_SHIFT) & CAP_NSSRS_MASK) -#define NVME_CAP_CSS(cap) (((cap) >> CAP_CSS_SHIFT) & CAP_CSS_MASK) -#define NVME_CAP_MPSMIN(cap)(((cap) >> CAP_MPSMIN_SHIFT) & CAP_MPSMIN_MASK) -#define NVME_CAP_MPSMAX(cap)(((cap) >> CAP_MPSMAX_SHIFT) & CAP_MPSMAX_MASK) - -#define NVME_CAP_SET_MQES(cap, val) (cap |= (uint64_t)(val & CAP_MQES_MASK) \ - << CAP_MQES_SHIFT) -#define NVME_CAP_SET_CQR(cap, val) (cap |= (uint64_t)(val & CAP_CQR_MASK) \ - << CAP_CQR_SHIFT) -#define NVME_CAP_SET_AMS(cap, val) (cap |= (uint64_t)(val & CAP_AMS_MASK) \ - << CAP_AMS_SHIFT) -#define NVME_CAP_SET_TO(cap, val) (cap |= (uint64_t)(val & CAP_TO_MASK) \ - << CAP_TO_SHIFT) -#define NVME_CAP_SET_DSTRD(cap, val) (cap |= (uint64_t)(val & CAP_DSTRD_MASK) \ - << CAP_DSTRD_SHIFT) -#define NVME_CAP_SET_NSSRS(cap, val) (cap |= (uint64_t)(val & CAP_NSSRS_MASK) \ - << CAP_NSSRS_SHIFT) -#define NVME_CAP_SET_CSS(cap, val) (cap |= (uint64_t)(val & CAP_CSS_MASK) \ - << CAP_CSS_SHIFT) -#define NVME_CAP_SET_MPSMIN(cap, val) (cap |= (uint64_t)(val & CAP_MPSMIN_MASK)\ - << CAP_MPSMIN_SHIFT) -#define NVME_CAP_SET_MPSMAX(cap, val) (cap |= (uint64_t)(val & CAP_MPSMAX_MASK)\ - << CAP_MPSMAX_SHIFT) - -enum NvmeCcShift { - CC_EN_SHIFT = 0, - CC_CSS_SHIFT = 4, - CC_MPS_SHIFT = 7, - CC_AMS_SHIFT = 11, - CC_SHN_SHIFT = 14, - CC_IOSQES_SHIFT = 16, - CC_IOCQES_SHIFT = 20, -}; - -enum NvmeCcMask { - CC_EN_MASK = 0x1, - CC_CSS_MASK = 0x7, - CC_MPS_MASK = 0xf, - CC_AMS_MASK = 0x7, - CC_SHN_MASK = 0x3, - CC_IOSQES_MASK = 0xf, - CC_IOCQES_MASK = 0xf, -}; - -#define NVME_CC_EN(cc) ((cc >> CC_EN_SHIFT) & CC_EN_MASK) -#define NVME_CC_CSS(cc) ((cc >> CC_CSS_SHIFT) & CC_CSS_MASK) -#define NVME_CC_MPS(cc) ((cc >> CC_MPS_SHIFT) & CC_MPS_MASK) -#define NVME_CC_AMS(cc) ((cc >> CC_AMS_SHIFT) & CC_AMS_MASK) -#define NVME_CC_SHN(cc) ((cc >> CC_SHN_SHIFT) & CC_SHN_MASK) -#define NVME_CC_IOSQES(cc) ((cc >> CC_IOSQES_SHIFT) & CC_IOSQES_MASK) -#define NVME_CC_IOCQES(cc) ((cc >> CC_IOCQES_SHIFT) & CC_IOCQES_MASK) - -enum NvmeCstsShift { - CSTS_RDY_SHIFT = 0, - CSTS_CFS_SHIFT = 1, - CSTS_SHST_SHIFT = 2, - CSTS_NSSRO_SHIFT = 4, -}; - -enum NvmeCstsMask { - CSTS_RDY_MASK = 0x1, - CSTS_CFS_MASK = 0x1, - CSTS_SHST_MASK = 0x3, - CSTS_NSSRO_MASK = 0x1, -}; - -enum NvmeCsts { - NVME_CSTS_READY = 1 << CSTS_RDY_SHIFT, - NVME_CSTS_FAILED = 1 << CSTS_CFS_SHIFT, - NVME_CSTS_SHST_NORMAL = 0 << CSTS_SHST_SHIFT, - NVME_CSTS_SHST_PROGRESS = 1 << CSTS_SHST_SHIFT, - NVME_CSTS_SHST_COMPLETE = 2 << CSTS_SHST_SHIFT, - NVME_CSTS_NSSRO = 1 << CSTS_NSSRO_SHIFT, -}; - -#define NVME_CSTS_RDY(csts) ((csts >> CSTS_RDY_SHIFT) & CSTS_RDY_MASK) -#define NVME_CSTS_CFS(csts) ((csts >> CSTS_CFS_SHIFT) & CSTS_CFS_MASK) -#define NVME_CSTS_SHST(csts) ((csts >> CSTS_SHST_SHIFT) & CSTS_SHST_MASK) -#define NVME_CSTS_NSSRO(csts) ((csts >> CSTS_NSSRO_SHIFT) & CSTS_NSSRO_MASK) - -enum NvmeAqaShift { - AQA_ASQS_SHIFT = 0, - AQA_ACQS_SHIFT = 16, -}; - -enum NvmeAqaMask { - AQA_ASQS_MASK = 0xfff, - AQA_ACQS_MASK = 0xfff, -}; - -#define NVME_AQA_ASQS(aqa) ((aqa >> AQA_ASQS_SHIFT) & AQA_ASQS_MASK) -#define NVME_AQA_ACQS(aqa) ((aqa >> AQA_ACQS_SHIFT) & AQA_ACQS_MASK) - -enum NvmeCmblocShift { - CMBLOC_BIR_SHIFT = 0, - CMBLOC_OFST_SHIFT = 12, -}; - -enum NvmeCmblocMask { - CMBLOC_BIR_MASK = 0x7, - CMBLOC_OFST_MASK = 0xfffff, -}; - -#define NVME_CMBLOC_BIR(cmbloc) ((cmbloc >> CMBLOC_BIR_SHIFT) & \ - CMBLOC_BIR_MASK) -#define NVME_CMBLOC_OFST(cmbloc)((cmbloc >> CMBLOC_OFST_SHIFT) & \ - CMBLOC_OFST_MASK) - -#define NVME_CMBLOC_SET_BIR(cmbloc, val) \ - (cmbloc |= (uint64_t)(val & CMBLOC_BIR_MASK) << CMBLOC_BIR_SHIFT) -#define NVME_CMBLOC_SET_OFST(cmbloc, val) \ - (cmbloc |= (uint64_t)(val & CMBLOC_OFST_MASK) << CMBLOC_OFST_SHIFT) - -enum NvmeCmbszShift { - CMBSZ_SQS_SHIFT = 0, - CMBSZ_CQS_SHIFT = 1, - CMBSZ_LISTS_SHIFT = 2, - CMBSZ_RDS_SHIFT = 3, - CMBSZ_WDS_SHIFT = 4, - CMBSZ_SZU_SHIFT = 8, - CMBSZ_SZ_SHIFT = 12, -}; - -enum NvmeCmbszMask { - CMBSZ_SQS_MASK = 0x1, - CMBSZ_CQS_MASK = 0x1, - CMBSZ_LISTS_MASK = 0x1, - CMBSZ_RDS_MASK = 0x1, - CMBSZ_WDS_MASK = 0x1, - CMBSZ_SZU_MASK = 0xf, - CMBSZ_SZ_MASK = 0xfffff, -}; - -#define NVME_CMBSZ_SQS(cmbsz) ((cmbsz >> CMBSZ_SQS_SHIFT) & CMBSZ_SQS_MASK) -#define NVME_CMBSZ_CQS(cmbsz) ((cmbsz >> CMBSZ_CQS_SHIFT) & CMBSZ_CQS_MASK) -#define NVME_CMBSZ_LISTS(cmbsz)((cmbsz >> CMBSZ_LISTS_SHIFT) & CMBSZ_LISTS_MASK) -#define NVME_CMBSZ_RDS(cmbsz) ((cmbsz >> CMBSZ_RDS_SHIFT) & CMBSZ_RDS_MASK) -#define NVME_CMBSZ_WDS(cmbsz) ((cmbsz >> CMBSZ_WDS_SHIFT) & CMBSZ_WDS_MASK) -#define NVME_CMBSZ_SZU(cmbsz) ((cmbsz >> CMBSZ_SZU_SHIFT) & CMBSZ_SZU_MASK) -#define NVME_CMBSZ_SZ(cmbsz) ((cmbsz >> CMBSZ_SZ_SHIFT) & CMBSZ_SZ_MASK) - -#define NVME_CMBSZ_SET_SQS(cmbsz, val) \ - (cmbsz |= (uint64_t)(val & CMBSZ_SQS_MASK) << CMBSZ_SQS_SHIFT) -#define NVME_CMBSZ_SET_CQS(cmbsz, val) \ - (cmbsz |= (uint64_t)(val & CMBSZ_CQS_MASK) << CMBSZ_CQS_SHIFT) -#define NVME_CMBSZ_SET_LISTS(cmbsz, val) \ - (cmbsz |= (uint64_t)(val & CMBSZ_LISTS_MASK) << CMBSZ_LISTS_SHIFT) -#define NVME_CMBSZ_SET_RDS(cmbsz, val) \ - (cmbsz |= (uint64_t)(val & CMBSZ_RDS_MASK) << CMBSZ_RDS_SHIFT) -#define NVME_CMBSZ_SET_WDS(cmbsz, val) \ - (cmbsz |= (uint64_t)(val & CMBSZ_WDS_MASK) << CMBSZ_WDS_SHIFT) -#define NVME_CMBSZ_SET_SZU(cmbsz, val) \ - (cmbsz |= (uint64_t)(val & CMBSZ_SZU_MASK) << CMBSZ_SZU_SHIFT) -#define NVME_CMBSZ_SET_SZ(cmbsz, val) \ - (cmbsz |= (uint64_t)(val & CMBSZ_SZ_MASK) << CMBSZ_SZ_SHIFT) - -#define NVME_CMBSZ_GETSIZE(cmbsz) \ - (NVME_CMBSZ_SZ(cmbsz) * (1 << (12 + 4 * NVME_CMBSZ_SZU(cmbsz)))) - -typedef struct NvmeCmd { - uint8_t opcode; - uint8_t fuse; - uint16_t cid; - uint32_t nsid; - uint64_t res1; - uint64_t mptr; - uint64_t prp1; - uint64_t prp2; - uint32_t cdw10; - uint32_t cdw11; - uint32_t cdw12; - uint32_t cdw13; - uint32_t cdw14; - uint32_t cdw15; -} NvmeCmd; - -enum NvmeAdminCommands { - NVME_ADM_CMD_DELETE_SQ = 0x00, - NVME_ADM_CMD_CREATE_SQ = 0x01, - NVME_ADM_CMD_GET_LOG_PAGE = 0x02, - NVME_ADM_CMD_DELETE_CQ = 0x04, - NVME_ADM_CMD_CREATE_CQ = 0x05, - NVME_ADM_CMD_IDENTIFY = 0x06, - NVME_ADM_CMD_ABORT = 0x08, - NVME_ADM_CMD_SET_FEATURES = 0x09, - NVME_ADM_CMD_GET_FEATURES = 0x0a, - NVME_ADM_CMD_ASYNC_EV_REQ = 0x0c, - NVME_ADM_CMD_ACTIVATE_FW = 0x10, - NVME_ADM_CMD_DOWNLOAD_FW = 0x11, - NVME_ADM_CMD_FORMAT_NVM = 0x80, - NVME_ADM_CMD_SECURITY_SEND = 0x81, - NVME_ADM_CMD_SECURITY_RECV = 0x82, -}; - -enum NvmeIoCommands { - NVME_CMD_FLUSH = 0x00, - NVME_CMD_WRITE = 0x01, - NVME_CMD_READ = 0x02, - NVME_CMD_WRITE_UNCOR = 0x04, - NVME_CMD_COMPARE = 0x05, - NVME_CMD_WRITE_ZEROS = 0x08, - NVME_CMD_DSM = 0x09, -}; - -typedef struct NvmeDeleteQ { - uint8_t opcode; - uint8_t flags; - uint16_t cid; - uint32_t rsvd1[9]; - uint16_t qid; - uint16_t rsvd10; - uint32_t rsvd11[5]; -} NvmeDeleteQ; - -typedef struct NvmeCreateCq { - uint8_t opcode; - uint8_t flags; - uint16_t cid; - uint32_t rsvd1[5]; - uint64_t prp1; - uint64_t rsvd8; - uint16_t cqid; - uint16_t qsize; - uint16_t cq_flags; - uint16_t irq_vector; - uint32_t rsvd12[4]; -} NvmeCreateCq; - -#define NVME_CQ_FLAGS_PC(cq_flags) (cq_flags & 0x1) -#define NVME_CQ_FLAGS_IEN(cq_flags) ((cq_flags >> 1) & 0x1) - -typedef struct NvmeCreateSq { - uint8_t opcode; - uint8_t flags; - uint16_t cid; - uint32_t rsvd1[5]; - uint64_t prp1; - uint64_t rsvd8; - uint16_t sqid; - uint16_t qsize; - uint16_t sq_flags; - uint16_t cqid; - uint32_t rsvd12[4]; -} NvmeCreateSq; - -#define NVME_SQ_FLAGS_PC(sq_flags) (sq_flags & 0x1) -#define NVME_SQ_FLAGS_QPRIO(sq_flags) ((sq_flags >> 1) & 0x3) - -enum NvmeQueueFlags { - NVME_Q_PC = 1, - NVME_Q_PRIO_URGENT = 0, - NVME_Q_PRIO_HIGH = 1, - NVME_Q_PRIO_NORMAL = 2, - NVME_Q_PRIO_LOW = 3, -}; - -typedef struct NvmeIdentify { - uint8_t opcode; - uint8_t flags; - uint16_t cid; - uint32_t nsid; - uint64_t rsvd2[2]; - uint64_t prp1; - uint64_t prp2; - uint32_t cns; - uint32_t rsvd11[5]; -} NvmeIdentify; - -typedef struct NvmeRwCmd { - uint8_t opcode; - uint8_t flags; - uint16_t cid; - uint32_t nsid; - uint64_t rsvd2; - uint64_t mptr; - uint64_t prp1; - uint64_t prp2; - uint64_t slba; - uint16_t nlb; - uint16_t control; - uint32_t dsmgmt; - uint32_t reftag; - uint16_t apptag; - uint16_t appmask; -} NvmeRwCmd; - -enum { - NVME_RW_LR = 1 << 15, - NVME_RW_FUA = 1 << 14, - NVME_RW_DSM_FREQ_UNSPEC = 0, - NVME_RW_DSM_FREQ_TYPICAL = 1, - NVME_RW_DSM_FREQ_RARE = 2, - NVME_RW_DSM_FREQ_READS = 3, - NVME_RW_DSM_FREQ_WRITES = 4, - NVME_RW_DSM_FREQ_RW = 5, - NVME_RW_DSM_FREQ_ONCE = 6, - NVME_RW_DSM_FREQ_PREFETCH = 7, - NVME_RW_DSM_FREQ_TEMP = 8, - NVME_RW_DSM_LATENCY_NONE = 0 << 4, - NVME_RW_DSM_LATENCY_IDLE = 1 << 4, - NVME_RW_DSM_LATENCY_NORM = 2 << 4, - NVME_RW_DSM_LATENCY_LOW = 3 << 4, - NVME_RW_DSM_SEQ_REQ = 1 << 6, - NVME_RW_DSM_COMPRESSED = 1 << 7, - NVME_RW_PRINFO_PRACT = 1 << 13, - NVME_RW_PRINFO_PRCHK_GUARD = 1 << 12, - NVME_RW_PRINFO_PRCHK_APP = 1 << 11, - NVME_RW_PRINFO_PRCHK_REF = 1 << 10, -}; - -typedef struct NvmeDsmCmd { - uint8_t opcode; - uint8_t flags; - uint16_t cid; - uint32_t nsid; - uint64_t rsvd2[2]; - uint64_t prp1; - uint64_t prp2; - uint32_t nr; - uint32_t attributes; - uint32_t rsvd12[4]; -} NvmeDsmCmd; - -enum { - NVME_DSMGMT_IDR = 1 << 0, - NVME_DSMGMT_IDW = 1 << 1, - NVME_DSMGMT_AD = 1 << 2, -}; - -typedef struct NvmeDsmRange { - uint32_t cattr; - uint32_t nlb; - uint64_t slba; -} NvmeDsmRange; - -enum NvmeAsyncEventRequest { - NVME_AER_TYPE_ERROR = 0, - NVME_AER_TYPE_SMART = 1, - NVME_AER_TYPE_IO_SPECIFIC = 6, - NVME_AER_TYPE_VENDOR_SPECIFIC = 7, - NVME_AER_INFO_ERR_INVALID_SQ = 0, - NVME_AER_INFO_ERR_INVALID_DB = 1, - NVME_AER_INFO_ERR_DIAG_FAIL = 2, - NVME_AER_INFO_ERR_PERS_INTERNAL_ERR = 3, - NVME_AER_INFO_ERR_TRANS_INTERNAL_ERR = 4, - NVME_AER_INFO_ERR_FW_IMG_LOAD_ERR = 5, - NVME_AER_INFO_SMART_RELIABILITY = 0, - NVME_AER_INFO_SMART_TEMP_THRESH = 1, - NVME_AER_INFO_SMART_SPARE_THRESH = 2, -}; - -typedef struct NvmeAerResult { - uint8_t event_type; - uint8_t event_info; - uint8_t log_page; - uint8_t resv; -} NvmeAerResult; - -typedef struct NvmeCqe { - uint32_t result; - uint32_t rsvd; - uint16_t sq_head; - uint16_t sq_id; - uint16_t cid; - uint16_t status; -} NvmeCqe; - -enum NvmeStatusCodes { - NVME_SUCCESS = 0x0000, - NVME_INVALID_OPCODE = 0x0001, - NVME_INVALID_FIELD = 0x0002, - NVME_CID_CONFLICT = 0x0003, - NVME_DATA_TRAS_ERROR = 0x0004, - NVME_POWER_LOSS_ABORT = 0x0005, - NVME_INTERNAL_DEV_ERROR = 0x0006, - NVME_CMD_ABORT_REQ = 0x0007, - NVME_CMD_ABORT_SQ_DEL = 0x0008, - NVME_CMD_ABORT_FAILED_FUSE = 0x0009, - NVME_CMD_ABORT_MISSING_FUSE = 0x000a, - NVME_INVALID_NSID = 0x000b, - NVME_CMD_SEQ_ERROR = 0x000c, - NVME_LBA_RANGE = 0x0080, - NVME_CAP_EXCEEDED = 0x0081, - NVME_NS_NOT_READY = 0x0082, - NVME_NS_RESV_CONFLICT = 0x0083, - NVME_INVALID_CQID = 0x0100, - NVME_INVALID_QID = 0x0101, - NVME_MAX_QSIZE_EXCEEDED = 0x0102, - NVME_ACL_EXCEEDED = 0x0103, - NVME_RESERVED = 0x0104, - NVME_AER_LIMIT_EXCEEDED = 0x0105, - NVME_INVALID_FW_SLOT = 0x0106, - NVME_INVALID_FW_IMAGE = 0x0107, - NVME_INVALID_IRQ_VECTOR = 0x0108, - NVME_INVALID_LOG_ID = 0x0109, - NVME_INVALID_FORMAT = 0x010a, - NVME_FW_REQ_RESET = 0x010b, - NVME_INVALID_QUEUE_DEL = 0x010c, - NVME_FID_NOT_SAVEABLE = 0x010d, - NVME_FID_NOT_NSID_SPEC = 0x010f, - NVME_FW_REQ_SUSYSTEM_RESET = 0x0110, - NVME_CONFLICTING_ATTRS = 0x0180, - NVME_INVALID_PROT_INFO = 0x0181, - NVME_WRITE_TO_RO = 0x0182, - NVME_WRITE_FAULT = 0x0280, - NVME_UNRECOVERED_READ = 0x0281, - NVME_E2E_GUARD_ERROR = 0x0282, - NVME_E2E_APP_ERROR = 0x0283, - NVME_E2E_REF_ERROR = 0x0284, - NVME_CMP_FAILURE = 0x0285, - NVME_ACCESS_DENIED = 0x0286, - NVME_MORE = 0x2000, - NVME_DNR = 0x4000, - NVME_NO_COMPLETE = 0xffff, -}; - -typedef struct NvmeFwSlotInfoLog { - uint8_t afi; - uint8_t reserved1[7]; - uint8_t frs1[8]; - uint8_t frs2[8]; - uint8_t frs3[8]; - uint8_t frs4[8]; - uint8_t frs5[8]; - uint8_t frs6[8]; - uint8_t frs7[8]; - uint8_t reserved2[448]; -} NvmeFwSlotInfoLog; - -typedef struct NvmeErrorLog { - uint64_t error_count; - uint16_t sqid; - uint16_t cid; - uint16_t status_field; - uint16_t param_error_location; - uint64_t lba; - uint32_t nsid; - uint8_t vs; - uint8_t resv[35]; -} NvmeErrorLog; - -typedef struct NvmeSmartLog { - uint8_t critical_warning; - uint8_t temperature[2]; - uint8_t available_spare; - uint8_t available_spare_threshold; - uint8_t percentage_used; - uint8_t reserved1[26]; - uint64_t data_units_read[2]; - uint64_t data_units_written[2]; - uint64_t host_read_commands[2]; - uint64_t host_write_commands[2]; - uint64_t controller_busy_time[2]; - uint64_t power_cycles[2]; - uint64_t power_on_hours[2]; - uint64_t unsafe_shutdowns[2]; - uint64_t media_errors[2]; - uint64_t number_of_error_log_entries[2]; - uint8_t reserved2[320]; -} NvmeSmartLog; - -enum NvmeSmartWarn { - NVME_SMART_SPARE = 1 << 0, - NVME_SMART_TEMPERATURE = 1 << 1, - NVME_SMART_RELIABILITY = 1 << 2, - NVME_SMART_MEDIA_READ_ONLY = 1 << 3, - NVME_SMART_FAILED_VOLATILE_MEDIA = 1 << 4, -}; - -enum LogIdentifier { - NVME_LOG_ERROR_INFO = 0x01, - NVME_LOG_SMART_INFO = 0x02, - NVME_LOG_FW_SLOT_INFO = 0x03, -}; - -typedef struct NvmePSD { - uint16_t mp; - uint16_t reserved; - uint32_t enlat; - uint32_t exlat; - uint8_t rrt; - uint8_t rrl; - uint8_t rwt; - uint8_t rwl; - uint8_t resv[16]; -} NvmePSD; - -typedef struct NvmeIdCtrl { - uint16_t vid; - uint16_t ssvid; - uint8_t sn[20]; - uint8_t mn[40]; - uint8_t fr[8]; - uint8_t rab; - uint8_t ieee[3]; - uint8_t cmic; - uint8_t mdts; - uint8_t rsvd255[178]; - uint16_t oacs; - uint8_t acl; - uint8_t aerl; - uint8_t frmw; - uint8_t lpa; - uint8_t elpe; - uint8_t npss; - uint8_t rsvd511[248]; - uint8_t sqes; - uint8_t cqes; - uint16_t rsvd515; - uint32_t nn; - uint16_t oncs; - uint16_t fuses; - uint8_t fna; - uint8_t vwc; - uint16_t awun; - uint16_t awupf; - uint8_t rsvd703[174]; - uint8_t rsvd2047[1344]; - NvmePSD psd[32]; - uint8_t vs[1024]; -} NvmeIdCtrl; - -enum NvmeIdCtrlOacs { - NVME_OACS_SECURITY = 1 << 0, - NVME_OACS_FORMAT = 1 << 1, - NVME_OACS_FW = 1 << 2, -}; - -enum NvmeIdCtrlOncs { - NVME_ONCS_COMPARE = 1 << 0, - NVME_ONCS_WRITE_UNCORR = 1 << 1, - NVME_ONCS_DSM = 1 << 2, - NVME_ONCS_WRITE_ZEROS = 1 << 3, - NVME_ONCS_FEATURES = 1 << 4, - NVME_ONCS_RESRVATIONS = 1 << 5, -}; - -#define NVME_CTRL_SQES_MIN(sqes) ((sqes) & 0xf) -#define NVME_CTRL_SQES_MAX(sqes) (((sqes) >> 4) & 0xf) -#define NVME_CTRL_CQES_MIN(cqes) ((cqes) & 0xf) -#define NVME_CTRL_CQES_MAX(cqes) (((cqes) >> 4) & 0xf) - -typedef struct NvmeFeatureVal { - uint32_t arbitration; - uint32_t power_mgmt; - uint32_t temp_thresh; - uint32_t err_rec; - uint32_t volatile_wc; - uint32_t num_queues; - uint32_t int_coalescing; - uint32_t *int_vector_config; - uint32_t write_atomicity; - uint32_t async_config; - uint32_t sw_prog_marker; -} NvmeFeatureVal; - -#define NVME_ARB_AB(arb) (arb & 0x7) -#define NVME_ARB_LPW(arb) ((arb >> 8) & 0xff) -#define NVME_ARB_MPW(arb) ((arb >> 16) & 0xff) -#define NVME_ARB_HPW(arb) ((arb >> 24) & 0xff) - -#define NVME_INTC_THR(intc) (intc & 0xff) -#define NVME_INTC_TIME(intc) ((intc >> 8) & 0xff) - -enum NvmeFeatureIds { - NVME_ARBITRATION = 0x1, - NVME_POWER_MANAGEMENT = 0x2, - NVME_LBA_RANGE_TYPE = 0x3, - NVME_TEMPERATURE_THRESHOLD = 0x4, - NVME_ERROR_RECOVERY = 0x5, - NVME_VOLATILE_WRITE_CACHE = 0x6, - NVME_NUMBER_OF_QUEUES = 0x7, - NVME_INTERRUPT_COALESCING = 0x8, - NVME_INTERRUPT_VECTOR_CONF = 0x9, - NVME_WRITE_ATOMICITY = 0xa, - NVME_ASYNCHRONOUS_EVENT_CONF = 0xb, - NVME_SOFTWARE_PROGRESS_MARKER = 0x80 -}; - -typedef struct NvmeRangeType { - uint8_t type; - uint8_t attributes; - uint8_t rsvd2[14]; - uint64_t slba; - uint64_t nlb; - uint8_t guid[16]; - uint8_t rsvd48[16]; -} NvmeRangeType; - -typedef struct NvmeLBAF { - uint16_t ms; - uint8_t ds; - uint8_t rp; -} NvmeLBAF; - -typedef struct NvmeIdNs { - uint64_t nsze; - uint64_t ncap; - uint64_t nuse; - uint8_t nsfeat; - uint8_t nlbaf; - uint8_t flbas; - uint8_t mc; - uint8_t dpc; - uint8_t dps; - uint8_t res30[98]; - NvmeLBAF lbaf[16]; - uint8_t res192[192]; - uint8_t vs[3712]; -} NvmeIdNs; - -#define NVME_ID_NS_NSFEAT_THIN(nsfeat) ((nsfeat & 0x1)) -#define NVME_ID_NS_FLBAS_EXTENDED(flbas) ((flbas >> 4) & 0x1) -#define NVME_ID_NS_FLBAS_INDEX(flbas) ((flbas & 0xf)) -#define NVME_ID_NS_MC_SEPARATE(mc) ((mc >> 1) & 0x1) -#define NVME_ID_NS_MC_EXTENDED(mc) ((mc & 0x1)) -#define NVME_ID_NS_DPC_LAST_EIGHT(dpc) ((dpc >> 4) & 0x1) -#define NVME_ID_NS_DPC_FIRST_EIGHT(dpc) ((dpc >> 3) & 0x1) -#define NVME_ID_NS_DPC_TYPE_3(dpc) ((dpc >> 2) & 0x1) -#define NVME_ID_NS_DPC_TYPE_2(dpc) ((dpc >> 1) & 0x1) -#define NVME_ID_NS_DPC_TYPE_1(dpc) ((dpc & 0x1)) -#define NVME_ID_NS_DPC_TYPE_MASK 0x7 - -enum NvmeIdNsDps { - DPS_TYPE_NONE = 0, - DPS_TYPE_1 = 1, - DPS_TYPE_2 = 2, - DPS_TYPE_3 = 3, - DPS_TYPE_MASK = 0x7, - DPS_FIRST_EIGHT = 8, -}; - -static inline void _nvme_check_size(void) -{ - QEMU_BUILD_BUG_ON(sizeof(NvmeAerResult) != 4); - QEMU_BUILD_BUG_ON(sizeof(NvmeCqe) != 16); - QEMU_BUILD_BUG_ON(sizeof(NvmeDsmRange) != 16); - QEMU_BUILD_BUG_ON(sizeof(NvmeCmd) != 64); - QEMU_BUILD_BUG_ON(sizeof(NvmeDeleteQ) != 64); - QEMU_BUILD_BUG_ON(sizeof(NvmeCreateCq) != 64); - QEMU_BUILD_BUG_ON(sizeof(NvmeCreateSq) != 64); - QEMU_BUILD_BUG_ON(sizeof(NvmeIdentify) != 64); - QEMU_BUILD_BUG_ON(sizeof(NvmeRwCmd) != 64); - QEMU_BUILD_BUG_ON(sizeof(NvmeDsmCmd) != 64); - QEMU_BUILD_BUG_ON(sizeof(NvmeRangeType) != 64); - QEMU_BUILD_BUG_ON(sizeof(NvmeErrorLog) != 64); - QEMU_BUILD_BUG_ON(sizeof(NvmeFwSlotInfoLog) != 512); - QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512); - QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096); - QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096); -} +#include "block/nvme.h" typedef struct NvmeAsyncEvent { QSIMPLEQ_ENTRY(NvmeAsyncEvent) entry; -- 2.9.4 ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH v3 6/6] block: Move NVMe spec definitions to a separate header 2017-07-05 13:36 ` [Qemu-devel] [PATCH v3 6/6] block: Move NVMe spec definitions to a separate header Fam Zheng @ 2017-07-05 13:39 ` Paolo Bonzini 2017-07-10 15:01 ` Stefan Hajnoczi 1 sibling, 0 replies; 33+ messages in thread From: Paolo Bonzini @ 2017-07-05 13:39 UTC (permalink / raw) To: Fam Zheng, qemu-devel Cc: Keith Busch, qemu-block, Kevin Wolf, Max Reitz, Stefan Hajnoczi, Karl Rister On 05/07/2017 15:36, Fam Zheng wrote: > Signed-off-by: Fam Zheng <famz@redhat.com> > --- > block/nvme.c | 7 +- > block/nvme.h | 700 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ This should be include/block/nvme.h. Can be fixed by maintainer, I suppose. Paolo > hw/block/nvme.h | 698 +------------------------------------------------------ > 3 files changed, 702 insertions(+), 703 deletions(-) > create mode 100644 block/nvme.h > > diff --git a/block/nvme.c b/block/nvme.c > index 7913017..2680b29 100644 > --- a/block/nvme.c > +++ b/block/nvme.c > @@ -22,12 +22,7 @@ > #include "block/nvme-vfio.h" > #include "trace.h" > > -/* TODO: Move nvme spec definitions from hw/block/nvme.h into a separate file > - * that doesn't depend on dma/pci headers. */ > -#include "sysemu/dma.h" > -#include "hw/pci/pci.h" > -#include "hw/block/block.h" > -#include "hw/block/nvme.h" > +#include "block/nvme.h" > > #define NVME_SQ_ENTRY_BYTES 64 > #define NVME_CQ_ENTRY_BYTES 16 > diff --git a/block/nvme.h b/block/nvme.h > new file mode 100644 > index 0000000..ed18091 > --- /dev/null > +++ b/block/nvme.h > @@ -0,0 +1,700 @@ > +#ifndef BLOCK_NVME_H > +#define BLOC_NVMEK_H > + > +typedef struct NvmeBar { > + uint64_t cap; > + uint32_t vs; > + uint32_t intms; > + uint32_t intmc; > + uint32_t cc; > + uint32_t rsvd1; > + uint32_t csts; > + uint32_t nssrc; > + uint32_t aqa; > + uint64_t asq; > + uint64_t acq; > + uint32_t cmbloc; > + uint32_t cmbsz; > +} NvmeBar; > + > +enum NvmeCapShift { > + CAP_MQES_SHIFT = 0, > + CAP_CQR_SHIFT = 16, > + CAP_AMS_SHIFT = 17, > + CAP_TO_SHIFT = 24, > + CAP_DSTRD_SHIFT = 32, > + CAP_NSSRS_SHIFT = 33, > + CAP_CSS_SHIFT = 37, > + CAP_MPSMIN_SHIFT = 48, > + CAP_MPSMAX_SHIFT = 52, > +}; > + > +enum NvmeCapMask { > + CAP_MQES_MASK = 0xffff, > + CAP_CQR_MASK = 0x1, > + CAP_AMS_MASK = 0x3, > + CAP_TO_MASK = 0xff, > + CAP_DSTRD_MASK = 0xf, > + CAP_NSSRS_MASK = 0x1, > + CAP_CSS_MASK = 0xff, > + CAP_MPSMIN_MASK = 0xf, > + CAP_MPSMAX_MASK = 0xf, > +}; > + > +#define NVME_CAP_MQES(cap) (((cap) >> CAP_MQES_SHIFT) & CAP_MQES_MASK) > +#define NVME_CAP_CQR(cap) (((cap) >> CAP_CQR_SHIFT) & CAP_CQR_MASK) > +#define NVME_CAP_AMS(cap) (((cap) >> CAP_AMS_SHIFT) & CAP_AMS_MASK) > +#define NVME_CAP_TO(cap) (((cap) >> CAP_TO_SHIFT) & CAP_TO_MASK) > +#define NVME_CAP_DSTRD(cap) (((cap) >> CAP_DSTRD_SHIFT) & CAP_DSTRD_MASK) > +#define NVME_CAP_NSSRS(cap) (((cap) >> CAP_NSSRS_SHIFT) & CAP_NSSRS_MASK) > +#define NVME_CAP_CSS(cap) (((cap) >> CAP_CSS_SHIFT) & CAP_CSS_MASK) > +#define NVME_CAP_MPSMIN(cap)(((cap) >> CAP_MPSMIN_SHIFT) & CAP_MPSMIN_MASK) > +#define NVME_CAP_MPSMAX(cap)(((cap) >> CAP_MPSMAX_SHIFT) & CAP_MPSMAX_MASK) > + > +#define NVME_CAP_SET_MQES(cap, val) (cap |= (uint64_t)(val & CAP_MQES_MASK) \ > + << CAP_MQES_SHIFT) > +#define NVME_CAP_SET_CQR(cap, val) (cap |= (uint64_t)(val & CAP_CQR_MASK) \ > + << CAP_CQR_SHIFT) > +#define NVME_CAP_SET_AMS(cap, val) (cap |= (uint64_t)(val & CAP_AMS_MASK) \ > + << CAP_AMS_SHIFT) > +#define NVME_CAP_SET_TO(cap, val) (cap |= (uint64_t)(val & CAP_TO_MASK) \ > + << CAP_TO_SHIFT) > +#define NVME_CAP_SET_DSTRD(cap, val) (cap |= (uint64_t)(val & CAP_DSTRD_MASK) \ > + << CAP_DSTRD_SHIFT) > +#define NVME_CAP_SET_NSSRS(cap, val) (cap |= (uint64_t)(val & CAP_NSSRS_MASK) \ > + << CAP_NSSRS_SHIFT) > +#define NVME_CAP_SET_CSS(cap, val) (cap |= (uint64_t)(val & CAP_CSS_MASK) \ > + << CAP_CSS_SHIFT) > +#define NVME_CAP_SET_MPSMIN(cap, val) (cap |= (uint64_t)(val & CAP_MPSMIN_MASK)\ > + << CAP_MPSMIN_SHIFT) > +#define NVME_CAP_SET_MPSMAX(cap, val) (cap |= (uint64_t)(val & CAP_MPSMAX_MASK)\ > + << CAP_MPSMAX_SHIFT) > + > +enum NvmeCcShift { > + CC_EN_SHIFT = 0, > + CC_CSS_SHIFT = 4, > + CC_MPS_SHIFT = 7, > + CC_AMS_SHIFT = 11, > + CC_SHN_SHIFT = 14, > + CC_IOSQES_SHIFT = 16, > + CC_IOCQES_SHIFT = 20, > +}; > + > +enum NvmeCcMask { > + CC_EN_MASK = 0x1, > + CC_CSS_MASK = 0x7, > + CC_MPS_MASK = 0xf, > + CC_AMS_MASK = 0x7, > + CC_SHN_MASK = 0x3, > + CC_IOSQES_MASK = 0xf, > + CC_IOCQES_MASK = 0xf, > +}; > + > +#define NVME_CC_EN(cc) ((cc >> CC_EN_SHIFT) & CC_EN_MASK) > +#define NVME_CC_CSS(cc) ((cc >> CC_CSS_SHIFT) & CC_CSS_MASK) > +#define NVME_CC_MPS(cc) ((cc >> CC_MPS_SHIFT) & CC_MPS_MASK) > +#define NVME_CC_AMS(cc) ((cc >> CC_AMS_SHIFT) & CC_AMS_MASK) > +#define NVME_CC_SHN(cc) ((cc >> CC_SHN_SHIFT) & CC_SHN_MASK) > +#define NVME_CC_IOSQES(cc) ((cc >> CC_IOSQES_SHIFT) & CC_IOSQES_MASK) > +#define NVME_CC_IOCQES(cc) ((cc >> CC_IOCQES_SHIFT) & CC_IOCQES_MASK) > + > +enum NvmeCstsShift { > + CSTS_RDY_SHIFT = 0, > + CSTS_CFS_SHIFT = 1, > + CSTS_SHST_SHIFT = 2, > + CSTS_NSSRO_SHIFT = 4, > +}; > + > +enum NvmeCstsMask { > + CSTS_RDY_MASK = 0x1, > + CSTS_CFS_MASK = 0x1, > + CSTS_SHST_MASK = 0x3, > + CSTS_NSSRO_MASK = 0x1, > +}; > + > +enum NvmeCsts { > + NVME_CSTS_READY = 1 << CSTS_RDY_SHIFT, > + NVME_CSTS_FAILED = 1 << CSTS_CFS_SHIFT, > + NVME_CSTS_SHST_NORMAL = 0 << CSTS_SHST_SHIFT, > + NVME_CSTS_SHST_PROGRESS = 1 << CSTS_SHST_SHIFT, > + NVME_CSTS_SHST_COMPLETE = 2 << CSTS_SHST_SHIFT, > + NVME_CSTS_NSSRO = 1 << CSTS_NSSRO_SHIFT, > +}; > + > +#define NVME_CSTS_RDY(csts) ((csts >> CSTS_RDY_SHIFT) & CSTS_RDY_MASK) > +#define NVME_CSTS_CFS(csts) ((csts >> CSTS_CFS_SHIFT) & CSTS_CFS_MASK) > +#define NVME_CSTS_SHST(csts) ((csts >> CSTS_SHST_SHIFT) & CSTS_SHST_MASK) > +#define NVME_CSTS_NSSRO(csts) ((csts >> CSTS_NSSRO_SHIFT) & CSTS_NSSRO_MASK) > + > +enum NvmeAqaShift { > + AQA_ASQS_SHIFT = 0, > + AQA_ACQS_SHIFT = 16, > +}; > + > +enum NvmeAqaMask { > + AQA_ASQS_MASK = 0xfff, > + AQA_ACQS_MASK = 0xfff, > +}; > + > +#define NVME_AQA_ASQS(aqa) ((aqa >> AQA_ASQS_SHIFT) & AQA_ASQS_MASK) > +#define NVME_AQA_ACQS(aqa) ((aqa >> AQA_ACQS_SHIFT) & AQA_ACQS_MASK) > + > +enum NvmeCmblocShift { > + CMBLOC_BIR_SHIFT = 0, > + CMBLOC_OFST_SHIFT = 12, > +}; > + > +enum NvmeCmblocMask { > + CMBLOC_BIR_MASK = 0x7, > + CMBLOC_OFST_MASK = 0xfffff, > +}; > + > +#define NVME_CMBLOC_BIR(cmbloc) ((cmbloc >> CMBLOC_BIR_SHIFT) & \ > + CMBLOC_BIR_MASK) > +#define NVME_CMBLOC_OFST(cmbloc)((cmbloc >> CMBLOC_OFST_SHIFT) & \ > + CMBLOC_OFST_MASK) > + > +#define NVME_CMBLOC_SET_BIR(cmbloc, val) \ > + (cmbloc |= (uint64_t)(val & CMBLOC_BIR_MASK) << CMBLOC_BIR_SHIFT) > +#define NVME_CMBLOC_SET_OFST(cmbloc, val) \ > + (cmbloc |= (uint64_t)(val & CMBLOC_OFST_MASK) << CMBLOC_OFST_SHIFT) > + > +enum NvmeCmbszShift { > + CMBSZ_SQS_SHIFT = 0, > + CMBSZ_CQS_SHIFT = 1, > + CMBSZ_LISTS_SHIFT = 2, > + CMBSZ_RDS_SHIFT = 3, > + CMBSZ_WDS_SHIFT = 4, > + CMBSZ_SZU_SHIFT = 8, > + CMBSZ_SZ_SHIFT = 12, > +}; > + > +enum NvmeCmbszMask { > + CMBSZ_SQS_MASK = 0x1, > + CMBSZ_CQS_MASK = 0x1, > + CMBSZ_LISTS_MASK = 0x1, > + CMBSZ_RDS_MASK = 0x1, > + CMBSZ_WDS_MASK = 0x1, > + CMBSZ_SZU_MASK = 0xf, > + CMBSZ_SZ_MASK = 0xfffff, > +}; > + > +#define NVME_CMBSZ_SQS(cmbsz) ((cmbsz >> CMBSZ_SQS_SHIFT) & CMBSZ_SQS_MASK) > +#define NVME_CMBSZ_CQS(cmbsz) ((cmbsz >> CMBSZ_CQS_SHIFT) & CMBSZ_CQS_MASK) > +#define NVME_CMBSZ_LISTS(cmbsz)((cmbsz >> CMBSZ_LISTS_SHIFT) & CMBSZ_LISTS_MASK) > +#define NVME_CMBSZ_RDS(cmbsz) ((cmbsz >> CMBSZ_RDS_SHIFT) & CMBSZ_RDS_MASK) > +#define NVME_CMBSZ_WDS(cmbsz) ((cmbsz >> CMBSZ_WDS_SHIFT) & CMBSZ_WDS_MASK) > +#define NVME_CMBSZ_SZU(cmbsz) ((cmbsz >> CMBSZ_SZU_SHIFT) & CMBSZ_SZU_MASK) > +#define NVME_CMBSZ_SZ(cmbsz) ((cmbsz >> CMBSZ_SZ_SHIFT) & CMBSZ_SZ_MASK) > + > +#define NVME_CMBSZ_SET_SQS(cmbsz, val) \ > + (cmbsz |= (uint64_t)(val & CMBSZ_SQS_MASK) << CMBSZ_SQS_SHIFT) > +#define NVME_CMBSZ_SET_CQS(cmbsz, val) \ > + (cmbsz |= (uint64_t)(val & CMBSZ_CQS_MASK) << CMBSZ_CQS_SHIFT) > +#define NVME_CMBSZ_SET_LISTS(cmbsz, val) \ > + (cmbsz |= (uint64_t)(val & CMBSZ_LISTS_MASK) << CMBSZ_LISTS_SHIFT) > +#define NVME_CMBSZ_SET_RDS(cmbsz, val) \ > + (cmbsz |= (uint64_t)(val & CMBSZ_RDS_MASK) << CMBSZ_RDS_SHIFT) > +#define NVME_CMBSZ_SET_WDS(cmbsz, val) \ > + (cmbsz |= (uint64_t)(val & CMBSZ_WDS_MASK) << CMBSZ_WDS_SHIFT) > +#define NVME_CMBSZ_SET_SZU(cmbsz, val) \ > + (cmbsz |= (uint64_t)(val & CMBSZ_SZU_MASK) << CMBSZ_SZU_SHIFT) > +#define NVME_CMBSZ_SET_SZ(cmbsz, val) \ > + (cmbsz |= (uint64_t)(val & CMBSZ_SZ_MASK) << CMBSZ_SZ_SHIFT) > + > +#define NVME_CMBSZ_GETSIZE(cmbsz) \ > + (NVME_CMBSZ_SZ(cmbsz) * (1 << (12 + 4 * NVME_CMBSZ_SZU(cmbsz)))) > + > +typedef struct NvmeCmd { > + uint8_t opcode; > + uint8_t fuse; > + uint16_t cid; > + uint32_t nsid; > + uint64_t res1; > + uint64_t mptr; > + uint64_t prp1; > + uint64_t prp2; > + uint32_t cdw10; > + uint32_t cdw11; > + uint32_t cdw12; > + uint32_t cdw13; > + uint32_t cdw14; > + uint32_t cdw15; > +} NvmeCmd; > + > +enum NvmeAdminCommands { > + NVME_ADM_CMD_DELETE_SQ = 0x00, > + NVME_ADM_CMD_CREATE_SQ = 0x01, > + NVME_ADM_CMD_GET_LOG_PAGE = 0x02, > + NVME_ADM_CMD_DELETE_CQ = 0x04, > + NVME_ADM_CMD_CREATE_CQ = 0x05, > + NVME_ADM_CMD_IDENTIFY = 0x06, > + NVME_ADM_CMD_ABORT = 0x08, > + NVME_ADM_CMD_SET_FEATURES = 0x09, > + NVME_ADM_CMD_GET_FEATURES = 0x0a, > + NVME_ADM_CMD_ASYNC_EV_REQ = 0x0c, > + NVME_ADM_CMD_ACTIVATE_FW = 0x10, > + NVME_ADM_CMD_DOWNLOAD_FW = 0x11, > + NVME_ADM_CMD_FORMAT_NVM = 0x80, > + NVME_ADM_CMD_SECURITY_SEND = 0x81, > + NVME_ADM_CMD_SECURITY_RECV = 0x82, > +}; > + > +enum NvmeIoCommands { > + NVME_CMD_FLUSH = 0x00, > + NVME_CMD_WRITE = 0x01, > + NVME_CMD_READ = 0x02, > + NVME_CMD_WRITE_UNCOR = 0x04, > + NVME_CMD_COMPARE = 0x05, > + NVME_CMD_WRITE_ZEROS = 0x08, > + NVME_CMD_DSM = 0x09, > +}; > + > +typedef struct NvmeDeleteQ { > + uint8_t opcode; > + uint8_t flags; > + uint16_t cid; > + uint32_t rsvd1[9]; > + uint16_t qid; > + uint16_t rsvd10; > + uint32_t rsvd11[5]; > +} NvmeDeleteQ; > + > +typedef struct NvmeCreateCq { > + uint8_t opcode; > + uint8_t flags; > + uint16_t cid; > + uint32_t rsvd1[5]; > + uint64_t prp1; > + uint64_t rsvd8; > + uint16_t cqid; > + uint16_t qsize; > + uint16_t cq_flags; > + uint16_t irq_vector; > + uint32_t rsvd12[4]; > +} NvmeCreateCq; > + > +#define NVME_CQ_FLAGS_PC(cq_flags) (cq_flags & 0x1) > +#define NVME_CQ_FLAGS_IEN(cq_flags) ((cq_flags >> 1) & 0x1) > + > +typedef struct NvmeCreateSq { > + uint8_t opcode; > + uint8_t flags; > + uint16_t cid; > + uint32_t rsvd1[5]; > + uint64_t prp1; > + uint64_t rsvd8; > + uint16_t sqid; > + uint16_t qsize; > + uint16_t sq_flags; > + uint16_t cqid; > + uint32_t rsvd12[4]; > +} NvmeCreateSq; > + > +#define NVME_SQ_FLAGS_PC(sq_flags) (sq_flags & 0x1) > +#define NVME_SQ_FLAGS_QPRIO(sq_flags) ((sq_flags >> 1) & 0x3) > + > +enum NvmeQueueFlags { > + NVME_Q_PC = 1, > + NVME_Q_PRIO_URGENT = 0, > + NVME_Q_PRIO_HIGH = 1, > + NVME_Q_PRIO_NORMAL = 2, > + NVME_Q_PRIO_LOW = 3, > +}; > + > +typedef struct NvmeIdentify { > + uint8_t opcode; > + uint8_t flags; > + uint16_t cid; > + uint32_t nsid; > + uint64_t rsvd2[2]; > + uint64_t prp1; > + uint64_t prp2; > + uint32_t cns; > + uint32_t rsvd11[5]; > +} NvmeIdentify; > + > +typedef struct NvmeRwCmd { > + uint8_t opcode; > + uint8_t flags; > + uint16_t cid; > + uint32_t nsid; > + uint64_t rsvd2; > + uint64_t mptr; > + uint64_t prp1; > + uint64_t prp2; > + uint64_t slba; > + uint16_t nlb; > + uint16_t control; > + uint32_t dsmgmt; > + uint32_t reftag; > + uint16_t apptag; > + uint16_t appmask; > +} NvmeRwCmd; > + > +enum { > + NVME_RW_LR = 1 << 15, > + NVME_RW_FUA = 1 << 14, > + NVME_RW_DSM_FREQ_UNSPEC = 0, > + NVME_RW_DSM_FREQ_TYPICAL = 1, > + NVME_RW_DSM_FREQ_RARE = 2, > + NVME_RW_DSM_FREQ_READS = 3, > + NVME_RW_DSM_FREQ_WRITES = 4, > + NVME_RW_DSM_FREQ_RW = 5, > + NVME_RW_DSM_FREQ_ONCE = 6, > + NVME_RW_DSM_FREQ_PREFETCH = 7, > + NVME_RW_DSM_FREQ_TEMP = 8, > + NVME_RW_DSM_LATENCY_NONE = 0 << 4, > + NVME_RW_DSM_LATENCY_IDLE = 1 << 4, > + NVME_RW_DSM_LATENCY_NORM = 2 << 4, > + NVME_RW_DSM_LATENCY_LOW = 3 << 4, > + NVME_RW_DSM_SEQ_REQ = 1 << 6, > + NVME_RW_DSM_COMPRESSED = 1 << 7, > + NVME_RW_PRINFO_PRACT = 1 << 13, > + NVME_RW_PRINFO_PRCHK_GUARD = 1 << 12, > + NVME_RW_PRINFO_PRCHK_APP = 1 << 11, > + NVME_RW_PRINFO_PRCHK_REF = 1 << 10, > +}; > + > +typedef struct NvmeDsmCmd { > + uint8_t opcode; > + uint8_t flags; > + uint16_t cid; > + uint32_t nsid; > + uint64_t rsvd2[2]; > + uint64_t prp1; > + uint64_t prp2; > + uint32_t nr; > + uint32_t attributes; > + uint32_t rsvd12[4]; > +} NvmeDsmCmd; > + > +enum { > + NVME_DSMGMT_IDR = 1 << 0, > + NVME_DSMGMT_IDW = 1 << 1, > + NVME_DSMGMT_AD = 1 << 2, > +}; > + > +typedef struct NvmeDsmRange { > + uint32_t cattr; > + uint32_t nlb; > + uint64_t slba; > +} NvmeDsmRange; > + > +enum NvmeAsyncEventRequest { > + NVME_AER_TYPE_ERROR = 0, > + NVME_AER_TYPE_SMART = 1, > + NVME_AER_TYPE_IO_SPECIFIC = 6, > + NVME_AER_TYPE_VENDOR_SPECIFIC = 7, > + NVME_AER_INFO_ERR_INVALID_SQ = 0, > + NVME_AER_INFO_ERR_INVALID_DB = 1, > + NVME_AER_INFO_ERR_DIAG_FAIL = 2, > + NVME_AER_INFO_ERR_PERS_INTERNAL_ERR = 3, > + NVME_AER_INFO_ERR_TRANS_INTERNAL_ERR = 4, > + NVME_AER_INFO_ERR_FW_IMG_LOAD_ERR = 5, > + NVME_AER_INFO_SMART_RELIABILITY = 0, > + NVME_AER_INFO_SMART_TEMP_THRESH = 1, > + NVME_AER_INFO_SMART_SPARE_THRESH = 2, > +}; > + > +typedef struct NvmeAerResult { > + uint8_t event_type; > + uint8_t event_info; > + uint8_t log_page; > + uint8_t resv; > +} NvmeAerResult; > + > +typedef struct NvmeCqe { > + uint32_t result; > + uint32_t rsvd; > + uint16_t sq_head; > + uint16_t sq_id; > + uint16_t cid; > + uint16_t status; > +} NvmeCqe; > + > +enum NvmeStatusCodes { > + NVME_SUCCESS = 0x0000, > + NVME_INVALID_OPCODE = 0x0001, > + NVME_INVALID_FIELD = 0x0002, > + NVME_CID_CONFLICT = 0x0003, > + NVME_DATA_TRAS_ERROR = 0x0004, > + NVME_POWER_LOSS_ABORT = 0x0005, > + NVME_INTERNAL_DEV_ERROR = 0x0006, > + NVME_CMD_ABORT_REQ = 0x0007, > + NVME_CMD_ABORT_SQ_DEL = 0x0008, > + NVME_CMD_ABORT_FAILED_FUSE = 0x0009, > + NVME_CMD_ABORT_MISSING_FUSE = 0x000a, > + NVME_INVALID_NSID = 0x000b, > + NVME_CMD_SEQ_ERROR = 0x000c, > + NVME_LBA_RANGE = 0x0080, > + NVME_CAP_EXCEEDED = 0x0081, > + NVME_NS_NOT_READY = 0x0082, > + NVME_NS_RESV_CONFLICT = 0x0083, > + NVME_INVALID_CQID = 0x0100, > + NVME_INVALID_QID = 0x0101, > + NVME_MAX_QSIZE_EXCEEDED = 0x0102, > + NVME_ACL_EXCEEDED = 0x0103, > + NVME_RESERVED = 0x0104, > + NVME_AER_LIMIT_EXCEEDED = 0x0105, > + NVME_INVALID_FW_SLOT = 0x0106, > + NVME_INVALID_FW_IMAGE = 0x0107, > + NVME_INVALID_IRQ_VECTOR = 0x0108, > + NVME_INVALID_LOG_ID = 0x0109, > + NVME_INVALID_FORMAT = 0x010a, > + NVME_FW_REQ_RESET = 0x010b, > + NVME_INVALID_QUEUE_DEL = 0x010c, > + NVME_FID_NOT_SAVEABLE = 0x010d, > + NVME_FID_NOT_NSID_SPEC = 0x010f, > + NVME_FW_REQ_SUSYSTEM_RESET = 0x0110, > + NVME_CONFLICTING_ATTRS = 0x0180, > + NVME_INVALID_PROT_INFO = 0x0181, > + NVME_WRITE_TO_RO = 0x0182, > + NVME_WRITE_FAULT = 0x0280, > + NVME_UNRECOVERED_READ = 0x0281, > + NVME_E2E_GUARD_ERROR = 0x0282, > + NVME_E2E_APP_ERROR = 0x0283, > + NVME_E2E_REF_ERROR = 0x0284, > + NVME_CMP_FAILURE = 0x0285, > + NVME_ACCESS_DENIED = 0x0286, > + NVME_MORE = 0x2000, > + NVME_DNR = 0x4000, > + NVME_NO_COMPLETE = 0xffff, > +}; > + > +typedef struct NvmeFwSlotInfoLog { > + uint8_t afi; > + uint8_t reserved1[7]; > + uint8_t frs1[8]; > + uint8_t frs2[8]; > + uint8_t frs3[8]; > + uint8_t frs4[8]; > + uint8_t frs5[8]; > + uint8_t frs6[8]; > + uint8_t frs7[8]; > + uint8_t reserved2[448]; > +} NvmeFwSlotInfoLog; > + > +typedef struct NvmeErrorLog { > + uint64_t error_count; > + uint16_t sqid; > + uint16_t cid; > + uint16_t status_field; > + uint16_t param_error_location; > + uint64_t lba; > + uint32_t nsid; > + uint8_t vs; > + uint8_t resv[35]; > +} NvmeErrorLog; > + > +typedef struct NvmeSmartLog { > + uint8_t critical_warning; > + uint8_t temperature[2]; > + uint8_t available_spare; > + uint8_t available_spare_threshold; > + uint8_t percentage_used; > + uint8_t reserved1[26]; > + uint64_t data_units_read[2]; > + uint64_t data_units_written[2]; > + uint64_t host_read_commands[2]; > + uint64_t host_write_commands[2]; > + uint64_t controller_busy_time[2]; > + uint64_t power_cycles[2]; > + uint64_t power_on_hours[2]; > + uint64_t unsafe_shutdowns[2]; > + uint64_t media_errors[2]; > + uint64_t number_of_error_log_entries[2]; > + uint8_t reserved2[320]; > +} NvmeSmartLog; > + > +enum NvmeSmartWarn { > + NVME_SMART_SPARE = 1 << 0, > + NVME_SMART_TEMPERATURE = 1 << 1, > + NVME_SMART_RELIABILITY = 1 << 2, > + NVME_SMART_MEDIA_READ_ONLY = 1 << 3, > + NVME_SMART_FAILED_VOLATILE_MEDIA = 1 << 4, > +}; > + > +enum LogIdentifier { > + NVME_LOG_ERROR_INFO = 0x01, > + NVME_LOG_SMART_INFO = 0x02, > + NVME_LOG_FW_SLOT_INFO = 0x03, > +}; > + > +typedef struct NvmePSD { > + uint16_t mp; > + uint16_t reserved; > + uint32_t enlat; > + uint32_t exlat; > + uint8_t rrt; > + uint8_t rrl; > + uint8_t rwt; > + uint8_t rwl; > + uint8_t resv[16]; > +} NvmePSD; > + > +typedef struct NvmeIdCtrl { > + uint16_t vid; > + uint16_t ssvid; > + uint8_t sn[20]; > + uint8_t mn[40]; > + uint8_t fr[8]; > + uint8_t rab; > + uint8_t ieee[3]; > + uint8_t cmic; > + uint8_t mdts; > + uint8_t rsvd255[178]; > + uint16_t oacs; > + uint8_t acl; > + uint8_t aerl; > + uint8_t frmw; > + uint8_t lpa; > + uint8_t elpe; > + uint8_t npss; > + uint8_t rsvd511[248]; > + uint8_t sqes; > + uint8_t cqes; > + uint16_t rsvd515; > + uint32_t nn; > + uint16_t oncs; > + uint16_t fuses; > + uint8_t fna; > + uint8_t vwc; > + uint16_t awun; > + uint16_t awupf; > + uint8_t rsvd703[174]; > + uint8_t rsvd2047[1344]; > + NvmePSD psd[32]; > + uint8_t vs[1024]; > +} NvmeIdCtrl; > + > +enum NvmeIdCtrlOacs { > + NVME_OACS_SECURITY = 1 << 0, > + NVME_OACS_FORMAT = 1 << 1, > + NVME_OACS_FW = 1 << 2, > +}; > + > +enum NvmeIdCtrlOncs { > + NVME_ONCS_COMPARE = 1 << 0, > + NVME_ONCS_WRITE_UNCORR = 1 << 1, > + NVME_ONCS_DSM = 1 << 2, > + NVME_ONCS_WRITE_ZEROS = 1 << 3, > + NVME_ONCS_FEATURES = 1 << 4, > + NVME_ONCS_RESRVATIONS = 1 << 5, > +}; > + > +#define NVME_CTRL_SQES_MIN(sqes) ((sqes) & 0xf) > +#define NVME_CTRL_SQES_MAX(sqes) (((sqes) >> 4) & 0xf) > +#define NVME_CTRL_CQES_MIN(cqes) ((cqes) & 0xf) > +#define NVME_CTRL_CQES_MAX(cqes) (((cqes) >> 4) & 0xf) > + > +typedef struct NvmeFeatureVal { > + uint32_t arbitration; > + uint32_t power_mgmt; > + uint32_t temp_thresh; > + uint32_t err_rec; > + uint32_t volatile_wc; > + uint32_t num_queues; > + uint32_t int_coalescing; > + uint32_t *int_vector_config; > + uint32_t write_atomicity; > + uint32_t async_config; > + uint32_t sw_prog_marker; > +} NvmeFeatureVal; > + > +#define NVME_ARB_AB(arb) (arb & 0x7) > +#define NVME_ARB_LPW(arb) ((arb >> 8) & 0xff) > +#define NVME_ARB_MPW(arb) ((arb >> 16) & 0xff) > +#define NVME_ARB_HPW(arb) ((arb >> 24) & 0xff) > + > +#define NVME_INTC_THR(intc) (intc & 0xff) > +#define NVME_INTC_TIME(intc) ((intc >> 8) & 0xff) > + > +enum NvmeFeatureIds { > + NVME_ARBITRATION = 0x1, > + NVME_POWER_MANAGEMENT = 0x2, > + NVME_LBA_RANGE_TYPE = 0x3, > + NVME_TEMPERATURE_THRESHOLD = 0x4, > + NVME_ERROR_RECOVERY = 0x5, > + NVME_VOLATILE_WRITE_CACHE = 0x6, > + NVME_NUMBER_OF_QUEUES = 0x7, > + NVME_INTERRUPT_COALESCING = 0x8, > + NVME_INTERRUPT_VECTOR_CONF = 0x9, > + NVME_WRITE_ATOMICITY = 0xa, > + NVME_ASYNCHRONOUS_EVENT_CONF = 0xb, > + NVME_SOFTWARE_PROGRESS_MARKER = 0x80 > +}; > + > +typedef struct NvmeRangeType { > + uint8_t type; > + uint8_t attributes; > + uint8_t rsvd2[14]; > + uint64_t slba; > + uint64_t nlb; > + uint8_t guid[16]; > + uint8_t rsvd48[16]; > +} NvmeRangeType; > + > +typedef struct NvmeLBAF { > + uint16_t ms; > + uint8_t ds; > + uint8_t rp; > +} NvmeLBAF; > + > +typedef struct NvmeIdNs { > + uint64_t nsze; > + uint64_t ncap; > + uint64_t nuse; > + uint8_t nsfeat; > + uint8_t nlbaf; > + uint8_t flbas; > + uint8_t mc; > + uint8_t dpc; > + uint8_t dps; > + uint8_t res30[98]; > + NvmeLBAF lbaf[16]; > + uint8_t res192[192]; > + uint8_t vs[3712]; > +} NvmeIdNs; > + > +#define NVME_ID_NS_NSFEAT_THIN(nsfeat) ((nsfeat & 0x1)) > +#define NVME_ID_NS_FLBAS_EXTENDED(flbas) ((flbas >> 4) & 0x1) > +#define NVME_ID_NS_FLBAS_INDEX(flbas) ((flbas & 0xf)) > +#define NVME_ID_NS_MC_SEPARATE(mc) ((mc >> 1) & 0x1) > +#define NVME_ID_NS_MC_EXTENDED(mc) ((mc & 0x1)) > +#define NVME_ID_NS_DPC_LAST_EIGHT(dpc) ((dpc >> 4) & 0x1) > +#define NVME_ID_NS_DPC_FIRST_EIGHT(dpc) ((dpc >> 3) & 0x1) > +#define NVME_ID_NS_DPC_TYPE_3(dpc) ((dpc >> 2) & 0x1) > +#define NVME_ID_NS_DPC_TYPE_2(dpc) ((dpc >> 1) & 0x1) > +#define NVME_ID_NS_DPC_TYPE_1(dpc) ((dpc & 0x1)) > +#define NVME_ID_NS_DPC_TYPE_MASK 0x7 > + > +enum NvmeIdNsDps { > + DPS_TYPE_NONE = 0, > + DPS_TYPE_1 = 1, > + DPS_TYPE_2 = 2, > + DPS_TYPE_3 = 3, > + DPS_TYPE_MASK = 0x7, > + DPS_FIRST_EIGHT = 8, > +}; > + > +static inline void _nvme_check_size(void) > +{ > + QEMU_BUILD_BUG_ON(sizeof(NvmeAerResult) != 4); > + QEMU_BUILD_BUG_ON(sizeof(NvmeCqe) != 16); > + QEMU_BUILD_BUG_ON(sizeof(NvmeDsmRange) != 16); > + QEMU_BUILD_BUG_ON(sizeof(NvmeCmd) != 64); > + QEMU_BUILD_BUG_ON(sizeof(NvmeDeleteQ) != 64); > + QEMU_BUILD_BUG_ON(sizeof(NvmeCreateCq) != 64); > + QEMU_BUILD_BUG_ON(sizeof(NvmeCreateSq) != 64); > + QEMU_BUILD_BUG_ON(sizeof(NvmeIdentify) != 64); > + QEMU_BUILD_BUG_ON(sizeof(NvmeRwCmd) != 64); > + QEMU_BUILD_BUG_ON(sizeof(NvmeDsmCmd) != 64); > + QEMU_BUILD_BUG_ON(sizeof(NvmeRangeType) != 64); > + QEMU_BUILD_BUG_ON(sizeof(NvmeErrorLog) != 64); > + QEMU_BUILD_BUG_ON(sizeof(NvmeFwSlotInfoLog) != 512); > + QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512); > + QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096); > + QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096); > +} > +#endif > diff --git a/hw/block/nvme.h b/hw/block/nvme.h > index 6aab338..59a1504 100644 > --- a/hw/block/nvme.h > +++ b/hw/block/nvme.h > @@ -1,703 +1,7 @@ > #ifndef HW_NVME_H > #define HW_NVME_H > #include "qemu/cutils.h" > - > -typedef struct NvmeBar { > - uint64_t cap; > - uint32_t vs; > - uint32_t intms; > - uint32_t intmc; > - uint32_t cc; > - uint32_t rsvd1; > - uint32_t csts; > - uint32_t nssrc; > - uint32_t aqa; > - uint64_t asq; > - uint64_t acq; > - uint32_t cmbloc; > - uint32_t cmbsz; > -} NvmeBar; > - > -enum NvmeCapShift { > - CAP_MQES_SHIFT = 0, > - CAP_CQR_SHIFT = 16, > - CAP_AMS_SHIFT = 17, > - CAP_TO_SHIFT = 24, > - CAP_DSTRD_SHIFT = 32, > - CAP_NSSRS_SHIFT = 33, > - CAP_CSS_SHIFT = 37, > - CAP_MPSMIN_SHIFT = 48, > - CAP_MPSMAX_SHIFT = 52, > -}; > - > -enum NvmeCapMask { > - CAP_MQES_MASK = 0xffff, > - CAP_CQR_MASK = 0x1, > - CAP_AMS_MASK = 0x3, > - CAP_TO_MASK = 0xff, > - CAP_DSTRD_MASK = 0xf, > - CAP_NSSRS_MASK = 0x1, > - CAP_CSS_MASK = 0xff, > - CAP_MPSMIN_MASK = 0xf, > - CAP_MPSMAX_MASK = 0xf, > -}; > - > -#define NVME_CAP_MQES(cap) (((cap) >> CAP_MQES_SHIFT) & CAP_MQES_MASK) > -#define NVME_CAP_CQR(cap) (((cap) >> CAP_CQR_SHIFT) & CAP_CQR_MASK) > -#define NVME_CAP_AMS(cap) (((cap) >> CAP_AMS_SHIFT) & CAP_AMS_MASK) > -#define NVME_CAP_TO(cap) (((cap) >> CAP_TO_SHIFT) & CAP_TO_MASK) > -#define NVME_CAP_DSTRD(cap) (((cap) >> CAP_DSTRD_SHIFT) & CAP_DSTRD_MASK) > -#define NVME_CAP_NSSRS(cap) (((cap) >> CAP_NSSRS_SHIFT) & CAP_NSSRS_MASK) > -#define NVME_CAP_CSS(cap) (((cap) >> CAP_CSS_SHIFT) & CAP_CSS_MASK) > -#define NVME_CAP_MPSMIN(cap)(((cap) >> CAP_MPSMIN_SHIFT) & CAP_MPSMIN_MASK) > -#define NVME_CAP_MPSMAX(cap)(((cap) >> CAP_MPSMAX_SHIFT) & CAP_MPSMAX_MASK) > - > -#define NVME_CAP_SET_MQES(cap, val) (cap |= (uint64_t)(val & CAP_MQES_MASK) \ > - << CAP_MQES_SHIFT) > -#define NVME_CAP_SET_CQR(cap, val) (cap |= (uint64_t)(val & CAP_CQR_MASK) \ > - << CAP_CQR_SHIFT) > -#define NVME_CAP_SET_AMS(cap, val) (cap |= (uint64_t)(val & CAP_AMS_MASK) \ > - << CAP_AMS_SHIFT) > -#define NVME_CAP_SET_TO(cap, val) (cap |= (uint64_t)(val & CAP_TO_MASK) \ > - << CAP_TO_SHIFT) > -#define NVME_CAP_SET_DSTRD(cap, val) (cap |= (uint64_t)(val & CAP_DSTRD_MASK) \ > - << CAP_DSTRD_SHIFT) > -#define NVME_CAP_SET_NSSRS(cap, val) (cap |= (uint64_t)(val & CAP_NSSRS_MASK) \ > - << CAP_NSSRS_SHIFT) > -#define NVME_CAP_SET_CSS(cap, val) (cap |= (uint64_t)(val & CAP_CSS_MASK) \ > - << CAP_CSS_SHIFT) > -#define NVME_CAP_SET_MPSMIN(cap, val) (cap |= (uint64_t)(val & CAP_MPSMIN_MASK)\ > - << CAP_MPSMIN_SHIFT) > -#define NVME_CAP_SET_MPSMAX(cap, val) (cap |= (uint64_t)(val & CAP_MPSMAX_MASK)\ > - << CAP_MPSMAX_SHIFT) > - > -enum NvmeCcShift { > - CC_EN_SHIFT = 0, > - CC_CSS_SHIFT = 4, > - CC_MPS_SHIFT = 7, > - CC_AMS_SHIFT = 11, > - CC_SHN_SHIFT = 14, > - CC_IOSQES_SHIFT = 16, > - CC_IOCQES_SHIFT = 20, > -}; > - > -enum NvmeCcMask { > - CC_EN_MASK = 0x1, > - CC_CSS_MASK = 0x7, > - CC_MPS_MASK = 0xf, > - CC_AMS_MASK = 0x7, > - CC_SHN_MASK = 0x3, > - CC_IOSQES_MASK = 0xf, > - CC_IOCQES_MASK = 0xf, > -}; > - > -#define NVME_CC_EN(cc) ((cc >> CC_EN_SHIFT) & CC_EN_MASK) > -#define NVME_CC_CSS(cc) ((cc >> CC_CSS_SHIFT) & CC_CSS_MASK) > -#define NVME_CC_MPS(cc) ((cc >> CC_MPS_SHIFT) & CC_MPS_MASK) > -#define NVME_CC_AMS(cc) ((cc >> CC_AMS_SHIFT) & CC_AMS_MASK) > -#define NVME_CC_SHN(cc) ((cc >> CC_SHN_SHIFT) & CC_SHN_MASK) > -#define NVME_CC_IOSQES(cc) ((cc >> CC_IOSQES_SHIFT) & CC_IOSQES_MASK) > -#define NVME_CC_IOCQES(cc) ((cc >> CC_IOCQES_SHIFT) & CC_IOCQES_MASK) > - > -enum NvmeCstsShift { > - CSTS_RDY_SHIFT = 0, > - CSTS_CFS_SHIFT = 1, > - CSTS_SHST_SHIFT = 2, > - CSTS_NSSRO_SHIFT = 4, > -}; > - > -enum NvmeCstsMask { > - CSTS_RDY_MASK = 0x1, > - CSTS_CFS_MASK = 0x1, > - CSTS_SHST_MASK = 0x3, > - CSTS_NSSRO_MASK = 0x1, > -}; > - > -enum NvmeCsts { > - NVME_CSTS_READY = 1 << CSTS_RDY_SHIFT, > - NVME_CSTS_FAILED = 1 << CSTS_CFS_SHIFT, > - NVME_CSTS_SHST_NORMAL = 0 << CSTS_SHST_SHIFT, > - NVME_CSTS_SHST_PROGRESS = 1 << CSTS_SHST_SHIFT, > - NVME_CSTS_SHST_COMPLETE = 2 << CSTS_SHST_SHIFT, > - NVME_CSTS_NSSRO = 1 << CSTS_NSSRO_SHIFT, > -}; > - > -#define NVME_CSTS_RDY(csts) ((csts >> CSTS_RDY_SHIFT) & CSTS_RDY_MASK) > -#define NVME_CSTS_CFS(csts) ((csts >> CSTS_CFS_SHIFT) & CSTS_CFS_MASK) > -#define NVME_CSTS_SHST(csts) ((csts >> CSTS_SHST_SHIFT) & CSTS_SHST_MASK) > -#define NVME_CSTS_NSSRO(csts) ((csts >> CSTS_NSSRO_SHIFT) & CSTS_NSSRO_MASK) > - > -enum NvmeAqaShift { > - AQA_ASQS_SHIFT = 0, > - AQA_ACQS_SHIFT = 16, > -}; > - > -enum NvmeAqaMask { > - AQA_ASQS_MASK = 0xfff, > - AQA_ACQS_MASK = 0xfff, > -}; > - > -#define NVME_AQA_ASQS(aqa) ((aqa >> AQA_ASQS_SHIFT) & AQA_ASQS_MASK) > -#define NVME_AQA_ACQS(aqa) ((aqa >> AQA_ACQS_SHIFT) & AQA_ACQS_MASK) > - > -enum NvmeCmblocShift { > - CMBLOC_BIR_SHIFT = 0, > - CMBLOC_OFST_SHIFT = 12, > -}; > - > -enum NvmeCmblocMask { > - CMBLOC_BIR_MASK = 0x7, > - CMBLOC_OFST_MASK = 0xfffff, > -}; > - > -#define NVME_CMBLOC_BIR(cmbloc) ((cmbloc >> CMBLOC_BIR_SHIFT) & \ > - CMBLOC_BIR_MASK) > -#define NVME_CMBLOC_OFST(cmbloc)((cmbloc >> CMBLOC_OFST_SHIFT) & \ > - CMBLOC_OFST_MASK) > - > -#define NVME_CMBLOC_SET_BIR(cmbloc, val) \ > - (cmbloc |= (uint64_t)(val & CMBLOC_BIR_MASK) << CMBLOC_BIR_SHIFT) > -#define NVME_CMBLOC_SET_OFST(cmbloc, val) \ > - (cmbloc |= (uint64_t)(val & CMBLOC_OFST_MASK) << CMBLOC_OFST_SHIFT) > - > -enum NvmeCmbszShift { > - CMBSZ_SQS_SHIFT = 0, > - CMBSZ_CQS_SHIFT = 1, > - CMBSZ_LISTS_SHIFT = 2, > - CMBSZ_RDS_SHIFT = 3, > - CMBSZ_WDS_SHIFT = 4, > - CMBSZ_SZU_SHIFT = 8, > - CMBSZ_SZ_SHIFT = 12, > -}; > - > -enum NvmeCmbszMask { > - CMBSZ_SQS_MASK = 0x1, > - CMBSZ_CQS_MASK = 0x1, > - CMBSZ_LISTS_MASK = 0x1, > - CMBSZ_RDS_MASK = 0x1, > - CMBSZ_WDS_MASK = 0x1, > - CMBSZ_SZU_MASK = 0xf, > - CMBSZ_SZ_MASK = 0xfffff, > -}; > - > -#define NVME_CMBSZ_SQS(cmbsz) ((cmbsz >> CMBSZ_SQS_SHIFT) & CMBSZ_SQS_MASK) > -#define NVME_CMBSZ_CQS(cmbsz) ((cmbsz >> CMBSZ_CQS_SHIFT) & CMBSZ_CQS_MASK) > -#define NVME_CMBSZ_LISTS(cmbsz)((cmbsz >> CMBSZ_LISTS_SHIFT) & CMBSZ_LISTS_MASK) > -#define NVME_CMBSZ_RDS(cmbsz) ((cmbsz >> CMBSZ_RDS_SHIFT) & CMBSZ_RDS_MASK) > -#define NVME_CMBSZ_WDS(cmbsz) ((cmbsz >> CMBSZ_WDS_SHIFT) & CMBSZ_WDS_MASK) > -#define NVME_CMBSZ_SZU(cmbsz) ((cmbsz >> CMBSZ_SZU_SHIFT) & CMBSZ_SZU_MASK) > -#define NVME_CMBSZ_SZ(cmbsz) ((cmbsz >> CMBSZ_SZ_SHIFT) & CMBSZ_SZ_MASK) > - > -#define NVME_CMBSZ_SET_SQS(cmbsz, val) \ > - (cmbsz |= (uint64_t)(val & CMBSZ_SQS_MASK) << CMBSZ_SQS_SHIFT) > -#define NVME_CMBSZ_SET_CQS(cmbsz, val) \ > - (cmbsz |= (uint64_t)(val & CMBSZ_CQS_MASK) << CMBSZ_CQS_SHIFT) > -#define NVME_CMBSZ_SET_LISTS(cmbsz, val) \ > - (cmbsz |= (uint64_t)(val & CMBSZ_LISTS_MASK) << CMBSZ_LISTS_SHIFT) > -#define NVME_CMBSZ_SET_RDS(cmbsz, val) \ > - (cmbsz |= (uint64_t)(val & CMBSZ_RDS_MASK) << CMBSZ_RDS_SHIFT) > -#define NVME_CMBSZ_SET_WDS(cmbsz, val) \ > - (cmbsz |= (uint64_t)(val & CMBSZ_WDS_MASK) << CMBSZ_WDS_SHIFT) > -#define NVME_CMBSZ_SET_SZU(cmbsz, val) \ > - (cmbsz |= (uint64_t)(val & CMBSZ_SZU_MASK) << CMBSZ_SZU_SHIFT) > -#define NVME_CMBSZ_SET_SZ(cmbsz, val) \ > - (cmbsz |= (uint64_t)(val & CMBSZ_SZ_MASK) << CMBSZ_SZ_SHIFT) > - > -#define NVME_CMBSZ_GETSIZE(cmbsz) \ > - (NVME_CMBSZ_SZ(cmbsz) * (1 << (12 + 4 * NVME_CMBSZ_SZU(cmbsz)))) > - > -typedef struct NvmeCmd { > - uint8_t opcode; > - uint8_t fuse; > - uint16_t cid; > - uint32_t nsid; > - uint64_t res1; > - uint64_t mptr; > - uint64_t prp1; > - uint64_t prp2; > - uint32_t cdw10; > - uint32_t cdw11; > - uint32_t cdw12; > - uint32_t cdw13; > - uint32_t cdw14; > - uint32_t cdw15; > -} NvmeCmd; > - > -enum NvmeAdminCommands { > - NVME_ADM_CMD_DELETE_SQ = 0x00, > - NVME_ADM_CMD_CREATE_SQ = 0x01, > - NVME_ADM_CMD_GET_LOG_PAGE = 0x02, > - NVME_ADM_CMD_DELETE_CQ = 0x04, > - NVME_ADM_CMD_CREATE_CQ = 0x05, > - NVME_ADM_CMD_IDENTIFY = 0x06, > - NVME_ADM_CMD_ABORT = 0x08, > - NVME_ADM_CMD_SET_FEATURES = 0x09, > - NVME_ADM_CMD_GET_FEATURES = 0x0a, > - NVME_ADM_CMD_ASYNC_EV_REQ = 0x0c, > - NVME_ADM_CMD_ACTIVATE_FW = 0x10, > - NVME_ADM_CMD_DOWNLOAD_FW = 0x11, > - NVME_ADM_CMD_FORMAT_NVM = 0x80, > - NVME_ADM_CMD_SECURITY_SEND = 0x81, > - NVME_ADM_CMD_SECURITY_RECV = 0x82, > -}; > - > -enum NvmeIoCommands { > - NVME_CMD_FLUSH = 0x00, > - NVME_CMD_WRITE = 0x01, > - NVME_CMD_READ = 0x02, > - NVME_CMD_WRITE_UNCOR = 0x04, > - NVME_CMD_COMPARE = 0x05, > - NVME_CMD_WRITE_ZEROS = 0x08, > - NVME_CMD_DSM = 0x09, > -}; > - > -typedef struct NvmeDeleteQ { > - uint8_t opcode; > - uint8_t flags; > - uint16_t cid; > - uint32_t rsvd1[9]; > - uint16_t qid; > - uint16_t rsvd10; > - uint32_t rsvd11[5]; > -} NvmeDeleteQ; > - > -typedef struct NvmeCreateCq { > - uint8_t opcode; > - uint8_t flags; > - uint16_t cid; > - uint32_t rsvd1[5]; > - uint64_t prp1; > - uint64_t rsvd8; > - uint16_t cqid; > - uint16_t qsize; > - uint16_t cq_flags; > - uint16_t irq_vector; > - uint32_t rsvd12[4]; > -} NvmeCreateCq; > - > -#define NVME_CQ_FLAGS_PC(cq_flags) (cq_flags & 0x1) > -#define NVME_CQ_FLAGS_IEN(cq_flags) ((cq_flags >> 1) & 0x1) > - > -typedef struct NvmeCreateSq { > - uint8_t opcode; > - uint8_t flags; > - uint16_t cid; > - uint32_t rsvd1[5]; > - uint64_t prp1; > - uint64_t rsvd8; > - uint16_t sqid; > - uint16_t qsize; > - uint16_t sq_flags; > - uint16_t cqid; > - uint32_t rsvd12[4]; > -} NvmeCreateSq; > - > -#define NVME_SQ_FLAGS_PC(sq_flags) (sq_flags & 0x1) > -#define NVME_SQ_FLAGS_QPRIO(sq_flags) ((sq_flags >> 1) & 0x3) > - > -enum NvmeQueueFlags { > - NVME_Q_PC = 1, > - NVME_Q_PRIO_URGENT = 0, > - NVME_Q_PRIO_HIGH = 1, > - NVME_Q_PRIO_NORMAL = 2, > - NVME_Q_PRIO_LOW = 3, > -}; > - > -typedef struct NvmeIdentify { > - uint8_t opcode; > - uint8_t flags; > - uint16_t cid; > - uint32_t nsid; > - uint64_t rsvd2[2]; > - uint64_t prp1; > - uint64_t prp2; > - uint32_t cns; > - uint32_t rsvd11[5]; > -} NvmeIdentify; > - > -typedef struct NvmeRwCmd { > - uint8_t opcode; > - uint8_t flags; > - uint16_t cid; > - uint32_t nsid; > - uint64_t rsvd2; > - uint64_t mptr; > - uint64_t prp1; > - uint64_t prp2; > - uint64_t slba; > - uint16_t nlb; > - uint16_t control; > - uint32_t dsmgmt; > - uint32_t reftag; > - uint16_t apptag; > - uint16_t appmask; > -} NvmeRwCmd; > - > -enum { > - NVME_RW_LR = 1 << 15, > - NVME_RW_FUA = 1 << 14, > - NVME_RW_DSM_FREQ_UNSPEC = 0, > - NVME_RW_DSM_FREQ_TYPICAL = 1, > - NVME_RW_DSM_FREQ_RARE = 2, > - NVME_RW_DSM_FREQ_READS = 3, > - NVME_RW_DSM_FREQ_WRITES = 4, > - NVME_RW_DSM_FREQ_RW = 5, > - NVME_RW_DSM_FREQ_ONCE = 6, > - NVME_RW_DSM_FREQ_PREFETCH = 7, > - NVME_RW_DSM_FREQ_TEMP = 8, > - NVME_RW_DSM_LATENCY_NONE = 0 << 4, > - NVME_RW_DSM_LATENCY_IDLE = 1 << 4, > - NVME_RW_DSM_LATENCY_NORM = 2 << 4, > - NVME_RW_DSM_LATENCY_LOW = 3 << 4, > - NVME_RW_DSM_SEQ_REQ = 1 << 6, > - NVME_RW_DSM_COMPRESSED = 1 << 7, > - NVME_RW_PRINFO_PRACT = 1 << 13, > - NVME_RW_PRINFO_PRCHK_GUARD = 1 << 12, > - NVME_RW_PRINFO_PRCHK_APP = 1 << 11, > - NVME_RW_PRINFO_PRCHK_REF = 1 << 10, > -}; > - > -typedef struct NvmeDsmCmd { > - uint8_t opcode; > - uint8_t flags; > - uint16_t cid; > - uint32_t nsid; > - uint64_t rsvd2[2]; > - uint64_t prp1; > - uint64_t prp2; > - uint32_t nr; > - uint32_t attributes; > - uint32_t rsvd12[4]; > -} NvmeDsmCmd; > - > -enum { > - NVME_DSMGMT_IDR = 1 << 0, > - NVME_DSMGMT_IDW = 1 << 1, > - NVME_DSMGMT_AD = 1 << 2, > -}; > - > -typedef struct NvmeDsmRange { > - uint32_t cattr; > - uint32_t nlb; > - uint64_t slba; > -} NvmeDsmRange; > - > -enum NvmeAsyncEventRequest { > - NVME_AER_TYPE_ERROR = 0, > - NVME_AER_TYPE_SMART = 1, > - NVME_AER_TYPE_IO_SPECIFIC = 6, > - NVME_AER_TYPE_VENDOR_SPECIFIC = 7, > - NVME_AER_INFO_ERR_INVALID_SQ = 0, > - NVME_AER_INFO_ERR_INVALID_DB = 1, > - NVME_AER_INFO_ERR_DIAG_FAIL = 2, > - NVME_AER_INFO_ERR_PERS_INTERNAL_ERR = 3, > - NVME_AER_INFO_ERR_TRANS_INTERNAL_ERR = 4, > - NVME_AER_INFO_ERR_FW_IMG_LOAD_ERR = 5, > - NVME_AER_INFO_SMART_RELIABILITY = 0, > - NVME_AER_INFO_SMART_TEMP_THRESH = 1, > - NVME_AER_INFO_SMART_SPARE_THRESH = 2, > -}; > - > -typedef struct NvmeAerResult { > - uint8_t event_type; > - uint8_t event_info; > - uint8_t log_page; > - uint8_t resv; > -} NvmeAerResult; > - > -typedef struct NvmeCqe { > - uint32_t result; > - uint32_t rsvd; > - uint16_t sq_head; > - uint16_t sq_id; > - uint16_t cid; > - uint16_t status; > -} NvmeCqe; > - > -enum NvmeStatusCodes { > - NVME_SUCCESS = 0x0000, > - NVME_INVALID_OPCODE = 0x0001, > - NVME_INVALID_FIELD = 0x0002, > - NVME_CID_CONFLICT = 0x0003, > - NVME_DATA_TRAS_ERROR = 0x0004, > - NVME_POWER_LOSS_ABORT = 0x0005, > - NVME_INTERNAL_DEV_ERROR = 0x0006, > - NVME_CMD_ABORT_REQ = 0x0007, > - NVME_CMD_ABORT_SQ_DEL = 0x0008, > - NVME_CMD_ABORT_FAILED_FUSE = 0x0009, > - NVME_CMD_ABORT_MISSING_FUSE = 0x000a, > - NVME_INVALID_NSID = 0x000b, > - NVME_CMD_SEQ_ERROR = 0x000c, > - NVME_LBA_RANGE = 0x0080, > - NVME_CAP_EXCEEDED = 0x0081, > - NVME_NS_NOT_READY = 0x0082, > - NVME_NS_RESV_CONFLICT = 0x0083, > - NVME_INVALID_CQID = 0x0100, > - NVME_INVALID_QID = 0x0101, > - NVME_MAX_QSIZE_EXCEEDED = 0x0102, > - NVME_ACL_EXCEEDED = 0x0103, > - NVME_RESERVED = 0x0104, > - NVME_AER_LIMIT_EXCEEDED = 0x0105, > - NVME_INVALID_FW_SLOT = 0x0106, > - NVME_INVALID_FW_IMAGE = 0x0107, > - NVME_INVALID_IRQ_VECTOR = 0x0108, > - NVME_INVALID_LOG_ID = 0x0109, > - NVME_INVALID_FORMAT = 0x010a, > - NVME_FW_REQ_RESET = 0x010b, > - NVME_INVALID_QUEUE_DEL = 0x010c, > - NVME_FID_NOT_SAVEABLE = 0x010d, > - NVME_FID_NOT_NSID_SPEC = 0x010f, > - NVME_FW_REQ_SUSYSTEM_RESET = 0x0110, > - NVME_CONFLICTING_ATTRS = 0x0180, > - NVME_INVALID_PROT_INFO = 0x0181, > - NVME_WRITE_TO_RO = 0x0182, > - NVME_WRITE_FAULT = 0x0280, > - NVME_UNRECOVERED_READ = 0x0281, > - NVME_E2E_GUARD_ERROR = 0x0282, > - NVME_E2E_APP_ERROR = 0x0283, > - NVME_E2E_REF_ERROR = 0x0284, > - NVME_CMP_FAILURE = 0x0285, > - NVME_ACCESS_DENIED = 0x0286, > - NVME_MORE = 0x2000, > - NVME_DNR = 0x4000, > - NVME_NO_COMPLETE = 0xffff, > -}; > - > -typedef struct NvmeFwSlotInfoLog { > - uint8_t afi; > - uint8_t reserved1[7]; > - uint8_t frs1[8]; > - uint8_t frs2[8]; > - uint8_t frs3[8]; > - uint8_t frs4[8]; > - uint8_t frs5[8]; > - uint8_t frs6[8]; > - uint8_t frs7[8]; > - uint8_t reserved2[448]; > -} NvmeFwSlotInfoLog; > - > -typedef struct NvmeErrorLog { > - uint64_t error_count; > - uint16_t sqid; > - uint16_t cid; > - uint16_t status_field; > - uint16_t param_error_location; > - uint64_t lba; > - uint32_t nsid; > - uint8_t vs; > - uint8_t resv[35]; > -} NvmeErrorLog; > - > -typedef struct NvmeSmartLog { > - uint8_t critical_warning; > - uint8_t temperature[2]; > - uint8_t available_spare; > - uint8_t available_spare_threshold; > - uint8_t percentage_used; > - uint8_t reserved1[26]; > - uint64_t data_units_read[2]; > - uint64_t data_units_written[2]; > - uint64_t host_read_commands[2]; > - uint64_t host_write_commands[2]; > - uint64_t controller_busy_time[2]; > - uint64_t power_cycles[2]; > - uint64_t power_on_hours[2]; > - uint64_t unsafe_shutdowns[2]; > - uint64_t media_errors[2]; > - uint64_t number_of_error_log_entries[2]; > - uint8_t reserved2[320]; > -} NvmeSmartLog; > - > -enum NvmeSmartWarn { > - NVME_SMART_SPARE = 1 << 0, > - NVME_SMART_TEMPERATURE = 1 << 1, > - NVME_SMART_RELIABILITY = 1 << 2, > - NVME_SMART_MEDIA_READ_ONLY = 1 << 3, > - NVME_SMART_FAILED_VOLATILE_MEDIA = 1 << 4, > -}; > - > -enum LogIdentifier { > - NVME_LOG_ERROR_INFO = 0x01, > - NVME_LOG_SMART_INFO = 0x02, > - NVME_LOG_FW_SLOT_INFO = 0x03, > -}; > - > -typedef struct NvmePSD { > - uint16_t mp; > - uint16_t reserved; > - uint32_t enlat; > - uint32_t exlat; > - uint8_t rrt; > - uint8_t rrl; > - uint8_t rwt; > - uint8_t rwl; > - uint8_t resv[16]; > -} NvmePSD; > - > -typedef struct NvmeIdCtrl { > - uint16_t vid; > - uint16_t ssvid; > - uint8_t sn[20]; > - uint8_t mn[40]; > - uint8_t fr[8]; > - uint8_t rab; > - uint8_t ieee[3]; > - uint8_t cmic; > - uint8_t mdts; > - uint8_t rsvd255[178]; > - uint16_t oacs; > - uint8_t acl; > - uint8_t aerl; > - uint8_t frmw; > - uint8_t lpa; > - uint8_t elpe; > - uint8_t npss; > - uint8_t rsvd511[248]; > - uint8_t sqes; > - uint8_t cqes; > - uint16_t rsvd515; > - uint32_t nn; > - uint16_t oncs; > - uint16_t fuses; > - uint8_t fna; > - uint8_t vwc; > - uint16_t awun; > - uint16_t awupf; > - uint8_t rsvd703[174]; > - uint8_t rsvd2047[1344]; > - NvmePSD psd[32]; > - uint8_t vs[1024]; > -} NvmeIdCtrl; > - > -enum NvmeIdCtrlOacs { > - NVME_OACS_SECURITY = 1 << 0, > - NVME_OACS_FORMAT = 1 << 1, > - NVME_OACS_FW = 1 << 2, > -}; > - > -enum NvmeIdCtrlOncs { > - NVME_ONCS_COMPARE = 1 << 0, > - NVME_ONCS_WRITE_UNCORR = 1 << 1, > - NVME_ONCS_DSM = 1 << 2, > - NVME_ONCS_WRITE_ZEROS = 1 << 3, > - NVME_ONCS_FEATURES = 1 << 4, > - NVME_ONCS_RESRVATIONS = 1 << 5, > -}; > - > -#define NVME_CTRL_SQES_MIN(sqes) ((sqes) & 0xf) > -#define NVME_CTRL_SQES_MAX(sqes) (((sqes) >> 4) & 0xf) > -#define NVME_CTRL_CQES_MIN(cqes) ((cqes) & 0xf) > -#define NVME_CTRL_CQES_MAX(cqes) (((cqes) >> 4) & 0xf) > - > -typedef struct NvmeFeatureVal { > - uint32_t arbitration; > - uint32_t power_mgmt; > - uint32_t temp_thresh; > - uint32_t err_rec; > - uint32_t volatile_wc; > - uint32_t num_queues; > - uint32_t int_coalescing; > - uint32_t *int_vector_config; > - uint32_t write_atomicity; > - uint32_t async_config; > - uint32_t sw_prog_marker; > -} NvmeFeatureVal; > - > -#define NVME_ARB_AB(arb) (arb & 0x7) > -#define NVME_ARB_LPW(arb) ((arb >> 8) & 0xff) > -#define NVME_ARB_MPW(arb) ((arb >> 16) & 0xff) > -#define NVME_ARB_HPW(arb) ((arb >> 24) & 0xff) > - > -#define NVME_INTC_THR(intc) (intc & 0xff) > -#define NVME_INTC_TIME(intc) ((intc >> 8) & 0xff) > - > -enum NvmeFeatureIds { > - NVME_ARBITRATION = 0x1, > - NVME_POWER_MANAGEMENT = 0x2, > - NVME_LBA_RANGE_TYPE = 0x3, > - NVME_TEMPERATURE_THRESHOLD = 0x4, > - NVME_ERROR_RECOVERY = 0x5, > - NVME_VOLATILE_WRITE_CACHE = 0x6, > - NVME_NUMBER_OF_QUEUES = 0x7, > - NVME_INTERRUPT_COALESCING = 0x8, > - NVME_INTERRUPT_VECTOR_CONF = 0x9, > - NVME_WRITE_ATOMICITY = 0xa, > - NVME_ASYNCHRONOUS_EVENT_CONF = 0xb, > - NVME_SOFTWARE_PROGRESS_MARKER = 0x80 > -}; > - > -typedef struct NvmeRangeType { > - uint8_t type; > - uint8_t attributes; > - uint8_t rsvd2[14]; > - uint64_t slba; > - uint64_t nlb; > - uint8_t guid[16]; > - uint8_t rsvd48[16]; > -} NvmeRangeType; > - > -typedef struct NvmeLBAF { > - uint16_t ms; > - uint8_t ds; > - uint8_t rp; > -} NvmeLBAF; > - > -typedef struct NvmeIdNs { > - uint64_t nsze; > - uint64_t ncap; > - uint64_t nuse; > - uint8_t nsfeat; > - uint8_t nlbaf; > - uint8_t flbas; > - uint8_t mc; > - uint8_t dpc; > - uint8_t dps; > - uint8_t res30[98]; > - NvmeLBAF lbaf[16]; > - uint8_t res192[192]; > - uint8_t vs[3712]; > -} NvmeIdNs; > - > -#define NVME_ID_NS_NSFEAT_THIN(nsfeat) ((nsfeat & 0x1)) > -#define NVME_ID_NS_FLBAS_EXTENDED(flbas) ((flbas >> 4) & 0x1) > -#define NVME_ID_NS_FLBAS_INDEX(flbas) ((flbas & 0xf)) > -#define NVME_ID_NS_MC_SEPARATE(mc) ((mc >> 1) & 0x1) > -#define NVME_ID_NS_MC_EXTENDED(mc) ((mc & 0x1)) > -#define NVME_ID_NS_DPC_LAST_EIGHT(dpc) ((dpc >> 4) & 0x1) > -#define NVME_ID_NS_DPC_FIRST_EIGHT(dpc) ((dpc >> 3) & 0x1) > -#define NVME_ID_NS_DPC_TYPE_3(dpc) ((dpc >> 2) & 0x1) > -#define NVME_ID_NS_DPC_TYPE_2(dpc) ((dpc >> 1) & 0x1) > -#define NVME_ID_NS_DPC_TYPE_1(dpc) ((dpc & 0x1)) > -#define NVME_ID_NS_DPC_TYPE_MASK 0x7 > - > -enum NvmeIdNsDps { > - DPS_TYPE_NONE = 0, > - DPS_TYPE_1 = 1, > - DPS_TYPE_2 = 2, > - DPS_TYPE_3 = 3, > - DPS_TYPE_MASK = 0x7, > - DPS_FIRST_EIGHT = 8, > -}; > - > -static inline void _nvme_check_size(void) > -{ > - QEMU_BUILD_BUG_ON(sizeof(NvmeAerResult) != 4); > - QEMU_BUILD_BUG_ON(sizeof(NvmeCqe) != 16); > - QEMU_BUILD_BUG_ON(sizeof(NvmeDsmRange) != 16); > - QEMU_BUILD_BUG_ON(sizeof(NvmeCmd) != 64); > - QEMU_BUILD_BUG_ON(sizeof(NvmeDeleteQ) != 64); > - QEMU_BUILD_BUG_ON(sizeof(NvmeCreateCq) != 64); > - QEMU_BUILD_BUG_ON(sizeof(NvmeCreateSq) != 64); > - QEMU_BUILD_BUG_ON(sizeof(NvmeIdentify) != 64); > - QEMU_BUILD_BUG_ON(sizeof(NvmeRwCmd) != 64); > - QEMU_BUILD_BUG_ON(sizeof(NvmeDsmCmd) != 64); > - QEMU_BUILD_BUG_ON(sizeof(NvmeRangeType) != 64); > - QEMU_BUILD_BUG_ON(sizeof(NvmeErrorLog) != 64); > - QEMU_BUILD_BUG_ON(sizeof(NvmeFwSlotInfoLog) != 512); > - QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512); > - QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096); > - QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096); > -} > +#include "block/nvme.h" > > typedef struct NvmeAsyncEvent { > QSIMPLEQ_ENTRY(NvmeAsyncEvent) entry; > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH v3 6/6] block: Move NVMe spec definitions to a separate header 2017-07-05 13:36 ` [Qemu-devel] [PATCH v3 6/6] block: Move NVMe spec definitions to a separate header Fam Zheng 2017-07-05 13:39 ` Paolo Bonzini @ 2017-07-10 15:01 ` Stefan Hajnoczi 1 sibling, 0 replies; 33+ messages in thread From: Stefan Hajnoczi @ 2017-07-10 15:01 UTC (permalink / raw) To: Fam Zheng Cc: qemu-devel, Paolo Bonzini, Keith Busch, qemu-block, Kevin Wolf, Max Reitz, Karl Rister [-- Attachment #1: Type: text/plain, Size: 285 bytes --] On Wed, Jul 05, 2017 at 09:36:35PM +0800, Fam Zheng wrote: > diff --git a/block/nvme.h b/block/nvme.h > new file mode 100644 > index 0000000..ed18091 > --- /dev/null > +++ b/block/nvme.h > @@ -0,0 +1,700 @@ > +#ifndef BLOCK_NVME_H > +#define BLOC_NVMEK_H s/BLOC_NVMEK_H/BLOCK_NVME_H/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH v3 0/6] block: Add VFIO based driver for NVMe device 2017-07-05 13:36 [Qemu-devel] [PATCH v3 0/6] block: Add VFIO based driver for NVMe device Fam Zheng ` (5 preceding siblings ...) 2017-07-05 13:36 ` [Qemu-devel] [PATCH v3 6/6] block: Move NVMe spec definitions to a separate header Fam Zheng @ 2017-07-05 13:41 ` Paolo Bonzini 2017-07-06 14:06 ` no-reply 7 siblings, 0 replies; 33+ messages in thread From: Paolo Bonzini @ 2017-07-05 13:41 UTC (permalink / raw) To: Fam Zheng, qemu-devel Cc: Keith Busch, qemu-block, Kevin Wolf, Max Reitz, Stefan Hajnoczi, Karl Rister On 05/07/2017 15:36, Fam Zheng wrote: > v3: Rebase, small tweaks/fixes and add locks to provide basic thread safety > (basic because it is not really tested). Sounds good, it can be converted to CoMutex later. Paolo ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH v3 0/6] block: Add VFIO based driver for NVMe device 2017-07-05 13:36 [Qemu-devel] [PATCH v3 0/6] block: Add VFIO based driver for NVMe device Fam Zheng ` (6 preceding siblings ...) 2017-07-05 13:41 ` [Qemu-devel] [PATCH v3 0/6] block: Add VFIO based driver for NVMe device Paolo Bonzini @ 2017-07-06 14:06 ` no-reply 2017-07-06 14:22 ` Paolo Bonzini 7 siblings, 1 reply; 33+ messages in thread From: no-reply @ 2017-07-06 14:06 UTC (permalink / raw) To: famz Cc: qemu-devel, kwolf, qemu-block, mreitz, keith.busch, stefanha, pbonzini, krister Hi, This series seems to have some coding style problems. See output below for more information: Message-id: 20170705133635.11850-1-famz@redhat.com Type: series Subject: [Qemu-devel] [PATCH v3 0/6] block: Add VFIO based driver for NVMe device === TEST SCRIPT BEGIN === #!/bin/bash BASE=base n=1 total=$(git log --oneline $BASE.. | wc -l) failed=0 git config --local diff.renamelimit 0 git config --local diff.renames True commits="$(git log --format=%H --reverse $BASE..)" for c in $commits; do echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..." if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then failed=1 echo fi n=$((n+1)) done exit $failed === TEST SCRIPT END === Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384 Switched to a new branch 'test' 079ea59 block: Move NVMe spec definitions to a separate header 28b0ea5 qemu-img: Map bench buffer 0c5fe62 block/nvme: Implement .bdrv_dma_map and .bdrv_dma_unmap bb59e7c block: Introduce bdrv_dma_map and bdrv_dma_unmap be4dab1 block: Add VFIO based NVMe driver 74c0158 stubs: Add stubs for ram block API === OUTPUT BEGIN === Checking PATCH 1/6: stubs: Add stubs for ram block API... Checking PATCH 2/6: block: Add VFIO based NVMe driver... WARNING: line over 80 characters #191: FILE: block/nvme-vfio.c:133: + if (ioctl(s->device, VFIO_DEVICE_GET_REGION_INFO, &s->bar_region_info[index])) { WARNING: line over 80 characters #279: FILE: block/nvme-vfio.c:221: +static int nvme_vfio_pci_write_config(NVMeVFIOState *s, void *buf, int size, int ofs) ERROR: Use of volatile is usually wrong: see Documentation/volatile-considered-harmful.txt #843: FILE: block/nvme.c:40: + volatile uint32_t *doorbell; ERROR: Use of volatile is usually wrong: see Documentation/volatile-considered-harmful.txt #869: FILE: block/nvme.c:66: +typedef volatile struct { WARNING: line over 80 characters #1381: FILE: block/nvme.c:578: + error_setg(errp, "Timeout while waiting for device to reset (%ld ms)", WARNING: line over 80 characters #1410: FILE: block/nvme.c:607: + error_setg(errp, "Timeout while waiting for device to start (%ld ms)", total: 2 errors, 4 warnings, 1878 lines checked Your patch has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. Checking PATCH 3/6: block: Introduce bdrv_dma_map and bdrv_dma_unmap... Checking PATCH 4/6: block/nvme: Implement .bdrv_dma_map and .bdrv_dma_unmap... Checking PATCH 5/6: qemu-img: Map bench buffer... Checking PATCH 6/6: block: Move NVMe spec definitions to a separate header... === OUTPUT END === Test command exited with code: 1 --- Email generated automatically by Patchew [http://patchew.org/]. Please send your feedback to patchew-devel@freelists.org ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH v3 0/6] block: Add VFIO based driver for NVMe device 2017-07-06 14:06 ` no-reply @ 2017-07-06 14:22 ` Paolo Bonzini 2017-07-06 14:36 ` Fam Zheng 0 siblings, 1 reply; 33+ messages in thread From: Paolo Bonzini @ 2017-07-06 14:22 UTC (permalink / raw) To: qemu-devel, famz Cc: kwolf, qemu-block, mreitz, keith.busch, stefanha, krister On 06/07/2017 16:06, no-reply@patchew.org wrote: > ERROR: Use of volatile is usually wrong: see Documentation/volatile-considered-harmful.txt > #843: FILE: block/nvme.c:40: > + volatile uint32_t *doorbell; > > ERROR: Use of volatile is usually wrong: see Documentation/volatile-considered-harmful.txt > #869: FILE: block/nvme.c:66: > +typedef volatile struct { Indeed volatile should not be necessary, since we use memory barriers appropriately. But these are hardware registers (like, host hardware) so I guess it's okay for this special case. Paolo ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH v3 0/6] block: Add VFIO based driver for NVMe device 2017-07-06 14:22 ` Paolo Bonzini @ 2017-07-06 14:36 ` Fam Zheng 2017-07-06 14:44 ` Paolo Bonzini 0 siblings, 1 reply; 33+ messages in thread From: Fam Zheng @ 2017-07-06 14:36 UTC (permalink / raw) To: Paolo Bonzini Cc: qemu-devel, kwolf, qemu-block, mreitz, keith.busch, stefanha, krister On Thu, 07/06 16:22, Paolo Bonzini wrote: > > > On 06/07/2017 16:06, no-reply@patchew.org wrote: > > ERROR: Use of volatile is usually wrong: see Documentation/volatile-considered-harmful.txt > > #843: FILE: block/nvme.c:40: > > + volatile uint32_t *doorbell; > > > > ERROR: Use of volatile is usually wrong: see Documentation/volatile-considered-harmful.txt > > #869: FILE: block/nvme.c:66: > > +typedef volatile struct { > > Indeed volatile should not be necessary, since we use memory barriers > appropriately. But these are hardware registers (like, host hardware) > so I guess it's okay for this special case. I think I used it because we don't have ACCESS_ONCE (maybe we should?). Fam ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH v3 0/6] block: Add VFIO based driver for NVMe device 2017-07-06 14:36 ` Fam Zheng @ 2017-07-06 14:44 ` Paolo Bonzini 0 siblings, 0 replies; 33+ messages in thread From: Paolo Bonzini @ 2017-07-06 14:44 UTC (permalink / raw) To: Fam Zheng Cc: qemu-devel, kwolf, qemu-block, mreitz, keith.busch, stefanha, krister On 06/07/2017 16:36, Fam Zheng wrote: > On Thu, 07/06 16:22, Paolo Bonzini wrote: >> >> >> On 06/07/2017 16:06, no-reply@patchew.org wrote: >>> ERROR: Use of volatile is usually wrong: see Documentation/volatile-considered-harmful.txt >>> #843: FILE: block/nvme.c:40: >>> + volatile uint32_t *doorbell; >>> >>> ERROR: Use of volatile is usually wrong: see Documentation/volatile-considered-harmful.txt >>> #869: FILE: block/nvme.c:66: >>> +typedef volatile struct { >> >> Indeed volatile should not be necessary, since we use memory barriers >> appropriately. But these are hardware registers (like, host hardware) >> so I guess it's okay for this special case. > > I think I used it because we don't have ACCESS_ONCE (maybe we should?). We have atomic_read and atomic_set (and Linux in fact tries not to use ACCESS_ONCE anymore, it's been replaced by READ_ONCE and WRITE_ONCE so it's really 1:1 with QEMU). Paolo ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2017-07-14 13:47 UTC | newest] Thread overview: 33+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-07-05 13:36 [Qemu-devel] [PATCH v3 0/6] block: Add VFIO based driver for NVMe device Fam Zheng 2017-07-05 13:36 ` [Qemu-devel] [PATCH v3 1/6] stubs: Add stubs for ram block API Fam Zheng 2017-07-05 13:36 ` [Qemu-devel] [PATCH v3 2/6] block: Add VFIO based NVMe driver Fam Zheng 2017-07-06 17:38 ` Keith Busch 2017-07-06 23:27 ` Fam Zheng 2017-07-07 10:06 ` [Qemu-devel] [Qemu-block] " Paolo Bonzini 2017-07-07 17:15 ` [Qemu-devel] " Stefan Hajnoczi 2017-07-10 14:55 ` Stefan Hajnoczi 2017-07-12 2:14 ` Fam Zheng 2017-07-12 10:49 ` Stefan Hajnoczi 2017-07-05 13:36 ` [Qemu-devel] [PATCH v3 3/6] block: Introduce bdrv_dma_map and bdrv_dma_unmap Fam Zheng 2017-07-10 14:57 ` Stefan Hajnoczi 2017-07-10 15:07 ` Stefan Hajnoczi 2017-07-10 15:08 ` Paolo Bonzini 2017-07-11 10:05 ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi 2017-07-11 10:28 ` Paolo Bonzini 2017-07-12 1:07 ` Fam Zheng 2017-07-12 14:03 ` Paolo Bonzini 2017-07-14 13:37 ` Stefan Hajnoczi 2017-07-14 13:46 ` Paolo Bonzini 2017-07-05 13:36 ` [Qemu-devel] [PATCH v3 4/6] block/nvme: Implement .bdrv_dma_map and .bdrv_dma_unmap Fam Zheng 2017-07-10 14:59 ` Stefan Hajnoczi 2017-07-10 15:09 ` Paolo Bonzini 2017-07-11 10:04 ` Stefan Hajnoczi 2017-07-05 13:36 ` [Qemu-devel] [PATCH v3 5/6] qemu-img: Map bench buffer Fam Zheng 2017-07-05 13:36 ` [Qemu-devel] [PATCH v3 6/6] block: Move NVMe spec definitions to a separate header Fam Zheng 2017-07-05 13:39 ` Paolo Bonzini 2017-07-10 15:01 ` Stefan Hajnoczi 2017-07-05 13:41 ` [Qemu-devel] [PATCH v3 0/6] block: Add VFIO based driver for NVMe device Paolo Bonzini 2017-07-06 14:06 ` no-reply 2017-07-06 14:22 ` Paolo Bonzini 2017-07-06 14:36 ` Fam Zheng 2017-07-06 14:44 ` Paolo Bonzini
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).