[Qemu-devel] [PATCH v4 00/11] virtio: virtio-blk data plane

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v4 00/11] virtio: virtio-blk data plane
@ 2012-11-22 15:16 Stefan Hajnoczi
  2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 01/11] raw-posix: add raw_get_aio_fd() for virtio-blk-data-plane Stefan Hajnoczi
                   ` (11 more replies)
  0 siblings, 12 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2012-11-22 15:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, Blue Swirl, khoa,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

This series adds the -device virtio-blk-pci,x-data-plane=on property that
enables a high performance I/O codepath.  A dedicated thread is used to process
virtio-blk requests outside the global mutex and without going through the QEMU
block layer.

Khoa Huynh <khoa@us.ibm.com> reported an increase from 140,000 IOPS to 600,000
IOPS for a single VM using virtio-blk-data-plane in July:

  http://comments.gmane.org/gmane.comp.emulators.kvm.devel/94580

The virtio-blk-data-plane approach was originally presented at Linux Plumbers
Conference 2010.  The following slides contain a brief overview:

  http://linuxplumbersconf.org/2010/ocw/system/presentations/651/original/Optimizing_the_QEMU_Storage_Stack.pdf

The basic approach is:
1. Each virtio-blk device has a thread dedicated to handling ioeventfd
   signalling when the guest kicks the virtqueue.
2. Requests are processed without going through the QEMU block layer using
   Linux AIO directly.
3. Completion interrupts are injected via irqfd from the dedicated thread.

To try it out:

  qemu -drive if=none,id=drive0,cache=none,aio=native,format=raw,file=...
       -device virtio-blk-pci,drive=drive0,scsi=off,x-data-plane=on

Limitations:
 * Only format=raw is supported
 * Live migration is not supported
 * Block jobs, hot unplug, and other operations fail with -EBUSY
 * I/O throttling limits are ignored
 * Only Linux hosts are supported due to Linux AIO usage

The code has reached a stage where I feel it is ready to merge.  Users have
been playing with it for some time and want the significant performance boost.

We are refactoring QEMU to get rid of the global mutex.  I believe that
virtio-blk-data-plane can eventually become the default mode of operation.

Instead of waiting for global mutex removal efforts to finish, I want to use
virtio-blk-data-plane as an example device for AioContext and threaded hw
dispatch refactoring.  This means:

1. When the block layer can bind to an AioContext and execute I/O outside the
   global mutex, virtio-blk-data-plane can use this (and gain image format
   support).

2. When hw dispatch no longer needs the global mutex we can use hw/virtio.c
   again and perhaps run a pool of iothreads instead of dedicated data plane
   threads.

But in the meantime, I have cleaned up the virtio-blk-data-plane code so that
it can be merged as an experimental feature.

v4:
 * Add qemu_iovec_concat_iov() [Paolo]
 * Use QEMUIOVector to copy out virtio_blk_inhdr [Michael, Paolo]

v3:
 * Don't assume iovec layout [Michael]
 * Better naming for hostmem.c MemoryListener callbacks [Don]
 * More vring quarantining if commands are bogus instead of exiting [Blue]

v2:
 * Use MemoryListener for thread-safe memory mapping [Paolo, Anthony, and everyone else pointed this out ;-)]
 * Quarantine invalid vring instead of exiting [Blue]
 * Replace __u16 kernel types with uint16_t [Blue]

Changes from the RFC v9:
 * Add x-data-plane=on|off option and coexist with regular virtio-blk code
 * Create thread from BH so it inherits iothread cpusets
 * Drain requests on vm_stop() so stopped guest does not access image file
 * Add migration blocker
 * Add bdrv_in_use() to prevent block jobs and other operations that can interfere
 * Drop IOQueue request merging for simplicity
 * Drop ioctl interrupt injection and always use irqfd for simplicity
 * Major cleanup to split up source files
 * Rebase from qemu-kvm.git onto qemu.git
 * Address Michael Tsirkin's review comments

Stefan Hajnoczi (11):
  raw-posix: add raw_get_aio_fd() for virtio-blk-data-plane
  configure: add CONFIG_VIRTIO_BLK_DATA_PLANE
  dataplane: add host memory mapping code
  dataplane: add virtqueue vring code
  dataplane: add event loop
  dataplane: add Linux AIO request queue
  iov: add iov_discard() to remove data
  test-iov: add iov_discard() testcase
  iov: add qemu_iovec_concat_iov()
  dataplane: add virtio-blk data plane code
  virtio-blk: add x-data-plane=on|off performance feature

 block.h                    |   9 +
 block/raw-posix.c          |  34 ++++
 configure                  |  21 +++
 hw/Makefile.objs           |   2 +-
 hw/dataplane/Makefile.objs |   3 +
 hw/dataplane/event-poll.c  | 109 ++++++++++++
 hw/dataplane/event-poll.h  |  40 +++++
 hw/dataplane/hostmem.c     | 165 ++++++++++++++++++
 hw/dataplane/hostmem.h     |  52 ++++++
 hw/dataplane/ioq.c         | 118 +++++++++++++
 hw/dataplane/ioq.h         |  57 ++++++
 hw/dataplane/virtio-blk.c  | 427 +++++++++++++++++++++++++++++++++++++++++++++
 hw/dataplane/virtio-blk.h  |  41 +++++
 hw/dataplane/vring.c       | 344 ++++++++++++++++++++++++++++++++++++
 hw/dataplane/vring.h       |  62 +++++++
 hw/virtio-blk.c            |  59 ++++++-
 hw/virtio-blk.h            |   1 +
 hw/virtio-pci.c            |   3 +
 iov.c                      |  80 +++++++--
 iov.h                      |  13 ++
 qemu-common.h              |   3 +
 tests/test-iov.c           | 129 ++++++++++++++
 trace-events               |   9 +
 23 files changed, 1767 insertions(+), 14 deletions(-)
 create mode 100644 hw/dataplane/Makefile.objs
 create mode 100644 hw/dataplane/event-poll.c
 create mode 100644 hw/dataplane/event-poll.h
 create mode 100644 hw/dataplane/hostmem.c
 create mode 100644 hw/dataplane/hostmem.h
 create mode 100644 hw/dataplane/ioq.c
 create mode 100644 hw/dataplane/ioq.h
 create mode 100644 hw/dataplane/virtio-blk.c
 create mode 100644 hw/dataplane/virtio-blk.h
 create mode 100644 hw/dataplane/vring.c
 create mode 100644 hw/dataplane/vring.h

-- 
1.8.0

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Qemu-devel] [PATCH v4 01/11] raw-posix: add raw_get_aio_fd() for virtio-blk-data-plane
  2012-11-22 15:16 [Qemu-devel] [PATCH v4 00/11] virtio: virtio-blk data plane Stefan Hajnoczi
@ 2012-11-22 15:16 ` Stefan Hajnoczi
  2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 02/11] configure: add CONFIG_VIRTIO_BLK_DATA_PLANE Stefan Hajnoczi
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2012-11-22 15:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, Blue Swirl, khoa,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

The raw_get_aio_fd() function allows virtio-blk-data-plane to get the
file descriptor of a raw image file with Linux AIO enabled.  This
interface is really a layering violation that can be resolved once the
block layer is able to run outside the global mutex - at that point
virtio-blk-data-plane will switch from custom Linux AIO code to using
the block layer.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block.h           |  9 +++++++++
 block/raw-posix.c | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/block.h b/block.h
index 722c620..2dc6aaf 100644
--- a/block.h
+++ b/block.h
@@ -365,6 +365,15 @@ void bdrv_disable_copy_on_read(BlockDriverState *bs);
 void bdrv_set_in_use(BlockDriverState *bs, int in_use);
 int bdrv_in_use(BlockDriverState *bs);
 
+#ifdef CONFIG_LINUX_AIO
+int raw_get_aio_fd(BlockDriverState *bs);
+#else
+static inline int raw_get_aio_fd(BlockDriverState *bs)
+{
+    return -ENOTSUP;
+}
+#endif
+
 enum BlockAcctType {
     BDRV_ACCT_READ,
     BDRV_ACCT_WRITE,
diff --git a/block/raw-posix.c b/block/raw-posix.c
index f2f0404..fc04981 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -1768,6 +1768,40 @@ static BlockDriver bdrv_host_cdrom = {
 };
 #endif /* __FreeBSD__ */
 
+#ifdef CONFIG_LINUX_AIO
+/**
+ * Return the file descriptor for Linux AIO
+ *
+ * This function is a layering violation and should be removed when it becomes
+ * possible to call the block layer outside the global mutex.  It allows the
+ * caller to hijack the file descriptor so I/O can be performed outside the
+ * block layer.
+ */
+int raw_get_aio_fd(BlockDriverState *bs)
+{
+    BDRVRawState *s;
+
+    if (!bs->drv) {
+        return -ENOMEDIUM;
+    }
+
+    if (bs->drv == bdrv_find_format("raw")) {
+        bs = bs->file;
+    }
+
+    /* raw-posix has several protocols so just check for raw_aio_readv */
+    if (bs->drv->bdrv_aio_readv != raw_aio_readv) {
+        return -ENOTSUP;
+    }
+
+    s = bs->opaque;
+    if (!s->use_aio) {
+        return -ENOTSUP;
+    }
+    return s->fd;
+}
+#endif /* CONFIG_LINUX_AIO */
+
 static void bdrv_file_init(void)
 {
     /*
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [Qemu-devel] [PATCH v4 02/11] configure: add CONFIG_VIRTIO_BLK_DATA_PLANE
  2012-11-22 15:16 [Qemu-devel] [PATCH v4 00/11] virtio: virtio-blk data plane Stefan Hajnoczi
  2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 01/11] raw-posix: add raw_get_aio_fd() for virtio-blk-data-plane Stefan Hajnoczi
@ 2012-11-22 15:16 ` Stefan Hajnoczi
  2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 03/11] dataplane: add host memory mapping code Stefan Hajnoczi
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2012-11-22 15:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, Blue Swirl, khoa,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

The virtio-blk-data-plane feature only works with Linux AIO.  Therefore
add a ./configure option and necessary checks to implement this
dependency.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 configure | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/configure b/configure
index 780b19a..633ba6d 100755
--- a/configure
+++ b/configure
@@ -223,6 +223,7 @@ libiscsi=""
 coroutine=""
 seccomp=""
 glusterfs=""
+virtio_blk_data_plane=""
 
 # parse CC options first
 for opt do
@@ -871,6 +872,10 @@ for opt do
   ;;
   --enable-glusterfs) glusterfs="yes"
   ;;
+  --disable-virtio-blk-data-plane) virtio_blk_data_plane="no"
+  ;;
+  --enable-virtio-blk-data-plane) virtio_blk_data_plane="yes"
+  ;;
   *) echo "ERROR: unknown option $opt"; show_help="yes"
   ;;
   esac
@@ -2233,6 +2238,17 @@ EOF
 fi
 
 ##########################################
+# adjust virtio-blk-data-plane based on linux-aio
+
+if test "$virtio_blk_data_plane" = "yes" -a \
+	"$linux_aio" != "yes" ; then
+  echo "Error: virtio-blk-data-plane requires Linux AIO, please try --enable-linux-aio"
+  exit 1
+elif test -z "$virtio_blk_data_plane" ; then
+  virtio_blk_data_plane=$linux_aio
+fi
+
+##########################################
 # attr probe
 
 if test "$attr" != "no" ; then
@@ -3235,6 +3251,7 @@ echo "build guest agent $guest_agent"
 echo "seccomp support   $seccomp"
 echo "coroutine backend $coroutine_backend"
 echo "GlusterFS support $glusterfs"
+echo "virtio-blk-data-plane $virtio_blk_data_plane"
 
 if test "$sdl_too_old" = "yes"; then
 echo "-> Your SDL version is too old - please upgrade to have SDL support"
@@ -3581,6 +3598,10 @@ if test "$glusterfs" = "yes" ; then
   echo "CONFIG_GLUSTERFS=y" >> $config_host_mak
 fi
 
+if test "$virtio_blk_data_plane" = "yes" ; then
+  echo "CONFIG_VIRTIO_BLK_DATA_PLANE=y" >> $config_host_mak
+fi
+
 # USB host support
 case "$usb" in
 linux)
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [Qemu-devel] [PATCH v4 03/11] dataplane: add host memory mapping code
  2012-11-22 15:16 [Qemu-devel] [PATCH v4 00/11] virtio: virtio-blk data plane Stefan Hajnoczi
  2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 01/11] raw-posix: add raw_get_aio_fd() for virtio-blk-data-plane Stefan Hajnoczi
  2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 02/11] configure: add CONFIG_VIRTIO_BLK_DATA_PLANE Stefan Hajnoczi
@ 2012-11-22 15:16 ` Stefan Hajnoczi
  2012-11-29 12:33   ` Michael S. Tsirkin
  2012-11-29 13:54   ` Michael S. Tsirkin
  2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 04/11] dataplane: add virtqueue vring code Stefan Hajnoczi
                   ` (8 subsequent siblings)
  11 siblings, 2 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2012-11-22 15:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, Blue Swirl, khoa,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

The data plane thread needs to map guest physical addresses to host
pointers.  Normally this is done with cpu_physical_memory_map() but the
function assumes the global mutex is held.  The data plane thread does
not touch the global mutex and therefore needs a thread-safe memory
mapping mechanism.

Hostmem registers a MemoryListener similar to how vhost collects and
pushes memory region information into the kernel.  There is a
fine-grained lock on the regions list which is held during lookup and
when installing a new regions list.

When the physical memory map changes the MemoryListener callbacks are
invoked.  They build up a new list of memory regions which is finally
installed when the list has been completed.

Note that this approach is not safe across memory hotplug because mapped
pointers may still be in used across memory unplug.  However, this is
currently a problem for QEMU in general and needs to be addressed in the
future.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 hw/dataplane/Makefile.objs |   3 +
 hw/dataplane/hostmem.c     | 165 +++++++++++++++++++++++++++++++++++++++++++++
 hw/dataplane/hostmem.h     |  52 ++++++++++++++
 3 files changed, 220 insertions(+)
 create mode 100644 hw/dataplane/Makefile.objs
 create mode 100644 hw/dataplane/hostmem.c
 create mode 100644 hw/dataplane/hostmem.h

diff --git a/hw/dataplane/Makefile.objs b/hw/dataplane/Makefile.objs
new file mode 100644
index 0000000..8c8dea1
--- /dev/null
+++ b/hw/dataplane/Makefile.objs
@@ -0,0 +1,3 @@
+ifeq ($(CONFIG_VIRTIO), y)
+common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += hostmem.o
+endif
diff --git a/hw/dataplane/hostmem.c b/hw/dataplane/hostmem.c
new file mode 100644
index 0000000..48aabf0
--- /dev/null
+++ b/hw/dataplane/hostmem.c
@@ -0,0 +1,165 @@
+/*
+ * Thread-safe guest to host memory mapping
+ *
+ * Copyright 2012 Red Hat, Inc. and/or its affiliates
+ *
+ * Authors:
+ *   Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "exec-memory.h"
+#include "hostmem.h"
+
+static int hostmem_lookup_cmp(const void *phys_, const void *region_)
+{
+    hwaddr phys = *(const hwaddr *)phys_;
+    const HostmemRegion *region = region_;
+
+    if (phys < region->guest_addr) {
+        return -1;
+    } else if (phys >= region->guest_addr + region->size) {
+        return 1;
+    } else {
+        return 0;
+    }
+}
+
+/**
+ * Map guest physical address to host pointer
+ */
+void *hostmem_lookup(Hostmem *hostmem, hwaddr phys, hwaddr len, bool is_write)
+{
+    HostmemRegion *region;
+    void *host_addr = NULL;
+    hwaddr offset_within_region;
+
+    qemu_mutex_lock(&hostmem->current_regions_lock);
+    region = bsearch(&phys, hostmem->current_regions,
+                     hostmem->num_current_regions,
+                     sizeof(hostmem->current_regions[0]),
+                     hostmem_lookup_cmp);
+    if (!region) {
+        goto out;
+    }
+    if (is_write && region->readonly) {
+        goto out;
+    }
+    offset_within_region = phys - region->guest_addr;
+    if (offset_within_region + len <= region->size) {
+        host_addr = region->host_addr + offset_within_region;
+    }
+out:
+    qemu_mutex_unlock(&hostmem->current_regions_lock);
+
+    return host_addr;
+}
+
+/**
+ * Install new regions list
+ */
+static void hostmem_listener_commit(MemoryListener *listener)
+{
+    Hostmem *hostmem = container_of(listener, Hostmem, listener);
+
+    qemu_mutex_lock(&hostmem->current_regions_lock);
+    g_free(hostmem->current_regions);
+    hostmem->current_regions = hostmem->new_regions;
+    hostmem->num_current_regions = hostmem->num_new_regions;
+    qemu_mutex_unlock(&hostmem->current_regions_lock);
+
+    /* Reset new regions list */
+    hostmem->new_regions = NULL;
+    hostmem->num_new_regions = 0;
+}
+
+/**
+ * Add a MemoryRegionSection to the new regions list
+ */
+static void hostmem_append_new_region(Hostmem *hostmem,
+                                      MemoryRegionSection *section)
+{
+    void *ram_ptr = memory_region_get_ram_ptr(section->mr);
+    size_t num = hostmem->num_new_regions;
+    size_t new_size = (num + 1) * sizeof(hostmem->new_regions[0]);
+
+    hostmem->new_regions = g_realloc(hostmem->new_regions, new_size);
+    hostmem->new_regions[num] = (HostmemRegion){
+        .host_addr = ram_ptr + section->offset_within_region,
+        .guest_addr = section->offset_within_address_space,
+        .size = section->size,
+        .readonly = section->readonly,
+    };
+    hostmem->num_new_regions++;
+}
+
+static void hostmem_listener_append_region(MemoryListener *listener,
+                                           MemoryRegionSection *section)
+{
+    Hostmem *hostmem = container_of(listener, Hostmem, listener);
+
+    if (memory_region_is_ram(section->mr)) {
+        hostmem_append_new_region(hostmem, section);
+    }
+}
+
+/* We don't implement most MemoryListener callbacks, use these nop stubs */
+static void hostmem_listener_dummy(MemoryListener *listener)
+{
+}
+
+static void hostmem_listener_section_dummy(MemoryListener *listener,
+                                           MemoryRegionSection *section)
+{
+}
+
+static void hostmem_listener_eventfd_dummy(MemoryListener *listener,
+                                           MemoryRegionSection *section,
+                                           bool match_data, uint64_t data,
+                                           EventNotifier *e)
+{
+}
+
+static void hostmem_listener_coalesced_mmio_dummy(MemoryListener *listener,
+                                                  MemoryRegionSection *section,
+                                                  hwaddr addr, hwaddr len)
+{
+}
+
+void hostmem_init(Hostmem *hostmem)
+{
+    memset(hostmem, 0, sizeof(*hostmem));
+
+    hostmem->listener = (MemoryListener){
+        .begin = hostmem_listener_dummy,
+        .commit = hostmem_listener_commit,
+        .region_add = hostmem_listener_append_region,
+        .region_del = hostmem_listener_section_dummy,
+        .region_nop = hostmem_listener_append_region,
+        .log_start = hostmem_listener_section_dummy,
+        .log_stop = hostmem_listener_section_dummy,
+        .log_sync = hostmem_listener_section_dummy,
+        .log_global_start = hostmem_listener_dummy,
+        .log_global_stop = hostmem_listener_dummy,
+        .eventfd_add = hostmem_listener_eventfd_dummy,
+        .eventfd_del = hostmem_listener_eventfd_dummy,
+        .coalesced_mmio_add = hostmem_listener_coalesced_mmio_dummy,
+        .coalesced_mmio_del = hostmem_listener_coalesced_mmio_dummy,
+        .priority = 10,
+    };
+
+    memory_listener_register(&hostmem->listener, &address_space_memory);
+    if (hostmem->num_new_regions > 0) {
+        hostmem_listener_commit(&hostmem->listener);
+    }
+}
+
+void hostmem_finalize(Hostmem *hostmem)
+{
+    memory_listener_unregister(&hostmem->listener);
+    g_free(hostmem->new_regions);
+    g_free(hostmem->current_regions);
+}
diff --git a/hw/dataplane/hostmem.h b/hw/dataplane/hostmem.h
new file mode 100644
index 0000000..a833b74
--- /dev/null
+++ b/hw/dataplane/hostmem.h
@@ -0,0 +1,52 @@
+/*
+ * Thread-safe guest to host memory mapping
+ *
+ * Copyright 2012 Red Hat, Inc. and/or its affiliates
+ *
+ * Authors:
+ *   Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef HOSTMEM_H
+#define HOSTMEM_H
+
+#include "memory.h"
+#include "qemu-thread.h"
+
+typedef struct {
+    void *host_addr;
+    hwaddr guest_addr;
+    uint64_t size;
+    bool readonly;
+} HostmemRegion;
+
+typedef struct {
+    /* The listener is invoked when regions change and a new list of regions is
+     * built up completely before they are installed.
+     */
+    MemoryListener listener;
+    HostmemRegion *new_regions;
+    size_t num_new_regions;
+
+    /* Current regions are accessed from multiple threads either to lookup
+     * addresses or to install a new list of regions.  The lock protects the
+     * pointer and the regions.
+     */
+    QemuMutex current_regions_lock;
+    HostmemRegion *current_regions;
+    size_t num_current_regions;
+} Hostmem;
+
+void hostmem_init(Hostmem *hostmem);
+void hostmem_finalize(Hostmem *hostmem);
+
+/**
+ * Map a guest physical address to a pointer
+ */
+void *hostmem_lookup(Hostmem *hostmem, hwaddr phys, hwaddr len, bool is_write);
+
+#endif /* HOSTMEM_H */
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 03/11] dataplane: add host memory mapping code
  2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 03/11] dataplane: add host memory mapping code Stefan Hajnoczi
@ 2012-11-29 12:33   ` Michael S. Tsirkin
  2012-11-29 12:45     ` Stefan Hajnoczi
  2012-11-29 13:54   ` Michael S. Tsirkin
  1 sibling, 1 reply; 43+ messages in thread
From: Michael S. Tsirkin @ 2012-11-29 12:33 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Blue Swirl, khoa,
	Paolo Bonzini, Asias He

On Thu, Nov 22, 2012 at 04:16:44PM +0100, Stefan Hajnoczi wrote:
> The data plane thread needs to map guest physical addresses to host
> pointers.  Normally this is done with cpu_physical_memory_map() but the
> function assumes the global mutex is held.  The data plane thread does
> not touch the global mutex and therefore needs a thread-safe memory
> mapping mechanism.
> 
> Hostmem registers a MemoryListener similar to how vhost collects and
> pushes memory region information into the kernel.  There is a
> fine-grained lock on the regions list which is held during lookup and
> when installing a new regions list.
> 
> When the physical memory map changes the MemoryListener callbacks are
> invoked.  They build up a new list of memory regions which is finally
> installed when the list has been completed.
> 
> Note that this approach is not safe across memory hotplug because mapped
> pointers may still be in used across memory unplug.  However, this is
> currently a problem for QEMU in general and needs to be addressed in the
> future.

Sounds like a serious problem.
I'm not sure I understand - do you say this currently a problem for QEMU
virtio? Coul you give an example please?

> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

> ---
>  hw/dataplane/Makefile.objs |   3 +
>  hw/dataplane/hostmem.c     | 165 +++++++++++++++++++++++++++++++++++++++++++++
>  hw/dataplane/hostmem.h     |  52 ++++++++++++++
>  3 files changed, 220 insertions(+)
>  create mode 100644 hw/dataplane/Makefile.objs
>  create mode 100644 hw/dataplane/hostmem.c
>  create mode 100644 hw/dataplane/hostmem.h
> 
> diff --git a/hw/dataplane/Makefile.objs b/hw/dataplane/Makefile.objs
> new file mode 100644
> index 0000000..8c8dea1
> --- /dev/null
> +++ b/hw/dataplane/Makefile.objs
> @@ -0,0 +1,3 @@
> +ifeq ($(CONFIG_VIRTIO), y)
> +common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += hostmem.o
> +endif
> diff --git a/hw/dataplane/hostmem.c b/hw/dataplane/hostmem.c
> new file mode 100644
> index 0000000..48aabf0
> --- /dev/null
> +++ b/hw/dataplane/hostmem.c
> @@ -0,0 +1,165 @@
> +/*
> + * Thread-safe guest to host memory mapping
> + *
> + * Copyright 2012 Red Hat, Inc. and/or its affiliates
> + *
> + * Authors:
> + *   Stefan Hajnoczi <stefanha@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "exec-memory.h"
> +#include "hostmem.h"
> +
> +static int hostmem_lookup_cmp(const void *phys_, const void *region_)
> +{
> +    hwaddr phys = *(const hwaddr *)phys_;
> +    const HostmemRegion *region = region_;
> +
> +    if (phys < region->guest_addr) {
> +        return -1;
> +    } else if (phys >= region->guest_addr + region->size) {
> +        return 1;
> +    } else {
> +        return 0;
> +    }
> +}
> +
> +/**
> + * Map guest physical address to host pointer
> + */
> +void *hostmem_lookup(Hostmem *hostmem, hwaddr phys, hwaddr len, bool is_write)
> +{
> +    HostmemRegion *region;
> +    void *host_addr = NULL;
> +    hwaddr offset_within_region;
> +
> +    qemu_mutex_lock(&hostmem->current_regions_lock);
> +    region = bsearch(&phys, hostmem->current_regions,
> +                     hostmem->num_current_regions,
> +                     sizeof(hostmem->current_regions[0]),
> +                     hostmem_lookup_cmp);
> +    if (!region) {
> +        goto out;
> +    }
> +    if (is_write && region->readonly) {
> +        goto out;
> +    }
> +    offset_within_region = phys - region->guest_addr;
> +    if (offset_within_region + len <= region->size) {
> +        host_addr = region->host_addr + offset_within_region;
> +    }
> +out:
> +    qemu_mutex_unlock(&hostmem->current_regions_lock);
> +
> +    return host_addr;
> +}
> +
> +/**
> + * Install new regions list
> + */
> +static void hostmem_listener_commit(MemoryListener *listener)
> +{
> +    Hostmem *hostmem = container_of(listener, Hostmem, listener);
> +
> +    qemu_mutex_lock(&hostmem->current_regions_lock);
> +    g_free(hostmem->current_regions);
> +    hostmem->current_regions = hostmem->new_regions;
> +    hostmem->num_current_regions = hostmem->num_new_regions;
> +    qemu_mutex_unlock(&hostmem->current_regions_lock);
> +
> +    /* Reset new regions list */
> +    hostmem->new_regions = NULL;
> +    hostmem->num_new_regions = 0;
> +}
> +
> +/**
> + * Add a MemoryRegionSection to the new regions list
> + */
> +static void hostmem_append_new_region(Hostmem *hostmem,
> +                                      MemoryRegionSection *section)
> +{
> +    void *ram_ptr = memory_region_get_ram_ptr(section->mr);
> +    size_t num = hostmem->num_new_regions;
> +    size_t new_size = (num + 1) * sizeof(hostmem->new_regions[0]);
> +
> +    hostmem->new_regions = g_realloc(hostmem->new_regions, new_size);
> +    hostmem->new_regions[num] = (HostmemRegion){
> +        .host_addr = ram_ptr + section->offset_within_region,
> +        .guest_addr = section->offset_within_address_space,
> +        .size = section->size,
> +        .readonly = section->readonly,
> +    };
> +    hostmem->num_new_regions++;
> +}
> +
> +static void hostmem_listener_append_region(MemoryListener *listener,
> +                                           MemoryRegionSection *section)
> +{
> +    Hostmem *hostmem = container_of(listener, Hostmem, listener);
> +
> +    if (memory_region_is_ram(section->mr)) {
> +        hostmem_append_new_region(hostmem, section);
> +    }
> +}
> +
> +/* We don't implement most MemoryListener callbacks, use these nop stubs */
> +static void hostmem_listener_dummy(MemoryListener *listener)
> +{
> +}
> +
> +static void hostmem_listener_section_dummy(MemoryListener *listener,
> +                                           MemoryRegionSection *section)
> +{
> +}
> +
> +static void hostmem_listener_eventfd_dummy(MemoryListener *listener,
> +                                           MemoryRegionSection *section,
> +                                           bool match_data, uint64_t data,
> +                                           EventNotifier *e)
> +{
> +}
> +
> +static void hostmem_listener_coalesced_mmio_dummy(MemoryListener *listener,
> +                                                  MemoryRegionSection *section,
> +                                                  hwaddr addr, hwaddr len)
> +{
> +}
> +
> +void hostmem_init(Hostmem *hostmem)
> +{
> +    memset(hostmem, 0, sizeof(*hostmem));
> +
> +    hostmem->listener = (MemoryListener){
> +        .begin = hostmem_listener_dummy,
> +        .commit = hostmem_listener_commit,
> +        .region_add = hostmem_listener_append_region,
> +        .region_del = hostmem_listener_section_dummy,
> +        .region_nop = hostmem_listener_append_region,
> +        .log_start = hostmem_listener_section_dummy,
> +        .log_stop = hostmem_listener_section_dummy,
> +        .log_sync = hostmem_listener_section_dummy,
> +        .log_global_start = hostmem_listener_dummy,
> +        .log_global_stop = hostmem_listener_dummy,
> +        .eventfd_add = hostmem_listener_eventfd_dummy,
> +        .eventfd_del = hostmem_listener_eventfd_dummy,
> +        .coalesced_mmio_add = hostmem_listener_coalesced_mmio_dummy,
> +        .coalesced_mmio_del = hostmem_listener_coalesced_mmio_dummy,
> +        .priority = 10,
> +    };
> +
> +    memory_listener_register(&hostmem->listener, &address_space_memory);
> +    if (hostmem->num_new_regions > 0) {
> +        hostmem_listener_commit(&hostmem->listener);
> +    }
> +}
> +
> +void hostmem_finalize(Hostmem *hostmem)
> +{
> +    memory_listener_unregister(&hostmem->listener);
> +    g_free(hostmem->new_regions);
> +    g_free(hostmem->current_regions);
> +}
> diff --git a/hw/dataplane/hostmem.h b/hw/dataplane/hostmem.h
> new file mode 100644
> index 0000000..a833b74
> --- /dev/null
> +++ b/hw/dataplane/hostmem.h
> @@ -0,0 +1,52 @@
> +/*
> + * Thread-safe guest to host memory mapping
> + *
> + * Copyright 2012 Red Hat, Inc. and/or its affiliates
> + *
> + * Authors:
> + *   Stefan Hajnoczi <stefanha@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#ifndef HOSTMEM_H
> +#define HOSTMEM_H
> +
> +#include "memory.h"
> +#include "qemu-thread.h"
> +
> +typedef struct {
> +    void *host_addr;
> +    hwaddr guest_addr;
> +    uint64_t size;
> +    bool readonly;
> +} HostmemRegion;
> +
> +typedef struct {
> +    /* The listener is invoked when regions change and a new list of regions is
> +     * built up completely before they are installed.
> +     */
> +    MemoryListener listener;
> +    HostmemRegion *new_regions;
> +    size_t num_new_regions;
> +
> +    /* Current regions are accessed from multiple threads either to lookup
> +     * addresses or to install a new list of regions.  The lock protects the
> +     * pointer and the regions.
> +     */
> +    QemuMutex current_regions_lock;
> +    HostmemRegion *current_regions;
> +    size_t num_current_regions;
> +} Hostmem;
> +
> +void hostmem_init(Hostmem *hostmem);
> +void hostmem_finalize(Hostmem *hostmem);
> +
> +/**
> + * Map a guest physical address to a pointer
> + */
> +void *hostmem_lookup(Hostmem *hostmem, hwaddr phys, hwaddr len, bool is_write);
> +
> +#endif /* HOSTMEM_H */
> -- 
> 1.8.0

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 03/11] dataplane: add host memory mapping code
  2012-11-29 12:33   ` Michael S. Tsirkin
@ 2012-11-29 12:45     ` Stefan Hajnoczi
  2012-11-29 12:54       ` Michael S. Tsirkin
  0 siblings, 1 reply; 43+ messages in thread
From: Stefan Hajnoczi @ 2012-11-29 12:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Blue Swirl, khoa,
	Paolo Bonzini, Asias He

On Thu, Nov 29, 2012 at 02:33:11PM +0200, Michael S. Tsirkin wrote:
> On Thu, Nov 22, 2012 at 04:16:44PM +0100, Stefan Hajnoczi wrote:
> > The data plane thread needs to map guest physical addresses to host
> > pointers.  Normally this is done with cpu_physical_memory_map() but the
> > function assumes the global mutex is held.  The data plane thread does
> > not touch the global mutex and therefore needs a thread-safe memory
> > mapping mechanism.
> > 
> > Hostmem registers a MemoryListener similar to how vhost collects and
> > pushes memory region information into the kernel.  There is a
> > fine-grained lock on the regions list which is held during lookup and
> > when installing a new regions list.
> > 
> > When the physical memory map changes the MemoryListener callbacks are
> > invoked.  They build up a new list of memory regions which is finally
> > installed when the list has been completed.
> > 
> > Note that this approach is not safe across memory hotplug because mapped
> > pointers may still be in used across memory unplug.  However, this is
> > currently a problem for QEMU in general and needs to be addressed in the
> > future.
> 
> Sounds like a serious problem.
> I'm not sure I understand - do you say this currently a problem for QEMU
> virtio? Coul you give an example please?

This is a limitation of the memory API but cannot be triggered by users
today since we don't support memory hot unplug.  I'm just explaining
that virtio-blk-data-plane has the same issue as hw/virtio-blk.c or any
other device emulation code here.

Some more detail:

The issue is that hw/virtio-blk.c submits an asynchronous I/O request on
the host with the guest buffer.  Then virtio-blk emulation returns back
to the caller and continues QEMU execution.

It is unsafe to unplug memory while the I/O request is pending since
there's no mechanism (e.g. refcount) to wait until the guest memory is
no longer in use.

This is a known issue.  There's no way to trigger a problem today but we
need to eventually enhance QEMU's memory API to handle this case.

Stefan

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 03/11] dataplane: add host memory mapping code
  2012-11-29 12:45     ` Stefan Hajnoczi
@ 2012-11-29 12:54       ` Michael S. Tsirkin
  2012-11-29 12:57         ` Michael S. Tsirkin
  0 siblings, 1 reply; 43+ messages in thread
From: Michael S. Tsirkin @ 2012-11-29 12:54 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Blue Swirl, khoa,
	Paolo Bonzini, Asias He

On Thu, Nov 29, 2012 at 01:45:19PM +0100, Stefan Hajnoczi wrote:
> On Thu, Nov 29, 2012 at 02:33:11PM +0200, Michael S. Tsirkin wrote:
> > On Thu, Nov 22, 2012 at 04:16:44PM +0100, Stefan Hajnoczi wrote:
> > > The data plane thread needs to map guest physical addresses to host
> > > pointers.  Normally this is done with cpu_physical_memory_map() but the
> > > function assumes the global mutex is held.  The data plane thread does
> > > not touch the global mutex and therefore needs a thread-safe memory
> > > mapping mechanism.
> > > 
> > > Hostmem registers a MemoryListener similar to how vhost collects and
> > > pushes memory region information into the kernel.  There is a
> > > fine-grained lock on the regions list which is held during lookup and
> > > when installing a new regions list.
> > > 
> > > When the physical memory map changes the MemoryListener callbacks are
> > > invoked.  They build up a new list of memory regions which is finally
> > > installed when the list has been completed.
> > > 
> > > Note that this approach is not safe across memory hotplug because mapped
> > > pointers may still be in used across memory unplug.  However, this is
> > > currently a problem for QEMU in general and needs to be addressed in the
> > > future.
> > 
> > Sounds like a serious problem.
> > I'm not sure I understand - do you say this currently a problem for QEMU
> > virtio? Coul you give an example please?
> 
> This is a limitation of the memory API but cannot be triggered by users
> today since we don't support memory hot unplug.  I'm just explaining
> that virtio-blk-data-plane has the same issue as hw/virtio-blk.c or any
> other device emulation code here.
> 
> Some more detail:
> 
> The issue is that hw/virtio-blk.c submits an asynchronous I/O request on
> the host with the guest buffer.  Then virtio-blk emulation returns back
> to the caller and continues QEMU execution.
> 
> It is unsafe to unplug memory while the I/O request is pending since
> there's no mechanism (e.g. refcount) to wait until the guest memory is
> no longer in use.
> 
> This is a known issue.  There's no way to trigger a problem today but we
> need to eventually enhance QEMU's memory API to handle this case.
> 
> Stefan

For this problem we would simply need to flush outstanding aio
before freeing memory for unplug, no refcount necessary.

This patch however introduces the issue in the frontend
and it looks like there won't be any way to solve
it without changing the API.

-- 
MST

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 03/11] dataplane: add host memory mapping code
  2012-11-29 12:54       ` Michael S. Tsirkin
@ 2012-11-29 12:57         ` Michael S. Tsirkin
  2012-12-05  8:13           ` Stefan Hajnoczi
  0 siblings, 1 reply; 43+ messages in thread
From: Michael S. Tsirkin @ 2012-11-29 12:57 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Blue Swirl, khoa,
	Paolo Bonzini, Asias He

On Thu, Nov 29, 2012 at 02:54:26PM +0200, Michael S. Tsirkin wrote:
> On Thu, Nov 29, 2012 at 01:45:19PM +0100, Stefan Hajnoczi wrote:
> > On Thu, Nov 29, 2012 at 02:33:11PM +0200, Michael S. Tsirkin wrote:
> > > On Thu, Nov 22, 2012 at 04:16:44PM +0100, Stefan Hajnoczi wrote:
> > > > The data plane thread needs to map guest physical addresses to host
> > > > pointers.  Normally this is done with cpu_physical_memory_map() but the
> > > > function assumes the global mutex is held.  The data plane thread does
> > > > not touch the global mutex and therefore needs a thread-safe memory
> > > > mapping mechanism.
> > > > 
> > > > Hostmem registers a MemoryListener similar to how vhost collects and
> > > > pushes memory region information into the kernel.  There is a
> > > > fine-grained lock on the regions list which is held during lookup and
> > > > when installing a new regions list.
> > > > 
> > > > When the physical memory map changes the MemoryListener callbacks are
> > > > invoked.  They build up a new list of memory regions which is finally
> > > > installed when the list has been completed.
> > > > 
> > > > Note that this approach is not safe across memory hotplug because mapped
> > > > pointers may still be in used across memory unplug.  However, this is
> > > > currently a problem for QEMU in general and needs to be addressed in the
> > > > future.
> > > 
> > > Sounds like a serious problem.
> > > I'm not sure I understand - do you say this currently a problem for QEMU
> > > virtio? Coul you give an example please?
> > 
> > This is a limitation of the memory API but cannot be triggered by users
> > today since we don't support memory hot unplug.  I'm just explaining
> > that virtio-blk-data-plane has the same issue as hw/virtio-blk.c or any
> > other device emulation code here.
> > 
> > Some more detail:
> > 
> > The issue is that hw/virtio-blk.c submits an asynchronous I/O request on
> > the host with the guest buffer.  Then virtio-blk emulation returns back
> > to the caller and continues QEMU execution.
> > 
> > It is unsafe to unplug memory while the I/O request is pending since
> > there's no mechanism (e.g. refcount) to wait until the guest memory is
> > no longer in use.
> > 
> > This is a known issue.  There's no way to trigger a problem today but we
> > need to eventually enhance QEMU's memory API to handle this case.
> > 
> > Stefan
> 
> For this problem we would simply need to flush outstanding aio
> before freeing memory for unplug, no refcount necessary.
> 
> This patch however introduces the issue in the frontend
> and it looks like there won't be any way to solve
> it without changing the API.

To clarify, as you say it is not triggerable
so I don't think this is strictly required to address
this at this point though it should not be too hard:
just register callback that flushes the frontend processing.

But if you can't code it at this point, please add
a TODO comment in code.

> -- 
> MST

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 03/11] dataplane: add host memory mapping code
  2012-11-29 12:57         ` Michael S. Tsirkin
@ 2012-12-05  8:13           ` Stefan Hajnoczi
  0 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2012-12-05  8:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Blue Swirl, khoa,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

On Thu, Nov 29, 2012 at 02:57:05PM +0200, Michael S. Tsirkin wrote:
> On Thu, Nov 29, 2012 at 02:54:26PM +0200, Michael S. Tsirkin wrote:
> > On Thu, Nov 29, 2012 at 01:45:19PM +0100, Stefan Hajnoczi wrote:
> > > On Thu, Nov 29, 2012 at 02:33:11PM +0200, Michael S. Tsirkin wrote:
> > > > On Thu, Nov 22, 2012 at 04:16:44PM +0100, Stefan Hajnoczi wrote:
> > > > > The data plane thread needs to map guest physical addresses to host
> > > > > pointers.  Normally this is done with cpu_physical_memory_map() but the
> > > > > function assumes the global mutex is held.  The data plane thread does
> > > > > not touch the global mutex and therefore needs a thread-safe memory
> > > > > mapping mechanism.
> > > > > 
> > > > > Hostmem registers a MemoryListener similar to how vhost collects and
> > > > > pushes memory region information into the kernel.  There is a
> > > > > fine-grained lock on the regions list which is held during lookup and
> > > > > when installing a new regions list.
> > > > > 
> > > > > When the physical memory map changes the MemoryListener callbacks are
> > > > > invoked.  They build up a new list of memory regions which is finally
> > > > > installed when the list has been completed.
> > > > > 
> > > > > Note that this approach is not safe across memory hotplug because mapped
> > > > > pointers may still be in used across memory unplug.  However, this is
> > > > > currently a problem for QEMU in general and needs to be addressed in the
> > > > > future.
> > > > 
> > > > Sounds like a serious problem.
> > > > I'm not sure I understand - do you say this currently a problem for QEMU
> > > > virtio? Coul you give an example please?
> > > 
> > > This is a limitation of the memory API but cannot be triggered by users
> > > today since we don't support memory hot unplug.  I'm just explaining
> > > that virtio-blk-data-plane has the same issue as hw/virtio-blk.c or any
> > > other device emulation code here.
> > > 
> > > Some more detail:
> > > 
> > > The issue is that hw/virtio-blk.c submits an asynchronous I/O request on
> > > the host with the guest buffer.  Then virtio-blk emulation returns back
> > > to the caller and continues QEMU execution.
> > > 
> > > It is unsafe to unplug memory while the I/O request is pending since
> > > there's no mechanism (e.g. refcount) to wait until the guest memory is
> > > no longer in use.
> > > 
> > > This is a known issue.  There's no way to trigger a problem today but we
> > > need to eventually enhance QEMU's memory API to handle this case.
> > > 
> > > Stefan
> > 
> > For this problem we would simply need to flush outstanding aio
> > before freeing memory for unplug, no refcount necessary.
> > 
> > This patch however introduces the issue in the frontend
> > and it looks like there won't be any way to solve
> > it without changing the API.
> 
> To clarify, as you say it is not triggerable
> so I don't think this is strictly required to address
> this at this point though it should not be too hard:
> just register callback that flushes the frontend processing.
> 
> But if you can't code it at this point, please add
> a TODO comment in code.

Yes, I'm adding a TODO and your suggestion to flush the frontend sounds
like a simple solution - we already quiesce at other critical points
like live migration.

Stefan

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 03/11] dataplane: add host memory mapping code
  2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 03/11] dataplane: add host memory mapping code Stefan Hajnoczi
  2012-11-29 12:33   ` Michael S. Tsirkin
@ 2012-11-29 13:54   ` Michael S. Tsirkin
  2012-11-29 14:26     ` Stefan Hajnoczi
  1 sibling, 1 reply; 43+ messages in thread
From: Michael S. Tsirkin @ 2012-11-29 13:54 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Blue Swirl, khoa,
	Paolo Bonzini, Asias He

On Thu, Nov 22, 2012 at 04:16:44PM +0100, Stefan Hajnoczi wrote:
> The data plane thread needs to map guest physical addresses to host
> pointers.  Normally this is done with cpu_physical_memory_map() but the
> function assumes the global mutex is held.  The data plane thread does
> not touch the global mutex and therefore needs a thread-safe memory
> mapping mechanism.
> 
> Hostmem registers a MemoryListener similar to how vhost collects and
> pushes memory region information into the kernel.  There is a
> fine-grained lock on the regions list which is held during lookup and
> when installing a new regions list.
> 
> When the physical memory map changes the MemoryListener callbacks are
> invoked.  They build up a new list of memory regions which is finally
> installed when the list has been completed.
> 
> Note that this approach is not safe across memory hotplug because mapped
> pointers may still be in used across memory unplug.  However, this is
> currently a problem for QEMU in general and needs to be addressed in the
> future.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

Worth bothering with binary search?
vhost does a linear search over regions because
the number of ram regions is very small.

> ---
>  hw/dataplane/Makefile.objs |   3 +
>  hw/dataplane/hostmem.c     | 165 +++++++++++++++++++++++++++++++++++++++++++++
>  hw/dataplane/hostmem.h     |  52 ++++++++++++++
>  3 files changed, 220 insertions(+)
>  create mode 100644 hw/dataplane/Makefile.objs
>  create mode 100644 hw/dataplane/hostmem.c
>  create mode 100644 hw/dataplane/hostmem.h
> 
> diff --git a/hw/dataplane/Makefile.objs b/hw/dataplane/Makefile.objs
> new file mode 100644
> index 0000000..8c8dea1
> --- /dev/null
> +++ b/hw/dataplane/Makefile.objs
> @@ -0,0 +1,3 @@
> +ifeq ($(CONFIG_VIRTIO), y)
> +common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += hostmem.o
> +endif
> diff --git a/hw/dataplane/hostmem.c b/hw/dataplane/hostmem.c
> new file mode 100644
> index 0000000..48aabf0
> --- /dev/null
> +++ b/hw/dataplane/hostmem.c
> @@ -0,0 +1,165 @@
> +/*
> + * Thread-safe guest to host memory mapping
> + *
> + * Copyright 2012 Red Hat, Inc. and/or its affiliates
> + *
> + * Authors:
> + *   Stefan Hajnoczi <stefanha@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "exec-memory.h"
> +#include "hostmem.h"
> +
> +static int hostmem_lookup_cmp(const void *phys_, const void *region_)
> +{
> +    hwaddr phys = *(const hwaddr *)phys_;
> +    const HostmemRegion *region = region_;
> +
> +    if (phys < region->guest_addr) {
> +        return -1;
> +    } else if (phys >= region->guest_addr + region->size) {
> +        return 1;
> +    } else {
> +        return 0;
> +    }
> +}
> +
> +/**
> + * Map guest physical address to host pointer
> + */
> +void *hostmem_lookup(Hostmem *hostmem, hwaddr phys, hwaddr len, bool is_write)
> +{
> +    HostmemRegion *region;
> +    void *host_addr = NULL;
> +    hwaddr offset_within_region;
> +
> +    qemu_mutex_lock(&hostmem->current_regions_lock);
> +    region = bsearch(&phys, hostmem->current_regions,
> +                     hostmem->num_current_regions,
> +                     sizeof(hostmem->current_regions[0]),
> +                     hostmem_lookup_cmp);
> +    if (!region) {
> +        goto out;
> +    }
> +    if (is_write && region->readonly) {
> +        goto out;
> +    }
> +    offset_within_region = phys - region->guest_addr;
> +    if (offset_within_region + len <= region->size) {
> +        host_addr = region->host_addr + offset_within_region;
> +    }
> +out:
> +    qemu_mutex_unlock(&hostmem->current_regions_lock);
> +
> +    return host_addr;
> +}
> +
> +/**
> + * Install new regions list
> + */
> +static void hostmem_listener_commit(MemoryListener *listener)
> +{
> +    Hostmem *hostmem = container_of(listener, Hostmem, listener);
> +
> +    qemu_mutex_lock(&hostmem->current_regions_lock);
> +    g_free(hostmem->current_regions);
> +    hostmem->current_regions = hostmem->new_regions;
> +    hostmem->num_current_regions = hostmem->num_new_regions;
> +    qemu_mutex_unlock(&hostmem->current_regions_lock);
> +
> +    /* Reset new regions list */
> +    hostmem->new_regions = NULL;
> +    hostmem->num_new_regions = 0;
> +}
> +
> +/**
> + * Add a MemoryRegionSection to the new regions list
> + */
> +static void hostmem_append_new_region(Hostmem *hostmem,
> +                                      MemoryRegionSection *section)
> +{
> +    void *ram_ptr = memory_region_get_ram_ptr(section->mr);
> +    size_t num = hostmem->num_new_regions;
> +    size_t new_size = (num + 1) * sizeof(hostmem->new_regions[0]);
> +
> +    hostmem->new_regions = g_realloc(hostmem->new_regions, new_size);
> +    hostmem->new_regions[num] = (HostmemRegion){
> +        .host_addr = ram_ptr + section->offset_within_region,
> +        .guest_addr = section->offset_within_address_space,
> +        .size = section->size,
> +        .readonly = section->readonly,
> +    };
> +    hostmem->num_new_regions++;
> +}
> +
> +static void hostmem_listener_append_region(MemoryListener *listener,
> +                                           MemoryRegionSection *section)
> +{
> +    Hostmem *hostmem = container_of(listener, Hostmem, listener);
> +
> +    if (memory_region_is_ram(section->mr)) {
> +        hostmem_append_new_region(hostmem, section);
> +    }

I think you also need to remove VGA region since you
don't mark these pages as dirty so access there won't work.

> +}
> +
> +/* We don't implement most MemoryListener callbacks, use these nop stubs */
> +static void hostmem_listener_dummy(MemoryListener *listener)
> +{
> +}
> +
> +static void hostmem_listener_section_dummy(MemoryListener *listener,
> +                                           MemoryRegionSection *section)
> +{
> +}
> +
> +static void hostmem_listener_eventfd_dummy(MemoryListener *listener,
> +                                           MemoryRegionSection *section,
> +                                           bool match_data, uint64_t data,
> +                                           EventNotifier *e)
> +{
> +}
> +
> +static void hostmem_listener_coalesced_mmio_dummy(MemoryListener *listener,
> +                                                  MemoryRegionSection *section,
> +                                                  hwaddr addr, hwaddr len)
> +{
> +}
> +
> +void hostmem_init(Hostmem *hostmem)
> +{
> +    memset(hostmem, 0, sizeof(*hostmem));
> +
> +    hostmem->listener = (MemoryListener){
> +        .begin = hostmem_listener_dummy,
> +        .commit = hostmem_listener_commit,
> +        .region_add = hostmem_listener_append_region,
> +        .region_del = hostmem_listener_section_dummy,
> +        .region_nop = hostmem_listener_append_region,
> +        .log_start = hostmem_listener_section_dummy,
> +        .log_stop = hostmem_listener_section_dummy,
> +        .log_sync = hostmem_listener_section_dummy,
> +        .log_global_start = hostmem_listener_dummy,
> +        .log_global_stop = hostmem_listener_dummy,
> +        .eventfd_add = hostmem_listener_eventfd_dummy,
> +        .eventfd_del = hostmem_listener_eventfd_dummy,
> +        .coalesced_mmio_add = hostmem_listener_coalesced_mmio_dummy,
> +        .coalesced_mmio_del = hostmem_listener_coalesced_mmio_dummy,
> +        .priority = 10,
> +    };
> +
> +    memory_listener_register(&hostmem->listener, &address_space_memory);
> +    if (hostmem->num_new_regions > 0) {
> +        hostmem_listener_commit(&hostmem->listener);
> +    }
> +}
> +
> +void hostmem_finalize(Hostmem *hostmem)
> +{
> +    memory_listener_unregister(&hostmem->listener);
> +    g_free(hostmem->new_regions);
> +    g_free(hostmem->current_regions);
> +}
> diff --git a/hw/dataplane/hostmem.h b/hw/dataplane/hostmem.h
> new file mode 100644
> index 0000000..a833b74
> --- /dev/null
> +++ b/hw/dataplane/hostmem.h
> @@ -0,0 +1,52 @@
> +/*
> + * Thread-safe guest to host memory mapping
> + *
> + * Copyright 2012 Red Hat, Inc. and/or its affiliates
> + *
> + * Authors:
> + *   Stefan Hajnoczi <stefanha@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#ifndef HOSTMEM_H
> +#define HOSTMEM_H
> +
> +#include "memory.h"
> +#include "qemu-thread.h"
> +
> +typedef struct {
> +    void *host_addr;
> +    hwaddr guest_addr;
> +    uint64_t size;
> +    bool readonly;
> +} HostmemRegion;
> +
> +typedef struct {
> +    /* The listener is invoked when regions change and a new list of regions is
> +     * built up completely before they are installed.
> +     */
> +    MemoryListener listener;
> +    HostmemRegion *new_regions;
> +    size_t num_new_regions;
> +
> +    /* Current regions are accessed from multiple threads either to lookup
> +     * addresses or to install a new list of regions.  The lock protects the
> +     * pointer and the regions.
> +     */
> +    QemuMutex current_regions_lock;
> +    HostmemRegion *current_regions;
> +    size_t num_current_regions;
> +} Hostmem;
> +
> +void hostmem_init(Hostmem *hostmem);
> +void hostmem_finalize(Hostmem *hostmem);
> +
> +/**
> + * Map a guest physical address to a pointer
> + */
> +void *hostmem_lookup(Hostmem *hostmem, hwaddr phys, hwaddr len, bool is_write);
> +
> +#endif /* HOSTMEM_H */
> -- 
> 1.8.0

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 03/11] dataplane: add host memory mapping code
  2012-11-29 13:54   ` Michael S. Tsirkin
@ 2012-11-29 14:26     ` Stefan Hajnoczi
  2012-11-29 14:36       ` Michael S. Tsirkin
  0 siblings, 1 reply; 43+ messages in thread
From: Stefan Hajnoczi @ 2012-11-29 14:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Blue Swirl, khoa,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

On Thu, Nov 29, 2012 at 03:54:25PM +0200, Michael S. Tsirkin wrote:
> On Thu, Nov 22, 2012 at 04:16:44PM +0100, Stefan Hajnoczi wrote:
> > The data plane thread needs to map guest physical addresses to host
> > pointers.  Normally this is done with cpu_physical_memory_map() but the
> > function assumes the global mutex is held.  The data plane thread does
> > not touch the global mutex and therefore needs a thread-safe memory
> > mapping mechanism.
> > 
> > Hostmem registers a MemoryListener similar to how vhost collects and
> > pushes memory region information into the kernel.  There is a
> > fine-grained lock on the regions list which is held during lookup and
> > when installing a new regions list.
> > 
> > When the physical memory map changes the MemoryListener callbacks are
> > invoked.  They build up a new list of memory regions which is finally
> > installed when the list has been completed.
> > 
> > Note that this approach is not safe across memory hotplug because mapped
> > pointers may still be in used across memory unplug.  However, this is
> > currently a problem for QEMU in general and needs to be addressed in the
> > future.
> > 
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> 
> Worth bothering with binary search?
> vhost does a linear search over regions because
> the number of ram regions is very small.

memory.c does binary search.  I did the same but in practice there are
<20 regions for a simple VM.  It's probably not worth it but without
performance results this is speculation.

I think there's no harm in using binary search to start with.

> > +static void hostmem_listener_append_region(MemoryListener *listener,
> > +                                           MemoryRegionSection *section)
> > +{
> > +    Hostmem *hostmem = container_of(listener, Hostmem, listener);
> > +
> > +    if (memory_region_is_ram(section->mr)) {
> > +        hostmem_append_new_region(hostmem, section);
> > +    }
> 
> I think you also need to remove VGA region since you
> don't mark these pages as dirty so access there won't work.

I don't understand.  If memory in the VGA region returns true from
memory_region_is_ram(), why would there be a problem?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 03/11] dataplane: add host memory mapping code
  2012-11-29 14:26     ` Stefan Hajnoczi
@ 2012-11-29 14:36       ` Michael S. Tsirkin
  2012-11-29 15:26         ` Paolo Bonzini
  2012-12-05  8:31         ` Stefan Hajnoczi
  0 siblings, 2 replies; 43+ messages in thread
From: Michael S. Tsirkin @ 2012-11-29 14:36 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Blue Swirl, khoa,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

On Thu, Nov 29, 2012 at 03:26:56PM +0100, Stefan Hajnoczi wrote:
> On Thu, Nov 29, 2012 at 03:54:25PM +0200, Michael S. Tsirkin wrote:
> > On Thu, Nov 22, 2012 at 04:16:44PM +0100, Stefan Hajnoczi wrote:
> > > The data plane thread needs to map guest physical addresses to host
> > > pointers.  Normally this is done with cpu_physical_memory_map() but the
> > > function assumes the global mutex is held.  The data plane thread does
> > > not touch the global mutex and therefore needs a thread-safe memory
> > > mapping mechanism.
> > > 
> > > Hostmem registers a MemoryListener similar to how vhost collects and
> > > pushes memory region information into the kernel.  There is a
> > > fine-grained lock on the regions list which is held during lookup and
> > > when installing a new regions list.
> > > 
> > > When the physical memory map changes the MemoryListener callbacks are
> > > invoked.  They build up a new list of memory regions which is finally
> > > installed when the list has been completed.
> > > 
> > > Note that this approach is not safe across memory hotplug because mapped
> > > pointers may still be in used across memory unplug.  However, this is
> > > currently a problem for QEMU in general and needs to be addressed in the
> > > future.
> > > 
> > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > 
> > Worth bothering with binary search?
> > vhost does a linear search over regions because
> > the number of ram regions is very small.
> 
> memory.c does binary search.  I did the same but in practice there are
> <20 regions for a simple VM.  It's probably not worth it but without
> performance results this is speculation.
> 
> I think there's no harm in using binary search to start with.
> 
> > > +static void hostmem_listener_append_region(MemoryListener *listener,
> > > +                                           MemoryRegionSection *section)
> > > +{
> > > +    Hostmem *hostmem = container_of(listener, Hostmem, listener);
> > > +
> > > +    if (memory_region_is_ram(section->mr)) {
> > > +        hostmem_append_new_region(hostmem, section);
> > > +    }
> > 
> > I think you also need to remove VGA region since you
> > don't mark these pages as dirty so access there won't work.
> 
> I don't understand.  If memory in the VGA region returns true from
> memory_region_is_ram(), why would there be a problem?

If you change this memory but you don't update the display.
Never happens with non buggy guests but we should catch and fail if it does.

-- 
MST

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 03/11] dataplane: add host memory mapping code
  2012-11-29 14:36       ` Michael S. Tsirkin
@ 2012-11-29 15:26         ` Paolo Bonzini
  2012-12-05  8:31         ` Stefan Hajnoczi
  1 sibling, 0 replies; 43+ messages in thread
From: Paolo Bonzini @ 2012-11-29 15:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, qemu-devel,
	Blue Swirl, khoa, Stefan Hajnoczi, Asias He

> > I don't understand.  If memory in the VGA region returns true from
> > memory_region_is_ram(), why would there be a problem?
> 
> If you change this memory but you don't update the display.
> Never happens with non buggy guests but we should catch and fail if
> it does.

Actually it _could_ happen with an MS-DOS guest that does the oh-so-retro

    DEF SEG=&HB800
    BLOAD "file.pic", 0

(I think).

Paolo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 03/11] dataplane: add host memory mapping code
  2012-11-29 14:36       ` Michael S. Tsirkin
  2012-11-29 15:26         ` Paolo Bonzini
@ 2012-12-05  8:31         ` Stefan Hajnoczi
  2012-12-05 11:22           ` Michael S. Tsirkin
  1 sibling, 1 reply; 43+ messages in thread
From: Stefan Hajnoczi @ 2012-12-05  8:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Blue Swirl, khoa,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

On Thu, Nov 29, 2012 at 04:36:08PM +0200, Michael S. Tsirkin wrote:
> On Thu, Nov 29, 2012 at 03:26:56PM +0100, Stefan Hajnoczi wrote:
> > On Thu, Nov 29, 2012 at 03:54:25PM +0200, Michael S. Tsirkin wrote:
> > > On Thu, Nov 22, 2012 at 04:16:44PM +0100, Stefan Hajnoczi wrote:
> > > > The data plane thread needs to map guest physical addresses to host
> > > > pointers.  Normally this is done with cpu_physical_memory_map() but the
> > > > function assumes the global mutex is held.  The data plane thread does
> > > > not touch the global mutex and therefore needs a thread-safe memory
> > > > mapping mechanism.
> > > > 
> > > > Hostmem registers a MemoryListener similar to how vhost collects and
> > > > pushes memory region information into the kernel.  There is a
> > > > fine-grained lock on the regions list which is held during lookup and
> > > > when installing a new regions list.
> > > > 
> > > > When the physical memory map changes the MemoryListener callbacks are
> > > > invoked.  They build up a new list of memory regions which is finally
> > > > installed when the list has been completed.
> > > > 
> > > > Note that this approach is not safe across memory hotplug because mapped
> > > > pointers may still be in used across memory unplug.  However, this is
> > > > currently a problem for QEMU in general and needs to be addressed in the
> > > > future.
> > > > 
> > > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > 
> > > Worth bothering with binary search?
> > > vhost does a linear search over regions because
> > > the number of ram regions is very small.
> > 
> > memory.c does binary search.  I did the same but in practice there are
> > <20 regions for a simple VM.  It's probably not worth it but without
> > performance results this is speculation.
> > 
> > I think there's no harm in using binary search to start with.
> > 
> > > > +static void hostmem_listener_append_region(MemoryListener *listener,
> > > > +                                           MemoryRegionSection *section)
> > > > +{
> > > > +    Hostmem *hostmem = container_of(listener, Hostmem, listener);
> > > > +
> > > > +    if (memory_region_is_ram(section->mr)) {
> > > > +        hostmem_append_new_region(hostmem, section);
> > > > +    }
> > > 
> > > I think you also need to remove VGA region since you
> > > don't mark these pages as dirty so access there won't work.
> > 
> > I don't understand.  If memory in the VGA region returns true from
> > memory_region_is_ram(), why would there be a problem?
> 
> If you change this memory but you don't update the display.
> Never happens with non buggy guests but we should catch and fail if it does.

Okay, I took a look at the VGA code and I think it makes sense now.  We
have VRAM as a regular RAM region so that writes to it are cheap.  To
avoid scanning or redrawing VRAM on every update we use dirty logging.

Since virtio-blk-data-plane does not mark pages dirty an I/O buffer in
VRAM would fail to update the display correctly.

I will try to put in a check to omit the VGA region.  It can be dropped
in the future when we use the memory API with dirty logging from the
data plane thread.

Stefan

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 03/11] dataplane: add host memory mapping code
  2012-12-05  8:31         ` Stefan Hajnoczi
@ 2012-12-05 11:22           ` Michael S. Tsirkin
  0 siblings, 0 replies; 43+ messages in thread
From: Michael S. Tsirkin @ 2012-12-05 11:22 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Blue Swirl, khoa,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

On Wed, Dec 05, 2012 at 09:31:56AM +0100, Stefan Hajnoczi wrote:
> On Thu, Nov 29, 2012 at 04:36:08PM +0200, Michael S. Tsirkin wrote:
> > On Thu, Nov 29, 2012 at 03:26:56PM +0100, Stefan Hajnoczi wrote:
> > > On Thu, Nov 29, 2012 at 03:54:25PM +0200, Michael S. Tsirkin wrote:
> > > > On Thu, Nov 22, 2012 at 04:16:44PM +0100, Stefan Hajnoczi wrote:
> > > > > The data plane thread needs to map guest physical addresses to host
> > > > > pointers.  Normally this is done with cpu_physical_memory_map() but the
> > > > > function assumes the global mutex is held.  The data plane thread does
> > > > > not touch the global mutex and therefore needs a thread-safe memory
> > > > > mapping mechanism.
> > > > > 
> > > > > Hostmem registers a MemoryListener similar to how vhost collects and
> > > > > pushes memory region information into the kernel.  There is a
> > > > > fine-grained lock on the regions list which is held during lookup and
> > > > > when installing a new regions list.
> > > > > 
> > > > > When the physical memory map changes the MemoryListener callbacks are
> > > > > invoked.  They build up a new list of memory regions which is finally
> > > > > installed when the list has been completed.
> > > > > 
> > > > > Note that this approach is not safe across memory hotplug because mapped
> > > > > pointers may still be in used across memory unplug.  However, this is
> > > > > currently a problem for QEMU in general and needs to be addressed in the
> > > > > future.
> > > > > 
> > > > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > 
> > > > Worth bothering with binary search?
> > > > vhost does a linear search over regions because
> > > > the number of ram regions is very small.
> > > 
> > > memory.c does binary search.  I did the same but in practice there are
> > > <20 regions for a simple VM.  It's probably not worth it but without
> > > performance results this is speculation.
> > > 
> > > I think there's no harm in using binary search to start with.
> > > 
> > > > > +static void hostmem_listener_append_region(MemoryListener *listener,
> > > > > +                                           MemoryRegionSection *section)
> > > > > +{
> > > > > +    Hostmem *hostmem = container_of(listener, Hostmem, listener);
> > > > > +
> > > > > +    if (memory_region_is_ram(section->mr)) {
> > > > > +        hostmem_append_new_region(hostmem, section);
> > > > > +    }
> > > > 
> > > > I think you also need to remove VGA region since you
> > > > don't mark these pages as dirty so access there won't work.
> > > 
> > > I don't understand.  If memory in the VGA region returns true from
> > > memory_region_is_ram(), why would there be a problem?
> > 
> > If you change this memory but you don't update the display.
> > Never happens with non buggy guests but we should catch and fail if it does.
> 
> Okay, I took a look at the VGA code and I think it makes sense now.  We
> have VRAM as a regular RAM region so that writes to it are cheap.  To
> avoid scanning or redrawing VRAM on every update we use dirty logging.
> 
> Since virtio-blk-data-plane does not mark pages dirty an I/O buffer in
> VRAM would fail to update the display correctly.
> 
> I will try to put in a check to omit the VGA region.

There are many ways to do this but I guess the simplest
is to detect dirty logging and skip that region.

>  It can be dropped
> in the future when we use the memory API with dirty logging from the
> data plane thread.
> 
> Stefan

Or not - there's also the issue that e.g. cirrus doing tricks
with memory mapping at data path. So skipping
that region might help performance.

-- 
MST

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Qemu-devel] [PATCH v4 04/11] dataplane: add virtqueue vring code
  2012-11-22 15:16 [Qemu-devel] [PATCH v4 00/11] virtio: virtio-blk data plane Stefan Hajnoczi
                   ` (2 preceding siblings ...)
  2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 03/11] dataplane: add host memory mapping code Stefan Hajnoczi
@ 2012-11-22 15:16 ` Stefan Hajnoczi
  2012-11-29 12:50   ` Michael S. Tsirkin
  2012-11-29 13:48   ` Michael S. Tsirkin
  2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 05/11] dataplane: add event loop Stefan Hajnoczi
                   ` (7 subsequent siblings)
  11 siblings, 2 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2012-11-22 15:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, Blue Swirl, khoa,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

The virtio-blk-data-plane cannot access memory using the usual QEMU
functions since it executes outside the global mutex and the memory APIs
are this time are not thread-safe.

This patch introduces a virtqueue module based on the kernel's vhost
vring code.  The trick is that we map guest memory ahead of time and
access it cheaply outside the global mutex.

Once the hardware emulation code can execute outside the global mutex it
will be possible to drop this code.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 hw/Makefile.objs           |   2 +-
 hw/dataplane/Makefile.objs |   2 +-
 hw/dataplane/vring.c       | 344 +++++++++++++++++++++++++++++++++++++++++++++
 hw/dataplane/vring.h       |  62 ++++++++
 trace-events               |   3 +
 5 files changed, 411 insertions(+), 2 deletions(-)
 create mode 100644 hw/dataplane/vring.c
 create mode 100644 hw/dataplane/vring.h

diff --git a/hw/Makefile.objs b/hw/Makefile.objs
index ea46f81..db87fbf 100644
--- a/hw/Makefile.objs
+++ b/hw/Makefile.objs
@@ -1,4 +1,4 @@
-common-obj-y = usb/ ide/
+common-obj-y = usb/ ide/ dataplane/
 common-obj-y += loader.o
 common-obj-$(CONFIG_VIRTIO) += virtio-console.o
 common-obj-$(CONFIG_VIRTIO) += virtio-rng.o
diff --git a/hw/dataplane/Makefile.objs b/hw/dataplane/Makefile.objs
index 8c8dea1..34e6d57 100644
--- a/hw/dataplane/Makefile.objs
+++ b/hw/dataplane/Makefile.objs
@@ -1,3 +1,3 @@
 ifeq ($(CONFIG_VIRTIO), y)
-common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += hostmem.o
+common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += hostmem.o vring.o
 endif
diff --git a/hw/dataplane/vring.c b/hw/dataplane/vring.c
new file mode 100644
index 0000000..2632fbd
--- /dev/null
+++ b/hw/dataplane/vring.c
@@ -0,0 +1,344 @@
+/* Copyright 2012 Red Hat, Inc.
+ * Copyright IBM, Corp. 2012
+ *
+ * Based on Linux vhost code:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Copyright (C) 2006 Rusty Russell IBM Corporation
+ *
+ * Author: Michael S. Tsirkin <mst@redhat.com>
+ *         Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * Inspiration, some code, and most witty comments come from
+ * Documentation/virtual/lguest/lguest.c, by Rusty Russell
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ */
+
+#include "trace.h"
+#include "hw/dataplane/vring.h"
+
+/* Map the guest's vring to host memory */
+bool vring_setup(Vring *vring, VirtIODevice *vdev, int n)
+{
+    hwaddr vring_addr = virtio_queue_get_ring_addr(vdev, n);
+    hwaddr vring_size = virtio_queue_get_ring_size(vdev, n);
+    void *vring_ptr;
+
+    vring->broken = false;
+
+    hostmem_init(&vring->hostmem);
+    vring_ptr = hostmem_lookup(&vring->hostmem, vring_addr, vring_size, true);
+    if (!vring_ptr) {
+        error_report("Failed to map vring "
+                     "addr %#" HWADDR_PRIx " size %" HWADDR_PRIu,
+                     vring_addr, vring_size);
+        vring->broken = true;
+        return false;
+    }
+
+    vring_init(&vring->vr, virtio_queue_get_num(vdev, n), vring_ptr, 4096);
+
+    vring->last_avail_idx = 0;
+    vring->last_used_idx = 0;
+    vring->signalled_used = 0;
+    vring->signalled_used_valid = false;
+
+    trace_vring_setup(virtio_queue_get_ring_addr(vdev, n),
+                      vring->vr.desc, vring->vr.avail, vring->vr.used);
+    return true;
+}
+
+void vring_teardown(Vring *vring)
+{
+    hostmem_finalize(&vring->hostmem);
+}
+
+/* Toggle guest->host notifies */
+void vring_set_notification(VirtIODevice *vdev, Vring *vring, bool enable)
+{
+    if (vdev->guest_features & (1 << VIRTIO_RING_F_EVENT_IDX)) {
+        if (enable) {
+            vring_avail_event(&vring->vr) = vring->vr.avail->idx;
+        }
+    } else if (enable) {
+        vring->vr.used->flags &= ~VRING_USED_F_NO_NOTIFY;
+    } else {
+        vring->vr.used->flags |= VRING_USED_F_NO_NOTIFY;
+    }
+}
+
+/* This is stolen from linux/drivers/vhost/vhost.c:vhost_notify() */
+bool vring_should_notify(VirtIODevice *vdev, Vring *vring)
+{
+    uint16_t old, new;
+    bool v;
+    /* Flush out used index updates. This is paired
+     * with the barrier that the Guest executes when enabling
+     * interrupts. */
+    smp_mb();
+
+    if ((vdev->guest_features & VIRTIO_F_NOTIFY_ON_EMPTY) &&
+        unlikely(vring->vr.avail->idx == vring->last_avail_idx)) {
+        return true;
+    }
+
+    if (!(vdev->guest_features & VIRTIO_RING_F_EVENT_IDX)) {
+        return !(vring->vr.avail->flags & VRING_AVAIL_F_NO_INTERRUPT);
+    }
+    old = vring->signalled_used;
+    v = vring->signalled_used_valid;
+    new = vring->signalled_used = vring->last_used_idx;
+    vring->signalled_used_valid = true;
+
+    if (unlikely(!v)) {
+        return true;
+    }
+
+    return vring_need_event(vring_used_event(&vring->vr), new, old);
+}
+
+/* This is stolen from linux-2.6/drivers/vhost/vhost.c. */
+static int get_indirect(Vring *vring,
+                        struct iovec iov[], struct iovec *iov_end,
+                        unsigned int *out_num, unsigned int *in_num,
+                        struct vring_desc *indirect)
+{
+    struct vring_desc desc;
+    unsigned int i = 0, count, found = 0;
+
+    /* Sanity check */
+    if (unlikely(indirect->len % sizeof desc)) {
+        error_report("Invalid length in indirect descriptor: "
+                     "len %#x not multiple of %#zx",
+                     indirect->len, sizeof desc);
+        vring->broken = true;
+        return -EFAULT;
+    }
+
+    count = indirect->len / sizeof desc;
+    /* Buffers are chained via a 16 bit next field, so
+     * we can have at most 2^16 of these. */
+    if (unlikely(count > USHRT_MAX + 1)) {
+        error_report("Indirect buffer length too big: %d", indirect->len);
+        vring->broken = true;
+        return -EFAULT;
+    }
+
+    /* Point to translate indirect desc chain */
+    indirect = hostmem_lookup(&vring->hostmem, indirect->addr, indirect->len,
+                              false);
+    if (!indirect) {
+        error_report("Failed to map indirect desc chain "
+                     "addr %#" PRIx64 " len %u",
+                     (uint64_t)indirect->addr, indirect->len);
+        vring->broken = true;
+        return -EFAULT;
+    }
+
+    /* We will use the result as an address to read from, so most
+     * architectures only need a compiler barrier here. */
+    barrier(); /* read_barrier_depends(); */
+
+    do {
+        if (unlikely(++found > count)) {
+            error_report("Loop detected: last one at %u "
+                         "indirect size %u", i, count);
+            vring->broken = true;
+            return -EFAULT;
+        }
+
+        desc = *indirect++;
+        if (unlikely(desc.flags & VRING_DESC_F_INDIRECT)) {
+            error_report("Nested indirect descriptor");
+            vring->broken = true;
+            return -EFAULT;
+        }
+
+        /* Stop for now if there are not enough iovecs available. */
+        if (iov >= iov_end) {
+            return -ENOBUFS;
+        }
+
+        iov->iov_base = hostmem_lookup(&vring->hostmem, desc.addr, desc.len,
+                                       desc.flags & VRING_DESC_F_WRITE);
+        if (!iov->iov_base) {
+            error_report("Failed to map indirect descriptor"
+                         "addr %#" PRIx64 " len %u",
+                         (uint64_t)desc.addr, desc.len);
+            vring->broken = true;
+            return -EFAULT;
+        }
+        iov->iov_len = desc.len;
+        iov++;
+
+        /* If this is an input descriptor, increment that count. */
+        if (desc.flags & VRING_DESC_F_WRITE) {
+            *in_num += 1;
+        } else {
+            /* If it's an output descriptor, they're all supposed
+             * to come before any input descriptors. */
+            if (unlikely(*in_num)) {
+                error_report("Indirect descriptor "
+                             "has out after in: idx %d", i);
+                vring->broken = true;
+                return -EFAULT;
+            }
+            *out_num += 1;
+        }
+        i = desc.next;
+    } while (desc.flags & VRING_DESC_F_NEXT);
+    return 0;
+}
+
+/* This looks in the virtqueue and for the first available buffer, and converts
+ * it to an iovec for convenient access.  Since descriptors consist of some
+ * number of output then some number of input descriptors, it's actually two
+ * iovecs, but we pack them into one and note how many of each there were.
+ *
+ * This function returns the descriptor number found, or vq->num (which is
+ * never a valid descriptor number) if none was found.  A negative code is
+ * returned on error.
+ *
+ * Stolen from linux-2.6/drivers/vhost/vhost.c.
+ */
+int vring_pop(VirtIODevice *vdev, Vring *vring,
+              struct iovec iov[], struct iovec *iov_end,
+              unsigned int *out_num, unsigned int *in_num)
+{
+    struct vring_desc desc;
+    unsigned int i, head, found = 0, num = vring->vr.num;
+    uint16_t avail_idx, last_avail_idx;
+
+    /* If there was a fatal error then refuse operation */
+    if (vring->broken) {
+        return -EFAULT;
+    }
+
+    /* Check it isn't doing very strange things with descriptor numbers. */
+    last_avail_idx = vring->last_avail_idx;
+    avail_idx = vring->vr.avail->idx;
+
+    if (unlikely((uint16_t)(avail_idx - last_avail_idx) > num)) {
+        error_report("Guest moved used index from %u to %u",
+                     last_avail_idx, avail_idx);
+        vring->broken = true;
+        return -EFAULT;
+    }
+
+    /* If there's nothing new since last we looked. */
+    if (avail_idx == last_avail_idx) {
+        return -EAGAIN;
+    }
+
+    /* Only get avail ring entries after they have been exposed by guest. */
+    smp_rmb();
+
+    /* Grab the next descriptor number they're advertising, and increment
+     * the index we've seen. */
+    head = vring->vr.avail->ring[last_avail_idx % num];
+
+    /* If their number is silly, that's an error. */
+    if (unlikely(head >= num)) {
+        error_report("Guest says index %u > %u is available", head, num);
+        vring->broken = true;
+        return -EFAULT;
+    }
+
+    if (vdev->guest_features & (1 << VIRTIO_RING_F_EVENT_IDX)) {
+        vring_avail_event(&vring->vr) = vring->vr.avail->idx;
+    }
+
+    /* When we start there are none of either input nor output. */
+    *out_num = *in_num = 0;
+
+    i = head;
+    do {
+        if (unlikely(i >= num)) {
+            error_report("Desc index is %u > %u, head = %u", i, num, head);
+            vring->broken = true;
+            return -EFAULT;
+        }
+        if (unlikely(++found > num)) {
+            error_report("Loop detected: last one at %u vq size %u head %u",
+                         i, num, head);
+            vring->broken = true;
+            return -EFAULT;
+        }
+        desc = vring->vr.desc[i];
+        if (desc.flags & VRING_DESC_F_INDIRECT) {
+            int ret = get_indirect(vring, iov, iov_end, out_num, in_num, &desc);
+            if (ret < 0) {
+                return ret;
+            }
+            continue;
+        }
+
+        /* If there are not enough iovecs left, stop for now.  The caller
+         * should check if there are more descs available once they have dealt
+         * with the current set.
+         */
+        if (iov >= iov_end) {
+            return -ENOBUFS;
+        }
+
+        iov->iov_base = hostmem_lookup(&vring->hostmem, desc.addr, desc.len,
+                                       desc.flags & VRING_DESC_F_WRITE);
+        if (!iov->iov_base) {
+            error_report("Failed to map vring desc addr %#" PRIx64 " len %u",
+                         (uint64_t)desc.addr, desc.len);
+            vring->broken = true;
+            return -EFAULT;
+        }
+        iov->iov_len  = desc.len;
+        iov++;
+
+        if (desc.flags & VRING_DESC_F_WRITE) {
+            /* If this is an input descriptor,
+             * increment that count. */
+            *in_num += 1;
+        } else {
+            /* If it's an output descriptor, they're all supposed
+             * to come before any input descriptors. */
+            if (unlikely(*in_num)) {
+                error_report("Descriptor has out after in: idx %d", i);
+                vring->broken = true;
+                return -EFAULT;
+            }
+            *out_num += 1;
+        }
+        i = desc.next;
+    } while (desc.flags & VRING_DESC_F_NEXT);
+
+    /* On success, increment avail index. */
+    vring->last_avail_idx++;
+    return head;
+}
+
+/* After we've used one of their buffers, we tell them about it.
+ *
+ * Stolen from linux-2.6/drivers/vhost/vhost.c.
+ */
+void vring_push(Vring *vring, unsigned int head, int len)
+{
+    struct vring_used_elem *used;
+    uint16_t new;
+
+    /* Don't touch vring if a fatal error occurred */
+    if (vring->broken) {
+        return;
+    }
+
+    /* The virtqueue contains a ring of used buffers.  Get a pointer to the
+     * next entry in that used ring. */
+    used = &vring->vr.used->ring[vring->last_used_idx % vring->vr.num];
+    used->id = head;
+    used->len = len;
+
+    /* Make sure buffer is written before we update index. */
+    smp_wmb();
+
+    new = vring->vr.used->idx = ++vring->last_used_idx;
+    if (unlikely((int16_t)(new - vring->signalled_used) < (uint16_t)1)) {
+        vring->signalled_used_valid = false;
+    }
+}
diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h
new file mode 100644
index 0000000..7245d99
--- /dev/null
+++ b/hw/dataplane/vring.h
@@ -0,0 +1,62 @@
+/* Copyright 2012 Red Hat, Inc. and/or its affiliates
+ * Copyright IBM, Corp. 2012
+ *
+ * Based on Linux vhost code:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Copyright (C) 2006 Rusty Russell IBM Corporation
+ *
+ * Author: Michael S. Tsirkin <mst@redhat.com>
+ *         Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * Inspiration, some code, and most witty comments come from
+ * Documentation/virtual/lguest/lguest.c, by Rusty Russell
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ */
+
+#ifndef VRING_H
+#define VRING_H
+
+#include <linux/virtio_ring.h>
+#include "qemu-common.h"
+#include "qemu-barrier.h"
+#include "hw/dataplane/hostmem.h"
+#include "hw/virtio.h"
+
+typedef struct {
+    Hostmem hostmem;                /* guest memory mapper */
+    struct vring vr;                /* virtqueue vring mapped to host memory */
+    uint16_t last_avail_idx;        /* last processed avail ring index */
+    uint16_t last_used_idx;         /* last processed used ring index */
+    uint16_t signalled_used;        /* EVENT_IDX state */
+    bool signalled_used_valid;
+    bool broken;                    /* was there a fatal error? */
+} Vring;
+
+static inline unsigned int vring_get_num(Vring *vring)
+{
+    return vring->vr.num;
+}
+
+/* Are there more descriptors available? */
+static inline bool vring_more_avail(Vring *vring)
+{
+    return vring->vr.avail->idx != vring->last_avail_idx;
+}
+
+/* Fail future vring_pop() and vring_push() calls until reset */
+static inline void vring_set_broken(Vring *vring)
+{
+    vring->broken = true;
+}
+
+bool vring_setup(Vring *vring, VirtIODevice *vdev, int n);
+void vring_teardown(Vring *vring);
+void vring_set_notification(VirtIODevice *vdev, Vring *vring, bool enable);
+bool vring_should_notify(VirtIODevice *vdev, Vring *vring);
+int vring_pop(VirtIODevice *vdev, Vring *vring,
+              struct iovec iov[], struct iovec *iov_end,
+              unsigned int *out_num, unsigned int *in_num);
+void vring_push(Vring *vring, unsigned int head, int len);
+
+#endif /* VRING_H */
diff --git a/trace-events b/trace-events
index 6c6cbf1..a9a791b 100644
--- a/trace-events
+++ b/trace-events
@@ -98,6 +98,9 @@ virtio_blk_rw_complete(void *req, int ret) "req %p ret %d"
 virtio_blk_handle_write(void *req, uint64_t sector, size_t nsectors) "req %p sector %"PRIu64" nsectors %zu"
 virtio_blk_handle_read(void *req, uint64_t sector, size_t nsectors) "req %p sector %"PRIu64" nsectors %zu"
 
+# hw/dataplane/vring.c
+vring_setup(uint64_t physical, void *desc, void *avail, void *used) "vring physical %#"PRIx64" desc %p avail %p used %p"
+
 # thread-pool.c
 thread_pool_submit(void *req, void *opaque) "req %p opaque %p"
 thread_pool_complete(void *req, void *opaque, int ret) "req %p opaque %p ret %d"
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 04/11] dataplane: add virtqueue vring code
  2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 04/11] dataplane: add virtqueue vring code Stefan Hajnoczi
@ 2012-11-29 12:50   ` Michael S. Tsirkin
  2012-11-29 15:17     ` Paolo Bonzini
  2012-12-05 12:57     ` Stefan Hajnoczi
  2012-11-29 13:48   ` Michael S. Tsirkin
  1 sibling, 2 replies; 43+ messages in thread
From: Michael S. Tsirkin @ 2012-11-29 12:50 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Blue Swirl, khoa,
	Paolo Bonzini, Asias He

On Thu, Nov 22, 2012 at 04:16:45PM +0100, Stefan Hajnoczi wrote:
> The virtio-blk-data-plane cannot access memory using the usual QEMU
> functions since it executes outside the global mutex and the memory APIs
> are this time are not thread-safe.
> 
> This patch introduces a virtqueue module based on the kernel's vhost
> vring code.  The trick is that we map guest memory ahead of time and
> access it cheaply outside the global mutex.
> 
> Once the hardware emulation code can execute outside the global mutex it
> will be possible to drop this code.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

Is there no way to factor out ommon code and share it with virtio.c?

> ---
>  hw/Makefile.objs           |   2 +-
>  hw/dataplane/Makefile.objs |   2 +-
>  hw/dataplane/vring.c       | 344 +++++++++++++++++++++++++++++++++++++++++++++
>  hw/dataplane/vring.h       |  62 ++++++++
>  trace-events               |   3 +
>  5 files changed, 411 insertions(+), 2 deletions(-)
>  create mode 100644 hw/dataplane/vring.c
>  create mode 100644 hw/dataplane/vring.h
> 
> diff --git a/hw/Makefile.objs b/hw/Makefile.objs
> index ea46f81..db87fbf 100644
> --- a/hw/Makefile.objs
> +++ b/hw/Makefile.objs
> @@ -1,4 +1,4 @@
> -common-obj-y = usb/ ide/
> +common-obj-y = usb/ ide/ dataplane/
>  common-obj-y += loader.o
>  common-obj-$(CONFIG_VIRTIO) += virtio-console.o
>  common-obj-$(CONFIG_VIRTIO) += virtio-rng.o
> diff --git a/hw/dataplane/Makefile.objs b/hw/dataplane/Makefile.objs
> index 8c8dea1..34e6d57 100644
> --- a/hw/dataplane/Makefile.objs
> +++ b/hw/dataplane/Makefile.objs
> @@ -1,3 +1,3 @@
>  ifeq ($(CONFIG_VIRTIO), y)
> -common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += hostmem.o
> +common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += hostmem.o vring.o
>  endif
> diff --git a/hw/dataplane/vring.c b/hw/dataplane/vring.c
> new file mode 100644
> index 0000000..2632fbd
> --- /dev/null
> +++ b/hw/dataplane/vring.c
> @@ -0,0 +1,344 @@
> +/* Copyright 2012 Red Hat, Inc.
> + * Copyright IBM, Corp. 2012
> + *
> + * Based on Linux vhost code:
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Copyright (C) 2006 Rusty Russell IBM Corporation
> + *
> + * Author: Michael S. Tsirkin <mst@redhat.com>
> + *         Stefan Hajnoczi <stefanha@redhat.com>
> + *
> + * Inspiration, some code, and most witty comments come from
> + * Documentation/virtual/lguest/lguest.c, by Rusty Russell
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + */
> +
> +#include "trace.h"
> +#include "hw/dataplane/vring.h"
> +
> +/* Map the guest's vring to host memory */
> +bool vring_setup(Vring *vring, VirtIODevice *vdev, int n)
> +{
> +    hwaddr vring_addr = virtio_queue_get_ring_addr(vdev, n);
> +    hwaddr vring_size = virtio_queue_get_ring_size(vdev, n);
> +    void *vring_ptr;
> +
> +    vring->broken = false;
> +
> +    hostmem_init(&vring->hostmem);
> +    vring_ptr = hostmem_lookup(&vring->hostmem, vring_addr, vring_size, true);
> +    if (!vring_ptr) {
> +        error_report("Failed to map vring "
> +                     "addr %#" HWADDR_PRIx " size %" HWADDR_PRIu,
> +                     vring_addr, vring_size);
> +        vring->broken = true;
> +        return false;
> +    }
> +
> +    vring_init(&vring->vr, virtio_queue_get_num(vdev, n), vring_ptr, 4096);
> +
> +    vring->last_avail_idx = 0;
> +    vring->last_used_idx = 0;
> +    vring->signalled_used = 0;
> +    vring->signalled_used_valid = false;
> +
> +    trace_vring_setup(virtio_queue_get_ring_addr(vdev, n),
> +                      vring->vr.desc, vring->vr.avail, vring->vr.used);
> +    return true;
> +}
> +
> +void vring_teardown(Vring *vring)
> +{
> +    hostmem_finalize(&vring->hostmem);
> +}
> +
> +/* Toggle guest->host notifies */
> +void vring_set_notification(VirtIODevice *vdev, Vring *vring, bool enable)
> +{
> +    if (vdev->guest_features & (1 << VIRTIO_RING_F_EVENT_IDX)) {
> +        if (enable) {
> +            vring_avail_event(&vring->vr) = vring->vr.avail->idx;
> +        }
> +    } else if (enable) {
> +        vring->vr.used->flags &= ~VRING_USED_F_NO_NOTIFY;
> +    } else {
> +        vring->vr.used->flags |= VRING_USED_F_NO_NOTIFY;
> +    }
> +}
> +
> +/* This is stolen from linux/drivers/vhost/vhost.c:vhost_notify() */
> +bool vring_should_notify(VirtIODevice *vdev, Vring *vring)
> +{
> +    uint16_t old, new;
> +    bool v;
> +    /* Flush out used index updates. This is paired
> +     * with the barrier that the Guest executes when enabling
> +     * interrupts. */
> +    smp_mb();
> +
> +    if ((vdev->guest_features & VIRTIO_F_NOTIFY_ON_EMPTY) &&
> +        unlikely(vring->vr.avail->idx == vring->last_avail_idx)) {
> +        return true;
> +    }
> +
> +    if (!(vdev->guest_features & VIRTIO_RING_F_EVENT_IDX)) {
> +        return !(vring->vr.avail->flags & VRING_AVAIL_F_NO_INTERRUPT);
> +    }
> +    old = vring->signalled_used;
> +    v = vring->signalled_used_valid;
> +    new = vring->signalled_used = vring->last_used_idx;
> +    vring->signalled_used_valid = true;
> +
> +    if (unlikely(!v)) {
> +        return true;
> +    }
> +
> +    return vring_need_event(vring_used_event(&vring->vr), new, old);
> +}
> +
> +/* This is stolen from linux-2.6/drivers/vhost/vhost.c. */

Probably should document the version you based this on.
Surely not really 2.6?

> +static int get_indirect(Vring *vring,
> +                        struct iovec iov[], struct iovec *iov_end,
> +                        unsigned int *out_num, unsigned int *in_num,
> +                        struct vring_desc *indirect)
> +{
> +    struct vring_desc desc;
> +    unsigned int i = 0, count, found = 0;
> +
> +    /* Sanity check */
> +    if (unlikely(indirect->len % sizeof desc)) {
> +        error_report("Invalid length in indirect descriptor: "
> +                     "len %#x not multiple of %#zx",
> +                     indirect->len, sizeof desc);
> +        vring->broken = true;
> +        return -EFAULT;
> +    }
> +
> +    count = indirect->len / sizeof desc;
> +    /* Buffers are chained via a 16 bit next field, so
> +     * we can have at most 2^16 of these. */
> +    if (unlikely(count > USHRT_MAX + 1)) {
> +        error_report("Indirect buffer length too big: %d", indirect->len);
> +        vring->broken = true;
> +        return -EFAULT;
> +    }
> +
> +    /* Point to translate indirect desc chain */
> +    indirect = hostmem_lookup(&vring->hostmem, indirect->addr, indirect->len,
> +                              false);

This assumes an indirect buffer is contigious in qemu memory
which seems wrong since unlike vring itself
there are no alignment requirements.

Overriding indirect here also seems unnecessarily tricky.

> +    if (!indirect) {
> +        error_report("Failed to map indirect desc chain "
> +                     "addr %#" PRIx64 " len %u",
> +                     (uint64_t)indirect->addr, indirect->len);
> +        vring->broken = true;
> +        return -EFAULT;
> +    }
> +
> +    /* We will use the result as an address to read from, so most
> +     * architectures only need a compiler barrier here. */
> +    barrier(); /* read_barrier_depends(); */
> +
> +    do {
> +        if (unlikely(++found > count)) {
> +            error_report("Loop detected: last one at %u "
> +                         "indirect size %u", i, count);
> +            vring->broken = true;
> +            return -EFAULT;
> +        }
> +
> +        desc = *indirect++;
> +        if (unlikely(desc.flags & VRING_DESC_F_INDIRECT)) {
> +            error_report("Nested indirect descriptor");
> +            vring->broken = true;
> +            return -EFAULT;
> +        }
> +
> +        /* Stop for now if there are not enough iovecs available. */
> +        if (iov >= iov_end) {
> +            return -ENOBUFS;
> +        }
> +
> +        iov->iov_base = hostmem_lookup(&vring->hostmem, desc.addr, desc.len,
> +                                       desc.flags & VRING_DESC_F_WRITE);
> +        if (!iov->iov_base) {
> +            error_report("Failed to map indirect descriptor"
> +                         "addr %#" PRIx64 " len %u",
> +                         (uint64_t)desc.addr, desc.len);
> +            vring->broken = true;
> +            return -EFAULT;
> +        }
> +        iov->iov_len = desc.len;
> +        iov++;
> +
> +        /* If this is an input descriptor, increment that count. */
> +        if (desc.flags & VRING_DESC_F_WRITE) {
> +            *in_num += 1;
> +        } else {
> +            /* If it's an output descriptor, they're all supposed
> +             * to come before any input descriptors. */
> +            if (unlikely(*in_num)) {
> +                error_report("Indirect descriptor "
> +                             "has out after in: idx %d", i);
> +                vring->broken = true;
> +                return -EFAULT;
> +            }
> +            *out_num += 1;
> +        }
> +        i = desc.next;
> +    } while (desc.flags & VRING_DESC_F_NEXT);
> +    return 0;
> +}
> +
> +/* This looks in the virtqueue and for the first available buffer, and converts
> + * it to an iovec for convenient access.  Since descriptors consist of some
> + * number of output then some number of input descriptors, it's actually two
> + * iovecs, but we pack them into one and note how many of each there were.
> + *
> + * This function returns the descriptor number found, or vq->num (which is
> + * never a valid descriptor number) if none was found.  A negative code is
> + * returned on error.
> + *
> + * Stolen from linux-2.6/drivers/vhost/vhost.c.
> + */
> +int vring_pop(VirtIODevice *vdev, Vring *vring,
> +              struct iovec iov[], struct iovec *iov_end,
> +              unsigned int *out_num, unsigned int *in_num)
> +{
> +    struct vring_desc desc;
> +    unsigned int i, head, found = 0, num = vring->vr.num;
> +    uint16_t avail_idx, last_avail_idx;
> +
> +    /* If there was a fatal error then refuse operation */
> +    if (vring->broken) {
> +        return -EFAULT;
> +    }
> +
> +    /* Check it isn't doing very strange things with descriptor numbers. */
> +    last_avail_idx = vring->last_avail_idx;
> +    avail_idx = vring->vr.avail->idx;

I think something needs to be done here to force
a read otherwise two accesses to avail_idx
below can cause two reads from the ring and
could return inconsistent results.

> +
> +    if (unlikely((uint16_t)(avail_idx - last_avail_idx) > num)) {
> +        error_report("Guest moved used index from %u to %u",
> +                     last_avail_idx, avail_idx);
> +        vring->broken = true;
> +        return -EFAULT;
> +    }
> +
> +    /* If there's nothing new since last we looked. */
> +    if (avail_idx == last_avail_idx) {
> +        return -EAGAIN;
> +    }
> +
> +    /* Only get avail ring entries after they have been exposed by guest. */
> +    smp_rmb();
> +
> +    /* Grab the next descriptor number they're advertising, and increment
> +     * the index we've seen. */
> +    head = vring->vr.avail->ring[last_avail_idx % num];
> +
> +    /* If their number is silly, that's an error. */
> +    if (unlikely(head >= num)) {
> +        error_report("Guest says index %u > %u is available", head, num);
> +        vring->broken = true;
> +        return -EFAULT;
> +    }
> +
> +    if (vdev->guest_features & (1 << VIRTIO_RING_F_EVENT_IDX)) {
> +        vring_avail_event(&vring->vr) = vring->vr.avail->idx;

No barrier here?
I also don't see similar code in vhost - why is it a good idea?

> +    }
> +
> +    /* When we start there are none of either input nor output. */
> +    *out_num = *in_num = 0;
> +
> +    i = head;
> +    do {
> +        if (unlikely(i >= num)) {
> +            error_report("Desc index is %u > %u, head = %u", i, num, head);
> +            vring->broken = true;
> +            return -EFAULT;
> +        }
> +        if (unlikely(++found > num)) {
> +            error_report("Loop detected: last one at %u vq size %u head %u",
> +                         i, num, head);
> +            vring->broken = true;
> +            return -EFAULT;
> +        }
> +        desc = vring->vr.desc[i];
> +        if (desc.flags & VRING_DESC_F_INDIRECT) {
> +            int ret = get_indirect(vring, iov, iov_end, out_num, in_num, &desc);
> +            if (ret < 0) {
> +                return ret;
> +            }
> +            continue;
> +        }
> +
> +        /* If there are not enough iovecs left, stop for now.  The caller
> +         * should check if there are more descs available once they have dealt
> +         * with the current set.
> +         */
> +        if (iov >= iov_end) {
> +            return -ENOBUFS;
> +        }
> +
> +        iov->iov_base = hostmem_lookup(&vring->hostmem, desc.addr, desc.len,
> +                                       desc.flags & VRING_DESC_F_WRITE);
> +        if (!iov->iov_base) {
> +            error_report("Failed to map vring desc addr %#" PRIx64 " len %u",
> +                         (uint64_t)desc.addr, desc.len);
> +            vring->broken = true;
> +            return -EFAULT;
> +        }
> +        iov->iov_len  = desc.len;
> +        iov++;
> +
> +        if (desc.flags & VRING_DESC_F_WRITE) {
> +            /* If this is an input descriptor,
> +             * increment that count. */
> +            *in_num += 1;
> +        } else {
> +            /* If it's an output descriptor, they're all supposed
> +             * to come before any input descriptors. */
> +            if (unlikely(*in_num)) {
> +                error_report("Descriptor has out after in: idx %d", i);
> +                vring->broken = true;
> +                return -EFAULT;
> +            }
> +            *out_num += 1;
> +        }
> +        i = desc.next;
> +    } while (desc.flags & VRING_DESC_F_NEXT);
> +
> +    /* On success, increment avail index. */
> +    vring->last_avail_idx++;
> +    return head;
> +}
> +
> +/* After we've used one of their buffers, we tell them about it.
> + *
> + * Stolen from linux-2.6/drivers/vhost/vhost.c.
> + */
> +void vring_push(Vring *vring, unsigned int head, int len)
> +{
> +    struct vring_used_elem *used;
> +    uint16_t new;
> +
> +    /* Don't touch vring if a fatal error occurred */
> +    if (vring->broken) {
> +        return;
> +    }
> +
> +    /* The virtqueue contains a ring of used buffers.  Get a pointer to the
> +     * next entry in that used ring. */
> +    used = &vring->vr.used->ring[vring->last_used_idx % vring->vr.num];
> +    used->id = head;
> +    used->len = len;
> +
> +    /* Make sure buffer is written before we update index. */
> +    smp_wmb();
> +
> +    new = vring->vr.used->idx = ++vring->last_used_idx;
> +    if (unlikely((int16_t)(new - vring->signalled_used) < (uint16_t)1)) {
> +        vring->signalled_used_valid = false;
> +    }
> +}
> diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h
> new file mode 100644
> index 0000000..7245d99
> --- /dev/null
> +++ b/hw/dataplane/vring.h
> @@ -0,0 +1,62 @@
> +/* Copyright 2012 Red Hat, Inc. and/or its affiliates
> + * Copyright IBM, Corp. 2012
> + *
> + * Based on Linux vhost code:
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Copyright (C) 2006 Rusty Russell IBM Corporation
> + *
> + * Author: Michael S. Tsirkin <mst@redhat.com>
> + *         Stefan Hajnoczi <stefanha@redhat.com>
> + *
> + * Inspiration, some code, and most witty comments come from
> + * Documentation/virtual/lguest/lguest.c, by Rusty Russell
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + */
> +
> +#ifndef VRING_H
> +#define VRING_H
> +
> +#include <linux/virtio_ring.h>
> +#include "qemu-common.h"
> +#include "qemu-barrier.h"
> +#include "hw/dataplane/hostmem.h"
> +#include "hw/virtio.h"
> +
> +typedef struct {
> +    Hostmem hostmem;                /* guest memory mapper */
> +    struct vring vr;                /* virtqueue vring mapped to host memory */
> +    uint16_t last_avail_idx;        /* last processed avail ring index */
> +    uint16_t last_used_idx;         /* last processed used ring index */
> +    uint16_t signalled_used;        /* EVENT_IDX state */
> +    bool signalled_used_valid;
> +    bool broken;                    /* was there a fatal error? */
> +} Vring;
> +
> +static inline unsigned int vring_get_num(Vring *vring)
> +{
> +    return vring->vr.num;
> +}
> +
> +/* Are there more descriptors available? */
> +static inline bool vring_more_avail(Vring *vring)
> +{
> +    return vring->vr.avail->idx != vring->last_avail_idx;
> +}
> +
> +/* Fail future vring_pop() and vring_push() calls until reset */
> +static inline void vring_set_broken(Vring *vring)
> +{
> +    vring->broken = true;
> +}
> +
> +bool vring_setup(Vring *vring, VirtIODevice *vdev, int n);
> +void vring_teardown(Vring *vring);
> +void vring_set_notification(VirtIODevice *vdev, Vring *vring, bool enable);
> +bool vring_should_notify(VirtIODevice *vdev, Vring *vring);
> +int vring_pop(VirtIODevice *vdev, Vring *vring,
> +              struct iovec iov[], struct iovec *iov_end,
> +              unsigned int *out_num, unsigned int *in_num);
> +void vring_push(Vring *vring, unsigned int head, int len);
> +
> +#endif /* VRING_H */
> diff --git a/trace-events b/trace-events
> index 6c6cbf1..a9a791b 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -98,6 +98,9 @@ virtio_blk_rw_complete(void *req, int ret) "req %p ret %d"
>  virtio_blk_handle_write(void *req, uint64_t sector, size_t nsectors) "req %p sector %"PRIu64" nsectors %zu"
>  virtio_blk_handle_read(void *req, uint64_t sector, size_t nsectors) "req %p sector %"PRIu64" nsectors %zu"
>  
> +# hw/dataplane/vring.c
> +vring_setup(uint64_t physical, void *desc, void *avail, void *used) "vring physical %#"PRIx64" desc %p avail %p used %p"
> +
>  # thread-pool.c
>  thread_pool_submit(void *req, void *opaque) "req %p opaque %p"
>  thread_pool_complete(void *req, void *opaque, int ret) "req %p opaque %p ret %d"
> -- 
> 1.8.0

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 04/11] dataplane: add virtqueue vring code
  2012-11-29 12:50   ` Michael S. Tsirkin
@ 2012-11-29 15:17     ` Paolo Bonzini
  2012-12-05 12:57     ` Stefan Hajnoczi
  1 sibling, 0 replies; 43+ messages in thread
From: Paolo Bonzini @ 2012-11-29 15:17 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Blue Swirl, khoa,
	Stefan Hajnoczi, Asias He


> > +/* Toggle guest->host notifies */
> > +void vring_set_notification(VirtIODevice *vdev, Vring *vring, bool
> > enable)
> > +{
> > +    if (vdev->guest_features & (1 << VIRTIO_RING_F_EVENT_IDX)) {
> > +        if (enable) {
> > +            vring_avail_event(&vring->vr) = vring->vr.avail->idx;
> > +        }
> > +    } else if (enable) {
> > +        vring->vr.used->flags &= ~VRING_USED_F_NO_NOTIFY;
> > +    } else {
> > +        vring->vr.used->flags |= VRING_USED_F_NO_NOTIFY;
> > +    }
> > +}

This is similar to the (guest-side) virtqueue_disable_cb/virtqueue_enable_cb.

Perhaps having two functions will be easier to use, because from your
other code it looks like you'd benefit from a return value when
enable == true (again similar to virtqueue_enable_cb).

Paolo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 04/11] dataplane: add virtqueue vring code
  2012-11-29 12:50   ` Michael S. Tsirkin
  2012-11-29 15:17     ` Paolo Bonzini
@ 2012-12-05 12:57     ` Stefan Hajnoczi
  1 sibling, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2012-12-05 12:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Blue Swirl, khoa,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

On Thu, Nov 29, 2012 at 02:50:01PM +0200, Michael S. Tsirkin wrote:
> On Thu, Nov 22, 2012 at 04:16:45PM +0100, Stefan Hajnoczi wrote:
> > The virtio-blk-data-plane cannot access memory using the usual QEMU
> > functions since it executes outside the global mutex and the memory APIs
> > are this time are not thread-safe.
> > 
> > This patch introduces a virtqueue module based on the kernel's vhost
> > vring code.  The trick is that we map guest memory ahead of time and
> > access it cheaply outside the global mutex.
> > 
> > Once the hardware emulation code can execute outside the global mutex it
> > will be possible to drop this code.
> > 
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> 
> Is there no way to factor out ommon code and share it with virtio.c?

I think we have touched on this in other sub-threads but for reference:
this code implements vring access outside the global mutex, which means
no QEMU memory API functions.  Therefore it's hard to share the virtio.c
code which uses QEMU memory API functions.

The current work that Ping Fan Liu is doing will lead to thread-safe
memory accesses from device emulation code.  At that point we can ditch
this and unify with virtio.c.

> > +/* This is stolen from linux-2.6/drivers/vhost/vhost.c. */
> 
> Probably should document the version you based this on.
> Surely not really 2.6?

linux-2.6.git is still mirrored from linux.git :).  I'll try to dig up
the specific Linux version that this code is based on.

> > +static int get_indirect(Vring *vring,
> > +                        struct iovec iov[], struct iovec *iov_end,
> > +                        unsigned int *out_num, unsigned int *in_num,
> > +                        struct vring_desc *indirect)
> > +{
> > +    struct vring_desc desc;
> > +    unsigned int i = 0, count, found = 0;
> > +
> > +    /* Sanity check */
> > +    if (unlikely(indirect->len % sizeof desc)) {
> > +        error_report("Invalid length in indirect descriptor: "
> > +                     "len %#x not multiple of %#zx",
> > +                     indirect->len, sizeof desc);
> > +        vring->broken = true;
> > +        return -EFAULT;
> > +    }
> > +
> > +    count = indirect->len / sizeof desc;
> > +    /* Buffers are chained via a 16 bit next field, so
> > +     * we can have at most 2^16 of these. */
> > +    if (unlikely(count > USHRT_MAX + 1)) {
> > +        error_report("Indirect buffer length too big: %d", indirect->len);
> > +        vring->broken = true;
> > +        return -EFAULT;
> > +    }
> > +
> > +    /* Point to translate indirect desc chain */
> > +    indirect = hostmem_lookup(&vring->hostmem, indirect->addr, indirect->len,
> > +                              false);
> 
> This assumes an indirect buffer is contigious in qemu memory
> which seems wrong since unlike vring itself
> there are no alignment requirements.

Let's break this up into one hostmem_lookup() per descriptor.  In other
words, don't try to lookup the entire indirect buffer but copy-in one
descriptor at a time.

> Overriding indirect here also seems unnecessarily tricky.

You are right, let's use a separate local variable to make the code
clearer.

> > +int vring_pop(VirtIODevice *vdev, Vring *vring,
> > +              struct iovec iov[], struct iovec *iov_end,
> > +              unsigned int *out_num, unsigned int *in_num)
> > +{
> > +    struct vring_desc desc;
> > +    unsigned int i, head, found = 0, num = vring->vr.num;
> > +    uint16_t avail_idx, last_avail_idx;
> > +
> > +    /* If there was a fatal error then refuse operation */
> > +    if (vring->broken) {
> > +        return -EFAULT;
> > +    }
> > +
> > +    /* Check it isn't doing very strange things with descriptor numbers. */
> > +    last_avail_idx = vring->last_avail_idx;
> > +    avail_idx = vring->vr.avail->idx;
> 
> I think something needs to be done here to force
> a read otherwise two accesses to avail_idx
> below can cause two reads from the ring and
> could return inconsistent results.

There is no function call or anything in between that forces the
compiler to load the value of avail_idx and reuse it.

So I think you're right.  I'm not 100% sure a read barrier forces the
compiler to load here since the following code just manipulates
last_avail_idx and avail_idx.

Any ideas?

> > +    if (vdev->guest_features & (1 << VIRTIO_RING_F_EVENT_IDX)) {
> > +        vring_avail_event(&vring->vr) = vring->vr.avail->idx;
> 
> No barrier here?
> I also don't see similar code in vhost - why is it a good idea?

This is from hw/virtio.c:virtqueue_pop().  We know there is at least one
request available, we're hinting to the guest to not to bother
notifying any requests up to the latest.

However, setting avail_event to the current vring avail_idx only helps
if we get lucky and process the vring *before* the guest decides to
notify a whole bunch of requests it has just enqueued.

So this doesn't seem incorrect but the performance benefit is
questionable.

Do you remember why you wrote this code?  The commit is:

commit bcbabae8ff7f7ec114da9fe2aa7f25f420f35306
Author: Michael S. Tsirkin <mst@redhat.com>
Date:   Sun Jun 12 16:21:57 2011 +0300

    virtio: event index support

    Add support for event_idx feature, and utilize it to
    reduce the number of interrupts and exits for the guest.

Stefan

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 04/11] dataplane: add virtqueue vring code
  2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 04/11] dataplane: add virtqueue vring code Stefan Hajnoczi
  2012-11-29 12:50   ` Michael S. Tsirkin
@ 2012-11-29 13:48   ` Michael S. Tsirkin
  1 sibling, 0 replies; 43+ messages in thread
From: Michael S. Tsirkin @ 2012-11-29 13:48 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Blue Swirl, khoa,
	Paolo Bonzini, Asias He

On Thu, Nov 22, 2012 at 04:16:45PM +0100, Stefan Hajnoczi wrote:
> The virtio-blk-data-plane cannot access memory using the usual QEMU
> functions since it executes outside the global mutex and the memory APIs
> are this time are not thread-safe.
> 
> This patch introduces a virtqueue module based on the kernel's vhost
> vring code.  The trick is that we map guest memory ahead of time and
> access it cheaply outside the global mutex.
> 
> Once the hardware emulation code can execute outside the global mutex it
> will be possible to drop this code.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  hw/Makefile.objs           |   2 +-
>  hw/dataplane/Makefile.objs |   2 +-
>  hw/dataplane/vring.c       | 344 +++++++++++++++++++++++++++++++++++++++++++++
>  hw/dataplane/vring.h       |  62 ++++++++
>  trace-events               |   3 +
>  5 files changed, 411 insertions(+), 2 deletions(-)
>  create mode 100644 hw/dataplane/vring.c
>  create mode 100644 hw/dataplane/vring.h
> 
> diff --git a/hw/Makefile.objs b/hw/Makefile.objs
> index ea46f81..db87fbf 100644
> --- a/hw/Makefile.objs
> +++ b/hw/Makefile.objs
> @@ -1,4 +1,4 @@
> -common-obj-y = usb/ ide/
> +common-obj-y = usb/ ide/ dataplane/
>  common-obj-y += loader.o
>  common-obj-$(CONFIG_VIRTIO) += virtio-console.o
>  common-obj-$(CONFIG_VIRTIO) += virtio-rng.o
> diff --git a/hw/dataplane/Makefile.objs b/hw/dataplane/Makefile.objs
> index 8c8dea1..34e6d57 100644
> --- a/hw/dataplane/Makefile.objs
> +++ b/hw/dataplane/Makefile.objs
> @@ -1,3 +1,3 @@
>  ifeq ($(CONFIG_VIRTIO), y)
> -common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += hostmem.o
> +common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += hostmem.o vring.o
>  endif
> diff --git a/hw/dataplane/vring.c b/hw/dataplane/vring.c
> new file mode 100644
> index 0000000..2632fbd
> --- /dev/null
> +++ b/hw/dataplane/vring.c
> @@ -0,0 +1,344 @@
> +/* Copyright 2012 Red Hat, Inc.
> + * Copyright IBM, Corp. 2012
> + *
> + * Based on Linux vhost code:
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Copyright (C) 2006 Rusty Russell IBM Corporation
> + *
> + * Author: Michael S. Tsirkin <mst@redhat.com>
> + *         Stefan Hajnoczi <stefanha@redhat.com>
> + *
> + * Inspiration, some code, and most witty comments come from
> + * Documentation/virtual/lguest/lguest.c, by Rusty Russell
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + */
> +
> +#include "trace.h"
> +#include "hw/dataplane/vring.h"
> +
> +/* Map the guest's vring to host memory */
> +bool vring_setup(Vring *vring, VirtIODevice *vdev, int n)
> +{
> +    hwaddr vring_addr = virtio_queue_get_ring_addr(vdev, n);
> +    hwaddr vring_size = virtio_queue_get_ring_size(vdev, n);
> +    void *vring_ptr;
> +
> +    vring->broken = false;
> +
> +    hostmem_init(&vring->hostmem);
> +    vring_ptr = hostmem_lookup(&vring->hostmem, vring_addr, vring_size, true);
> +    if (!vring_ptr) {
> +        error_report("Failed to map vring "
> +                     "addr %#" HWADDR_PRIx " size %" HWADDR_PRIu,
> +                     vring_addr, vring_size);
> +        vring->broken = true;
> +        return false;
> +    }
> +
> +    vring_init(&vring->vr, virtio_queue_get_num(vdev, n), vring_ptr, 4096);
> +
> +    vring->last_avail_idx = 0;
> +    vring->last_used_idx = 0;
> +    vring->signalled_used = 0;
> +    vring->signalled_used_valid = false;
> +
> +    trace_vring_setup(virtio_queue_get_ring_addr(vdev, n),
> +                      vring->vr.desc, vring->vr.avail, vring->vr.used);
> +    return true;
> +}
> +
> +void vring_teardown(Vring *vring)
> +{
> +    hostmem_finalize(&vring->hostmem);
> +}
> +
> +/* Toggle guest->host notifies */
> +void vring_set_notification(VirtIODevice *vdev, Vring *vring, bool enable)
> +{
> +    if (vdev->guest_features & (1 << VIRTIO_RING_F_EVENT_IDX)) {
> +        if (enable) {
> +            vring_avail_event(&vring->vr) = vring->vr.avail->idx;
> +        }
> +    } else if (enable) {
> +        vring->vr.used->flags &= ~VRING_USED_F_NO_NOTIFY;
> +    } else {
> +        vring->vr.used->flags |= VRING_USED_F_NO_NOTIFY;
> +    }
> +}
> +
> +/* This is stolen from linux/drivers/vhost/vhost.c:vhost_notify() */
> +bool vring_should_notify(VirtIODevice *vdev, Vring *vring)
> +{
> +    uint16_t old, new;
> +    bool v;
> +    /* Flush out used index updates. This is paired
> +     * with the barrier that the Guest executes when enabling
> +     * interrupts. */
> +    smp_mb();
> +
> +    if ((vdev->guest_features & VIRTIO_F_NOTIFY_ON_EMPTY) &&
> +        unlikely(vring->vr.avail->idx == vring->last_avail_idx)) {
> +        return true;
> +    }
> +
> +    if (!(vdev->guest_features & VIRTIO_RING_F_EVENT_IDX)) {
> +        return !(vring->vr.avail->flags & VRING_AVAIL_F_NO_INTERRUPT);
> +    }
> +    old = vring->signalled_used;
> +    v = vring->signalled_used_valid;
> +    new = vring->signalled_used = vring->last_used_idx;
> +    vring->signalled_used_valid = true;
> +
> +    if (unlikely(!v)) {
> +        return true;
> +    }
> +
> +    return vring_need_event(vring_used_event(&vring->vr), new, old);
> +}
> +
> +/* This is stolen from linux-2.6/drivers/vhost/vhost.c. */
> +static int get_indirect(Vring *vring,
> +                        struct iovec iov[], struct iovec *iov_end,
> +                        unsigned int *out_num, unsigned int *in_num,
> +                        struct vring_desc *indirect)
> +{
> +    struct vring_desc desc;
> +    unsigned int i = 0, count, found = 0;
> +
> +    /* Sanity check */
> +    if (unlikely(indirect->len % sizeof desc)) {
> +        error_report("Invalid length in indirect descriptor: "
> +                     "len %#x not multiple of %#zx",
> +                     indirect->len, sizeof desc);
> +        vring->broken = true;
> +        return -EFAULT;
> +    }
> +
> +    count = indirect->len / sizeof desc;
> +    /* Buffers are chained via a 16 bit next field, so
> +     * we can have at most 2^16 of these. */
> +    if (unlikely(count > USHRT_MAX + 1)) {
> +        error_report("Indirect buffer length too big: %d", indirect->len);
> +        vring->broken = true;
> +        return -EFAULT;
> +    }
> +
> +    /* Point to translate indirect desc chain */
> +    indirect = hostmem_lookup(&vring->hostmem, indirect->addr, indirect->len,
> +                              false);
> +    if (!indirect) {
> +        error_report("Failed to map indirect desc chain "
> +                     "addr %#" PRIx64 " len %u",
> +                     (uint64_t)indirect->addr, indirect->len);
> +        vring->broken = true;
> +        return -EFAULT;
> +    }
> +
> +    /* We will use the result as an address to read from, so most
> +     * architectures only need a compiler barrier here. */
> +    barrier(); /* read_barrier_depends(); */
> +
> +    do {
> +        if (unlikely(++found > count)) {
> +            error_report("Loop detected: last one at %u "
> +                         "indirect size %u", i, count);
> +            vring->broken = true;
> +            return -EFAULT;
> +        }
> +
> +        desc = *indirect++;

One other thing - I am guessing that in practice this and other accesses
get compiled to a memcpy call so you operate on stack from this point
on so it's fine.
But to really guarantee this, you have to use something like ACCESS_ONCE
macro in linux for all memory accesses or even (gasp) volatile.

Otherwise, e.g. desc.len could get modified after it's
validated with bad results.

> +        if (unlikely(desc.flags & VRING_DESC_F_INDIRECT)) {
> +            error_report("Nested indirect descriptor");
> +            vring->broken = true;
> +            return -EFAULT;
> +        }
> +
> +        /* Stop for now if there are not enough iovecs available. */
> +        if (iov >= iov_end) {
> +            return -ENOBUFS;
> +        }
> +
> +        iov->iov_base = hostmem_lookup(&vring->hostmem, desc.addr, desc.len,
> +                                       desc.flags & VRING_DESC_F_WRITE);
> +        if (!iov->iov_base) {
> +            error_report("Failed to map indirect descriptor"
> +                         "addr %#" PRIx64 " len %u",
> +                         (uint64_t)desc.addr, desc.len);
> +            vring->broken = true;
> +            return -EFAULT;
> +        }
> +        iov->iov_len = desc.len;
> +        iov++;
> +
> +        /* If this is an input descriptor, increment that count. */
> +        if (desc.flags & VRING_DESC_F_WRITE) {
> +            *in_num += 1;
> +        } else {
> +            /* If it's an output descriptor, they're all supposed
> +             * to come before any input descriptors. */
> +            if (unlikely(*in_num)) {
> +                error_report("Indirect descriptor "
> +                             "has out after in: idx %d", i);
> +                vring->broken = true;
> +                return -EFAULT;
> +            }
> +            *out_num += 1;
> +        }
> +        i = desc.next;
> +    } while (desc.flags & VRING_DESC_F_NEXT);
> +    return 0;
> +}
> +
> +/* This looks in the virtqueue and for the first available buffer, and converts
> + * it to an iovec for convenient access.  Since descriptors consist of some
> + * number of output then some number of input descriptors, it's actually two
> + * iovecs, but we pack them into one and note how many of each there were.
> + *
> + * This function returns the descriptor number found, or vq->num (which is
> + * never a valid descriptor number) if none was found.  A negative code is
> + * returned on error.
> + *
> + * Stolen from linux-2.6/drivers/vhost/vhost.c.
> + */
> +int vring_pop(VirtIODevice *vdev, Vring *vring,
> +              struct iovec iov[], struct iovec *iov_end,
> +              unsigned int *out_num, unsigned int *in_num)
> +{
> +    struct vring_desc desc;
> +    unsigned int i, head, found = 0, num = vring->vr.num;
> +    uint16_t avail_idx, last_avail_idx;
> +
> +    /* If there was a fatal error then refuse operation */
> +    if (vring->broken) {
> +        return -EFAULT;
> +    }
> +
> +    /* Check it isn't doing very strange things with descriptor numbers. */
> +    last_avail_idx = vring->last_avail_idx;
> +    avail_idx = vring->vr.avail->idx;
> +
> +    if (unlikely((uint16_t)(avail_idx - last_avail_idx) > num)) {
> +        error_report("Guest moved used index from %u to %u",
> +                     last_avail_idx, avail_idx);
> +        vring->broken = true;
> +        return -EFAULT;
> +    }
> +
> +    /* If there's nothing new since last we looked. */
> +    if (avail_idx == last_avail_idx) {
> +        return -EAGAIN;
> +    }
> +
> +    /* Only get avail ring entries after they have been exposed by guest. */
> +    smp_rmb();
> +
> +    /* Grab the next descriptor number they're advertising, and increment
> +     * the index we've seen. */
> +    head = vring->vr.avail->ring[last_avail_idx % num];
> +
> +    /* If their number is silly, that's an error. */
> +    if (unlikely(head >= num)) {
> +        error_report("Guest says index %u > %u is available", head, num);
> +        vring->broken = true;
> +        return -EFAULT;
> +    }
> +
> +    if (vdev->guest_features & (1 << VIRTIO_RING_F_EVENT_IDX)) {
> +        vring_avail_event(&vring->vr) = vring->vr.avail->idx;
> +    }
> +
> +    /* When we start there are none of either input nor output. */
> +    *out_num = *in_num = 0;
> +
> +    i = head;
> +    do {
> +        if (unlikely(i >= num)) {
> +            error_report("Desc index is %u > %u, head = %u", i, num, head);
> +            vring->broken = true;
> +            return -EFAULT;
> +        }
> +        if (unlikely(++found > num)) {
> +            error_report("Loop detected: last one at %u vq size %u head %u",
> +                         i, num, head);
> +            vring->broken = true;
> +            return -EFAULT;
> +        }
> +        desc = vring->vr.desc[i];
> +        if (desc.flags & VRING_DESC_F_INDIRECT) {
> +            int ret = get_indirect(vring, iov, iov_end, out_num, in_num, &desc);
> +            if (ret < 0) {
> +                return ret;
> +            }
> +            continue;
> +        }
> +
> +        /* If there are not enough iovecs left, stop for now.  The caller
> +         * should check if there are more descs available once they have dealt
> +         * with the current set.
> +         */
> +        if (iov >= iov_end) {
> +            return -ENOBUFS;
> +        }
> +
> +        iov->iov_base = hostmem_lookup(&vring->hostmem, desc.addr, desc.len,
> +                                       desc.flags & VRING_DESC_F_WRITE);
> +        if (!iov->iov_base) {
> +            error_report("Failed to map vring desc addr %#" PRIx64 " len %u",
> +                         (uint64_t)desc.addr, desc.len);
> +            vring->broken = true;
> +            return -EFAULT;
> +        }
> +        iov->iov_len  = desc.len;
> +        iov++;
> +
> +        if (desc.flags & VRING_DESC_F_WRITE) {
> +            /* If this is an input descriptor,
> +             * increment that count. */
> +            *in_num += 1;
> +        } else {
> +            /* If it's an output descriptor, they're all supposed
> +             * to come before any input descriptors. */
> +            if (unlikely(*in_num)) {
> +                error_report("Descriptor has out after in: idx %d", i);
> +                vring->broken = true;
> +                return -EFAULT;
> +            }
> +            *out_num += 1;
> +        }
> +        i = desc.next;
> +    } while (desc.flags & VRING_DESC_F_NEXT);
> +
> +    /* On success, increment avail index. */
> +    vring->last_avail_idx++;
> +    return head;
> +}
> +
> +/* After we've used one of their buffers, we tell them about it.
> + *
> + * Stolen from linux-2.6/drivers/vhost/vhost.c.
> + */
> +void vring_push(Vring *vring, unsigned int head, int len)
> +{
> +    struct vring_used_elem *used;
> +    uint16_t new;
> +
> +    /* Don't touch vring if a fatal error occurred */
> +    if (vring->broken) {
> +        return;
> +    }
> +
> +    /* The virtqueue contains a ring of used buffers.  Get a pointer to the
> +     * next entry in that used ring. */
> +    used = &vring->vr.used->ring[vring->last_used_idx % vring->vr.num];
> +    used->id = head;
> +    used->len = len;
> +
> +    /* Make sure buffer is written before we update index. */
> +    smp_wmb();
> +
> +    new = vring->vr.used->idx = ++vring->last_used_idx;
> +    if (unlikely((int16_t)(new - vring->signalled_used) < (uint16_t)1)) {
> +        vring->signalled_used_valid = false;
> +    }
> +}
> diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h
> new file mode 100644
> index 0000000..7245d99
> --- /dev/null
> +++ b/hw/dataplane/vring.h
> @@ -0,0 +1,62 @@
> +/* Copyright 2012 Red Hat, Inc. and/or its affiliates
> + * Copyright IBM, Corp. 2012
> + *
> + * Based on Linux vhost code:
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Copyright (C) 2006 Rusty Russell IBM Corporation
> + *
> + * Author: Michael S. Tsirkin <mst@redhat.com>
> + *         Stefan Hajnoczi <stefanha@redhat.com>
> + *
> + * Inspiration, some code, and most witty comments come from
> + * Documentation/virtual/lguest/lguest.c, by Rusty Russell
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + */
> +
> +#ifndef VRING_H
> +#define VRING_H
> +
> +#include <linux/virtio_ring.h>
> +#include "qemu-common.h"
> +#include "qemu-barrier.h"
> +#include "hw/dataplane/hostmem.h"
> +#include "hw/virtio.h"
> +
> +typedef struct {
> +    Hostmem hostmem;                /* guest memory mapper */
> +    struct vring vr;                /* virtqueue vring mapped to host memory */
> +    uint16_t last_avail_idx;        /* last processed avail ring index */
> +    uint16_t last_used_idx;         /* last processed used ring index */
> +    uint16_t signalled_used;        /* EVENT_IDX state */
> +    bool signalled_used_valid;
> +    bool broken;                    /* was there a fatal error? */
> +} Vring;
> +
> +static inline unsigned int vring_get_num(Vring *vring)
> +{
> +    return vring->vr.num;
> +}
> +
> +/* Are there more descriptors available? */
> +static inline bool vring_more_avail(Vring *vring)
> +{
> +    return vring->vr.avail->idx != vring->last_avail_idx;
> +}
> +
> +/* Fail future vring_pop() and vring_push() calls until reset */
> +static inline void vring_set_broken(Vring *vring)
> +{
> +    vring->broken = true;
> +}
> +
> +bool vring_setup(Vring *vring, VirtIODevice *vdev, int n);
> +void vring_teardown(Vring *vring);
> +void vring_set_notification(VirtIODevice *vdev, Vring *vring, bool enable);
> +bool vring_should_notify(VirtIODevice *vdev, Vring *vring);
> +int vring_pop(VirtIODevice *vdev, Vring *vring,
> +              struct iovec iov[], struct iovec *iov_end,
> +              unsigned int *out_num, unsigned int *in_num);
> +void vring_push(Vring *vring, unsigned int head, int len);
> +
> +#endif /* VRING_H */
> diff --git a/trace-events b/trace-events
> index 6c6cbf1..a9a791b 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -98,6 +98,9 @@ virtio_blk_rw_complete(void *req, int ret) "req %p ret %d"
>  virtio_blk_handle_write(void *req, uint64_t sector, size_t nsectors) "req %p sector %"PRIu64" nsectors %zu"
>  virtio_blk_handle_read(void *req, uint64_t sector, size_t nsectors) "req %p sector %"PRIu64" nsectors %zu"
>  
> +# hw/dataplane/vring.c
> +vring_setup(uint64_t physical, void *desc, void *avail, void *used) "vring physical %#"PRIx64" desc %p avail %p used %p"
> +
>  # thread-pool.c
>  thread_pool_submit(void *req, void *opaque) "req %p opaque %p"
>  thread_pool_complete(void *req, void *opaque, int ret) "req %p opaque %p ret %d"
> -- 
> 1.8.0

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Qemu-devel] [PATCH v4 05/11] dataplane: add event loop
  2012-11-22 15:16 [Qemu-devel] [PATCH v4 00/11] virtio: virtio-blk data plane Stefan Hajnoczi
                   ` (3 preceding siblings ...)
  2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 04/11] dataplane: add virtqueue vring code Stefan Hajnoczi
@ 2012-11-22 15:16 ` Stefan Hajnoczi
  2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 06/11] dataplane: add Linux AIO request queue Stefan Hajnoczi
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2012-11-22 15:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, Blue Swirl, khoa,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

Outside the safety of the global mutex we need to poll on file
descriptors.  I found epoll(2) is a convenient way to do that, although
other options could replace this module in the future (such as an
AioContext-based loop or glib's GMainLoop).

One important feature of this small event loop implementation is that
the loop can be terminated in a thread-safe way.  This allows QEMU to
stop the data plane thread cleanly.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 hw/dataplane/Makefile.objs |   2 +-
 hw/dataplane/event-poll.c  | 109 +++++++++++++++++++++++++++++++++++++++++++++
 hw/dataplane/event-poll.h  |  40 +++++++++++++++++
 3 files changed, 150 insertions(+), 1 deletion(-)
 create mode 100644 hw/dataplane/event-poll.c
 create mode 100644 hw/dataplane/event-poll.h

diff --git a/hw/dataplane/Makefile.objs b/hw/dataplane/Makefile.objs
index 34e6d57..e26bd7d 100644
--- a/hw/dataplane/Makefile.objs
+++ b/hw/dataplane/Makefile.objs
@@ -1,3 +1,3 @@
 ifeq ($(CONFIG_VIRTIO), y)
-common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += hostmem.o vring.o
+common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += hostmem.o vring.o event-poll.o
 endif
diff --git a/hw/dataplane/event-poll.c b/hw/dataplane/event-poll.c
new file mode 100644
index 0000000..4a53d48
--- /dev/null
+++ b/hw/dataplane/event-poll.c
@@ -0,0 +1,109 @@
+/*
+ * Event loop with file descriptor polling
+ *
+ * Copyright 2012 IBM, Corp.
+ * Copyright 2012 Red Hat, Inc. and/or its affiliates
+ *
+ * Authors:
+ *   Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include <sys/epoll.h>
+#include "hw/dataplane/event-poll.h"
+
+/* Add an event notifier and its callback for polling */
+void event_poll_add(EventPoll *poll, EventHandler *handler,
+                    EventNotifier *notifier, EventCallback *callback)
+{
+    struct epoll_event event = {
+        .events = EPOLLIN,
+        .data.ptr = handler,
+    };
+    handler->notifier = notifier;
+    handler->callback = callback;
+    if (epoll_ctl(poll->epoll_fd, EPOLL_CTL_ADD,
+                  event_notifier_get_fd(notifier), &event) != 0) {
+        fprintf(stderr, "failed to add event handler to epoll: %m\n");
+        exit(1);
+    }
+}
+
+/* Event callback for stopping the event_poll_run() loop */
+static bool handle_stop(EventHandler *handler)
+{
+    return false; /* stop event loop */
+}
+
+void event_poll_init(EventPoll *poll)
+{
+    /* Create epoll file descriptor */
+    poll->epoll_fd = epoll_create1(EPOLL_CLOEXEC);
+    if (poll->epoll_fd < 0) {
+        fprintf(stderr, "epoll_create1 failed: %m\n");
+        exit(1);
+    }
+
+    /* Set up stop notifier */
+    if (event_notifier_init(&poll->stop_notifier, 0) < 0) {
+        fprintf(stderr, "failed to init stop notifier\n");
+        exit(1);
+    }
+    event_poll_add(poll, &poll->stop_handler,
+                   &poll->stop_notifier, handle_stop);
+}
+
+void event_poll_cleanup(EventPoll *poll)
+{
+    event_notifier_cleanup(&poll->stop_notifier);
+    close(poll->epoll_fd);
+    poll->epoll_fd = -1;
+}
+
+/* Block until the next event and invoke its callback
+ *
+ * Signals must be masked, EINTR should never happen.  This is true for QEMU
+ * threads.
+ */
+static bool event_poll(EventPoll *poll)
+{
+    EventHandler *handler;
+    struct epoll_event event;
+    int nevents;
+
+    /* Wait for the next event.  Only do one event per call to keep the
+     * function simple, this could be changed later. */
+    nevents = epoll_wait(poll->epoll_fd, &event, 1, -1);
+    if (unlikely(nevents != 1)) {
+        fprintf(stderr, "epoll_wait failed: %m\n");
+        exit(1); /* should never happen */
+    }
+
+    /* Find out which event handler has become active */
+    handler = event.data.ptr;
+
+    /* Clear the eventfd */
+    event_notifier_test_and_clear(handler->notifier);
+
+    /* Handle the event */
+    return handler->callback(handler);
+}
+
+void event_poll_run(EventPoll *poll)
+{
+    while (event_poll(poll)) {
+        /* do nothing */
+    }
+}
+
+/* Stop the event_poll_run() loop
+ *
+ * This function can be used from another thread.
+ */
+void event_poll_stop(EventPoll *poll)
+{
+    event_notifier_set(&poll->stop_notifier);
+}
diff --git a/hw/dataplane/event-poll.h b/hw/dataplane/event-poll.h
new file mode 100644
index 0000000..5e1771f
--- /dev/null
+++ b/hw/dataplane/event-poll.h
@@ -0,0 +1,40 @@
+/*
+ * Event loop with file descriptor polling
+ *
+ * Copyright 2012 IBM, Corp.
+ * Copyright 2012 Red Hat, Inc. and/or its affiliates
+ *
+ * Authors:
+ *   Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef EVENT_POLL_H
+#define EVENT_POLL_H
+
+#include "event_notifier.h"
+
+typedef struct EventHandler EventHandler;
+typedef bool EventCallback(EventHandler *handler);
+struct EventHandler {
+    EventNotifier *notifier;        /* eventfd */
+    EventCallback *callback;        /* callback function */
+};
+
+typedef struct {
+    int epoll_fd;                   /* epoll(2) file descriptor */
+    EventNotifier stop_notifier;    /* stop poll notifier */
+    EventHandler stop_handler;      /* stop poll handler */
+} EventPoll;
+
+void event_poll_add(EventPoll *poll, EventHandler *handler,
+                    EventNotifier *notifier, EventCallback *callback);
+void event_poll_init(EventPoll *poll);
+void event_poll_cleanup(EventPoll *poll);
+void event_poll_run(EventPoll *poll);
+void event_poll_stop(EventPoll *poll);
+
+#endif /* EVENT_POLL_H */
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [Qemu-devel] [PATCH v4 06/11] dataplane: add Linux AIO request queue
  2012-11-22 15:16 [Qemu-devel] [PATCH v4 00/11] virtio: virtio-blk data plane Stefan Hajnoczi
                   ` (4 preceding siblings ...)
  2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 05/11] dataplane: add event loop Stefan Hajnoczi
@ 2012-11-22 15:16 ` Stefan Hajnoczi
  2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 07/11] iov: add iov_discard() to remove data Stefan Hajnoczi
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2012-11-22 15:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, Blue Swirl, khoa,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

The IOQueue has a pool of iocb structs and a function to add new
read/write requests.  Multiple requests can be added before calling the
submit function to actually tell the host kernel to begin I/O.  This
allows callers to batch requests and submit them in one go.

The actual I/O is performed using Linux AIO.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 hw/dataplane/Makefile.objs |   2 +-
 hw/dataplane/ioq.c         | 118 +++++++++++++++++++++++++++++++++++++++++++++
 hw/dataplane/ioq.h         |  57 ++++++++++++++++++++++
 3 files changed, 176 insertions(+), 1 deletion(-)
 create mode 100644 hw/dataplane/ioq.c
 create mode 100644 hw/dataplane/ioq.h

diff --git a/hw/dataplane/Makefile.objs b/hw/dataplane/Makefile.objs
index e26bd7d..abd408f 100644
--- a/hw/dataplane/Makefile.objs
+++ b/hw/dataplane/Makefile.objs
@@ -1,3 +1,3 @@
 ifeq ($(CONFIG_VIRTIO), y)
-common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += hostmem.o vring.o event-poll.o
+common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += hostmem.o vring.o event-poll.o ioq.o
 endif
diff --git a/hw/dataplane/ioq.c b/hw/dataplane/ioq.c
new file mode 100644
index 0000000..7adeb5d
--- /dev/null
+++ b/hw/dataplane/ioq.c
@@ -0,0 +1,118 @@
+/*
+ * Linux AIO request queue
+ *
+ * Copyright 2012 IBM, Corp.
+ * Copyright 2012 Red Hat, Inc. and/or its affiliates
+ *
+ * Authors:
+ *   Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "hw/dataplane/ioq.h"
+
+void ioq_init(IOQueue *ioq, int fd, unsigned int max_reqs)
+{
+    int rc;
+
+    ioq->fd = fd;
+    ioq->max_reqs = max_reqs;
+
+    memset(&ioq->io_ctx, 0, sizeof ioq->io_ctx);
+    rc = io_setup(max_reqs, &ioq->io_ctx);
+    if (rc != 0) {
+        fprintf(stderr, "ioq io_setup failed %d\n", rc);
+        exit(1);
+    }
+
+    rc = event_notifier_init(&ioq->io_notifier, 0);
+    if (rc != 0) {
+        fprintf(stderr, "ioq io event notifier creation failed %d\n", rc);
+        exit(1);
+    }
+
+    ioq->freelist = g_malloc0(sizeof ioq->freelist[0] * max_reqs);
+    ioq->freelist_idx = 0;
+
+    ioq->queue = g_malloc0(sizeof ioq->queue[0] * max_reqs);
+    ioq->queue_idx = 0;
+}
+
+void ioq_cleanup(IOQueue *ioq)
+{
+    g_free(ioq->freelist);
+    g_free(ioq->queue);
+
+    event_notifier_cleanup(&ioq->io_notifier);
+    io_destroy(ioq->io_ctx);
+}
+
+EventNotifier *ioq_get_notifier(IOQueue *ioq)
+{
+    return &ioq->io_notifier;
+}
+
+struct iocb *ioq_get_iocb(IOQueue *ioq)
+{
+    if (unlikely(ioq->freelist_idx == 0)) {
+        fprintf(stderr, "ioq underflow\n");
+        exit(1);
+    }
+    struct iocb *iocb = ioq->freelist[--ioq->freelist_idx];
+    ioq->queue[ioq->queue_idx++] = iocb;
+    return iocb;
+}
+
+void ioq_put_iocb(IOQueue *ioq, struct iocb *iocb)
+{
+    if (unlikely(ioq->freelist_idx == ioq->max_reqs)) {
+        fprintf(stderr, "ioq overflow\n");
+        exit(1);
+    }
+    ioq->freelist[ioq->freelist_idx++] = iocb;
+}
+
+struct iocb *ioq_rdwr(IOQueue *ioq, bool read, struct iovec *iov,
+                      unsigned int count, long long offset)
+{
+    struct iocb *iocb = ioq_get_iocb(ioq);
+
+    if (read) {
+        io_prep_preadv(iocb, ioq->fd, iov, count, offset);
+    } else {
+        io_prep_pwritev(iocb, ioq->fd, iov, count, offset);
+    }
+    io_set_eventfd(iocb, event_notifier_get_fd(&ioq->io_notifier));
+    return iocb;
+}
+
+int ioq_submit(IOQueue *ioq)
+{
+    int rc = io_submit(ioq->io_ctx, ioq->queue_idx, ioq->queue);
+    ioq->queue_idx = 0; /* reset */
+    return rc;
+}
+
+int ioq_run_completion(IOQueue *ioq, IOQueueCompletion *completion,
+                       void *opaque)
+{
+    struct io_event events[ioq->max_reqs];
+    int nevents, i;
+
+    nevents = io_getevents(ioq->io_ctx, 0, ioq->max_reqs, events, NULL);
+    if (unlikely(nevents < 0)) {
+        fprintf(stderr, "io_getevents failed %d\n", nevents);
+        exit(1);
+    }
+
+    for (i = 0; i < nevents; i++) {
+        ssize_t ret = ((uint64_t)events[i].res2 << 32) | events[i].res;
+
+        completion(events[i].obj, ret, opaque);
+        ioq_put_iocb(ioq, events[i].obj);
+    }
+    return nevents;
+}
diff --git a/hw/dataplane/ioq.h b/hw/dataplane/ioq.h
new file mode 100644
index 0000000..890db22
--- /dev/null
+++ b/hw/dataplane/ioq.h
@@ -0,0 +1,57 @@
+/*
+ * Linux AIO request queue
+ *
+ * Copyright 2012 IBM, Corp.
+ * Copyright 2012 Red Hat, Inc. and/or its affiliates
+ *
+ * Authors:
+ *   Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef IOQ_H
+#define IOQ_H
+
+#include <libaio.h>
+#include "event_notifier.h"
+
+typedef struct {
+    int fd;                         /* file descriptor */
+    unsigned int max_reqs;          /* max length of freelist and queue */
+
+    io_context_t io_ctx;            /* Linux AIO context */
+    EventNotifier io_notifier;      /* Linux AIO eventfd */
+
+    /* Requests can complete in any order so a free list is necessary to manage
+     * available iocbs.
+     */
+    struct iocb **freelist;         /* free iocbs */
+    unsigned int freelist_idx;
+
+    /* Multiple requests are queued up before submitting them all in one go */
+    struct iocb **queue;            /* queued iocbs */
+    unsigned int queue_idx;
+} IOQueue;
+
+void ioq_init(IOQueue *ioq, int fd, unsigned int max_reqs);
+void ioq_cleanup(IOQueue *ioq);
+EventNotifier *ioq_get_notifier(IOQueue *ioq);
+struct iocb *ioq_get_iocb(IOQueue *ioq);
+void ioq_put_iocb(IOQueue *ioq, struct iocb *iocb);
+struct iocb *ioq_rdwr(IOQueue *ioq, bool read, struct iovec *iov,
+                      unsigned int count, long long offset);
+int ioq_submit(IOQueue *ioq);
+
+static inline unsigned int ioq_num_queued(IOQueue *ioq)
+{
+    return ioq->queue_idx;
+}
+
+typedef void IOQueueCompletion(struct iocb *iocb, ssize_t ret, void *opaque);
+int ioq_run_completion(IOQueue *ioq, IOQueueCompletion *completion,
+                       void *opaque);
+
+#endif /* IOQ_H */
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [Qemu-devel] [PATCH v4 07/11] iov: add iov_discard() to remove data
  2012-11-22 15:16 [Qemu-devel] [PATCH v4 00/11] virtio: virtio-blk data plane Stefan Hajnoczi
                   ` (5 preceding siblings ...)
  2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 06/11] dataplane: add Linux AIO request queue Stefan Hajnoczi
@ 2012-11-22 15:16 ` Stefan Hajnoczi
  2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 08/11] test-iov: add iov_discard() testcase Stefan Hajnoczi
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2012-11-22 15:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, Blue Swirl, khoa,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

The iov_discard() function removes data from the front or back of the
vector.  This is useful when peeling off header/footer structs.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 iov.c | 41 +++++++++++++++++++++++++++++++++++++++++
 iov.h | 13 +++++++++++++
 2 files changed, 54 insertions(+)

diff --git a/iov.c b/iov.c
index a81eedc..6eed089 100644
--- a/iov.c
+++ b/iov.c
@@ -354,3 +354,44 @@ size_t qemu_iovec_memset(QEMUIOVector *qiov, size_t offset,
 {
     return iov_memset(qiov->iov, qiov->niov, offset, fillc, bytes);
 }
+
+size_t iov_discard(struct iovec **iov, unsigned int *iov_cnt, ssize_t bytes)
+{
+    size_t total = 0;
+    struct iovec *cur;
+    int direction;
+
+    if (*iov_cnt == 0) {
+        return 0;
+    }
+
+    if (bytes < 0) {
+        bytes = -bytes;
+        direction = -1;
+        cur = *iov + (*iov_cnt - 1);
+    } else {
+        direction = 1;
+        cur = *iov;
+    }
+
+    while (*iov_cnt > 0) {
+        if (cur->iov_len > bytes) {
+            if (direction > 0) {
+                cur->iov_base += bytes;
+            }
+            cur->iov_len -= bytes;
+            total += bytes;
+            break;
+        }
+
+        bytes -= cur->iov_len;
+        total += cur->iov_len;
+        cur += direction;
+        *iov_cnt -= 1;
+    }
+
+    if (direction > 0) {
+        *iov = cur;
+    }
+    return total;
+}
diff --git a/iov.h b/iov.h
index 34c8ec9..d6d1fa6 100644
--- a/iov.h
+++ b/iov.h
@@ -95,3 +95,16 @@ void iov_hexdump(const struct iovec *iov, const unsigned int iov_cnt,
 unsigned iov_copy(struct iovec *dst_iov, unsigned int dst_iov_cnt,
                  const struct iovec *iov, unsigned int iov_cnt,
                  size_t offset, size_t bytes);
+
+/*
+ * Remove a given number of bytes from the front or back of a vector.
+ * This may update iov and/or iov_cnt to exclude iovec elements that are
+ * no longer required.
+ *
+ * Data is discarded from the front of the vector if bytes is positive and
+ * from the back of the vector if bytes is negative.
+ *
+ * The number of bytes actually discarded is returned.  This number may be
+ * smaller than requested if the vector is too small.
+ */
+size_t iov_discard(struct iovec **iov, unsigned int *iov_cnt, ssize_t bytes);
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [Qemu-devel] [PATCH v4 08/11] test-iov: add iov_discard() testcase
  2012-11-22 15:16 [Qemu-devel] [PATCH v4 00/11] virtio: virtio-blk data plane Stefan Hajnoczi
                   ` (6 preceding siblings ...)
  2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 07/11] iov: add iov_discard() to remove data Stefan Hajnoczi
@ 2012-11-22 15:16 ` Stefan Hajnoczi
  2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 09/11] iov: add qemu_iovec_concat_iov() Stefan Hajnoczi
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2012-11-22 15:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, Blue Swirl, khoa,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tests/test-iov.c | 129 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 129 insertions(+)

diff --git a/tests/test-iov.c b/tests/test-iov.c
index cbe7a89..7997fb5 100644
--- a/tests/test-iov.c
+++ b/tests/test-iov.c
@@ -250,11 +250,140 @@ static void test_io(void)
 #endif
 }
 
+static void test_discard(void)
+{
+    struct iovec *iov;
+    struct iovec *iov_tmp;
+    unsigned int iov_cnt;
+    unsigned int iov_cnt_tmp;
+    void *old_base;
+    size_t size;
+    size_t ret;
+
+    /* Discard zero bytes */
+    iov_random(&iov, &iov_cnt);
+    iov_tmp = iov;
+    iov_cnt_tmp = iov_cnt;
+    ret = iov_discard(&iov_tmp, &iov_cnt_tmp, 0);
+    g_assert(ret == 0);
+    g_assert(iov_tmp == iov);
+    g_assert(iov_cnt_tmp == iov_cnt);
+    iov_free(iov, iov_cnt);
+
+    /* Discard more bytes than vector size */
+    iov_random(&iov, &iov_cnt);
+    iov_tmp = iov;
+    iov_cnt_tmp = iov_cnt;
+    size = iov_size(iov, iov_cnt);
+    ret = iov_discard(&iov_tmp, &iov_cnt_tmp, size + 1);
+    g_assert(ret == size);
+    g_assert(iov_cnt_tmp == 0);
+    iov_free(iov, iov_cnt);
+
+    /* Discard more bytes than vector size (negative) */
+    iov_random(&iov, &iov_cnt);
+    iov_tmp = iov;
+    iov_cnt_tmp = iov_cnt;
+    size = iov_size(iov, iov_cnt);
+    ret = iov_discard(&iov_tmp, &iov_cnt_tmp, -(size + 1));
+    g_assert(ret == size);
+    g_assert(iov_cnt_tmp == 0);
+    iov_free(iov, iov_cnt);
+
+    /* Discard entire vector */
+    iov_random(&iov, &iov_cnt);
+    iov_tmp = iov;
+    iov_cnt_tmp = iov_cnt;
+    size = iov_size(iov, iov_cnt);
+    ret = iov_discard(&iov_tmp, &iov_cnt_tmp, size);
+    g_assert(ret == size);
+    g_assert(iov_cnt_tmp == 0);
+    iov_free(iov, iov_cnt);
+
+    /* Discard within first element */
+    iov_random(&iov, &iov_cnt);
+    iov_tmp = iov;
+    iov_cnt_tmp = iov_cnt;
+    old_base = iov->iov_base;
+    size = g_test_rand_int_range(1, iov->iov_len);
+    ret = iov_discard(&iov_tmp, &iov_cnt_tmp, size);
+    g_assert(ret == size);
+    g_assert(iov_tmp == iov);
+    g_assert(iov_cnt_tmp == iov_cnt);
+    g_assert(iov_tmp->iov_base == old_base + size);
+    iov_tmp->iov_base = old_base; /* undo before g_free() */
+    iov_free(iov, iov_cnt);
+
+    /* Discard entire first element */
+    iov_random(&iov, &iov_cnt);
+    iov_tmp = iov;
+    iov_cnt_tmp = iov_cnt;
+    ret = iov_discard(&iov_tmp, &iov_cnt_tmp, iov->iov_len);
+    g_assert(ret == iov->iov_len);
+    g_assert(iov_tmp == iov + 1);
+    g_assert(iov_cnt_tmp == iov_cnt - 1);
+    iov_free(iov, iov_cnt);
+
+    /* Discard within second element */
+    iov_random(&iov, &iov_cnt);
+    iov_tmp = iov;
+    iov_cnt_tmp = iov_cnt;
+    old_base = iov[1].iov_base;
+    size = iov->iov_len + g_test_rand_int_range(1, iov[1].iov_len);
+    ret = iov_discard(&iov_tmp, &iov_cnt_tmp, size);
+    g_assert(ret == size);
+    g_assert(iov_tmp == iov + 1);
+    g_assert(iov_cnt_tmp == iov_cnt - 1);
+    g_assert(iov_tmp->iov_base == old_base + (size - iov->iov_len));
+    iov_tmp->iov_base = old_base; /* undo before g_free() */
+    iov_free(iov, iov_cnt);
+
+    /* Discard within last element */
+    iov_random(&iov, &iov_cnt);
+    iov_tmp = iov;
+    iov_cnt_tmp = iov_cnt;
+    old_base = iov[iov_cnt - 1].iov_base;
+    size = g_test_rand_int_range(1, iov[iov_cnt - 1].iov_len);
+    ret = iov_discard(&iov_tmp, &iov_cnt_tmp, -size);
+    g_assert(ret == size);
+    g_assert(iov_tmp == iov);
+    g_assert(iov_cnt_tmp == iov_cnt);
+    g_assert(iov[iov_cnt - 1].iov_base == old_base);
+    iov_free(iov, iov_cnt);
+
+    /* Discard entire last element */
+    iov_random(&iov, &iov_cnt);
+    iov_tmp = iov;
+    iov_cnt_tmp = iov_cnt;
+    old_base = iov[iov_cnt - 1].iov_base;
+    size = iov[iov_cnt - 1].iov_len;
+    ret = iov_discard(&iov_tmp, &iov_cnt_tmp, -size);
+    g_assert(ret == size);
+    g_assert(iov_tmp == iov);
+    g_assert(iov_cnt_tmp == iov_cnt - 1);
+    iov_free(iov, iov_cnt);
+
+    /* Discard within second-to-last element */
+    iov_random(&iov, &iov_cnt);
+    iov_tmp = iov;
+    iov_cnt_tmp = iov_cnt;
+    old_base = iov[iov_cnt - 2].iov_base;
+    size = iov[iov_cnt - 1].iov_len +
+           g_test_rand_int_range(1, iov[iov_cnt - 2].iov_len);
+    ret = iov_discard(&iov_tmp, &iov_cnt_tmp, -size);
+    g_assert(ret == size);
+    g_assert(iov_tmp == iov);
+    g_assert(iov_cnt_tmp == iov_cnt - 1);
+    g_assert(iov[iov_cnt - 2].iov_base == old_base);
+    iov_free(iov, iov_cnt);
+}
+
 int main(int argc, char **argv)
 {
     g_test_init(&argc, &argv, NULL);
     g_test_rand_int();
     g_test_add_func("/basic/iov/from-to-buf", test_to_from_buf);
     g_test_add_func("/basic/iov/io", test_io);
+    g_test_add_func("/basic/iov/discard", test_discard);
     return g_test_run();
 }
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [Qemu-devel] [PATCH v4 09/11] iov: add qemu_iovec_concat_iov()
  2012-11-22 15:16 [Qemu-devel] [PATCH v4 00/11] virtio: virtio-blk data plane Stefan Hajnoczi
                   ` (7 preceding siblings ...)
  2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 08/11] test-iov: add iov_discard() testcase Stefan Hajnoczi
@ 2012-11-22 15:16 ` Stefan Hajnoczi
  2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 10/11] dataplane: add virtio-blk data plane code Stefan Hajnoczi
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2012-11-22 15:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, Blue Swirl, khoa,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

The qemu_iovec_concat() function copies a subset of a QEMUIOVector.  The
new qemu_iovec_concat_iov() function does the same for a iov/cnt pair.

It is easy to define qemu_iovec_concat() in terms of
qemu_iovec_concat_iov().  The existing code is mostly unchanged, except
for the assertion src->size >= soffset, which cannot be efficiently
checked upfront on a iov/cnt pair.  Instead we assert upon hitting the
end of src with an unsatisfied soffset.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 iov.c         | 39 +++++++++++++++++++++++++++------------
 qemu-common.h |  3 +++
 2 files changed, 30 insertions(+), 12 deletions(-)

diff --git a/iov.c b/iov.c
index 6eed089..1d4c5fe 100644
--- a/iov.c
+++ b/iov.c
@@ -289,34 +289,49 @@ void qemu_iovec_add(QEMUIOVector *qiov, void *base, size_t len)
 }
 
 /*
- * Concatenates (partial) iovecs from src to the end of dst.
+ * Concatenates (partial) iovecs from src_iov to the end of dst.
  * It starts copying after skipping `soffset' bytes at the
  * beginning of src and adds individual vectors from src to
  * dst copies up to `sbytes' bytes total, or up to the end
- * of src if it comes first.  This way, it is okay to specify
+ * of src_iov if it comes first.  This way, it is okay to specify
  * very large value for `sbytes' to indicate "up to the end
  * of src".
  * Only vector pointers are processed, not the actual data buffers.
  */
-void qemu_iovec_concat(QEMUIOVector *dst,
-                       QEMUIOVector *src, size_t soffset, size_t sbytes)
+void qemu_iovec_concat_iov(QEMUIOVector *dst,
+                           struct iovec *src_iov, unsigned int src_cnt,
+                           size_t soffset, size_t sbytes)
 {
     int i;
     size_t done;
-    struct iovec *siov = src->iov;
     assert(dst->nalloc != -1);
-    assert(src->size >= soffset);
-    for (i = 0, done = 0; done < sbytes && i < src->niov; i++) {
-        if (soffset < siov[i].iov_len) {
-            size_t len = MIN(siov[i].iov_len - soffset, sbytes - done);
-            qemu_iovec_add(dst, siov[i].iov_base + soffset, len);
+    for (i = 0, done = 0; done < sbytes && i < src_cnt; i++) {
+        if (soffset < src_iov[i].iov_len) {
+            size_t len = MIN(src_iov[i].iov_len - soffset, sbytes - done);
+            qemu_iovec_add(dst, src_iov[i].iov_base + soffset, len);
             done += len;
             soffset = 0;
         } else {
-            soffset -= siov[i].iov_len;
+            soffset -= src_iov[i].iov_len;
         }
     }
-    /* return done; */
+    assert(soffset == 0); /* offset beyond end of src */
+}
+
+/*
+ * Concatenates (partial) iovecs from src to the end of dst.
+ * It starts copying after skipping `soffset' bytes at the
+ * beginning of src and adds individual vectors from src to
+ * dst copies up to `sbytes' bytes total, or up to the end
+ * of src if it comes first.  This way, it is okay to specify
+ * very large value for `sbytes' to indicate "up to the end
+ * of src".
+ * Only vector pointers are processed, not the actual data buffers.
+ */
+void qemu_iovec_concat(QEMUIOVector *dst,
+                       QEMUIOVector *src, size_t soffset, size_t sbytes)
+{
+    qemu_iovec_concat_iov(dst, src->iov, src->niov, soffset, sbytes);
 }
 
 void qemu_iovec_destroy(QEMUIOVector *qiov)
diff --git a/qemu-common.h b/qemu-common.h
index cef264c..4cc63e1 100644
--- a/qemu-common.h
+++ b/qemu-common.h
@@ -379,6 +379,9 @@ void qemu_iovec_init_external(QEMUIOVector *qiov, struct iovec *iov, int niov);
 void qemu_iovec_add(QEMUIOVector *qiov, void *base, size_t len);
 void qemu_iovec_concat(QEMUIOVector *dst,
                        QEMUIOVector *src, size_t soffset, size_t sbytes);
+void qemu_iovec_concat_iov(QEMUIOVector *dst,
+                           struct iovec *src_iov, unsigned int src_cnt,
+                           size_t soffset, size_t sbytes);
 void qemu_iovec_destroy(QEMUIOVector *qiov);
 void qemu_iovec_reset(QEMUIOVector *qiov);
 size_t qemu_iovec_to_buf(QEMUIOVector *qiov, size_t offset,
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [Qemu-devel] [PATCH v4 10/11] dataplane: add virtio-blk data plane code
  2012-11-22 15:16 [Qemu-devel] [PATCH v4 00/11] virtio: virtio-blk data plane Stefan Hajnoczi
                   ` (8 preceding siblings ...)
  2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 09/11] iov: add qemu_iovec_concat_iov() Stefan Hajnoczi
@ 2012-11-22 15:16 ` Stefan Hajnoczi
  2012-11-29 13:41   ` Michael S. Tsirkin
  2012-11-29 14:02   ` Michael S. Tsirkin
  2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 11/11] virtio-blk: add x-data-plane=on|off performance feature Stefan Hajnoczi
  2012-11-29  9:18 ` [Qemu-devel] [PATCH v4 00/11] virtio: virtio-blk data plane Stefan Hajnoczi
  11 siblings, 2 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2012-11-22 15:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, Blue Swirl, khoa,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

virtio-blk-data-plane is a subset implementation of virtio-blk.  It only
handles read, write, and flush requests.  It does this using a dedicated
thread that executes an epoll(2)-based event loop and processes I/O
using Linux AIO.

This approach performs very well but can be used for raw image files
only.  The number of IOPS achieved has been reported to be several times
higher than the existing virtio-blk implementation.

Eventually it should be possible to unify virtio-blk-data-plane with the
main body of QEMU code once the block layer and hardware emulation is
able to run outside the global mutex.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 hw/dataplane/Makefile.objs |   2 +-
 hw/dataplane/virtio-blk.c  | 427 +++++++++++++++++++++++++++++++++++++++++++++
 hw/dataplane/virtio-blk.h  |  41 +++++
 trace-events               |   6 +
 4 files changed, 475 insertions(+), 1 deletion(-)
 create mode 100644 hw/dataplane/virtio-blk.c
 create mode 100644 hw/dataplane/virtio-blk.h

diff --git a/hw/dataplane/Makefile.objs b/hw/dataplane/Makefile.objs
index abd408f..682aa9e 100644
--- a/hw/dataplane/Makefile.objs
+++ b/hw/dataplane/Makefile.objs
@@ -1,3 +1,3 @@
 ifeq ($(CONFIG_VIRTIO), y)
-common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += hostmem.o vring.o event-poll.o ioq.o
+common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += hostmem.o vring.o event-poll.o ioq.o virtio-blk.o
 endif
diff --git a/hw/dataplane/virtio-blk.c b/hw/dataplane/virtio-blk.c
new file mode 100644
index 0000000..9b29969
--- /dev/null
+++ b/hw/dataplane/virtio-blk.c
@@ -0,0 +1,427 @@
+/*
+ * Dedicated thread for virtio-blk I/O processing
+ *
+ * Copyright 2012 IBM, Corp.
+ * Copyright 2012 Red Hat, Inc. and/or its affiliates
+ *
+ * Authors:
+ *   Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "trace.h"
+#include "iov.h"
+#include "event-poll.h"
+#include "qemu-thread.h"
+#include "vring.h"
+#include "ioq.h"
+#include "hw/virtio-blk.h"
+#include "hw/dataplane/virtio-blk.h"
+
+enum {
+    SEG_MAX = 126,                  /* maximum number of I/O segments */
+    VRING_MAX = SEG_MAX + 2,        /* maximum number of vring descriptors */
+    REQ_MAX = VRING_MAX,            /* maximum number of requests in the vring,
+                                     * is VRING_MAX / 2 with traditional and
+                                     * VRING_MAX with indirect descriptors */
+};
+
+typedef struct {
+    struct iocb iocb;               /* Linux AIO control block */
+    QEMUIOVector *inhdr;            /* iovecs for virtio_blk_inhdr */
+    unsigned int head;              /* vring descriptor index */
+} VirtIOBlockRequest;
+
+struct VirtIOBlockDataPlane {
+    bool started;
+    QEMUBH *start_bh;
+    QemuThread thread;
+
+    int fd;                         /* image file descriptor */
+
+    VirtIODevice *vdev;
+    Vring vring;                    /* virtqueue vring */
+    EventNotifier *guest_notifier;  /* irq */
+
+    EventPoll event_poll;           /* event poller */
+    EventHandler io_handler;        /* Linux AIO completion handler */
+    EventHandler notify_handler;    /* virtqueue notify handler */
+
+    IOQueue ioqueue;                /* Linux AIO queue (should really be per
+                                       dataplane thread) */
+    VirtIOBlockRequest requests[REQ_MAX]; /* pool of requests, managed by the
+                                             queue */
+
+    unsigned int num_reqs;
+    QemuMutex num_reqs_lock;
+    QemuCond no_reqs_cond;
+};
+
+/* Raise an interrupt to signal guest, if necessary */
+static void notify_guest(VirtIOBlockDataPlane *s)
+{
+    if (!vring_should_notify(s->vdev, &s->vring)) {
+        return;
+    }
+
+    event_notifier_set(s->guest_notifier);
+}
+
+static void complete_request(struct iocb *iocb, ssize_t ret, void *opaque)
+{
+    VirtIOBlockDataPlane *s = opaque;
+    VirtIOBlockRequest *req = container_of(iocb, VirtIOBlockRequest, iocb);
+    struct virtio_blk_inhdr hdr;
+    int len;
+
+    if (likely(ret >= 0)) {
+        hdr.status = VIRTIO_BLK_S_OK;
+        len = ret;
+    } else {
+        hdr.status = VIRTIO_BLK_S_IOERR;
+        len = 0;
+    }
+
+    trace_virtio_blk_data_plane_complete_request(s, req->head, ret);
+
+    qemu_iovec_from_buf(req->inhdr, 0, &hdr, sizeof(hdr));
+    qemu_iovec_destroy(req->inhdr);
+    g_slice_free(QEMUIOVector, req->inhdr);
+
+    /* According to the virtio specification len should be the number of bytes
+     * written to, but for virtio-blk it seems to be the number of bytes
+     * transferred plus the status bytes.
+     */
+    vring_push(&s->vring, req->head, len + sizeof(hdr));
+
+    qemu_mutex_lock(&s->num_reqs_lock);
+    if (--s->num_reqs == 0) {
+        qemu_cond_broadcast(&s->no_reqs_cond);
+    }
+    qemu_mutex_unlock(&s->num_reqs_lock);
+}
+
+static void fail_request_early(VirtIOBlockDataPlane *s, unsigned int head,
+                               QEMUIOVector *inhdr, unsigned char status)
+{
+    struct virtio_blk_inhdr hdr = {
+        .status = status,
+    };
+
+    qemu_iovec_from_buf(inhdr, 0, &hdr, sizeof(hdr));
+    qemu_iovec_destroy(inhdr);
+    g_slice_free(QEMUIOVector, inhdr);
+
+    vring_push(&s->vring, head, sizeof(hdr));
+    notify_guest(s);
+}
+
+static int process_request(IOQueue *ioq, struct iovec iov[],
+                           unsigned int out_num, unsigned int in_num,
+                           unsigned int head)
+{
+    VirtIOBlockDataPlane *s = container_of(ioq, VirtIOBlockDataPlane, ioqueue);
+    struct iovec *in_iov = &iov[out_num];
+    struct virtio_blk_outhdr outhdr;
+    QEMUIOVector *inhdr;
+    size_t in_size;
+
+    /* Copy in outhdr */
+    if (unlikely(iov_to_buf(iov, out_num, 0, &outhdr,
+                            sizeof(outhdr)) != sizeof(outhdr))) {
+        error_report("virtio-blk request outhdr too short");
+        return -EFAULT;
+    }
+    iov_discard(&iov, &out_num, sizeof(outhdr));
+
+    /* Grab inhdr for later */
+    in_size = iov_size(in_iov, in_num);
+    if (in_size < sizeof(struct virtio_blk_inhdr)) {
+        error_report("virtio_blk request inhdr too short");
+        return -EFAULT;
+    }
+    inhdr = g_slice_new(QEMUIOVector);
+    qemu_iovec_init(inhdr, 1);
+    qemu_iovec_concat_iov(inhdr, in_iov, in_num,
+            in_size - sizeof(struct virtio_blk_inhdr),
+            sizeof(struct virtio_blk_inhdr));
+    iov_discard(&in_iov, &in_num, -sizeof(struct virtio_blk_inhdr));
+
+    /* TODO Linux sets the barrier bit even when not advertised! */
+    outhdr.type &= ~VIRTIO_BLK_T_BARRIER;
+
+    struct iocb *iocb;
+    switch (outhdr.type & (VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_SCSI_CMD |
+                           VIRTIO_BLK_T_FLUSH)) {
+    case VIRTIO_BLK_T_IN:
+        iocb = ioq_rdwr(ioq, true, in_iov, in_num, outhdr.sector * 512);
+        break;
+
+    case VIRTIO_BLK_T_OUT:
+        iocb = ioq_rdwr(ioq, false, iov, out_num, outhdr.sector * 512);
+        break;
+
+    case VIRTIO_BLK_T_SCSI_CMD:
+        /* TODO support SCSI commands */
+        fail_request_early(s, head, inhdr, VIRTIO_BLK_S_UNSUPP);
+        return 0;
+
+    case VIRTIO_BLK_T_FLUSH:
+        /* TODO fdsync not supported by Linux AIO, do it synchronously here! */
+        fdatasync(s->fd);
+        fail_request_early(s, head, inhdr, VIRTIO_BLK_S_OK);
+        return 0;
+
+    default:
+        error_report("virtio-blk unsupported request type %#x", outhdr.type);
+        qemu_iovec_destroy(inhdr);
+        g_slice_free(QEMUIOVector, inhdr);
+        return -EFAULT;
+    }
+
+    /* Fill in virtio block metadata needed for completion */
+    VirtIOBlockRequest *req = container_of(iocb, VirtIOBlockRequest, iocb);
+    req->head = head;
+    req->inhdr = inhdr;
+    return 0;
+}
+
+static bool handle_notify(EventHandler *handler)
+{
+    VirtIOBlockDataPlane *s = container_of(handler, VirtIOBlockDataPlane,
+                                           notify_handler);
+
+    /* There is one array of iovecs into which all new requests are extracted
+     * from the vring.  Requests are read from the vring and the translated
+     * descriptors are written to the iovecs array.  The iovecs do not have to
+     * persist across handle_notify() calls because the kernel copies the
+     * iovecs on io_submit().
+     *
+     * Handling io_submit() EAGAIN may require storing the requests across
+     * handle_notify() calls until the kernel has sufficient resources to
+     * accept more I/O.  This is not implemented yet.
+     */
+    struct iovec iovec[VRING_MAX];
+    struct iovec *end = &iovec[VRING_MAX];
+    struct iovec *iov = iovec;
+
+    /* When a request is read from the vring, the index of the first descriptor
+     * (aka head) is returned so that the completed request can be pushed onto
+     * the vring later.
+     *
+     * The number of hypervisor read-only iovecs is out_num.  The number of
+     * hypervisor write-only iovecs is in_num.
+     */
+    int head;
+    unsigned int out_num = 0, in_num = 0;
+    unsigned int num_queued;
+
+    for (;;) {
+        /* Disable guest->host notifies to avoid unnecessary vmexits */
+        vring_set_notification(s->vdev, &s->vring, false);
+
+        for (;;) {
+            head = vring_pop(s->vdev, &s->vring, iov, end, &out_num, &in_num);
+            if (head < 0) {
+                break; /* no more requests */
+            }
+
+            trace_virtio_blk_data_plane_process_request(s, out_num, in_num,
+                                                        head);
+
+            if (process_request(&s->ioqueue, iov, out_num, in_num, head) < 0) {
+                vring_set_broken(&s->vring);
+                break;
+            }
+            iov += out_num + in_num;
+        }
+
+        if (likely(head == -EAGAIN)) { /* vring emptied */
+            /* Re-enable guest->host notifies and stop processing the vring.
+             * But if the guest has snuck in more descriptors, keep processing.
+             */
+            vring_set_notification(s->vdev, &s->vring, true);
+            smp_mb();
+            if (!vring_more_avail(&s->vring)) {
+                break;
+            }
+        } else { /* head == -ENOBUFS or fatal error, iovecs[] is depleted */
+            /* Since there are no iovecs[] left, stop processing for now.  Do
+             * not re-enable guest->host notifies since the I/O completion
+             * handler knows to check for more vring descriptors anyway.
+             */
+            break;
+        }
+    }
+
+    num_queued = ioq_num_queued(&s->ioqueue);
+    if (num_queued > 0) {
+        qemu_mutex_lock(&s->num_reqs_lock);
+        s->num_reqs += num_queued;
+        qemu_mutex_unlock(&s->num_reqs_lock);
+
+        int rc = ioq_submit(&s->ioqueue);
+        if (unlikely(rc < 0)) {
+            fprintf(stderr, "ioq_submit failed %d\n", rc);
+            exit(1);
+        }
+    }
+    return true;
+}
+
+static bool handle_io(EventHandler *handler)
+{
+    VirtIOBlockDataPlane *s = container_of(handler, VirtIOBlockDataPlane,
+                                           io_handler);
+
+    if (ioq_run_completion(&s->ioqueue, complete_request, s) > 0) {
+        notify_guest(s);
+    }
+
+    /* If there were more requests than iovecs, the vring will not be empty yet
+     * so check again.  There should now be enough resources to process more
+     * requests.
+     */
+    if (unlikely(vring_more_avail(&s->vring))) {
+        return handle_notify(&s->notify_handler);
+    }
+
+    return true;
+}
+
+static void *data_plane_thread(void *opaque)
+{
+    VirtIOBlockDataPlane *s = opaque;
+    event_poll_run(&s->event_poll);
+    return NULL;
+}
+
+static void start_data_plane_bh(void *opaque)
+{
+    VirtIOBlockDataPlane *s = opaque;
+
+    qemu_bh_delete(s->start_bh);
+    s->start_bh = NULL;
+    qemu_thread_create(&s->thread, data_plane_thread,
+                       s, QEMU_THREAD_JOINABLE);
+}
+
+VirtIOBlockDataPlane *virtio_blk_data_plane_create(VirtIODevice *vdev, int fd)
+{
+    VirtIOBlockDataPlane *s;
+
+    s = g_new0(VirtIOBlockDataPlane, 1);
+    s->vdev = vdev;
+    s->fd = fd;
+    return s;
+}
+
+void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s)
+{
+    if (!s) {
+        return;
+    }
+    virtio_blk_data_plane_stop(s);
+    g_free(s);
+}
+
+/* Block until pending requests have completed
+ *
+ * The vring continues to be serviced so ensure no new requests will be added
+ * to avoid races.
+ */
+void virtio_blk_data_plane_drain(VirtIOBlockDataPlane *s)
+{
+    qemu_mutex_lock(&s->num_reqs_lock);
+    while (s->num_reqs > 0) {
+        qemu_cond_wait(&s->no_reqs_cond, &s->num_reqs_lock);
+    }
+    qemu_mutex_unlock(&s->num_reqs_lock);
+}
+
+void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s)
+{
+    VirtQueue *vq;
+    int i;
+
+    if (s->started) {
+        return;
+    }
+
+    vq = virtio_get_queue(s->vdev, 0);
+    if (!vring_setup(&s->vring, s->vdev, 0)) {
+        return;
+    }
+
+    event_poll_init(&s->event_poll);
+
+    /* Set up guest notifier (irq) */
+    if (s->vdev->binding->set_guest_notifiers(s->vdev->binding_opaque,
+                                              true) != 0) {
+        fprintf(stderr, "virtio-blk failed to set guest notifier, "
+                "ensure -enable-kvm is set\n");
+        exit(1);
+    }
+    s->guest_notifier = virtio_queue_get_guest_notifier(vq);
+
+    /* Set up virtqueue notify */
+    if (s->vdev->binding->set_host_notifier(s->vdev->binding_opaque,
+                                            0, true) != 0) {
+        fprintf(stderr, "virtio-blk failed to set host notifier\n");
+        exit(1);
+    }
+    event_poll_add(&s->event_poll, &s->notify_handler,
+                   virtio_queue_get_host_notifier(vq),
+                   handle_notify);
+
+    /* Set up ioqueue */
+    ioq_init(&s->ioqueue, s->fd, REQ_MAX);
+    for (i = 0; i < ARRAY_SIZE(s->requests); i++) {
+        ioq_put_iocb(&s->ioqueue, &s->requests[i].iocb);
+    }
+    event_poll_add(&s->event_poll, &s->io_handler,
+                   ioq_get_notifier(&s->ioqueue), handle_io);
+
+    s->started = true;
+    trace_virtio_blk_data_plane_start(s);
+
+    /* Kick right away to begin processing requests already in vring */
+    event_notifier_set(virtio_queue_get_host_notifier(vq));
+
+    /* Spawn thread in BH so it inherits iothread cpusets */
+    s->start_bh = qemu_bh_new(start_data_plane_bh, s);
+    qemu_bh_schedule(s->start_bh);
+}
+
+void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s)
+{
+    if (!s->started) {
+        return;
+    }
+    s->started = false;
+    trace_virtio_blk_data_plane_stop(s);
+
+    /* Stop thread or cancel pending thread creation BH */
+    if (s->start_bh) {
+        qemu_bh_delete(s->start_bh);
+        s->start_bh = NULL;
+    } else {
+        virtio_blk_data_plane_drain(s);
+        event_poll_stop(&s->event_poll);
+        qemu_thread_join(&s->thread);
+    }
+
+    ioq_cleanup(&s->ioqueue);
+
+    s->vdev->binding->set_host_notifier(s->vdev->binding_opaque, 0, false);
+
+    event_poll_cleanup(&s->event_poll);
+
+    /* Clean up guest notifier (irq) */
+    s->vdev->binding->set_guest_notifiers(s->vdev->binding_opaque, false);
+
+    vring_teardown(&s->vring);
+}
diff --git a/hw/dataplane/virtio-blk.h b/hw/dataplane/virtio-blk.h
new file mode 100644
index 0000000..ddf1115
--- /dev/null
+++ b/hw/dataplane/virtio-blk.h
@@ -0,0 +1,41 @@
+/*
+ * Dedicated thread for virtio-blk I/O processing
+ *
+ * Copyright 2012 IBM, Corp.
+ * Copyright 2012 Red Hat, Inc. and/or its affiliates
+ *
+ * Authors:
+ *   Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef HW_DATAPLANE_VIRTIO_BLK_H
+#define HW_DATAPLANE_VIRTIO_BLK_H
+
+#include "hw/virtio.h"
+
+typedef struct VirtIOBlockDataPlane VirtIOBlockDataPlane;
+
+#ifdef CONFIG_VIRTIO_BLK_DATA_PLANE
+VirtIOBlockDataPlane *virtio_blk_data_plane_create(VirtIODevice *vdev, int fd);
+void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s);
+void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s);
+void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s);
+void virtio_blk_data_plane_drain(VirtIOBlockDataPlane *s);
+#else
+static inline VirtIOBlockDataPlane *virtio_blk_data_plane_create(
+        VirtIODevice *vdev, int fd)
+{
+    return NULL;
+}
+
+static inline void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s) {}
+static inline void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s) {}
+static inline void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s) {}
+static inline void virtio_blk_data_plane_drain(VirtIOBlockDataPlane *s) {}
+#endif
+
+#endif /* HW_DATAPLANE_VIRTIO_BLK_H */
diff --git a/trace-events b/trace-events
index a9a791b..1edc2ae 100644
--- a/trace-events
+++ b/trace-events
@@ -98,6 +98,12 @@ virtio_blk_rw_complete(void *req, int ret) "req %p ret %d"
 virtio_blk_handle_write(void *req, uint64_t sector, size_t nsectors) "req %p sector %"PRIu64" nsectors %zu"
 virtio_blk_handle_read(void *req, uint64_t sector, size_t nsectors) "req %p sector %"PRIu64" nsectors %zu"
 
+# hw/dataplane/virtio-blk.c
+virtio_blk_data_plane_start(void *s) "dataplane %p"
+virtio_blk_data_plane_stop(void *s) "dataplane %p"
+virtio_blk_data_plane_process_request(void *s, unsigned int out_num, unsigned int in_num, unsigned int head) "dataplane %p out_num %u in_num %u head %u"
+virtio_blk_data_plane_complete_request(void *s, unsigned int head, int ret) "dataplane %p head %u ret %d"
+
 # hw/dataplane/vring.c
 vring_setup(uint64_t physical, void *desc, void *avail, void *used) "vring physical %#"PRIx64" desc %p avail %p used %p"
 
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 10/11] dataplane: add virtio-blk data plane code
  2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 10/11] dataplane: add virtio-blk data plane code Stefan Hajnoczi
@ 2012-11-29 13:41   ` Michael S. Tsirkin
  2012-11-29 14:02   ` Michael S. Tsirkin
  1 sibling, 0 replies; 43+ messages in thread
From: Michael S. Tsirkin @ 2012-11-29 13:41 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Blue Swirl, khoa,
	Paolo Bonzini, Asias He

On Thu, Nov 22, 2012 at 04:16:51PM +0100, Stefan Hajnoczi wrote:
> virtio-blk-data-plane is a subset implementation of virtio-blk.  It only
> handles read, write, and flush requests.  It does this using a dedicated
> thread that executes an epoll(2)-based event loop and processes I/O
> using Linux AIO.
> 
> This approach performs very well but can be used for raw image files
> only.  The number of IOPS achieved has been reported to be several times
> higher than the existing virtio-blk implementation.
> 
> Eventually it should be possible to unify virtio-blk-data-plane with the
> main body of QEMU code once the block layer and hardware emulation is
> able to run outside the global mutex.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  hw/dataplane/Makefile.objs |   2 +-
>  hw/dataplane/virtio-blk.c  | 427 +++++++++++++++++++++++++++++++++++++++++++++
>  hw/dataplane/virtio-blk.h  |  41 +++++
>  trace-events               |   6 +
>  4 files changed, 475 insertions(+), 1 deletion(-)
>  create mode 100644 hw/dataplane/virtio-blk.c
>  create mode 100644 hw/dataplane/virtio-blk.h
> 
> diff --git a/hw/dataplane/Makefile.objs b/hw/dataplane/Makefile.objs
> index abd408f..682aa9e 100644
> --- a/hw/dataplane/Makefile.objs
> +++ b/hw/dataplane/Makefile.objs
> @@ -1,3 +1,3 @@
>  ifeq ($(CONFIG_VIRTIO), y)
> -common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += hostmem.o vring.o event-poll.o ioq.o
> +common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += hostmem.o vring.o event-poll.o ioq.o virtio-blk.o
>  endif
> diff --git a/hw/dataplane/virtio-blk.c b/hw/dataplane/virtio-blk.c
> new file mode 100644
> index 0000000..9b29969
> --- /dev/null
> +++ b/hw/dataplane/virtio-blk.c
> @@ -0,0 +1,427 @@
> +/*
> + * Dedicated thread for virtio-blk I/O processing
> + *
> + * Copyright 2012 IBM, Corp.
> + * Copyright 2012 Red Hat, Inc. and/or its affiliates
> + *
> + * Authors:
> + *   Stefan Hajnoczi <stefanha@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "trace.h"
> +#include "iov.h"
> +#include "event-poll.h"
> +#include "qemu-thread.h"
> +#include "vring.h"
> +#include "ioq.h"
> +#include "hw/virtio-blk.h"
> +#include "hw/dataplane/virtio-blk.h"
> +
> +enum {
> +    SEG_MAX = 126,                  /* maximum number of I/O segments */
> +    VRING_MAX = SEG_MAX + 2,        /* maximum number of vring descriptors */
> +    REQ_MAX = VRING_MAX,            /* maximum number of requests in the vring,
> +                                     * is VRING_MAX / 2 with traditional and
> +                                     * VRING_MAX with indirect descriptors */
> +};
> +
> +typedef struct {
> +    struct iocb iocb;               /* Linux AIO control block */
> +    QEMUIOVector *inhdr;            /* iovecs for virtio_blk_inhdr */
> +    unsigned int head;              /* vring descriptor index */
> +} VirtIOBlockRequest;
> +
> +struct VirtIOBlockDataPlane {
> +    bool started;
> +    QEMUBH *start_bh;
> +    QemuThread thread;
> +
> +    int fd;                         /* image file descriptor */
> +
> +    VirtIODevice *vdev;
> +    Vring vring;                    /* virtqueue vring */
> +    EventNotifier *guest_notifier;  /* irq */
> +
> +    EventPoll event_poll;           /* event poller */
> +    EventHandler io_handler;        /* Linux AIO completion handler */
> +    EventHandler notify_handler;    /* virtqueue notify handler */
> +
> +    IOQueue ioqueue;                /* Linux AIO queue (should really be per
> +                                       dataplane thread) */
> +    VirtIOBlockRequest requests[REQ_MAX]; /* pool of requests, managed by the
> +                                             queue */
> +
> +    unsigned int num_reqs;
> +    QemuMutex num_reqs_lock;
> +    QemuCond no_reqs_cond;
> +};
> +
> +/* Raise an interrupt to signal guest, if necessary */
> +static void notify_guest(VirtIOBlockDataPlane *s)
> +{
> +    if (!vring_should_notify(s->vdev, &s->vring)) {
> +        return;
> +    }
> +
> +    event_notifier_set(s->guest_notifier);
> +}
> +
> +static void complete_request(struct iocb *iocb, ssize_t ret, void *opaque)
> +{
> +    VirtIOBlockDataPlane *s = opaque;
> +    VirtIOBlockRequest *req = container_of(iocb, VirtIOBlockRequest, iocb);
> +    struct virtio_blk_inhdr hdr;
> +    int len;
> +
> +    if (likely(ret >= 0)) {
> +        hdr.status = VIRTIO_BLK_S_OK;
> +        len = ret;
> +    } else {
> +        hdr.status = VIRTIO_BLK_S_IOERR;
> +        len = 0;
> +    }
> +
> +    trace_virtio_blk_data_plane_complete_request(s, req->head, ret);
> +
> +    qemu_iovec_from_buf(req->inhdr, 0, &hdr, sizeof(hdr));
> +    qemu_iovec_destroy(req->inhdr);
> +    g_slice_free(QEMUIOVector, req->inhdr);
> +
> +    /* According to the virtio specification len should be the number of bytes
> +     * written to, but for virtio-blk it seems to be the number of bytes
> +     * transferred plus the status bytes.
> +     */
> +    vring_push(&s->vring, req->head, len + sizeof(hdr));
> +
> +    qemu_mutex_lock(&s->num_reqs_lock);
> +    if (--s->num_reqs == 0) {
> +        qemu_cond_broadcast(&s->no_reqs_cond);
> +    }
> +    qemu_mutex_unlock(&s->num_reqs_lock);
> +}
> +
> +static void fail_request_early(VirtIOBlockDataPlane *s, unsigned int head,
> +                               QEMUIOVector *inhdr, unsigned char status)
> +{
> +    struct virtio_blk_inhdr hdr = {
> +        .status = status,
> +    };
> +
> +    qemu_iovec_from_buf(inhdr, 0, &hdr, sizeof(hdr));
> +    qemu_iovec_destroy(inhdr);
> +    g_slice_free(QEMUIOVector, inhdr);
> +
> +    vring_push(&s->vring, head, sizeof(hdr));
> +    notify_guest(s);
> +}
> +
> +static int process_request(IOQueue *ioq, struct iovec iov[],
> +                           unsigned int out_num, unsigned int in_num,
> +                           unsigned int head)
> +{
> +    VirtIOBlockDataPlane *s = container_of(ioq, VirtIOBlockDataPlane, ioqueue);
> +    struct iovec *in_iov = &iov[out_num];
> +    struct virtio_blk_outhdr outhdr;
> +    QEMUIOVector *inhdr;
> +    size_t in_size;
> +
> +    /* Copy in outhdr */
> +    if (unlikely(iov_to_buf(iov, out_num, 0, &outhdr,
> +                            sizeof(outhdr)) != sizeof(outhdr))) {
> +        error_report("virtio-blk request outhdr too short");
> +        return -EFAULT;
> +    }
> +    iov_discard(&iov, &out_num, sizeof(outhdr));
> +
> +    /* Grab inhdr for later */
> +    in_size = iov_size(in_iov, in_num);
> +    if (in_size < sizeof(struct virtio_blk_inhdr)) {
> +        error_report("virtio_blk request inhdr too short");
> +        return -EFAULT;
> +    }
> +    inhdr = g_slice_new(QEMUIOVector);
> +    qemu_iovec_init(inhdr, 1);
> +    qemu_iovec_concat_iov(inhdr, in_iov, in_num,
> +            in_size - sizeof(struct virtio_blk_inhdr),
> +            sizeof(struct virtio_blk_inhdr));
> +    iov_discard(&in_iov, &in_num, -sizeof(struct virtio_blk_inhdr));
> +
> +    /* TODO Linux sets the barrier bit even when not advertised! */
> +    outhdr.type &= ~VIRTIO_BLK_T_BARRIER;
> +
> +    struct iocb *iocb;
> +    switch (outhdr.type & (VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_SCSI_CMD |
> +                           VIRTIO_BLK_T_FLUSH)) {
> +    case VIRTIO_BLK_T_IN:
> +        iocb = ioq_rdwr(ioq, true, in_iov, in_num, outhdr.sector * 512);
> +        break;
> +
> +    case VIRTIO_BLK_T_OUT:
> +        iocb = ioq_rdwr(ioq, false, iov, out_num, outhdr.sector * 512);
> +        break;
> +
> +    case VIRTIO_BLK_T_SCSI_CMD:
> +        /* TODO support SCSI commands */
> +        fail_request_early(s, head, inhdr, VIRTIO_BLK_S_UNSUPP);
> +        return 0;
> +
> +    case VIRTIO_BLK_T_FLUSH:
> +        /* TODO fdsync not supported by Linux AIO, do it synchronously here! */
> +        fdatasync(s->fd);
> +        fail_request_early(s, head, inhdr, VIRTIO_BLK_S_OK);
> +        return 0;
> +
> +    default:
> +        error_report("virtio-blk unsupported request type %#x", outhdr.type);
> +        qemu_iovec_destroy(inhdr);
> +        g_slice_free(QEMUIOVector, inhdr);
> +        return -EFAULT;
> +    }
> +
> +    /* Fill in virtio block metadata needed for completion */
> +    VirtIOBlockRequest *req = container_of(iocb, VirtIOBlockRequest, iocb);
> +    req->head = head;
> +    req->inhdr = inhdr;
> +    return 0;
> +}
> +
> +static bool handle_notify(EventHandler *handler)
> +{
> +    VirtIOBlockDataPlane *s = container_of(handler, VirtIOBlockDataPlane,
> +                                           notify_handler);
> +
> +    /* There is one array of iovecs into which all new requests are extracted
> +     * from the vring.  Requests are read from the vring and the translated
> +     * descriptors are written to the iovecs array.  The iovecs do not have to
> +     * persist across handle_notify() calls because the kernel copies the
> +     * iovecs on io_submit().
> +     *
> +     * Handling io_submit() EAGAIN may require storing the requests across
> +     * handle_notify() calls until the kernel has sufficient resources to
> +     * accept more I/O.  This is not implemented yet.
> +     */
> +    struct iovec iovec[VRING_MAX];
> +    struct iovec *end = &iovec[VRING_MAX];
> +    struct iovec *iov = iovec;
> +
> +    /* When a request is read from the vring, the index of the first descriptor
> +     * (aka head) is returned so that the completed request can be pushed onto
> +     * the vring later.
> +     *
> +     * The number of hypervisor read-only iovecs is out_num.  The number of
> +     * hypervisor write-only iovecs is in_num.
> +     */
> +    int head;
> +    unsigned int out_num = 0, in_num = 0;
> +    unsigned int num_queued;
> +
> +    for (;;) {
> +        /* Disable guest->host notifies to avoid unnecessary vmexits */
> +        vring_set_notification(s->vdev, &s->vring, false);
> +
> +        for (;;) {
> +            head = vring_pop(s->vdev, &s->vring, iov, end, &out_num, &in_num);
> +            if (head < 0) {
> +                break; /* no more requests */
> +            }
> +
> +            trace_virtio_blk_data_plane_process_request(s, out_num, in_num,
> +                                                        head);
> +
> +            if (process_request(&s->ioqueue, iov, out_num, in_num, head) < 0) {
> +                vring_set_broken(&s->vring);
> +                break;
> +            }
> +            iov += out_num + in_num;
> +        }
> +
> +        if (likely(head == -EAGAIN)) { /* vring emptied */
> +            /* Re-enable guest->host notifies and stop processing the vring.
> +             * But if the guest has snuck in more descriptors, keep processing.
> +             */
> +            vring_set_notification(s->vdev, &s->vring, true);
> +            smp_mb();

A memory barrier at this level looks wrong - barriers should be
part of vring processing.

> +            if (!vring_more_avail(&s->vring)) {
> +                break;
> +            }
> +        } else { /* head == -ENOBUFS or fatal error, iovecs[] is depleted */
> +            /* Since there are no iovecs[] left, stop processing for now.  Do
> +             * not re-enable guest->host notifies since the I/O completion
> +             * handler knows to check for more vring descriptors anyway.
> +             */
> +            break;
> +        }
> +    }
> +
> +    num_queued = ioq_num_queued(&s->ioqueue);
> +    if (num_queued > 0) {
> +        qemu_mutex_lock(&s->num_reqs_lock);
> +        s->num_reqs += num_queued;
> +        qemu_mutex_unlock(&s->num_reqs_lock);
> +
> +        int rc = ioq_submit(&s->ioqueue);
> +        if (unlikely(rc < 0)) {
> +            fprintf(stderr, "ioq_submit failed %d\n", rc);
> +            exit(1);
> +        }
> +    }
> +    return true;
> +}
> +
> +static bool handle_io(EventHandler *handler)
> +{
> +    VirtIOBlockDataPlane *s = container_of(handler, VirtIOBlockDataPlane,
> +                                           io_handler);
> +
> +    if (ioq_run_completion(&s->ioqueue, complete_request, s) > 0) {
> +        notify_guest(s);
> +    }
> +
> +    /* If there were more requests than iovecs, the vring will not be empty yet
> +     * so check again.  There should now be enough resources to process more
> +     * requests.
> +     */
> +    if (unlikely(vring_more_avail(&s->vring))) {
> +        return handle_notify(&s->notify_handler);
> +    }
> +
> +    return true;
> +}
> +
> +static void *data_plane_thread(void *opaque)
> +{
> +    VirtIOBlockDataPlane *s = opaque;
> +    event_poll_run(&s->event_poll);
> +    return NULL;
> +}
> +
> +static void start_data_plane_bh(void *opaque)
> +{
> +    VirtIOBlockDataPlane *s = opaque;
> +
> +    qemu_bh_delete(s->start_bh);
> +    s->start_bh = NULL;
> +    qemu_thread_create(&s->thread, data_plane_thread,
> +                       s, QEMU_THREAD_JOINABLE);
> +}
> +
> +VirtIOBlockDataPlane *virtio_blk_data_plane_create(VirtIODevice *vdev, int fd)
> +{
> +    VirtIOBlockDataPlane *s;
> +
> +    s = g_new0(VirtIOBlockDataPlane, 1);
> +    s->vdev = vdev;
> +    s->fd = fd;
> +    return s;
> +}
> +
> +void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s)
> +{
> +    if (!s) {
> +        return;
> +    }
> +    virtio_blk_data_plane_stop(s);
> +    g_free(s);
> +}
> +
> +/* Block until pending requests have completed
> + *
> + * The vring continues to be serviced so ensure no new requests will be added
> + * to avoid races.
> + */
> +void virtio_blk_data_plane_drain(VirtIOBlockDataPlane *s)
> +{
> +    qemu_mutex_lock(&s->num_reqs_lock);
> +    while (s->num_reqs > 0) {
> +        qemu_cond_wait(&s->no_reqs_cond, &s->num_reqs_lock);
> +    }
> +    qemu_mutex_unlock(&s->num_reqs_lock);
> +}
> +
> +void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s)
> +{
> +    VirtQueue *vq;
> +    int i;
> +
> +    if (s->started) {
> +        return;
> +    }
> +
> +    vq = virtio_get_queue(s->vdev, 0);
> +    if (!vring_setup(&s->vring, s->vdev, 0)) {
> +        return;
> +    }
> +
> +    event_poll_init(&s->event_poll);
> +
> +    /* Set up guest notifier (irq) */
> +    if (s->vdev->binding->set_guest_notifiers(s->vdev->binding_opaque,
> +                                              true) != 0) {
> +        fprintf(stderr, "virtio-blk failed to set guest notifier, "
> +                "ensure -enable-kvm is set\n");
> +        exit(1);
> +    }
> +    s->guest_notifier = virtio_queue_get_guest_notifier(vq);
> +
> +    /* Set up virtqueue notify */
> +    if (s->vdev->binding->set_host_notifier(s->vdev->binding_opaque,
> +                                            0, true) != 0) {
> +        fprintf(stderr, "virtio-blk failed to set host notifier\n");
> +        exit(1);
> +    }
> +    event_poll_add(&s->event_poll, &s->notify_handler,
> +                   virtio_queue_get_host_notifier(vq),
> +                   handle_notify);
> +
> +    /* Set up ioqueue */
> +    ioq_init(&s->ioqueue, s->fd, REQ_MAX);
> +    for (i = 0; i < ARRAY_SIZE(s->requests); i++) {
> +        ioq_put_iocb(&s->ioqueue, &s->requests[i].iocb);
> +    }
> +    event_poll_add(&s->event_poll, &s->io_handler,
> +                   ioq_get_notifier(&s->ioqueue), handle_io);
> +
> +    s->started = true;
> +    trace_virtio_blk_data_plane_start(s);
> +
> +    /* Kick right away to begin processing requests already in vring */
> +    event_notifier_set(virtio_queue_get_host_notifier(vq));
> +
> +    /* Spawn thread in BH so it inherits iothread cpusets */
> +    s->start_bh = qemu_bh_new(start_data_plane_bh, s);
> +    qemu_bh_schedule(s->start_bh);
> +}
> +
> +void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s)
> +{
> +    if (!s->started) {
> +        return;
> +    }
> +    s->started = false;
> +    trace_virtio_blk_data_plane_stop(s);
> +
> +    /* Stop thread or cancel pending thread creation BH */
> +    if (s->start_bh) {
> +        qemu_bh_delete(s->start_bh);
> +        s->start_bh = NULL;
> +    } else {
> +        virtio_blk_data_plane_drain(s);
> +        event_poll_stop(&s->event_poll);
> +        qemu_thread_join(&s->thread);
> +    }
> +
> +    ioq_cleanup(&s->ioqueue);
> +
> +    s->vdev->binding->set_host_notifier(s->vdev->binding_opaque, 0, false);
> +
> +    event_poll_cleanup(&s->event_poll);
> +
> +    /* Clean up guest notifier (irq) */
> +    s->vdev->binding->set_guest_notifiers(s->vdev->binding_opaque, false);
> +
> +    vring_teardown(&s->vring);
> +}
> diff --git a/hw/dataplane/virtio-blk.h b/hw/dataplane/virtio-blk.h
> new file mode 100644
> index 0000000..ddf1115
> --- /dev/null
> +++ b/hw/dataplane/virtio-blk.h
> @@ -0,0 +1,41 @@
> +/*
> + * Dedicated thread for virtio-blk I/O processing
> + *
> + * Copyright 2012 IBM, Corp.
> + * Copyright 2012 Red Hat, Inc. and/or its affiliates
> + *
> + * Authors:
> + *   Stefan Hajnoczi <stefanha@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#ifndef HW_DATAPLANE_VIRTIO_BLK_H
> +#define HW_DATAPLANE_VIRTIO_BLK_H
> +
> +#include "hw/virtio.h"
> +
> +typedef struct VirtIOBlockDataPlane VirtIOBlockDataPlane;
> +
> +#ifdef CONFIG_VIRTIO_BLK_DATA_PLANE
> +VirtIOBlockDataPlane *virtio_blk_data_plane_create(VirtIODevice *vdev, int fd);
> +void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s);
> +void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s);
> +void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s);
> +void virtio_blk_data_plane_drain(VirtIOBlockDataPlane *s);
> +#else
> +static inline VirtIOBlockDataPlane *virtio_blk_data_plane_create(
> +        VirtIODevice *vdev, int fd)
> +{
> +    return NULL;
> +}
> +
> +static inline void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s) {}
> +static inline void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s) {}
> +static inline void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s) {}
> +static inline void virtio_blk_data_plane_drain(VirtIOBlockDataPlane *s) {}
> +#endif
> +
> +#endif /* HW_DATAPLANE_VIRTIO_BLK_H */
> diff --git a/trace-events b/trace-events
> index a9a791b..1edc2ae 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -98,6 +98,12 @@ virtio_blk_rw_complete(void *req, int ret) "req %p ret %d"
>  virtio_blk_handle_write(void *req, uint64_t sector, size_t nsectors) "req %p sector %"PRIu64" nsectors %zu"
>  virtio_blk_handle_read(void *req, uint64_t sector, size_t nsectors) "req %p sector %"PRIu64" nsectors %zu"
>  
> +# hw/dataplane/virtio-blk.c
> +virtio_blk_data_plane_start(void *s) "dataplane %p"
> +virtio_blk_data_plane_stop(void *s) "dataplane %p"
> +virtio_blk_data_plane_process_request(void *s, unsigned int out_num, unsigned int in_num, unsigned int head) "dataplane %p out_num %u in_num %u head %u"
> +virtio_blk_data_plane_complete_request(void *s, unsigned int head, int ret) "dataplane %p head %u ret %d"
> +
>  # hw/dataplane/vring.c
>  vring_setup(uint64_t physical, void *desc, void *avail, void *used) "vring physical %#"PRIx64" desc %p avail %p used %p"
>  
> -- 
> 1.8.0

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 10/11] dataplane: add virtio-blk data plane code
  2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 10/11] dataplane: add virtio-blk data plane code Stefan Hajnoczi
  2012-11-29 13:41   ` Michael S. Tsirkin
@ 2012-11-29 14:02   ` Michael S. Tsirkin
  2012-11-29 15:21     ` Paolo Bonzini
  1 sibling, 1 reply; 43+ messages in thread
From: Michael S. Tsirkin @ 2012-11-29 14:02 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Blue Swirl, khoa,
	Paolo Bonzini, Asias He

On Thu, Nov 22, 2012 at 04:16:51PM +0100, Stefan Hajnoczi wrote:
> virtio-blk-data-plane is a subset implementation of virtio-blk.  It only
> handles read, write, and flush requests.  It does this using a dedicated
> thread that executes an epoll(2)-based event loop and processes I/O
> using Linux AIO.
> 
> This approach performs very well but can be used for raw image files
> only.  The number of IOPS achieved has been reported to be several times
> higher than the existing virtio-blk implementation.
> 
> Eventually it should be possible to unify virtio-blk-data-plane with the
> main body of QEMU code once the block layer and hardware emulation is
> able to run outside the global mutex.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  hw/dataplane/Makefile.objs |   2 +-
>  hw/dataplane/virtio-blk.c  | 427 +++++++++++++++++++++++++++++++++++++++++++++
>  hw/dataplane/virtio-blk.h  |  41 +++++
>  trace-events               |   6 +
>  4 files changed, 475 insertions(+), 1 deletion(-)
>  create mode 100644 hw/dataplane/virtio-blk.c
>  create mode 100644 hw/dataplane/virtio-blk.h
> 
> diff --git a/hw/dataplane/Makefile.objs b/hw/dataplane/Makefile.objs
> index abd408f..682aa9e 100644
> --- a/hw/dataplane/Makefile.objs
> +++ b/hw/dataplane/Makefile.objs
> @@ -1,3 +1,3 @@
>  ifeq ($(CONFIG_VIRTIO), y)
> -common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += hostmem.o vring.o event-poll.o ioq.o
> +common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += hostmem.o vring.o event-poll.o ioq.o virtio-blk.o
>  endif
> diff --git a/hw/dataplane/virtio-blk.c b/hw/dataplane/virtio-blk.c
> new file mode 100644
> index 0000000..9b29969
> --- /dev/null
> +++ b/hw/dataplane/virtio-blk.c
> @@ -0,0 +1,427 @@
> +/*
> + * Dedicated thread for virtio-blk I/O processing
> + *
> + * Copyright 2012 IBM, Corp.
> + * Copyright 2012 Red Hat, Inc. and/or its affiliates
> + *
> + * Authors:
> + *   Stefan Hajnoczi <stefanha@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "trace.h"
> +#include "iov.h"
> +#include "event-poll.h"
> +#include "qemu-thread.h"
> +#include "vring.h"
> +#include "ioq.h"
> +#include "hw/virtio-blk.h"
> +#include "hw/dataplane/virtio-blk.h"
> +
> +enum {
> +    SEG_MAX = 126,                  /* maximum number of I/O segments */
> +    VRING_MAX = SEG_MAX + 2,        /* maximum number of vring descriptors */
> +    REQ_MAX = VRING_MAX,            /* maximum number of requests in the vring,
> +                                     * is VRING_MAX / 2 with traditional and
> +                                     * VRING_MAX with indirect descriptors */
> +};
> +
> +typedef struct {
> +    struct iocb iocb;               /* Linux AIO control block */
> +    QEMUIOVector *inhdr;            /* iovecs for virtio_blk_inhdr */
> +    unsigned int head;              /* vring descriptor index */
> +} VirtIOBlockRequest;
> +
> +struct VirtIOBlockDataPlane {
> +    bool started;
> +    QEMUBH *start_bh;
> +    QemuThread thread;
> +
> +    int fd;                         /* image file descriptor */
> +
> +    VirtIODevice *vdev;
> +    Vring vring;                    /* virtqueue vring */
> +    EventNotifier *guest_notifier;  /* irq */
> +
> +    EventPoll event_poll;           /* event poller */
> +    EventHandler io_handler;        /* Linux AIO completion handler */
> +    EventHandler notify_handler;    /* virtqueue notify handler */
> +
> +    IOQueue ioqueue;                /* Linux AIO queue (should really be per
> +                                       dataplane thread) */
> +    VirtIOBlockRequest requests[REQ_MAX]; /* pool of requests, managed by the
> +                                             queue */
> +
> +    unsigned int num_reqs;
> +    QemuMutex num_reqs_lock;

OK the only reason this lock is needed is because
you want to drain outside the thread.
Won't it be better to queue process the drain request through
the thread?
You won't need any locks then.

> +    QemuCond no_reqs_cond;
> +};
> +
> +/* Raise an interrupt to signal guest, if necessary */
> +static void notify_guest(VirtIOBlockDataPlane *s)
> +{
> +    if (!vring_should_notify(s->vdev, &s->vring)) {
> +        return;
> +    }
> +
> +    event_notifier_set(s->guest_notifier);
> +}
> +
> +static void complete_request(struct iocb *iocb, ssize_t ret, void *opaque)
> +{
> +    VirtIOBlockDataPlane *s = opaque;
> +    VirtIOBlockRequest *req = container_of(iocb, VirtIOBlockRequest, iocb);
> +    struct virtio_blk_inhdr hdr;
> +    int len;
> +
> +    if (likely(ret >= 0)) {
> +        hdr.status = VIRTIO_BLK_S_OK;
> +        len = ret;
> +    } else {
> +        hdr.status = VIRTIO_BLK_S_IOERR;
> +        len = 0;
> +    }
> +
> +    trace_virtio_blk_data_plane_complete_request(s, req->head, ret);
> +
> +    qemu_iovec_from_buf(req->inhdr, 0, &hdr, sizeof(hdr));
> +    qemu_iovec_destroy(req->inhdr);
> +    g_slice_free(QEMUIOVector, req->inhdr);
> +
> +    /* According to the virtio specification len should be the number of bytes
> +     * written to, but for virtio-blk it seems to be the number of bytes
> +     * transferred plus the status bytes.
> +     */
> +    vring_push(&s->vring, req->head, len + sizeof(hdr));
> +
> +    qemu_mutex_lock(&s->num_reqs_lock);
> +    if (--s->num_reqs == 0) {
> +        qemu_cond_broadcast(&s->no_reqs_cond);
> +    }
> +    qemu_mutex_unlock(&s->num_reqs_lock);
> +}
> +
> +static void fail_request_early(VirtIOBlockDataPlane *s, unsigned int head,
> +                               QEMUIOVector *inhdr, unsigned char status)
> +{
> +    struct virtio_blk_inhdr hdr = {
> +        .status = status,
> +    };
> +
> +    qemu_iovec_from_buf(inhdr, 0, &hdr, sizeof(hdr));
> +    qemu_iovec_destroy(inhdr);
> +    g_slice_free(QEMUIOVector, inhdr);
> +
> +    vring_push(&s->vring, head, sizeof(hdr));
> +    notify_guest(s);
> +}
> +
> +static int process_request(IOQueue *ioq, struct iovec iov[],
> +                           unsigned int out_num, unsigned int in_num,
> +                           unsigned int head)
> +{
> +    VirtIOBlockDataPlane *s = container_of(ioq, VirtIOBlockDataPlane, ioqueue);
> +    struct iovec *in_iov = &iov[out_num];
> +    struct virtio_blk_outhdr outhdr;
> +    QEMUIOVector *inhdr;
> +    size_t in_size;
> +
> +    /* Copy in outhdr */
> +    if (unlikely(iov_to_buf(iov, out_num, 0, &outhdr,
> +                            sizeof(outhdr)) != sizeof(outhdr))) {
> +        error_report("virtio-blk request outhdr too short");
> +        return -EFAULT;
> +    }
> +    iov_discard(&iov, &out_num, sizeof(outhdr));
> +
> +    /* Grab inhdr for later */
> +    in_size = iov_size(in_iov, in_num);
> +    if (in_size < sizeof(struct virtio_blk_inhdr)) {
> +        error_report("virtio_blk request inhdr too short");
> +        return -EFAULT;
> +    }
> +    inhdr = g_slice_new(QEMUIOVector);
> +    qemu_iovec_init(inhdr, 1);
> +    qemu_iovec_concat_iov(inhdr, in_iov, in_num,
> +            in_size - sizeof(struct virtio_blk_inhdr),
> +            sizeof(struct virtio_blk_inhdr));
> +    iov_discard(&in_iov, &in_num, -sizeof(struct virtio_blk_inhdr));
> +
> +    /* TODO Linux sets the barrier bit even when not advertised! */
> +    outhdr.type &= ~VIRTIO_BLK_T_BARRIER;
> +
> +    struct iocb *iocb;
> +    switch (outhdr.type & (VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_SCSI_CMD |
> +                           VIRTIO_BLK_T_FLUSH)) {
> +    case VIRTIO_BLK_T_IN:
> +        iocb = ioq_rdwr(ioq, true, in_iov, in_num, outhdr.sector * 512);
> +        break;
> +
> +    case VIRTIO_BLK_T_OUT:
> +        iocb = ioq_rdwr(ioq, false, iov, out_num, outhdr.sector * 512);
> +        break;
> +
> +    case VIRTIO_BLK_T_SCSI_CMD:
> +        /* TODO support SCSI commands */
> +        fail_request_early(s, head, inhdr, VIRTIO_BLK_S_UNSUPP);
> +        return 0;
> +
> +    case VIRTIO_BLK_T_FLUSH:
> +        /* TODO fdsync not supported by Linux AIO, do it synchronously here! */
> +        fdatasync(s->fd);
> +        fail_request_early(s, head, inhdr, VIRTIO_BLK_S_OK);
> +        return 0;
> +
> +    default:
> +        error_report("virtio-blk unsupported request type %#x", outhdr.type);
> +        qemu_iovec_destroy(inhdr);
> +        g_slice_free(QEMUIOVector, inhdr);
> +        return -EFAULT;
> +    }
> +
> +    /* Fill in virtio block metadata needed for completion */
> +    VirtIOBlockRequest *req = container_of(iocb, VirtIOBlockRequest, iocb);
> +    req->head = head;
> +    req->inhdr = inhdr;
> +    return 0;
> +}
> +
> +static bool handle_notify(EventHandler *handler)
> +{
> +    VirtIOBlockDataPlane *s = container_of(handler, VirtIOBlockDataPlane,
> +                                           notify_handler);
> +
> +    /* There is one array of iovecs into which all new requests are extracted
> +     * from the vring.  Requests are read from the vring and the translated
> +     * descriptors are written to the iovecs array.  The iovecs do not have to
> +     * persist across handle_notify() calls because the kernel copies the
> +     * iovecs on io_submit().
> +     *
> +     * Handling io_submit() EAGAIN may require storing the requests across
> +     * handle_notify() calls until the kernel has sufficient resources to
> +     * accept more I/O.  This is not implemented yet.
> +     */
> +    struct iovec iovec[VRING_MAX];
> +    struct iovec *end = &iovec[VRING_MAX];
> +    struct iovec *iov = iovec;
> +
> +    /* When a request is read from the vring, the index of the first descriptor
> +     * (aka head) is returned so that the completed request can be pushed onto
> +     * the vring later.
> +     *
> +     * The number of hypervisor read-only iovecs is out_num.  The number of
> +     * hypervisor write-only iovecs is in_num.
> +     */
> +    int head;
> +    unsigned int out_num = 0, in_num = 0;
> +    unsigned int num_queued;
> +
> +    for (;;) {
> +        /* Disable guest->host notifies to avoid unnecessary vmexits */
> +        vring_set_notification(s->vdev, &s->vring, false);
> +
> +        for (;;) {
> +            head = vring_pop(s->vdev, &s->vring, iov, end, &out_num, &in_num);
> +            if (head < 0) {
> +                break; /* no more requests */
> +            }
> +
> +            trace_virtio_blk_data_plane_process_request(s, out_num, in_num,
> +                                                        head);
> +
> +            if (process_request(&s->ioqueue, iov, out_num, in_num, head) < 0) {
> +                vring_set_broken(&s->vring);
> +                break;
> +            }
> +            iov += out_num + in_num;
> +        }
> +
> +        if (likely(head == -EAGAIN)) { /* vring emptied */
> +            /* Re-enable guest->host notifies and stop processing the vring.
> +             * But if the guest has snuck in more descriptors, keep processing.
> +             */
> +            vring_set_notification(s->vdev, &s->vring, true);
> +            smp_mb();
> +            if (!vring_more_avail(&s->vring)) {
> +                break;
> +            }
> +        } else { /* head == -ENOBUFS or fatal error, iovecs[] is depleted */
> +            /* Since there are no iovecs[] left, stop processing for now.  Do
> +             * not re-enable guest->host notifies since the I/O completion
> +             * handler knows to check for more vring descriptors anyway.
> +             */
> +            break;
> +        }
> +    }
> +
> +    num_queued = ioq_num_queued(&s->ioqueue);
> +    if (num_queued > 0) {
> +        qemu_mutex_lock(&s->num_reqs_lock);
> +        s->num_reqs += num_queued;
> +        qemu_mutex_unlock(&s->num_reqs_lock);
> +
> +        int rc = ioq_submit(&s->ioqueue);
> +        if (unlikely(rc < 0)) {
> +            fprintf(stderr, "ioq_submit failed %d\n", rc);
> +            exit(1);
> +        }
> +    }
> +    return true;
> +}
> +
> +static bool handle_io(EventHandler *handler)
> +{
> +    VirtIOBlockDataPlane *s = container_of(handler, VirtIOBlockDataPlane,
> +                                           io_handler);
> +
> +    if (ioq_run_completion(&s->ioqueue, complete_request, s) > 0) {
> +        notify_guest(s);
> +    }
> +
> +    /* If there were more requests than iovecs, the vring will not be empty yet
> +     * so check again.  There should now be enough resources to process more
> +     * requests.
> +     */
> +    if (unlikely(vring_more_avail(&s->vring))) {
> +        return handle_notify(&s->notify_handler);
> +    }
> +
> +    return true;
> +}
> +
> +static void *data_plane_thread(void *opaque)
> +{
> +    VirtIOBlockDataPlane *s = opaque;
> +    event_poll_run(&s->event_poll);
> +    return NULL;
> +}
> +
> +static void start_data_plane_bh(void *opaque)
> +{
> +    VirtIOBlockDataPlane *s = opaque;
> +
> +    qemu_bh_delete(s->start_bh);
> +    s->start_bh = NULL;
> +    qemu_thread_create(&s->thread, data_plane_thread,
> +                       s, QEMU_THREAD_JOINABLE);
> +}
> +
> +VirtIOBlockDataPlane *virtio_blk_data_plane_create(VirtIODevice *vdev, int fd)
> +{
> +    VirtIOBlockDataPlane *s;
> +
> +    s = g_new0(VirtIOBlockDataPlane, 1);
> +    s->vdev = vdev;
> +    s->fd = fd;
> +    return s;
> +}
> +
> +void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s)
> +{
> +    if (!s) {
> +        return;
> +    }
> +    virtio_blk_data_plane_stop(s);
> +    g_free(s);
> +}
> +
> +/* Block until pending requests have completed
> + *
> + * The vring continues to be serviced so ensure no new requests will be added
> + * to avoid races.

This comment confuses me. "avoid races" is a kind of vague
comment that does not really help.

This function does not seem to ensure
no new requests - it simply waits until num requests
gets to 0. But requests could get added right afterwards
and it won't help.

Could be comment be made more clear please?

> + */
> +void virtio_blk_data_plane_drain(VirtIOBlockDataPlane *s)
> +{
> +    qemu_mutex_lock(&s->num_reqs_lock);
> +    while (s->num_reqs > 0) {
> +        qemu_cond_wait(&s->no_reqs_cond, &s->num_reqs_lock);
> +    }
> +    qemu_mutex_unlock(&s->num_reqs_lock);
> +}
> +
> +void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s)
> +{
> +    VirtQueue *vq;
> +    int i;
> +
> +    if (s->started) {
> +        return;
> +    }
> +
> +    vq = virtio_get_queue(s->vdev, 0);
> +    if (!vring_setup(&s->vring, s->vdev, 0)) {
> +        return;
> +    }
> +
> +    event_poll_init(&s->event_poll);
> +
> +    /* Set up guest notifier (irq) */
> +    if (s->vdev->binding->set_guest_notifiers(s->vdev->binding_opaque,
> +                                              true) != 0) {
> +        fprintf(stderr, "virtio-blk failed to set guest notifier, "
> +                "ensure -enable-kvm is set\n");
> +        exit(1);
> +    }
> +    s->guest_notifier = virtio_queue_get_guest_notifier(vq);
> +
> +    /* Set up virtqueue notify */
> +    if (s->vdev->binding->set_host_notifier(s->vdev->binding_opaque,
> +                                            0, true) != 0) {
> +        fprintf(stderr, "virtio-blk failed to set host notifier\n");
> +        exit(1);
> +    }
> +    event_poll_add(&s->event_poll, &s->notify_handler,
> +                   virtio_queue_get_host_notifier(vq),
> +                   handle_notify);
> +
> +    /* Set up ioqueue */
> +    ioq_init(&s->ioqueue, s->fd, REQ_MAX);
> +    for (i = 0; i < ARRAY_SIZE(s->requests); i++) {
> +        ioq_put_iocb(&s->ioqueue, &s->requests[i].iocb);
> +    }
> +    event_poll_add(&s->event_poll, &s->io_handler,
> +                   ioq_get_notifier(&s->ioqueue), handle_io);
> +
> +    s->started = true;
> +    trace_virtio_blk_data_plane_start(s);
> +
> +    /* Kick right away to begin processing requests already in vring */
> +    event_notifier_set(virtio_queue_get_host_notifier(vq));
> +
> +    /* Spawn thread in BH so it inherits iothread cpusets */
> +    s->start_bh = qemu_bh_new(start_data_plane_bh, s);
> +    qemu_bh_schedule(s->start_bh);
> +}
> +
> +void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s)
> +{
> +    if (!s->started) {
> +        return;
> +    }
> +    s->started = false;
> +    trace_virtio_blk_data_plane_stop(s);
> +
> +    /* Stop thread or cancel pending thread creation BH */
> +    if (s->start_bh) {
> +        qemu_bh_delete(s->start_bh);
> +        s->start_bh = NULL;
> +    } else {
> +        virtio_blk_data_plane_drain(s);
> +        event_poll_stop(&s->event_poll);
> +        qemu_thread_join(&s->thread);
> +    }
> +
> +    ioq_cleanup(&s->ioqueue);
> +
> +    s->vdev->binding->set_host_notifier(s->vdev->binding_opaque, 0, false);
> +
> +    event_poll_cleanup(&s->event_poll);
> +
> +    /* Clean up guest notifier (irq) */
> +    s->vdev->binding->set_guest_notifiers(s->vdev->binding_opaque, false);
> +
> +    vring_teardown(&s->vring);
> +}
> diff --git a/hw/dataplane/virtio-blk.h b/hw/dataplane/virtio-blk.h
> new file mode 100644
> index 0000000..ddf1115
> --- /dev/null
> +++ b/hw/dataplane/virtio-blk.h
> @@ -0,0 +1,41 @@
> +/*
> + * Dedicated thread for virtio-blk I/O processing
> + *
> + * Copyright 2012 IBM, Corp.
> + * Copyright 2012 Red Hat, Inc. and/or its affiliates
> + *
> + * Authors:
> + *   Stefan Hajnoczi <stefanha@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#ifndef HW_DATAPLANE_VIRTIO_BLK_H
> +#define HW_DATAPLANE_VIRTIO_BLK_H
> +
> +#include "hw/virtio.h"
> +
> +typedef struct VirtIOBlockDataPlane VirtIOBlockDataPlane;
> +
> +#ifdef CONFIG_VIRTIO_BLK_DATA_PLANE
> +VirtIOBlockDataPlane *virtio_blk_data_plane_create(VirtIODevice *vdev, int fd);
> +void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s);
> +void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s);
> +void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s);
> +void virtio_blk_data_plane_drain(VirtIOBlockDataPlane *s);
> +#else
> +static inline VirtIOBlockDataPlane *virtio_blk_data_plane_create(
> +        VirtIODevice *vdev, int fd)
> +{
> +    return NULL;
> +}
> +
> +static inline void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s) {}
> +static inline void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s) {}
> +static inline void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s) {}
> +static inline void virtio_blk_data_plane_drain(VirtIOBlockDataPlane *s) {}
> +#endif
> +
> +#endif /* HW_DATAPLANE_VIRTIO_BLK_H */
> diff --git a/trace-events b/trace-events
> index a9a791b..1edc2ae 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -98,6 +98,12 @@ virtio_blk_rw_complete(void *req, int ret) "req %p ret %d"
>  virtio_blk_handle_write(void *req, uint64_t sector, size_t nsectors) "req %p sector %"PRIu64" nsectors %zu"
>  virtio_blk_handle_read(void *req, uint64_t sector, size_t nsectors) "req %p sector %"PRIu64" nsectors %zu"
>  
> +# hw/dataplane/virtio-blk.c
> +virtio_blk_data_plane_start(void *s) "dataplane %p"
> +virtio_blk_data_plane_stop(void *s) "dataplane %p"
> +virtio_blk_data_plane_process_request(void *s, unsigned int out_num, unsigned int in_num, unsigned int head) "dataplane %p out_num %u in_num %u head %u"
> +virtio_blk_data_plane_complete_request(void *s, unsigned int head, int ret) "dataplane %p head %u ret %d"
> +
>  # hw/dataplane/vring.c
>  vring_setup(uint64_t physical, void *desc, void *avail, void *used) "vring physical %#"PRIx64" desc %p avail %p used %p"
>  
> -- 
> 1.8.0

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 10/11] dataplane: add virtio-blk data plane code
  2012-11-29 14:02   ` Michael S. Tsirkin
@ 2012-11-29 15:21     ` Paolo Bonzini
  2012-11-29 15:27       ` Michael S. Tsirkin
  0 siblings, 1 reply; 43+ messages in thread
From: Paolo Bonzini @ 2012-11-29 15:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Blue Swirl, khoa,
	Stefan Hajnoczi, Asias He


> > +    unsigned int num_reqs;
> > +    QemuMutex num_reqs_lock;
> 
> OK the only reason this lock is needed is because
> you want to drain outside the thread.
> Won't it be better to queue process the drain request through
> the thread?
> You won't need any locks then.

Draining is processed in the thread.  This lock is only needed
to use it together with no_reqs_cond, because userspace threads
do not have something like wait_event.

Direct usage of futexes would let you remove the lock, but it's
not portable.

Paolo

> > +    QemuCond no_reqs_cond;
> > +};
> > +
> > +/* Raise an interrupt to signal guest, if necessary */
> > +static void notify_guest(VirtIOBlockDataPlane *s)
> > +{
> > +    if (!vring_should_notify(s->vdev, &s->vring)) {
> > +        return;
> > +    }
> > +
> > +    event_notifier_set(s->guest_notifier);
> > +}
> > +
> > +static void complete_request(struct iocb *iocb, ssize_t ret, void
> > *opaque)
> > +{
> > +    VirtIOBlockDataPlane *s = opaque;
> > +    VirtIOBlockRequest *req = container_of(iocb,
> > VirtIOBlockRequest, iocb);
> > +    struct virtio_blk_inhdr hdr;
> > +    int len;
> > +
> > +    if (likely(ret >= 0)) {
> > +        hdr.status = VIRTIO_BLK_S_OK;
> > +        len = ret;
> > +    } else {
> > +        hdr.status = VIRTIO_BLK_S_IOERR;
> > +        len = 0;
> > +    }
> > +
> > +    trace_virtio_blk_data_plane_complete_request(s, req->head,
> > ret);
> > +
> > +    qemu_iovec_from_buf(req->inhdr, 0, &hdr, sizeof(hdr));
> > +    qemu_iovec_destroy(req->inhdr);
> > +    g_slice_free(QEMUIOVector, req->inhdr);
> > +
> > +    /* According to the virtio specification len should be the
> > number of bytes
> > +     * written to, but for virtio-blk it seems to be the number of
> > bytes
> > +     * transferred plus the status bytes.
> > +     */
> > +    vring_push(&s->vring, req->head, len + sizeof(hdr));
> > +
> > +    qemu_mutex_lock(&s->num_reqs_lock);
> > +    if (--s->num_reqs == 0) {
> > +        qemu_cond_broadcast(&s->no_reqs_cond);
> > +    }
> > +    qemu_mutex_unlock(&s->num_reqs_lock);
> > +}
> > +
> > +static void fail_request_early(VirtIOBlockDataPlane *s, unsigned
> > int head,
> > +                               QEMUIOVector *inhdr, unsigned char
> > status)
> > +{
> > +    struct virtio_blk_inhdr hdr = {
> > +        .status = status,
> > +    };
> > +
> > +    qemu_iovec_from_buf(inhdr, 0, &hdr, sizeof(hdr));
> > +    qemu_iovec_destroy(inhdr);
> > +    g_slice_free(QEMUIOVector, inhdr);
> > +
> > +    vring_push(&s->vring, head, sizeof(hdr));
> > +    notify_guest(s);
> > +}
> > +
> > +static int process_request(IOQueue *ioq, struct iovec iov[],
> > +                           unsigned int out_num, unsigned int
> > in_num,
> > +                           unsigned int head)
> > +{
> > +    VirtIOBlockDataPlane *s = container_of(ioq,
> > VirtIOBlockDataPlane, ioqueue);
> > +    struct iovec *in_iov = &iov[out_num];
> > +    struct virtio_blk_outhdr outhdr;
> > +    QEMUIOVector *inhdr;
> > +    size_t in_size;
> > +
> > +    /* Copy in outhdr */
> > +    if (unlikely(iov_to_buf(iov, out_num, 0, &outhdr,
> > +                            sizeof(outhdr)) != sizeof(outhdr))) {
> > +        error_report("virtio-blk request outhdr too short");
> > +        return -EFAULT;
> > +    }
> > +    iov_discard(&iov, &out_num, sizeof(outhdr));
> > +
> > +    /* Grab inhdr for later */
> > +    in_size = iov_size(in_iov, in_num);
> > +    if (in_size < sizeof(struct virtio_blk_inhdr)) {
> > +        error_report("virtio_blk request inhdr too short");
> > +        return -EFAULT;
> > +    }
> > +    inhdr = g_slice_new(QEMUIOVector);
> > +    qemu_iovec_init(inhdr, 1);
> > +    qemu_iovec_concat_iov(inhdr, in_iov, in_num,
> > +            in_size - sizeof(struct virtio_blk_inhdr),
> > +            sizeof(struct virtio_blk_inhdr));
> > +    iov_discard(&in_iov, &in_num, -sizeof(struct
> > virtio_blk_inhdr));
> > +
> > +    /* TODO Linux sets the barrier bit even when not advertised!
> > */
> > +    outhdr.type &= ~VIRTIO_BLK_T_BARRIER;
> > +
> > +    struct iocb *iocb;
> > +    switch (outhdr.type & (VIRTIO_BLK_T_OUT |
> > VIRTIO_BLK_T_SCSI_CMD |
> > +                           VIRTIO_BLK_T_FLUSH)) {
> > +    case VIRTIO_BLK_T_IN:
> > +        iocb = ioq_rdwr(ioq, true, in_iov, in_num, outhdr.sector *
> > 512);
> > +        break;
> > +
> > +    case VIRTIO_BLK_T_OUT:
> > +        iocb = ioq_rdwr(ioq, false, iov, out_num, outhdr.sector *
> > 512);
> > +        break;
> > +
> > +    case VIRTIO_BLK_T_SCSI_CMD:
> > +        /* TODO support SCSI commands */
> > +        fail_request_early(s, head, inhdr, VIRTIO_BLK_S_UNSUPP);
> > +        return 0;
> > +
> > +    case VIRTIO_BLK_T_FLUSH:
> > +        /* TODO fdsync not supported by Linux AIO, do it
> > synchronously here! */
> > +        fdatasync(s->fd);
> > +        fail_request_early(s, head, inhdr, VIRTIO_BLK_S_OK);
> > +        return 0;
> > +
> > +    default:
> > +        error_report("virtio-blk unsupported request type %#x",
> > outhdr.type);
> > +        qemu_iovec_destroy(inhdr);
> > +        g_slice_free(QEMUIOVector, inhdr);
> > +        return -EFAULT;
> > +    }
> > +
> > +    /* Fill in virtio block metadata needed for completion */
> > +    VirtIOBlockRequest *req = container_of(iocb,
> > VirtIOBlockRequest, iocb);
> > +    req->head = head;
> > +    req->inhdr = inhdr;
> > +    return 0;
> > +}
> > +
> > +static bool handle_notify(EventHandler *handler)
> > +{
> > +    VirtIOBlockDataPlane *s = container_of(handler,
> > VirtIOBlockDataPlane,
> > +                                           notify_handler);
> > +
> > +    /* There is one array of iovecs into which all new requests
> > are extracted
> > +     * from the vring.  Requests are read from the vring and the
> > translated
> > +     * descriptors are written to the iovecs array.  The iovecs do
> > not have to
> > +     * persist across handle_notify() calls because the kernel
> > copies the
> > +     * iovecs on io_submit().
> > +     *
> > +     * Handling io_submit() EAGAIN may require storing the
> > requests across
> > +     * handle_notify() calls until the kernel has sufficient
> > resources to
> > +     * accept more I/O.  This is not implemented yet.
> > +     */
> > +    struct iovec iovec[VRING_MAX];
> > +    struct iovec *end = &iovec[VRING_MAX];
> > +    struct iovec *iov = iovec;
> > +
> > +    /* When a request is read from the vring, the index of the
> > first descriptor
> > +     * (aka head) is returned so that the completed request can be
> > pushed onto
> > +     * the vring later.
> > +     *
> > +     * The number of hypervisor read-only iovecs is out_num.  The
> > number of
> > +     * hypervisor write-only iovecs is in_num.
> > +     */
> > +    int head;
> > +    unsigned int out_num = 0, in_num = 0;
> > +    unsigned int num_queued;
> > +
> > +    for (;;) {
> > +        /* Disable guest->host notifies to avoid unnecessary
> > vmexits */
> > +        vring_set_notification(s->vdev, &s->vring, false);
> > +
> > +        for (;;) {
> > +            head = vring_pop(s->vdev, &s->vring, iov, end,
> > &out_num, &in_num);
> > +            if (head < 0) {
> > +                break; /* no more requests */
> > +            }
> > +
> > +            trace_virtio_blk_data_plane_process_request(s,
> > out_num, in_num,
> > +                                                        head);
> > +
> > +            if (process_request(&s->ioqueue, iov, out_num, in_num,
> > head) < 0) {
> > +                vring_set_broken(&s->vring);
> > +                break;
> > +            }
> > +            iov += out_num + in_num;
> > +        }
> > +
> > +        if (likely(head == -EAGAIN)) { /* vring emptied */
> > +            /* Re-enable guest->host notifies and stop processing
> > the vring.
> > +             * But if the guest has snuck in more descriptors,
> > keep processing.
> > +             */
> > +            vring_set_notification(s->vdev, &s->vring, true);
> > +            smp_mb();
> > +            if (!vring_more_avail(&s->vring)) {
> > +                break;
> > +            }
> > +        } else { /* head == -ENOBUFS or fatal error, iovecs[] is
> > depleted */
> > +            /* Since there are no iovecs[] left, stop processing
> > for now.  Do
> > +             * not re-enable guest->host notifies since the I/O
> > completion
> > +             * handler knows to check for more vring descriptors
> > anyway.
> > +             */
> > +            break;
> > +        }
> > +    }
> > +
> > +    num_queued = ioq_num_queued(&s->ioqueue);
> > +    if (num_queued > 0) {
> > +        qemu_mutex_lock(&s->num_reqs_lock);
> > +        s->num_reqs += num_queued;
> > +        qemu_mutex_unlock(&s->num_reqs_lock);
> > +
> > +        int rc = ioq_submit(&s->ioqueue);
> > +        if (unlikely(rc < 0)) {
> > +            fprintf(stderr, "ioq_submit failed %d\n", rc);
> > +            exit(1);
> > +        }
> > +    }
> > +    return true;
> > +}
> > +
> > +static bool handle_io(EventHandler *handler)
> > +{
> > +    VirtIOBlockDataPlane *s = container_of(handler,
> > VirtIOBlockDataPlane,
> > +                                           io_handler);
> > +
> > +    if (ioq_run_completion(&s->ioqueue, complete_request, s) > 0)
> > {
> > +        notify_guest(s);
> > +    }
> > +
> > +    /* If there were more requests than iovecs, the vring will not
> > be empty yet
> > +     * so check again.  There should now be enough resources to
> > process more
> > +     * requests.
> > +     */
> > +    if (unlikely(vring_more_avail(&s->vring))) {
> > +        return handle_notify(&s->notify_handler);
> > +    }
> > +
> > +    return true;
> > +}
> > +
> > +static void *data_plane_thread(void *opaque)
> > +{
> > +    VirtIOBlockDataPlane *s = opaque;
> > +    event_poll_run(&s->event_poll);
> > +    return NULL;
> > +}
> > +
> > +static void start_data_plane_bh(void *opaque)
> > +{
> > +    VirtIOBlockDataPlane *s = opaque;
> > +
> > +    qemu_bh_delete(s->start_bh);
> > +    s->start_bh = NULL;
> > +    qemu_thread_create(&s->thread, data_plane_thread,
> > +                       s, QEMU_THREAD_JOINABLE);
> > +}
> > +
> > +VirtIOBlockDataPlane *virtio_blk_data_plane_create(VirtIODevice
> > *vdev, int fd)
> > +{
> > +    VirtIOBlockDataPlane *s;
> > +
> > +    s = g_new0(VirtIOBlockDataPlane, 1);
> > +    s->vdev = vdev;
> > +    s->fd = fd;
> > +    return s;
> > +}
> > +
> > +void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s)
> > +{
> > +    if (!s) {
> > +        return;
> > +    }
> > +    virtio_blk_data_plane_stop(s);
> > +    g_free(s);
> > +}
> > +
> > +/* Block until pending requests have completed
> > + *
> > + * The vring continues to be serviced so ensure no new requests
> > will be added
> > + * to avoid races.
> 
> This comment confuses me. "avoid races" is a kind of vague
> comment that does not really help.
> 
> This function does not seem to ensure
> no new requests - it simply waits until num requests
> gets to 0. But requests could get added right afterwards
> and it won't help.
> 
> Could be comment be made more clear please?
> 
> > + */
> > +void virtio_blk_data_plane_drain(VirtIOBlockDataPlane *s)
> > +{
> > +    qemu_mutex_lock(&s->num_reqs_lock);
> > +    while (s->num_reqs > 0) {
> > +        qemu_cond_wait(&s->no_reqs_cond, &s->num_reqs_lock);
> > +    }
> > +    qemu_mutex_unlock(&s->num_reqs_lock);
> > +}
> > +
> > +void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s)
> > +{
> > +    VirtQueue *vq;
> > +    int i;
> > +
> > +    if (s->started) {
> > +        return;
> > +    }
> > +
> > +    vq = virtio_get_queue(s->vdev, 0);
> > +    if (!vring_setup(&s->vring, s->vdev, 0)) {
> > +        return;
> > +    }
> > +
> > +    event_poll_init(&s->event_poll);
> > +
> > +    /* Set up guest notifier (irq) */
> > +    if
> > (s->vdev->binding->set_guest_notifiers(s->vdev->binding_opaque,
> > +                                              true) != 0) {
> > +        fprintf(stderr, "virtio-blk failed to set guest notifier,
> > "
> > +                "ensure -enable-kvm is set\n");
> > +        exit(1);
> > +    }
> > +    s->guest_notifier = virtio_queue_get_guest_notifier(vq);
> > +
> > +    /* Set up virtqueue notify */
> > +    if
> > (s->vdev->binding->set_host_notifier(s->vdev->binding_opaque,
> > +                                            0, true) != 0) {
> > +        fprintf(stderr, "virtio-blk failed to set host
> > notifier\n");
> > +        exit(1);
> > +    }
> > +    event_poll_add(&s->event_poll, &s->notify_handler,
> > +                   virtio_queue_get_host_notifier(vq),
> > +                   handle_notify);
> > +
> > +    /* Set up ioqueue */
> > +    ioq_init(&s->ioqueue, s->fd, REQ_MAX);
> > +    for (i = 0; i < ARRAY_SIZE(s->requests); i++) {
> > +        ioq_put_iocb(&s->ioqueue, &s->requests[i].iocb);
> > +    }
> > +    event_poll_add(&s->event_poll, &s->io_handler,
> > +                   ioq_get_notifier(&s->ioqueue), handle_io);
> > +
> > +    s->started = true;
> > +    trace_virtio_blk_data_plane_start(s);
> > +
> > +    /* Kick right away to begin processing requests already in
> > vring */
> > +    event_notifier_set(virtio_queue_get_host_notifier(vq));
> > +
> > +    /* Spawn thread in BH so it inherits iothread cpusets */
> > +    s->start_bh = qemu_bh_new(start_data_plane_bh, s);
> > +    qemu_bh_schedule(s->start_bh);
> > +}
> > +
> > +void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s)
> > +{
> > +    if (!s->started) {
> > +        return;
> > +    }
> > +    s->started = false;
> > +    trace_virtio_blk_data_plane_stop(s);
> > +
> > +    /* Stop thread or cancel pending thread creation BH */
> > +    if (s->start_bh) {
> > +        qemu_bh_delete(s->start_bh);
> > +        s->start_bh = NULL;
> > +    } else {
> > +        virtio_blk_data_plane_drain(s);
> > +        event_poll_stop(&s->event_poll);
> > +        qemu_thread_join(&s->thread);
> > +    }
> > +
> > +    ioq_cleanup(&s->ioqueue);
> > +
> > +    s->vdev->binding->set_host_notifier(s->vdev->binding_opaque,
> > 0, false);
> > +
> > +    event_poll_cleanup(&s->event_poll);
> > +
> > +    /* Clean up guest notifier (irq) */
> > +    s->vdev->binding->set_guest_notifiers(s->vdev->binding_opaque,
> > false);
> > +
> > +    vring_teardown(&s->vring);
> > +}
> > diff --git a/hw/dataplane/virtio-blk.h b/hw/dataplane/virtio-blk.h
> > new file mode 100644
> > index 0000000..ddf1115
> > --- /dev/null
> > +++ b/hw/dataplane/virtio-blk.h
> > @@ -0,0 +1,41 @@
> > +/*
> > + * Dedicated thread for virtio-blk I/O processing
> > + *
> > + * Copyright 2012 IBM, Corp.
> > + * Copyright 2012 Red Hat, Inc. and/or its affiliates
> > + *
> > + * Authors:
> > + *   Stefan Hajnoczi <stefanha@redhat.com>
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2
> > or later.
> > + * See the COPYING file in the top-level directory.
> > + *
> > + */
> > +
> > +#ifndef HW_DATAPLANE_VIRTIO_BLK_H
> > +#define HW_DATAPLANE_VIRTIO_BLK_H
> > +
> > +#include "hw/virtio.h"
> > +
> > +typedef struct VirtIOBlockDataPlane VirtIOBlockDataPlane;
> > +
> > +#ifdef CONFIG_VIRTIO_BLK_DATA_PLANE
> > +VirtIOBlockDataPlane *virtio_blk_data_plane_create(VirtIODevice
> > *vdev, int fd);
> > +void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s);
> > +void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s);
> > +void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s);
> > +void virtio_blk_data_plane_drain(VirtIOBlockDataPlane *s);
> > +#else
> > +static inline VirtIOBlockDataPlane *virtio_blk_data_plane_create(
> > +        VirtIODevice *vdev, int fd)
> > +{
> > +    return NULL;
> > +}
> > +
> > +static inline void
> > virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s) {}
> > +static inline void
> > virtio_blk_data_plane_start(VirtIOBlockDataPlane *s) {}
> > +static inline void virtio_blk_data_plane_stop(VirtIOBlockDataPlane
> > *s) {}
> > +static inline void
> > virtio_blk_data_plane_drain(VirtIOBlockDataPlane *s) {}
> > +#endif
> > +
> > +#endif /* HW_DATAPLANE_VIRTIO_BLK_H */
> > diff --git a/trace-events b/trace-events
> > index a9a791b..1edc2ae 100644
> > --- a/trace-events
> > +++ b/trace-events
> > @@ -98,6 +98,12 @@ virtio_blk_rw_complete(void *req, int ret) "req
> > %p ret %d"
> >  virtio_blk_handle_write(void *req, uint64_t sector, size_t
> >  nsectors) "req %p sector %"PRIu64" nsectors %zu"
> >  virtio_blk_handle_read(void *req, uint64_t sector, size_t
> >  nsectors) "req %p sector %"PRIu64" nsectors %zu"
> >  
> > +# hw/dataplane/virtio-blk.c
> > +virtio_blk_data_plane_start(void *s) "dataplane %p"
> > +virtio_blk_data_plane_stop(void *s) "dataplane %p"
> > +virtio_blk_data_plane_process_request(void *s, unsigned int
> > out_num, unsigned int in_num, unsigned int head) "dataplane %p
> > out_num %u in_num %u head %u"
> > +virtio_blk_data_plane_complete_request(void *s, unsigned int head,
> > int ret) "dataplane %p head %u ret %d"
> > +
> >  # hw/dataplane/vring.c
> >  vring_setup(uint64_t physical, void *desc, void *avail, void
> >  *used) "vring physical %#"PRIx64" desc %p avail %p used %p"
> >  
> > --
> > 1.8.0
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 10/11] dataplane: add virtio-blk data plane code
  2012-11-29 15:21     ` Paolo Bonzini
@ 2012-11-29 15:27       ` Michael S. Tsirkin
  2012-11-29 15:47         ` Paolo Bonzini
  0 siblings, 1 reply; 43+ messages in thread
From: Michael S. Tsirkin @ 2012-11-29 15:27 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Blue Swirl, khoa,
	Stefan Hajnoczi, Asias He

On Thu, Nov 29, 2012 at 10:21:31AM -0500, Paolo Bonzini wrote:
> 
> > > +    unsigned int num_reqs;
> > > +    QemuMutex num_reqs_lock;
> > 
> > OK the only reason this lock is needed is because
> > you want to drain outside the thread.
> > Won't it be better to queue process the drain request through
> > the thread?
> > You won't need any locks then.
> 
> Draining is processed in the thread.

Confused. You say same thread is always taking this lock?
This is what I suggest.

>  This lock is only needed
> to use it together with no_reqs_cond, because userspace threads
> do not have something like wait_event.
> 
> Direct usage of futexes would let you remove the lock, but it's
> not portable.
> 
> Paolo

It's easier: simply signal condition which is local on stack.

> > > +    QemuCond no_reqs_cond;
> > > +};
> > > +
> > > +/* Raise an interrupt to signal guest, if necessary */
> > > +static void notify_guest(VirtIOBlockDataPlane *s)
> > > +{
> > > +    if (!vring_should_notify(s->vdev, &s->vring)) {
> > > +        return;
> > > +    }
> > > +
> > > +    event_notifier_set(s->guest_notifier);
> > > +}
> > > +
> > > +static void complete_request(struct iocb *iocb, ssize_t ret, void
> > > *opaque)
> > > +{
> > > +    VirtIOBlockDataPlane *s = opaque;
> > > +    VirtIOBlockRequest *req = container_of(iocb,
> > > VirtIOBlockRequest, iocb);
> > > +    struct virtio_blk_inhdr hdr;
> > > +    int len;
> > > +
> > > +    if (likely(ret >= 0)) {
> > > +        hdr.status = VIRTIO_BLK_S_OK;
> > > +        len = ret;
> > > +    } else {
> > > +        hdr.status = VIRTIO_BLK_S_IOERR;
> > > +        len = 0;
> > > +    }
> > > +
> > > +    trace_virtio_blk_data_plane_complete_request(s, req->head,
> > > ret);
> > > +
> > > +    qemu_iovec_from_buf(req->inhdr, 0, &hdr, sizeof(hdr));
> > > +    qemu_iovec_destroy(req->inhdr);
> > > +    g_slice_free(QEMUIOVector, req->inhdr);
> > > +
> > > +    /* According to the virtio specification len should be the
> > > number of bytes
> > > +     * written to, but for virtio-blk it seems to be the number of
> > > bytes
> > > +     * transferred plus the status bytes.
> > > +     */
> > > +    vring_push(&s->vring, req->head, len + sizeof(hdr));
> > > +
> > > +    qemu_mutex_lock(&s->num_reqs_lock);
> > > +    if (--s->num_reqs == 0) {
> > > +        qemu_cond_broadcast(&s->no_reqs_cond);
> > > +    }
> > > +    qemu_mutex_unlock(&s->num_reqs_lock);
> > > +}
> > > +
> > > +static void fail_request_early(VirtIOBlockDataPlane *s, unsigned
> > > int head,
> > > +                               QEMUIOVector *inhdr, unsigned char
> > > status)
> > > +{
> > > +    struct virtio_blk_inhdr hdr = {
> > > +        .status = status,
> > > +    };
> > > +
> > > +    qemu_iovec_from_buf(inhdr, 0, &hdr, sizeof(hdr));
> > > +    qemu_iovec_destroy(inhdr);
> > > +    g_slice_free(QEMUIOVector, inhdr);
> > > +
> > > +    vring_push(&s->vring, head, sizeof(hdr));
> > > +    notify_guest(s);
> > > +}
> > > +
> > > +static int process_request(IOQueue *ioq, struct iovec iov[],
> > > +                           unsigned int out_num, unsigned int
> > > in_num,
> > > +                           unsigned int head)
> > > +{
> > > +    VirtIOBlockDataPlane *s = container_of(ioq,
> > > VirtIOBlockDataPlane, ioqueue);
> > > +    struct iovec *in_iov = &iov[out_num];
> > > +    struct virtio_blk_outhdr outhdr;
> > > +    QEMUIOVector *inhdr;
> > > +    size_t in_size;
> > > +
> > > +    /* Copy in outhdr */
> > > +    if (unlikely(iov_to_buf(iov, out_num, 0, &outhdr,
> > > +                            sizeof(outhdr)) != sizeof(outhdr))) {
> > > +        error_report("virtio-blk request outhdr too short");
> > > +        return -EFAULT;
> > > +    }
> > > +    iov_discard(&iov, &out_num, sizeof(outhdr));
> > > +
> > > +    /* Grab inhdr for later */
> > > +    in_size = iov_size(in_iov, in_num);
> > > +    if (in_size < sizeof(struct virtio_blk_inhdr)) {
> > > +        error_report("virtio_blk request inhdr too short");
> > > +        return -EFAULT;
> > > +    }
> > > +    inhdr = g_slice_new(QEMUIOVector);
> > > +    qemu_iovec_init(inhdr, 1);
> > > +    qemu_iovec_concat_iov(inhdr, in_iov, in_num,
> > > +            in_size - sizeof(struct virtio_blk_inhdr),
> > > +            sizeof(struct virtio_blk_inhdr));
> > > +    iov_discard(&in_iov, &in_num, -sizeof(struct
> > > virtio_blk_inhdr));
> > > +
> > > +    /* TODO Linux sets the barrier bit even when not advertised!
> > > */
> > > +    outhdr.type &= ~VIRTIO_BLK_T_BARRIER;
> > > +
> > > +    struct iocb *iocb;
> > > +    switch (outhdr.type & (VIRTIO_BLK_T_OUT |
> > > VIRTIO_BLK_T_SCSI_CMD |
> > > +                           VIRTIO_BLK_T_FLUSH)) {
> > > +    case VIRTIO_BLK_T_IN:
> > > +        iocb = ioq_rdwr(ioq, true, in_iov, in_num, outhdr.sector *
> > > 512);
> > > +        break;
> > > +
> > > +    case VIRTIO_BLK_T_OUT:
> > > +        iocb = ioq_rdwr(ioq, false, iov, out_num, outhdr.sector *
> > > 512);
> > > +        break;
> > > +
> > > +    case VIRTIO_BLK_T_SCSI_CMD:
> > > +        /* TODO support SCSI commands */
> > > +        fail_request_early(s, head, inhdr, VIRTIO_BLK_S_UNSUPP);
> > > +        return 0;
> > > +
> > > +    case VIRTIO_BLK_T_FLUSH:
> > > +        /* TODO fdsync not supported by Linux AIO, do it
> > > synchronously here! */
> > > +        fdatasync(s->fd);
> > > +        fail_request_early(s, head, inhdr, VIRTIO_BLK_S_OK);
> > > +        return 0;
> > > +
> > > +    default:
> > > +        error_report("virtio-blk unsupported request type %#x",
> > > outhdr.type);
> > > +        qemu_iovec_destroy(inhdr);
> > > +        g_slice_free(QEMUIOVector, inhdr);
> > > +        return -EFAULT;
> > > +    }
> > > +
> > > +    /* Fill in virtio block metadata needed for completion */
> > > +    VirtIOBlockRequest *req = container_of(iocb,
> > > VirtIOBlockRequest, iocb);
> > > +    req->head = head;
> > > +    req->inhdr = inhdr;
> > > +    return 0;
> > > +}
> > > +
> > > +static bool handle_notify(EventHandler *handler)
> > > +{
> > > +    VirtIOBlockDataPlane *s = container_of(handler,
> > > VirtIOBlockDataPlane,
> > > +                                           notify_handler);
> > > +
> > > +    /* There is one array of iovecs into which all new requests
> > > are extracted
> > > +     * from the vring.  Requests are read from the vring and the
> > > translated
> > > +     * descriptors are written to the iovecs array.  The iovecs do
> > > not have to
> > > +     * persist across handle_notify() calls because the kernel
> > > copies the
> > > +     * iovecs on io_submit().
> > > +     *
> > > +     * Handling io_submit() EAGAIN may require storing the
> > > requests across
> > > +     * handle_notify() calls until the kernel has sufficient
> > > resources to
> > > +     * accept more I/O.  This is not implemented yet.
> > > +     */
> > > +    struct iovec iovec[VRING_MAX];
> > > +    struct iovec *end = &iovec[VRING_MAX];
> > > +    struct iovec *iov = iovec;
> > > +
> > > +    /* When a request is read from the vring, the index of the
> > > first descriptor
> > > +     * (aka head) is returned so that the completed request can be
> > > pushed onto
> > > +     * the vring later.
> > > +     *
> > > +     * The number of hypervisor read-only iovecs is out_num.  The
> > > number of
> > > +     * hypervisor write-only iovecs is in_num.
> > > +     */
> > > +    int head;
> > > +    unsigned int out_num = 0, in_num = 0;
> > > +    unsigned int num_queued;
> > > +
> > > +    for (;;) {
> > > +        /* Disable guest->host notifies to avoid unnecessary
> > > vmexits */
> > > +        vring_set_notification(s->vdev, &s->vring, false);
> > > +
> > > +        for (;;) {
> > > +            head = vring_pop(s->vdev, &s->vring, iov, end,
> > > &out_num, &in_num);
> > > +            if (head < 0) {
> > > +                break; /* no more requests */
> > > +            }
> > > +
> > > +            trace_virtio_blk_data_plane_process_request(s,
> > > out_num, in_num,
> > > +                                                        head);
> > > +
> > > +            if (process_request(&s->ioqueue, iov, out_num, in_num,
> > > head) < 0) {
> > > +                vring_set_broken(&s->vring);
> > > +                break;
> > > +            }
> > > +            iov += out_num + in_num;
> > > +        }
> > > +
> > > +        if (likely(head == -EAGAIN)) { /* vring emptied */
> > > +            /* Re-enable guest->host notifies and stop processing
> > > the vring.
> > > +             * But if the guest has snuck in more descriptors,
> > > keep processing.
> > > +             */
> > > +            vring_set_notification(s->vdev, &s->vring, true);
> > > +            smp_mb();
> > > +            if (!vring_more_avail(&s->vring)) {
> > > +                break;
> > > +            }
> > > +        } else { /* head == -ENOBUFS or fatal error, iovecs[] is
> > > depleted */
> > > +            /* Since there are no iovecs[] left, stop processing
> > > for now.  Do
> > > +             * not re-enable guest->host notifies since the I/O
> > > completion
> > > +             * handler knows to check for more vring descriptors
> > > anyway.
> > > +             */
> > > +            break;
> > > +        }
> > > +    }
> > > +
> > > +    num_queued = ioq_num_queued(&s->ioqueue);
> > > +    if (num_queued > 0) {
> > > +        qemu_mutex_lock(&s->num_reqs_lock);
> > > +        s->num_reqs += num_queued;
> > > +        qemu_mutex_unlock(&s->num_reqs_lock);
> > > +
> > > +        int rc = ioq_submit(&s->ioqueue);
> > > +        if (unlikely(rc < 0)) {
> > > +            fprintf(stderr, "ioq_submit failed %d\n", rc);
> > > +            exit(1);
> > > +        }
> > > +    }
> > > +    return true;
> > > +}
> > > +
> > > +static bool handle_io(EventHandler *handler)
> > > +{
> > > +    VirtIOBlockDataPlane *s = container_of(handler,
> > > VirtIOBlockDataPlane,
> > > +                                           io_handler);
> > > +
> > > +    if (ioq_run_completion(&s->ioqueue, complete_request, s) > 0)
> > > {
> > > +        notify_guest(s);
> > > +    }
> > > +
> > > +    /* If there were more requests than iovecs, the vring will not
> > > be empty yet
> > > +     * so check again.  There should now be enough resources to
> > > process more
> > > +     * requests.
> > > +     */
> > > +    if (unlikely(vring_more_avail(&s->vring))) {
> > > +        return handle_notify(&s->notify_handler);
> > > +    }
> > > +
> > > +    return true;
> > > +}
> > > +
> > > +static void *data_plane_thread(void *opaque)
> > > +{
> > > +    VirtIOBlockDataPlane *s = opaque;
> > > +    event_poll_run(&s->event_poll);
> > > +    return NULL;
> > > +}
> > > +
> > > +static void start_data_plane_bh(void *opaque)
> > > +{
> > > +    VirtIOBlockDataPlane *s = opaque;
> > > +
> > > +    qemu_bh_delete(s->start_bh);
> > > +    s->start_bh = NULL;
> > > +    qemu_thread_create(&s->thread, data_plane_thread,
> > > +                       s, QEMU_THREAD_JOINABLE);
> > > +}
> > > +
> > > +VirtIOBlockDataPlane *virtio_blk_data_plane_create(VirtIODevice
> > > *vdev, int fd)
> > > +{
> > > +    VirtIOBlockDataPlane *s;
> > > +
> > > +    s = g_new0(VirtIOBlockDataPlane, 1);
> > > +    s->vdev = vdev;
> > > +    s->fd = fd;
> > > +    return s;
> > > +}
> > > +
> > > +void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s)
> > > +{
> > > +    if (!s) {
> > > +        return;
> > > +    }
> > > +    virtio_blk_data_plane_stop(s);
> > > +    g_free(s);
> > > +}
> > > +
> > > +/* Block until pending requests have completed
> > > + *
> > > + * The vring continues to be serviced so ensure no new requests
> > > will be added
> > > + * to avoid races.
> > 
> > This comment confuses me. "avoid races" is a kind of vague
> > comment that does not really help.
> > 
> > This function does not seem to ensure
> > no new requests - it simply waits until num requests
> > gets to 0. But requests could get added right afterwards
> > and it won't help.
> > 
> > Could be comment be made more clear please?
> > 
> > > + */
> > > +void virtio_blk_data_plane_drain(VirtIOBlockDataPlane *s)
> > > +{
> > > +    qemu_mutex_lock(&s->num_reqs_lock);
> > > +    while (s->num_reqs > 0) {
> > > +        qemu_cond_wait(&s->no_reqs_cond, &s->num_reqs_lock);
> > > +    }
> > > +    qemu_mutex_unlock(&s->num_reqs_lock);
> > > +}
> > > +
> > > +void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s)
> > > +{
> > > +    VirtQueue *vq;
> > > +    int i;
> > > +
> > > +    if (s->started) {
> > > +        return;
> > > +    }
> > > +
> > > +    vq = virtio_get_queue(s->vdev, 0);
> > > +    if (!vring_setup(&s->vring, s->vdev, 0)) {
> > > +        return;
> > > +    }
> > > +
> > > +    event_poll_init(&s->event_poll);
> > > +
> > > +    /* Set up guest notifier (irq) */
> > > +    if
> > > (s->vdev->binding->set_guest_notifiers(s->vdev->binding_opaque,
> > > +                                              true) != 0) {
> > > +        fprintf(stderr, "virtio-blk failed to set guest notifier,
> > > "
> > > +                "ensure -enable-kvm is set\n");
> > > +        exit(1);
> > > +    }
> > > +    s->guest_notifier = virtio_queue_get_guest_notifier(vq);
> > > +
> > > +    /* Set up virtqueue notify */
> > > +    if
> > > (s->vdev->binding->set_host_notifier(s->vdev->binding_opaque,
> > > +                                            0, true) != 0) {
> > > +        fprintf(stderr, "virtio-blk failed to set host
> > > notifier\n");
> > > +        exit(1);
> > > +    }
> > > +    event_poll_add(&s->event_poll, &s->notify_handler,
> > > +                   virtio_queue_get_host_notifier(vq),
> > > +                   handle_notify);
> > > +
> > > +    /* Set up ioqueue */
> > > +    ioq_init(&s->ioqueue, s->fd, REQ_MAX);
> > > +    for (i = 0; i < ARRAY_SIZE(s->requests); i++) {
> > > +        ioq_put_iocb(&s->ioqueue, &s->requests[i].iocb);
> > > +    }
> > > +    event_poll_add(&s->event_poll, &s->io_handler,
> > > +                   ioq_get_notifier(&s->ioqueue), handle_io);
> > > +
> > > +    s->started = true;
> > > +    trace_virtio_blk_data_plane_start(s);
> > > +
> > > +    /* Kick right away to begin processing requests already in
> > > vring */
> > > +    event_notifier_set(virtio_queue_get_host_notifier(vq));
> > > +
> > > +    /* Spawn thread in BH so it inherits iothread cpusets */
> > > +    s->start_bh = qemu_bh_new(start_data_plane_bh, s);
> > > +    qemu_bh_schedule(s->start_bh);
> > > +}
> > > +
> > > +void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s)
> > > +{
> > > +    if (!s->started) {
> > > +        return;
> > > +    }
> > > +    s->started = false;
> > > +    trace_virtio_blk_data_plane_stop(s);
> > > +
> > > +    /* Stop thread or cancel pending thread creation BH */
> > > +    if (s->start_bh) {
> > > +        qemu_bh_delete(s->start_bh);
> > > +        s->start_bh = NULL;
> > > +    } else {
> > > +        virtio_blk_data_plane_drain(s);
> > > +        event_poll_stop(&s->event_poll);
> > > +        qemu_thread_join(&s->thread);
> > > +    }
> > > +
> > > +    ioq_cleanup(&s->ioqueue);
> > > +
> > > +    s->vdev->binding->set_host_notifier(s->vdev->binding_opaque,
> > > 0, false);
> > > +
> > > +    event_poll_cleanup(&s->event_poll);
> > > +
> > > +    /* Clean up guest notifier (irq) */
> > > +    s->vdev->binding->set_guest_notifiers(s->vdev->binding_opaque,
> > > false);
> > > +
> > > +    vring_teardown(&s->vring);
> > > +}
> > > diff --git a/hw/dataplane/virtio-blk.h b/hw/dataplane/virtio-blk.h
> > > new file mode 100644
> > > index 0000000..ddf1115
> > > --- /dev/null
> > > +++ b/hw/dataplane/virtio-blk.h
> > > @@ -0,0 +1,41 @@
> > > +/*
> > > + * Dedicated thread for virtio-blk I/O processing
> > > + *
> > > + * Copyright 2012 IBM, Corp.
> > > + * Copyright 2012 Red Hat, Inc. and/or its affiliates
> > > + *
> > > + * Authors:
> > > + *   Stefan Hajnoczi <stefanha@redhat.com>
> > > + *
> > > + * This work is licensed under the terms of the GNU GPL, version 2
> > > or later.
> > > + * See the COPYING file in the top-level directory.
> > > + *
> > > + */
> > > +
> > > +#ifndef HW_DATAPLANE_VIRTIO_BLK_H
> > > +#define HW_DATAPLANE_VIRTIO_BLK_H
> > > +
> > > +#include "hw/virtio.h"
> > > +
> > > +typedef struct VirtIOBlockDataPlane VirtIOBlockDataPlane;
> > > +
> > > +#ifdef CONFIG_VIRTIO_BLK_DATA_PLANE
> > > +VirtIOBlockDataPlane *virtio_blk_data_plane_create(VirtIODevice
> > > *vdev, int fd);
> > > +void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s);
> > > +void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s);
> > > +void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s);
> > > +void virtio_blk_data_plane_drain(VirtIOBlockDataPlane *s);
> > > +#else
> > > +static inline VirtIOBlockDataPlane *virtio_blk_data_plane_create(
> > > +        VirtIODevice *vdev, int fd)
> > > +{
> > > +    return NULL;
> > > +}
> > > +
> > > +static inline void
> > > virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s) {}
> > > +static inline void
> > > virtio_blk_data_plane_start(VirtIOBlockDataPlane *s) {}
> > > +static inline void virtio_blk_data_plane_stop(VirtIOBlockDataPlane
> > > *s) {}
> > > +static inline void
> > > virtio_blk_data_plane_drain(VirtIOBlockDataPlane *s) {}
> > > +#endif
> > > +
> > > +#endif /* HW_DATAPLANE_VIRTIO_BLK_H */
> > > diff --git a/trace-events b/trace-events
> > > index a9a791b..1edc2ae 100644
> > > --- a/trace-events
> > > +++ b/trace-events
> > > @@ -98,6 +98,12 @@ virtio_blk_rw_complete(void *req, int ret) "req
> > > %p ret %d"
> > >  virtio_blk_handle_write(void *req, uint64_t sector, size_t
> > >  nsectors) "req %p sector %"PRIu64" nsectors %zu"
> > >  virtio_blk_handle_read(void *req, uint64_t sector, size_t
> > >  nsectors) "req %p sector %"PRIu64" nsectors %zu"
> > >  
> > > +# hw/dataplane/virtio-blk.c
> > > +virtio_blk_data_plane_start(void *s) "dataplane %p"
> > > +virtio_blk_data_plane_stop(void *s) "dataplane %p"
> > > +virtio_blk_data_plane_process_request(void *s, unsigned int
> > > out_num, unsigned int in_num, unsigned int head) "dataplane %p
> > > out_num %u in_num %u head %u"
> > > +virtio_blk_data_plane_complete_request(void *s, unsigned int head,
> > > int ret) "dataplane %p head %u ret %d"
> > > +
> > >  # hw/dataplane/vring.c
> > >  vring_setup(uint64_t physical, void *desc, void *avail, void
> > >  *used) "vring physical %#"PRIx64" desc %p avail %p used %p"
> > >  
> > > --
> > > 1.8.0
> > 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 10/11] dataplane: add virtio-blk data plane code
  2012-11-29 15:27       ` Michael S. Tsirkin
@ 2012-11-29 15:47         ` Paolo Bonzini
  2012-11-30 13:57           ` Stefan Hajnoczi
  0 siblings, 1 reply; 43+ messages in thread
From: Paolo Bonzini @ 2012-11-29 15:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Blue Swirl, khoa,
	Stefan Hajnoczi, Asias He



----- Messaggio originale -----
> Da: "Michael S. Tsirkin" <mst@redhat.com>
> A: "Paolo Bonzini" <pbonzini@redhat.com>
> Cc: qemu-devel@nongnu.org, "Kevin Wolf" <kwolf@redhat.com>, "Anthony Liguori" <aliguori@us.ibm.com>, "Blue Swirl"
> <blauwirbel@gmail.com>, khoa@us.ibm.com, "Asias He" <asias@redhat.com>, "Stefan Hajnoczi" <stefanha@redhat.com>
> Inviato: Giovedì, 29 novembre 2012 16:27:39
> Oggetto: Re: [PATCH v4 10/11] dataplane: add virtio-blk data plane code
> 
> On Thu, Nov 29, 2012 at 10:21:31AM -0500, Paolo Bonzini wrote:
> > 
> > > > +    unsigned int num_reqs;
> > > > +    QemuMutex num_reqs_lock;
> > > 
> > > OK the only reason this lock is needed is because
> > > you want to drain outside the thread.
> > > Won't it be better to queue process the drain request through
> > > the thread?
> > > You won't need any locks then.
> > 
> > Draining is processed in the thread.
> 
> Confused. You say same thread is always taking this lock?

No, one thread needs to take the lock just for waiting on the condition
variable.

However, indeed you are right about this:

> > > > +/* Block until pending requests have completed
> > > > + *
> > > > + * The vring continues to be serviced so ensure no new
> > > > + * requests will be added to avoid races.
> > > 
> > > This comment confuses me. "avoid races" is a kind of vague
> > > comment that does not really help.
> > > 
> > > This function does not seem to ensure
> > > no new requests - it simply waits until num requests
> > > gets to 0. But requests could get added right afterwards
> > > and it won't help.

and solving this race would also let Stefan remove the lock.

Stefan, perhaps you could replace the stop_notifier mechanism of
event-poll.c with something similar to aio_notify/qemu_notify_event,
and even remove event_poll_run in favor of event_poll (aka aio_wait...).
And also remove the return value from the callback, since your remaining
callbacks always return true.

The main thread can just signal the event loop:

    s->stopping = true;
    event_poll_notify(&s->event_poll);
    qemu_thread_join(&s->thread);

and the dataplane thread will drain the queue before exiting.

Paolo

> This is what I suggest.
> 
> >  This lock is only needed
> > to use it together with no_reqs_cond, because userspace threads
> > do not have something like wait_event.
> > 
> > Direct usage of futexes would let you remove the lock, but it's
> > not portable.
> > 
> > Paolo
> 
> It's easier: simply signal condition which is local on stack.
> 
> > > > +    QemuCond no_reqs_cond;
> > > > +};
> > > > +
> > > > +/* Raise an interrupt to signal guest, if necessary */
> > > > +static void notify_guest(VirtIOBlockDataPlane *s)
> > > > +{
> > > > +    if (!vring_should_notify(s->vdev, &s->vring)) {
> > > > +        return;
> > > > +    }
> > > > +
> > > > +    event_notifier_set(s->guest_notifier);
> > > > +}
> > > > +
> > > > +static void complete_request(struct iocb *iocb, ssize_t ret,
> > > > void
> > > > *opaque)
> > > > +{
> > > > +    VirtIOBlockDataPlane *s = opaque;
> > > > +    VirtIOBlockRequest *req = container_of(iocb,
> > > > VirtIOBlockRequest, iocb);
> > > > +    struct virtio_blk_inhdr hdr;
> > > > +    int len;
> > > > +
> > > > +    if (likely(ret >= 0)) {
> > > > +        hdr.status = VIRTIO_BLK_S_OK;
> > > > +        len = ret;
> > > > +    } else {
> > > > +        hdr.status = VIRTIO_BLK_S_IOERR;
> > > > +        len = 0;
> > > > +    }
> > > > +
> > > > +    trace_virtio_blk_data_plane_complete_request(s, req->head,
> > > > ret);
> > > > +
> > > > +    qemu_iovec_from_buf(req->inhdr, 0, &hdr, sizeof(hdr));
> > > > +    qemu_iovec_destroy(req->inhdr);
> > > > +    g_slice_free(QEMUIOVector, req->inhdr);
> > > > +
> > > > +    /* According to the virtio specification len should be the
> > > > number of bytes
> > > > +     * written to, but for virtio-blk it seems to be the
> > > > number of
> > > > bytes
> > > > +     * transferred plus the status bytes.
> > > > +     */
> > > > +    vring_push(&s->vring, req->head, len + sizeof(hdr));
> > > > +
> > > > +    qemu_mutex_lock(&s->num_reqs_lock);
> > > > +    if (--s->num_reqs == 0) {
> > > > +        qemu_cond_broadcast(&s->no_reqs_cond);
> > > > +    }
> > > > +    qemu_mutex_unlock(&s->num_reqs_lock);
> > > > +}
> > > > +
> > > > +static void fail_request_early(VirtIOBlockDataPlane *s,
> > > > unsigned
> > > > int head,
> > > > +                               QEMUIOVector *inhdr, unsigned
> > > > char
> > > > status)
> > > > +{
> > > > +    struct virtio_blk_inhdr hdr = {
> > > > +        .status = status,
> > > > +    };
> > > > +
> > > > +    qemu_iovec_from_buf(inhdr, 0, &hdr, sizeof(hdr));
> > > > +    qemu_iovec_destroy(inhdr);
> > > > +    g_slice_free(QEMUIOVector, inhdr);
> > > > +
> > > > +    vring_push(&s->vring, head, sizeof(hdr));
> > > > +    notify_guest(s);
> > > > +}
> > > > +
> > > > +static int process_request(IOQueue *ioq, struct iovec iov[],
> > > > +                           unsigned int out_num, unsigned int
> > > > in_num,
> > > > +                           unsigned int head)
> > > > +{
> > > > +    VirtIOBlockDataPlane *s = container_of(ioq,
> > > > VirtIOBlockDataPlane, ioqueue);
> > > > +    struct iovec *in_iov = &iov[out_num];
> > > > +    struct virtio_blk_outhdr outhdr;
> > > > +    QEMUIOVector *inhdr;
> > > > +    size_t in_size;
> > > > +
> > > > +    /* Copy in outhdr */
> > > > +    if (unlikely(iov_to_buf(iov, out_num, 0, &outhdr,
> > > > +                            sizeof(outhdr)) !=
> > > > sizeof(outhdr))) {
> > > > +        error_report("virtio-blk request outhdr too short");
> > > > +        return -EFAULT;
> > > > +    }
> > > > +    iov_discard(&iov, &out_num, sizeof(outhdr));
> > > > +
> > > > +    /* Grab inhdr for later */
> > > > +    in_size = iov_size(in_iov, in_num);
> > > > +    if (in_size < sizeof(struct virtio_blk_inhdr)) {
> > > > +        error_report("virtio_blk request inhdr too short");
> > > > +        return -EFAULT;
> > > > +    }
> > > > +    inhdr = g_slice_new(QEMUIOVector);
> > > > +    qemu_iovec_init(inhdr, 1);
> > > > +    qemu_iovec_concat_iov(inhdr, in_iov, in_num,
> > > > +            in_size - sizeof(struct virtio_blk_inhdr),
> > > > +            sizeof(struct virtio_blk_inhdr));
> > > > +    iov_discard(&in_iov, &in_num, -sizeof(struct
> > > > virtio_blk_inhdr));
> > > > +
> > > > +    /* TODO Linux sets the barrier bit even when not
> > > > advertised!
> > > > */
> > > > +    outhdr.type &= ~VIRTIO_BLK_T_BARRIER;
> > > > +
> > > > +    struct iocb *iocb;
> > > > +    switch (outhdr.type & (VIRTIO_BLK_T_OUT |
> > > > VIRTIO_BLK_T_SCSI_CMD |
> > > > +                           VIRTIO_BLK_T_FLUSH)) {
> > > > +    case VIRTIO_BLK_T_IN:
> > > > +        iocb = ioq_rdwr(ioq, true, in_iov, in_num,
> > > > outhdr.sector *
> > > > 512);
> > > > +        break;
> > > > +
> > > > +    case VIRTIO_BLK_T_OUT:
> > > > +        iocb = ioq_rdwr(ioq, false, iov, out_num,
> > > > outhdr.sector *
> > > > 512);
> > > > +        break;
> > > > +
> > > > +    case VIRTIO_BLK_T_SCSI_CMD:
> > > > +        /* TODO support SCSI commands */
> > > > +        fail_request_early(s, head, inhdr,
> > > > VIRTIO_BLK_S_UNSUPP);
> > > > +        return 0;
> > > > +
> > > > +    case VIRTIO_BLK_T_FLUSH:
> > > > +        /* TODO fdsync not supported by Linux AIO, do it
> > > > synchronously here! */
> > > > +        fdatasync(s->fd);
> > > > +        fail_request_early(s, head, inhdr, VIRTIO_BLK_S_OK);
> > > > +        return 0;
> > > > +
> > > > +    default:
> > > > +        error_report("virtio-blk unsupported request type
> > > > %#x",
> > > > outhdr.type);
> > > > +        qemu_iovec_destroy(inhdr);
> > > > +        g_slice_free(QEMUIOVector, inhdr);
> > > > +        return -EFAULT;
> > > > +    }
> > > > +
> > > > +    /* Fill in virtio block metadata needed for completion */
> > > > +    VirtIOBlockRequest *req = container_of(iocb,
> > > > VirtIOBlockRequest, iocb);
> > > > +    req->head = head;
> > > > +    req->inhdr = inhdr;
> > > > +    return 0;
> > > > +}
> > > > +
> > > > +static bool handle_notify(EventHandler *handler)
> > > > +{
> > > > +    VirtIOBlockDataPlane *s = container_of(handler,
> > > > VirtIOBlockDataPlane,
> > > > +                                           notify_handler);
> > > > +
> > > > +    /* There is one array of iovecs into which all new
> > > > requests
> > > > are extracted
> > > > +     * from the vring.  Requests are read from the vring and
> > > > the
> > > > translated
> > > > +     * descriptors are written to the iovecs array.  The
> > > > iovecs do
> > > > not have to
> > > > +     * persist across handle_notify() calls because the kernel
> > > > copies the
> > > > +     * iovecs on io_submit().
> > > > +     *
> > > > +     * Handling io_submit() EAGAIN may require storing the
> > > > requests across
> > > > +     * handle_notify() calls until the kernel has sufficient
> > > > resources to
> > > > +     * accept more I/O.  This is not implemented yet.
> > > > +     */
> > > > +    struct iovec iovec[VRING_MAX];
> > > > +    struct iovec *end = &iovec[VRING_MAX];
> > > > +    struct iovec *iov = iovec;
> > > > +
> > > > +    /* When a request is read from the vring, the index of the
> > > > first descriptor
> > > > +     * (aka head) is returned so that the completed request
> > > > can be
> > > > pushed onto
> > > > +     * the vring later.
> > > > +     *
> > > > +     * The number of hypervisor read-only iovecs is out_num.
> > > >  The
> > > > number of
> > > > +     * hypervisor write-only iovecs is in_num.
> > > > +     */
> > > > +    int head;
> > > > +    unsigned int out_num = 0, in_num = 0;
> > > > +    unsigned int num_queued;
> > > > +
> > > > +    for (;;) {
> > > > +        /* Disable guest->host notifies to avoid unnecessary
> > > > vmexits */
> > > > +        vring_set_notification(s->vdev, &s->vring, false);
> > > > +
> > > > +        for (;;) {
> > > > +            head = vring_pop(s->vdev, &s->vring, iov, end,
> > > > &out_num, &in_num);
> > > > +            if (head < 0) {
> > > > +                break; /* no more requests */
> > > > +            }
> > > > +
> > > > +            trace_virtio_blk_data_plane_process_request(s,
> > > > out_num, in_num,
> > > > +                                                        head);
> > > > +
> > > > +            if (process_request(&s->ioqueue, iov, out_num,
> > > > in_num,
> > > > head) < 0) {
> > > > +                vring_set_broken(&s->vring);
> > > > +                break;
> > > > +            }
> > > > +            iov += out_num + in_num;
> > > > +        }
> > > > +
> > > > +        if (likely(head == -EAGAIN)) { /* vring emptied */
> > > > +            /* Re-enable guest->host notifies and stop
> > > > processing
> > > > the vring.
> > > > +             * But if the guest has snuck in more descriptors,
> > > > keep processing.
> > > > +             */
> > > > +            vring_set_notification(s->vdev, &s->vring, true);
> > > > +            smp_mb();
> > > > +            if (!vring_more_avail(&s->vring)) {
> > > > +                break;
> > > > +            }
> > > > +        } else { /* head == -ENOBUFS or fatal error, iovecs[]
> > > > is
> > > > depleted */
> > > > +            /* Since there are no iovecs[] left, stop
> > > > processing
> > > > for now.  Do
> > > > +             * not re-enable guest->host notifies since the
> > > > I/O
> > > > completion
> > > > +             * handler knows to check for more vring
> > > > descriptors
> > > > anyway.
> > > > +             */
> > > > +            break;
> > > > +        }
> > > > +    }
> > > > +
> > > > +    num_queued = ioq_num_queued(&s->ioqueue);
> > > > +    if (num_queued > 0) {
> > > > +        qemu_mutex_lock(&s->num_reqs_lock);
> > > > +        s->num_reqs += num_queued;
> > > > +        qemu_mutex_unlock(&s->num_reqs_lock);
> > > > +
> > > > +        int rc = ioq_submit(&s->ioqueue);
> > > > +        if (unlikely(rc < 0)) {
> > > > +            fprintf(stderr, "ioq_submit failed %d\n", rc);
> > > > +            exit(1);
> > > > +        }
> > > > +    }
> > > > +    return true;
> > > > +}
> > > > +
> > > > +static bool handle_io(EventHandler *handler)
> > > > +{
> > > > +    VirtIOBlockDataPlane *s = container_of(handler,
> > > > VirtIOBlockDataPlane,
> > > > +                                           io_handler);
> > > > +
> > > > +    if (ioq_run_completion(&s->ioqueue, complete_request, s) >
> > > > 0)
> > > > {
> > > > +        notify_guest(s);
> > > > +    }
> > > > +
> > > > +    /* If there were more requests than iovecs, the vring will
> > > > not
> > > > be empty yet
> > > > +     * so check again.  There should now be enough resources
> > > > to
> > > > process more
> > > > +     * requests.
> > > > +     */
> > > > +    if (unlikely(vring_more_avail(&s->vring))) {
> > > > +        return handle_notify(&s->notify_handler);
> > > > +    }
> > > > +
> > > > +    return true;
> > > > +}
> > > > +
> > > > +static void *data_plane_thread(void *opaque)
> > > > +{
> > > > +    VirtIOBlockDataPlane *s = opaque;
> > > > +    event_poll_run(&s->event_poll);
> > > > +    return NULL;
> > > > +}
> > > > +
> > > > +static void start_data_plane_bh(void *opaque)
> > > > +{
> > > > +    VirtIOBlockDataPlane *s = opaque;
> > > > +
> > > > +    qemu_bh_delete(s->start_bh);
> > > > +    s->start_bh = NULL;
> > > > +    qemu_thread_create(&s->thread, data_plane_thread,
> > > > +                       s, QEMU_THREAD_JOINABLE);
> > > > +}
> > > > +
> > > > +VirtIOBlockDataPlane
> > > > *virtio_blk_data_plane_create(VirtIODevice
> > > > *vdev, int fd)
> > > > +{
> > > > +    VirtIOBlockDataPlane *s;
> > > > +
> > > > +    s = g_new0(VirtIOBlockDataPlane, 1);
> > > > +    s->vdev = vdev;
> > > > +    s->fd = fd;
> > > > +    return s;
> > > > +}
> > > > +
> > > > +void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s)
> > > > +{
> > > > +    if (!s) {
> > > > +        return;
> > > > +    }
> > > > +    virtio_blk_data_plane_stop(s);
> > > > +    g_free(s);
> > > > +}
> > > > +
> > > > +/* Block until pending requests have completed
> > > > + *
> > > > + * The vring continues to be serviced so ensure no new
> > > > requests
> > > > will be added
> > > > + * to avoid races.
> > > 
> > > This comment confuses me. "avoid races" is a kind of vague
> > > comment that does not really help.
> > > 
> > > This function does not seem to ensure
> > > no new requests - it simply waits until num requests
> > > gets to 0. But requests could get added right afterwards
> > > and it won't help.
> > > 
> > > Could be comment be made more clear please?
> > > 
> > > > + */
> > > > +void virtio_blk_data_plane_drain(VirtIOBlockDataPlane *s)
> > > > +{
> > > > +    qemu_mutex_lock(&s->num_reqs_lock);
> > > > +    while (s->num_reqs > 0) {
> > > > +        qemu_cond_wait(&s->no_reqs_cond, &s->num_reqs_lock);
> > > > +    }
> > > > +    qemu_mutex_unlock(&s->num_reqs_lock);
> > > > +}
> > > > +
> > > > +void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s)
> > > > +{
> > > > +    VirtQueue *vq;
> > > > +    int i;
> > > > +
> > > > +    if (s->started) {
> > > > +        return;
> > > > +    }
> > > > +
> > > > +    vq = virtio_get_queue(s->vdev, 0);
> > > > +    if (!vring_setup(&s->vring, s->vdev, 0)) {
> > > > +        return;
> > > > +    }
> > > > +
> > > > +    event_poll_init(&s->event_poll);
> > > > +
> > > > +    /* Set up guest notifier (irq) */
> > > > +    if
> > > > (s->vdev->binding->set_guest_notifiers(s->vdev->binding_opaque,
> > > > +                                              true) != 0) {
> > > > +        fprintf(stderr, "virtio-blk failed to set guest
> > > > notifier,
> > > > "
> > > > +                "ensure -enable-kvm is set\n");
> > > > +        exit(1);
> > > > +    }
> > > > +    s->guest_notifier = virtio_queue_get_guest_notifier(vq);
> > > > +
> > > > +    /* Set up virtqueue notify */
> > > > +    if
> > > > (s->vdev->binding->set_host_notifier(s->vdev->binding_opaque,
> > > > +                                            0, true) != 0) {
> > > > +        fprintf(stderr, "virtio-blk failed to set host
> > > > notifier\n");
> > > > +        exit(1);
> > > > +    }
> > > > +    event_poll_add(&s->event_poll, &s->notify_handler,
> > > > +                   virtio_queue_get_host_notifier(vq),
> > > > +                   handle_notify);
> > > > +
> > > > +    /* Set up ioqueue */
> > > > +    ioq_init(&s->ioqueue, s->fd, REQ_MAX);
> > > > +    for (i = 0; i < ARRAY_SIZE(s->requests); i++) {
> > > > +        ioq_put_iocb(&s->ioqueue, &s->requests[i].iocb);
> > > > +    }
> > > > +    event_poll_add(&s->event_poll, &s->io_handler,
> > > > +                   ioq_get_notifier(&s->ioqueue), handle_io);
> > > > +
> > > > +    s->started = true;
> > > > +    trace_virtio_blk_data_plane_start(s);
> > > > +
> > > > +    /* Kick right away to begin processing requests already in
> > > > vring */
> > > > +    event_notifier_set(virtio_queue_get_host_notifier(vq));
> > > > +
> > > > +    /* Spawn thread in BH so it inherits iothread cpusets */
> > > > +    s->start_bh = qemu_bh_new(start_data_plane_bh, s);
> > > > +    qemu_bh_schedule(s->start_bh);
> > > > +}
> > > > +
> > > > +void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s)
> > > > +{
> > > > +    if (!s->started) {
> > > > +        return;
> > > > +    }
> > > > +    s->started = false;
> > > > +    trace_virtio_blk_data_plane_stop(s);
> > > > +
> > > > +    /* Stop thread or cancel pending thread creation BH */
> > > > +    if (s->start_bh) {
> > > > +        qemu_bh_delete(s->start_bh);
> > > > +        s->start_bh = NULL;
> > > > +    } else {
> > > > +        virtio_blk_data_plane_drain(s);
> > > > +        event_poll_stop(&s->event_poll);
> > > > +        qemu_thread_join(&s->thread);
> > > > +    }
> > > > +
> > > > +    ioq_cleanup(&s->ioqueue);
> > > > +
> > > > +
> > > >    s->vdev->binding->set_host_notifier(s->vdev->binding_opaque,
> > > > 0, false);
> > > > +
> > > > +    event_poll_cleanup(&s->event_poll);
> > > > +
> > > > +    /* Clean up guest notifier (irq) */
> > > > +
> > > >    s->vdev->binding->set_guest_notifiers(s->vdev->binding_opaque,
> > > > false);
> > > > +
> > > > +    vring_teardown(&s->vring);
> > > > +}
> > > > diff --git a/hw/dataplane/virtio-blk.h
> > > > b/hw/dataplane/virtio-blk.h
> > > > new file mode 100644
> > > > index 0000000..ddf1115
> > > > --- /dev/null
> > > > +++ b/hw/dataplane/virtio-blk.h
> > > > @@ -0,0 +1,41 @@
> > > > +/*
> > > > + * Dedicated thread for virtio-blk I/O processing
> > > > + *
> > > > + * Copyright 2012 IBM, Corp.
> > > > + * Copyright 2012 Red Hat, Inc. and/or its affiliates
> > > > + *
> > > > + * Authors:
> > > > + *   Stefan Hajnoczi <stefanha@redhat.com>
> > > > + *
> > > > + * This work is licensed under the terms of the GNU GPL,
> > > > version 2
> > > > or later.
> > > > + * See the COPYING file in the top-level directory.
> > > > + *
> > > > + */
> > > > +
> > > > +#ifndef HW_DATAPLANE_VIRTIO_BLK_H
> > > > +#define HW_DATAPLANE_VIRTIO_BLK_H
> > > > +
> > > > +#include "hw/virtio.h"
> > > > +
> > > > +typedef struct VirtIOBlockDataPlane VirtIOBlockDataPlane;
> > > > +
> > > > +#ifdef CONFIG_VIRTIO_BLK_DATA_PLANE
> > > > +VirtIOBlockDataPlane
> > > > *virtio_blk_data_plane_create(VirtIODevice
> > > > *vdev, int fd);
> > > > +void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s);
> > > > +void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s);
> > > > +void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s);
> > > > +void virtio_blk_data_plane_drain(VirtIOBlockDataPlane *s);
> > > > +#else
> > > > +static inline VirtIOBlockDataPlane
> > > > *virtio_blk_data_plane_create(
> > > > +        VirtIODevice *vdev, int fd)
> > > > +{
> > > > +    return NULL;
> > > > +}
> > > > +
> > > > +static inline void
> > > > virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s) {}
> > > > +static inline void
> > > > virtio_blk_data_plane_start(VirtIOBlockDataPlane *s) {}
> > > > +static inline void
> > > > virtio_blk_data_plane_stop(VirtIOBlockDataPlane
> > > > *s) {}
> > > > +static inline void
> > > > virtio_blk_data_plane_drain(VirtIOBlockDataPlane *s) {}
> > > > +#endif
> > > > +
> > > > +#endif /* HW_DATAPLANE_VIRTIO_BLK_H */
> > > > diff --git a/trace-events b/trace-events
> > > > index a9a791b..1edc2ae 100644
> > > > --- a/trace-events
> > > > +++ b/trace-events
> > > > @@ -98,6 +98,12 @@ virtio_blk_rw_complete(void *req, int ret)
> > > > "req
> > > > %p ret %d"
> > > >  virtio_blk_handle_write(void *req, uint64_t sector, size_t
> > > >  nsectors) "req %p sector %"PRIu64" nsectors %zu"
> > > >  virtio_blk_handle_read(void *req, uint64_t sector, size_t
> > > >  nsectors) "req %p sector %"PRIu64" nsectors %zu"
> > > >  
> > > > +# hw/dataplane/virtio-blk.c
> > > > +virtio_blk_data_plane_start(void *s) "dataplane %p"
> > > > +virtio_blk_data_plane_stop(void *s) "dataplane %p"
> > > > +virtio_blk_data_plane_process_request(void *s, unsigned int
> > > > out_num, unsigned int in_num, unsigned int head) "dataplane %p
> > > > out_num %u in_num %u head %u"
> > > > +virtio_blk_data_plane_complete_request(void *s, unsigned int
> > > > head,
> > > > int ret) "dataplane %p head %u ret %d"
> > > > +
> > > >  # hw/dataplane/vring.c
> > > >  vring_setup(uint64_t physical, void *desc, void *avail, void
> > > >  *used) "vring physical %#"PRIx64" desc %p avail %p used %p"
> > > >  
> > > > --
> > > > 1.8.0
> > > 
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 10/11] dataplane: add virtio-blk data plane code
  2012-11-29 15:47         ` Paolo Bonzini
@ 2012-11-30 13:57           ` Stefan Hajnoczi
  0 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2012-11-30 13:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, qemu-devel,
	Blue Swirl, Khoa Huynh, Stefan Hajnoczi, Asias He

On Thu, Nov 29, 2012 at 4:47 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Stefan, perhaps you could replace the stop_notifier mechanism of
> event-poll.c with something similar to aio_notify/qemu_notify_event,
> and even remove event_poll_run in favor of event_poll (aka aio_wait...).
> And also remove the return value from the callback, since your remaining
> callbacks always return true.
>
> The main thread can just signal the event loop:
>
>     s->stopping = true;
>     event_poll_notify(&s->event_poll);
>     qemu_thread_join(&s->thread);
>
> and the dataplane thread will drain the queue before exiting.

Thanks Michael and Paolo, this is a cool idea.  I didn't see a way to
communicate back that the data plane thread is quiesced without more
locking.  Stopping the thread is a neat solution.

Stefan

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Qemu-devel] [PATCH v4 11/11] virtio-blk: add x-data-plane=on|off performance feature
  2012-11-22 15:16 [Qemu-devel] [PATCH v4 00/11] virtio: virtio-blk data plane Stefan Hajnoczi
                   ` (9 preceding siblings ...)
  2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 10/11] dataplane: add virtio-blk data plane code Stefan Hajnoczi
@ 2012-11-22 15:16 ` Stefan Hajnoczi
  2012-11-29 13:12   ` Michael S. Tsirkin
  2012-11-29  9:18 ` [Qemu-devel] [PATCH v4 00/11] virtio: virtio-blk data plane Stefan Hajnoczi
  11 siblings, 1 reply; 43+ messages in thread
From: Stefan Hajnoczi @ 2012-11-22 15:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, Blue Swirl, khoa,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

The virtio-blk-data-plane feature is easy to integrate into
hw/virtio-blk.c.  The data plane can be started and stopped similar to
vhost-net.

Users can take advantage of the virtio-blk-data-plane feature using the
new -device virtio-blk-pci,x-data-plane=on property.

The x-data-plane name was chosen because at this stage the feature is
experimental and likely to see changes in the future.

If the VM configuration does not support virtio-blk-data-plane an error
message is printed.  Although we could fall back to regular virtio-blk,
I prefer the explicit approach since it prompts the user to fix their
configuration if they want the performance benefit of
virtio-blk-data-plane.

Limitations:
 * Only format=raw is supported
 * Live migration is not supported
 * Block jobs, hot unplug, and other operations fail with -EBUSY
 * I/O throttling limits are ignored
 * Only Linux hosts are supported due to Linux AIO usage

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 hw/virtio-blk.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 hw/virtio-blk.h |  1 +
 hw/virtio-pci.c |  3 +++
 3 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index e25cc96..7f6004e 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -17,6 +17,8 @@
 #include "hw/block-common.h"
 #include "blockdev.h"
 #include "virtio-blk.h"
+#include "hw/dataplane/virtio-blk.h"
+#include "migration.h"
 #include "scsi-defs.h"
 #ifdef __linux__
 # include <scsi/sg.h>
@@ -33,6 +35,8 @@ typedef struct VirtIOBlock
     VirtIOBlkConf *blk;
     unsigned short sector_mask;
     DeviceState *qdev;
+    VirtIOBlockDataPlane *dataplane;
+    Error *migration_blocker;
 } VirtIOBlock;
 
 static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
@@ -407,6 +411,14 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
         .num_writes = 0,
     };
 
+    /* Some guests kick before setting VIRTIO_CONFIG_S_DRIVER_OK so start
+     * dataplane here instead of waiting for .set_status().
+     */
+    if (s->dataplane) {
+        virtio_blk_data_plane_start(s->dataplane);
+        return;
+    }
+
     while ((req = virtio_blk_get_request(s))) {
         virtio_blk_handle_request(req, &mrb);
     }
@@ -446,8 +458,13 @@ static void virtio_blk_dma_restart_cb(void *opaque, int running,
 {
     VirtIOBlock *s = opaque;
 
-    if (!running)
+    if (!running) {
+        /* qemu_drain_all() doesn't know about data plane, quiesce here */
+        if (s->dataplane) {
+            virtio_blk_data_plane_drain(s->dataplane);
+        }
         return;
+    }
 
     if (!s->bh) {
         s->bh = qemu_bh_new(virtio_blk_dma_restart_bh, s);
@@ -538,6 +555,10 @@ static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t status)
     VirtIOBlock *s = to_virtio_blk(vdev);
     uint32_t features;
 
+    if (s->dataplane && !(status & VIRTIO_CONFIG_S_DRIVER)) {
+        virtio_blk_data_plane_stop(s->dataplane);
+    }
+
     if (!(status & VIRTIO_CONFIG_S_DRIVER_OK)) {
         return;
     }
@@ -604,6 +625,7 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, VirtIOBlkConf *blk)
 {
     VirtIOBlock *s;
     static int virtio_blk_id;
+    int fd = -1;
 
     if (!blk->conf.bs) {
         error_report("drive property not set");
@@ -619,6 +641,21 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, VirtIOBlkConf *blk)
         return NULL;
     }
 
+    if (blk->data_plane) {
+        if (blk->scsi) {
+            error_report("device is incompatible with x-data-plane, "
+                         "use scsi=off");
+            return NULL;
+        }
+
+        fd = raw_get_aio_fd(blk->conf.bs);
+        if (fd < 0) {
+            error_report("drive is incompatible with x-data-plane, "
+                         "use format=raw,cache=none,aio=native");
+            return NULL;
+        }
+    }
+
     s = (VirtIOBlock *)virtio_common_init("virtio-blk", VIRTIO_ID_BLOCK,
                                           sizeof(struct virtio_blk_config),
                                           sizeof(VirtIOBlock));
@@ -636,6 +673,17 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, VirtIOBlkConf *blk)
 
     s->vq = virtio_add_queue(&s->vdev, 128, virtio_blk_handle_output);
 
+    if (fd >= 0) {
+        s->dataplane = virtio_blk_data_plane_create(&s->vdev, fd);
+
+        /* Prevent block operations that conflict with data plane thread */
+        bdrv_set_in_use(s->bs, 1);
+
+        error_setg(&s->migration_blocker,
+                   "x-data-plane does not support migration");
+        migrate_add_blocker(s->migration_blocker);
+    }
+
     qemu_add_vm_change_state_handler(virtio_blk_dma_restart_cb, s);
     s->qdev = dev;
     register_savevm(dev, "virtio-blk", virtio_blk_id++, 2,
@@ -652,6 +700,15 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, VirtIOBlkConf *blk)
 void virtio_blk_exit(VirtIODevice *vdev)
 {
     VirtIOBlock *s = to_virtio_blk(vdev);
+
+    if (s->dataplane) {
+        migrate_del_blocker(s->migration_blocker);
+        error_free(s->migration_blocker);
+        bdrv_set_in_use(s->bs, 0);
+        virtio_blk_data_plane_destroy(s->dataplane);
+        s->dataplane = NULL;
+    }
+
     unregister_savevm(s->qdev, "virtio-blk", s);
     blockdev_mark_auto_del(s->bs);
     virtio_cleanup(vdev);
diff --git a/hw/virtio-blk.h b/hw/virtio-blk.h
index f0740d0..53d7971 100644
--- a/hw/virtio-blk.h
+++ b/hw/virtio-blk.h
@@ -105,6 +105,7 @@ struct VirtIOBlkConf
     char *serial;
     uint32_t scsi;
     uint32_t config_wce;
+    uint32_t data_plane;
 };
 
 #define DEFINE_VIRTIO_BLK_FEATURES(_state, _field) \
diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index 71f4fb5..32cc910 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -897,6 +897,9 @@ static Property virtio_blk_properties[] = {
 #endif
     DEFINE_PROP_BIT("config-wce", VirtIOPCIProxy, blk.config_wce, 0, true),
     DEFINE_PROP_BIT("ioeventfd", VirtIOPCIProxy, flags, VIRTIO_PCI_FLAG_USE_IOEVENTFD_BIT, true),
+#ifdef CONFIG_VIRTIO_BLK_DATA_PLANE
+    DEFINE_PROP_BIT("x-data-plane", VirtIOPCIProxy, blk.data_plane, 0, false),
+#endif
     DEFINE_PROP_UINT32("vectors", VirtIOPCIProxy, nvectors, 2),
     DEFINE_VIRTIO_BLK_FEATURES(VirtIOPCIProxy, host_features),
     DEFINE_PROP_END_OF_LIST(),
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 11/11] virtio-blk: add x-data-plane=on|off performance feature
  2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 11/11] virtio-blk: add x-data-plane=on|off performance feature Stefan Hajnoczi
@ 2012-11-29 13:12   ` Michael S. Tsirkin
  2012-11-29 14:45     ` Stefan Hajnoczi
  0 siblings, 1 reply; 43+ messages in thread
From: Michael S. Tsirkin @ 2012-11-29 13:12 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Blue Swirl, khoa,
	Paolo Bonzini, Asias He

On Thu, Nov 22, 2012 at 04:16:52PM +0100, Stefan Hajnoczi wrote:
> The virtio-blk-data-plane feature is easy to integrate into
> hw/virtio-blk.c.  The data plane can be started and stopped similar to
> vhost-net.
> 
> Users can take advantage of the virtio-blk-data-plane feature using the
> new -device virtio-blk-pci,x-data-plane=on property.
> 
> The x-data-plane name was chosen because at this stage the feature is
> experimental and likely to see changes in the future.
> 
> If the VM configuration does not support virtio-blk-data-plane an error
> message is printed.  Although we could fall back to regular virtio-blk,
> I prefer the explicit approach since it prompts the user to fix their
> configuration if they want the performance benefit of
> virtio-blk-data-plane.

Not only that, this affects features exposed to guest so it really can't be
trasparent.

Which reminds me - shouldn't some features be turned off?
For example, VIRTIO_BLK_F_SCSI?

> Limitations:
>  * Only format=raw is supported
>  * Live migration is not supported

This is probably fixable long term?

>  * Block jobs, hot unplug, and other operations fail with -EBUSY

Hmm I don't see code to disable PCU unplug in this patch.
I expected no_hotplug to be set.
Where is it?

>  * I/O throttling limits are ignored

And this?
Meanwhile can we have attempts to set them fail?

>  * Only Linux hosts are supported due to Linux AIO usage
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  hw/virtio-blk.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  hw/virtio-blk.h |  1 +
>  hw/virtio-pci.c |  3 +++
>  3 files changed, 62 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
> index e25cc96..7f6004e 100644
> --- a/hw/virtio-blk.c
> +++ b/hw/virtio-blk.c
> @@ -17,6 +17,8 @@
>  #include "hw/block-common.h"
>  #include "blockdev.h"
>  #include "virtio-blk.h"
> +#include "hw/dataplane/virtio-blk.h"
> +#include "migration.h"
>  #include "scsi-defs.h"
>  #ifdef __linux__
>  # include <scsi/sg.h>
> @@ -33,6 +35,8 @@ typedef struct VirtIOBlock
>      VirtIOBlkConf *blk;
>      unsigned short sector_mask;
>      DeviceState *qdev;
> +    VirtIOBlockDataPlane *dataplane;
> +    Error *migration_blocker;

Would be nice to move the migration disabling
checking supported formats
and all the rest of it out to dataplane code.

>  } VirtIOBlock;
>  
>  static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
> @@ -407,6 +411,14 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>          .num_writes = 0,
>      };
>  
> +    /* Some guests kick before setting VIRTIO_CONFIG_S_DRIVER_OK so start
> +     * dataplane here instead of waiting for .set_status().
> +     */
> +    if (s->dataplane) {
> +        virtio_blk_data_plane_start(s->dataplane);
> +        return;
> +    }
> +
>      while ((req = virtio_blk_get_request(s))) {
>          virtio_blk_handle_request(req, &mrb);
>      }
> @@ -446,8 +458,13 @@ static void virtio_blk_dma_restart_cb(void *opaque, int running,
>  {
>      VirtIOBlock *s = opaque;
>  
> -    if (!running)
> +    if (!running) {
> +        /* qemu_drain_all() doesn't know about data plane, quiesce here */
> +        if (s->dataplane) {
> +            virtio_blk_data_plane_drain(s->dataplane);
> +        }
>          return;
> +    }
>  
>      if (!s->bh) {
>          s->bh = qemu_bh_new(virtio_blk_dma_restart_bh, s);
> @@ -538,6 +555,10 @@ static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t status)
>      VirtIOBlock *s = to_virtio_blk(vdev);
>      uint32_t features;
>  
> +    if (s->dataplane && !(status & VIRTIO_CONFIG_S_DRIVER)) {
> +        virtio_blk_data_plane_stop(s->dataplane);
> +    }
> +
>      if (!(status & VIRTIO_CONFIG_S_DRIVER_OK)) {
>          return;
>      }
> @@ -604,6 +625,7 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, VirtIOBlkConf *blk)
>  {
>      VirtIOBlock *s;
>      static int virtio_blk_id;
> +    int fd = -1;
>  
>      if (!blk->conf.bs) {
>          error_report("drive property not set");
> @@ -619,6 +641,21 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, VirtIOBlkConf *blk)
>          return NULL;
>      }
>  
> +    if (blk->data_plane) {
> +        if (blk->scsi) {
> +            error_report("device is incompatible with x-data-plane, "
> +                         "use scsi=off");
> +            return NULL;
> +        }
> +
> +        fd = raw_get_aio_fd(blk->conf.bs);
> +        if (fd < 0) {
> +            error_report("drive is incompatible with x-data-plane, "
> +                         "use format=raw,cache=none,aio=native");
> +            return NULL;
> +        }
> +    }
> +
>      s = (VirtIOBlock *)virtio_common_init("virtio-blk", VIRTIO_ID_BLOCK,
>                                            sizeof(struct virtio_blk_config),
>                                            sizeof(VirtIOBlock));
> @@ -636,6 +673,17 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, VirtIOBlkConf *blk)
>  
>      s->vq = virtio_add_queue(&s->vdev, 128, virtio_blk_handle_output);
>  
> +    if (fd >= 0) {
> +        s->dataplane = virtio_blk_data_plane_create(&s->vdev, fd);
> +
> +        /* Prevent block operations that conflict with data plane thread */
> +        bdrv_set_in_use(s->bs, 1);
> +
> +        error_setg(&s->migration_blocker,
> +                   "x-data-plane does not support migration");
> +        migrate_add_blocker(s->migration_blocker);
> +    }
> +
>      qemu_add_vm_change_state_handler(virtio_blk_dma_restart_cb, s);
>      s->qdev = dev;
>      register_savevm(dev, "virtio-blk", virtio_blk_id++, 2,
> @@ -652,6 +700,15 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, VirtIOBlkConf *blk)
>  void virtio_blk_exit(VirtIODevice *vdev)
>  {
>      VirtIOBlock *s = to_virtio_blk(vdev);
> +
> +    if (s->dataplane) {
> +        migrate_del_blocker(s->migration_blocker);
> +        error_free(s->migration_blocker);
> +        bdrv_set_in_use(s->bs, 0);
> +        virtio_blk_data_plane_destroy(s->dataplane);
> +        s->dataplane = NULL;
> +    }
> +
>      unregister_savevm(s->qdev, "virtio-blk", s);
>      blockdev_mark_auto_del(s->bs);
>      virtio_cleanup(vdev);
> diff --git a/hw/virtio-blk.h b/hw/virtio-blk.h
> index f0740d0..53d7971 100644
> --- a/hw/virtio-blk.h
> +++ b/hw/virtio-blk.h
> @@ -105,6 +105,7 @@ struct VirtIOBlkConf
>      char *serial;
>      uint32_t scsi;
>      uint32_t config_wce;
> +    uint32_t data_plane;
>  };
>  
>  #define DEFINE_VIRTIO_BLK_FEATURES(_state, _field) \
> diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
> index 71f4fb5..32cc910 100644
> --- a/hw/virtio-pci.c
> +++ b/hw/virtio-pci.c
> @@ -897,6 +897,9 @@ static Property virtio_blk_properties[] = {
>  #endif
>      DEFINE_PROP_BIT("config-wce", VirtIOPCIProxy, blk.config_wce, 0, true),
>      DEFINE_PROP_BIT("ioeventfd", VirtIOPCIProxy, flags, VIRTIO_PCI_FLAG_USE_IOEVENTFD_BIT, true),
> +#ifdef CONFIG_VIRTIO_BLK_DATA_PLANE
> +    DEFINE_PROP_BIT("x-data-plane", VirtIOPCIProxy, blk.data_plane, 0, false),
> +#endif
>      DEFINE_PROP_UINT32("vectors", VirtIOPCIProxy, nvectors, 2),
>      DEFINE_VIRTIO_BLK_FEATURES(VirtIOPCIProxy, host_features),
>      DEFINE_PROP_END_OF_LIST(),
> -- 
> 1.8.0

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 11/11] virtio-blk: add x-data-plane=on|off performance feature
  2012-11-29 13:12   ` Michael S. Tsirkin
@ 2012-11-29 14:45     ` Stefan Hajnoczi
  2012-11-29 14:55       ` Michael S. Tsirkin
  0 siblings, 1 reply; 43+ messages in thread
From: Stefan Hajnoczi @ 2012-11-29 14:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Blue Swirl, khoa,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

On Thu, Nov 29, 2012 at 03:12:35PM +0200, Michael S. Tsirkin wrote:
> On Thu, Nov 22, 2012 at 04:16:52PM +0100, Stefan Hajnoczi wrote:
> > The virtio-blk-data-plane feature is easy to integrate into
> > hw/virtio-blk.c.  The data plane can be started and stopped similar to
> > vhost-net.
> > 
> > Users can take advantage of the virtio-blk-data-plane feature using the
> > new -device virtio-blk-pci,x-data-plane=on property.
> > 
> > The x-data-plane name was chosen because at this stage the feature is
> > experimental and likely to see changes in the future.
> > 
> > If the VM configuration does not support virtio-blk-data-plane an error
> > message is printed.  Although we could fall back to regular virtio-blk,
> > I prefer the explicit approach since it prompts the user to fix their
> > configuration if they want the performance benefit of
> > virtio-blk-data-plane.
> 
> Not only that, this affects features exposed to guest so it really can't be
> trasparent.
> 
> Which reminds me - shouldn't some features be turned off?
> For example, VIRTIO_BLK_F_SCSI?

Yes, virtio-blk-data-plane only starts when you give -device
virtio-blk-pci,scsi=off,x-data-plane=on.  If you use scsi=on an error
message is printed.

> > Limitations:
> >  * Only format=raw is supported
> >  * Live migration is not supported
> 
> This is probably fixable long term?

Absolutely.  There are two parts:

1. Marking written memory dirty so live RAM migration can work.  Missing
   today, easy cheat is to switch off virtio-blk-data-plane and silently
   switch to regular virtio-blk emulation while memory dirty logging is
   enabled.  The more long-term solution is to actually communicate the
   dirty information back to the memory API.

2. Synchronizing virtio-blk-data-plane vring state with virtio-blk so
   save/load works.  This should be relatively straightforward.

I don't want to gate this patch series on live migration support but it
is on my TODO list for virtio-blk-data-plane after this initial series
has been merged.

> >  * Block jobs, hot unplug, and other operations fail with -EBUSY
> 
> Hmm I don't see code to disable PCU unplug in this patch.
> I expected no_hotplug to be set.
> Where is it?

It uses the bdrv_in_use() mechanism.

> >  * I/O throttling limits are ignored
> 
> And this?
> Meanwhile can we have attempts to set them fail?

This limitation exists because virtio-blk-data-plane today bypasses the
QEMU block layer.  The next step is to get the block layer working
inside the data plane thread.  At that point I/O limits work again.

Adding an error would be a layering violation because I/O throttling
happens in the QEMU block layer and is unaware of the emulated storage
controller (virtio-blk, IDE, SCSI, etc).

I think it's better to document the limitation and continue working on
AioContext so that we can soon support I/O throttling with
virtio-blk-data-plane.  It would be quite ugly to add checks.

> > @@ -33,6 +35,8 @@ typedef struct VirtIOBlock
> >      VirtIOBlkConf *blk;
> >      unsigned short sector_mask;
> >      DeviceState *qdev;
> > +    VirtIOBlockDataPlane *dataplane;
> > +    Error *migration_blocker;
> 
> Would be nice to move the migration disabling
> checking supported formats
> and all the rest of it out to dataplane code.

The reason to do it in virtio-blk.c is that we already have access to
the device configuration.  If we move it to hw/dataplane/virtio-blk.c
then that code needs to reach inside and check data that it doesn't
otherwise access.

IMO it's nice to keep data plane "dumb" and perform these checks where
we already have to deal with the relationship between VirtIOBlkConf and
friends.

Stefan

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 11/11] virtio-blk: add x-data-plane=on|off performance feature
  2012-11-29 14:45     ` Stefan Hajnoczi
@ 2012-11-29 14:55       ` Michael S. Tsirkin
  2012-12-04 11:20         ` Michael S. Tsirkin
  0 siblings, 1 reply; 43+ messages in thread
From: Michael S. Tsirkin @ 2012-11-29 14:55 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Blue Swirl, khoa,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

On Thu, Nov 29, 2012 at 03:45:55PM +0100, Stefan Hajnoczi wrote:
> On Thu, Nov 29, 2012 at 03:12:35PM +0200, Michael S. Tsirkin wrote:
> > On Thu, Nov 22, 2012 at 04:16:52PM +0100, Stefan Hajnoczi wrote:
> > > The virtio-blk-data-plane feature is easy to integrate into
> > > hw/virtio-blk.c.  The data plane can be started and stopped similar to
> > > vhost-net.
> > > 
> > > Users can take advantage of the virtio-blk-data-plane feature using the
> > > new -device virtio-blk-pci,x-data-plane=on property.
> > > 
> > > The x-data-plane name was chosen because at this stage the feature is
> > > experimental and likely to see changes in the future.
> > > 
> > > If the VM configuration does not support virtio-blk-data-plane an error
> > > message is printed.  Although we could fall back to regular virtio-blk,
> > > I prefer the explicit approach since it prompts the user to fix their
> > > configuration if they want the performance benefit of
> > > virtio-blk-data-plane.
> > 
> > Not only that, this affects features exposed to guest so it really can't be
> > trasparent.
> > 
> > Which reminds me - shouldn't some features be turned off?
> > For example, VIRTIO_BLK_F_SCSI?
> 
> Yes, virtio-blk-data-plane only starts when you give -device
> virtio-blk-pci,scsi=off,x-data-plane=on.  If you use scsi=on an error
> message is printed.
> 
> > > Limitations:
> > >  * Only format=raw is supported
> > >  * Live migration is not supported
> > 
> > This is probably fixable long term?
> 
> Absolutely.  There are two parts:
> 
> 1. Marking written memory dirty so live RAM migration can work.  Missing
>    today, easy cheat is to switch off virtio-blk-data-plane and silently
>    switch to regular virtio-blk emulation while memory dirty logging is
>    enabled.  The more long-term solution is to actually communicate the
>    dirty information back to the memory API.
> 
> 2. Synchronizing virtio-blk-data-plane vring state with virtio-blk so
>    save/load works.  This should be relatively straightforward.
> 
> I don't want to gate this patch series on live migration support but it
> is on my TODO list for virtio-blk-data-plane after this initial series
> has been merged.
> 
> > >  * Block jobs, hot unplug, and other operations fail with -EBUSY
> > 
> > Hmm I don't see code to disable PCU unplug in this patch.
> > I expected no_hotplug to be set.
> > Where is it?
> 
> It uses the bdrv_in_use() mechanism.

Hmm but PCI device can still go away if
guest ejects it. Does this work fine?

> > >  * I/O throttling limits are ignored
> > 
> > And this?
> > Meanwhile can we have attempts to set them fail?
> 
> This limitation exists because virtio-blk-data-plane today bypasses the
> QEMU block layer.  The next step is to get the block layer working
> inside the data plane thread.  At that point I/O limits work again.
> 
> Adding an error would be a layering violation because I/O throttling
> happens in the QEMU block layer and is unaware of the emulated storage
> controller (virtio-blk, IDE, SCSI, etc).
> 
> I think it's better to document the limitation and continue working on
> AioContext so that we can soon support I/O throttling with
> virtio-blk-data-plane.  It would be quite ugly to add checks.
> 
> > > @@ -33,6 +35,8 @@ typedef struct VirtIOBlock
> > >      VirtIOBlkConf *blk;
> > >      unsigned short sector_mask;
> > >      DeviceState *qdev;
> > > +    VirtIOBlockDataPlane *dataplane;
> > > +    Error *migration_blocker;
> > 
> > Would be nice to move the migration disabling
> > checking supported formats
> > and all the rest of it out to dataplane code.
> 
> The reason to do it in virtio-blk.c is that we already have access to
> the device configuration.  If we move it to hw/dataplane/virtio-blk.c
> then that code needs to reach inside and check data that it doesn't
> otherwise access.

Not really, just pass it all necessary data.

> IMO it's nice to keep data plane "dumb" and perform these checks where
> we already have to deal with the relationship between VirtIOBlkConf and
> friends.
> 
> Stefan

Yes but then it's not contained.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 11/11] virtio-blk: add x-data-plane=on|off performance feature
  2012-11-29 14:55       ` Michael S. Tsirkin
@ 2012-12-04 11:20         ` Michael S. Tsirkin
  2012-12-04 14:19           ` Stefan Hajnoczi
  0 siblings, 1 reply; 43+ messages in thread
From: Michael S. Tsirkin @ 2012-12-04 11:20 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Blue Swirl, khoa,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

On Thu, Nov 29, 2012 at 04:55:48PM +0200, Michael S. Tsirkin wrote:
> On Thu, Nov 29, 2012 at 03:45:55PM +0100, Stefan Hajnoczi wrote:
> > On Thu, Nov 29, 2012 at 03:12:35PM +0200, Michael S. Tsirkin wrote:
> > > On Thu, Nov 22, 2012 at 04:16:52PM +0100, Stefan Hajnoczi wrote:
> > > > The virtio-blk-data-plane feature is easy to integrate into
> > > > hw/virtio-blk.c.  The data plane can be started and stopped similar to
> > > > vhost-net.
> > > > 
> > > > Users can take advantage of the virtio-blk-data-plane feature using the
> > > > new -device virtio-blk-pci,x-data-plane=on property.
> > > > 
> > > > The x-data-plane name was chosen because at this stage the feature is
> > > > experimental and likely to see changes in the future.
> > > > 
> > > > If the VM configuration does not support virtio-blk-data-plane an error
> > > > message is printed.  Although we could fall back to regular virtio-blk,
> > > > I prefer the explicit approach since it prompts the user to fix their
> > > > configuration if they want the performance benefit of
> > > > virtio-blk-data-plane.
> > > 
> > > Not only that, this affects features exposed to guest so it really can't be
> > > trasparent.
> > > 
> > > Which reminds me - shouldn't some features be turned off?
> > > For example, VIRTIO_BLK_F_SCSI?
> > 
> > Yes, virtio-blk-data-plane only starts when you give -device
> > virtio-blk-pci,scsi=off,x-data-plane=on.  If you use scsi=on an error
> > message is printed.
> > 
> > > > Limitations:
> > > >  * Only format=raw is supported
> > > >  * Live migration is not supported
> > > 
> > > This is probably fixable long term?
> > 
> > Absolutely.  There are two parts:
> > 
> > 1. Marking written memory dirty so live RAM migration can work.  Missing
> >    today, easy cheat is to switch off virtio-blk-data-plane and silently
> >    switch to regular virtio-blk emulation while memory dirty logging is
> >    enabled.  The more long-term solution is to actually communicate the
> >    dirty information back to the memory API.
> > 
> > 2. Synchronizing virtio-blk-data-plane vring state with virtio-blk so
> >    save/load works.  This should be relatively straightforward.
> > 
> > I don't want to gate this patch series on live migration support but it
> > is on my TODO list for virtio-blk-data-plane after this initial series
> > has been merged.
> > 
> > > >  * Block jobs, hot unplug, and other operations fail with -EBUSY
> > > 
> > > Hmm I don't see code to disable PCU unplug in this patch.
> > > I expected no_hotplug to be set.
> > > Where is it?
> > 
> > It uses the bdrv_in_use() mechanism.
> 
> Hmm but PCI device can still go away if
> guest ejects it. Does this work fine?

Any comment?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 11/11] virtio-blk: add x-data-plane=on|off performance feature
  2012-12-04 11:20         ` Michael S. Tsirkin
@ 2012-12-04 14:19           ` Stefan Hajnoczi
  0 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2012-12-04 14:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, qemu-devel,
	Blue Swirl, khoa, Paolo Bonzini, Asias He

On Tue, Dec 04, 2012 at 01:20:20PM +0200, Michael S. Tsirkin wrote:
> On Thu, Nov 29, 2012 at 04:55:48PM +0200, Michael S. Tsirkin wrote:
> > On Thu, Nov 29, 2012 at 03:45:55PM +0100, Stefan Hajnoczi wrote:
> > > On Thu, Nov 29, 2012 at 03:12:35PM +0200, Michael S. Tsirkin wrote:
> > > > On Thu, Nov 22, 2012 at 04:16:52PM +0100, Stefan Hajnoczi wrote:
> > > > > The virtio-blk-data-plane feature is easy to integrate into
> > > > > hw/virtio-blk.c.  The data plane can be started and stopped similar to
> > > > > vhost-net.
> > > > > 
> > > > > Users can take advantage of the virtio-blk-data-plane feature using the
> > > > > new -device virtio-blk-pci,x-data-plane=on property.
> > > > > 
> > > > > The x-data-plane name was chosen because at this stage the feature is
> > > > > experimental and likely to see changes in the future.
> > > > > 
> > > > > If the VM configuration does not support virtio-blk-data-plane an error
> > > > > message is printed.  Although we could fall back to regular virtio-blk,
> > > > > I prefer the explicit approach since it prompts the user to fix their
> > > > > configuration if they want the performance benefit of
> > > > > virtio-blk-data-plane.
> > > > 
> > > > Not only that, this affects features exposed to guest so it really can't be
> > > > trasparent.
> > > > 
> > > > Which reminds me - shouldn't some features be turned off?
> > > > For example, VIRTIO_BLK_F_SCSI?
> > > 
> > > Yes, virtio-blk-data-plane only starts when you give -device
> > > virtio-blk-pci,scsi=off,x-data-plane=on.  If you use scsi=on an error
> > > message is printed.
> > > 
> > > > > Limitations:
> > > > >  * Only format=raw is supported
> > > > >  * Live migration is not supported
> > > > 
> > > > This is probably fixable long term?
> > > 
> > > Absolutely.  There are two parts:
> > > 
> > > 1. Marking written memory dirty so live RAM migration can work.  Missing
> > >    today, easy cheat is to switch off virtio-blk-data-plane and silently
> > >    switch to regular virtio-blk emulation while memory dirty logging is
> > >    enabled.  The more long-term solution is to actually communicate the
> > >    dirty information back to the memory API.
> > > 
> > > 2. Synchronizing virtio-blk-data-plane vring state with virtio-blk so
> > >    save/load works.  This should be relatively straightforward.
> > > 
> > > I don't want to gate this patch series on live migration support but it
> > > is on my TODO list for virtio-blk-data-plane after this initial series
> > > has been merged.
> > > 
> > > > >  * Block jobs, hot unplug, and other operations fail with -EBUSY
> > > > 
> > > > Hmm I don't see code to disable PCU unplug in this patch.
> > > > I expected no_hotplug to be set.
> > > > Where is it?
> > > 
> > > It uses the bdrv_in_use() mechanism.
> > 
> > Hmm but PCI device can still go away if
> > guest ejects it. Does this work fine?
> 
> Any comment?

Sorry for the delay.

virtio_blk_exit() is called when the device is freed.  The code destroys
the data plane thread - this includes draining requests and then
terminating the thread.

I tested with pci_del so the guest is cooperating but virtio_blk_exit()
does not assume that the data plane thread is already stopped.

Is this what you were asking?

Stefan

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 00/11] virtio: virtio-blk data plane
  2012-11-22 15:16 [Qemu-devel] [PATCH v4 00/11] virtio: virtio-blk data plane Stefan Hajnoczi
                   ` (10 preceding siblings ...)
  2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 11/11] virtio-blk: add x-data-plane=on|off performance feature Stefan Hajnoczi
@ 2012-11-29  9:18 ` Stefan Hajnoczi
  2012-11-29 12:03   ` Paolo Bonzini
  2012-11-29 14:09   ` Michael S. Tsirkin
  11 siblings, 2 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2012-11-29  9:18 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, qemu-devel,
	Blue Swirl, Khoa Huynh, Paolo Bonzini, Asias He

On Thu, Nov 22, 2012 at 4:16 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> This series adds the -device virtio-blk-pci,x-data-plane=on property that
> enables a high performance I/O codepath.  A dedicated thread is used to process
> virtio-blk requests outside the global mutex and without going through the QEMU
> block layer.
>
> Khoa Huynh <khoa@us.ibm.com> reported an increase from 140,000 IOPS to 600,000
> IOPS for a single VM using virtio-blk-data-plane in July:
>
>   http://comments.gmane.org/gmane.comp.emulators.kvm.devel/94580
>
> The virtio-blk-data-plane approach was originally presented at Linux Plumbers
> Conference 2010.  The following slides contain a brief overview:
>
>   http://linuxplumbersconf.org/2010/ocw/system/presentations/651/original/Optimizing_the_QEMU_Storage_Stack.pdf
>
> The basic approach is:
> 1. Each virtio-blk device has a thread dedicated to handling ioeventfd
>    signalling when the guest kicks the virtqueue.
> 2. Requests are processed without going through the QEMU block layer using
>    Linux AIO directly.
> 3. Completion interrupts are injected via irqfd from the dedicated thread.
>
> To try it out:
>
>   qemu -drive if=none,id=drive0,cache=none,aio=native,format=raw,file=...
>        -device virtio-blk-pci,drive=drive0,scsi=off,x-data-plane=on
>
> Limitations:
>  * Only format=raw is supported
>  * Live migration is not supported
>  * Block jobs, hot unplug, and other operations fail with -EBUSY
>  * I/O throttling limits are ignored
>  * Only Linux hosts are supported due to Linux AIO usage
>
> The code has reached a stage where I feel it is ready to merge.  Users have
> been playing with it for some time and want the significant performance boost.
>
> We are refactoring QEMU to get rid of the global mutex.  I believe that
> virtio-blk-data-plane can eventually become the default mode of operation.
>
> Instead of waiting for global mutex removal efforts to finish, I want to use
> virtio-blk-data-plane as an example device for AioContext and threaded hw
> dispatch refactoring.  This means:
>
> 1. When the block layer can bind to an AioContext and execute I/O outside the
>    global mutex, virtio-blk-data-plane can use this (and gain image format
>    support).
>
> 2. When hw dispatch no longer needs the global mutex we can use hw/virtio.c
>    again and perhaps run a pool of iothreads instead of dedicated data plane
>    threads.
>
> But in the meantime, I have cleaned up the virtio-blk-data-plane code so that
> it can be merged as an experimental feature.
>
> v4:
>  * Add qemu_iovec_concat_iov() [Paolo]
>  * Use QEMUIOVector to copy out virtio_blk_inhdr [Michael, Paolo]
>
> v3:
>  * Don't assume iovec layout [Michael]
>  * Better naming for hostmem.c MemoryListener callbacks [Don]
>  * More vring quarantining if commands are bogus instead of exiting [Blue]
>
> v2:
>  * Use MemoryListener for thread-safe memory mapping [Paolo, Anthony, and everyone else pointed this out ;-)]
>  * Quarantine invalid vring instead of exiting [Blue]
>  * Replace __u16 kernel types with uint16_t [Blue]
>
> Changes from the RFC v9:
>  * Add x-data-plane=on|off option and coexist with regular virtio-blk code
>  * Create thread from BH so it inherits iothread cpusets
>  * Drain requests on vm_stop() so stopped guest does not access image file
>  * Add migration blocker
>  * Add bdrv_in_use() to prevent block jobs and other operations that can interfere
>  * Drop IOQueue request merging for simplicity
>  * Drop ioctl interrupt injection and always use irqfd for simplicity
>  * Major cleanup to split up source files
>  * Rebase from qemu-kvm.git onto qemu.git
>  * Address Michael Tsirkin's review comments
>
> Stefan Hajnoczi (11):
>   raw-posix: add raw_get_aio_fd() for virtio-blk-data-plane
>   configure: add CONFIG_VIRTIO_BLK_DATA_PLANE
>   dataplane: add host memory mapping code
>   dataplane: add virtqueue vring code
>   dataplane: add event loop
>   dataplane: add Linux AIO request queue
>   iov: add iov_discard() to remove data
>   test-iov: add iov_discard() testcase
>   iov: add qemu_iovec_concat_iov()
>   dataplane: add virtio-blk data plane code
>   virtio-blk: add x-data-plane=on|off performance feature
>
>  block.h                    |   9 +
>  block/raw-posix.c          |  34 ++++
>  configure                  |  21 +++
>  hw/Makefile.objs           |   2 +-
>  hw/dataplane/Makefile.objs |   3 +
>  hw/dataplane/event-poll.c  | 109 ++++++++++++
>  hw/dataplane/event-poll.h  |  40 +++++
>  hw/dataplane/hostmem.c     | 165 ++++++++++++++++++
>  hw/dataplane/hostmem.h     |  52 ++++++
>  hw/dataplane/ioq.c         | 118 +++++++++++++
>  hw/dataplane/ioq.h         |  57 ++++++
>  hw/dataplane/virtio-blk.c  | 427 +++++++++++++++++++++++++++++++++++++++++++++
>  hw/dataplane/virtio-blk.h  |  41 +++++
>  hw/dataplane/vring.c       | 344 ++++++++++++++++++++++++++++++++++++
>  hw/dataplane/vring.h       |  62 +++++++
>  hw/virtio-blk.c            |  59 ++++++-
>  hw/virtio-blk.h            |   1 +
>  hw/virtio-pci.c            |   3 +
>  iov.c                      |  80 +++++++--
>  iov.h                      |  13 ++
>  qemu-common.h              |   3 +
>  tests/test-iov.c           | 129 ++++++++++++++
>  trace-events               |   9 +
>  23 files changed, 1767 insertions(+), 14 deletions(-)
>  create mode 100644 hw/dataplane/Makefile.objs
>  create mode 100644 hw/dataplane/event-poll.c
>  create mode 100644 hw/dataplane/event-poll.h
>  create mode 100644 hw/dataplane/hostmem.c
>  create mode 100644 hw/dataplane/hostmem.h
>  create mode 100644 hw/dataplane/ioq.c
>  create mode 100644 hw/dataplane/ioq.h
>  create mode 100644 hw/dataplane/virtio-blk.c
>  create mode 100644 hw/dataplane/virtio-blk.h
>  create mode 100644 hw/dataplane/vring.c
>  create mode 100644 hw/dataplane/vring.h

Michael, Paolo: Are you happy with v4?

Kevin: Do you want to take this series through the block tree?

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 00/11] virtio: virtio-blk data plane
  2012-11-29  9:18 ` [Qemu-devel] [PATCH v4 00/11] virtio: virtio-blk data plane Stefan Hajnoczi
@ 2012-11-29 12:03   ` Paolo Bonzini
  2012-11-29 14:09   ` Michael S. Tsirkin
  1 sibling, 0 replies; 43+ messages in thread
From: Paolo Bonzini @ 2012-11-29 12:03 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, qemu-devel,
	Blue Swirl, Khoa Huynh, Stefan Hajnoczi, Asias He

Il 29/11/2012 10:18, Stefan Hajnoczi ha scritto:
> Michael, Paolo: Are you happy with v4?

Sure.

> Kevin: Do you want to take this series through the block tree?

Paolo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 00/11] virtio: virtio-blk data plane
  2012-11-29  9:18 ` [Qemu-devel] [PATCH v4 00/11] virtio: virtio-blk data plane Stefan Hajnoczi
  2012-11-29 12:03   ` Paolo Bonzini
@ 2012-11-29 14:09   ` Michael S. Tsirkin
  2012-11-29 14:48     ` Stefan Hajnoczi
  1 sibling, 1 reply; 43+ messages in thread
From: Michael S. Tsirkin @ 2012-11-29 14:09 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Blue Swirl, Khoa Huynh,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

On Thu, Nov 29, 2012 at 10:18:59AM +0100, Stefan Hajnoczi wrote:
> Michael, Paolo: Are you happy with v4?

Looks pretty clean by itself. I sent some comments but they can be
addressed later.  What worries me most is the code duplication with
regular virtio.

I see two ways to reduce the maintainance somewhat
- split out ring handling code in virtio-blk
  to a separate file to make it more obvious which part
  is inactive when data plane runs.
- share ring processing code with virtio/virtio-blk
  (e.g. use callbacks)

Was any thought given to implementing one of these two
approaches? 

-- 
MST

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 00/11] virtio: virtio-blk data plane
  2012-11-29 14:09   ` Michael S. Tsirkin
@ 2012-11-29 14:48     ` Stefan Hajnoczi
  2012-11-29 15:19       ` Michael S. Tsirkin
  0 siblings, 1 reply; 43+ messages in thread
From: Stefan Hajnoczi @ 2012-11-29 14:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Blue Swirl, Khoa Huynh,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

On Thu, Nov 29, 2012 at 04:09:28PM +0200, Michael S. Tsirkin wrote:
> On Thu, Nov 29, 2012 at 10:18:59AM +0100, Stefan Hajnoczi wrote:
> > Michael, Paolo: Are you happy with v4?
> 
> Looks pretty clean by itself. I sent some comments but they can be
> addressed later.  What worries me most is the code duplication with
> regular virtio.
> 
> I see two ways to reduce the maintainance somewhat
> - split out ring handling code in virtio-blk
>   to a separate file to make it more obvious which part
>   is inactive when data plane runs.
> - share ring processing code with virtio/virtio-blk
>   (e.g. use callbacks)
> 
> Was any thought given to implementing one of these two
> approaches? 

Yes, your option #2 is where I'd like to move once threaded memory
dispatch is working.  I hope we can run virtio.c code in a thread
outside the global mutex soon.  That way we can kill
hw/dataplane/vring.[ch].

Ping Fan Liu has been working on the memory API and device emulation
stuff that we need in order to eventually use virtio.c outside the
global mutex.

Stefan

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v4 00/11] virtio: virtio-blk data plane
  2012-11-29 14:48     ` Stefan Hajnoczi
@ 2012-11-29 15:19       ` Michael S. Tsirkin
  0 siblings, 0 replies; 43+ messages in thread
From: Michael S. Tsirkin @ 2012-11-29 15:19 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Blue Swirl, Khoa Huynh,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

On Thu, Nov 29, 2012 at 03:48:04PM +0100, Stefan Hajnoczi wrote:
> On Thu, Nov 29, 2012 at 04:09:28PM +0200, Michael S. Tsirkin wrote:
> > On Thu, Nov 29, 2012 at 10:18:59AM +0100, Stefan Hajnoczi wrote:
> > > Michael, Paolo: Are you happy with v4?
> > 
> > Looks pretty clean by itself. I sent some comments but they can be
> > addressed later.  What worries me most is the code duplication with
> > regular virtio.
> > 
> > I see two ways to reduce the maintainance somewhat
> > - split out ring handling code in virtio-blk
> >   to a separate file to make it more obvious which part
> >   is inactive when data plane runs.
> > - share ring processing code with virtio/virtio-blk
> >   (e.g. use callbacks)
> > 
> > Was any thought given to implementing one of these two
> > approaches? 
> 
> Yes, your option #2 is where I'd like to move once threaded memory
> dispatch is working.  I hope we can run virtio.c code in a thread
> outside the global mutex soon.  That way we can kill
> hw/dataplane/vring.[ch].
> 
> Ping Fan Liu has been working on the memory API and device emulation
> stuff that we need in order to eventually use virtio.c outside the
> global mutex.
> 
> Stefan

I guess we can live with this short term.

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2012-12-05 12:57 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-22 15:16 [Qemu-devel] [PATCH v4 00/11] virtio: virtio-blk data plane Stefan Hajnoczi
2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 01/11] raw-posix: add raw_get_aio_fd() for virtio-blk-data-plane Stefan Hajnoczi
2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 02/11] configure: add CONFIG_VIRTIO_BLK_DATA_PLANE Stefan Hajnoczi
2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 03/11] dataplane: add host memory mapping code Stefan Hajnoczi
2012-11-29 12:33   ` Michael S. Tsirkin
2012-11-29 12:45     ` Stefan Hajnoczi
2012-11-29 12:54       ` Michael S. Tsirkin
2012-11-29 12:57         ` Michael S. Tsirkin
2012-12-05  8:13           ` Stefan Hajnoczi
2012-11-29 13:54   ` Michael S. Tsirkin
2012-11-29 14:26     ` Stefan Hajnoczi
2012-11-29 14:36       ` Michael S. Tsirkin
2012-11-29 15:26         ` Paolo Bonzini
2012-12-05  8:31         ` Stefan Hajnoczi
2012-12-05 11:22           ` Michael S. Tsirkin
2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 04/11] dataplane: add virtqueue vring code Stefan Hajnoczi
2012-11-29 12:50   ` Michael S. Tsirkin
2012-11-29 15:17     ` Paolo Bonzini
2012-12-05 12:57     ` Stefan Hajnoczi
2012-11-29 13:48   ` Michael S. Tsirkin
2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 05/11] dataplane: add event loop Stefan Hajnoczi
2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 06/11] dataplane: add Linux AIO request queue Stefan Hajnoczi
2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 07/11] iov: add iov_discard() to remove data Stefan Hajnoczi
2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 08/11] test-iov: add iov_discard() testcase Stefan Hajnoczi
2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 09/11] iov: add qemu_iovec_concat_iov() Stefan Hajnoczi
2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 10/11] dataplane: add virtio-blk data plane code Stefan Hajnoczi
2012-11-29 13:41   ` Michael S. Tsirkin
2012-11-29 14:02   ` Michael S. Tsirkin
2012-11-29 15:21     ` Paolo Bonzini
2012-11-29 15:27       ` Michael S. Tsirkin
2012-11-29 15:47         ` Paolo Bonzini
2012-11-30 13:57           ` Stefan Hajnoczi
2012-11-22 15:16 ` [Qemu-devel] [PATCH v4 11/11] virtio-blk: add x-data-plane=on|off performance feature Stefan Hajnoczi
2012-11-29 13:12   ` Michael S. Tsirkin
2012-11-29 14:45     ` Stefan Hajnoczi
2012-11-29 14:55       ` Michael S. Tsirkin
2012-12-04 11:20         ` Michael S. Tsirkin
2012-12-04 14:19           ` Stefan Hajnoczi
2012-11-29  9:18 ` [Qemu-devel] [PATCH v4 00/11] virtio: virtio-blk data plane Stefan Hajnoczi
2012-11-29 12:03   ` Paolo Bonzini
2012-11-29 14:09   ` Michael S. Tsirkin
2012-11-29 14:48     ` Stefan Hajnoczi
2012-11-29 15:19       ` Michael S. Tsirkin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).