qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v4 0/8] block: generic copy-on-read
@ 2011-11-23 11:47 Stefan Hajnoczi
  2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 1/8] qemu-common: add QEMU_ALIGN_DOWN() and QEMU_ALIGN_UP() macros Stefan Hajnoczi
                   ` (7 more replies)
  0 siblings, 8 replies; 16+ messages in thread
From: Stefan Hajnoczi @ 2011-11-23 11:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi

The new -drive copy-on-read=on|off feature populates the image file with data
from the backing file on read.  This is useful when accessing images backed
over a slow medium (e.g. http over internet).  All read data will be stored in
the local image file so it does not need to be fetched again in the future.

This series is a prerequisite for the image streaming feature, which uses
copy-on-read to populate the image file in the background while the VM is
running.  However, the copy-on-read feature is useful on its own.

Copy-on-read is implemented by checking whether or not data is allocated in the
image file before reading it.  If data is not allocated then it needs to be
read and written back to the image file.

The tricky bit is avoiding races with other I/O requests.  These patches add
request tracking to BlockDriverState so that the list of pending requests is
available.  Copy-on-read prevents races by serializing overlapping requests.

Finally, there is a performance impact when enabling this feature since an
additional write is performed.  Serializing overlapping requests also means
that I/O patterns where multiple requests access the same cluster will see a
loss in parallelism.  Perhaps we can be smarter about preventing corruption in
the future and win back some performance.

v4:
 * No changes, sorry for the spam v3 emails that were sent out

v3:
 * Improve wait_for_overlapping_requests() comment [Kevin]

v2:
 * Based on bdrv_co_is_allocated patch series - now safe in coroutine context
 * Use QEMU_ALIGN_DOWN/UP() macros for copy-on-read cluster calculations [Zhi Yong]
 * Reset bs->copy_on_read on bdrv_close() [Kevin]
 * Refcount bs->copy_on_read so it doesn't get clobbered by multiple users [Marcelo]
 * Use bool instead of int where appropriate [Kevin]
 * Use compound literal assignment to ensure BdrvTrackedRequest fields always get zeroed [Kevin]
 * Comment rationale for copy-on-read bounce buffer [Kevin]

Stefan Hajnoczi (8):
  qemu-common: add QEMU_ALIGN_DOWN() and QEMU_ALIGN_UP() macros
  coroutine: add qemu_co_queue_restart_all()
  block: add request tracking
  block: add bdrv_set_copy_on_read()
  block: wait for overlapping requests
  block: request overlap detection
  block: core copy-on-read logic
  block: add -drive copy-on-read=on|off

 block.c               |  218 ++++++++++++++++++++++++++++++++++++++++++++++++-
 block.h               |    3 +
 block/qcow2.c         |    2 +-
 block_int.h           |    6 ++
 blockdev.c            |    6 ++
 hmp-commands.hx       |    5 +-
 qemu-common.h         |    6 ++
 qemu-config.c         |    4 +
 qemu-coroutine-lock.c |   15 ++--
 qemu-coroutine.h      |    5 +
 qemu-options.hx       |    9 ++-
 trace-events          |    1 +
 12 files changed, 268 insertions(+), 12 deletions(-)

-- 
1.7.7.1

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Qemu-devel] [PATCH v4 1/8] qemu-common: add QEMU_ALIGN_DOWN() and QEMU_ALIGN_UP() macros
  2011-11-23 11:47 [Qemu-devel] [PATCH v4 0/8] block: generic copy-on-read Stefan Hajnoczi
@ 2011-11-23 11:47 ` Stefan Hajnoczi
  2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 2/8] coroutine: add qemu_co_queue_restart_all() Stefan Hajnoczi
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Stefan Hajnoczi @ 2011-11-23 11:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi

Add macros for aligning a number to a multiple, for example:

QEMU_ALIGN_DOWN(500, 2000) = 0
QEMU_ALIGN_UP(500, 2000) = 2000

Since ALIGN_UP() is a common macro name use the QEMU_* namespace prefix.
Hopefully this will protect us from included headers that leak something
with a similar name.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 qemu-common.h |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/qemu-common.h b/qemu-common.h
index 2ce47aa..44870fe 100644
--- a/qemu-common.h
+++ b/qemu-common.h
@@ -341,6 +341,12 @@ static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c)
     return res.ll;
 }
 
+/* Round number down to multiple */
+#define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m))
+
+/* Round number up to multiple */
+#define QEMU_ALIGN_UP(n, m) QEMU_ALIGN_DOWN((n) + (m) - 1, (m))
+
 #include "module.h"
 
 #endif
-- 
1.7.7.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Qemu-devel] [PATCH v4 2/8] coroutine: add qemu_co_queue_restart_all()
  2011-11-23 11:47 [Qemu-devel] [PATCH v4 0/8] block: generic copy-on-read Stefan Hajnoczi
  2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 1/8] qemu-common: add QEMU_ALIGN_DOWN() and QEMU_ALIGN_UP() macros Stefan Hajnoczi
@ 2011-11-23 11:47 ` Stefan Hajnoczi
  2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 3/8] block: add request tracking Stefan Hajnoczi
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Stefan Hajnoczi @ 2011-11-23 11:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi

It's common to wake up all waiting coroutines.  Introduce the
qemu_co_queue_restart_all() function to do this instead of looping over
qemu_co_queue_next() in every caller.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 block/qcow2.c         |    2 +-
 qemu-coroutine-lock.c |   15 ++++++++-------
 qemu-coroutine.h      |    5 +++++
 3 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index eab35d1..195e1b1 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -514,7 +514,7 @@ static void run_dependent_requests(BDRVQcowState *s, QCowL2Meta *m)
     /* Restart all dependent requests */
     if (!qemu_co_queue_empty(&m->dependent_requests)) {
         qemu_co_mutex_unlock(&s->lock);
-        while(qemu_co_queue_next(&m->dependent_requests));
+        qemu_co_queue_restart_all(&m->dependent_requests);
         qemu_co_mutex_lock(&s->lock);
     }
 }
diff --git a/qemu-coroutine-lock.c b/qemu-coroutine-lock.c
index 9549c07..26ad76b 100644
--- a/qemu-coroutine-lock.c
+++ b/qemu-coroutine-lock.c
@@ -84,6 +84,13 @@ bool qemu_co_queue_next(CoQueue *queue)
     return (next != NULL);
 }
 
+void qemu_co_queue_restart_all(CoQueue *queue)
+{
+    while (qemu_co_queue_next(queue)) {
+        /* Do nothing */
+    }
+}
+
 bool qemu_co_queue_empty(CoQueue *queue)
 {
     return (QTAILQ_FIRST(&queue->entries) == NULL);
@@ -144,13 +151,7 @@ void qemu_co_rwlock_unlock(CoRwlock *lock)
     assert(qemu_in_coroutine());
     if (lock->writer) {
         lock->writer = false;
-        while (!qemu_co_queue_empty(&lock->queue)) {
-            /*
-             * Wakeup every body. This will include some
-             * writers too.
-             */
-            qemu_co_queue_next(&lock->queue);
-        }
+        qemu_co_queue_restart_all(&lock->queue);
     } else {
         lock->reader--;
         assert(lock->reader >= 0);
diff --git a/qemu-coroutine.h b/qemu-coroutine.h
index 8a2e5d2..8a55fe1 100644
--- a/qemu-coroutine.h
+++ b/qemu-coroutine.h
@@ -131,6 +131,11 @@ void coroutine_fn qemu_co_queue_wait_insert_head(CoQueue *queue);
 bool qemu_co_queue_next(CoQueue *queue);
 
 /**
+ * Restarts all coroutines in the CoQueue and leaves the queue empty.
+ */
+void qemu_co_queue_restart_all(CoQueue *queue);
+
+/**
  * Checks if the CoQueue is empty.
  */
 bool qemu_co_queue_empty(CoQueue *queue);
-- 
1.7.7.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Qemu-devel] [PATCH v4 3/8] block: add request tracking
  2011-11-23 11:47 [Qemu-devel] [PATCH v4 0/8] block: generic copy-on-read Stefan Hajnoczi
  2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 1/8] qemu-common: add QEMU_ALIGN_DOWN() and QEMU_ALIGN_UP() macros Stefan Hajnoczi
  2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 2/8] coroutine: add qemu_co_queue_restart_all() Stefan Hajnoczi
@ 2011-11-23 11:47 ` Stefan Hajnoczi
  2011-12-05 12:17   ` Marcelo Tosatti
  2011-12-05 16:09   ` Marcelo Tosatti
  2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 4/8] block: add bdrv_set_copy_on_read() Stefan Hajnoczi
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 16+ messages in thread
From: Stefan Hajnoczi @ 2011-11-23 11:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi

The block layer does not know about pending requests.  This information
is necessary for copy-on-read since overlapping requests must be
serialized to prevent races that corrupt the image.

The BlockDriverState gets a new tracked_request list field which
contains all pending requests.  Each request is a BdrvTrackedRequest
record with sector_num, nb_sectors, and is_write fields.

Note that request tracking is always enabled but hopefully this extra
work is so small that it doesn't justify adding an enable/disable flag.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 block.c     |   48 +++++++++++++++++++++++++++++++++++++++++++++++-
 block_int.h |    4 ++++
 2 files changed, 51 insertions(+), 1 deletions(-)

diff --git a/block.c b/block.c
index 0df7eb9..27c4e84 100644
--- a/block.c
+++ b/block.c
@@ -1071,6 +1071,42 @@ void bdrv_commit_all(void)
     }
 }
 
+struct BdrvTrackedRequest {
+    BlockDriverState *bs;
+    int64_t sector_num;
+    int nb_sectors;
+    bool is_write;
+    QLIST_ENTRY(BdrvTrackedRequest) list;
+};
+
+/**
+ * Remove an active request from the tracked requests list
+ *
+ * This function should be called when a tracked request is completing.
+ */
+static void tracked_request_end(BdrvTrackedRequest *req)
+{
+    QLIST_REMOVE(req, list);
+}
+
+/**
+ * Add an active request to the tracked requests list
+ */
+static void tracked_request_begin(BdrvTrackedRequest *req,
+                                  BlockDriverState *bs,
+                                  int64_t sector_num,
+                                  int nb_sectors, bool is_write)
+{
+    *req = (BdrvTrackedRequest){
+        .bs = bs,
+        .sector_num = sector_num,
+        .nb_sectors = nb_sectors,
+        .is_write = is_write,
+    };
+
+    QLIST_INSERT_HEAD(&bs->tracked_requests, req, list);
+}
+
 /*
  * Return values:
  * 0        - success
@@ -1350,6 +1386,8 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
     int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
 {
     BlockDriver *drv = bs->drv;
+    BdrvTrackedRequest req;
+    int ret;
 
     if (!drv) {
         return -ENOMEDIUM;
@@ -1363,7 +1401,10 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
         bdrv_io_limits_intercept(bs, false, nb_sectors);
     }
 
-    return drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
+    tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
+    ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
+    tracked_request_end(&req);
+    return ret;
 }
 
 int coroutine_fn bdrv_co_readv(BlockDriverState *bs, int64_t sector_num,
@@ -1381,6 +1422,7 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
     int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
 {
     BlockDriver *drv = bs->drv;
+    BdrvTrackedRequest req;
     int ret;
 
     if (!bs->drv) {
@@ -1398,6 +1440,8 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
         bdrv_io_limits_intercept(bs, true, nb_sectors);
     }
 
+    tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
+
     ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov);
 
     if (bs->dirty_bitmap) {
@@ -1408,6 +1452,8 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
         bs->wr_highest_sector = sector_num + nb_sectors - 1;
     }
 
+    tracked_request_end(&req);
+
     return ret;
 }
 
diff --git a/block_int.h b/block_int.h
index f9e2c9a..788fde9 100644
--- a/block_int.h
+++ b/block_int.h
@@ -51,6 +51,8 @@
 #define BLOCK_OPT_PREALLOC      "preallocation"
 #define BLOCK_OPT_SUBFMT        "subformat"
 
+typedef struct BdrvTrackedRequest BdrvTrackedRequest;
+
 typedef struct AIOPool {
     void (*cancel)(BlockDriverAIOCB *acb);
     int aiocb_size;
@@ -250,6 +252,8 @@ struct BlockDriverState {
     int in_use; /* users other than guest access, eg. block migration */
     QTAILQ_ENTRY(BlockDriverState) list;
     void *private;
+
+    QLIST_HEAD(, BdrvTrackedRequest) tracked_requests;
 };
 
 struct BlockDriverAIOCB {
-- 
1.7.7.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Qemu-devel] [PATCH v4 4/8] block: add bdrv_set_copy_on_read()
  2011-11-23 11:47 [Qemu-devel] [PATCH v4 0/8] block: generic copy-on-read Stefan Hajnoczi
                   ` (2 preceding siblings ...)
  2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 3/8] block: add request tracking Stefan Hajnoczi
@ 2011-11-23 11:47 ` Stefan Hajnoczi
  2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 5/8] block: wait for overlapping requests Stefan Hajnoczi
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Stefan Hajnoczi @ 2011-11-23 11:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi

The bdrv_set_copy_on_read() function can be used to programmatically
enable or disable copy-on-read for a block device.  Later patches add
the actual copy-on-read logic.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 block.c     |   22 ++++++++++++++++++++++
 block.h     |    3 +++
 block_int.h |    2 ++
 3 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/block.c b/block.c
index 27c4e84..c90880b 100644
--- a/block.c
+++ b/block.c
@@ -538,6 +538,22 @@ int bdrv_parse_cache_flags(const char *mode, int *flags)
     return 0;
 }
 
+/**
+ * Enable/disable copy-on-read
+ *
+ * This is based on a reference count so multiple users may call this function
+ * without worrying about clobbering the previous state.  Copy-on-read stays
+ * enabled until all users have called to disable it.
+ */
+void bdrv_set_copy_on_read(BlockDriverState *bs, bool enable)
+{
+    if (enable) {
+        bs->copy_on_read++;
+    } else {
+        bs->copy_on_read--;
+    }
+}
+
 /*
  * Common part for opening disk images and files
  */
@@ -559,6 +575,11 @@ static int bdrv_open_common(BlockDriverState *bs, const char *filename,
     bs->growable = 0;
     bs->buffer_alignment = 512;
 
+    assert(bs->copy_on_read == 0); /* bdrv_new() and bdrv_close() make it so */
+    if (flags & BDRV_O_RDWR) {
+        bdrv_set_copy_on_read(bs, !!(flags & BDRV_O_COPY_ON_READ));
+    }
+
     pstrcpy(bs->filename, sizeof(bs->filename), filename);
     bs->backing_file[0] = '\0';
 
@@ -801,6 +822,7 @@ void bdrv_close(BlockDriverState *bs)
 #endif
         bs->opaque = NULL;
         bs->drv = NULL;
+        bs->copy_on_read = 0;
 
         if (bs->file != NULL) {
             bdrv_close(bs->file);
diff --git a/block.h b/block.h
index ad8dd48..68b4b14 100644
--- a/block.h
+++ b/block.h
@@ -70,6 +70,7 @@ typedef struct BlockDevOps {
 #define BDRV_O_NATIVE_AIO  0x0080 /* use native AIO instead of the thread pool */
 #define BDRV_O_NO_BACKING  0x0100 /* don't open the backing file */
 #define BDRV_O_NO_FLUSH    0x0200 /* disable flushing on this disk */
+#define BDRV_O_COPY_ON_READ 0x0400 /* copy read backing sectors into image */
 
 #define BDRV_O_CACHE_MASK  (BDRV_O_NOCACHE | BDRV_O_CACHE_WB | BDRV_O_NO_FLUSH)
 
@@ -308,6 +309,8 @@ void bdrv_reset_dirty(BlockDriverState *bs, int64_t cur_sector,
                       int nr_sectors);
 int64_t bdrv_get_dirty_count(BlockDriverState *bs);
 
+void bdrv_set_copy_on_read(BlockDriverState *bs, bool enable);
+
 void bdrv_set_in_use(BlockDriverState *bs, int in_use);
 int bdrv_in_use(BlockDriverState *bs);
 
diff --git a/block_int.h b/block_int.h
index 788fde9..3c5bacb 100644
--- a/block_int.h
+++ b/block_int.h
@@ -193,6 +193,8 @@ struct BlockDriverState {
     int encrypted; /* if true, the media is encrypted */
     int valid_key; /* if true, a valid encryption key has been set */
     int sg;        /* if true, the device is a /dev/sg* */
+    int copy_on_read; /* if true, copy read backing sectors into image
+                         note this is a reference count */
 
     BlockDriver *drv; /* NULL means no media */
     void *opaque;
-- 
1.7.7.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Qemu-devel] [PATCH v4 5/8] block: wait for overlapping requests
  2011-11-23 11:47 [Qemu-devel] [PATCH v4 0/8] block: generic copy-on-read Stefan Hajnoczi
                   ` (3 preceding siblings ...)
  2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 4/8] block: add bdrv_set_copy_on_read() Stefan Hajnoczi
@ 2011-11-23 11:47 ` Stefan Hajnoczi
  2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 6/8] block: request overlap detection Stefan Hajnoczi
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Stefan Hajnoczi @ 2011-11-23 11:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi

When copy-on-read is enabled it is necessary to wait for overlapping
requests before issuing new requests.  This prevents races between the
copy-on-read and a write request.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 block.c |   35 +++++++++++++++++++++++++++++++++++
 1 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/block.c b/block.c
index c90880b..da7aaa2 100644
--- a/block.c
+++ b/block.c
@@ -1099,6 +1099,7 @@ struct BdrvTrackedRequest {
     int nb_sectors;
     bool is_write;
     QLIST_ENTRY(BdrvTrackedRequest) list;
+    CoQueue wait_queue; /* coroutines blocked on this request */
 };
 
 /**
@@ -1109,6 +1110,7 @@ struct BdrvTrackedRequest {
 static void tracked_request_end(BdrvTrackedRequest *req)
 {
     QLIST_REMOVE(req, list);
+    qemu_co_queue_restart_all(&req->wait_queue);
 }
 
 /**
@@ -1126,9 +1128,34 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
         .is_write = is_write,
     };
 
+    qemu_co_queue_init(&req->wait_queue);
+
     QLIST_INSERT_HEAD(&bs->tracked_requests, req, list);
 }
 
+static bool tracked_request_overlaps(BdrvTrackedRequest *req,
+                                     int64_t sector_num, int nb_sectors) {
+    return false; /* not yet implemented */
+}
+
+static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
+        int64_t sector_num, int nb_sectors)
+{
+    BdrvTrackedRequest *req;
+    bool retry;
+
+    do {
+        retry = false;
+        QLIST_FOREACH(req, &bs->tracked_requests, list) {
+            if (tracked_request_overlaps(req, sector_num, nb_sectors)) {
+                qemu_co_queue_wait(&req->wait_queue);
+                retry = true;
+                break;
+            }
+        }
+    } while (retry);
+}
+
 /*
  * Return values:
  * 0        - success
@@ -1423,6 +1450,10 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
         bdrv_io_limits_intercept(bs, false, nb_sectors);
     }
 
+    if (bs->copy_on_read) {
+        wait_for_overlapping_requests(bs, sector_num, nb_sectors);
+    }
+
     tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
     ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
     tracked_request_end(&req);
@@ -1462,6 +1493,10 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
         bdrv_io_limits_intercept(bs, true, nb_sectors);
     }
 
+    if (bs->copy_on_read) {
+        wait_for_overlapping_requests(bs, sector_num, nb_sectors);
+    }
+
     tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
 
     ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov);
-- 
1.7.7.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Qemu-devel] [PATCH v4 6/8] block: request overlap detection
  2011-11-23 11:47 [Qemu-devel] [PATCH v4 0/8] block: generic copy-on-read Stefan Hajnoczi
                   ` (4 preceding siblings ...)
  2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 5/8] block: wait for overlapping requests Stefan Hajnoczi
@ 2011-11-23 11:47 ` Stefan Hajnoczi
  2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 7/8] block: core copy-on-read logic Stefan Hajnoczi
  2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 8/8] block: add -drive copy-on-read=on|off Stefan Hajnoczi
  7 siblings, 0 replies; 16+ messages in thread
From: Stefan Hajnoczi @ 2011-11-23 11:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi

Detect overlapping requests and remember to align to cluster boundaries
if the image format uses them.  This assumes that allocating I/O is
performed in cluster granularity - which is true for qcow2, qed, etc.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 block.c |   45 +++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/block.c b/block.c
index da7aaa2..c30c8f2 100644
--- a/block.c
+++ b/block.c
@@ -1133,21 +1133,62 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
     QLIST_INSERT_HEAD(&bs->tracked_requests, req, list);
 }
 
+/**
+ * Round a region to cluster boundaries
+ */
+static void round_to_clusters(BlockDriverState *bs,
+                              int64_t sector_num, int nb_sectors,
+                              int64_t *cluster_sector_num,
+                              int *cluster_nb_sectors)
+{
+    BlockDriverInfo bdi;
+
+    if (bdrv_get_info(bs, &bdi) < 0 || bdi.cluster_size == 0) {
+        *cluster_sector_num = sector_num;
+        *cluster_nb_sectors = nb_sectors;
+    } else {
+        int64_t c = bdi.cluster_size / BDRV_SECTOR_SIZE;
+        *cluster_sector_num = QEMU_ALIGN_DOWN(sector_num, c);
+        *cluster_nb_sectors = QEMU_ALIGN_UP(sector_num - *cluster_sector_num +
+                                            nb_sectors, c);
+    }
+}
+
 static bool tracked_request_overlaps(BdrvTrackedRequest *req,
                                      int64_t sector_num, int nb_sectors) {
-    return false; /* not yet implemented */
+    /*        aaaa   bbbb */
+    if (sector_num >= req->sector_num + req->nb_sectors) {
+        return false;
+    }
+    /* bbbb   aaaa        */
+    if (req->sector_num >= sector_num + nb_sectors) {
+        return false;
+    }
+    return true;
 }
 
 static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
         int64_t sector_num, int nb_sectors)
 {
     BdrvTrackedRequest *req;
+    int64_t cluster_sector_num;
+    int cluster_nb_sectors;
     bool retry;
 
+    /* If we touch the same cluster it counts as an overlap.  This guarantees
+     * that allocating writes will be serialized and not race with each other
+     * for the same cluster.  For example, in copy-on-read it ensures that the
+     * CoR read and write operations are atomic and guest writes cannot
+     * interleave between them.
+     */
+    round_to_clusters(bs, sector_num, nb_sectors,
+                      &cluster_sector_num, &cluster_nb_sectors);
+
     do {
         retry = false;
         QLIST_FOREACH(req, &bs->tracked_requests, list) {
-            if (tracked_request_overlaps(req, sector_num, nb_sectors)) {
+            if (tracked_request_overlaps(req, cluster_sector_num,
+                                         cluster_nb_sectors)) {
                 qemu_co_queue_wait(&req->wait_queue);
                 retry = true;
                 break;
-- 
1.7.7.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Qemu-devel] [PATCH v4 7/8] block: core copy-on-read logic
  2011-11-23 11:47 [Qemu-devel] [PATCH v4 0/8] block: generic copy-on-read Stefan Hajnoczi
                   ` (5 preceding siblings ...)
  2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 6/8] block: request overlap detection Stefan Hajnoczi
@ 2011-11-23 11:47 ` Stefan Hajnoczi
  2011-12-02 17:13   ` Marcelo Tosatti
  2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 8/8] block: add -drive copy-on-read=on|off Stefan Hajnoczi
  7 siblings, 1 reply; 16+ messages in thread
From: Stefan Hajnoczi @ 2011-11-23 11:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 block.c      |   72 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 trace-events |    1 +
 2 files changed, 73 insertions(+), 0 deletions(-)

diff --git a/block.c b/block.c
index c30c8f2..a598a19 100644
--- a/block.c
+++ b/block.c
@@ -1469,6 +1469,61 @@ int bdrv_pwrite_sync(BlockDriverState *bs, int64_t offset,
     return 0;
 }
 
+static int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
+        int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
+{
+    /* Perform I/O through a temporary buffer so that users who scribble over
+     * their read buffer while the operation is in progress do not end up
+     * modifying the image file.  This is critical for zero-copy guest I/O
+     * where anything might happen inside guest memory.
+     */
+    void *bounce_buffer;
+
+    struct iovec iov;
+    QEMUIOVector bounce_qiov;
+    int64_t cluster_sector_num;
+    int cluster_nb_sectors;
+    size_t skip_bytes;
+    int ret;
+
+    /* Cover entire cluster so no additional backing file I/O is required when
+     * allocating cluster in the image file.
+     */
+    round_to_clusters(bs, sector_num, nb_sectors,
+                      &cluster_sector_num, &cluster_nb_sectors);
+
+    trace_bdrv_co_copy_on_readv(bs, sector_num, nb_sectors,
+                                cluster_sector_num, cluster_nb_sectors);
+
+    iov.iov_len = cluster_nb_sectors * BDRV_SECTOR_SIZE;
+    iov.iov_base = bounce_buffer = qemu_blockalign(bs, iov.iov_len);
+    qemu_iovec_init_external(&bounce_qiov, &iov, 1);
+
+    ret = bs->drv->bdrv_co_readv(bs, cluster_sector_num, cluster_nb_sectors,
+                                 &bounce_qiov);
+    if (ret < 0) {
+        goto err;
+    }
+
+    ret = bs->drv->bdrv_co_writev(bs, cluster_sector_num, cluster_nb_sectors,
+                                  &bounce_qiov);
+    if (ret < 0) {
+        /* It might be okay to ignore write errors for guest requests.  If this
+         * is a deliberate copy-on-read then we don't want to ignore the error.
+         * Simply report it in all cases.
+         */
+        goto err;
+    }
+
+    skip_bytes = (sector_num - cluster_sector_num) * BDRV_SECTOR_SIZE;
+    qemu_iovec_from_buffer(qiov, bounce_buffer + skip_bytes,
+                           nb_sectors * BDRV_SECTOR_SIZE);
+
+err:
+    qemu_vfree(bounce_buffer);
+    return ret;
+}
+
 /*
  * Handle a read request in coroutine context
  */
@@ -1496,7 +1551,24 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
     }
 
     tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
+
+    if (bs->copy_on_read) {
+        int pnum;
+
+        ret = bdrv_co_is_allocated(bs, sector_num, nb_sectors, &pnum);
+        if (ret < 0) {
+            goto out;
+        }
+
+        if (!ret || pnum != nb_sectors) {
+            ret = bdrv_co_copy_on_readv(bs, sector_num, nb_sectors, qiov);
+            goto out;
+        }
+    }
+
     ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
+
+out:
     tracked_request_end(&req);
     return ret;
 }
diff --git a/trace-events b/trace-events
index 962caca..518b76b 100644
--- a/trace-events
+++ b/trace-events
@@ -69,6 +69,7 @@ bdrv_lock_medium(void *bs, bool locked) "bs %p locked %d"
 bdrv_co_readv(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
 bdrv_co_writev(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
 bdrv_co_io_em(void *bs, int64_t sector_num, int nb_sectors, int is_write, void *acb) "bs %p sector_num %"PRId64" nb_sectors %d is_write %d acb %p"
+bdrv_co_copy_on_readv(void *bs, int64_t sector_num, int nb_sectors, int64_t cluster_sector_num, int cluster_nb_sectors) "bs %p sector_num %"PRId64" nb_sectors %d cluster_sector_num %"PRId64" cluster_nb_sectors %d"
 
 # hw/virtio-blk.c
 virtio_blk_req_complete(void *req, int status) "req %p status %d"
-- 
1.7.7.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Qemu-devel] [PATCH v4 8/8] block: add -drive copy-on-read=on|off
  2011-11-23 11:47 [Qemu-devel] [PATCH v4 0/8] block: generic copy-on-read Stefan Hajnoczi
                   ` (6 preceding siblings ...)
  2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 7/8] block: core copy-on-read logic Stefan Hajnoczi
@ 2011-11-23 11:47 ` Stefan Hajnoczi
  7 siblings, 0 replies; 16+ messages in thread
From: Stefan Hajnoczi @ 2011-11-23 11:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi

This patch adds the -drive copy-on-read=on|off command-line option:

  copy-on-read=on|off
  copy-on-read is "on" or "off" and enables whether to copy read backing
  file sectors into the image file.  Copy-on-read avoids accessing the
  same backing file sectors repeatedly and is useful when the backing
  file is over a slow network.  By default copy-on-read is off.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 blockdev.c      |    6 ++++++
 hmp-commands.hx |    5 +++--
 qemu-config.c   |    4 ++++
 qemu-options.hx |    9 ++++++++-
 4 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 9068c5b..af4e239 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -257,6 +257,7 @@ DriveInfo *drive_init(QemuOpts *opts, int default_to_scsi)
     DriveInfo *dinfo;
     BlockIOLimit io_limits;
     int snapshot = 0;
+    bool copy_on_read;
     int ret;
 
     translation = BIOS_ATA_TRANSLATION_AUTO;
@@ -273,6 +274,7 @@ DriveInfo *drive_init(QemuOpts *opts, int default_to_scsi)
 
     snapshot = qemu_opt_get_bool(opts, "snapshot", 0);
     ro = qemu_opt_get_bool(opts, "readonly", 0);
+    copy_on_read = qemu_opt_get_bool(opts, "copy-on-read", false);
 
     file = qemu_opt_get(opts, "file");
     serial = qemu_opt_get(opts, "serial");
@@ -546,6 +548,10 @@ DriveInfo *drive_init(QemuOpts *opts, int default_to_scsi)
         bdrv_flags |= (BDRV_O_SNAPSHOT|BDRV_O_CACHE_WB|BDRV_O_NO_FLUSH);
     }
 
+    if (copy_on_read) {
+        bdrv_flags |= BDRV_O_COPY_ON_READ;
+    }
+
     if (media == MEDIA_CDROM) {
         /* CDROM is fine for any interface, don't check.  */
         ro = 1;
diff --git a/hmp-commands.hx b/hmp-commands.hx
index f8d855e..79a9195 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -860,9 +860,10 @@ ETEXI
         .args_type  = "pci_addr:s,opts:s",
         .params     = "[[<domain>:]<bus>:]<slot>\n"
                       "[file=file][,if=type][,bus=n]\n"
-                      "[,unit=m][,media=d][index=i]\n"
+                      "[,unit=m][,media=d][,index=i]\n"
                       "[,cyls=c,heads=h,secs=s[,trans=t]]\n"
-                      "[snapshot=on|off][,cache=on|off]",
+                      "[,snapshot=on|off][,cache=on|off]\n"
+                      "[,readonly=on|off][,copy-on-read=on|off]",
         .help       = "add drive to PCI storage controller",
         .mhandler.cmd = drive_hot_add,
     },
diff --git a/qemu-config.c b/qemu-config.c
index 1aa080f..18f3020 100644
--- a/qemu-config.c
+++ b/qemu-config.c
@@ -109,6 +109,10 @@ static QemuOptsList qemu_drive_opts = {
             .name = "bps_wr",
             .type = QEMU_OPT_NUMBER,
             .help = "limit write bytes per second",
+        },{
+            .name = "copy-on-read",
+            .type = QEMU_OPT_BOOL,
+            .help = "copy read data from backing file into image file",
         },
         { /* end of list */ }
     },
diff --git a/qemu-options.hx b/qemu-options.hx
index 25a7be7..b3db10c 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -135,7 +135,7 @@ DEF("drive", HAS_ARG, QEMU_OPTION_drive,
     "       [,cyls=c,heads=h,secs=s[,trans=t]][,snapshot=on|off]\n"
     "       [,cache=writethrough|writeback|none|directsync|unsafe][,format=f]\n"
     "       [,serial=s][,addr=A][,id=name][,aio=threads|native]\n"
-    "       [,readonly=on|off]\n"
+    "       [,readonly=on|off][,copy-on-read=on|off]\n"
     "       [[,bps=b]|[[,bps_rd=r][,bps_wr=w]]][[,iops=i]|[[,iops_rd=r][,iops_wr=w]]\n"
     "                use 'file' as a drive image\n", QEMU_ARCH_ALL)
 STEXI
@@ -187,6 +187,9 @@ host disk is full; report the error to the guest otherwise).
 The default setting is @option{werror=enospc} and @option{rerror=report}.
 @item readonly
 Open drive @option{file} as read-only. Guest write attempts will fail.
+@item copy-on-read=@var{copy-on-read}
+@var{copy-on-read} is "on" or "off" and enables whether to copy read backing
+file sectors into the image file.
 @end table
 
 By default, writethrough caching is used for all block device.  This means that
@@ -218,6 +221,10 @@ like your host losing power, the disk storage getting disconnected accidently,
 etc. you're image will most probably be rendered unusable.   When using
 the @option{-snapshot} option, unsafe caching is always used.
 
+Copy-on-read avoids accessing the same backing file sectors repeatedly and is
+useful when the backing file is over a slow network.  By default copy-on-read
+is off.
+
 Instead of @option{-cdrom} you can use:
 @example
 qemu -drive file=file,index=2,media=cdrom
-- 
1.7.7.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v4 7/8] block: core copy-on-read logic
  2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 7/8] block: core copy-on-read logic Stefan Hajnoczi
@ 2011-12-02 17:13   ` Marcelo Tosatti
  2011-12-05 11:35     ` Stefan Hajnoczi
  0 siblings, 1 reply; 16+ messages in thread
From: Marcelo Tosatti @ 2011-12-02 17:13 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Paolo Bonzini, qemu-devel

On Wed, Nov 23, 2011 at 11:47:57AM +0000, Stefan Hajnoczi wrote:
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> ---
>  block.c      |   72 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  trace-events |    1 +
>  2 files changed, 73 insertions(+), 0 deletions(-)
> 
> diff --git a/block.c b/block.c
> index c30c8f2..a598a19 100644
> --- a/block.c
> +++ b/block.c
> @@ -1469,6 +1469,61 @@ int bdrv_pwrite_sync(BlockDriverState *bs, int64_t offset,
>      return 0;
>  }
>  
> +static int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
> +        int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
> +{
> +    /* Perform I/O through a temporary buffer so that users who scribble over
> +     * their read buffer while the operation is in progress do not end up
> +     * modifying the image file.  This is critical for zero-copy guest I/O
> +     * where anything might happen inside guest memory.
> +     */
> +    void *bounce_buffer;
> +
> +    struct iovec iov;
> +    QEMUIOVector bounce_qiov;
> +    int64_t cluster_sector_num;
> +    int cluster_nb_sectors;
> +    size_t skip_bytes;
> +    int ret;
> +
> +    /* Cover entire cluster so no additional backing file I/O is required when
> +     * allocating cluster in the image file.
> +     */
> +    round_to_clusters(bs, sector_num, nb_sectors,
> +                      &cluster_sector_num, &cluster_nb_sectors);
> +
> +    trace_bdrv_co_copy_on_readv(bs, sector_num, nb_sectors,
> +                                cluster_sector_num, cluster_nb_sectors);
> +
> +    iov.iov_len = cluster_nb_sectors * BDRV_SECTOR_SIZE;
> +    iov.iov_base = bounce_buffer = qemu_blockalign(bs, iov.iov_len);
> +    qemu_iovec_init_external(&bounce_qiov, &iov, 1);
> +
> +    ret = bs->drv->bdrv_co_readv(bs, cluster_sector_num, cluster_nb_sectors,
> +                                 &bounce_qiov);
> +    if (ret < 0) {
> +        goto err;
> +    }
> +
> +    ret = bs->drv->bdrv_co_writev(bs, cluster_sector_num, cluster_nb_sectors,
> +                                  &bounce_qiov);
> +    if (ret < 0) {
> +        /* It might be okay to ignore write errors for guest requests.  If this
> +         * is a deliberate copy-on-read then we don't want to ignore the error.
> +         * Simply report it in all cases.
> +         */
> +        goto err;
> +    }
> +
> +    skip_bytes = (sector_num - cluster_sector_num) * BDRV_SECTOR_SIZE;
> +    qemu_iovec_from_buffer(qiov, bounce_buffer + skip_bytes,
> +                           nb_sectors * BDRV_SECTOR_SIZE);
> +
> +err:
> +    qemu_vfree(bounce_buffer);
> +    return ret;
> +}
> +
>  /*
>   * Handle a read request in coroutine context
>   */
> @@ -1496,7 +1551,24 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
>      }
>  
>      tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
> +
> +    if (bs->copy_on_read) {
> +        int pnum;
> +
> +        ret = bdrv_co_is_allocated(bs, sector_num, nb_sectors, &pnum);
> +        if (ret < 0) {
> +            goto out;
> +        }

Stefan,

It is not clear where support for shared backing files would fit.
Let us consider the following block copy example:

1) Original chain:
[ BASE ] -> [ IMAGE-1 ] -> [ IMAGE-2 ] -> [ IMAGE-3 ]

2) Final chain:
[ BASE ] -> [ IMAGE-3 ]

I was talking to Kevin and we don't have code/monitor command to the
switch from 1) to 2). But that is a separate issue.

Question is, how do you plan to stream the contents of IMAGE-1 and
IMAGE-2 (but not BASE), into IMAGE-3 ? That is an important use case.

Also, do you have status on the external COW file work, for raw images?

Thanks

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v4 7/8] block: core copy-on-read logic
  2011-12-02 17:13   ` Marcelo Tosatti
@ 2011-12-05 11:35     ` Stefan Hajnoczi
  0 siblings, 0 replies; 16+ messages in thread
From: Stefan Hajnoczi @ 2011-12-05 11:35 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Kevin Wolf, Paolo Bonzini, Stefan Hajnoczi, qemu-devel

On Fri, Dec 2, 2011 at 5:13 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Wed, Nov 23, 2011 at 11:47:57AM +0000, Stefan Hajnoczi wrote:
>> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
>> ---
>>  block.c      |   72 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  trace-events |    1 +
>>  2 files changed, 73 insertions(+), 0 deletions(-)
>>
>> diff --git a/block.c b/block.c
>> index c30c8f2..a598a19 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -1469,6 +1469,61 @@ int bdrv_pwrite_sync(BlockDriverState *bs, int64_t offset,
>>      return 0;
>>  }
>>
>> +static int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
>> +        int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
>> +{
>> +    /* Perform I/O through a temporary buffer so that users who scribble over
>> +     * their read buffer while the operation is in progress do not end up
>> +     * modifying the image file.  This is critical for zero-copy guest I/O
>> +     * where anything might happen inside guest memory.
>> +     */
>> +    void *bounce_buffer;
>> +
>> +    struct iovec iov;
>> +    QEMUIOVector bounce_qiov;
>> +    int64_t cluster_sector_num;
>> +    int cluster_nb_sectors;
>> +    size_t skip_bytes;
>> +    int ret;
>> +
>> +    /* Cover entire cluster so no additional backing file I/O is required when
>> +     * allocating cluster in the image file.
>> +     */
>> +    round_to_clusters(bs, sector_num, nb_sectors,
>> +                      &cluster_sector_num, &cluster_nb_sectors);
>> +
>> +    trace_bdrv_co_copy_on_readv(bs, sector_num, nb_sectors,
>> +                                cluster_sector_num, cluster_nb_sectors);
>> +
>> +    iov.iov_len = cluster_nb_sectors * BDRV_SECTOR_SIZE;
>> +    iov.iov_base = bounce_buffer = qemu_blockalign(bs, iov.iov_len);
>> +    qemu_iovec_init_external(&bounce_qiov, &iov, 1);
>> +
>> +    ret = bs->drv->bdrv_co_readv(bs, cluster_sector_num, cluster_nb_sectors,
>> +                                 &bounce_qiov);
>> +    if (ret < 0) {
>> +        goto err;
>> +    }
>> +
>> +    ret = bs->drv->bdrv_co_writev(bs, cluster_sector_num, cluster_nb_sectors,
>> +                                  &bounce_qiov);
>> +    if (ret < 0) {
>> +        /* It might be okay to ignore write errors for guest requests.  If this
>> +         * is a deliberate copy-on-read then we don't want to ignore the error.
>> +         * Simply report it in all cases.
>> +         */
>> +        goto err;
>> +    }
>> +
>> +    skip_bytes = (sector_num - cluster_sector_num) * BDRV_SECTOR_SIZE;
>> +    qemu_iovec_from_buffer(qiov, bounce_buffer + skip_bytes,
>> +                           nb_sectors * BDRV_SECTOR_SIZE);
>> +
>> +err:
>> +    qemu_vfree(bounce_buffer);
>> +    return ret;
>> +}
>> +
>>  /*
>>   * Handle a read request in coroutine context
>>   */
>> @@ -1496,7 +1551,24 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
>>      }
>>
>>      tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
>> +
>> +    if (bs->copy_on_read) {
>> +        int pnum;
>> +
>> +        ret = bdrv_co_is_allocated(bs, sector_num, nb_sectors, &pnum);
>> +        if (ret < 0) {
>> +            goto out;
>> +        }
>
> Stefan,
>
> It is not clear where support for shared backing files would fit.
> Let us consider the following block copy example:
>
> 1) Original chain:
> [ BASE ] -> [ IMAGE-1 ] -> [ IMAGE-2 ] -> [ IMAGE-3 ]
>
> 2) Final chain:
> [ BASE ] -> [ IMAGE-3 ]
>
> I was talking to Kevin and we don't have code/monitor command to the
> switch from 1) to 2). But that is a separate issue.
>
> Question is, how do you plan to stream the contents of IMAGE-1 and
> IMAGE-2 (but not BASE), into IMAGE-3 ? That is an important use case.

We need to introduce a "deep" bdrv_is_allocated() which takes a "base"
argument at which to terminate the search.  Here's what it will do:

[ BASE ] -> [ IMAGE-1 ] -> [ IMAGE-2 ] -> [ IMAGE-3 ]

def bdrv_co_is_allocated(IMAGE-3, sector_num=0, base=BASE):
    if IMAGE-3 sector_num=0 is allocated:
        return True  # already allocated
    if IMAGE-2 sector_num=0 is allocated:
        return False # copy it!
    if IMAGE-1 sector_num=0 is allocated:
        return False # copy it!
    return True      # reached base image so we don't care

This is the hardcoded version of the function, in reality it will be a
loop but it shows the idea.

> Also, do you have status on the external COW file work, for raw images?

Sure, please help review Dong Xu Wang's add-cow patches on the list,
he is looking for feedback.

Stefan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v4 3/8] block: add request tracking
  2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 3/8] block: add request tracking Stefan Hajnoczi
@ 2011-12-05 12:17   ` Marcelo Tosatti
  2011-12-05 12:20     ` Paolo Bonzini
  2011-12-05 16:09   ` Marcelo Tosatti
  1 sibling, 1 reply; 16+ messages in thread
From: Marcelo Tosatti @ 2011-12-05 12:17 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Paolo Bonzini, qemu-devel

On Wed, Nov 23, 2011 at 11:47:53AM +0000, Stefan Hajnoczi wrote:
> The block layer does not know about pending requests.  This information
> is necessary for copy-on-read since overlapping requests must be
> serialized to prevent races that corrupt the image.
> 
> The BlockDriverState gets a new tracked_request list field which
> contains all pending requests.  Each request is a BdrvTrackedRequest
> record with sector_num, nb_sectors, and is_write fields.
> 
> Note that request tracking is always enabled but hopefully this extra
> work is so small that it doesn't justify adding an enable/disable flag.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> ---
>  block.c     |   48 +++++++++++++++++++++++++++++++++++++++++++++++-
>  block_int.h |    4 ++++
>  2 files changed, 51 insertions(+), 1 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 0df7eb9..27c4e84 100644
> --- a/block.c
> +++ b/block.c
> @@ -1071,6 +1071,42 @@ void bdrv_commit_all(void)
>      }
>  }
>  
> +struct BdrvTrackedRequest {
> +    BlockDriverState *bs;
> +    int64_t sector_num;
> +    int nb_sectors;
> +    bool is_write;
> +    QLIST_ENTRY(BdrvTrackedRequest) list;
> +};
> +
> +/**
> + * Remove an active request from the tracked requests list
> + *
> + * This function should be called when a tracked request is completing.
> + */
> +static void tracked_request_end(BdrvTrackedRequest *req)
> +{
> +    QLIST_REMOVE(req, list);
> +}
> +
> +/**
> + * Add an active request to the tracked requests list
> + */
> +static void tracked_request_begin(BdrvTrackedRequest *req,
> +                                  BlockDriverState *bs,
> +                                  int64_t sector_num,
> +                                  int nb_sectors, bool is_write)
> +{
> +    *req = (BdrvTrackedRequest){
> +        .bs = bs,
> +        .sector_num = sector_num,
> +        .nb_sectors = nb_sectors,
> +        .is_write = is_write,
> +    };
> +
> +    QLIST_INSERT_HEAD(&bs->tracked_requests, req, list);
> +}
> +
>  /*
>   * Return values:
>   * 0        - success
> @@ -1350,6 +1386,8 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
>      int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
>  {
>      BlockDriver *drv = bs->drv;
> +    BdrvTrackedRequest req;
> +    int ret;
>  
>      if (!drv) {
>          return -ENOMEDIUM;
> @@ -1363,7 +1401,10 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
>          bdrv_io_limits_intercept(bs, false, nb_sectors);
>      }
>  
> -    return drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
> +    tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
> +    ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
> +    tracked_request_end(&req);
> +    return ret;
>  }
>  
>  int coroutine_fn bdrv_co_readv(BlockDriverState *bs, int64_t sector_num,
> @@ -1381,6 +1422,7 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
>      int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
>  {
>      BlockDriver *drv = bs->drv;
> +    BdrvTrackedRequest req;
>      int ret;
>  
>      if (!bs->drv) {
> @@ -1398,6 +1440,8 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
>          bdrv_io_limits_intercept(bs, true, nb_sectors);
>      }
>  
> +    tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
> +
>      ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov);
>  
>      if (bs->dirty_bitmap) {
> @@ -1408,6 +1452,8 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
>          bs->wr_highest_sector = sector_num + nb_sectors - 1;
>      }
>  
> +    tracked_request_end(&req);
> +
>      return ret;
>  }

There is no need to worry about synchronous read/write requests
bypassing this interface, correct?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v4 3/8] block: add request tracking
  2011-12-05 12:17   ` Marcelo Tosatti
@ 2011-12-05 12:20     ` Paolo Bonzini
  0 siblings, 0 replies; 16+ messages in thread
From: Paolo Bonzini @ 2011-12-05 12:20 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel

On 12/05/2011 01:17 PM, Marcelo Tosatti wrote:
> There is no need to worry about synchronous read/write requests
> bypassing this interface, correct?
>

Synchronous read/write requests do not exist anymore (in the sense that 
they also go through coroutines).

Paolo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v4 3/8] block: add request tracking
  2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 3/8] block: add request tracking Stefan Hajnoczi
  2011-12-05 12:17   ` Marcelo Tosatti
@ 2011-12-05 16:09   ` Marcelo Tosatti
  2011-12-05 16:20     ` Stefan Hajnoczi
  1 sibling, 1 reply; 16+ messages in thread
From: Marcelo Tosatti @ 2011-12-05 16:09 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Paolo Bonzini, qemu-devel

On Wed, Nov 23, 2011 at 11:47:53AM +0000, Stefan Hajnoczi wrote:
> The block layer does not know about pending requests.  This information
> is necessary for copy-on-read since overlapping requests must be
> serialized to prevent races that corrupt the image.
> 
> The BlockDriverState gets a new tracked_request list field which
> contains all pending requests.  Each request is a BdrvTrackedRequest
> record with sector_num, nb_sectors, and is_write fields.
> 
> Note that request tracking is always enabled but hopefully this extra
> work is so small that it doesn't justify adding an enable/disable flag.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> ---
>  block.c     |   48 +++++++++++++++++++++++++++++++++++++++++++++++-
>  block_int.h |    4 ++++
>  2 files changed, 51 insertions(+), 1 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 0df7eb9..27c4e84 100644
> --- a/block.c
> +++ b/block.c
> @@ -1071,6 +1071,42 @@ void bdrv_commit_all(void)
>      }
>  }
>  
> +struct BdrvTrackedRequest {
> +    BlockDriverState *bs;
> +    int64_t sector_num;
> +    int nb_sectors;
> +    bool is_write;
> +    QLIST_ENTRY(BdrvTrackedRequest) list;
> +};
> +
> +/**
> + * Remove an active request from the tracked requests list
> + *
> + * This function should be called when a tracked request is completing.
> + */
> +static void tracked_request_end(BdrvTrackedRequest *req)
> +{
> +    QLIST_REMOVE(req, list);
> +}
> +
> +/**
> + * Add an active request to the tracked requests list
> + */
> +static void tracked_request_begin(BdrvTrackedRequest *req,
> +                                  BlockDriverState *bs,
> +                                  int64_t sector_num,
> +                                  int nb_sectors, bool is_write)
> +{
> +    *req = (BdrvTrackedRequest){
> +        .bs = bs,
> +        .sector_num = sector_num,
> +        .nb_sectors = nb_sectors,
> +        .is_write = is_write,
> +    };
> +
> +    QLIST_INSERT_HEAD(&bs->tracked_requests, req, list);
> +}
> +
>  /*
>   * Return values:
>   * 0        - success
> @@ -1350,6 +1386,8 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
>      int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
>  {
>      BlockDriver *drv = bs->drv;
> +    BdrvTrackedRequest req;
> +    int ret;
>  
>      if (!drv) {
>          return -ENOMEDIUM;
> @@ -1363,7 +1401,10 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
>          bdrv_io_limits_intercept(bs, false, nb_sectors);
>      }
>  
> -    return drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
> +    tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
> +    ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
> +    tracked_request_end(&req);
> +    return ret;
>  }

Stefan,

On earlier discussion, you replied to me:

"
>>      req = tracked_request_add(bs, sector_num, nb_sectors, false);
>
> The tracked request should include cluster round info?

When checking A and B for overlap, only one of them needs to be
rounded in order for overlap detection to be correct.  Therefore we
can store the original request [start, length) in tracked_requests and
only round the new request.
"

The problem AFAICS is this:

- Store a non-cluster-aligned request in the tracked request list.
- Wait on that non-cluster-aligned request
  (wait_for_overlapping_requests).
- Submit cluster-aligned request for COR request.

So, the tracked request list does not properly reflect the in-flight 
COR requests. Which can result in:

1) guest reads sector 10.
2) <sector_num=10,nb_sectors=2> added to tracked request list.
3) COR code submits read for <sector_num=10,nb_sectors=2+cluster_align>
4) unrelated guest operation writes to sector 13, nb_sectors=1. That is
allowed to proceed without waiting because tracked request list does not
reflect what COR in-flight requests.

Am i missing something here?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v4 3/8] block: add request tracking
  2011-12-05 16:09   ` Marcelo Tosatti
@ 2011-12-05 16:20     ` Stefan Hajnoczi
  2011-12-05 16:31       ` Marcelo Tosatti
  0 siblings, 1 reply; 16+ messages in thread
From: Stefan Hajnoczi @ 2011-12-05 16:20 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Kevin Wolf, Paolo Bonzini, Stefan Hajnoczi, qemu-devel

On Mon, Dec 5, 2011 at 4:09 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Wed, Nov 23, 2011 at 11:47:53AM +0000, Stefan Hajnoczi wrote:
> On earlier discussion, you replied to me:
>
> "
>>>      req = tracked_request_add(bs, sector_num, nb_sectors, false);
>>
>> The tracked request should include cluster round info?
>
> When checking A and B for overlap, only one of them needs to be
> rounded in order for overlap detection to be correct.  Therefore we
> can store the original request [start, length) in tracked_requests and
> only round the new request.
> "
>
> The problem AFAICS is this:
>
> - Store a non-cluster-aligned request in the tracked request list.
> - Wait on that non-cluster-aligned request
>  (wait_for_overlapping_requests).
> - Submit cluster-aligned request for COR request.
>
> So, the tracked request list does not properly reflect the in-flight
> COR requests. Which can result in:
>
> 1) guest reads sector 10.
> 2) <sector_num=10,nb_sectors=2> added to tracked request list.
> 3) COR code submits read for <sector_num=10,nb_sectors=2+cluster_align>

It will also round down to sector_num=0 when cluster size is 128
sectors (e.g. qcow2 and qed).

> 4) unrelated guest operation writes to sector 13, nb_sectors=1. That is
> allowed to proceed without waiting because tracked request list does not
> reflect what COR in-flight requests.

The tracked request list has <sector_num=10, nb_sectors=2> and the
candidate write request is <sector_num=13, nb_sectors=1>.

In wait_for_overlapping_requests() we round the candidate request to
<sector_num=0, nb_sectors=cluster_size>.  This rounded request does
overlap <sector_num=10, sectors=2> so it will need to wait.

Therefore CoR and write will not execute at the same time.

Does this make more sense?

Stefan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v4 3/8] block: add request tracking
  2011-12-05 16:20     ` Stefan Hajnoczi
@ 2011-12-05 16:31       ` Marcelo Tosatti
  0 siblings, 0 replies; 16+ messages in thread
From: Marcelo Tosatti @ 2011-12-05 16:31 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Paolo Bonzini, Stefan Hajnoczi, qemu-devel

On Mon, Dec 05, 2011 at 04:20:55PM +0000, Stefan Hajnoczi wrote:
> On Mon, Dec 5, 2011 at 4:09 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Wed, Nov 23, 2011 at 11:47:53AM +0000, Stefan Hajnoczi wrote:
> > On earlier discussion, you replied to me:
> >
> > "
> >>>      req = tracked_request_add(bs, sector_num, nb_sectors, false);
> >>
> >> The tracked request should include cluster round info?
> >
> > When checking A and B for overlap, only one of them needs to be
> > rounded in order for overlap detection to be correct.  Therefore we
> > can store the original request [start, length) in tracked_requests and
> > only round the new request.
> > "
> >
> > The problem AFAICS is this:
> >
> > - Store a non-cluster-aligned request in the tracked request list.
> > - Wait on that non-cluster-aligned request
> >  (wait_for_overlapping_requests).
> > - Submit cluster-aligned request for COR request.
> >
> > So, the tracked request list does not properly reflect the in-flight
> > COR requests. Which can result in:
> >
> > 1) guest reads sector 10.
> > 2) <sector_num=10,nb_sectors=2> added to tracked request list.
> > 3) COR code submits read for <sector_num=10,nb_sectors=2+cluster_align>
> 
> It will also round down to sector_num=0 when cluster size is 128
> sectors (e.g. qcow2 and qed).
> 
> > 4) unrelated guest operation writes to sector 13, nb_sectors=1. That is
> > allowed to proceed without waiting because tracked request list does not
> > reflect what COR in-flight requests.
> 
> The tracked request list has <sector_num=10, nb_sectors=2> and the
> candidate write request is <sector_num=13, nb_sectors=1>.
> 
> In wait_for_overlapping_requests() we round the candidate request to
> <sector_num=0, nb_sectors=cluster_size>.  This rounded request does
> overlap <sector_num=10, sectors=2> so it will need to wait.
> 
> Therefore CoR and write will not execute at the same time.
> 
> Does this make more sense?
> 
> Stefan

Ah yes same mistake on my part, again. Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2011-12-05 16:32 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-23 11:47 [Qemu-devel] [PATCH v4 0/8] block: generic copy-on-read Stefan Hajnoczi
2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 1/8] qemu-common: add QEMU_ALIGN_DOWN() and QEMU_ALIGN_UP() macros Stefan Hajnoczi
2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 2/8] coroutine: add qemu_co_queue_restart_all() Stefan Hajnoczi
2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 3/8] block: add request tracking Stefan Hajnoczi
2011-12-05 12:17   ` Marcelo Tosatti
2011-12-05 12:20     ` Paolo Bonzini
2011-12-05 16:09   ` Marcelo Tosatti
2011-12-05 16:20     ` Stefan Hajnoczi
2011-12-05 16:31       ` Marcelo Tosatti
2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 4/8] block: add bdrv_set_copy_on_read() Stefan Hajnoczi
2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 5/8] block: wait for overlapping requests Stefan Hajnoczi
2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 6/8] block: request overlap detection Stefan Hajnoczi
2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 7/8] block: core copy-on-read logic Stefan Hajnoczi
2011-12-02 17:13   ` Marcelo Tosatti
2011-12-05 11:35     ` Stefan Hajnoczi
2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 8/8] block: add -drive copy-on-read=on|off Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).