[Qemu-devel] [PATCH v4 0/6] block: zero writes

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v4 0/6] block: zero writes
@ 2012-01-18 14:59 Stefan Hajnoczi
  2012-01-18 14:59 ` [Qemu-devel] [PATCH v4 1/6] cutils: extract buffer_is_zero() from qemu-img.c Stefan Hajnoczi
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Stefan Hajnoczi @ 2012-01-18 14:59 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi

Note: This series has been rebased onto the image streaming series which is
likely to get merged first.  There are no significant changes to this series
since v2.

This series adds an interface for efficient writes when data contains all
zeros.  It also takes advantage of this new interface by extending the
copy-on-read feature to perform zero-detection.

The details of efficient zero representations depend on the image format.  This
series includes a patch for the QED image format to write special "zero
clusters" that keep the image file compact.  In the future qcow2v3 could also
support an efficient zero representation.

The new BlockDriver interface is called .bdrv_co_write_zeroes() and is
optional.  If the interface is not implemented by a BlockDriver then a regular
.bdrv_co_writev() operation will be performed.  The public interface is called
bdrv_co_write_zeroes() and can be tested via the new qemu-io write -z option.

Copy-on-read is extended to detect zeroes and invoke the
.bdrv_co_write_zeroes() interface when possible.  As a result we avoid bloating
the image file if the backing file contains zeroes.

My motivation for this feature is efficient image streaming.  The destination
file must stay compact even when copying zeroes from the source file.

We now only do zero detection for copy-on-read requests, whereas previous
revisions of this patch series scanned all write requests for zeroes.  The old
behavior wasted CPU cycles in most cases but we could add a feature to
explicitly scan guest writes for zeroes in the future, if desired.

v3:
 * Rebased on image streaming series

v2:
 * Introduce .bdrv_co_write_zeroes() [Kevin]
 * Perform zero detection only on copy-on-read requests [Kevin]

Stefan Hajnoczi (6):
  cutils: extract buffer_is_zero() from qemu-img.c
  block: add .bdrv_co_write_zeroes() interface
  block: perform zero-detection during copy-on-read
  qed: replace is_write with flags field
  qed: add .bdrv_co_write_zeroes() support
  qemu-io: add write -z option for bdrv_co_write_zeroes

 block.c       |   67 ++++++++++++++++++++++++++----
 block.h       |    7 +++
 block/qed.c   |  125 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
 block/qed.h   |    7 +++-
 block_int.h   |    8 ++++
 cutils.c      |   34 +++++++++++++++
 qemu-common.h |    2 +
 qemu-img.c    |   46 +++------------------
 qemu-io.c     |   77 +++++++++++++++++++++++++++++++----
 trace-events  |    3 +-
 10 files changed, 303 insertions(+), 73 deletions(-)

-- 
1.7.8.3

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Qemu-devel] [PATCH v4 1/6] cutils: extract buffer_is_zero() from qemu-img.c
  2012-01-18 14:59 [Qemu-devel] [PATCH v4 0/6] block: zero writes Stefan Hajnoczi
@ 2012-01-18 14:59 ` Stefan Hajnoczi
  2012-01-24 15:03   ` Kevin Wolf
  2012-01-18 14:59 ` [Qemu-devel] [PATCH v4 2/6] block: add .bdrv_co_write_zeroes() interface Stefan Hajnoczi
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Stefan Hajnoczi @ 2012-01-18 14:59 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi

The qemu-img.c:is_not_zero() function checks if a buffer contains all
zeroes.  This function will come in handy for zero-detection in the
block layer, so clean it up and move it to cutils.c.

Note that the function now returns true if the buffer is all zeroes.
This avoids the double-negatives (i.e. !is_not_zero()) that the old
function can cause in callers.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 cutils.c      |   34 ++++++++++++++++++++++++++++++++++
 qemu-common.h |    2 ++
 qemu-img.c    |   46 +++++++---------------------------------------
 3 files changed, 43 insertions(+), 39 deletions(-)

diff --git a/cutils.c b/cutils.c
index a6ffd46..2ea9c3c 100644
--- a/cutils.c
+++ b/cutils.c
@@ -303,6 +303,40 @@ void qemu_iovec_memset_skip(QEMUIOVector *qiov, int c, size_t count,
     }
 }
 
+/*
+ * Checks if a buffer is all zeroes
+ *
+ * Attention! The len must be a multiple of 4 * sizeof(long) due to
+ * restriction of optimizations in this function.
+ */
+bool buffer_is_zero(const void *buf, size_t len)
+{
+    /*
+     * Use long as the biggest available internal data type that fits into the
+     * CPU register and unroll the loop to smooth out the effect of memory
+     * latency.
+     */
+
+    size_t i;
+    long d0, d1, d2, d3;
+    const long * const data = buf;
+
+    len /= sizeof(long);
+
+    for (i = 0; i < len; i += 4) {
+        d0 = data[i + 0];
+        d1 = data[i + 1];
+        d2 = data[i + 2];
+        d3 = data[i + 3];
+
+        if (d0 || d1 || d2 || d3) {
+            return false;
+        }
+    }
+
+    return true;
+}
+
 #ifndef _WIN32
 /* Sets a specific flag */
 int fcntl_setfl(int fd, int flag)
diff --git a/qemu-common.h b/qemu-common.h
index 6ab7dfb..333f42f 100644
--- a/qemu-common.h
+++ b/qemu-common.h
@@ -327,6 +327,8 @@ void qemu_iovec_memset(QEMUIOVector *qiov, int c, size_t count);
 void qemu_iovec_memset_skip(QEMUIOVector *qiov, int c, size_t count,
                             size_t skip);
 
+bool buffer_is_zero(const void *buf, size_t len);
+
 void qemu_progress_init(int enabled, float min_skip);
 void qemu_progress_end(void);
 void qemu_progress_print(float delta, int max);
diff --git a/qemu-img.c b/qemu-img.c
index 01cc0d3..c4bcf41 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -515,40 +515,6 @@ static int img_commit(int argc, char **argv)
 }
 
 /*
- * Checks whether the sector is not a zero sector.
- *
- * Attention! The len must be a multiple of 4 * sizeof(long) due to
- * restriction of optimizations in this function.
- */
-static int is_not_zero(const uint8_t *sector, int len)
-{
-    /*
-     * Use long as the biggest available internal data type that fits into the
-     * CPU register and unroll the loop to smooth out the effect of memory
-     * latency.
-     */
-
-    int i;
-    long d0, d1, d2, d3;
-    const long * const data = (const long *) sector;
-
-    len /= sizeof(long);
-
-    for(i = 0; i < len; i += 4) {
-        d0 = data[i + 0];
-        d1 = data[i + 1];
-        d2 = data[i + 2];
-        d3 = data[i + 3];
-
-        if (d0 || d1 || d2 || d3) {
-            return 1;
-        }
-    }
-
-    return 0;
-}
-
-/*
  * Returns true iff the first sector pointed to by 'buf' contains at least
  * a non-NUL byte.
  *
@@ -557,20 +523,22 @@ static int is_not_zero(const uint8_t *sector, int len)
  */
 static int is_allocated_sectors(const uint8_t *buf, int n, int *pnum)
 {
-    int v, i;
+    bool is_zero;
+    int i;
 
     if (n <= 0) {
         *pnum = 0;
         return 0;
     }
-    v = is_not_zero(buf, 512);
+    is_zero = buffer_is_zero(buf, 512);
     for(i = 1; i < n; i++) {
         buf += 512;
-        if (v != is_not_zero(buf, 512))
+        if (is_zero != buffer_is_zero(buf, 512)) {
             break;
+        }
     }
     *pnum = i;
-    return v;
+    return !is_zero;
 }
 
 /*
@@ -955,7 +923,7 @@ static int img_convert(int argc, char **argv)
             if (n < cluster_sectors) {
                 memset(buf + n * 512, 0, cluster_size - n * 512);
             }
-            if (is_not_zero(buf, cluster_size)) {
+            if (!buffer_is_zero(buf, cluster_size)) {
                 ret = bdrv_write_compressed(out_bs, sector_num, buf,
                                             cluster_sectors);
                 if (ret != 0) {
-- 
1.7.8.3

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH v4 1/6] cutils: extract buffer_is_zero() from qemu-img.c
  2012-01-18 14:59 ` [Qemu-devel] [PATCH v4 1/6] cutils: extract buffer_is_zero() from qemu-img.c Stefan Hajnoczi
@ 2012-01-24 15:03   ` Kevin Wolf
  2012-02-06 15:37     ` Stefan Hajnoczi
  0 siblings, 1 reply; 13+ messages in thread
From: Kevin Wolf @ 2012-01-24 15:03 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel

Am 18.01.2012 15:59, schrieb Stefan Hajnoczi:
> The qemu-img.c:is_not_zero() function checks if a buffer contains all
> zeroes.  This function will come in handy for zero-detection in the
> block layer, so clean it up and move it to cutils.c.
> 
> Note that the function now returns true if the buffer is all zeroes.
> This avoids the double-negatives (i.e. !is_not_zero()) that the old
> function can cause in callers.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> ---
>  cutils.c      |   34 ++++++++++++++++++++++++++++++++++
>  qemu-common.h |    2 ++
>  qemu-img.c    |   46 +++++++---------------------------------------
>  3 files changed, 43 insertions(+), 39 deletions(-)
> 
> diff --git a/cutils.c b/cutils.c
> index a6ffd46..2ea9c3c 100644
> --- a/cutils.c
> +++ b/cutils.c
> @@ -303,6 +303,40 @@ void qemu_iovec_memset_skip(QEMUIOVector *qiov, int c, size_t count,
>      }
>  }
>  
> +/*
> + * Checks if a buffer is all zeroes
> + *
> + * Attention! The len must be a multiple of 4 * sizeof(long) due to
> + * restriction of optimizations in this function.
> + */

Sounds like something to assert(), while you're touching the code.

Kevin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH v4 1/6] cutils: extract buffer_is_zero() from qemu-img.c
  2012-01-24 15:03   ` Kevin Wolf
@ 2012-02-06 15:37     ` Stefan Hajnoczi
  0 siblings, 0 replies; 13+ messages in thread
From: Stefan Hajnoczi @ 2012-02-06 15:37 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Stefan Hajnoczi, qemu-devel

On Tue, Jan 24, 2012 at 3:03 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 18.01.2012 15:59, schrieb Stefan Hajnoczi:
>> @@ -303,6 +303,40 @@ void qemu_iovec_memset_skip(QEMUIOVector *qiov, int c, size_t count,
>>      }
>>  }
>>
>> +/*
>> + * Checks if a buffer is all zeroes
>> + *
>> + * Attention! The len must be a multiple of 4 * sizeof(long) due to
>> + * restriction of optimizations in this function.
>> + */
>
> Sounds like something to assert(), while you're touching the code.

Okay, will fix.

Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Qemu-devel] [PATCH v4 2/6] block: add .bdrv_co_write_zeroes() interface
  2012-01-18 14:59 [Qemu-devel] [PATCH v4 0/6] block: zero writes Stefan Hajnoczi
  2012-01-18 14:59 ` [Qemu-devel] [PATCH v4 1/6] cutils: extract buffer_is_zero() from qemu-img.c Stefan Hajnoczi
@ 2012-01-18 14:59 ` Stefan Hajnoczi
  2012-01-24 15:16   ` Kevin Wolf
  2012-01-18 14:59 ` [Qemu-devel] [PATCH v4 3/6] block: perform zero-detection during copy-on-read Stefan Hajnoczi
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Stefan Hajnoczi @ 2012-01-18 14:59 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi

The ability to zero regions of an image file is a useful primitive for
higher-level features such as image streaming or zero write detection.

Image formats may support an optimized metadata representation instead
of writing zeroes into the image file.  This allows zero writes to be
potentially faster than regular write operations and also preserve
sparseness of the image file.

The .bdrv_co_write_zeroes() interface should be implemented by block
drivers that wish to provide efficient zeroing.

Note that this operation is different from the discard operation, which
may leave the contents of the region indeterminate.  That means
discarded blocks are not guaranteed to contain zeroes and may contain
junk data instead.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 block.c      |   53 +++++++++++++++++++++++++++++++++++++++++++++++------
 block.h      |    7 +++++++
 block_int.h  |    8 ++++++++
 trace-events |    1 +
 4 files changed, 63 insertions(+), 6 deletions(-)

diff --git a/block.c b/block.c
index 3621d11..c9fa5c1 100644
--- a/block.c
+++ b/block.c
@@ -50,6 +50,7 @@
 
 typedef enum {
     BDRV_REQ_COPY_ON_READ = 0x1,
+    BDRV_REQ_ZERO_WRITE   = 0x2,
 } BdrvRequestFlags;
 
 static void bdrv_dev_change_media_cb(BlockDriverState *bs, bool load);
@@ -69,7 +70,8 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
     int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
     BdrvRequestFlags flags);
 static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
-    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov);
+    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
+    BdrvRequestFlags flags);
 static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
                                                int64_t sector_num,
                                                QEMUIOVector *qiov,
@@ -1300,7 +1302,7 @@ static void coroutine_fn bdrv_rw_co_entry(void *opaque)
                                      rwco->nb_sectors, rwco->qiov, 0);
     } else {
         rwco->ret = bdrv_co_do_writev(rwco->bs, rwco->sector_num,
-                                      rwco->nb_sectors, rwco->qiov);
+                                      rwco->nb_sectors, rwco->qiov, 0);
     }
 }
 
@@ -1639,11 +1641,37 @@ int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
                             BDRV_REQ_COPY_ON_READ);
 }
 
+static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
+    int64_t sector_num, int nb_sectors)
+{
+    BlockDriver *drv = bs->drv;
+    QEMUIOVector qiov;
+    struct iovec iov;
+    int ret;
+
+    /* First try the efficient write zeroes operation */
+    if (drv->bdrv_co_write_zeroes) {
+        return drv->bdrv_co_write_zeroes(bs, sector_num, nb_sectors);
+    }
+
+    /* Fall back to bounce buffer if write zeroes is unsupported */
+    iov.iov_len  = nb_sectors * BDRV_SECTOR_SIZE;
+    iov.iov_base = qemu_blockalign(bs, iov.iov_len);
+    memset(iov.iov_base, 0, iov.iov_len);
+    qemu_iovec_init_external(&qiov, &iov, 1);
+
+    ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, &qiov);
+
+    qemu_vfree(iov.iov_base);
+    return ret;
+}
+
 /*
  * Handle a write request in coroutine context
  */
 static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
-    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
+    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
+    BdrvRequestFlags flags)
 {
     BlockDriver *drv = bs->drv;
     BdrvTrackedRequest req;
@@ -1670,7 +1698,11 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
 
     tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
 
-    ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov);
+    if (flags & BDRV_REQ_ZERO_WRITE) {
+        ret = bdrv_co_do_write_zeroes(bs, sector_num, nb_sectors);
+    } else {
+        ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov);
+    }
 
     if (bs->dirty_bitmap) {
         set_dirty_bitmap(bs, sector_num, nb_sectors, 1);
@@ -1690,7 +1722,16 @@ int coroutine_fn bdrv_co_writev(BlockDriverState *bs, int64_t sector_num,
 {
     trace_bdrv_co_writev(bs, sector_num, nb_sectors);
 
-    return bdrv_co_do_writev(bs, sector_num, nb_sectors, qiov);
+    return bdrv_co_do_writev(bs, sector_num, nb_sectors, qiov, 0);
+}
+
+int coroutine_fn bdrv_co_write_zeroes(BlockDriverState *bs,
+                                      int64_t sector_num, int nb_sectors)
+{
+    trace_bdrv_co_write_zeroes(bs, sector_num, nb_sectors);
+
+    return bdrv_co_do_writev(bs, sector_num, nb_sectors, NULL,
+                             BDRV_REQ_ZERO_WRITE);
 }
 
 /**
@@ -3192,7 +3233,7 @@ static void coroutine_fn bdrv_co_do_rw(void *opaque)
             acb->req.nb_sectors, acb->req.qiov, 0);
     } else {
         acb->req.error = bdrv_co_do_writev(bs, acb->req.sector,
-            acb->req.nb_sectors, acb->req.qiov);
+            acb->req.nb_sectors, acb->req.qiov, 0);
     }
 
     acb->bh = qemu_bh_new(bdrv_co_em_bh, acb);
diff --git a/block.h b/block.h
index cae289b..9f3aad3 100644
--- a/block.h
+++ b/block.h
@@ -146,6 +146,13 @@ int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
     int64_t sector_num, int nb_sectors, QEMUIOVector *qiov);
 int coroutine_fn bdrv_co_writev(BlockDriverState *bs, int64_t sector_num,
     int nb_sectors, QEMUIOVector *qiov);
+/*
+ * Efficiently zero a region of the disk image.  Note that this is a regular
+ * I/O request like read or write and should have a reasonable size.  This
+ * function is not suitable for zeroing the entire image in a single request.
+ */
+int coroutine_fn bdrv_co_write_zeroes(BlockDriverState *bs, int64_t sector_num,
+    int nb_sectors);
 int coroutine_fn bdrv_co_is_allocated(BlockDriverState *bs, int64_t sector_num,
     int nb_sectors, int *pnum);
 BlockDriverState *bdrv_find_backing_image(BlockDriverState *bs,
diff --git a/block_int.h b/block_int.h
index 7be2988..7946cf6 100644
--- a/block_int.h
+++ b/block_int.h
@@ -131,6 +131,14 @@ struct BlockDriver {
         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov);
     int coroutine_fn (*bdrv_co_writev)(BlockDriverState *bs,
         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov);
+    /*
+     * Efficiently zero a region of the disk image.  Typically an image format
+     * would use a compact metadata representation to implement this.  This
+     * function pointer may be NULL and .bdrv_co_writev() will be called
+     * instead.
+     */
+    int coroutine_fn (*bdrv_co_write_zeroes)(BlockDriverState *bs,
+        int64_t sector_num, int nb_sectors);
     int coroutine_fn (*bdrv_co_discard)(BlockDriverState *bs,
         int64_t sector_num, int nb_sectors);
     int coroutine_fn (*bdrv_co_is_allocated)(BlockDriverState *bs,
diff --git a/trace-events b/trace-events
index 6faedf3..faab38c 100644
--- a/trace-events
+++ b/trace-events
@@ -67,6 +67,7 @@ bdrv_lock_medium(void *bs, bool locked) "bs %p locked %d"
 bdrv_co_readv(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
 bdrv_co_copy_on_readv(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
 bdrv_co_writev(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
+bdrv_co_write_zeroes(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
 bdrv_co_io_em(void *bs, int64_t sector_num, int nb_sectors, int is_write, void *acb) "bs %p sector_num %"PRId64" nb_sectors %d is_write %d acb %p"
 bdrv_co_do_copy_on_readv(void *bs, int64_t sector_num, int nb_sectors, int64_t cluster_sector_num, int cluster_nb_sectors) "bs %p sector_num %"PRId64" nb_sectors %d cluster_sector_num %"PRId64" cluster_nb_sectors %d"
 
-- 
1.7.8.3

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH v4 2/6] block: add .bdrv_co_write_zeroes() interface
  2012-01-18 14:59 ` [Qemu-devel] [PATCH v4 2/6] block: add .bdrv_co_write_zeroes() interface Stefan Hajnoczi
@ 2012-01-24 15:16   ` Kevin Wolf
  2012-02-06 15:50     ` Stefan Hajnoczi
  0 siblings, 1 reply; 13+ messages in thread
From: Kevin Wolf @ 2012-01-24 15:16 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel

Am 18.01.2012 15:59, schrieb Stefan Hajnoczi:
> The ability to zero regions of an image file is a useful primitive for
> higher-level features such as image streaming or zero write detection.
> 
> Image formats may support an optimized metadata representation instead
> of writing zeroes into the image file.  This allows zero writes to be
> potentially faster than regular write operations and also preserve
> sparseness of the image file.
> 
> The .bdrv_co_write_zeroes() interface should be implemented by block
> drivers that wish to provide efficient zeroing.
> 
> Note that this operation is different from the discard operation, which
> may leave the contents of the region indeterminate.  That means
> discarded blocks are not guaranteed to contain zeroes and may contain
> junk data instead.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> ---
>  block.c      |   53 +++++++++++++++++++++++++++++++++++++++++++++++------
>  block.h      |    7 +++++++
>  block_int.h  |    8 ++++++++
>  trace-events |    1 +
>  4 files changed, 63 insertions(+), 6 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 3621d11..c9fa5c1 100644
> --- a/block.c
> +++ b/block.c
> @@ -50,6 +50,7 @@
>  
>  typedef enum {
>      BDRV_REQ_COPY_ON_READ = 0x1,
> +    BDRV_REQ_ZERO_WRITE   = 0x2,
>  } BdrvRequestFlags;
>  
>  static void bdrv_dev_change_media_cb(BlockDriverState *bs, bool load);
> @@ -69,7 +70,8 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
>      int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
>      BdrvRequestFlags flags);
>  static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
> -    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov);
> +    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
> +    BdrvRequestFlags flags);
>  static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
>                                                 int64_t sector_num,
>                                                 QEMUIOVector *qiov,
> @@ -1300,7 +1302,7 @@ static void coroutine_fn bdrv_rw_co_entry(void *opaque)
>                                       rwco->nb_sectors, rwco->qiov, 0);
>      } else {
>          rwco->ret = bdrv_co_do_writev(rwco->bs, rwco->sector_num,
> -                                      rwco->nb_sectors, rwco->qiov);
> +                                      rwco->nb_sectors, rwco->qiov, 0);
>      }
>  }
>  
> @@ -1639,11 +1641,37 @@ int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
>                              BDRV_REQ_COPY_ON_READ);
>  }
>  
> +static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
> +    int64_t sector_num, int nb_sectors)
> +{
> +    BlockDriver *drv = bs->drv;
> +    QEMUIOVector qiov;
> +    struct iovec iov;
> +    int ret;
> +
> +    /* First try the efficient write zeroes operation */
> +    if (drv->bdrv_co_write_zeroes) {
> +        return drv->bdrv_co_write_zeroes(bs, sector_num, nb_sectors);
> +    }
> +
> +    /* Fall back to bounce buffer if write zeroes is unsupported */
> +    iov.iov_len  = nb_sectors * BDRV_SECTOR_SIZE;
> +    iov.iov_base = qemu_blockalign(bs, iov.iov_len);
> +    memset(iov.iov_base, 0, iov.iov_len);
> +    qemu_iovec_init_external(&qiov, &iov, 1);
> +
> +    ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, &qiov);
> +
> +    qemu_vfree(iov.iov_base);
> +    return ret;
> +}
> +
>  /*
>   * Handle a write request in coroutine context
>   */
>  static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
> -    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
> +    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
> +    BdrvRequestFlags flags)
>  {
>      BlockDriver *drv = bs->drv;
>      BdrvTrackedRequest req;
> @@ -1670,7 +1698,11 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
>  
>      tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
>  
> -    ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov);
> +    if (flags & BDRV_REQ_ZERO_WRITE) {
> +        ret = bdrv_co_do_write_zeroes(bs, sector_num, nb_sectors);
> +    } else {
> +        ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov);
> +    }
>  
>      if (bs->dirty_bitmap) {
>          set_dirty_bitmap(bs, sector_num, nb_sectors, 1);
> @@ -1690,7 +1722,16 @@ int coroutine_fn bdrv_co_writev(BlockDriverState *bs, int64_t sector_num,
>  {
>      trace_bdrv_co_writev(bs, sector_num, nb_sectors);
>  
> -    return bdrv_co_do_writev(bs, sector_num, nb_sectors, qiov);
> +    return bdrv_co_do_writev(bs, sector_num, nb_sectors, qiov, 0);
> +}
> +
> +int coroutine_fn bdrv_co_write_zeroes(BlockDriverState *bs,
> +                                      int64_t sector_num, int nb_sectors)
> +{
> +    trace_bdrv_co_write_zeroes(bs, sector_num, nb_sectors);
> +
> +    return bdrv_co_do_writev(bs, sector_num, nb_sectors, NULL,
> +                             BDRV_REQ_ZERO_WRITE);
>  }
>  
>  /**
> @@ -3192,7 +3233,7 @@ static void coroutine_fn bdrv_co_do_rw(void *opaque)
>              acb->req.nb_sectors, acb->req.qiov, 0);
>      } else {
>          acb->req.error = bdrv_co_do_writev(bs, acb->req.sector,
> -            acb->req.nb_sectors, acb->req.qiov);
> +            acb->req.nb_sectors, acb->req.qiov, 0);
>      }
>  
>      acb->bh = qemu_bh_new(bdrv_co_em_bh, acb);
> diff --git a/block.h b/block.h
> index cae289b..9f3aad3 100644
> --- a/block.h
> +++ b/block.h
> @@ -146,6 +146,13 @@ int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
>      int64_t sector_num, int nb_sectors, QEMUIOVector *qiov);
>  int coroutine_fn bdrv_co_writev(BlockDriverState *bs, int64_t sector_num,
>      int nb_sectors, QEMUIOVector *qiov);
> +/*
> + * Efficiently zero a region of the disk image.  Note that this is a regular
> + * I/O request like read or write and should have a reasonable size.  This
> + * function is not suitable for zeroing the entire image in a single request.
> + */

The reason for this is that in the fallback case you allocate memory for
the whole request, right? So what about just limiting the allocation in
bdrv_co_do_write_zeroes() to a reasonable size and putting a loop there
for large requests?

Kevin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH v4 2/6] block: add .bdrv_co_write_zeroes() interface
  2012-01-24 15:16   ` Kevin Wolf
@ 2012-02-06 15:50     ` Stefan Hajnoczi
  2012-02-06 16:00       ` Kevin Wolf
  0 siblings, 1 reply; 13+ messages in thread
From: Stefan Hajnoczi @ 2012-02-06 15:50 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Stefan Hajnoczi, qemu-devel

On Tue, Jan 24, 2012 at 3:16 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> +/*
>> + * Efficiently zero a region of the disk image.  Note that this is a regular
>> + * I/O request like read or write and should have a reasonable size.  This
>> + * function is not suitable for zeroing the entire image in a single request.
>> + */
>
> The reason for this is that in the fallback case you allocate memory for
> the whole request, right? So what about just limiting the allocation in
> bdrv_co_do_write_zeroes() to a reasonable size and putting a loop there
> for large requests?

I'd rather not do that yet.  We don't have a reasonable way to cancel
such a request yet.  In the future, if we decide we'd like to do huge
bdrv_co_write_zeroes(), we could look at the details of making this
work.

Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH v4 2/6] block: add .bdrv_co_write_zeroes() interface
  2012-02-06 15:50     ` Stefan Hajnoczi
@ 2012-02-06 16:00       ` Kevin Wolf
  2012-02-06 16:16         ` Stefan Hajnoczi
  0 siblings, 1 reply; 13+ messages in thread
From: Kevin Wolf @ 2012-02-06 16:00 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Stefan Hajnoczi, qemu-devel

Am 06.02.2012 16:50, schrieb Stefan Hajnoczi:
> On Tue, Jan 24, 2012 at 3:16 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>> +/*
>>> + * Efficiently zero a region of the disk image.  Note that this is a regular
>>> + * I/O request like read or write and should have a reasonable size.  This
>>> + * function is not suitable for zeroing the entire image in a single request.
>>> + */
>>
>> The reason for this is that in the fallback case you allocate memory for
>> the whole request, right? So what about just limiting the allocation in
>> bdrv_co_do_write_zeroes() to a reasonable size and putting a loop there
>> for large requests?
> 
> I'd rather not do that yet.  We don't have a reasonable way to cancel
> such a request yet.  In the future, if we decide we'd like to do huge
> bdrv_co_write_zeroes(), we could look at the details of making this
> work.

Do we even need a way to cancel such requests?

But anyway, include the reason in the comment, then we can leave it as
it is.

Kevin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH v4 2/6] block: add .bdrv_co_write_zeroes() interface
  2012-02-06 16:00       ` Kevin Wolf
@ 2012-02-06 16:16         ` Stefan Hajnoczi
  0 siblings, 0 replies; 13+ messages in thread
From: Stefan Hajnoczi @ 2012-02-06 16:16 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Stefan Hajnoczi, qemu-devel

On Mon, Feb 6, 2012 at 4:00 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 06.02.2012 16:50, schrieb Stefan Hajnoczi:
>> On Tue, Jan 24, 2012 at 3:16 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>> +/*
>>>> + * Efficiently zero a region of the disk image.  Note that this is a regular
>>>> + * I/O request like read or write and should have a reasonable size.  This
>>>> + * function is not suitable for zeroing the entire image in a single request.
>>>> + */
>>>
>>> The reason for this is that in the fallback case you allocate memory for
>>> the whole request, right? So what about just limiting the allocation in
>>> bdrv_co_do_write_zeroes() to a reasonable size and putting a loop there
>>> for large requests?
>>
>> I'd rather not do that yet.  We don't have a reasonable way to cancel
>> such a request yet.  In the future, if we decide we'd like to do huge
>> bdrv_co_write_zeroes(), we could look at the details of making this
>> work.
>
> Do we even need a way to cancel such requests?

I think so.  Any long-running operation tends to come with cancel/abort.

> But anyway, include the reason in the comment, then we can leave it as
> it is.

Okay, will add it.

Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Qemu-devel] [PATCH v4 3/6] block: perform zero-detection during copy-on-read
  2012-01-18 14:59 [Qemu-devel] [PATCH v4 0/6] block: zero writes Stefan Hajnoczi
  2012-01-18 14:59 ` [Qemu-devel] [PATCH v4 1/6] cutils: extract buffer_is_zero() from qemu-img.c Stefan Hajnoczi
  2012-01-18 14:59 ` [Qemu-devel] [PATCH v4 2/6] block: add .bdrv_co_write_zeroes() interface Stefan Hajnoczi
@ 2012-01-18 14:59 ` Stefan Hajnoczi
  2012-01-18 14:59 ` [Qemu-devel] [PATCH v4 4/6] qed: replace is_write with flags field Stefan Hajnoczi
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Stefan Hajnoczi @ 2012-01-18 14:59 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi

Copy-on-Read populates the image file with data read from a backing
image.  In order to avoid bloating the image file when all zeroes are
read we should scan the buffer and perform an optimized zero write
operation.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 block.c |   14 +++++++++++---
 1 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/block.c b/block.c
index c9fa5c1..ae297bb 100644
--- a/block.c
+++ b/block.c
@@ -1517,6 +1517,7 @@ static int coroutine_fn bdrv_co_do_copy_on_readv(BlockDriverState *bs,
      */
     void *bounce_buffer;
 
+    BlockDriver *drv = bs->drv;
     struct iovec iov;
     QEMUIOVector bounce_qiov;
     int64_t cluster_sector_num;
@@ -1537,14 +1538,21 @@ static int coroutine_fn bdrv_co_do_copy_on_readv(BlockDriverState *bs,
     iov.iov_base = bounce_buffer = qemu_blockalign(bs, iov.iov_len);
     qemu_iovec_init_external(&bounce_qiov, &iov, 1);
 
-    ret = bs->drv->bdrv_co_readv(bs, cluster_sector_num, cluster_nb_sectors,
-                                 &bounce_qiov);
+    ret = drv->bdrv_co_readv(bs, cluster_sector_num, cluster_nb_sectors,
+                             &bounce_qiov);
     if (ret < 0) {
         goto err;
     }
 
-    ret = bs->drv->bdrv_co_writev(bs, cluster_sector_num, cluster_nb_sectors,
+    if (drv->bdrv_co_write_zeroes &&
+        buffer_is_zero(bounce_buffer, iov.iov_len)) {
+        ret = drv->bdrv_co_write_zeroes(bs, cluster_sector_num,
+                                        cluster_nb_sectors);
+    } else {
+        ret = drv->bdrv_co_writev(bs, cluster_sector_num, cluster_nb_sectors,
                                   &bounce_qiov);
+    }
+
     if (ret < 0) {
         /* It might be okay to ignore write errors for guest requests.  If this
          * is a deliberate copy-on-read then we don't want to ignore the error.
-- 
1.7.8.3

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [Qemu-devel] [PATCH v4 4/6] qed: replace is_write with flags field
  2012-01-18 14:59 [Qemu-devel] [PATCH v4 0/6] block: zero writes Stefan Hajnoczi
                   ` (2 preceding siblings ...)
  2012-01-18 14:59 ` [Qemu-devel] [PATCH v4 3/6] block: perform zero-detection during copy-on-read Stefan Hajnoczi
@ 2012-01-18 14:59 ` Stefan Hajnoczi
  2012-01-18 14:59 ` [Qemu-devel] [PATCH v4 5/6] qed: add .bdrv_co_write_zeroes() support Stefan Hajnoczi
  2012-01-18 14:59 ` [Qemu-devel] [PATCH v4 6/6] qemu-io: add write -z option for bdrv_co_write_zeroes Stefan Hajnoczi
  5 siblings, 0 replies; 13+ messages in thread
From: Stefan Hajnoczi @ 2012-01-18 14:59 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi

Per-request attributes like read/write are currently implemented as bool
fields in the QEDAIOCB struct.  This becomes unwiedly as the number of
attributes grows.  For example, the qed_aio_setup() function would have
to take multiple bool arguments and at call sites it would be hard to
distinguish the meaning of each bool.

Instead use a flags field with bitmask constants.  This will be used
when zero write support is added.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 block/qed.c  |   15 ++++++++-------
 block/qed.h  |    6 +++++-
 trace-events |    2 +-
 3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index 8da3ebe..b66dd17 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -1233,8 +1233,8 @@ static void qed_aio_next_io(void *opaque, int ret)
 {
     QEDAIOCB *acb = opaque;
     BDRVQEDState *s = acb_to_s(acb);
-    QEDFindClusterFunc *io_fn =
-        acb->is_write ? qed_aio_write_data : qed_aio_read_data;
+    QEDFindClusterFunc *io_fn = (acb->flags & QED_AIOCB_WRITE) ?
+                                qed_aio_write_data : qed_aio_read_data;
 
     trace_qed_aio_next_io(s, acb, ret, acb->cur_pos + acb->cur_qiov.size);
 
@@ -1264,14 +1264,14 @@ static BlockDriverAIOCB *qed_aio_setup(BlockDriverState *bs,
                                        int64_t sector_num,
                                        QEMUIOVector *qiov, int nb_sectors,
                                        BlockDriverCompletionFunc *cb,
-                                       void *opaque, bool is_write)
+                                       void *opaque, int flags)
 {
     QEDAIOCB *acb = qemu_aio_get(&qed_aio_pool, bs, cb, opaque);
 
     trace_qed_aio_setup(bs->opaque, acb, sector_num, nb_sectors,
-                         opaque, is_write);
+                        opaque, flags);
 
-    acb->is_write = is_write;
+    acb->flags = flags;
     acb->finished = NULL;
     acb->qiov = qiov;
     acb->qiov_offset = 0;
@@ -1291,7 +1291,7 @@ static BlockDriverAIOCB *bdrv_qed_aio_readv(BlockDriverState *bs,
                                             BlockDriverCompletionFunc *cb,
                                             void *opaque)
 {
-    return qed_aio_setup(bs, sector_num, qiov, nb_sectors, cb, opaque, false);
+    return qed_aio_setup(bs, sector_num, qiov, nb_sectors, cb, opaque, 0);
 }
 
 static BlockDriverAIOCB *bdrv_qed_aio_writev(BlockDriverState *bs,
@@ -1300,7 +1300,8 @@ static BlockDriverAIOCB *bdrv_qed_aio_writev(BlockDriverState *bs,
                                              BlockDriverCompletionFunc *cb,
                                              void *opaque)
 {
-    return qed_aio_setup(bs, sector_num, qiov, nb_sectors, cb, opaque, true);
+    return qed_aio_setup(bs, sector_num, qiov, nb_sectors, cb,
+                         opaque, QED_AIOCB_WRITE);
 }
 
 static BlockDriverAIOCB *bdrv_qed_aio_flush(BlockDriverState *bs,
diff --git a/block/qed.h b/block/qed.h
index 62cbd3b..abed147 100644
--- a/block/qed.h
+++ b/block/qed.h
@@ -123,12 +123,16 @@ typedef struct QEDRequest {
     CachedL2Table *l2_table;
 } QEDRequest;
 
+enum {
+    QED_AIOCB_WRITE = 0x0001,       /* read or write? */
+};
+
 typedef struct QEDAIOCB {
     BlockDriverAIOCB common;
     QEMUBH *bh;
     int bh_ret;                     /* final return status for completion bh */
     QSIMPLEQ_ENTRY(QEDAIOCB) next;  /* next request */
-    bool is_write;                  /* false - read, true - write */
+    int flags;                      /* QED_AIOCB_* bits ORed together */
     bool *finished;                 /* signal for cancel completion */
     uint64_t end_pos;               /* request end on block device, in bytes */
 
diff --git a/trace-events b/trace-events
index faab38c..2e68c2c 100644
--- a/trace-events
+++ b/trace-events
@@ -319,7 +319,7 @@ qed_need_check_timer_cb(void *s) "s %p"
 qed_start_need_check_timer(void *s) "s %p"
 qed_cancel_need_check_timer(void *s) "s %p"
 qed_aio_complete(void *s, void *acb, int ret) "s %p acb %p ret %d"
-qed_aio_setup(void *s, void *acb, int64_t sector_num, int nb_sectors, void *opaque, int is_write) "s %p acb %p sector_num %"PRId64" nb_sectors %d opaque %p is_write %d"
+qed_aio_setup(void *s, void *acb, int64_t sector_num, int nb_sectors, void *opaque, int flags) "s %p acb %p sector_num %"PRId64" nb_sectors %d opaque %p flags %#x"
 qed_aio_next_io(void *s, void *acb, int ret, uint64_t cur_pos) "s %p acb %p ret %d cur_pos %"PRIu64
 qed_aio_read_data(void *s, void *acb, int ret, uint64_t offset, size_t len) "s %p acb %p ret %d offset %"PRIu64" len %zu"
 qed_aio_write_data(void *s, void *acb, int ret, uint64_t offset, size_t len) "s %p acb %p ret %d offset %"PRIu64" len %zu"
-- 
1.7.8.3

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [Qemu-devel] [PATCH v4 5/6] qed: add .bdrv_co_write_zeroes() support
  2012-01-18 14:59 [Qemu-devel] [PATCH v4 0/6] block: zero writes Stefan Hajnoczi
                   ` (3 preceding siblings ...)
  2012-01-18 14:59 ` [Qemu-devel] [PATCH v4 4/6] qed: replace is_write with flags field Stefan Hajnoczi
@ 2012-01-18 14:59 ` Stefan Hajnoczi
  2012-01-18 14:59 ` [Qemu-devel] [PATCH v4 6/6] qemu-io: add write -z option for bdrv_co_write_zeroes Stefan Hajnoczi
  5 siblings, 0 replies; 13+ messages in thread
From: Stefan Hajnoczi @ 2012-01-18 14:59 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi

Zero writes are a dedicated interface for writing regions of zeroes into
the image file.  If clusters are not yet allocated it is possible to use
an efficient metadata representation which keeps the image file compact
and does not store individual zero bytes.

Implementing this for the QED image format is fairly straightforward.
The only issue is that when a zero write touches an existing cluster we
have to allocate a bounce buffer and perform a regular write.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 block/qed.c |  110 ++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 block/qed.h |    1 +
 2 files changed, 103 insertions(+), 8 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index b66dd17..a041d31 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -875,6 +875,12 @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
     qemu_iovec_destroy(&acb->cur_qiov);
     qed_unref_l2_cache_entry(acb->request.l2_table);
 
+    /* Free the buffer we may have allocated for zero writes */
+    if (acb->flags & QED_AIOCB_ZERO) {
+        qemu_vfree(acb->qiov->iov[0].iov_base);
+        acb->qiov->iov[0].iov_base = NULL;
+    }
+
     /* Arrange for a bh to invoke the completion function */
     acb->bh_ret = ret;
     acb->bh = qemu_bh_new(qed_aio_complete_bh, acb);
@@ -941,9 +947,8 @@ static void qed_aio_write_l1_update(void *opaque, int ret)
 /**
  * Update L2 table with new cluster offsets and write them out
  */
-static void qed_aio_write_l2_update(void *opaque, int ret)
+static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
 {
-    QEDAIOCB *acb = opaque;
     BDRVQEDState *s = acb_to_s(acb);
     bool need_alloc = acb->find_cluster_ret == QED_CLUSTER_L1;
     int index;
@@ -959,7 +964,7 @@ static void qed_aio_write_l2_update(void *opaque, int ret)
 
     index = qed_l2_index(s, acb->cur_pos);
     qed_update_l2_table(s, acb->request.l2_table->table, index, acb->cur_nclusters,
-                         acb->cur_cluster);
+                         offset);
 
     if (need_alloc) {
         /* Write out the whole new L2 table */
@@ -976,6 +981,12 @@ err:
     qed_aio_complete(acb, ret);
 }
 
+static void qed_aio_write_l2_update_cb(void *opaque, int ret)
+{
+    QEDAIOCB *acb = opaque;
+    qed_aio_write_l2_update(acb, ret, acb->cur_cluster);
+}
+
 /**
  * Flush new data clusters before updating the L2 table
  *
@@ -990,7 +1001,7 @@ static void qed_aio_write_flush_before_l2_update(void *opaque, int ret)
     QEDAIOCB *acb = opaque;
     BDRVQEDState *s = acb_to_s(acb);
 
-    if (!bdrv_aio_flush(s->bs->file, qed_aio_write_l2_update, opaque)) {
+    if (!bdrv_aio_flush(s->bs->file, qed_aio_write_l2_update_cb, opaque)) {
         qed_aio_complete(acb, -EIO);
     }
 }
@@ -1019,7 +1030,7 @@ static void qed_aio_write_main(void *opaque, int ret)
         if (s->bs->backing_hd) {
             next_fn = qed_aio_write_flush_before_l2_update;
         } else {
-            next_fn = qed_aio_write_l2_update;
+            next_fn = qed_aio_write_l2_update_cb;
         }
     }
 
@@ -1081,6 +1092,18 @@ static bool qed_should_set_need_check(BDRVQEDState *s)
     return !(s->header.features & QED_F_NEED_CHECK);
 }
 
+static void qed_aio_write_zero_cluster(void *opaque, int ret)
+{
+    QEDAIOCB *acb = opaque;
+
+    if (ret) {
+        qed_aio_complete(acb, ret);
+        return;
+    }
+
+    qed_aio_write_l2_update(acb, 0, 1);
+}
+
 /**
  * Write new data cluster
  *
@@ -1092,6 +1115,7 @@ static bool qed_should_set_need_check(BDRVQEDState *s)
 static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
 {
     BDRVQEDState *s = acb_to_s(acb);
+    BlockDriverCompletionFunc *cb;
 
     /* Cancel timer when the first allocating request comes in */
     if (QSIMPLEQ_EMPTY(&s->allocating_write_reqs)) {
@@ -1109,14 +1133,26 @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
 
     acb->cur_nclusters = qed_bytes_to_clusters(s,
             qed_offset_into_cluster(s, acb->cur_pos) + len);
-    acb->cur_cluster = qed_alloc_clusters(s, acb->cur_nclusters);
     qemu_iovec_copy(&acb->cur_qiov, acb->qiov, acb->qiov_offset, len);
 
+    if (acb->flags & QED_AIOCB_ZERO) {
+        /* Skip ahead if the clusters are already zero */
+        if (acb->find_cluster_ret == QED_CLUSTER_ZERO) {
+            qed_aio_next_io(acb, 0);
+            return;
+        }
+
+        cb = qed_aio_write_zero_cluster;
+    } else {
+        cb = qed_aio_write_prefill;
+        acb->cur_cluster = qed_alloc_clusters(s, acb->cur_nclusters);
+    }
+
     if (qed_should_set_need_check(s)) {
         s->header.features |= QED_F_NEED_CHECK;
-        qed_write_header(s, qed_aio_write_prefill, acb);
+        qed_write_header(s, cb, acb);
     } else {
-        qed_aio_write_prefill(acb, 0);
+        cb(acb, 0);
     }
 }
 
@@ -1131,6 +1167,16 @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
  */
 static void qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
 {
+    /* Allocate buffer for zero writes */
+    if (acb->flags & QED_AIOCB_ZERO) {
+        struct iovec *iov = acb->qiov->iov;
+
+        if (!iov->iov_base) {
+            iov->iov_base = qemu_blockalign(acb->common.bs, iov->iov_len);
+            memset(iov->iov_base, 0, iov->iov_len);
+        }
+    }
+
     /* Calculate the I/O vector */
     acb->cur_cluster = offset;
     qemu_iovec_copy(&acb->cur_qiov, acb->qiov, acb->qiov_offset, len);
@@ -1311,6 +1357,53 @@ static BlockDriverAIOCB *bdrv_qed_aio_flush(BlockDriverState *bs,
     return bdrv_aio_flush(bs->file, cb, opaque);
 }
 
+typedef struct {
+    Coroutine *co;
+    int ret;
+    bool done;
+} QEDWriteZeroesCB;
+
+static void coroutine_fn qed_co_write_zeroes_cb(void *opaque, int ret)
+{
+    QEDWriteZeroesCB *cb = opaque;
+
+    cb->done = true;
+    cb->ret = ret;
+    if (cb->co) {
+        qemu_coroutine_enter(cb->co, NULL);
+    }
+}
+
+static int coroutine_fn bdrv_qed_co_write_zeroes(BlockDriverState *bs,
+                                                 int64_t sector_num,
+                                                 int nb_sectors)
+{
+    BlockDriverAIOCB *blockacb;
+    QEDWriteZeroesCB cb = { .done = false };
+    QEMUIOVector qiov;
+    struct iovec iov;
+
+    /* Zero writes start without an I/O buffer.  If a buffer becomes necessary
+     * then it will be allocated during request processing.
+     */
+    iov.iov_base = NULL,
+    iov.iov_len  = nb_sectors * BDRV_SECTOR_SIZE,
+
+    qemu_iovec_init_external(&qiov, &iov, 1);
+    blockacb = qed_aio_setup(bs, sector_num, &qiov, nb_sectors,
+                             qed_co_write_zeroes_cb, &cb,
+                             QED_AIOCB_WRITE | QED_AIOCB_ZERO);
+    if (!blockacb) {
+        return -EIO;
+    }
+    if (!cb.done) {
+        cb.co = qemu_coroutine_self();
+        qemu_coroutine_yield();
+    }
+    assert(cb.done);
+    return cb.ret;
+}
+
 static int bdrv_qed_truncate(BlockDriverState *bs, int64_t offset)
 {
     BDRVQEDState *s = bs->opaque;
@@ -1470,6 +1563,7 @@ static BlockDriver bdrv_qed = {
     .bdrv_aio_readv           = bdrv_qed_aio_readv,
     .bdrv_aio_writev          = bdrv_qed_aio_writev,
     .bdrv_aio_flush           = bdrv_qed_aio_flush,
+    .bdrv_co_write_zeroes     = bdrv_qed_co_write_zeroes,
     .bdrv_truncate            = bdrv_qed_truncate,
     .bdrv_getlength           = bdrv_qed_getlength,
     .bdrv_get_info            = bdrv_qed_get_info,
diff --git a/block/qed.h b/block/qed.h
index abed147..62624a1 100644
--- a/block/qed.h
+++ b/block/qed.h
@@ -125,6 +125,7 @@ typedef struct QEDRequest {
 
 enum {
     QED_AIOCB_WRITE = 0x0001,       /* read or write? */
+    QED_AIOCB_ZERO  = 0x0002,       /* zero write, used with QED_AIOCB_WRITE */
 };
 
 typedef struct QEDAIOCB {
-- 
1.7.8.3

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [Qemu-devel] [PATCH v4 6/6] qemu-io: add write -z option for bdrv_co_write_zeroes
  2012-01-18 14:59 [Qemu-devel] [PATCH v4 0/6] block: zero writes Stefan Hajnoczi
                   ` (4 preceding siblings ...)
  2012-01-18 14:59 ` [Qemu-devel] [PATCH v4 5/6] qed: add .bdrv_co_write_zeroes() support Stefan Hajnoczi
@ 2012-01-18 14:59 ` Stefan Hajnoczi
  5 siblings, 0 replies; 13+ messages in thread
From: Stefan Hajnoczi @ 2012-01-18 14:59 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi

Extend the qemu-io write command with the -z option to call
bdrv_co_write_zeroes().  Exposing the zero write interface from qemu-io
allows us to write tests that exercise this new block layer interface.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 qemu-io.c |   77 ++++++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 files changed, 69 insertions(+), 8 deletions(-)

diff --git a/qemu-io.c b/qemu-io.c
index 7c446b6..c02a702 100644
--- a/qemu-io.c
+++ b/qemu-io.c
@@ -218,6 +218,51 @@ static int do_pwrite(char *buf, int64_t offset, int count, int *total)
     return 1;
 }
 
+typedef struct {
+    int64_t offset;
+    int count;
+    int *total;
+    int ret;
+    bool done;
+} CoWriteZeroes;
+
+static void coroutine_fn co_write_zeroes_entry(void *opaque)
+{
+    CoWriteZeroes *data = opaque;
+
+    data->ret = bdrv_co_write_zeroes(bs, data->offset / BDRV_SECTOR_SIZE,
+                                     data->count / BDRV_SECTOR_SIZE);
+    data->done = true;
+    if (data->ret < 0) {
+        *data->total = data->ret;
+        return;
+    }
+
+    *data->total = data->count;
+}
+
+static int do_co_write_zeroes(int64_t offset, int count, int *total)
+{
+    Coroutine *co;
+    CoWriteZeroes data = {
+        .offset = offset,
+        .count  = count,
+        .total  = total,
+        .done   = false,
+    };
+
+    co = qemu_coroutine_create(co_write_zeroes_entry);
+    qemu_coroutine_enter(co, &data);
+    while (!data.done) {
+        qemu_aio_wait();
+    }
+    if (data.ret < 0) {
+        return data.ret;
+    } else {
+        return 1;
+    }
+}
+
 static int do_load_vmstate(char *buf, int64_t offset, int count, int *total)
 {
     *total = bdrv_load_vmstate(bs, (uint8_t *)buf, offset, count);
@@ -643,6 +688,7 @@ static void write_help(void)
 " -P, -- use different pattern to fill file\n"
 " -C, -- report statistics in a machine parsable format\n"
 " -q, -- quiet mode, do not show I/O statistics\n"
+" -z, -- write zeroes using bdrv_co_write_zeroes\n"
 "\n");
 }
 
@@ -654,7 +700,7 @@ static const cmdinfo_t write_cmd = {
     .cfunc      = write_f,
     .argmin     = 2,
     .argmax     = -1,
-    .args       = "[-abCpq] [-P pattern ] off len",
+    .args       = "[-bCpqz] [-P pattern ] off len",
     .oneline    = "writes a number of bytes at a specified offset",
     .help       = write_help,
 };
@@ -662,16 +708,16 @@ static const cmdinfo_t write_cmd = {
 static int write_f(int argc, char **argv)
 {
     struct timeval t1, t2;
-    int Cflag = 0, pflag = 0, qflag = 0, bflag = 0;
+    int Cflag = 0, pflag = 0, qflag = 0, bflag = 0, Pflag = 0, zflag = 0;
     int c, cnt;
-    char *buf;
+    char *buf = NULL;
     int64_t offset;
     int count;
     /* Some compilers get confused and warn if this is not initialized.  */
     int total = 0;
     int pattern = 0xcd;
 
-    while ((c = getopt(argc, argv, "bCpP:q")) != EOF) {
+    while ((c = getopt(argc, argv, "bCpP:qz")) != EOF) {
         switch (c) {
         case 'b':
             bflag = 1;
@@ -683,6 +729,7 @@ static int write_f(int argc, char **argv)
             pflag = 1;
             break;
         case 'P':
+            Pflag = 1;
             pattern = parse_pattern(optarg);
             if (pattern < 0) {
                 return 0;
@@ -691,6 +738,9 @@ static int write_f(int argc, char **argv)
         case 'q':
             qflag = 1;
             break;
+        case 'z':
+            zflag = 1;
+            break;
         default:
             return command_usage(&write_cmd);
         }
@@ -700,8 +750,13 @@ static int write_f(int argc, char **argv)
         return command_usage(&write_cmd);
     }
 
-    if (bflag && pflag) {
-        printf("-b and -p cannot be specified at the same time\n");
+    if (bflag + pflag + zflag > 1) {
+        printf("-b, -p, or -z cannot be specified at the same time\n");
+        return 0;
+    }
+
+    if (zflag && Pflag) {
+        printf("-z and -P cannot be specified at the same time\n");
         return 0;
     }
 
@@ -732,13 +787,17 @@ static int write_f(int argc, char **argv)
         }
     }
 
-    buf = qemu_io_alloc(count, pattern);
+    if (!zflag) {
+        buf = qemu_io_alloc(count, pattern);
+    }
 
     gettimeofday(&t1, NULL);
     if (pflag) {
         cnt = do_pwrite(buf, offset, count, &total);
     } else if (bflag) {
         cnt = do_save_vmstate(buf, offset, count, &total);
+    } else if (zflag) {
+        cnt = do_co_write_zeroes(offset, count, &total);
     } else {
         cnt = do_write(buf, offset, count, &total);
     }
@@ -758,7 +817,9 @@ static int write_f(int argc, char **argv)
     print_report("wrote", &t2, offset, count, total, cnt, Cflag);
 
 out:
-    qemu_io_free(buf);
+    if (!zflag) {
+        qemu_io_free(buf);
+    }
 
     return 0;
 }
-- 
1.7.8.3

^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-02-06 16:16 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-18 14:59 [Qemu-devel] [PATCH v4 0/6] block: zero writes Stefan Hajnoczi
2012-01-18 14:59 ` [Qemu-devel] [PATCH v4 1/6] cutils: extract buffer_is_zero() from qemu-img.c Stefan Hajnoczi
2012-01-24 15:03   ` Kevin Wolf
2012-02-06 15:37     ` Stefan Hajnoczi
2012-01-18 14:59 ` [Qemu-devel] [PATCH v4 2/6] block: add .bdrv_co_write_zeroes() interface Stefan Hajnoczi
2012-01-24 15:16   ` Kevin Wolf
2012-02-06 15:50     ` Stefan Hajnoczi
2012-02-06 16:00       ` Kevin Wolf
2012-02-06 16:16         ` Stefan Hajnoczi
2012-01-18 14:59 ` [Qemu-devel] [PATCH v4 3/6] block: perform zero-detection during copy-on-read Stefan Hajnoczi
2012-01-18 14:59 ` [Qemu-devel] [PATCH v4 4/6] qed: replace is_write with flags field Stefan Hajnoczi
2012-01-18 14:59 ` [Qemu-devel] [PATCH v4 5/6] qed: add .bdrv_co_write_zeroes() support Stefan Hajnoczi
2012-01-18 14:59 ` [Qemu-devel] [PATCH v4 6/6] qemu-io: add write -z option for bdrv_co_write_zeroes Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).