[Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c
@ 2014-12-30  9:20 Denis V. Lunev
  2014-12-30  9:20 ` [Qemu-devel] [PATCH 1/8] block: prepare bdrv_co_do_write_zeroes to deal with large bl.max_write_zeroes Denis V. Lunev
                   ` (8 more replies)
  0 siblings, 9 replies; 24+ messages in thread
From: Denis V. Lunev @ 2014-12-30  9:20 UTC (permalink / raw)
  Cc: Kevin Wolf, Denis V. Lunev, Peter Lieven, qemu-devel,
	Stefan Hajnoczi

These patches eliminate data writes completely on Linux if fallocate
FALLOC_FL_ZERO_RANGE or FALLOC_FL_PUNCH_HOLE are  supported on
underlying filesystem.

I have performed several tests with non-aligned fallocate calls and
in all cases (with non-aligned fallocates) Linux performs fine, i.e.
areas are zeroed correctly. Checks were made on
   Linux 3.16.0-28-generic #38-Ubuntu SMP

This should seriously increase performance in some special cases.

Changes from v2:
- added Peter Lieven to CC
- added CONFIG_FALLOCATE check to call do_fallocate in patch 7
- dropped patch 1 as NACK-ed
- added processing of very large data areas in bdrv_co_write_zeroes (new
  patch 1)
- set bl.max_write_zeroes to INT_MAX in raw-posix.c for regular files
  (new patch 8)

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Qemu-devel] [PATCH 1/8] block: prepare bdrv_co_do_write_zeroes to deal with large bl.max_write_zeroes
  2014-12-30  9:20 [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
@ 2014-12-30  9:20 ` Denis V. Lunev
  2015-01-05  7:34   ` Peter Lieven
  2014-12-30  9:20 ` [Qemu-devel] [PATCH 2/8] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes Denis V. Lunev
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Denis V. Lunev @ 2014-12-30  9:20 UTC (permalink / raw)
  Cc: Kevin Wolf, Denis V. Lunev, Peter Lieven, qemu-devel,
	Stefan Hajnoczi

bdrv_co_do_write_zeroes split writes using bl.max_write_zeroes or
16 MiB as a chunk size. This is implemented in this way to tolerate
buggy block backends which do not accept too big requests.

Though if the bdrv_co_write_zeroes callback is not good enough, we
fallback to write data explicitely using bdrv_co_writev and we
create buffer to accomodate zeroes inside. The size of this buffer
is the size of the chunk. Thus if the underlying layer will have
bl.max_write_zeroes high enough, f.e. 4 GiB, the allocation can fail.

Actually, there is no need to allocate such a big amount of memory.
We could simply allocate 1 MiB buffer and create iovec, which will
point to the same memory.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
---
 block.c | 35 ++++++++++++++++++++++++-----------
 1 file changed, 24 insertions(+), 11 deletions(-)

diff --git a/block.c b/block.c
index 4165d42..d69c121 100644
--- a/block.c
+++ b/block.c
@@ -3173,14 +3173,18 @@ int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
  * of 32768 512-byte sectors (16 MiB) per request.
  */
 #define MAX_WRITE_ZEROES_DEFAULT 32768
+/* allocate iovec with zeroes using 1 MiB chunks to avoid to big allocations */
+#define MAX_ZEROES_CHUNK (1024 * 1024)
 
 static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
     int64_t sector_num, int nb_sectors, BdrvRequestFlags flags)
 {
     BlockDriver *drv = bs->drv;
     QEMUIOVector qiov;
-    struct iovec iov = {0};
     int ret = 0;
+    void *chunk = NULL;
+
+    qemu_iovec_init(&qiov, 0);
 
     int max_write_zeroes = bs->bl.max_write_zeroes ?
                            bs->bl.max_write_zeroes : MAX_WRITE_ZEROES_DEFAULT;
@@ -3217,27 +3221,35 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
         }
 
         if (ret == -ENOTSUP) {
+            int64_t num_bytes = (int64_t)num << BDRV_SECTOR_BITS;
+            int chunk_size = MIN(MAX_ZEROES_CHUNK, num_bytes);
+
             /* Fall back to bounce buffer if write zeroes is unsupported */
-            iov.iov_len = num * BDRV_SECTOR_SIZE;
-            if (iov.iov_base == NULL) {
-                iov.iov_base = qemu_try_blockalign(bs, num * BDRV_SECTOR_SIZE);
-                if (iov.iov_base == NULL) {
+            if (chunk == NULL) {
+                chunk = qemu_try_blockalign(bs, chunk_size);
+                if (chunk == NULL) {
                     ret = -ENOMEM;
                     goto fail;
                 }
-                memset(iov.iov_base, 0, num * BDRV_SECTOR_SIZE);
+                memset(chunk, 0, chunk_size);
+            }
+
+            while (num_bytes > 0) {
+                int to_add = MIN(chunk_size, num_bytes);
+                qemu_iovec_add(&qiov, chunk, to_add);
+                num_bytes -= to_add;
             }
-            qemu_iovec_init_external(&qiov, &iov, 1);
 
             ret = drv->bdrv_co_writev(bs, sector_num, num, &qiov);
 
             /* Keep bounce buffer around if it is big enough for all
              * all future requests.
              */
-            if (num < max_write_zeroes) {
-                qemu_vfree(iov.iov_base);
-                iov.iov_base = NULL;
+            if (chunk_size != MAX_ZEROES_CHUNK) {
+                qemu_vfree(chunk);
+                chunk = NULL;
             }
+            qemu_iovec_reset(&qiov);
         }
 
         sector_num += num;
@@ -3245,7 +3257,8 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
     }
 
 fail:
-    qemu_vfree(iov.iov_base);
+    qemu_iovec_destroy(&qiov);
+    qemu_vfree(chunk);
     return ret;
 }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [Qemu-devel] [PATCH 2/8] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes
  2014-12-30  9:20 [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
  2014-12-30  9:20 ` [Qemu-devel] [PATCH 1/8] block: prepare bdrv_co_do_write_zeroes to deal with large bl.max_write_zeroes Denis V. Lunev
@ 2014-12-30  9:20 ` Denis V. Lunev
  2014-12-30  9:20 ` [Qemu-devel] [PATCH 3/8] block/raw-posix: create do_fallocate helper Denis V. Lunev
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: Denis V. Lunev @ 2014-12-30  9:20 UTC (permalink / raw)
  Cc: Kevin Wolf, Denis V. Lunev, Peter Lieven, qemu-devel,
	Stefan Hajnoczi

This efficiently writes zeroes on Linux if the kernel is capable enough.
FALLOC_FL_ZERO_RANGE correctly handles all cases, including and not
including file expansion.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
---
 block/raw-posix.c | 13 ++++++++++++-
 configure         | 19 +++++++++++++++++++
 2 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index e51293a..66ebaab 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -60,7 +60,7 @@
 #define FS_NOCOW_FL                     0x00800000 /* Do not cow file */
 #endif
 #endif
-#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
+#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE)
 #include <linux/falloc.h>
 #endif
 #if defined (__FreeBSD__) || defined(__FreeBSD_kernel__)
@@ -919,6 +919,17 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
             return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes);
         }
 #endif
+
+#ifdef CONFIG_FALLOCATE_ZERO_RANGE
+        do {
+            if (fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
+                          aiocb->aio_offset, aiocb->aio_nbytes) == 0) {
+                return 0;
+            }
+        } while (errno == EINTR);
+
+        ret = -errno;
+#endif
     }
 
     if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
diff --git a/configure b/configure
index cae588c..dfcf7b3 100755
--- a/configure
+++ b/configure
@@ -3309,6 +3309,22 @@ if compile_prog "" "" ; then
   fallocate_punch_hole=yes
 fi
 
+# check that fallocate supports range zeroing inside the file
+fallocate_zero_range=no
+cat > $TMPC << EOF
+#include <fcntl.h>
+#include <linux/falloc.h>
+
+int main(void)
+{
+    fallocate(0, FALLOC_FL_ZERO_RANGE, 0, 0);
+    return 0;
+}
+EOF
+if compile_prog "" "" ; then
+  fallocate_zero_range=yes
+fi
+
 # check for posix_fallocate
 posix_fallocate=no
 cat > $TMPC << EOF
@@ -4538,6 +4554,9 @@ fi
 if test "$fallocate_punch_hole" = "yes" ; then
   echo "CONFIG_FALLOCATE_PUNCH_HOLE=y" >> $config_host_mak
 fi
+if test "$fallocate_zero_range" = "yes" ; then
+  echo "CONFIG_FALLOCATE_ZERO_RANGE=y" >> $config_host_mak
+fi
 if test "$posix_fallocate" = "yes" ; then
   echo "CONFIG_POSIX_FALLOCATE=y" >> $config_host_mak
 fi
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [Qemu-devel] [PATCH 3/8] block/raw-posix: create do_fallocate helper
  2014-12-30  9:20 [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
  2014-12-30  9:20 ` [Qemu-devel] [PATCH 1/8] block: prepare bdrv_co_do_write_zeroes to deal with large bl.max_write_zeroes Denis V. Lunev
  2014-12-30  9:20 ` [Qemu-devel] [PATCH 2/8] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes Denis V. Lunev
@ 2014-12-30  9:20 ` Denis V. Lunev
  2014-12-30  9:20 ` [Qemu-devel] [PATCH 4/8] block/raw-posix: create translate_err helper to merge errno values Denis V. Lunev
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: Denis V. Lunev @ 2014-12-30  9:20 UTC (permalink / raw)
  Cc: Kevin Wolf, Denis V. Lunev, Peter Lieven, qemu-devel,
	Stefan Hajnoczi

The pattern
    do {
        if (fallocate(s->fd, mode, offset, len) == 0) {
            return 0;
        }
    } while (errno == EINTR);
is used twice at the moment and I am going to add more usages. Move it
to the helper function.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
---
 block/raw-posix.c | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 66ebaab..a7c8816 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -893,6 +893,19 @@ static int xfs_discard(BDRVRawState *s, int64_t offset, uint64_t bytes)
 }
 #endif
 
+
+#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE)
+static int do_fallocate(int fd, int mode, off_t offset, off_t len)
+{
+    do {
+        if (fallocate(fd, mode, offset, len) == 0) {
+            return 0;
+        }
+    } while (errno == EINTR);
+    return -errno;
+}
+#endif
+
 static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
 {
     int ret = -EOPNOTSUPP;
@@ -921,14 +934,8 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
 #endif
 
 #ifdef CONFIG_FALLOCATE_ZERO_RANGE
-        do {
-            if (fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
-                          aiocb->aio_offset, aiocb->aio_nbytes) == 0) {
-                return 0;
-            }
-        } while (errno == EINTR);
-
-        ret = -errno;
+        ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
+                           aiocb->aio_offset, aiocb->aio_nbytes);
 #endif
     }
 
@@ -968,14 +975,8 @@ static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb)
 #endif
 
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
-        do {
-            if (fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
-                          aiocb->aio_offset, aiocb->aio_nbytes) == 0) {
-                return 0;
-            }
-        } while (errno == EINTR);
-
-        ret = -errno;
+        ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+                           aiocb->aio_offset, aiocb->aio_nbytes);
 #endif
     }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [Qemu-devel] [PATCH 4/8] block/raw-posix: create translate_err helper to merge errno values
  2014-12-30  9:20 [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
                   ` (2 preceding siblings ...)
  2014-12-30  9:20 ` [Qemu-devel] [PATCH 3/8] block/raw-posix: create do_fallocate helper Denis V. Lunev
@ 2014-12-30  9:20 ` Denis V. Lunev
  2015-01-05  6:46   ` Fam Zheng
  2014-12-30  9:20 ` [Qemu-devel] [PATCH 5/8] block/raw-posix: refactor handle_aiocb_write_zeroes a bit Denis V. Lunev
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Denis V. Lunev @ 2014-12-30  9:20 UTC (permalink / raw)
  Cc: Kevin Wolf, Denis V. Lunev, Peter Lieven, qemu-devel,
	Stefan Hajnoczi

actually the code
    if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
        ret == -ENOTTY) {
        ret = -ENOTSUP;
    }
is present twice and will be added a couple more times. Create helper
for this. Place it into do_fallocate() for further convinience.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
---
 block/raw-posix.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index a7c8816..25a6947 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -894,6 +894,15 @@ static int xfs_discard(BDRVRawState *s, int64_t offset, uint64_t bytes)
 #endif
 
 
+static int translate_err(int err)
+{
+    if (err == -ENODEV || err == -ENOSYS || err == -EOPNOTSUPP ||
+        err == -ENOTTY) {
+        err = -ENOTSUP;
+    }
+    return err;
+}
+
 #if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE)
 static int do_fallocate(int fd, int mode, off_t offset, off_t len)
 {
@@ -902,7 +911,7 @@ static int do_fallocate(int fd, int mode, off_t offset, off_t len)
             return 0;
         }
     } while (errno == EINTR);
-    return -errno;
+    return translate_err(-errno);
 }
 #endif
 
@@ -939,10 +948,9 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
 #endif
     }
 
-    if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
-        ret == -ENOTTY) {
+    ret = translate_err(ret);
+    if (ret == -ENOTSUP) {
         s->has_write_zeroes = false;
-        ret = -ENOTSUP;
     }
     return ret;
 }
@@ -980,10 +988,9 @@ static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb)
 #endif
     }
 
-    if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
-        ret == -ENOTTY) {
+    ret = translate_err(ret);
+    if (ret == -ENOTSUP) {
         s->has_discard = false;
-        ret = -ENOTSUP;
     }
     return ret;
 }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [Qemu-devel] [PATCH 5/8] block/raw-posix: refactor handle_aiocb_write_zeroes a bit
  2014-12-30  9:20 [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
                   ` (3 preceding siblings ...)
  2014-12-30  9:20 ` [Qemu-devel] [PATCH 4/8] block/raw-posix: create translate_err helper to merge errno values Denis V. Lunev
@ 2014-12-30  9:20 ` Denis V. Lunev
  2015-01-05  6:57   ` Fam Zheng
  2014-12-30  9:20 ` [Qemu-devel] [PATCH 6/8] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes Denis V. Lunev
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Denis V. Lunev @ 2014-12-30  9:20 UTC (permalink / raw)
  Cc: Kevin Wolf, Denis V. Lunev, Peter Lieven, qemu-devel,
	Stefan Hajnoczi

move code dealing with a block device to a separate function. This will
allow to implement additional processing for an ordinary files.

Pls note, that xfs_code has been moved before checking for
s->has_write_zeroes as xfs_write_zeroes does not touch this flag inside.
This makes code a bit more consistent.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
---
 block/raw-posix.c | 60 +++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 38 insertions(+), 22 deletions(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 25a6947..7866d31 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -915,46 +915,62 @@ static int do_fallocate(int fd, int mode, off_t offset, off_t len)
 }
 #endif
 
-static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
+static ssize_t handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
 {
     int ret = -EOPNOTSUPP;
     BDRVRawState *s = aiocb->bs->opaque;
 
-    if (s->has_write_zeroes == 0) {
+    if (!s->has_write_zeroes) {
         return -ENOTSUP;
     }
 
-    if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
 #ifdef BLKZEROOUT
-        do {
-            uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes };
-            if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) {
-                return 0;
-            }
-        } while (errno == EINTR);
-
-        ret = -errno;
-#endif
-    } else {
-#ifdef CONFIG_XFS
-        if (s->is_xfs) {
-            return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes);
+    do {
+        uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes };
+        if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) {
+            return 0;
         }
-#endif
+    } while (errno == EINTR);
 
-#ifdef CONFIG_FALLOCATE_ZERO_RANGE
-        ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
-                           aiocb->aio_offset, aiocb->aio_nbytes);
+    ret = translate_err(-errno);
 #endif
-    }
 
-    ret = translate_err(ret);
     if (ret == -ENOTSUP) {
         s->has_write_zeroes = false;
     }
     return ret;
 }
 
+static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
+{
+    BDRVRawState *s;
+
+    if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
+        return handle_aiocb_write_zeroes_block(aiocb);
+    }
+
+#ifdef CONFIG_XFS
+    if (s->is_xfs) {
+        return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes);
+    }
+#endif
+
+    s = aiocb->bs->opaque;
+
+#ifdef CONFIG_FALLOCATE_ZERO_RANGE
+    if (s->has_write_zeroes) {
+        int ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
+                               aiocb->aio_offset, aiocb->aio_nbytes);
+        if (ret == 0 && ret != -ENOTSUP) {
+            return ret;
+        }
+    }
+#endif
+
+    s->has_write_zeroes = false;
+    return -ENOTSUP;
+}
+
 static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb)
 {
     int ret = -EOPNOTSUPP;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [Qemu-devel] [PATCH 6/8] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes
  2014-12-30  9:20 [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
                   ` (4 preceding siblings ...)
  2014-12-30  9:20 ` [Qemu-devel] [PATCH 5/8] block/raw-posix: refactor handle_aiocb_write_zeroes a bit Denis V. Lunev
@ 2014-12-30  9:20 ` Denis V. Lunev
  2015-01-05  7:02   ` Fam Zheng
  2014-12-30  9:20 ` [Qemu-devel] [PATCH 7/8] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes Denis V. Lunev
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Denis V. Lunev @ 2014-12-30  9:20 UTC (permalink / raw)
  Cc: Kevin Wolf, Denis V. Lunev, Peter Lieven, qemu-devel,
	Stefan Hajnoczi

This sequence works efficiently if FALLOC_FL_ZERO_RANGE is not supported.

Simple fallocate(0) will extend file with zeroes when appropriate in the
middle of the file if there is a hole there and at the end of the file.
Unfortunately fallocate(0) does not drop the content of the file if
there is a data on this offset. Therefore to make the situation consistent
we should drop the data beforehand. This is done using FALLOC_FL_PUNCH_HOLE

This should increase the performance a bit for not-so-modern kernels or for
filesystems which do not support FALLOC_FL_ZERO_RANGE.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
---
 block/raw-posix.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 7866d31..96a8678 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -968,6 +968,23 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
 #endif
 
     s->has_write_zeroes = false;
+
+#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
+    if (s->has_discard) {
+        int ret;
+        ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+                           aiocb->aio_offset, aiocb->aio_nbytes);
+        if (ret < 0) {
+            if (ret == -ENOTSUP) {
+                s->has_discard = false;
+            }
+            return ret;
+        }
+        return do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes);
+    }
+#endif
+
+    s->has_discard = false;
     return -ENOTSUP;
 }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [Qemu-devel] [PATCH 7/8] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes
  2014-12-30  9:20 [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
                   ` (5 preceding siblings ...)
  2014-12-30  9:20 ` [Qemu-devel] [PATCH 6/8] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes Denis V. Lunev
@ 2014-12-30  9:20 ` Denis V. Lunev
  2014-12-30  9:20 ` [Qemu-devel] [PATCH 8/8] block/raw-posix: set max_write_zeroes to INT_MAX for regular files Denis V. Lunev
  2014-12-30 10:55 ` [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Peter Lieven
  8 siblings, 0 replies; 24+ messages in thread
From: Denis V. Lunev @ 2014-12-30  9:20 UTC (permalink / raw)
  Cc: Kevin Wolf, Denis V. Lunev, Peter Lieven, qemu-devel,
	Stefan Hajnoczi

There is a possibility that we are extending our image and thus writing
zeroes beyond end of the file. In this case we do not need to care
about the hole to make sure that there is no data in the file under
this offset (pre-condition to fallocate(0) to work). We could simply call
fallocate(0).

This improves the performance of writing zeroes even on really old
platforms which do not have even FALLOC_FL_PUNCH_HOLE.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
---
 block/raw-posix.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 96a8678..57b94ad 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -60,7 +60,7 @@
 #define FS_NOCOW_FL                     0x00800000 /* Do not cow file */
 #endif
 #endif
-#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE)
+#ifdef CONFIG_FALLOCATE
 #include <linux/falloc.h>
 #endif
 #if defined (__FreeBSD__) || defined(__FreeBSD_kernel__)
@@ -903,7 +903,7 @@ static int translate_err(int err)
     return err;
 }
 
-#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE)
+#ifdef CONFIG_FALLOCATE
 static int do_fallocate(int fd, int mode, off_t offset, off_t len)
 {
     do {
@@ -957,6 +957,12 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
 
     s = aiocb->bs->opaque;
 
+#ifdef CONFIG_FALLOCATE
+    if (aiocb->aio_offset >= aiocb->bs->total_sectors << BDRV_SECTOR_BITS) {
+        return do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes);
+    }
+#endif
+
 #ifdef CONFIG_FALLOCATE_ZERO_RANGE
     if (s->has_write_zeroes) {
         int ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [Qemu-devel] [PATCH 8/8] block/raw-posix: set max_write_zeroes to INT_MAX for regular files
  2014-12-30  9:20 [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
                   ` (6 preceding siblings ...)
  2014-12-30  9:20 ` [Qemu-devel] [PATCH 7/8] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes Denis V. Lunev
@ 2014-12-30  9:20 ` Denis V. Lunev
  2014-12-30 10:55 ` [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Peter Lieven
  8 siblings, 0 replies; 24+ messages in thread
From: Denis V. Lunev @ 2014-12-30  9:20 UTC (permalink / raw)
  Cc: Kevin Wolf, Denis V. Lunev, Peter Lieven, qemu-devel,
	Stefan Hajnoczi

fallocate() works fine and could handle properly with arbitrary size
requests. There is no sense to reduce the amount of space to fallocate.
The bigger the size, the better is performance as the amount of journal
updates is reduced.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
---
 block/raw-posix.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 57b94ad..3e156c4 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -292,6 +292,20 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
     }
 }
 
+static void raw_probe_max_write_zeroes(BlockDriverState *bs)
+{
+    BDRVRawState *s = bs->opaque;
+    struct stat st;
+
+    if (fstat(s->fd, &st) < 0) {
+        return; /* no problem, keep default value */
+    }
+    if (!S_ISREG(st.st_mode) || !s->discard_zeroes) {
+        return;
+    }
+    bs->bl.max_write_zeroes = INT_MAX;
+}
+
 static void raw_parse_flags(int bdrv_flags, int *open_flags)
 {
     assert(open_flags != NULL);
@@ -598,6 +612,7 @@ static int raw_reopen_prepare(BDRVReopenState *state,
     /* Fail already reopen_prepare() if we can't get a working O_DIRECT
      * alignment with the new fd. */
     if (raw_s->fd != -1) {
+        raw_probe_max_write_zeroes(state->bs);
         raw_probe_alignment(state->bs, raw_s->fd, &local_err);
         if (local_err) {
             qemu_close(raw_s->fd);
@@ -651,6 +666,8 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
 
     raw_probe_alignment(bs, s->fd, errp);
     bs->bl.opt_mem_alignment = s->buf_align;
+
+    raw_probe_max_write_zeroes(bs);
 }
 
 static ssize_t handle_aiocb_ioctl(RawPosixAIOData *aiocb)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c
  2014-12-30  9:20 [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
                   ` (7 preceding siblings ...)
  2014-12-30  9:20 ` [Qemu-devel] [PATCH 8/8] block/raw-posix: set max_write_zeroes to INT_MAX for regular files Denis V. Lunev
@ 2014-12-30 10:55 ` Peter Lieven
  2014-12-30 11:07   ` Denis V. Lunev
  8 siblings, 1 reply; 24+ messages in thread
From: Peter Lieven @ 2014-12-30 10:55 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

Am 30.12.2014 um 10:20 schrieb Denis V. Lunev:
> These patches eliminate data writes completely on Linux if fallocate
> FALLOC_FL_ZERO_RANGE or FALLOC_FL_PUNCH_HOLE are  supported on
> underlying filesystem.
>
> I have performed several tests with non-aligned fallocate calls and
> in all cases (with non-aligned fallocates) Linux performs fine, i.e.
> areas are zeroed correctly. Checks were made on
>    Linux 3.16.0-28-generic #38-Ubuntu SMP
>
> This should seriously increase performance in some special cases.

Could you give a hint what that special cases are? It would help
to evaluate and test the performance difference.

Thanks,
Peter

>
> Changes from v2:
> - added Peter Lieven to CC
> - added CONFIG_FALLOCATE check to call do_fallocate in patch 7
> - dropped patch 1 as NACK-ed
> - added processing of very large data areas in bdrv_co_write_zeroes (new
>   patch 1)
> - set bl.max_write_zeroes to INT_MAX in raw-posix.c for regular files
>   (new patch 8)
>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Peter Lieven <pl@kamp.de>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c
  2014-12-30 10:55 ` [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Peter Lieven
@ 2014-12-30 11:07   ` Denis V. Lunev
  2015-01-05  6:55     ` Peter Lieven
  0 siblings, 1 reply; 24+ messages in thread
From: Denis V. Lunev @ 2014-12-30 11:07 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

On 30/12/14 13:55, Peter Lieven wrote:
> Am 30.12.2014 um 10:20 schrieb Denis V. Lunev:
>> These patches eliminate data writes completely on Linux if fallocate
>> FALLOC_FL_ZERO_RANGE or FALLOC_FL_PUNCH_HOLE are  supported on
>> underlying filesystem.
>>
>> I have performed several tests with non-aligned fallocate calls and
>> in all cases (with non-aligned fallocates) Linux performs fine, i.e.
>> areas are zeroed correctly. Checks were made on
>>     Linux 3.16.0-28-generic #38-Ubuntu SMP
>>
>> This should seriously increase performance in some special cases.
> Could you give a hint what that special cases are? It would help
> to evaluate and test the performance difference.
>
> Thanks,
> Peter

- 15% in Parallels Image expansion, see my side patchset
- writing zeroes to raw image with BLOCKDEV_DETECT_ZEROES_OPTIONS_ON set
    (actually I have kludged raw-posix.c to have this flag always set
    to perform independent testing)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH 4/8] block/raw-posix: create translate_err helper to merge errno values
  2014-12-30  9:20 ` [Qemu-devel] [PATCH 4/8] block/raw-posix: create translate_err helper to merge errno values Denis V. Lunev
@ 2015-01-05  6:46   ` Fam Zheng
  2015-01-05 11:17     ` Denis V. Lunev
  0 siblings, 1 reply; 24+ messages in thread
From: Fam Zheng @ 2015-01-05  6:46 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: Kevin Wolf, Peter Lieven, qemu-devel, Stefan Hajnoczi

On Tue, 12/30 12:20, Denis V. Lunev wrote:
> actually the code
>     if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
>         ret == -ENOTTY) {
>         ret = -ENOTSUP;
>     }
> is present twice and will be added a couple more times. Create helper
> for this. Place it into do_fallocate() for further convinience.

s/convinience/convenience/

> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Peter Lieven <pl@kamp.de>
> ---
>  block/raw-posix.c | 21 ++++++++++++++-------
>  1 file changed, 14 insertions(+), 7 deletions(-)
> 
> diff --git a/block/raw-posix.c b/block/raw-posix.c
> index a7c8816..25a6947 100644
> --- a/block/raw-posix.c
> +++ b/block/raw-posix.c
> @@ -894,6 +894,15 @@ static int xfs_discard(BDRVRawState *s, int64_t offset, uint64_t bytes)
>  #endif
>  
>  
> +static int translate_err(int err)
> +{
> +    if (err == -ENODEV || err == -ENOSYS || err == -EOPNOTSUPP ||
> +        err == -ENOTTY) {
> +        err = -ENOTSUP;
> +    }
> +    return err;
> +}
> +
>  #if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE)
>  static int do_fallocate(int fd, int mode, off_t offset, off_t len)
>  {
> @@ -902,7 +911,7 @@ static int do_fallocate(int fd, int mode, off_t offset, off_t len)
>              return 0;
>          }
>      } while (errno == EINTR);
> -    return -errno;
> +    return translate_err(-errno);

This could be a separate patch. Maybe reorder the previous patches as:

1) Introduce translate_err.
2) Refactor out do_fallocate.
3) Introduce FALLOC_FL_ZERO_RANGE calling code.

Fam

>  }
>  #endif
>  
> @@ -939,10 +948,9 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
>  #endif
>      }
>  
> -    if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
> -        ret == -ENOTTY) {
> +    ret = translate_err(ret);
> +    if (ret == -ENOTSUP) {
>          s->has_write_zeroes = false;
> -        ret = -ENOTSUP;
>      }
>      return ret;
>  }
> @@ -980,10 +988,9 @@ static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb)
>  #endif
>      }
>  
> -    if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
> -        ret == -ENOTTY) {
> +    ret = translate_err(ret);
> +    if (ret == -ENOTSUP) {
>          s->has_discard = false;
> -        ret = -ENOTSUP;
>      }
>      return ret;
>  }
> -- 
> 1.9.1
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c
  2014-12-30 11:07   ` Denis V. Lunev
@ 2015-01-05  6:55     ` Peter Lieven
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Lieven @ 2015-01-05  6:55 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

On 30.12.2014 12:07, Denis V. Lunev wrote:
> On 30/12/14 13:55, Peter Lieven wrote:
>> Am 30.12.2014 um 10:20 schrieb Denis V. Lunev:
>>> These patches eliminate data writes completely on Linux if fallocate
>>> FALLOC_FL_ZERO_RANGE or FALLOC_FL_PUNCH_HOLE are  supported on
>>> underlying filesystem.
>>>
>>> I have performed several tests with non-aligned fallocate calls and
>>> in all cases (with non-aligned fallocates) Linux performs fine, i.e.
>>> areas are zeroed correctly. Checks were made on
>>>     Linux 3.16.0-28-generic #38-Ubuntu SMP
>>>
>>> This should seriously increase performance in some special cases.
>> Could you give a hint what that special cases are? It would help
>> to evaluate and test the performance difference.
>>
>> Thanks,
>> Peter
>
> - 15% in Parallels Image expansion, see my side patchset

I will have a look.

> - writing zeroes to raw image with BLOCKDEV_DETECT_ZEROES_OPTIONS_ON set
>    (actually I have kludged raw-posix.c to have this flag always set
>    to perform independent testing)

Is there a valid reason to write chunks of >16MB blocksize from a guest? Or is this
just users performing pseudo benchmarks with dd?

Peter

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH 5/8] block/raw-posix: refactor handle_aiocb_write_zeroes a bit
  2014-12-30  9:20 ` [Qemu-devel] [PATCH 5/8] block/raw-posix: refactor handle_aiocb_write_zeroes a bit Denis V. Lunev
@ 2015-01-05  6:57   ` Fam Zheng
  2015-01-05 11:07     ` Denis V. Lunev
  0 siblings, 1 reply; 24+ messages in thread
From: Fam Zheng @ 2015-01-05  6:57 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: Kevin Wolf, Peter Lieven, qemu-devel, Stefan Hajnoczi

On Tue, 12/30 12:20, Denis V. Lunev wrote:
> move code dealing with a block device to a separate function. This will
> allow to implement additional processing for an ordinary files.

s/an//

> 
> Pls note, that xfs_code has been moved before checking for
> s->has_write_zeroes as xfs_write_zeroes does not touch this flag inside.
> This makes code a bit more consistent.
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Peter Lieven <pl@kamp.de>
> ---
>  block/raw-posix.c | 60 +++++++++++++++++++++++++++++++++++--------------------
>  1 file changed, 38 insertions(+), 22 deletions(-)
> 
> diff --git a/block/raw-posix.c b/block/raw-posix.c
> index 25a6947..7866d31 100644
> --- a/block/raw-posix.c
> +++ b/block/raw-posix.c
> @@ -915,46 +915,62 @@ static int do_fallocate(int fd, int mode, off_t offset, off_t len)
>  }
>  #endif
>  
> -static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
> +static ssize_t handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
>  {
>      int ret = -EOPNOTSUPP;
>      BDRVRawState *s = aiocb->bs->opaque;
>  
> -    if (s->has_write_zeroes == 0) {
> +    if (!s->has_write_zeroes) {
>          return -ENOTSUP;
>      }
>  
> -    if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
>  #ifdef BLKZEROOUT
> -        do {
> -            uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes };
> -            if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) {
> -                return 0;
> -            }
> -        } while (errno == EINTR);
> -
> -        ret = -errno;
> -#endif
> -    } else {
> -#ifdef CONFIG_XFS
> -        if (s->is_xfs) {
> -            return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes);
> +    do {
> +        uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes };
> +        if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) {
> +            return 0;
>          }
> -#endif
> +    } while (errno == EINTR);
>  
> -#ifdef CONFIG_FALLOCATE_ZERO_RANGE
> -        ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
> -                           aiocb->aio_offset, aiocb->aio_nbytes);
> +    ret = translate_err(-errno);
>  #endif
> -    }
>  
> -    ret = translate_err(ret);
>      if (ret == -ENOTSUP) {
>          s->has_write_zeroes = false;
>      }
>      return ret;
>  }
>  
> +static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
> +{
> +    BDRVRawState *s;
> +
> +    if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
> +        return handle_aiocb_write_zeroes_block(aiocb);
> +    }
> +
> +#ifdef CONFIG_XFS
> +    if (s->is_xfs) {

s is not initialized here.

Also, could you please do these refactoring before adding new code in the
series? That way the new code needn't to be moved, which makes the patches
easier to review.

Fam

> +        return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes);
> +    }
> +#endif
> +
> +    s = aiocb->bs->opaque;
> +
> +#ifdef CONFIG_FALLOCATE_ZERO_RANGE
> +    if (s->has_write_zeroes) {
> +        int ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
> +                               aiocb->aio_offset, aiocb->aio_nbytes);
> +        if (ret == 0 && ret != -ENOTSUP) {
> +            return ret;
> +        }
> +    }
> +#endif
> +
> +    s->has_write_zeroes = false;
> +    return -ENOTSUP;
> +}
> +
>  static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb)
>  {
>      int ret = -EOPNOTSUPP;
> -- 
> 1.9.1
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH 6/8] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes
  2014-12-30  9:20 ` [Qemu-devel] [PATCH 6/8] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes Denis V. Lunev
@ 2015-01-05  7:02   ` Fam Zheng
  2015-01-05 11:14     ` Denis V. Lunev
  0 siblings, 1 reply; 24+ messages in thread
From: Fam Zheng @ 2015-01-05  7:02 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: Kevin Wolf, Peter Lieven, qemu-devel, Stefan Hajnoczi

On Tue, 12/30 12:20, Denis V. Lunev wrote:
> This sequence works efficiently if FALLOC_FL_ZERO_RANGE is not supported.
> 
> Simple fallocate(0) will extend file with zeroes when appropriate in the
> middle of the file if there is a hole there and at the end of the file.
> Unfortunately fallocate(0) does not drop the content of the file if
> there is a data on this offset. Therefore to make the situation consistent
> we should drop the data beforehand. This is done using FALLOC_FL_PUNCH_HOLE
> 
> This should increase the performance a bit for not-so-modern kernels or for
> filesystems which do not support FALLOC_FL_ZERO_RANGE.
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Peter Lieven <pl@kamp.de>
> ---
>  block/raw-posix.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/block/raw-posix.c b/block/raw-posix.c
> index 7866d31..96a8678 100644
> --- a/block/raw-posix.c
> +++ b/block/raw-posix.c
> @@ -968,6 +968,23 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
>  #endif
>  
>      s->has_write_zeroes = false;
> +
> +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> +    if (s->has_discard) {
> +        int ret;
> +        ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> +                           aiocb->aio_offset, aiocb->aio_nbytes);
> +        if (ret < 0) {
> +            if (ret == -ENOTSUP) {
> +                s->has_discard = false;
> +            }
> +            return ret;
> +        }
> +        return do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes);

Why is fallocate(0) necessary here? The manpage says:

Deallocating file space
	Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux 2.6.38)
	in mode deallocates space (i.e., creates a hole) in the byte range
	starting at offset and continuing for len bytes.  Within the specified
	range, partial file system blocks are zeroed, and whole file system
	blocks are removed from the file.  After a successful call, subsequent
	reads from this range will return zeroes.

So the data are already zeroes after FALLOC_FL_PUNCH_HOLE.

Fam

> +    }
> +#endif
> +
> +    s->has_discard = false;
>      return -ENOTSUP;
>  }
>  
> -- 
> 1.9.1
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH 1/8] block: prepare bdrv_co_do_write_zeroes to deal with large bl.max_write_zeroes
  2014-12-30  9:20 ` [Qemu-devel] [PATCH 1/8] block: prepare bdrv_co_do_write_zeroes to deal with large bl.max_write_zeroes Denis V. Lunev
@ 2015-01-05  7:34   ` Peter Lieven
  2015-01-05 11:06     ` Denis V. Lunev
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Lieven @ 2015-01-05  7:34 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

On 30.12.2014 10:20, Denis V. Lunev wrote:
> bdrv_co_do_write_zeroes split writes using bl.max_write_zeroes or
> 16 MiB as a chunk size. This is implemented in this way to tolerate
> buggy block backends which do not accept too big requests.
>
> Though if the bdrv_co_write_zeroes callback is not good enough, we
> fallback to write data explicitely using bdrv_co_writev and we
> create buffer to accomodate zeroes inside. The size of this buffer
> is the size of the chunk. Thus if the underlying layer will have
> bl.max_write_zeroes high enough, f.e. 4 GiB, the allocation can fail.
>
> Actually, there is no need to allocate such a big amount of memory.
> We could simply allocate 1 MiB buffer and create iovec, which will
> point to the same memory.
>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Peter Lieven <pl@kamp.de>
> ---
>   block.c | 35 ++++++++++++++++++++++++-----------
>   1 file changed, 24 insertions(+), 11 deletions(-)
>
> diff --git a/block.c b/block.c
> index 4165d42..d69c121 100644
> --- a/block.c
> +++ b/block.c
> @@ -3173,14 +3173,18 @@ int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
>    * of 32768 512-byte sectors (16 MiB) per request.
>    */
>   #define MAX_WRITE_ZEROES_DEFAULT 32768
> +/* allocate iovec with zeroes using 1 MiB chunks to avoid to big allocations */
> +#define MAX_ZEROES_CHUNK (1024 * 1024)
>   
>   static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
>       int64_t sector_num, int nb_sectors, BdrvRequestFlags flags)
>   {
>       BlockDriver *drv = bs->drv;
>       QEMUIOVector qiov;
> -    struct iovec iov = {0};
>       int ret = 0;
> +    void *chunk = NULL;
> +
> +    qemu_iovec_init(&qiov, 0);
>   
>       int max_write_zeroes = bs->bl.max_write_zeroes ?
>                              bs->bl.max_write_zeroes : MAX_WRITE_ZEROES_DEFAULT;
> @@ -3217,27 +3221,35 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
>           }
>   
>           if (ret == -ENOTSUP) {
> +            int64_t num_bytes = (int64_t)num << BDRV_SECTOR_BITS;
> +            int chunk_size = MIN(MAX_ZEROES_CHUNK, num_bytes);
> +
>               /* Fall back to bounce buffer if write zeroes is unsupported */
> -            iov.iov_len = num * BDRV_SECTOR_SIZE;
> -            if (iov.iov_base == NULL) {
> -                iov.iov_base = qemu_try_blockalign(bs, num * BDRV_SECTOR_SIZE);
> -                if (iov.iov_base == NULL) {
> +            if (chunk == NULL) {
> +                chunk = qemu_try_blockalign(bs, chunk_size);
> +                if (chunk == NULL) {
>                       ret = -ENOMEM;
>                       goto fail;
>                   }
> -                memset(iov.iov_base, 0, num * BDRV_SECTOR_SIZE);
> +                memset(chunk, 0, chunk_size);
> +            }
> +
> +            while (num_bytes > 0) {
> +                int to_add = MIN(chunk_size, num_bytes);
> +                qemu_iovec_add(&qiov, chunk, to_add);

This can and likely will fail for big num_bytes if you exceed IOV_MAX vectors.

I would stick to the old method and limit the num to a reasonable value e.g. MAX_WRITE_ZEROES_DEFAULT.
This becomes necessary as you set INT_MAX for max_write_zeroes. That hasn't been considered before in
the original patch.

Peter

> +                num_bytes -= to_add;
>               }
> -            qemu_iovec_init_external(&qiov, &iov, 1);
>   
>               ret = drv->bdrv_co_writev(bs, sector_num, num, &qiov);
>   
>               /* Keep bounce buffer around if it is big enough for all
>                * all future requests.
>                */
> -            if (num < max_write_zeroes) {
> -                qemu_vfree(iov.iov_base);
> -                iov.iov_base = NULL;
> +            if (chunk_size != MAX_ZEROES_CHUNK) {
> +                qemu_vfree(chunk);
> +                chunk = NULL;
>               }
> +            qemu_iovec_reset(&qiov);
>           }
>   
>           sector_num += num;
> @@ -3245,7 +3257,8 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
>       }
>   
>   fail:
> -    qemu_vfree(iov.iov_base);
> +    qemu_iovec_destroy(&qiov);
> +    qemu_vfree(chunk);
>       return ret;
>   }
>   


-- 

Mit freundlichen Grüßen

Peter Lieven

...........................................................

   KAMP Netzwerkdienste GmbH
   Vestische Str. 89-91 | 46117 Oberhausen
   Tel: +49 (0) 208.89 402-50 | Fax: +49 (0) 208.89 402-40
   pl@kamp.de | http://www.kamp.de

   Geschäftsführer: Heiner Lante | Michael Lante
   Amtsgericht Duisburg | HRB Nr. 12154
   USt-Id-Nr.: DE 120607556

...........................................................

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH 1/8] block: prepare bdrv_co_do_write_zeroes to deal with large bl.max_write_zeroes
  2015-01-05  7:34   ` Peter Lieven
@ 2015-01-05 11:06     ` Denis V. Lunev
  2015-01-05 11:23       ` Peter Lieven
  0 siblings, 1 reply; 24+ messages in thread
From: Denis V. Lunev @ 2015-01-05 11:06 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

On 05/01/15 10:34, Peter Lieven wrote:
> On 30.12.2014 10:20, Denis V. Lunev wrote:
>> bdrv_co_do_write_zeroes split writes using bl.max_write_zeroes or
>> 16 MiB as a chunk size. This is implemented in this way to tolerate
>> buggy block backends which do not accept too big requests.
>>
>> Though if the bdrv_co_write_zeroes callback is not good enough, we
>> fallback to write data explicitely using bdrv_co_writev and we
>> create buffer to accomodate zeroes inside. The size of this buffer
>> is the size of the chunk. Thus if the underlying layer will have
>> bl.max_write_zeroes high enough, f.e. 4 GiB, the allocation can fail.
>>
>> Actually, there is no need to allocate such a big amount of memory.
>> We could simply allocate 1 MiB buffer and create iovec, which will
>> point to the same memory.
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> CC: Peter Lieven <pl@kamp.de>
>> ---
>>   block.c | 35 ++++++++++++++++++++++++-----------
>>   1 file changed, 24 insertions(+), 11 deletions(-)
>>
>> diff --git a/block.c b/block.c
>> index 4165d42..d69c121 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -3173,14 +3173,18 @@ int coroutine_fn 
>> bdrv_co_copy_on_readv(BlockDriverState *bs,
>>    * of 32768 512-byte sectors (16 MiB) per request.
>>    */
>>   #define MAX_WRITE_ZEROES_DEFAULT 32768
>> +/* allocate iovec with zeroes using 1 MiB chunks to avoid to big 
>> allocations */
>> +#define MAX_ZEROES_CHUNK (1024 * 1024)
>>     static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState 
>> *bs,
>>       int64_t sector_num, int nb_sectors, BdrvRequestFlags flags)
>>   {
>>       BlockDriver *drv = bs->drv;
>>       QEMUIOVector qiov;
>> -    struct iovec iov = {0};
>>       int ret = 0;
>> +    void *chunk = NULL;
>> +
>> +    qemu_iovec_init(&qiov, 0);
>>         int max_write_zeroes = bs->bl.max_write_zeroes ?
>>                              bs->bl.max_write_zeroes : 
>> MAX_WRITE_ZEROES_DEFAULT;
>> @@ -3217,27 +3221,35 @@ static int coroutine_fn 
>> bdrv_co_do_write_zeroes(BlockDriverState *bs,
>>           }
>>             if (ret == -ENOTSUP) {
>> +            int64_t num_bytes = (int64_t)num << BDRV_SECTOR_BITS;
>> +            int chunk_size = MIN(MAX_ZEROES_CHUNK, num_bytes);
>> +
>>               /* Fall back to bounce buffer if write zeroes is 
>> unsupported */
>> -            iov.iov_len = num * BDRV_SECTOR_SIZE;
>> -            if (iov.iov_base == NULL) {
>> -                iov.iov_base = qemu_try_blockalign(bs, num * 
>> BDRV_SECTOR_SIZE);
>> -                if (iov.iov_base == NULL) {
>> +            if (chunk == NULL) {
>> +                chunk = qemu_try_blockalign(bs, chunk_size);
>> +                if (chunk == NULL) {
>>                       ret = -ENOMEM;
>>                       goto fail;
>>                   }
>> -                memset(iov.iov_base, 0, num * BDRV_SECTOR_SIZE);
>> +                memset(chunk, 0, chunk_size);
>> +            }
>> +
>> +            while (num_bytes > 0) {
>> +                int to_add = MIN(chunk_size, num_bytes);
>> +                qemu_iovec_add(&qiov, chunk, to_add);
>
> This can and likely will fail for big num_bytes if you exceed IOV_MAX 
> vectors.
>
> I would stick to the old method and limit the num to a reasonable 
> value e.g. MAX_WRITE_ZEROES_DEFAULT.
> This becomes necessary as you set INT_MAX for max_write_zeroes. That 
> hasn't been considered before in
> the original patch.
>
> Peter
>

hmm. You are right, but I think that it would be better to limit iovec size
to 32 and this will solve the problem. Allocation of 32 Mb could be a 
real problem
on loaded system could be a problem.

What do you think on this? May be we could consider 16 as a limit...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH 5/8] block/raw-posix: refactor handle_aiocb_write_zeroes a bit
  2015-01-05  6:57   ` Fam Zheng
@ 2015-01-05 11:07     ` Denis V. Lunev
  0 siblings, 0 replies; 24+ messages in thread
From: Denis V. Lunev @ 2015-01-05 11:07 UTC (permalink / raw)
  To: Fam Zheng; +Cc: Kevin Wolf, Peter Lieven, qemu-devel, Stefan Hajnoczi

On 05/01/15 09:57, Fam Zheng wrote:
> On Tue, 12/30 12:20, Denis V. Lunev wrote:
>> move code dealing with a block device to a separate function. This will
>> allow to implement additional processing for an ordinary files.
> s/an//
>
>> Pls note, that xfs_code has been moved before checking for
>> s->has_write_zeroes as xfs_write_zeroes does not touch this flag inside.
>> This makes code a bit more consistent.
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> CC: Peter Lieven <pl@kamp.de>
>> ---
>>   block/raw-posix.c | 60 +++++++++++++++++++++++++++++++++++--------------------
>>   1 file changed, 38 insertions(+), 22 deletions(-)
>>
>> diff --git a/block/raw-posix.c b/block/raw-posix.c
>> index 25a6947..7866d31 100644
>> --- a/block/raw-posix.c
>> +++ b/block/raw-posix.c
>> @@ -915,46 +915,62 @@ static int do_fallocate(int fd, int mode, off_t offset, off_t len)
>>   }
>>   #endif
>>   
>> -static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
>> +static ssize_t handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
>>   {
>>       int ret = -EOPNOTSUPP;
>>       BDRVRawState *s = aiocb->bs->opaque;
>>   
>> -    if (s->has_write_zeroes == 0) {
>> +    if (!s->has_write_zeroes) {
>>           return -ENOTSUP;
>>       }
>>   
>> -    if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
>>   #ifdef BLKZEROOUT
>> -        do {
>> -            uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes };
>> -            if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) {
>> -                return 0;
>> -            }
>> -        } while (errno == EINTR);
>> -
>> -        ret = -errno;
>> -#endif
>> -    } else {
>> -#ifdef CONFIG_XFS
>> -        if (s->is_xfs) {
>> -            return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes);
>> +    do {
>> +        uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes };
>> +        if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) {
>> +            return 0;
>>           }
>> -#endif
>> +    } while (errno == EINTR);
>>   
>> -#ifdef CONFIG_FALLOCATE_ZERO_RANGE
>> -        ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
>> -                           aiocb->aio_offset, aiocb->aio_nbytes);
>> +    ret = translate_err(-errno);
>>   #endif
>> -    }
>>   
>> -    ret = translate_err(ret);
>>       if (ret == -ENOTSUP) {
>>           s->has_write_zeroes = false;
>>       }
>>       return ret;
>>   }
>>   
>> +static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
>> +{
>> +    BDRVRawState *s;
>> +
>> +    if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
>> +        return handle_aiocb_write_zeroes_block(aiocb);
>> +    }
>> +
>> +#ifdef CONFIG_XFS
>> +    if (s->is_xfs) {
> s is not initialized here.
>
> Also, could you please do these refactoring before adding new code in the
> series? That way the new code needn't to be moved, which makes the patches
> easier to review.
>
> Fam
ok

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH 6/8] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes
  2015-01-05  7:02   ` Fam Zheng
@ 2015-01-05 11:14     ` Denis V. Lunev
  0 siblings, 0 replies; 24+ messages in thread
From: Denis V. Lunev @ 2015-01-05 11:14 UTC (permalink / raw)
  To: Fam Zheng; +Cc: Kevin Wolf, Peter Lieven, qemu-devel, Stefan Hajnoczi

On 05/01/15 10:02, Fam Zheng wrote:
> On Tue, 12/30 12:20, Denis V. Lunev wrote:
>> This sequence works efficiently if FALLOC_FL_ZERO_RANGE is not supported.
>>
>> Simple fallocate(0) will extend file with zeroes when appropriate in the
>> middle of the file if there is a hole there and at the end of the file.
>> Unfortunately fallocate(0) does not drop the content of the file if
>> there is a data on this offset. Therefore to make the situation consistent
>> we should drop the data beforehand. This is done using FALLOC_FL_PUNCH_HOLE
>>
>> This should increase the performance a bit for not-so-modern kernels or for
>> filesystems which do not support FALLOC_FL_ZERO_RANGE.
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> CC: Peter Lieven <pl@kamp.de>
>> ---
>>   block/raw-posix.c | 17 +++++++++++++++++
>>   1 file changed, 17 insertions(+)
>>
>> diff --git a/block/raw-posix.c b/block/raw-posix.c
>> index 7866d31..96a8678 100644
>> --- a/block/raw-posix.c
>> +++ b/block/raw-posix.c
>> @@ -968,6 +968,23 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
>>   #endif
>>   
>>       s->has_write_zeroes = false;
>> +
>> +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>> +    if (s->has_discard) {
>> +        int ret;
>> +        ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
>> +                           aiocb->aio_offset, aiocb->aio_nbytes);
>> +        if (ret < 0) {
>> +            if (ret == -ENOTSUP) {
>> +                s->has_discard = false;
>> +            }
>> +            return ret;
>> +        }
>> +        return do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes);
> Why is fallocate(0) necessary here? The manpage says:
>
> Deallocating file space
> 	Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux 2.6.38)
> 	in mode deallocates space (i.e., creates a hole) in the byte range
> 	starting at offset and continuing for len bytes.  Within the specified
> 	range, partial file system blocks are zeroed, and whole file system
> 	blocks are removed from the file.  After a successful call, subsequent
> 	reads from this range will return zeroes.
>
> So the data are already zeroes after FALLOC_FL_PUNCH_HOLE.
>
> Fam
These zeroes will have different properties.  FALLOC_FL_PUNCH_HOLE
deallocates disk space on that range. Thus this call work work in a
different way in respect to the method of zero writing. This does not
look good for me.

The function should keep the file in the same state using all
possible internal implementations. If the caller wants to use 
FALLOC_FL_PUNCH_HOLE
alone, it should call handle_aiocb_discard method.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH 4/8] block/raw-posix: create translate_err helper to merge errno values
  2015-01-05  6:46   ` Fam Zheng
@ 2015-01-05 11:17     ` Denis V. Lunev
  0 siblings, 0 replies; 24+ messages in thread
From: Denis V. Lunev @ 2015-01-05 11:17 UTC (permalink / raw)
  To: Fam Zheng; +Cc: Kevin Wolf, Peter Lieven, qemu-devel, Stefan Hajnoczi

On 05/01/15 09:46, Fam Zheng wrote:
> On Tue, 12/30 12:20, Denis V. Lunev wrote:
>> actually the code
>>      if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
>>          ret == -ENOTTY) {
>>          ret = -ENOTSUP;
>>      }
>> is present twice and will be added a couple more times. Create helper
>> for this. Place it into do_fallocate() for further convinience.
> s/convinience/convenience/
>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> CC: Peter Lieven <pl@kamp.de>
>> ---
>>   block/raw-posix.c | 21 ++++++++++++++-------
>>   1 file changed, 14 insertions(+), 7 deletions(-)
>>
>> diff --git a/block/raw-posix.c b/block/raw-posix.c
>> index a7c8816..25a6947 100644
>> --- a/block/raw-posix.c
>> +++ b/block/raw-posix.c
>> @@ -894,6 +894,15 @@ static int xfs_discard(BDRVRawState *s, int64_t offset, uint64_t bytes)
>>   #endif
>>   
>>   
>> +static int translate_err(int err)
>> +{
>> +    if (err == -ENODEV || err == -ENOSYS || err == -EOPNOTSUPP ||
>> +        err == -ENOTTY) {
>> +        err = -ENOTSUP;
>> +    }
>> +    return err;
>> +}
>> +
>>   #if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE)
>>   static int do_fallocate(int fd, int mode, off_t offset, off_t len)
>>   {
>> @@ -902,7 +911,7 @@ static int do_fallocate(int fd, int mode, off_t offset, off_t len)
>>               return 0;
>>           }
>>       } while (errno == EINTR);
>> -    return -errno;
>> +    return translate_err(-errno);
> This could be a separate patch. Maybe reorder the previous patches as:
>
> 1) Introduce translate_err.
> 2) Refactor out do_fallocate.
> 3) Introduce FALLOC_FL_ZERO_RANGE calling code.
>
> Fam
ok, will try

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH 1/8] block: prepare bdrv_co_do_write_zeroes to deal with large bl.max_write_zeroes
  2015-01-05 11:06     ` Denis V. Lunev
@ 2015-01-05 11:23       ` Peter Lieven
  2015-01-05 11:48         ` Denis V. Lunev
  2015-01-05 12:26         ` [Qemu-devel] [PATCH v2 1/1] " Denis V. Lunev
  0 siblings, 2 replies; 24+ messages in thread
From: Peter Lieven @ 2015-01-05 11:23 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

On 05.01.2015 12:06, Denis V. Lunev wrote:
> On 05/01/15 10:34, Peter Lieven wrote:
>> On 30.12.2014 10:20, Denis V. Lunev wrote:
>>> bdrv_co_do_write_zeroes split writes using bl.max_write_zeroes or
>>> 16 MiB as a chunk size. This is implemented in this way to tolerate
>>> buggy block backends which do not accept too big requests.
>>>
>>> Though if the bdrv_co_write_zeroes callback is not good enough, we
>>> fallback to write data explicitely using bdrv_co_writev and we
>>> create buffer to accomodate zeroes inside. The size of this buffer
>>> is the size of the chunk. Thus if the underlying layer will have
>>> bl.max_write_zeroes high enough, f.e. 4 GiB, the allocation can fail.
>>>
>>> Actually, there is no need to allocate such a big amount of memory.
>>> We could simply allocate 1 MiB buffer and create iovec, which will
>>> point to the same memory.
>>>
>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>> CC: Kevin Wolf <kwolf@redhat.com>
>>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>>> CC: Peter Lieven <pl@kamp.de>
>>> ---
>>>   block.c | 35 ++++++++++++++++++++++++-----------
>>>   1 file changed, 24 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/block.c b/block.c
>>> index 4165d42..d69c121 100644
>>> --- a/block.c
>>> +++ b/block.c
>>> @@ -3173,14 +3173,18 @@ int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
>>>    * of 32768 512-byte sectors (16 MiB) per request.
>>>    */
>>>   #define MAX_WRITE_ZEROES_DEFAULT 32768
>>> +/* allocate iovec with zeroes using 1 MiB chunks to avoid to big allocations */
>>> +#define MAX_ZEROES_CHUNK (1024 * 1024)
>>>     static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
>>>       int64_t sector_num, int nb_sectors, BdrvRequestFlags flags)
>>>   {
>>>       BlockDriver *drv = bs->drv;
>>>       QEMUIOVector qiov;
>>> -    struct iovec iov = {0};
>>>       int ret = 0;
>>> +    void *chunk = NULL;
>>> +
>>> +    qemu_iovec_init(&qiov, 0);
>>>         int max_write_zeroes = bs->bl.max_write_zeroes ?
>>>                              bs->bl.max_write_zeroes : MAX_WRITE_ZEROES_DEFAULT;
>>> @@ -3217,27 +3221,35 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
>>>           }
>>>             if (ret == -ENOTSUP) {
>>> +            int64_t num_bytes = (int64_t)num << BDRV_SECTOR_BITS;
>>> +            int chunk_size = MIN(MAX_ZEROES_CHUNK, num_bytes);
>>> +
>>>               /* Fall back to bounce buffer if write zeroes is unsupported */
>>> -            iov.iov_len = num * BDRV_SECTOR_SIZE;
>>> -            if (iov.iov_base == NULL) {
>>> -                iov.iov_base = qemu_try_blockalign(bs, num * BDRV_SECTOR_SIZE);
>>> -                if (iov.iov_base == NULL) {
>>> +            if (chunk == NULL) {
>>> +                chunk = qemu_try_blockalign(bs, chunk_size);
>>> +                if (chunk == NULL) {
>>>                       ret = -ENOMEM;
>>>                       goto fail;
>>>                   }
>>> -                memset(iov.iov_base, 0, num * BDRV_SECTOR_SIZE);
>>> +                memset(chunk, 0, chunk_size);
>>> +            }
>>> +
>>> +            while (num_bytes > 0) {
>>> +                int to_add = MIN(chunk_size, num_bytes);
>>> +                qemu_iovec_add(&qiov, chunk, to_add);
>>
>> This can and likely will fail for big num_bytes if you exceed IOV_MAX vectors.
>>
>> I would stick to the old method and limit the num to a reasonable value e.g. MAX_WRITE_ZEROES_DEFAULT.
>> This becomes necessary as you set INT_MAX for max_write_zeroes. That hasn't been considered before in
>> the original patch.
>>
>> Peter
>>
>
> hmm. You are right, but I think that it would be better to limit iovec size
> to 32 and this will solve the problem. Allocation of 32 Mb could be a real problem
> on loaded system could be a problem.
>
> What do you think on this? May be we could consider 16 as a limit...

I would do the following:

---8<---

 From 8c2a08baddbcd9e89bbb11fa83a42350bd7cc095 Mon Sep 17 00:00:00 2001
From: Peter Lieven <pl@kamp.de>
Date: Mon, 5 Jan 2015 12:14:52 +0100
Subject: [PATCH] block: limited request size in write zeroes unsupported path

If bs->bl.max_write_zeroes is large and we end up in the unsupported
path we might allocate a lot of memory for the iovector and/or even
generate an oversized requests.

Fix this by limiting the request by the minimum of the reported
maximum transfer size or 16MB (32768 sectors).

Reported-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Peter Lieven <pl@kamp.de>
---
  block.c |    5 ++++-
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/block.c b/block.c
index a612594..8009478 100644
--- a/block.c
+++ b/block.c
@@ -3203,6 +3203,9 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,

          if (ret == -ENOTSUP) {
              /* Fall back to bounce buffer if write zeroes is unsupported */
+            int max_xfer_len = MIN_NON_ZERO(bs->bl.max_transfer_length,
+ MAX_WRITE_ZEROES_DEFAULT);
+            num = MIN(num, max_xfer_len);
              iov.iov_len = num * BDRV_SECTOR_SIZE;
              if (iov.iov_base == NULL) {
                  iov.iov_base = qemu_try_blockalign(bs, num * BDRV_SECTOR_SIZE);
@@ -3219,7 +3222,7 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
              /* Keep bounce buffer around if it is big enough for all
               * all future requests.
               */
-            if (num < max_write_zeroes) {
+            if (num < max_xfer_len) {
                  qemu_vfree(iov.iov_base);
                  iov.iov_base = NULL;
              }
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH 1/8] block: prepare bdrv_co_do_write_zeroes to deal with large bl.max_write_zeroes
  2015-01-05 11:23       ` Peter Lieven
@ 2015-01-05 11:48         ` Denis V. Lunev
  2015-01-05 12:26         ` [Qemu-devel] [PATCH v2 1/1] " Denis V. Lunev
  1 sibling, 0 replies; 24+ messages in thread
From: Denis V. Lunev @ 2015-01-05 11:48 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

On 05/01/15 14:23, Peter Lieven wrote:
> On 05.01.2015 12:06, Denis V. Lunev wrote:
>> On 05/01/15 10:34, Peter Lieven wrote:
>>> On 30.12.2014 10:20, Denis V. Lunev wrote:
>>>> bdrv_co_do_write_zeroes split writes using bl.max_write_zeroes or
>>>> 16 MiB as a chunk size. This is implemented in this way to tolerate
>>>> buggy block backends which do not accept too big requests.
>>>>
>>>> Though if the bdrv_co_write_zeroes callback is not good enough, we
>>>> fallback to write data explicitely using bdrv_co_writev and we
>>>> create buffer to accomodate zeroes inside. The size of this buffer
>>>> is the size of the chunk. Thus if the underlying layer will have
>>>> bl.max_write_zeroes high enough, f.e. 4 GiB, the allocation can fail.
>>>>
>>>> Actually, there is no need to allocate such a big amount of memory.
>>>> We could simply allocate 1 MiB buffer and create iovec, which will
>>>> point to the same memory.
>>>>
>>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>>> CC: Kevin Wolf <kwolf@redhat.com>
>>>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>>>> CC: Peter Lieven <pl@kamp.de>
>>>> ---
>>>>   block.c | 35 ++++++++++++++++++++++++-----------
>>>>   1 file changed, 24 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/block.c b/block.c
>>>> index 4165d42..d69c121 100644
>>>> --- a/block.c
>>>> +++ b/block.c
>>>> @@ -3173,14 +3173,18 @@ int coroutine_fn 
>>>> bdrv_co_copy_on_readv(BlockDriverState *bs,
>>>>    * of 32768 512-byte sectors (16 MiB) per request.
>>>>    */
>>>>   #define MAX_WRITE_ZEROES_DEFAULT 32768
>>>> +/* allocate iovec with zeroes using 1 MiB chunks to avoid to big 
>>>> allocations */
>>>> +#define MAX_ZEROES_CHUNK (1024 * 1024)
>>>>     static int coroutine_fn 
>>>> bdrv_co_do_write_zeroes(BlockDriverState *bs,
>>>>       int64_t sector_num, int nb_sectors, BdrvRequestFlags flags)
>>>>   {
>>>>       BlockDriver *drv = bs->drv;
>>>>       QEMUIOVector qiov;
>>>> -    struct iovec iov = {0};
>>>>       int ret = 0;
>>>> +    void *chunk = NULL;
>>>> +
>>>> +    qemu_iovec_init(&qiov, 0);
>>>>         int max_write_zeroes = bs->bl.max_write_zeroes ?
>>>>                              bs->bl.max_write_zeroes : 
>>>> MAX_WRITE_ZEROES_DEFAULT;
>>>> @@ -3217,27 +3221,35 @@ static int coroutine_fn 
>>>> bdrv_co_do_write_zeroes(BlockDriverState *bs,
>>>>           }
>>>>             if (ret == -ENOTSUP) {
>>>> +            int64_t num_bytes = (int64_t)num << BDRV_SECTOR_BITS;
>>>> +            int chunk_size = MIN(MAX_ZEROES_CHUNK, num_bytes);
>>>> +
>>>>               /* Fall back to bounce buffer if write zeroes is 
>>>> unsupported */
>>>> -            iov.iov_len = num * BDRV_SECTOR_SIZE;
>>>> -            if (iov.iov_base == NULL) {
>>>> -                iov.iov_base = qemu_try_blockalign(bs, num * 
>>>> BDRV_SECTOR_SIZE);
>>>> -                if (iov.iov_base == NULL) {
>>>> +            if (chunk == NULL) {
>>>> +                chunk = qemu_try_blockalign(bs, chunk_size);
>>>> +                if (chunk == NULL) {
>>>>                       ret = -ENOMEM;
>>>>                       goto fail;
>>>>                   }
>>>> -                memset(iov.iov_base, 0, num * BDRV_SECTOR_SIZE);
>>>> +                memset(chunk, 0, chunk_size);
>>>> +            }
>>>> +
>>>> +            while (num_bytes > 0) {
>>>> +                int to_add = MIN(chunk_size, num_bytes);
>>>> +                qemu_iovec_add(&qiov, chunk, to_add);
>>>
>>> This can and likely will fail for big num_bytes if you exceed 
>>> IOV_MAX vectors.
>>>
>>> I would stick to the old method and limit the num to a reasonable 
>>> value e.g. MAX_WRITE_ZEROES_DEFAULT.
>>> This becomes necessary as you set INT_MAX for max_write_zeroes. That 
>>> hasn't been considered before in
>>> the original patch.
>>>
>>> Peter
>>>
>>
>> hmm. You are right, but I think that it would be better to limit 
>> iovec size
>> to 32 and this will solve the problem. Allocation of 32 Mb could be a 
>> real problem
>> on loaded system could be a problem.
>>
>> What do you think on this? May be we could consider 16 as a limit...
>
> I would do the following:
>
> ---8<---
>
> From 8c2a08baddbcd9e89bbb11fa83a42350bd7cc095 Mon Sep 17 00:00:00 2001
> From: Peter Lieven <pl@kamp.de>
> Date: Mon, 5 Jan 2015 12:14:52 +0100
> Subject: [PATCH] block: limited request size in write zeroes 
> unsupported path
>
> If bs->bl.max_write_zeroes is large and we end up in the unsupported
> path we might allocate a lot of memory for the iovector and/or even
> generate an oversized requests.
>
> Fix this by limiting the request by the minimum of the reported
> maximum transfer size or 16MB (32768 sectors).
>
> Reported-by: Denis V. Lunev <den@openvz.org>
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  block.c |    5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/block.c b/block.c
> index a612594..8009478 100644
> --- a/block.c
> +++ b/block.c
> @@ -3203,6 +3203,9 @@ static int coroutine_fn 
> bdrv_co_do_write_zeroes(BlockDriverState *bs,
>
>          if (ret == -ENOTSUP) {
>              /* Fall back to bounce buffer if write zeroes is 
> unsupported */
> +            int max_xfer_len = MIN_NON_ZERO(bs->bl.max_transfer_length,
> + MAX_WRITE_ZEROES_DEFAULT);
> +            num = MIN(num, max_xfer_len);
this is not going to work IMHO. num is the number in sectors.
max_xfer_len is in bytes.

I will send my updated version using your approach in a
couple of minutes. Would like to test it a bit.

> iov.iov_len = num * BDRV_SECTOR_SIZE;
>              if (iov.iov_base == NULL) {
>                  iov.iov_base = qemu_try_blockalign(bs, num * 
> BDRV_SECTOR_SIZE);
> @@ -3219,7 +3222,7 @@ static int coroutine_fn 
> bdrv_co_do_write_zeroes(BlockDriverState *bs,
>              /* Keep bounce buffer around if it is big enough for all
>               * all future requests.
>               */
> -            if (num < max_write_zeroes) {
> +            if (num < max_xfer_len) {
>                  qemu_vfree(iov.iov_base);
>                  iov.iov_base = NULL;
>              }

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Qemu-devel] [PATCH v2 1/1] block: prepare bdrv_co_do_write_zeroes to deal with large bl.max_write_zeroes
  2015-01-05 11:23       ` Peter Lieven
  2015-01-05 11:48         ` Denis V. Lunev
@ 2015-01-05 12:26         ` Denis V. Lunev
  2015-01-05 12:32           ` [Qemu-devel] [PATCH v3 " Denis V. Lunev
  1 sibling, 1 reply; 24+ messages in thread
From: Denis V. Lunev @ 2015-01-05 12:26 UTC (permalink / raw)
  Cc: Kevin Wolf, Denis V. Lunev, Peter Lieven, qemu-devel,
	Stefan Hajnoczi

bdrv_co_do_write_zeroes split writes using bl.max_write_zeroes or
16 MiB as a chunk size. This is implemented in this way to tolerate
buggy block backends which do not accept too big requests.

Though if the bdrv_co_write_zeroes callback is not good enough, we
fallback to write data explicitely using bdrv_co_writev and we
create buffer to accomodate zeroes inside. The size of this buffer
is the size of the chunk. Thus if the underlying layer will have
bl.max_write_zeroes high enough, f.e. 4 GiB, the allocation can fail.

Actually, there is no need to allocate such a big amount of memory.
We could simply allocate 1 MiB buffer and create iovec, which will
point to the same memory.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
---
Changes from v1:
- using MAX_WRITE_ZEROES_DEFAULT as a limit for real write

 block.c | 40 +++++++++++++++++++++++++++++-----------
 1 file changed, 29 insertions(+), 11 deletions(-)

diff --git a/block.c b/block.c
index 4165d42..49fd4e3 100644
--- a/block.c
+++ b/block.c
@@ -3173,14 +3173,18 @@ int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
  * of 32768 512-byte sectors (16 MiB) per request.
  */
 #define MAX_WRITE_ZEROES_DEFAULT 32768
+/* allocate iovec with zeroes using 1 MiB chunks to avoid to big allocations */
+#define MAX_ZEROES_CHUNK (1024 * 1024)
 
 static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
     int64_t sector_num, int nb_sectors, BdrvRequestFlags flags)
 {
     BlockDriver *drv = bs->drv;
     QEMUIOVector qiov;
-    struct iovec iov = {0};
     int ret = 0;
+    void *chunk = NULL;
+
+    qemu_iovec_init(&qiov, 0);
 
     int max_write_zeroes = bs->bl.max_write_zeroes ?
                            bs->bl.max_write_zeroes : MAX_WRITE_ZEROES_DEFAULT;
@@ -3217,27 +3221,40 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
         }
 
         if (ret == -ENOTSUP) {
+            int64_t num_bytes;
+            int chunk_size;
+
             /* Fall back to bounce buffer if write zeroes is unsupported */
-            iov.iov_len = num * BDRV_SECTOR_SIZE;
-            if (iov.iov_base == NULL) {
-                iov.iov_base = qemu_try_blockalign(bs, num * BDRV_SECTOR_SIZE);
-                if (iov.iov_base == NULL) {
+            num = MIN(num, MAX_WRITE_ZEROES_DEFAULT >> BDRV_SECTOR_BITS);
+
+            num_bytes = (int64_t)num << BDRV_SECTOR_BITS;
+            chunk_size = MIN(MAX_ZEROES_CHUNK, num_bytes);
+
+            if (chunk == NULL) {
+                chunk = qemu_try_blockalign(bs, chunk_size);
+                if (chunk == NULL) {
                     ret = -ENOMEM;
                     goto fail;
                 }
-                memset(iov.iov_base, 0, num * BDRV_SECTOR_SIZE);
+                memset(chunk, 0, chunk_size);
+            }
+
+            while (num_bytes > 0) {
+                int to_add = MIN(chunk_size, num_bytes);
+                qemu_iovec_add(&qiov, chunk, to_add);
+                num_bytes -= to_add;
             }
-            qemu_iovec_init_external(&qiov, &iov, 1);
 
             ret = drv->bdrv_co_writev(bs, sector_num, num, &qiov);
 
             /* Keep bounce buffer around if it is big enough for all
              * all future requests.
              */
-            if (num < max_write_zeroes) {
-                qemu_vfree(iov.iov_base);
-                iov.iov_base = NULL;
+            if (chunk_size != MAX_ZEROES_CHUNK) {
+                qemu_vfree(chunk);
+                chunk = NULL;
             }
+            qemu_iovec_reset(&qiov);
         }
 
         sector_num += num;
@@ -3245,7 +3262,8 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
     }
 
 fail:
-    qemu_vfree(iov.iov_base);
+    qemu_iovec_destroy(&qiov);
+    qemu_vfree(chunk);
     return ret;
 }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [Qemu-devel] [PATCH v3 1/1] block: prepare bdrv_co_do_write_zeroes to deal with large bl.max_write_zeroes
  2015-01-05 12:26         ` [Qemu-devel] [PATCH v2 1/1] " Denis V. Lunev
@ 2015-01-05 12:32           ` Denis V. Lunev
  0 siblings, 0 replies; 24+ messages in thread
From: Denis V. Lunev @ 2015-01-05 12:32 UTC (permalink / raw)
  Cc: Kevin Wolf, Denis V. Lunev, Peter Lieven, qemu-devel,
	Stefan Hajnoczi

bdrv_co_do_write_zeroes split writes using bl.max_write_zeroes or
16 MiB as a chunk size. This is implemented in this way to tolerate
buggy block backends which do not accept too big requests.

Though if the bdrv_co_write_zeroes callback is not good enough, we
fallback to write data explicitely using bdrv_co_writev and we
create buffer to accomodate zeroes inside. The size of this buffer
is the size of the chunk. Thus if the underlying layer will have
bl.max_write_zeroes high enough, f.e. 4 GiB, the allocation can fail.

Actually, there is no need to allocate such a big amount of memory.
We could simply allocate 1 MiB buffer and create iovec, which will
point to the same memory.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
---
Changes from v2:
- fixed num assignment as MAX_WRITE_ZEROES_DEFAULT is already in sectors

Changes from v1:
- using MAX_WRITE_ZEROES_DEFAULT as a limit for real write

 block.c | 40 +++++++++++++++++++++++++++++-----------
 1 file changed, 29 insertions(+), 11 deletions(-)

diff --git a/block.c b/block.c
index 4165d42..e09220f 100644
--- a/block.c
+++ b/block.c
@@ -3173,14 +3173,18 @@ int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
  * of 32768 512-byte sectors (16 MiB) per request.
  */
 #define MAX_WRITE_ZEROES_DEFAULT 32768
+/* allocate iovec with zeroes using 1 MiB chunks to avoid to big allocations */
+#define MAX_ZEROES_CHUNK (1024 * 1024)
 
 static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
     int64_t sector_num, int nb_sectors, BdrvRequestFlags flags)
 {
     BlockDriver *drv = bs->drv;
     QEMUIOVector qiov;
-    struct iovec iov = {0};
     int ret = 0;
+    void *chunk = NULL;
+
+    qemu_iovec_init(&qiov, 0);
 
     int max_write_zeroes = bs->bl.max_write_zeroes ?
                            bs->bl.max_write_zeroes : MAX_WRITE_ZEROES_DEFAULT;
@@ -3217,27 +3221,40 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
         }
 
         if (ret == -ENOTSUP) {
+            int64_t num_bytes;
+            int chunk_size;
+
             /* Fall back to bounce buffer if write zeroes is unsupported */
-            iov.iov_len = num * BDRV_SECTOR_SIZE;
-            if (iov.iov_base == NULL) {
-                iov.iov_base = qemu_try_blockalign(bs, num * BDRV_SECTOR_SIZE);
-                if (iov.iov_base == NULL) {
+            num = MIN(num, MAX_WRITE_ZEROES_DEFAULT);
+
+            num_bytes = (int64_t)num << BDRV_SECTOR_BITS;
+            chunk_size = MIN(MAX_ZEROES_CHUNK, num_bytes);
+
+            if (chunk == NULL) {
+                chunk = qemu_try_blockalign(bs, chunk_size);
+                if (chunk == NULL) {
                     ret = -ENOMEM;
                     goto fail;
                 }
-                memset(iov.iov_base, 0, num * BDRV_SECTOR_SIZE);
+                memset(chunk, 0, chunk_size);
+            }
+
+            while (num_bytes > 0) {
+                int to_add = MIN(chunk_size, num_bytes);
+                qemu_iovec_add(&qiov, chunk, to_add);
+                num_bytes -= to_add;
             }
-            qemu_iovec_init_external(&qiov, &iov, 1);
 
             ret = drv->bdrv_co_writev(bs, sector_num, num, &qiov);
 
             /* Keep bounce buffer around if it is big enough for all
              * all future requests.
              */
-            if (num < max_write_zeroes) {
-                qemu_vfree(iov.iov_base);
-                iov.iov_base = NULL;
+            if (chunk_size != MAX_ZEROES_CHUNK) {
+                qemu_vfree(chunk);
+                chunk = NULL;
             }
+            qemu_iovec_reset(&qiov);
         }
 
         sector_num += num;
@@ -3245,7 +3262,8 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
     }
 
 fail:
-    qemu_vfree(iov.iov_base);
+    qemu_iovec_destroy(&qiov);
+    qemu_vfree(chunk);
     return ret;
 }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2015-01-05 12:32 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-30  9:20 [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
2014-12-30  9:20 ` [Qemu-devel] [PATCH 1/8] block: prepare bdrv_co_do_write_zeroes to deal with large bl.max_write_zeroes Denis V. Lunev
2015-01-05  7:34   ` Peter Lieven
2015-01-05 11:06     ` Denis V. Lunev
2015-01-05 11:23       ` Peter Lieven
2015-01-05 11:48         ` Denis V. Lunev
2015-01-05 12:26         ` [Qemu-devel] [PATCH v2 1/1] " Denis V. Lunev
2015-01-05 12:32           ` [Qemu-devel] [PATCH v3 " Denis V. Lunev
2014-12-30  9:20 ` [Qemu-devel] [PATCH 2/8] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes Denis V. Lunev
2014-12-30  9:20 ` [Qemu-devel] [PATCH 3/8] block/raw-posix: create do_fallocate helper Denis V. Lunev
2014-12-30  9:20 ` [Qemu-devel] [PATCH 4/8] block/raw-posix: create translate_err helper to merge errno values Denis V. Lunev
2015-01-05  6:46   ` Fam Zheng
2015-01-05 11:17     ` Denis V. Lunev
2014-12-30  9:20 ` [Qemu-devel] [PATCH 5/8] block/raw-posix: refactor handle_aiocb_write_zeroes a bit Denis V. Lunev
2015-01-05  6:57   ` Fam Zheng
2015-01-05 11:07     ` Denis V. Lunev
2014-12-30  9:20 ` [Qemu-devel] [PATCH 6/8] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes Denis V. Lunev
2015-01-05  7:02   ` Fam Zheng
2015-01-05 11:14     ` Denis V. Lunev
2014-12-30  9:20 ` [Qemu-devel] [PATCH 7/8] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes Denis V. Lunev
2014-12-30  9:20 ` [Qemu-devel] [PATCH 8/8] block/raw-posix: set max_write_zeroes to INT_MAX for regular files Denis V. Lunev
2014-12-30 10:55 ` [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Peter Lieven
2014-12-30 11:07   ` Denis V. Lunev
2015-01-05  6:55     ` Peter Lieven

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).