qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCHv2] block: introduce BDRV_O_SEQUENTIAL
@ 2014-03-21 11:49 Peter Lieven
  2014-03-21 12:06 ` Paolo Bonzini
  2014-03-24  9:18 ` Fam Zheng
  0 siblings, 2 replies; 6+ messages in thread
From: Peter Lieven @ 2014-03-21 11:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, famz, Peter Lieven, stefanha, shadowsor, pbonzini

this patch introduces a new flag to indicate that we are going to sequentially
read from a file and do not plan to reread/reuse the data after it has been read.

The current use of this flag is to open the source(s) of a qemu-img convert
process. If a protocol from block/raw-posix.c is used posix_fadvise is utilized
to advise to the kernel that we are going to read sequentially from the
file and a POSIX_FADV_DONTNEED advise is issued after each write to indicate
that there is no advantage keeping the blocks in the buffers.

Consider the following test case that was created to confirm the behaviour of
the new flag:

A 10G logical volume was created and filled with random data.
Then the logical volume was exported via qemu-img convert to an iscsi target.
Before the export was started all caches of the linux kernel where dropped.

Old behavior:
 - The convert process took 3m45s and the buffer cache grew up to 9.67 GB close
   to the end of the conversion. After qemu-img terminated all the buffers were
   freed by the kernel.

New behavior with the -N switch:
 - The convert process took 3m43s and the buffer cache grew up to 15.48 MB close
   to the end with some small peaks up to 30 MB durine the conversion.

Signed-off-by: Peter Lieven <pl@kamp.de>
---
v1->v2: - added test example to commit msg
        - added -N knob to qemu-img

 block/raw-posix.c     |   14 ++++++++++++++
 include/block/block.h |    1 +
 qemu-img-cmds.hx      |    4 ++--
 qemu-img.c            |   16 +++++++++++++---
 qemu-img.texi         |    9 ++++++++-
 5 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 1688e16..08f7209 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -444,6 +444,13 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
     }
 #endif
 
+#ifdef POSIX_FADV_SEQUENTIAL
+    if (bs->open_flags & BDRV_O_SEQUENTIAL &&
+        !(bs->open_flags & BDRV_O_NOCACHE)) {
+        posix_fadvise(s->fd, 0, 0, POSIX_FADV_SEQUENTIAL);
+    }
+#endif
+
     ret = 0;
 fail:
     qemu_opts_del(opts);
@@ -913,6 +920,13 @@ static int aio_worker(void *arg)
             ret = aiocb->aio_nbytes;
         }
         if (ret == aiocb->aio_nbytes) {
+#ifdef POSIX_FADV_DONTNEED
+            if (aiocb->bs->open_flags & BDRV_O_SEQUENTIAL &&
+                !(aiocb->bs->open_flags & BDRV_O_NOCACHE)) {
+                posix_fadvise(aiocb->aio_fildes, aiocb->aio_offset,
+                              aiocb->aio_nbytes, POSIX_FADV_DONTNEED);
+            }
+#endif
             ret = 0;
         } else if (ret >= 0 && ret < aiocb->aio_nbytes) {
             ret = -EINVAL;
diff --git a/include/block/block.h b/include/block/block.h
index 1ed55d8..a60d973 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -105,6 +105,7 @@ typedef enum {
 #define BDRV_O_PROTOCOL    0x8000  /* if no block driver is explicitly given:
                                       select an appropriate protocol driver,
                                       ignoring the format layer */
+#define BDRV_O_SEQUENTIAL 0x10000  /* open device for sequential read */
 
 #define BDRV_O_CACHE_MASK  (BDRV_O_NOCACHE | BDRV_O_CACHE_WB | BDRV_O_NO_FLUSH)
 
diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index d029609..74c2c08 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -34,9 +34,9 @@ STEXI
 ETEXI
 
 DEF("convert", img_convert,
-    "convert [-c] [-p] [-q] [-n] [-f fmt] [-t cache] [-O output_fmt] [-o options] [-s snapshot_id_or_name] [-l snapshot_param] [-S sparse_size] filename [filename2 [...]] output_filename")
+    "convert [-c] [-p] [-q] [-n] [-N] [-f fmt] [-t cache] [-O output_fmt] [-o options] [-s snapshot_id_or_name] [-l snapshot_param] [-S sparse_size] filename [filename2 [...]] output_filename")
 STEXI
-@item convert [-c] [-p] [-q] [-n] [-f @var{fmt}] [-t @var{cache}] [-O @var{output_fmt}] [-o @var{options}] [-s @var{snapshot_id_or_name}] [-l @var{snapshot_param}] [-S @var{sparse_size}] @var{filename} [@var{filename2} [...]] @var{output_filename}
+@item convert [-c] [-p] [-q] [-n] [-N] [-f @var{fmt}] [-t @var{cache}] [-O @var{output_fmt}] [-o @var{options}] [-s @var{snapshot_id_or_name}] [-l @var{snapshot_param}] [-S @var{sparse_size}] @var{filename} [@var{filename2} [...]] @var{output_filename}
 ETEXI
 
 DEF("info", img_info,
diff --git a/qemu-img.c b/qemu-img.c
index 2e40cc1..ddb6c25 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -107,6 +107,8 @@ static void help(void)
            "  '--output' takes the format in which the output must be done (human or json)\n"
            "  '-n' skips the target volume creation (useful if the volume is created\n"
            "       prior to running qemu-img)\n"
+           "  '-N' opens the source file(s) for sequential reading and drops data from\n"
+           "       page cache immediately\n"
            "\n"
            "Parameters to check subcommand:\n"
            "  '-r' tries to repair any inconsistencies that are found during the check.\n"
@@ -1158,7 +1160,7 @@ static int img_convert(int argc, char **argv)
     char *options = NULL;
     const char *snapshot_name = NULL;
     int min_sparse = 8; /* Need at least 4k of zeros for sparse detection */
-    bool quiet = false;
+    bool quiet = false, sequential_read = false;
     Error *local_err = NULL;
     QemuOpts *sn_opts = NULL;
 
@@ -1169,7 +1171,7 @@ static int img_convert(int argc, char **argv)
     compress = 0;
     skip_create = 0;
     for(;;) {
-        c = getopt(argc, argv, "f:O:B:s:hce6o:pS:t:qnl:");
+        c = getopt(argc, argv, "f:O:B:s:hce6o:pS:t:qnNl:");
         if (c == -1) {
             break;
         }
@@ -1256,6 +1258,9 @@ static int img_convert(int argc, char **argv)
         case 'n':
             skip_create = 1;
             break;
+        case 'N':
+            sequential_read = true;
+            break;
         }
     }
 
@@ -1292,7 +1297,12 @@ static int img_convert(int argc, char **argv)
 
     total_sectors = 0;
     for (bs_i = 0; bs_i < bs_n; bs_i++) {
-        bs[bs_i] = bdrv_new_open(argv[optind + bs_i], fmt, BDRV_O_FLAGS, true,
+        int open_flags = BDRV_O_FLAGS;
+        if (sequential_read) {
+            open_flags |= BDRV_O_SEQUENTIAL;
+        }
+        bs[bs_i] = bdrv_new_open(argv[optind + bs_i], fmt,
+                                 open_flags , true,
                                  quiet);
         if (!bs[bs_i]) {
             error_report("Could not open '%s'", argv[optind + bs_i]);
diff --git a/qemu-img.texi b/qemu-img.texi
index f84590e..0fb63c2 100644
--- a/qemu-img.texi
+++ b/qemu-img.texi
@@ -190,7 +190,7 @@ Error on reading data
 
 @end table
 
-@item convert [-c] [-p] [-n] [-f @var{fmt}] [-t @var{cache}] [-O @var{output_fmt}] [-o @var{options}] [-s @var{snapshot_id_or_name}] [-l @var{snapshot_param}] [-S @var{sparse_size}] @var{filename} [@var{filename2} [...]] @var{output_filename}
+@item convert [-c] [-p] [-n] [-N] [-f @var{fmt}] [-t @var{cache}] [-O @var{output_fmt}] [-o @var{options}] [-s @var{snapshot_id_or_name}] [-l @var{snapshot_param}] [-S @var{sparse_size}] @var{filename} [@var{filename2} [...]] @var{output_filename}
 
 Convert the disk image @var{filename} or a snapshot @var{snapshot_param}(@var{snapshot_id_or_name} is deprecated)
 to disk image @var{output_filename} using format @var{output_fmt}. It can be optionally compressed (@code{-c}
@@ -220,6 +220,13 @@ skipped. This is useful for formats such as @code{rbd} if the target
 volume has already been created with site specific options that cannot
 be supplied through qemu-img.
 
+If the @code{-N} option is specified, the source image is opened
+for sequential reading. This means its contents are dropped from
+the page cache immediately after they have been read. The option
+is meant for reading in raw files or host devices and may have
+bad performance impact on other formats which read a sector more
+than once.
+
 @item info [-f @var{fmt}] [--output=@var{ofmt}] [--backing-chain] @var{filename}
 
 Give information about the disk image @var{filename}. Use it in
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] [PATCHv2] block: introduce BDRV_O_SEQUENTIAL
  2014-03-21 11:49 [Qemu-devel] [PATCHv2] block: introduce BDRV_O_SEQUENTIAL Peter Lieven
@ 2014-03-21 12:06 ` Paolo Bonzini
  2014-03-21 12:42   ` Peter Lieven
  2014-03-28 10:02   ` Peter Lieven
  2014-03-24  9:18 ` Fam Zheng
  1 sibling, 2 replies; 6+ messages in thread
From: Paolo Bonzini @ 2014-03-21 12:06 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel; +Cc: kwolf, shadowsor, famz, stefanha

Il 21/03/2014 12:49, Peter Lieven ha scritto:
> A 10G logical volume was created and filled with random data.
> Then the logical volume was exported via qemu-img convert to an iscsi target.
> Before the export was started all caches of the linux kernel where dropped.
>
> Old behavior:
>  - The convert process took 3m45s and the buffer cache grew up to 9.67 GB close
>    to the end of the conversion. After qemu-img terminated all the buffers were
>    freed by the kernel.
>
> New behavior with the -N switch:
>  - The convert process took 3m43s and the buffer cache grew up to 15.48 MB close
>    to the end with some small peaks up to 30 MB durine the conversion.
>
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
> v1->v2: - added test example to commit msg
>         - added -N knob to qemu-img

I'm sorry, I cannot find the original discussion.  Why is the new knob 
needed?

Paolo

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] [PATCHv2] block: introduce BDRV_O_SEQUENTIAL
  2014-03-21 12:06 ` Paolo Bonzini
@ 2014-03-21 12:42   ` Peter Lieven
  2014-03-28 10:02   ` Peter Lieven
  1 sibling, 0 replies; 6+ messages in thread
From: Peter Lieven @ 2014-03-21 12:42 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: kwolf, shadowsor, famz, stefanha

On 21.03.2014 13:06, Paolo Bonzini wrote:
> Il 21/03/2014 12:49, Peter Lieven ha scritto:
>> A 10G logical volume was created and filled with random data.
>> Then the logical volume was exported via qemu-img convert to an iscsi target.
>> Before the export was started all caches of the linux kernel where dropped.
>>
>> Old behavior:
>>  - The convert process took 3m45s and the buffer cache grew up to 9.67 GB close
>>    to the end of the conversion. After qemu-img terminated all the buffers were
>>    freed by the kernel.
>>
>> New behavior with the -N switch:
>>  - The convert process took 3m43s and the buffer cache grew up to 15.48 MB close
>>    to the end with some small peaks up to 30 MB durine the conversion.
>>
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>> v1->v2: - added test example to commit msg
>>         - added -N knob to qemu-img
>
> I'm sorry, I cannot find the original discussion.  Why is the new knob needed?

The thread was named "qemu-img convert cache mode for source".

I think the 2 points (mainly by Marcus) were that you would not expect qemu-img
to mangle with the page cache by default as you would not expect it from cp or dd.
And secondly, if a running vServer and the image that is converted share pages it
can ruin the vServers cache.

Peter

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] [PATCHv2] block: introduce BDRV_O_SEQUENTIAL
  2014-03-21 11:49 [Qemu-devel] [PATCHv2] block: introduce BDRV_O_SEQUENTIAL Peter Lieven
  2014-03-21 12:06 ` Paolo Bonzini
@ 2014-03-24  9:18 ` Fam Zheng
  2014-03-24 14:02   ` Peter Lieven
  1 sibling, 1 reply; 6+ messages in thread
From: Fam Zheng @ 2014-03-24  9:18 UTC (permalink / raw)
  To: Peter Lieven; +Cc: kwolf, qemu-devel, stefanha, shadowsor, pbonzini

On Fri, 03/21 12:49, Peter Lieven wrote:
> this patch introduces a new flag to indicate that we are going to sequentially
> read from a file and do not plan to reread/reuse the data after it has been read.
> 
> The current use of this flag is to open the source(s) of a qemu-img convert
> process. If a protocol from block/raw-posix.c is used posix_fadvise is utilized
> to advise to the kernel that we are going to read sequentially from the
> file and a POSIX_FADV_DONTNEED advise is issued after each write to indicate
> that there is no advantage keeping the blocks in the buffers.
> 
> Consider the following test case that was created to confirm the behaviour of
> the new flag:
> 
> A 10G logical volume was created and filled with random data.
> Then the logical volume was exported via qemu-img convert to an iscsi target.
> Before the export was started all caches of the linux kernel where dropped.
> 
> Old behavior:
>  - The convert process took 3m45s and the buffer cache grew up to 9.67 GB close
>    to the end of the conversion. After qemu-img terminated all the buffers were
>    freed by the kernel.
> 
> New behavior with the -N switch:
>  - The convert process took 3m43s and the buffer cache grew up to 15.48 MB close
>    to the end with some small peaks up to 30 MB durine the conversion.

s/durine/during/

The patch looks OK, and I have no objection with this flag. But I'm still
curious about the use case: Host page cache growing is not the real problem,
I'm not fully persudaded by commit message because I still don't know _what_
useful cache would be dropped (if you don't empty the kernel cache before
starting). I don't think all 9.67 GB buffer will be filled by data from this
volume, so the question is how to measure the real, effective performance
impact?

> 
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
> v1->v2: - added test example to commit msg
>         - added -N knob to qemu-img
> 
>  block/raw-posix.c     |   14 ++++++++++++++
>  include/block/block.h |    1 +
>  qemu-img-cmds.hx      |    4 ++--
>  qemu-img.c            |   16 +++++++++++++---
>  qemu-img.texi         |    9 ++++++++-
>  5 files changed, 38 insertions(+), 6 deletions(-)
> 
> diff --git a/block/raw-posix.c b/block/raw-posix.c
> index 1688e16..08f7209 100644
> --- a/block/raw-posix.c
> +++ b/block/raw-posix.c
> @@ -444,6 +444,13 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>      }
>  #endif
>  
> +#ifdef POSIX_FADV_SEQUENTIAL
> +    if (bs->open_flags & BDRV_O_SEQUENTIAL &&
> +        !(bs->open_flags & BDRV_O_NOCACHE)) {
> +        posix_fadvise(s->fd, 0, 0, POSIX_FADV_SEQUENTIAL);
> +    }
> +#endif
> +
>      ret = 0;
>  fail:
>      qemu_opts_del(opts);
> @@ -913,6 +920,13 @@ static int aio_worker(void *arg)
>              ret = aiocb->aio_nbytes;
>          }
>          if (ret == aiocb->aio_nbytes) {
> +#ifdef POSIX_FADV_DONTNEED
> +            if (aiocb->bs->open_flags & BDRV_O_SEQUENTIAL &&
> +                !(aiocb->bs->open_flags & BDRV_O_NOCACHE)) {
> +                posix_fadvise(aiocb->aio_fildes, aiocb->aio_offset,
> +                              aiocb->aio_nbytes, POSIX_FADV_DONTNEED);
> +            }
> +#endif

I'm not familiar with posix_fadvise, can we do this on the whole file in once
in raw_open_common like POSIX_FADV_SEQUENTIAL?

Thanks,
Fam

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] [PATCHv2] block: introduce BDRV_O_SEQUENTIAL
  2014-03-24  9:18 ` Fam Zheng
@ 2014-03-24 14:02   ` Peter Lieven
  0 siblings, 0 replies; 6+ messages in thread
From: Peter Lieven @ 2014-03-24 14:02 UTC (permalink / raw)
  To: Fam Zheng; +Cc: kwolf, qemu-devel, stefanha, shadowsor, pbonzini

Am 24.03.2014 10:18, schrieb Fam Zheng:
> On Fri, 03/21 12:49, Peter Lieven wrote:
>> this patch introduces a new flag to indicate that we are going to sequentially
>> read from a file and do not plan to reread/reuse the data after it has been read.
>>
>> The current use of this flag is to open the source(s) of a qemu-img convert
>> process. If a protocol from block/raw-posix.c is used posix_fadvise is utilized
>> to advise to the kernel that we are going to read sequentially from the
>> file and a POSIX_FADV_DONTNEED advise is issued after each write to indicate
>> that there is no advantage keeping the blocks in the buffers.
>>
>> Consider the following test case that was created to confirm the behaviour of
>> the new flag:
>>
>> A 10G logical volume was created and filled with random data.
>> Then the logical volume was exported via qemu-img convert to an iscsi target.
>> Before the export was started all caches of the linux kernel where dropped.
>>
>> Old behavior:
>>  - The convert process took 3m45s and the buffer cache grew up to 9.67 GB close
>>    to the end of the conversion. After qemu-img terminated all the buffers were
>>    freed by the kernel.
>>
>> New behavior with the -N switch:
>>  - The convert process took 3m43s and the buffer cache grew up to 15.48 MB close
>>    to the end with some small peaks up to 30 MB durine the conversion.
> s/durine/during/
>
> The patch looks OK, and I have no objection with this flag. But I'm still
> curious about the use case: Host page cache growing is not the real problem,
> I'm not fully persudaded by commit message because I still don't know _what_
> useful cache would be dropped (if you don't empty the kernel cache before
> starting). I don't think all 9.67 GB buffer will be filled by data from this
> volume, so the question is how to measure the real, effective performance
> impact?
I ran an idle machine and indeed all the 9.67GB are buffered from the
qemu-img process. The problem is that the growing buffers eventually
disposses other pages from the cache. As for sharing if you have a
drive of a vServer on a lvm logical volume and take a snapshot and
you fadvise data from the snapshot I think that shared pages between
the logical volume and its snapshot are dropped. However, this all depends
on how it is handled internally. Maybe Markus has more evidence. I personally
would always disable the cache entirely for my vServers harddrives.
In general I personally am totally happy with having a switch. Just in case
there are some side effects we don't see at this point.
>
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>> v1->v2: - added test example to commit msg
>>         - added -N knob to qemu-img
>>
>>  block/raw-posix.c     |   14 ++++++++++++++
>>  include/block/block.h |    1 +
>>  qemu-img-cmds.hx      |    4 ++--
>>  qemu-img.c            |   16 +++++++++++++---
>>  qemu-img.texi         |    9 ++++++++-
>>  5 files changed, 38 insertions(+), 6 deletions(-)
>>
>> diff --git a/block/raw-posix.c b/block/raw-posix.c
>> index 1688e16..08f7209 100644
>> --- a/block/raw-posix.c
>> +++ b/block/raw-posix.c
>> @@ -444,6 +444,13 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>>      }
>>  #endif
>>  
>> +#ifdef POSIX_FADV_SEQUENTIAL
>> +    if (bs->open_flags & BDRV_O_SEQUENTIAL &&
>> +        !(bs->open_flags & BDRV_O_NOCACHE)) {
>> +        posix_fadvise(s->fd, 0, 0, POSIX_FADV_SEQUENTIAL);
>> +    }
>> +#endif
>> +
>>      ret = 0;
>>  fail:
>>      qemu_opts_del(opts);
>> @@ -913,6 +920,13 @@ static int aio_worker(void *arg)
>>              ret = aiocb->aio_nbytes;
>>          }
>>          if (ret == aiocb->aio_nbytes) {
>> +#ifdef POSIX_FADV_DONTNEED
>> +            if (aiocb->bs->open_flags & BDRV_O_SEQUENTIAL &&
>> +                !(aiocb->bs->open_flags & BDRV_O_NOCACHE)) {
>> +                posix_fadvise(aiocb->aio_fildes, aiocb->aio_offset,
>> +                              aiocb->aio_nbytes, POSIX_FADV_DONTNEED);
>> +            }
>> +#endif
> I'm not familiar with posix_fadvise, can we do this on the whole file in once
> in raw_open_common like POSIX_FADV_SEQUENTIAL?
We could do it, but the usage I have seen it call it on the pages you
actually want to have dropped. At least this seems to work good.

Peter

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] [PATCHv2] block: introduce BDRV_O_SEQUENTIAL
  2014-03-21 12:06 ` Paolo Bonzini
  2014-03-21 12:42   ` Peter Lieven
@ 2014-03-28 10:02   ` Peter Lieven
  1 sibling, 0 replies; 6+ messages in thread
From: Peter Lieven @ 2014-03-28 10:02 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: kwolf, shadowsor, famz, stefanha

On 21.03.2014 13:06, Paolo Bonzini wrote:
> Il 21/03/2014 12:49, Peter Lieven ha scritto:
>> A 10G logical volume was created and filled with random data.
>> Then the logical volume was exported via qemu-img convert to an iscsi target.
>> Before the export was started all caches of the linux kernel where dropped.
>>
>> Old behavior:
>>  - The convert process took 3m45s and the buffer cache grew up to 9.67 GB close
>>    to the end of the conversion. After qemu-img terminated all the buffers were
>>    freed by the kernel.
>>
>> New behavior with the -N switch:
>>  - The convert process took 3m43s and the buffer cache grew up to 15.48 MB close
>>    to the end with some small peaks up to 30 MB durine the conversion.
>>
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>> v1->v2: - added test example to commit msg
>>         - added -N knob to qemu-img
>
> I'm sorry, I cannot find the original discussion.  Why is the new knob needed?

Hi all,

how shall we proceed with this patch? Is additional info needed?

Peter

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-03-28 10:02 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-21 11:49 [Qemu-devel] [PATCHv2] block: introduce BDRV_O_SEQUENTIAL Peter Lieven
2014-03-21 12:06 ` Paolo Bonzini
2014-03-21 12:42   ` Peter Lieven
2014-03-28 10:02   ` Peter Lieven
2014-03-24  9:18 ` Fam Zheng
2014-03-24 14:02   ` Peter Lieven

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).