From: Peter Lieven <pl@kamp.de>
To: qemu-devel@nongnu.org
Cc: kwolf@redhat.com, famz@redhat.com, Peter Lieven <pl@kamp.de>,
stefanha@redhat.com, shadowsor@gmail.com, pbonzini@redhat.com
Subject: [Qemu-devel] [PATCHv3] block: introduce BDRV_O_SEQUENTIAL
Date: Tue, 13 May 2014 14:36:32 +0200 [thread overview]
Message-ID: <1399984592-2469-1-git-send-email-pl@kamp.de> (raw)
this patch introduces a new flag to indicate that we are going to sequentially
read from a file and do not plan to reread/reuse the data after it has been read.
The current use of this flag is to open the source(s) of a qemu-img convert
process. If a protocol from block/raw-posix.c is used posix_fadvise is utilized
to advise to the kernel that we are going to read sequentially from the
file and a POSIX_FADV_DONTNEED advise is issued after each write to indicate
that there is no advantage keeping the blocks in the buffers.
Consider the following test case that was created to confirm the behaviour of
the new flag:
A 10G logical volume was created and filled with random data.
Then the logical volume was exported via qemu-img convert to an iscsi target.
Before the export was started all caches of the linux kernel where dropped.
Old behavior:
- The convert process took 3m45s and the buffer cache grew up to 9.67 GB close
to the end of the conversion. After qemu-img terminated all the buffers were
freed by the kernel.
New behavior with the -N switch:
- The convert process took 3m43s and the buffer cache grew up to 15.48 MB close
to the end with some small peaks up to 30 MB during the conversion.
Signed-off-by: Peter Lieven <pl@kamp.de>
---
v2->v3: - rebased
- fixed typo in commit msg [Fam]
v1->v2: - added test example to commit msg
- added -N knob to qemu-img
block/raw-posix.c | 14 ++++++++++++++
include/block/block.h | 1 +
qemu-img-cmds.hx | 4 ++--
qemu-img.c | 15 ++++++++++++---
qemu-img.texi | 9 ++++++++-
5 files changed, 37 insertions(+), 6 deletions(-)
diff --git a/block/raw-posix.c b/block/raw-posix.c
index 6586a0c..9768cc4 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -447,6 +447,13 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
}
#endif
+#ifdef POSIX_FADV_SEQUENTIAL
+ if (bs->open_flags & BDRV_O_SEQUENTIAL &&
+ !(bs->open_flags & BDRV_O_NOCACHE)) {
+ posix_fadvise(s->fd, 0, 0, POSIX_FADV_SEQUENTIAL);
+ }
+#endif
+
ret = 0;
fail:
if (filename && (bdrv_flags & BDRV_O_TEMPORARY)) {
@@ -919,6 +926,13 @@ static int aio_worker(void *arg)
ret = aiocb->aio_nbytes;
}
if (ret == aiocb->aio_nbytes) {
+#ifdef POSIX_FADV_DONTNEED
+ if (aiocb->bs->open_flags & BDRV_O_SEQUENTIAL &&
+ !(aiocb->bs->open_flags & BDRV_O_NOCACHE)) {
+ posix_fadvise(aiocb->aio_fildes, aiocb->aio_offset,
+ aiocb->aio_nbytes, POSIX_FADV_DONTNEED);
+ }
+#endif
ret = 0;
} else if (ret >= 0 && ret < aiocb->aio_nbytes) {
ret = -EINVAL;
diff --git a/include/block/block.h b/include/block/block.h
index 1b119aa..9b42d54 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -110,6 +110,7 @@ typedef enum {
#define BDRV_O_PROTOCOL 0x8000 /* if no block driver is explicitly given:
select an appropriate protocol driver,
ignoring the format layer */
+#define BDRV_O_SEQUENTIAL 0x10000 /* open device for sequential read */
#define BDRV_O_CACHE_MASK (BDRV_O_NOCACHE | BDRV_O_CACHE_WB | BDRV_O_NO_FLUSH)
diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index d029609..74c2c08 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -34,9 +34,9 @@ STEXI
ETEXI
DEF("convert", img_convert,
- "convert [-c] [-p] [-q] [-n] [-f fmt] [-t cache] [-O output_fmt] [-o options] [-s snapshot_id_or_name] [-l snapshot_param] [-S sparse_size] filename [filename2 [...]] output_filename")
+ "convert [-c] [-p] [-q] [-n] [-N] [-f fmt] [-t cache] [-O output_fmt] [-o options] [-s snapshot_id_or_name] [-l snapshot_param] [-S sparse_size] filename [filename2 [...]] output_filename")
STEXI
-@item convert [-c] [-p] [-q] [-n] [-f @var{fmt}] [-t @var{cache}] [-O @var{output_fmt}] [-o @var{options}] [-s @var{snapshot_id_or_name}] [-l @var{snapshot_param}] [-S @var{sparse_size}] @var{filename} [@var{filename2} [...]] @var{output_filename}
+@item convert [-c] [-p] [-q] [-n] [-N] [-f @var{fmt}] [-t @var{cache}] [-O @var{output_fmt}] [-o @var{options}] [-s @var{snapshot_id_or_name}] [-l @var{snapshot_param}] [-S @var{sparse_size}] @var{filename} [@var{filename2} [...]] @var{output_filename}
ETEXI
DEF("info", img_info,
diff --git a/qemu-img.c b/qemu-img.c
index 04ce02a..356d4ae 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -141,6 +141,8 @@ static void QEMU_NORETURN help(void)
" '--output' takes the format in which the output must be done (human or json)\n"
" '-n' skips the target volume creation (useful if the volume is created\n"
" prior to running qemu-img)\n"
+ " '-N' opens the source file(s) for sequential reading and drops data from\n"
+ " page cache immediately\n"
"\n"
"Parameters to check subcommand:\n"
" '-r' tries to repair any inconsistencies that are found during the check.\n"
@@ -1199,7 +1201,7 @@ static int img_convert(int argc, char **argv)
char *options = NULL;
const char *snapshot_name = NULL;
int min_sparse = 8; /* Need at least 4k of zeros for sparse detection */
- bool quiet = false;
+ bool quiet = false, sequential_read = false;
Error *local_err = NULL;
QemuOpts *sn_opts = NULL;
@@ -1210,7 +1212,7 @@ static int img_convert(int argc, char **argv)
compress = 0;
skip_create = 0;
for(;;) {
- c = getopt(argc, argv, "f:O:B:s:hce6o:pS:t:qnl:");
+ c = getopt(argc, argv, "f:O:B:s:hce6o:pS:t:qnNl:");
if (c == -1) {
break;
}
@@ -1297,6 +1299,9 @@ static int img_convert(int argc, char **argv)
case 'n':
skip_create = 1;
break;
+ case 'N':
+ sequential_read = true;
+ break;
}
}
@@ -1333,9 +1338,13 @@ static int img_convert(int argc, char **argv)
total_sectors = 0;
for (bs_i = 0; bs_i < bs_n; bs_i++) {
+ int open_flags = BDRV_O_FLAGS;
char *id = bs_n > 1 ? g_strdup_printf("source %d", bs_i)
: g_strdup("source");
- bs[bs_i] = bdrv_new_open(id, argv[optind + bs_i], fmt, BDRV_O_FLAGS,
+ if (sequential_read) {
+ open_flags |= BDRV_O_SEQUENTIAL;
+ }
+ bs[bs_i] = bdrv_new_open(id, argv[optind + bs_i], fmt, open_flags,
true, quiet);
g_free(id);
if (!bs[bs_i]) {
diff --git a/qemu-img.texi b/qemu-img.texi
index f84590e..0fb63c2 100644
--- a/qemu-img.texi
+++ b/qemu-img.texi
@@ -190,7 +190,7 @@ Error on reading data
@end table
-@item convert [-c] [-p] [-n] [-f @var{fmt}] [-t @var{cache}] [-O @var{output_fmt}] [-o @var{options}] [-s @var{snapshot_id_or_name}] [-l @var{snapshot_param}] [-S @var{sparse_size}] @var{filename} [@var{filename2} [...]] @var{output_filename}
+@item convert [-c] [-p] [-n] [-N] [-f @var{fmt}] [-t @var{cache}] [-O @var{output_fmt}] [-o @var{options}] [-s @var{snapshot_id_or_name}] [-l @var{snapshot_param}] [-S @var{sparse_size}] @var{filename} [@var{filename2} [...]] @var{output_filename}
Convert the disk image @var{filename} or a snapshot @var{snapshot_param}(@var{snapshot_id_or_name} is deprecated)
to disk image @var{output_filename} using format @var{output_fmt}. It can be optionally compressed (@code{-c}
@@ -220,6 +220,13 @@ skipped. This is useful for formats such as @code{rbd} if the target
volume has already been created with site specific options that cannot
be supplied through qemu-img.
+If the @code{-N} option is specified, the source image is opened
+for sequential reading. This means its contents are dropped from
+the page cache immediately after they have been read. The option
+is meant for reading in raw files or host devices and may have
+bad performance impact on other formats which read a sector more
+than once.
+
@item info [-f @var{fmt}] [--output=@var{ofmt}] [--backing-chain] @var{filename}
Give information about the disk image @var{filename}. Use it in
--
1.7.9.5
next reply other threads:[~2014-05-13 12:37 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-05-13 12:36 Peter Lieven [this message]
2014-05-14 17:31 ` [Qemu-devel] [PATCHv3] block: introduce BDRV_O_SEQUENTIAL Eric Blake
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1399984592-2469-1-git-send-email-pl@kamp.de \
--to=pl@kamp.de \
--cc=famz@redhat.com \
--cc=kwolf@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=shadowsor@gmail.com \
--cc=stefanha@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).