From: Peter Lieven <pl@kamp.de>
To: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>,
Stefan Hajnoczi <stefanha@gmail.com>,
"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [Qemu-devel] qemu-img convert cache mode for source
Date: Fri, 28 Feb 2014 15:35:05 +0100 [thread overview]
Message-ID: <53109E99.3020102@kamp.de> (raw)
In-Reply-To: <20140227085711.GC21749@stefanha-thinkpad.redhat.com>
On 27.02.2014 09:57, Stefan Hajnoczi wrote:
> On Wed, Feb 26, 2014 at 05:01:52PM +0100, Peter Lieven wrote:
>> On 26.02.2014 16:41, Stefan Hajnoczi wrote:
>>> On Wed, Feb 26, 2014 at 11:14:04AM +0100, Peter Lieven wrote:
>>>> I was wondering if it would be a good idea to set the O_DIRECT mode for the source
>>>> files of a qemu-img convert process if the source is a host_device?
>>>>
>>>> Currently the backup of a host device is polluting the page cache.
>>> Points to consider:
>>>
>>> 1. O_DIRECT does not work on Linux tmpfs, you get EINVAL when opening
>>> the file. A fallback is necessary.
>>>
>>> 2. O_DIRECT has no readahead so performance could actually decrease.
>>> The question is, how important is reahead versus polluting page
>>> cache?
>>>
>>> 3. For raw files it would make sense to tell the kernel that access is
>>> sequential and data will be used only once. Then we can get the best
>>> of both worlds (avoid polluting page cache but still get readahead).
>>> This is done using posix_fadvise(2).
>>>
>>> The problem is what to do for image formats. An image file can be
>>> very fragmented so the readahead might not be a win. Does this mean
>>> that for image formats we should tell the kernel access will be
>>> random?
>>>
>>> Furthermore, maybe it's best to do readahead inside QEMU so that even
>>> network protocols (nbd, iscsi, etc) can get good performance. They
>>> act like O_DIRECT is always on.
>> your comments are regarding qemu-img convert, right?
>> How would you implement this? A new open flag because
>> the fadvise had to goto inside the protocol driver.
>>
>> I would start with host_devices first and see how it performs there.
>>
>> For qemu-img convert I would issue a FADV_DONTNEED after
>> a write for the bytes that have been written
>> (i have tested this with Linux and it seems to work quite well).
>>
>> Question is, what is the right paramter for reads? Also FADV_DONTNEED?
> I think so but this should be justified with benchmark results.
I ran some benchmarks at found that a FADV_DONTNEED issues after
a read does not hurt regarding to performance. But it avoids buffers
increasing while I read from a host_device of raw file.
As for writing it does only work if I issue a fdatasync after each write, but
this should be equivalent to O_DIRECT. So I would keep the patch
to support qemu-img convert sources if they are host_device or file.
Here is a proposal for a patch:
diff --git a/block.c b/block.c
index 2fd5482..2445433 100644
--- a/block.c
+++ b/block.c
@@ -2626,6 +2626,14 @@ static int bdrv_prwv_co(BlockDriverState *bs, int64_t offset,
qemu_aio_wait();
}
}
+
+#ifdef POSIX_FADV_DONTNEED
+ if (!rwco.ret && bs->open_flags & BDRV_O_SEQUENTIAL &&
+ bs->drv->bdrv_fadvise && !is_write) {
+ bs->drv->bdrv_fadvise(bs, offset, qiov->size, POSIX_FADV_DONTNEED);
+ }
+#endif
+
return rwco.ret;
}
diff --git a/block/raw-posix.c b/block/raw-posix.c
index 161ea14..d8d78d8 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -1397,6 +1397,12 @@ static int raw_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
return 0;
}
+static int raw_fadvise(BlockDriverState *bs, off_t offset, off_t len, int advise)
+{
+ BDRVRawState *s = bs->opaque;
+ return posix_fadvise(s->fd, offset, len, advise);
+}
+
static QEMUOptionParameter raw_create_options[] = {
{
.name = BLOCK_OPT_SIZE,
@@ -1433,6 +1439,7 @@ static BlockDriver bdrv_file = {
.bdrv_get_info = raw_get_info,
.bdrv_get_allocated_file_size
= raw_get_allocated_file_size,
+ .bdrv_fadvise = raw_fadvise,
.create_options = raw_create_options,
};
@@ -1811,6 +1818,7 @@ static BlockDriver bdrv_host_device = {
.bdrv_get_info = raw_get_info,
.bdrv_get_allocated_file_size
= raw_get_allocated_file_size,
+ .bdrv_fadvise = raw_fadvise,
/* generic scsi device */
#ifdef __linux__
diff --git a/block/raw_bsd.c b/block/raw_bsd.c
index 01ea692..f09bc70 100644
--- a/block/raw_bsd.c
+++ b/block/raw_bsd.c
@@ -171,6 +171,15 @@ static int raw_probe(const uint8_t *buf, int buf_size, const char *filename)
return 1;
}
+static int raw_fadvise(BlockDriverState *bs, off_t offset, off_t len, int advise)
+{
+ if (bs->file->drv->bdrv_fadvise) {
+ return bs->file->drv->bdrv_fadvise(bs->file, offset, len, advise);
+ }
+ return 0;
+}
+
+
static BlockDriver bdrv_raw = {
.format_name = "raw",
.bdrv_probe = &raw_probe,
@@ -195,7 +204,8 @@ static BlockDriver bdrv_raw = {
.bdrv_ioctl = &raw_ioctl,
.bdrv_aio_ioctl = &raw_aio_ioctl,
.create_options = &raw_create_options[0],
- .bdrv_has_zero_init = &raw_has_zero_init
+ .bdrv_has_zero_init = &raw_has_zero_init,
+ .bdrv_fadvise = &raw_fadvise,
};
static void bdrv_raw_init(void)
diff --git a/include/block/block.h b/include/block/block.h
index 780f48b..a4dcc3c 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -105,6 +105,9 @@ typedef enum {
#define BDRV_O_PROTOCOL 0x8000 /* if no block driver is explicitly given:
select an appropriate protocol driver,
ignoring the format layer */
+#define BDRV_O_SEQUENTIAL 0x10000 /* open device for sequential read/write */
+
+
#define BDRV_O_CACHE_MASK (BDRV_O_NOCACHE | BDRV_O_CACHE_WB | BDRV_O_NO_FLUSH)
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 0bcf1c9..7efad55 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -246,6 +246,8 @@ struct BlockDriver {
* zeros, 0 otherwise.
*/
int (*bdrv_has_zero_init)(BlockDriverState *bs);
+
+ int (*bdrv_fadvise)(BlockDriverState *bs, off_t offset, off_t len, int advise);
QLIST_ENTRY(BlockDriver) list;
};
diff --git a/qemu-img.c b/qemu-img.c
index 78fc868..2b900d0 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1298,7 +1298,8 @@ static int img_convert(int argc, char **argv)
total_sectors = 0;
for (bs_i = 0; bs_i < bs_n; bs_i++) {
- bs[bs_i] = bdrv_new_open(argv[optind + bs_i], fmt, BDRV_O_FLAGS, true,
+ bs[bs_i] = bdrv_new_open(argv[optind + bs_i], fmt,
+ BDRV_O_FLAGS | BDRV_O_SEQUENTIAL, true,
quiet);
if (!bs[bs_i]) {
error_report("Could not open '%s'", argv[optind + bs_i]);
........................................................... KAMP Netzwerkdienste GmbH Vestische Str. 89-91 | 46117 Oberhausen Tel: +49 (0) 208.89 402-50 | Fax: +49 (0) 208.89 402-40 pl@kamp.de | http://www.kamp.de Geschäftsführer: Heiner Lante | Michael
Lante Amtsgericht Duisburg | HRB Nr. 12154 USt-Id-Nr.: DE 120607556 ...........................................................
next prev parent reply other threads:[~2014-02-28 14:35 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-02-26 10:14 [Qemu-devel] qemu-img convert cache mode for source Peter Lieven
2014-02-26 15:41 ` Stefan Hajnoczi
2014-02-26 15:54 ` Eric Blake
2014-02-26 16:01 ` Peter Lieven
2014-02-27 8:57 ` Stefan Hajnoczi
2014-02-28 14:35 ` Peter Lieven [this message]
2014-03-03 10:38 ` Kevin Wolf
2014-03-03 11:20 ` Peter Lieven
2014-03-03 12:59 ` Paolo Bonzini
2014-03-03 13:07 ` Peter Lieven
2014-03-03 12:03 ` Stefan Hajnoczi
2014-03-03 12:20 ` Peter Lieven
2014-03-04 9:24 ` Stefan Hajnoczi
2014-03-05 14:44 ` Peter Lieven
2014-03-05 15:20 ` Marcus
2014-03-05 15:53 ` Peter Lieven
2014-03-05 17:38 ` Marcus
2014-03-05 18:09 ` Peter Lieven
2014-03-06 10:41 ` Stefan Hajnoczi
2014-03-06 18:58 ` Peter Lieven
2014-03-06 10:29 ` Stefan Hajnoczi
2014-03-06 11:29 ` Paolo Bonzini
2014-03-06 14:19 ` Liguori, Anthony
2014-03-06 18:07 ` Peter Lieven
2014-03-07 8:03 ` Peter Lieven
2014-02-27 1:10 ` Fam Zheng
2014-02-27 11:07 ` Kevin Wolf
2014-02-27 16:12 ` Peter Lieven
2014-03-03 10:40 ` Kevin Wolf
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=53109E99.3020102@kamp.de \
--to=pl@kamp.de \
--cc=kwolf@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@gmail.com \
--cc=stefanha@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.