qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Peter Lieven <pl@kamp.de>
To: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>,
	Stefan Hajnoczi <stefanha@gmail.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [Qemu-devel] qemu-img convert cache mode for source
Date: Fri, 28 Feb 2014 15:35:05 +0100	[thread overview]
Message-ID: <53109E99.3020102@kamp.de> (raw)
In-Reply-To: <20140227085711.GC21749@stefanha-thinkpad.redhat.com>

On 27.02.2014 09:57, Stefan Hajnoczi wrote:
> On Wed, Feb 26, 2014 at 05:01:52PM +0100, Peter Lieven wrote:
>> On 26.02.2014 16:41, Stefan Hajnoczi wrote:
>>> On Wed, Feb 26, 2014 at 11:14:04AM +0100, Peter Lieven wrote:
>>>> I was wondering if it would be a good idea to set the O_DIRECT mode for the source
>>>> files of a qemu-img convert process if the source is a host_device?
>>>>
>>>> Currently the backup of a host device is polluting the page cache.
>>> Points to consider:
>>>
>>> 1. O_DIRECT does not work on Linux tmpfs, you get EINVAL when opening
>>>     the file.  A fallback is necessary.
>>>
>>> 2. O_DIRECT has no readahead so performance could actually decrease.
>>>     The question is, how important is reahead versus polluting page
>>>     cache?
>>>
>>> 3. For raw files it would make sense to tell the kernel that access is
>>>     sequential and data will be used only once.  Then we can get the best
>>>     of both worlds (avoid polluting page cache but still get readahead).
>>>     This is done using posix_fadvise(2).
>>>
>>>     The problem is what to do for image formats.  An image file can be
>>>     very fragmented so the readahead might not be a win.  Does this mean
>>>     that for image formats we should tell the kernel access will be
>>>     random?
>>>
>>>     Furthermore, maybe it's best to do readahead inside QEMU so that even
>>>     network protocols (nbd, iscsi, etc) can get good performance.  They
>>>     act like O_DIRECT is always on.
>> your comments are regarding qemu-img convert, right?
>> How would you implement this? A new open flag because
>> the fadvise had to goto inside the protocol driver.
>>
>> I would start with host_devices first and see how it performs there.
>>
>> For qemu-img convert I would issue a FADV_DONTNEED after
>> a write for the bytes that have been written
>> (i have tested this with Linux and it seems to work quite well).
>>
>> Question is, what is the right paramter for reads? Also FADV_DONTNEED?
> I think so but this should be justified with benchmark results.

I ran some benchmarks at found that a FADV_DONTNEED issues after
a read does not hurt regarding to performance. But it avoids buffers
increasing while I read from a host_device of raw file.
As for writing it does only work if I issue a fdatasync after each write, but
this should be equivalent to O_DIRECT. So I would keep the patch
to support qemu-img convert sources if they are host_device or file.

Here is a proposal for a patch:

diff --git a/block.c b/block.c
index 2fd5482..2445433 100644
--- a/block.c
+++ b/block.c
@@ -2626,6 +2626,14 @@ static int bdrv_prwv_co(BlockDriverState *bs, int64_t offset,
              qemu_aio_wait();
          }
      }
+
+#ifdef POSIX_FADV_DONTNEED
+    if (!rwco.ret && bs->open_flags & BDRV_O_SEQUENTIAL &&
+        bs->drv->bdrv_fadvise && !is_write) {
+        bs->drv->bdrv_fadvise(bs, offset, qiov->size, POSIX_FADV_DONTNEED);
+    }
+#endif
+
      return rwco.ret;
  }

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 161ea14..d8d78d8 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -1397,6 +1397,12 @@ static int raw_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
      return 0;
  }

+static int raw_fadvise(BlockDriverState *bs, off_t offset, off_t len, int advise)
+{
+    BDRVRawState *s = bs->opaque;
+    return posix_fadvise(s->fd, offset, len, advise);
+}
+
  static QEMUOptionParameter raw_create_options[] = {
      {
          .name = BLOCK_OPT_SIZE,
@@ -1433,6 +1439,7 @@ static BlockDriver bdrv_file = {
      .bdrv_get_info = raw_get_info,
      .bdrv_get_allocated_file_size
                          = raw_get_allocated_file_size,
+    .bdrv_fadvise = raw_fadvise,

      .create_options = raw_create_options,
  };
@@ -1811,6 +1818,7 @@ static BlockDriver bdrv_host_device = {
      .bdrv_get_info = raw_get_info,
      .bdrv_get_allocated_file_size
                          = raw_get_allocated_file_size,
+    .bdrv_fadvise = raw_fadvise,

      /* generic scsi device */
  #ifdef __linux__
diff --git a/block/raw_bsd.c b/block/raw_bsd.c
index 01ea692..f09bc70 100644
--- a/block/raw_bsd.c
+++ b/block/raw_bsd.c
@@ -171,6 +171,15 @@ static int raw_probe(const uint8_t *buf, int buf_size, const char *filename)
      return 1;
  }

+static int raw_fadvise(BlockDriverState *bs, off_t offset, off_t len, int advise)
+{
+    if (bs->file->drv->bdrv_fadvise) {
+        return bs->file->drv->bdrv_fadvise(bs->file, offset, len, advise);
+    }
+    return 0;
+}
+
+
  static BlockDriver bdrv_raw = {
      .format_name          = "raw",
      .bdrv_probe           = &raw_probe,
@@ -195,7 +204,8 @@ static BlockDriver bdrv_raw = {
      .bdrv_ioctl           = &raw_ioctl,
      .bdrv_aio_ioctl       = &raw_aio_ioctl,
      .create_options       = &raw_create_options[0],
-    .bdrv_has_zero_init   = &raw_has_zero_init
+    .bdrv_has_zero_init   = &raw_has_zero_init,
+    .bdrv_fadvise         = &raw_fadvise,
  };

  static void bdrv_raw_init(void)
diff --git a/include/block/block.h b/include/block/block.h
index 780f48b..a4dcc3c 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -105,6 +105,9 @@ typedef enum {
  #define BDRV_O_PROTOCOL    0x8000  /* if no block driver is explicitly given:
                                        select an appropriate protocol driver,
                                        ignoring the format layer */
+#define BDRV_O_SEQUENTIAL 0x10000  /* open device for sequential read/write */
+
+

  #define BDRV_O_CACHE_MASK  (BDRV_O_NOCACHE | BDRV_O_CACHE_WB | BDRV_O_NO_FLUSH)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 0bcf1c9..7efad55 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -246,6 +246,8 @@ struct BlockDriver {
       * zeros, 0 otherwise.
       */
      int (*bdrv_has_zero_init)(BlockDriverState *bs);
+
+    int (*bdrv_fadvise)(BlockDriverState *bs, off_t offset, off_t len, int advise);

      QLIST_ENTRY(BlockDriver) list;
  };
diff --git a/qemu-img.c b/qemu-img.c
index 78fc868..2b900d0 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1298,7 +1298,8 @@ static int img_convert(int argc, char **argv)

      total_sectors = 0;
      for (bs_i = 0; bs_i < bs_n; bs_i++) {
-        bs[bs_i] = bdrv_new_open(argv[optind + bs_i], fmt, BDRV_O_FLAGS, true,
+        bs[bs_i] = bdrv_new_open(argv[optind + bs_i], fmt,
+                                 BDRV_O_FLAGS | BDRV_O_SEQUENTIAL, true,
                                   quiet);
          if (!bs[bs_i]) {
              error_report("Could not open '%s'", argv[optind + bs_i]);
........................................................... KAMP Netzwerkdienste GmbH Vestische Str. 89-91 | 46117 Oberhausen Tel: +49 (0) 208.89 402-50 | Fax: +49 (0) 208.89 402-40 pl@kamp.de | http://www.kamp.de Geschäftsführer: Heiner Lante | Michael 
Lante Amtsgericht Duisburg | HRB Nr. 12154 USt-Id-Nr.: DE 120607556 ...........................................................

  reply	other threads:[~2014-02-28 14:35 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-26 10:14 [Qemu-devel] qemu-img convert cache mode for source Peter Lieven
2014-02-26 15:41 ` Stefan Hajnoczi
2014-02-26 15:54   ` Eric Blake
2014-02-26 16:01   ` Peter Lieven
2014-02-27  8:57     ` Stefan Hajnoczi
2014-02-28 14:35       ` Peter Lieven [this message]
2014-03-03 10:38         ` Kevin Wolf
2014-03-03 11:20           ` Peter Lieven
2014-03-03 12:59             ` Paolo Bonzini
2014-03-03 13:07               ` Peter Lieven
2014-03-03 12:03         ` Stefan Hajnoczi
2014-03-03 12:20           ` Peter Lieven
2014-03-04  9:24             ` Stefan Hajnoczi
2014-03-05 14:44               ` Peter Lieven
2014-03-05 15:20                 ` Marcus
2014-03-05 15:53                   ` Peter Lieven
2014-03-05 17:38                     ` Marcus
2014-03-05 18:09                       ` Peter Lieven
2014-03-06 10:41                         ` Stefan Hajnoczi
2014-03-06 18:58                           ` Peter Lieven
2014-03-06 10:29                 ` Stefan Hajnoczi
2014-03-06 11:29                   ` Paolo Bonzini
2014-03-06 14:19                     ` Liguori, Anthony
2014-03-06 18:07                       ` Peter Lieven
2014-03-07  8:03                       ` Peter Lieven
2014-02-27  1:10   ` Fam Zheng
2014-02-27 11:07     ` Kevin Wolf
2014-02-27 16:12       ` Peter Lieven
2014-03-03 10:40         ` Kevin Wolf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53109E99.3020102@kamp.de \
    --to=pl@kamp.de \
    --cc=kwolf@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@gmail.com \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).