All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Denis V. Lunev" <den@parallels.com>
To: Kevin Wolf <kwolf@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>
Cc: qemu-devel@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com>
Subject: Re: [Qemu-devel] [PATCH 1/1] block: enforce minimal 4096 alignment in qemu_blockalign
Date: Fri, 30 Jan 2015 21:39:11 +0300	[thread overview]
Message-ID: <54CBCFCF.7090006@parallels.com> (raw)
In-Reply-To: <54CA3A7A.8090208@openvz.org>

[-- Attachment #1: Type: text/plain, Size: 7504 bytes --]

On 29/01/15 16:49, Denis V. Lunev wrote:
> On 29/01/15 16:18, Kevin Wolf wrote:
>> Am 29.01.2015 um 11:50 hat Denis V. Lunev geschrieben:
>>> The following sequence
>>>      int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644);
>>>      for (i = 0; i < 100000; i++)
>>>              write(fd, buf, 4096);
>>> performs 5% better if buf is aligned to 4096 bytes rather then to
>>> 512 bytes on HDD with 512/4096 logical/physical sector size.
>>>
>>> The difference is quite reliable.
>>>
>>> On the other hand we do not want at the moment to enforce bounce
>>> buffering if guest request is aligned to 512 bytes. This patch
>>> forces page alignment when we really forced to perform memory
>>> allocation.
>>>
>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>> CC: Paolo Bonzini <pbonzini@redhat.com>
>>> CC: Kevin Wolf <kwolf@redhat.com>
>>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>>> ---
>>>   block.c | 9 ++++++++-
>>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/block.c b/block.c
>>> index d45e4dd..38cf73f 100644
>>> --- a/block.c
>>> +++ b/block.c
>>> @@ -5293,7 +5293,11 @@ void 
>>> bdrv_set_guest_block_size(BlockDriverState *bs, int align)
>>>     void *qemu_blockalign(BlockDriverState *bs, size_t size)
>>>   {
>>> -    return qemu_memalign(bdrv_opt_mem_align(bs), size);
>>> +    size_t align = bdrv_opt_mem_align(bs);
>>> +    if (align < 4096) {
>>> +        align = 4096;
>>> +    }
>>> +    return qemu_memalign(align, size);
>>>   }
>>>     void *qemu_blockalign0(BlockDriverState *bs, size_t size)
>>> @@ -5307,6 +5311,9 @@ void *qemu_try_blockalign(BlockDriverState 
>>> *bs, size_t size)
>>>         /* Ensure that NULL is never returned on success */
>>>       assert(align > 0);
>>> +    if (align < 4096) {
>>> +        align = 4096;
>>> +    }
>>>       if (size == 0) {
>>>           size = align;
>>>       }
>> This is the wrong place to make this change. First you're duplicating
>> logic in the callers of bdrv_opt_mem_align() instead of making it return
>> the right thing in the first place.
> This has been actually done in the first iteration. bdrv_opt_mem_align
> is called actually three times in:
>   qemu_blockalign
>   qemu_try_blockalign
>   bdrv_qiov_is_aligned
> Paolo says that he does not want to have bdrv_qiov_is_aligned affected
> to avoid extra bounce buffering.
>
> From my point of view this extra bounce buffering is better than 
> unaligned
> pointer during write to the disk as 512/4096 logical/physical sectors 
> size
> disks are mainstream now. Though I don't want to specially argue here.
> Normal guest operations results in page aligned requests and this is not
> a problem at all. The amount of 512 aligned requests from guest side is
> quite negligible.
>>   Second, you're arguing with numbers
>> from a simple test case for O_DIRECT on Linux, but you're changing the
>> alignment for everyone instead of just the raw-posix driver which is
>> responsible for accessing Linux files.
> This should not be a real problem. We are allocation memory for the
> buffer. A little bit stricter alignment is not a big overhead for any 
> libc
> implementation thus this kludge will not produce any significant 
> overhead.
>> Also, what's the real reason for the performance improvement? Having
>> page alignment? If so, actually querying the page size instead of
>> assuming 4k might be worth a thought.
>>
>> Kevin
> Most likely the problem comes from the read-modify-write pattern
> either in kernel or in disk. Actually my experience says that it is a
> bad idea to supply 512 byte aligned buffer for O_DIRECT IO.
> ABI technically allows this but in general it is much less tested.
>
> Yes, this synthetic test shows some difference here. In terms of
> qemu-io the result is also visible, but less
>   qemu-img create -f qcow2 ./1.img 64G
>   qemu-io -n -c 'write -P 0xaa 0 1G' 1.img
> performs 1% better.
>
> There is also similar kludge here
> size_t bdrv_opt_mem_align(BlockDriverState *bs)
> {
>     if (!bs || !bs->drv) {
>         /* 4k should be on the safe side */
>         return 4096;
>     }
>
>     return bs->bl.opt_mem_alignment;
> }
> which just uses 4096 constant.
>
> Yes, I could agree that queering page size could be a good idea, but
> I do not know at the moment how to do that. Can you pls share your
> opinion if you have any.
>
> Regards,
>     Den
Paolo, Kevin,

I have spent a bit more time digging the issue and found some
additional information. The same 5% difference if the buffer is
aligned to 512/4096 is observed for the following devices/filesystems

1) ext4 with block size equals to 1024 over 512/512 physical/logical
    sector size SSD disk
2) ext4 with block size equals to 4096 over 512/512 physical/logical
    sector size SSD disk
3) ext4 with block size equals to 4096 over 512/4096 physical/logical
    sector size rotational disk (WDC WD20EZRX)
4) with block size equals to 4096 over 512/512 physical/logical
    sector size SSD disk

This means that only page size (4k) matters.

Guys, you propose quite different approaches. I can extend this patch
to use sysconf(_SC_PAGESIZE) to detect page size and drop hardcoded
4096. This is not a problem. But you have different opinion about
the place to insert the check.

Could you please come into agreement?

Proper defines/configuration work to be done, I am trying to negotiate
principal approach.

Version 1)

diff --git a/block.c b/block.c
index d45e4dd..bc5d1e7 100644
--- a/block.c
+++ b/block.c
@@ -543,7 +543,7 @@ void bdrv_refresh_limits(BlockDriverState *bs, Error **errp)
          bs->bl.max_transfer_length = bs->file->bl.max_transfer_length;
          bs->bl.opt_mem_alignment = bs->file->bl.opt_mem_alignment;
      } else {
-        bs->bl.opt_mem_alignment = 512;
+        bs->bl.opt_mem_alignment = sysconf(_SC_PAGESIZE);
      }
  
      if (bs->backing_hd) {
diff --git a/block/raw-posix.c b/block/raw-posix.c
index ec38fee..d1b3388 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -266,7 +266,7 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
      if (!s->buf_align) {
          size_t align;
          buf = qemu_memalign(MAX_BLOCKSIZE, 2 * MAX_BLOCKSIZE);
-        for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) {
+        for (align = sysconf(_SC_PAGESIZE); align <= MAX_BLOCKSIZE; align <<= 1) {
              if (pread(fd, buf + align, MAX_BLOCKSIZE, 0) >= 0) {
                  s->buf_align = align;
                  break;


Version 2)
diff --git a/block.c b/block.c
index d45e4dd..e2bb3fd 100644
--- a/block.c
+++ b/block.c
@@ -5293,6 +5293,11 @@ void bdrv_set_guest_block_size(BlockDriverState 
*bs, int align)

  void *qemu_blockalign(BlockDriverState *bs, size_t size)
  {
+    int align = bdrv_opt_mem_align(bs);
+    int page_size = sysconf(_SC_PAGESIZE);
+    if (align < page_size) {
+        align = page_size;
+    }
      return qemu_memalign(bdrv_opt_mem_align(bs), size);
  }

@@ -5304,9 +5309,13 @@ void *qemu_blockalign0(BlockDriverState *bs, 
size_t size)
  void *qemu_try_blockalign(BlockDriverState *bs, size_t size)
  {
      size_t align = bdrv_opt_mem_align(bs);
+    int page_size = sysconf(_SC_PAGESIZE);

      /* Ensure that NULL is never returned on success */
      assert(align > 0);
+    if (align < page_size) {
+        align = page_size;
+    }
      if (size == 0) {
          size = align;
      }

I am totally fine with both versions.

Regards,
     Den

P.S. A bit improved version of test is attached.

[-- Attachment #2: 1.c --]
[-- Type: text/x-csrc, Size: 675 bytes --]

#define _GNU_SOURCE

#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>
#include <malloc.h>
#include <string.h>

int main(int argc, char *argv[])
{
    int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644);
    void *buf;
    int i = 0, align = atoi(argv[2]);

    do {
        buf = memalign(align, 4096);
        if (align >= 4096)
            break;
        if ((unsigned long)buf & 4095)
            break;
        i++;
    } while (1);
    printf("%d %p\n", i, buf);

    memset(buf, 0x11, 4096);

    for (i = 0; i < 100000; i++) {
        lseek(fd, SEEK_CUR, 4096);
        write(fd, buf, 4096);
    }

    close(fd);
    return 0;
}

  reply	other threads:[~2015-01-30 18:40 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-01-29 10:50 [Qemu-devel] [PATCH v2 0/1] block: enforce minimal 4096 alignment in qemu_blockalign Denis V. Lunev
2015-01-29 10:50 ` [Qemu-devel] [PATCH 1/1] " Denis V. Lunev
2015-01-29 10:58   ` Paolo Bonzini
2015-01-29 13:18   ` Kevin Wolf
2015-01-29 13:49     ` Denis V. Lunev
2015-01-30 18:39       ` Denis V. Lunev [this message]
2015-01-30 19:48         ` Kevin Wolf
2015-01-30 20:05           ` Denis V. Lunev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54CBCFCF.7090006@parallels.com \
    --to=den@parallels.com \
    --cc=kwolf@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.