From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:49658)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <den@parallels.com>) id 1YHGUH-0006GG-OX
	for qemu-devel@nongnu.org; Fri, 30 Jan 2015 13:40:15 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <den@parallels.com>) id 1YHGUD-0007oe-J5
	for qemu-devel@nongnu.org; Fri, 30 Jan 2015 13:40:13 -0500
Received: from mx2.parallels.com ([199.115.105.18]:35599)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <den@parallels.com>) id 1YHGUD-0007Eu-BH
	for qemu-devel@nongnu.org; Fri, 30 Jan 2015 13:40:09 -0500
Message-ID: <54CBCFCF.7090006@parallels.com>
Date: Fri, 30 Jan 2015 21:39:11 +0300
From: "Denis V. Lunev" <den@parallels.com>
MIME-Version: 1.0
References: <1422528659-3121-1-git-send-email-den@openvz.org>
	<1422528659-3121-2-git-send-email-den@openvz.org>
	<20150129131848.GA3950@noname.redhat.com>
	<54CA3A7A.8090208@openvz.org>
In-Reply-To: <54CA3A7A.8090208@openvz.org>
Content-Type: multipart/mixed; boundary="------------010909030408090301010202"
Subject: Re: [Qemu-devel] [PATCH 1/1] block: enforce minimal 4096 alignment
	in qemu_blockalign
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Kevin Wolf <kwolf@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>
Cc: qemu-devel@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com>

--------------010909030408090301010202
Content-Type: text/plain; charset="windows-1252"; format=flowed
Content-Transfer-Encoding: 7bit

On 29/01/15 16:49, Denis V. Lunev wrote:
> On 29/01/15 16:18, Kevin Wolf wrote:
>> Am 29.01.2015 um 11:50 hat Denis V. Lunev geschrieben:
>>> The following sequence
>>>      int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644);
>>>      for (i = 0; i < 100000; i++)
>>>              write(fd, buf, 4096);
>>> performs 5% better if buf is aligned to 4096 bytes rather then to
>>> 512 bytes on HDD with 512/4096 logical/physical sector size.
>>>
>>> The difference is quite reliable.
>>>
>>> On the other hand we do not want at the moment to enforce bounce
>>> buffering if guest request is aligned to 512 bytes. This patch
>>> forces page alignment when we really forced to perform memory
>>> allocation.
>>>
>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>> CC: Paolo Bonzini <pbonzini@redhat.com>
>>> CC: Kevin Wolf <kwolf@redhat.com>
>>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>>> ---
>>>   block.c | 9 ++++++++-
>>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/block.c b/block.c
>>> index d45e4dd..38cf73f 100644
>>> --- a/block.c
>>> +++ b/block.c
>>> @@ -5293,7 +5293,11 @@ void 
>>> bdrv_set_guest_block_size(BlockDriverState *bs, int align)
>>>     void *qemu_blockalign(BlockDriverState *bs, size_t size)
>>>   {
>>> -    return qemu_memalign(bdrv_opt_mem_align(bs), size);
>>> +    size_t align = bdrv_opt_mem_align(bs);
>>> +    if (align < 4096) {
>>> +        align = 4096;
>>> +    }
>>> +    return qemu_memalign(align, size);
>>>   }
>>>     void *qemu_blockalign0(BlockDriverState *bs, size_t size)
>>> @@ -5307,6 +5311,9 @@ void *qemu_try_blockalign(BlockDriverState 
>>> *bs, size_t size)
>>>         /* Ensure that NULL is never returned on success */
>>>       assert(align > 0);
>>> +    if (align < 4096) {
>>> +        align = 4096;
>>> +    }
>>>       if (size == 0) {
>>>           size = align;
>>>       }
>> This is the wrong place to make this change. First you're duplicating
>> logic in the callers of bdrv_opt_mem_align() instead of making it return
>> the right thing in the first place.
> This has been actually done in the first iteration. bdrv_opt_mem_align
> is called actually three times in:
>   qemu_blockalign
>   qemu_try_blockalign
>   bdrv_qiov_is_aligned
> Paolo says that he does not want to have bdrv_qiov_is_aligned affected
> to avoid extra bounce buffering.
>
> From my point of view this extra bounce buffering is better than 
> unaligned
> pointer during write to the disk as 512/4096 logical/physical sectors 
> size
> disks are mainstream now. Though I don't want to specially argue here.
> Normal guest operations results in page aligned requests and this is not
> a problem at all. The amount of 512 aligned requests from guest side is
> quite negligible.
>>   Second, you're arguing with numbers
>> from a simple test case for O_DIRECT on Linux, but you're changing the
>> alignment for everyone instead of just the raw-posix driver which is
>> responsible for accessing Linux files.
> This should not be a real problem. We are allocation memory for the
> buffer. A little bit stricter alignment is not a big overhead for any 
> libc
> implementation thus this kludge will not produce any significant 
> overhead.
>> Also, what's the real reason for the performance improvement? Having
>> page alignment? If so, actually querying the page size instead of
>> assuming 4k might be worth a thought.
>>
>> Kevin
> Most likely the problem comes from the read-modify-write pattern
> either in kernel or in disk. Actually my experience says that it is a
> bad idea to supply 512 byte aligned buffer for O_DIRECT IO.
> ABI technically allows this but in general it is much less tested.
>
> Yes, this synthetic test shows some difference here. In terms of
> qemu-io the result is also visible, but less
>   qemu-img create -f qcow2 ./1.img 64G
>   qemu-io -n -c 'write -P 0xaa 0 1G' 1.img
> performs 1% better.
>
> There is also similar kludge here
> size_t bdrv_opt_mem_align(BlockDriverState *bs)
> {
>     if (!bs || !bs->drv) {
>         /* 4k should be on the safe side */
>         return 4096;
>     }
>
>     return bs->bl.opt_mem_alignment;
> }
> which just uses 4096 constant.
>
> Yes, I could agree that queering page size could be a good idea, but
> I do not know at the moment how to do that. Can you pls share your
> opinion if you have any.
>
> Regards,
>     Den
Paolo, Kevin,

I have spent a bit more time digging the issue and found some
additional information. The same 5% difference if the buffer is
aligned to 512/4096 is observed for the following devices/filesystems

1) ext4 with block size equals to 1024 over 512/512 physical/logical
    sector size SSD disk
2) ext4 with block size equals to 4096 over 512/512 physical/logical
    sector size SSD disk
3) ext4 with block size equals to 4096 over 512/4096 physical/logical
    sector size rotational disk (WDC WD20EZRX)
4) with block size equals to 4096 over 512/512 physical/logical
    sector size SSD disk

This means that only page size (4k) matters.

Guys, you propose quite different approaches. I can extend this patch
to use sysconf(_SC_PAGESIZE) to detect page size and drop hardcoded
4096. This is not a problem. But you have different opinion about
the place to insert the check.

Could you please come into agreement?

Proper defines/configuration work to be done, I am trying to negotiate
principal approach.

Version 1)

diff --git a/block.c b/block.c
index d45e4dd..bc5d1e7 100644
--- a/block.c
+++ b/block.c
@@ -543,7 +543,7 @@ void bdrv_refresh_limits(BlockDriverState *bs, Error **errp)
          bs->bl.max_transfer_length = bs->file->bl.max_transfer_length;
          bs->bl.opt_mem_alignment = bs->file->bl.opt_mem_alignment;
      } else {
-        bs->bl.opt_mem_alignment = 512;
+        bs->bl.opt_mem_alignment = sysconf(_SC_PAGESIZE);
      }
  
      if (bs->backing_hd) {
diff --git a/block/raw-posix.c b/block/raw-posix.c
index ec38fee..d1b3388 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -266,7 +266,7 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
      if (!s->buf_align) {
          size_t align;
          buf = qemu_memalign(MAX_BLOCKSIZE, 2 * MAX_BLOCKSIZE);
-        for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) {
+        for (align = sysconf(_SC_PAGESIZE); align <= MAX_BLOCKSIZE; align <<= 1) {
              if (pread(fd, buf + align, MAX_BLOCKSIZE, 0) >= 0) {
                  s->buf_align = align;
                  break;


Version 2)
diff --git a/block.c b/block.c
index d45e4dd..e2bb3fd 100644
--- a/block.c
+++ b/block.c
@@ -5293,6 +5293,11 @@ void bdrv_set_guest_block_size(BlockDriverState 
*bs, int align)

  void *qemu_blockalign(BlockDriverState *bs, size_t size)
  {
+    int align = bdrv_opt_mem_align(bs);
+    int page_size = sysconf(_SC_PAGESIZE);
+    if (align < page_size) {
+        align = page_size;
+    }
      return qemu_memalign(bdrv_opt_mem_align(bs), size);
  }

@@ -5304,9 +5309,13 @@ void *qemu_blockalign0(BlockDriverState *bs, 
size_t size)
  void *qemu_try_blockalign(BlockDriverState *bs, size_t size)
  {
      size_t align = bdrv_opt_mem_align(bs);
+    int page_size = sysconf(_SC_PAGESIZE);

      /* Ensure that NULL is never returned on success */
      assert(align > 0);
+    if (align < page_size) {
+        align = page_size;
+    }
      if (size == 0) {
          size = align;
      }

I am totally fine with both versions.

Regards,
     Den

P.S. A bit improved version of test is attached.

--------------010909030408090301010202
Content-Type: text/x-csrc; name="1.c"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="1.c"

#define _GNU_SOURCE

#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>
#include <malloc.h>
#include <string.h>

int main(int argc, char *argv[])
{
    int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644);
    void *buf;
    int i = 0, align = atoi(argv[2]);

    do {
        buf = memalign(align, 4096);
        if (align >= 4096)
            break;
        if ((unsigned long)buf & 4095)
            break;
        i++;
    } while (1);
    printf("%d %p\n", i, buf);

    memset(buf, 0x11, 4096);

    for (i = 0; i < 100000; i++) {
        lseek(fd, SEEK_CUR, 4096);
        write(fd, buf, 4096);
    }

    close(fd);
    return 0;
}

--------------010909030408090301010202--