[Qemu-devel] [PATCH v2 0/1] block: enforce minimal 4096 alignment in qemu

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v2 0/1] block: enforce minimal 4096 alignment in qemu_blockalign
@ 2015-01-29 10:50 Denis V. Lunev
  2015-01-29 10:50 ` [Qemu-devel] [PATCH 1/1] " Denis V. Lunev
  0 siblings, 1 reply; 8+ messages in thread
From: Denis V. Lunev @ 2015-01-29 10:50 UTC (permalink / raw)
  Cc: Kevin Wolf, Denis V. Lunev, qemu-devel, Stefan Hajnoczi,
	Paolo Bonzini

The following sequence
    int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644);
    for (i = 0; i < 100000; i++)
            write(fd, buf, 4096);
performs 5% better if buf is aligned to 4096 bytes rather then to
512 bytes on HDD with 512/4096 logical/physical sector size.

The difference is quite reliable.

I have used the following program to test
#define _GNU_SOURCE

#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>
#include <malloc.h>
#include <string.h>

int main(int argc, char *argv[])
{
    int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644);
    void *buf;
    int i = 0;

    do {
        buf = memalign(512, 4096); <--- replace 512 with 4096
        if ((unsigned long)buf & 4095)
            break;
        i++;
    } while (1);
    printf("%d\n", i);

    memset(buf, 0x11, 4096);

    for (i = 0; i < 100000; i++)
        write(fd, buf, 4096);

    close(fd);
    return 0;
}
time for in in `seq 1 30` ; do a.out aa ; done

The file was placed into 8 GB partition on HDD below to avoid speed
change due to different offset on disk. Results are reliable:
- 189 vs 180 seconds on Linux 3.16

Changes from v1:
- enforces 4096 alignment in qemu_(try_)blockalign, avoid touching of
  bdrv_qiov_is_aligned path not to enforce additional bounce buffering
  as suggested by Paolo
- reduces 10% to 5% in patch description to better fit 180 vs 189
  difference

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>

hades ~/src/qemu # hdparm -I /dev/sdg

/dev/sdg:

ATA device, with non-removable media
    Model Number:       WDC WD20EZRX-07D8PB0
    Serial Number:      WD-WCC4M5LVSAEP
    Firmware Revision:  80.00A80
    Transport:          Serial, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
    Supported: 9 8 7 6 5
    Likely used: 9
Configuration:
    Logical     max current
    cylinders   16383   16383
    heads       16  16
    sectors/track   63  63
    --
    CHS current addressable sectors:   16514064
    LBA    user addressable sectors:  268435455
    LBA48  user addressable sectors: 3907029168
    Logical  Sector size:                   512 bytes
    Physical Sector size:                  4096 bytes
    device size with M = 1024*1024:     1907729 MBytes
    device size with M = 1000*1000:     2000398 MBytes (2000 GB)
    cache/buffer size  = unknown
    Nominal Media Rotation Rate: 5400
Capabilities:
    LBA, IORDY(can be disabled)
    Queue depth: 32
    Standby timer values: spec'd by Standard, with device specific minimum
    R/W multiple sector transfer: Max = 16  Current = 16
    DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
         Cycle time: min=120ns recommended=120ns
    PIO: pio0 pio1 pio2 pio3 pio4
         Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
    Enabled Supported:
       *    SMART feature set
            Security Mode feature set
       *    Power Management feature set
       *    Write cache
       *    Look-ahead
       *    Host Protected Area feature set
       *    WRITE_BUFFER command
       *    READ_BUFFER command
       *    NOP cmd
       *    DOWNLOAD_MICROCODE
            Power-Up In Standby feature set
       *    SET_FEATURES required to spinup after power up
            SET_MAX security extension
       *    48-bit Address feature set
       *    Device Configuration Overlay feature set
       *    Mandatory FLUSH_CACHE
       *    FLUSH_CACHE_EXT
       *    SMART error logging
       *    SMART self-test
       *    General Purpose Logging feature set
       *    64-bit World wide name
       *    WRITE_UNCORRECTABLE_EXT command
       *    {READ,WRITE}_DMA_EXT_GPL commands
       *    Segmented DOWNLOAD_MICROCODE
       *    Gen1 signaling speed (1.5Gb/s)
       *    Gen2 signaling speed (3.0Gb/s)
       *    Gen3 signaling speed (6.0Gb/s)
       *    Native Command Queueing (NCQ)
       *    Host-initiated interface power management
       *    Phy event counters
       *    NCQ priority information
       *    READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
       *    DMA Setup Auto-Activate optimization
            Device-initiated interface power management
       *    Software settings preservation
       *    SMART Command Transport (SCT) feature set
       *    SCT Write Same (AC2)
       *    SCT Features Control (AC4)
       *    SCT Data Tables (AC5)
            unknown 206[12] (vendor specific)
            unknown 206[13] (vendor specific)
            unknown 206[14] (vendor specific)
Security:
    Master password revision code = 65534
        supported
    not enabled
    not locked
        frozen
    not expired: security count
        supported: enhanced erase
    276min for SECURITY ERASE UNIT. 276min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 50014ee2b5da838c
    NAA     : 5
    IEEE OUI    : 0014ee
    Unique ID   : 2b5da838c
Checksum: correct
hades ~/src/qemu #

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Qemu-devel] [PATCH 1/1] block: enforce minimal 4096 alignment in qemu_blockalign
  2015-01-29 10:50 [Qemu-devel] [PATCH v2 0/1] block: enforce minimal 4096 alignment in qemu_blockalign Denis V. Lunev
@ 2015-01-29 10:50 ` Denis V. Lunev
  2015-01-29 10:58   ` Paolo Bonzini
  2015-01-29 13:18   ` Kevin Wolf
  0 siblings, 2 replies; 8+ messages in thread
From: Denis V. Lunev @ 2015-01-29 10:50 UTC (permalink / raw)
  Cc: Kevin Wolf, Denis V. Lunev, qemu-devel, Stefan Hajnoczi,
	Paolo Bonzini

The following sequence
    int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644);
    for (i = 0; i < 100000; i++)
            write(fd, buf, 4096);
performs 5% better if buf is aligned to 4096 bytes rather then to
512 bytes on HDD with 512/4096 logical/physical sector size.

The difference is quite reliable.

On the other hand we do not want at the moment to enforce bounce
buffering if guest request is aligned to 512 bytes. This patch
forces page alignment when we really forced to perform memory
allocation.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
---
 block.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/block.c b/block.c
index d45e4dd..38cf73f 100644
--- a/block.c
+++ b/block.c
@@ -5293,7 +5293,11 @@ void bdrv_set_guest_block_size(BlockDriverState *bs, int align)
 
 void *qemu_blockalign(BlockDriverState *bs, size_t size)
 {
-    return qemu_memalign(bdrv_opt_mem_align(bs), size);
+    size_t align = bdrv_opt_mem_align(bs);
+    if (align < 4096) {
+        align = 4096;
+    }
+    return qemu_memalign(align, size);
 }
 
 void *qemu_blockalign0(BlockDriverState *bs, size_t size)
@@ -5307,6 +5311,9 @@ void *qemu_try_blockalign(BlockDriverState *bs, size_t size)
 
     /* Ensure that NULL is never returned on success */
     assert(align > 0);
+    if (align < 4096) {
+        align = 4096;
+    }
     if (size == 0) {
         size = align;
     }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH 1/1] block: enforce minimal 4096 alignment in qemu_blockalign
  2015-01-29 10:50 ` [Qemu-devel] [PATCH 1/1] " Denis V. Lunev
@ 2015-01-29 10:58   ` Paolo Bonzini
  2015-01-29 13:18   ` Kevin Wolf
  1 sibling, 0 replies; 8+ messages in thread
From: Paolo Bonzini @ 2015-01-29 10:58 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi



On 29/01/2015 11:50, Denis V. Lunev wrote:
> The following sequence
>     int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644);
>     for (i = 0; i < 100000; i++)
>             write(fd, buf, 4096);
> performs 5% better if buf is aligned to 4096 bytes rather then to
> 512 bytes on HDD with 512/4096 logical/physical sector size.
> 
> The difference is quite reliable.
> 
> On the other hand we do not want at the moment to enforce bounce
> buffering if guest request is aligned to 512 bytes. This patch
> forces page alignment when we really forced to perform memory
> allocation.
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Paolo Bonzini <pbonzini@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  block.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/block.c b/block.c
> index d45e4dd..38cf73f 100644
> --- a/block.c
> +++ b/block.c
> @@ -5293,7 +5293,11 @@ void bdrv_set_guest_block_size(BlockDriverState *bs, int align)
>  
>  void *qemu_blockalign(BlockDriverState *bs, size_t size)
>  {
> -    return qemu_memalign(bdrv_opt_mem_align(bs), size);
> +    size_t align = bdrv_opt_mem_align(bs);
> +    if (align < 4096) {
> +        align = 4096;
> +    }
> +    return qemu_memalign(align, size);
>  }
>  
>  void *qemu_blockalign0(BlockDriverState *bs, size_t size)
> @@ -5307,6 +5311,9 @@ void *qemu_try_blockalign(BlockDriverState *bs, size_t size)
>  
>      /* Ensure that NULL is never returned on success */
>      assert(align > 0);
> +    if (align < 4096) {
> +        align = 4096;
> +    }
>      if (size == 0) {
>          size = align;
>      }
> 

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH 1/1] block: enforce minimal 4096 alignment in qemu_blockalign
  2015-01-29 10:50 ` [Qemu-devel] [PATCH 1/1] " Denis V. Lunev
  2015-01-29 10:58   ` Paolo Bonzini
@ 2015-01-29 13:18   ` Kevin Wolf
  2015-01-29 13:49     ` Denis V. Lunev
  1 sibling, 1 reply; 8+ messages in thread
From: Kevin Wolf @ 2015-01-29 13:18 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: Paolo Bonzini, qemu-devel, Stefan Hajnoczi

Am 29.01.2015 um 11:50 hat Denis V. Lunev geschrieben:
> The following sequence
>     int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644);
>     for (i = 0; i < 100000; i++)
>             write(fd, buf, 4096);
> performs 5% better if buf is aligned to 4096 bytes rather then to
> 512 bytes on HDD with 512/4096 logical/physical sector size.
> 
> The difference is quite reliable.
> 
> On the other hand we do not want at the moment to enforce bounce
> buffering if guest request is aligned to 512 bytes. This patch
> forces page alignment when we really forced to perform memory
> allocation.
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Paolo Bonzini <pbonzini@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  block.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/block.c b/block.c
> index d45e4dd..38cf73f 100644
> --- a/block.c
> +++ b/block.c
> @@ -5293,7 +5293,11 @@ void bdrv_set_guest_block_size(BlockDriverState *bs, int align)
>  
>  void *qemu_blockalign(BlockDriverState *bs, size_t size)
>  {
> -    return qemu_memalign(bdrv_opt_mem_align(bs), size);
> +    size_t align = bdrv_opt_mem_align(bs);
> +    if (align < 4096) {
> +        align = 4096;
> +    }
> +    return qemu_memalign(align, size);
>  }
>  
>  void *qemu_blockalign0(BlockDriverState *bs, size_t size)
> @@ -5307,6 +5311,9 @@ void *qemu_try_blockalign(BlockDriverState *bs, size_t size)
>  
>      /* Ensure that NULL is never returned on success */
>      assert(align > 0);
> +    if (align < 4096) {
> +        align = 4096;
> +    }
>      if (size == 0) {
>          size = align;
>      }

This is the wrong place to make this change. First you're duplicating
logic in the callers of bdrv_opt_mem_align() instead of making it return
the right thing in the first place. Second, you're arguing with numbers
from a simple test case for O_DIRECT on Linux, but you're changing the
alignment for everyone instead of just the raw-posix driver which is
responsible for accessing Linux files.

Also, what's the real reason for the performance improvement? Having
page alignment? If so, actually querying the page size instead of
assuming 4k might be worth a thought.

Kevin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH 1/1] block: enforce minimal 4096 alignment in qemu_blockalign
  2015-01-29 13:18   ` Kevin Wolf
@ 2015-01-29 13:49     ` Denis V. Lunev
  2015-01-30 18:39       ` Denis V. Lunev
  0 siblings, 1 reply; 8+ messages in thread
From: Denis V. Lunev @ 2015-01-29 13:49 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Paolo Bonzini, qemu-devel, Stefan Hajnoczi

On 29/01/15 16:18, Kevin Wolf wrote:
> Am 29.01.2015 um 11:50 hat Denis V. Lunev geschrieben:
>> The following sequence
>>      int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644);
>>      for (i = 0; i < 100000; i++)
>>              write(fd, buf, 4096);
>> performs 5% better if buf is aligned to 4096 bytes rather then to
>> 512 bytes on HDD with 512/4096 logical/physical sector size.
>>
>> The difference is quite reliable.
>>
>> On the other hand we do not want at the moment to enforce bounce
>> buffering if guest request is aligned to 512 bytes. This patch
>> forces page alignment when we really forced to perform memory
>> allocation.
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> CC: Paolo Bonzini <pbonzini@redhat.com>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> ---
>>   block.c | 9 ++++++++-
>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/block.c b/block.c
>> index d45e4dd..38cf73f 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -5293,7 +5293,11 @@ void bdrv_set_guest_block_size(BlockDriverState *bs, int align)
>>   
>>   void *qemu_blockalign(BlockDriverState *bs, size_t size)
>>   {
>> -    return qemu_memalign(bdrv_opt_mem_align(bs), size);
>> +    size_t align = bdrv_opt_mem_align(bs);
>> +    if (align < 4096) {
>> +        align = 4096;
>> +    }
>> +    return qemu_memalign(align, size);
>>   }
>>   
>>   void *qemu_blockalign0(BlockDriverState *bs, size_t size)
>> @@ -5307,6 +5311,9 @@ void *qemu_try_blockalign(BlockDriverState *bs, size_t size)
>>   
>>       /* Ensure that NULL is never returned on success */
>>       assert(align > 0);
>> +    if (align < 4096) {
>> +        align = 4096;
>> +    }
>>       if (size == 0) {
>>           size = align;
>>       }
> This is the wrong place to make this change. First you're duplicating
> logic in the callers of bdrv_opt_mem_align() instead of making it return
> the right thing in the first place.
This has been actually done in the first iteration. bdrv_opt_mem_align
is called actually three times in:
   qemu_blockalign
   qemu_try_blockalign
   bdrv_qiov_is_aligned
Paolo says that he does not want to have bdrv_qiov_is_aligned affected
to avoid extra bounce buffering.

 From my point of view this extra bounce buffering is better than unaligned
pointer during write to the disk as 512/4096 logical/physical sectors size
disks are mainstream now. Though I don't want to specially argue here.
Normal guest operations results in page aligned requests and this is not
a problem at all. The amount of 512 aligned requests from guest side is
quite negligible.
>   Second, you're arguing with numbers
> from a simple test case for O_DIRECT on Linux, but you're changing the
> alignment for everyone instead of just the raw-posix driver which is
> responsible for accessing Linux files.
This should not be a real problem. We are allocation memory for the
buffer. A little bit stricter alignment is not a big overhead for any libc
implementation thus this kludge will not produce any significant overhead.
> Also, what's the real reason for the performance improvement? Having
> page alignment? If so, actually querying the page size instead of
> assuming 4k might be worth a thought.
>
> Kevin
Most likely the problem comes from the read-modify-write pattern
either in kernel or in disk. Actually my experience says that it is a
bad idea to supply 512 byte aligned buffer for O_DIRECT IO.
ABI technically allows this but in general it is much less tested.

Yes, this synthetic test shows some difference here. In terms of
qemu-io the result is also visible, but less
   qemu-img create -f qcow2 ./1.img 64G
   qemu-io -n -c 'write -P 0xaa 0 1G' 1.img
performs 1% better.

There is also similar kludge here
size_t bdrv_opt_mem_align(BlockDriverState *bs)
{
     if (!bs || !bs->drv) {
         /* 4k should be on the safe side */
         return 4096;
     }

     return bs->bl.opt_mem_alignment;
}
which just uses 4096 constant.

Yes, I could agree that queering page size could be a good idea, but
I do not know at the moment how to do that. Can you pls share your
opinion if you have any.

Regards,
     Den

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH 1/1] block: enforce minimal 4096 alignment in qemu_blockalign
  2015-01-29 13:49     ` Denis V. Lunev
@ 2015-01-30 18:39       ` Denis V. Lunev
  2015-01-30 19:48         ` Kevin Wolf
  0 siblings, 1 reply; 8+ messages in thread
From: Denis V. Lunev @ 2015-01-30 18:39 UTC (permalink / raw)
  To: Kevin Wolf, Paolo Bonzini; +Cc: qemu-devel, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 7504 bytes --]

On 29/01/15 16:49, Denis V. Lunev wrote:
> On 29/01/15 16:18, Kevin Wolf wrote:
>> Am 29.01.2015 um 11:50 hat Denis V. Lunev geschrieben:
>>> The following sequence
>>>      int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644);
>>>      for (i = 0; i < 100000; i++)
>>>              write(fd, buf, 4096);
>>> performs 5% better if buf is aligned to 4096 bytes rather then to
>>> 512 bytes on HDD with 512/4096 logical/physical sector size.
>>>
>>> The difference is quite reliable.
>>>
>>> On the other hand we do not want at the moment to enforce bounce
>>> buffering if guest request is aligned to 512 bytes. This patch
>>> forces page alignment when we really forced to perform memory
>>> allocation.
>>>
>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>> CC: Paolo Bonzini <pbonzini@redhat.com>
>>> CC: Kevin Wolf <kwolf@redhat.com>
>>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>>> ---
>>>   block.c | 9 ++++++++-
>>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/block.c b/block.c
>>> index d45e4dd..38cf73f 100644
>>> --- a/block.c
>>> +++ b/block.c
>>> @@ -5293,7 +5293,11 @@ void 
>>> bdrv_set_guest_block_size(BlockDriverState *bs, int align)
>>>     void *qemu_blockalign(BlockDriverState *bs, size_t size)
>>>   {
>>> -    return qemu_memalign(bdrv_opt_mem_align(bs), size);
>>> +    size_t align = bdrv_opt_mem_align(bs);
>>> +    if (align < 4096) {
>>> +        align = 4096;
>>> +    }
>>> +    return qemu_memalign(align, size);
>>>   }
>>>     void *qemu_blockalign0(BlockDriverState *bs, size_t size)
>>> @@ -5307,6 +5311,9 @@ void *qemu_try_blockalign(BlockDriverState 
>>> *bs, size_t size)
>>>         /* Ensure that NULL is never returned on success */
>>>       assert(align > 0);
>>> +    if (align < 4096) {
>>> +        align = 4096;
>>> +    }
>>>       if (size == 0) {
>>>           size = align;
>>>       }
>> This is the wrong place to make this change. First you're duplicating
>> logic in the callers of bdrv_opt_mem_align() instead of making it return
>> the right thing in the first place.
> This has been actually done in the first iteration. bdrv_opt_mem_align
> is called actually three times in:
>   qemu_blockalign
>   qemu_try_blockalign
>   bdrv_qiov_is_aligned
> Paolo says that he does not want to have bdrv_qiov_is_aligned affected
> to avoid extra bounce buffering.
>
> From my point of view this extra bounce buffering is better than 
> unaligned
> pointer during write to the disk as 512/4096 logical/physical sectors 
> size
> disks are mainstream now. Though I don't want to specially argue here.
> Normal guest operations results in page aligned requests and this is not
> a problem at all. The amount of 512 aligned requests from guest side is
> quite negligible.
>>   Second, you're arguing with numbers
>> from a simple test case for O_DIRECT on Linux, but you're changing the
>> alignment for everyone instead of just the raw-posix driver which is
>> responsible for accessing Linux files.
> This should not be a real problem. We are allocation memory for the
> buffer. A little bit stricter alignment is not a big overhead for any 
> libc
> implementation thus this kludge will not produce any significant 
> overhead.
>> Also, what's the real reason for the performance improvement? Having
>> page alignment? If so, actually querying the page size instead of
>> assuming 4k might be worth a thought.
>>
>> Kevin
> Most likely the problem comes from the read-modify-write pattern
> either in kernel or in disk. Actually my experience says that it is a
> bad idea to supply 512 byte aligned buffer for O_DIRECT IO.
> ABI technically allows this but in general it is much less tested.
>
> Yes, this synthetic test shows some difference here. In terms of
> qemu-io the result is also visible, but less
>   qemu-img create -f qcow2 ./1.img 64G
>   qemu-io -n -c 'write -P 0xaa 0 1G' 1.img
> performs 1% better.
>
> There is also similar kludge here
> size_t bdrv_opt_mem_align(BlockDriverState *bs)
> {
>     if (!bs || !bs->drv) {
>         /* 4k should be on the safe side */
>         return 4096;
>     }
>
>     return bs->bl.opt_mem_alignment;
> }
> which just uses 4096 constant.
>
> Yes, I could agree that queering page size could be a good idea, but
> I do not know at the moment how to do that. Can you pls share your
> opinion if you have any.
>
> Regards,
>     Den
Paolo, Kevin,

I have spent a bit more time digging the issue and found some
additional information. The same 5% difference if the buffer is
aligned to 512/4096 is observed for the following devices/filesystems

1) ext4 with block size equals to 1024 over 512/512 physical/logical
    sector size SSD disk
2) ext4 with block size equals to 4096 over 512/512 physical/logical
    sector size SSD disk
3) ext4 with block size equals to 4096 over 512/4096 physical/logical
    sector size rotational disk (WDC WD20EZRX)
4) with block size equals to 4096 over 512/512 physical/logical
    sector size SSD disk

This means that only page size (4k) matters.

Guys, you propose quite different approaches. I can extend this patch
to use sysconf(_SC_PAGESIZE) to detect page size and drop hardcoded
4096. This is not a problem. But you have different opinion about
the place to insert the check.

Could you please come into agreement?

Proper defines/configuration work to be done, I am trying to negotiate
principal approach.

Version 1)

diff --git a/block.c b/block.c
index d45e4dd..bc5d1e7 100644
--- a/block.c
+++ b/block.c
@@ -543,7 +543,7 @@ void bdrv_refresh_limits(BlockDriverState *bs, Error **errp)
          bs->bl.max_transfer_length = bs->file->bl.max_transfer_length;
          bs->bl.opt_mem_alignment = bs->file->bl.opt_mem_alignment;
      } else {
-        bs->bl.opt_mem_alignment = 512;
+        bs->bl.opt_mem_alignment = sysconf(_SC_PAGESIZE);
      }
  
      if (bs->backing_hd) {
diff --git a/block/raw-posix.c b/block/raw-posix.c
index ec38fee..d1b3388 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -266,7 +266,7 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
      if (!s->buf_align) {
          size_t align;
          buf = qemu_memalign(MAX_BLOCKSIZE, 2 * MAX_BLOCKSIZE);
-        for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) {
+        for (align = sysconf(_SC_PAGESIZE); align <= MAX_BLOCKSIZE; align <<= 1) {
              if (pread(fd, buf + align, MAX_BLOCKSIZE, 0) >= 0) {
                  s->buf_align = align;
                  break;


Version 2)
diff --git a/block.c b/block.c
index d45e4dd..e2bb3fd 100644
--- a/block.c
+++ b/block.c
@@ -5293,6 +5293,11 @@ void bdrv_set_guest_block_size(BlockDriverState 
*bs, int align)

  void *qemu_blockalign(BlockDriverState *bs, size_t size)
  {
+    int align = bdrv_opt_mem_align(bs);
+    int page_size = sysconf(_SC_PAGESIZE);
+    if (align < page_size) {
+        align = page_size;
+    }
      return qemu_memalign(bdrv_opt_mem_align(bs), size);
  }

@@ -5304,9 +5309,13 @@ void *qemu_blockalign0(BlockDriverState *bs, 
size_t size)
  void *qemu_try_blockalign(BlockDriverState *bs, size_t size)
  {
      size_t align = bdrv_opt_mem_align(bs);
+    int page_size = sysconf(_SC_PAGESIZE);

      /* Ensure that NULL is never returned on success */
      assert(align > 0);
+    if (align < page_size) {
+        align = page_size;
+    }
      if (size == 0) {
          size = align;
      }

I am totally fine with both versions.

Regards,
     Den

P.S. A bit improved version of test is attached.

[-- Attachment #2: 1.c --]
[-- Type: text/x-csrc, Size: 675 bytes --]

#define _GNU_SOURCE

#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>
#include <malloc.h>
#include <string.h>

int main(int argc, char *argv[])
{
    int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644);
    void *buf;
    int i = 0, align = atoi(argv[2]);

    do {
        buf = memalign(align, 4096);
        if (align >= 4096)
            break;
        if ((unsigned long)buf & 4095)
            break;
        i++;
    } while (1);
    printf("%d %p\n", i, buf);

    memset(buf, 0x11, 4096);

    for (i = 0; i < 100000; i++) {
        lseek(fd, SEEK_CUR, 4096);
        write(fd, buf, 4096);
    }

    close(fd);
    return 0;
}

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH 1/1] block: enforce minimal 4096 alignment in qemu_blockalign
  2015-01-30 18:39       ` Denis V. Lunev
@ 2015-01-30 19:48         ` Kevin Wolf
  2015-01-30 20:05           ` Denis V. Lunev
  0 siblings, 1 reply; 8+ messages in thread
From: Kevin Wolf @ 2015-01-30 19:48 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: Paolo Bonzini, qemu-devel, Stefan Hajnoczi

Am 30.01.2015 um 19:39 hat Denis V. Lunev geschrieben:
> On 29/01/15 16:49, Denis V. Lunev wrote:
> >On 29/01/15 16:18, Kevin Wolf wrote:
> >>Am 29.01.2015 um 11:50 hat Denis V. Lunev geschrieben:
> >>>The following sequence
> >>>     int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644);
> >>>     for (i = 0; i < 100000; i++)
> >>>             write(fd, buf, 4096);
> >>>performs 5% better if buf is aligned to 4096 bytes rather then to
> >>>512 bytes on HDD with 512/4096 logical/physical sector size.
> >>>
> >>>The difference is quite reliable.
> >>>
> >>>On the other hand we do not want at the moment to enforce bounce
> >>>buffering if guest request is aligned to 512 bytes. This patch
> >>>forces page alignment when we really forced to perform memory
> >>>allocation.
> >>>
> >>>Signed-off-by: Denis V. Lunev <den@openvz.org>
> >>>CC: Paolo Bonzini <pbonzini@redhat.com>
> >>>CC: Kevin Wolf <kwolf@redhat.com>
> >>>CC: Stefan Hajnoczi <stefanha@redhat.com>
> >>>---
> >>>  block.c | 9 ++++++++-
> >>>  1 file changed, 8 insertions(+), 1 deletion(-)
> >>>
> >>>diff --git a/block.c b/block.c
> >>>index d45e4dd..38cf73f 100644
> >>>--- a/block.c
> >>>+++ b/block.c
> >>>@@ -5293,7 +5293,11 @@ void
> >>>bdrv_set_guest_block_size(BlockDriverState *bs, int align)
> >>>    void *qemu_blockalign(BlockDriverState *bs, size_t size)
> >>>  {
> >>>-    return qemu_memalign(bdrv_opt_mem_align(bs), size);
> >>>+    size_t align = bdrv_opt_mem_align(bs);
> >>>+    if (align < 4096) {
> >>>+        align = 4096;
> >>>+    }
> >>>+    return qemu_memalign(align, size);
> >>>  }
> >>>    void *qemu_blockalign0(BlockDriverState *bs, size_t size)
> >>>@@ -5307,6 +5311,9 @@ void
> >>>*qemu_try_blockalign(BlockDriverState *bs, size_t size)
> >>>        /* Ensure that NULL is never returned on success */
> >>>      assert(align > 0);
> >>>+    if (align < 4096) {
> >>>+        align = 4096;
> >>>+    }
> >>>      if (size == 0) {
> >>>          size = align;
> >>>      }
> >>This is the wrong place to make this change. First you're duplicating
> >>logic in the callers of bdrv_opt_mem_align() instead of making it return
> >>the right thing in the first place.
> >This has been actually done in the first iteration. bdrv_opt_mem_align
> >is called actually three times in:
> >  qemu_blockalign
> >  qemu_try_blockalign
> >  bdrv_qiov_is_aligned
> >Paolo says that he does not want to have bdrv_qiov_is_aligned affected
> >to avoid extra bounce buffering.
> >
> >From my point of view this extra bounce buffering is better than
> >unaligned
> >pointer during write to the disk as 512/4096 logical/physical
> >sectors size
> >disks are mainstream now. Though I don't want to specially argue here.
> >Normal guest operations results in page aligned requests and this is not
> >a problem at all. The amount of 512 aligned requests from guest side is
> >quite negligible.
> >>  Second, you're arguing with numbers
> >>from a simple test case for O_DIRECT on Linux, but you're changing the
> >>alignment for everyone instead of just the raw-posix driver which is
> >>responsible for accessing Linux files.
> >This should not be a real problem. We are allocation memory for the
> >buffer. A little bit stricter alignment is not a big overhead for
> >any libc
> >implementation thus this kludge will not produce any significant
> >overhead.
> >>Also, what's the real reason for the performance improvement? Having
> >>page alignment? If so, actually querying the page size instead of
> >>assuming 4k might be worth a thought.
> >>
> >>Kevin
> >Most likely the problem comes from the read-modify-write pattern
> >either in kernel or in disk. Actually my experience says that it is a
> >bad idea to supply 512 byte aligned buffer for O_DIRECT IO.
> >ABI technically allows this but in general it is much less tested.
> >
> >Yes, this synthetic test shows some difference here. In terms of
> >qemu-io the result is also visible, but less
> >  qemu-img create -f qcow2 ./1.img 64G
> >  qemu-io -n -c 'write -P 0xaa 0 1G' 1.img
> >performs 1% better.
> >
> >There is also similar kludge here
> >size_t bdrv_opt_mem_align(BlockDriverState *bs)
> >{
> >    if (!bs || !bs->drv) {
> >        /* 4k should be on the safe side */
> >        return 4096;
> >    }
> >
> >    return bs->bl.opt_mem_alignment;
> >}
> >which just uses 4096 constant.
> >
> >Yes, I could agree that queering page size could be a good idea, but
> >I do not know at the moment how to do that. Can you pls share your
> >opinion if you have any.
> >
> >Regards,
> >    Den
> Paolo, Kevin,
> 
> I have spent a bit more time digging the issue and found some
> additional information. The same 5% difference if the buffer is
> aligned to 512/4096 is observed for the following devices/filesystems
> 
> 1) ext4 with block size equals to 1024 over 512/512 physical/logical
>    sector size SSD disk
> 2) ext4 with block size equals to 4096 over 512/512 physical/logical
>    sector size SSD disk
> 3) ext4 with block size equals to 4096 over 512/4096 physical/logical
>    sector size rotational disk (WDC WD20EZRX)
> 4) with block size equals to 4096 over 512/512 physical/logical
>    sector size SSD disk
> 
> This means that only page size (4k) matters.
> 
> Guys, you propose quite different approaches. I can extend this patch
> to use sysconf(_SC_PAGESIZE) to detect page size and drop hardcoded
> 4096. This is not a problem. But you have different opinion about
> the place to insert the check.
> 
> Could you please come into agreement?

I agree that Paolo has made a good point. Using a bounce buffer in this
case is not what we want, it would very likely degrade performance
instead of improving it.

I'm not completely sure about the conclusion yet, but it might be that
what we need is separate min_mem_alignment (which is what causes usage
of a bounce buffer) and opt_mem_alignment (which is what is used when we
allocate a buffer anyway). In typical configurations, min would be 512
and opt 4096.

Kevin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH 1/1] block: enforce minimal 4096 alignment in qemu_blockalign
  2015-01-30 19:48         ` Kevin Wolf
@ 2015-01-30 20:05           ` Denis V. Lunev
  0 siblings, 0 replies; 8+ messages in thread
From: Denis V. Lunev @ 2015-01-30 20:05 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Paolo Bonzini, qemu-devel, Stefan Hajnoczi

On 30/01/15 22:48, Kevin Wolf wrote:
> Am 30.01.2015 um 19:39 hat Denis V. Lunev geschrieben:
>> On 29/01/15 16:49, Denis V. Lunev wrote:
>>> On 29/01/15 16:18, Kevin Wolf wrote:
>>>> Am 29.01.2015 um 11:50 hat Denis V. Lunev geschrieben:
>>>>> The following sequence
>>>>>      int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644);
>>>>>      for (i = 0; i < 100000; i++)
>>>>>              write(fd, buf, 4096);
>>>>> performs 5% better if buf is aligned to 4096 bytes rather then to
>>>>> 512 bytes on HDD with 512/4096 logical/physical sector size.
>>>>>
>>>>> The difference is quite reliable.
>>>>>
>>>>> On the other hand we do not want at the moment to enforce bounce
>>>>> buffering if guest request is aligned to 512 bytes. This patch
>>>>> forces page alignment when we really forced to perform memory
>>>>> allocation.
>>>>>
>>>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>>>> CC: Paolo Bonzini <pbonzini@redhat.com>
>>>>> CC: Kevin Wolf <kwolf@redhat.com>
>>>>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>>>>> ---
>>>>>   block.c | 9 ++++++++-
>>>>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/block.c b/block.c
>>>>> index d45e4dd..38cf73f 100644
>>>>> --- a/block.c
>>>>> +++ b/block.c
>>>>> @@ -5293,7 +5293,11 @@ void
>>>>> bdrv_set_guest_block_size(BlockDriverState *bs, int align)
>>>>>     void *qemu_blockalign(BlockDriverState *bs, size_t size)
>>>>>   {
>>>>> -    return qemu_memalign(bdrv_opt_mem_align(bs), size);
>>>>> +    size_t align = bdrv_opt_mem_align(bs);
>>>>> +    if (align < 4096) {
>>>>> +        align = 4096;
>>>>> +    }
>>>>> +    return qemu_memalign(align, size);
>>>>>   }
>>>>>     void *qemu_blockalign0(BlockDriverState *bs, size_t size)
>>>>> @@ -5307,6 +5311,9 @@ void
>>>>> *qemu_try_blockalign(BlockDriverState *bs, size_t size)
>>>>>         /* Ensure that NULL is never returned on success */
>>>>>       assert(align > 0);
>>>>> +    if (align < 4096) {
>>>>> +        align = 4096;
>>>>> +    }
>>>>>       if (size == 0) {
>>>>>           size = align;
>>>>>       }
>>>> This is the wrong place to make this change. First you're duplicating
>>>> logic in the callers of bdrv_opt_mem_align() instead of making it return
>>>> the right thing in the first place.
>>> This has been actually done in the first iteration. bdrv_opt_mem_align
>>> is called actually three times in:
>>>   qemu_blockalign
>>>   qemu_try_blockalign
>>>   bdrv_qiov_is_aligned
>>> Paolo says that he does not want to have bdrv_qiov_is_aligned affected
>>> to avoid extra bounce buffering.
>>>
>> >From my point of view this extra bounce buffering is better than
>>> unaligned
>>> pointer during write to the disk as 512/4096 logical/physical
>>> sectors size
>>> disks are mainstream now. Though I don't want to specially argue here.
>>> Normal guest operations results in page aligned requests and this is not
>>> a problem at all. The amount of 512 aligned requests from guest side is
>>> quite negligible.
>>>>   Second, you're arguing with numbers
>>> >from a simple test case for O_DIRECT on Linux, but you're changing the
>>>> alignment for everyone instead of just the raw-posix driver which is
>>>> responsible for accessing Linux files.
>>> This should not be a real problem. We are allocation memory for the
>>> buffer. A little bit stricter alignment is not a big overhead for
>>> any libc
>>> implementation thus this kludge will not produce any significant
>>> overhead.
>>>> Also, what's the real reason for the performance improvement? Having
>>>> page alignment? If so, actually querying the page size instead of
>>>> assuming 4k might be worth a thought.
>>>>
>>>> Kevin
>>> Most likely the problem comes from the read-modify-write pattern
>>> either in kernel or in disk. Actually my experience says that it is a
>>> bad idea to supply 512 byte aligned buffer for O_DIRECT IO.
>>> ABI technically allows this but in general it is much less tested.
>>>
>>> Yes, this synthetic test shows some difference here. In terms of
>>> qemu-io the result is also visible, but less
>>>   qemu-img create -f qcow2 ./1.img 64G
>>>   qemu-io -n -c 'write -P 0xaa 0 1G' 1.img
>>> performs 1% better.
>>>
>>> There is also similar kludge here
>>> size_t bdrv_opt_mem_align(BlockDriverState *bs)
>>> {
>>>     if (!bs || !bs->drv) {
>>>         /* 4k should be on the safe side */
>>>         return 4096;
>>>     }
>>>
>>>     return bs->bl.opt_mem_alignment;
>>> }
>>> which just uses 4096 constant.
>>>
>>> Yes, I could agree that queering page size could be a good idea, but
>>> I do not know at the moment how to do that. Can you pls share your
>>> opinion if you have any.
>>>
>>> Regards,
>>>     Den
>> Paolo, Kevin,
>>
>> I have spent a bit more time digging the issue and found some
>> additional information. The same 5% difference if the buffer is
>> aligned to 512/4096 is observed for the following devices/filesystems
>>
>> 1) ext4 with block size equals to 1024 over 512/512 physical/logical
>>     sector size SSD disk
>> 2) ext4 with block size equals to 4096 over 512/512 physical/logical
>>     sector size SSD disk
>> 3) ext4 with block size equals to 4096 over 512/4096 physical/logical
>>     sector size rotational disk (WDC WD20EZRX)
>> 4) with block size equals to 4096 over 512/512 physical/logical
>>     sector size SSD disk
>>
>> This means that only page size (4k) matters.
>>
>> Guys, you propose quite different approaches. I can extend this patch
>> to use sysconf(_SC_PAGESIZE) to detect page size and drop hardcoded
>> 4096. This is not a problem. But you have different opinion about
>> the place to insert the check.
>>
>> Could you please come into agreement?
> I agree that Paolo has made a good point. Using a bounce buffer in this
> case is not what we want, it would very likely degrade performance
> instead of improving it.
>
> I'm not completely sure about the conclusion yet, but it might be that
> what we need is separate min_mem_alignment (which is what causes usage
> of a bounce buffer) and opt_mem_alignment (which is what is used when we
> allocate a buffer anyway). In typical configurations, min would be 512
> and opt 4096.
>
> Kevin
ok, this sounds reasonable enough.

I'll send an updated version on Monday. Also I will try to check older
kernels to extend the coverage.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-01-30 20:05 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-01-29 10:50 [Qemu-devel] [PATCH v2 0/1] block: enforce minimal 4096 alignment in qemu_blockalign Denis V. Lunev
2015-01-29 10:50 ` [Qemu-devel] [PATCH 1/1] " Denis V. Lunev
2015-01-29 10:58   ` Paolo Bonzini
2015-01-29 13:18   ` Kevin Wolf
2015-01-29 13:49     ` Denis V. Lunev
2015-01-30 18:39       ` Denis V. Lunev
2015-01-30 19:48         ` Kevin Wolf
2015-01-30 20:05           ` Denis V. Lunev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).