From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:40442)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <den@parallels.com>) id 1YHHox-0008A6-6G
	for qemu-devel@nongnu.org; Fri, 30 Jan 2015 15:05:40 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <den@parallels.com>) id 1YHHos-0004EU-3p
	for qemu-devel@nongnu.org; Fri, 30 Jan 2015 15:05:39 -0500
Received: from mx2.parallels.com ([199.115.105.18]:46957)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <den@parallels.com>) id 1YHHor-0004EC-J3
	for qemu-devel@nongnu.org; Fri, 30 Jan 2015 15:05:33 -0500
Message-ID: <54CBE403.6070703@parallels.com>
Date: Fri, 30 Jan 2015 23:05:23 +0300
From: "Denis V. Lunev" <den@parallels.com>
MIME-Version: 1.0
References: <1422528659-3121-1-git-send-email-den@openvz.org>
	<1422528659-3121-2-git-send-email-den@openvz.org>
	<20150129131848.GA3950@noname.redhat.com>
	<54CA3A7A.8090208@openvz.org> <54CBCFCF.7090006@parallels.com>
	<20150130194817.GF24537@noname.redhat.com>
In-Reply-To: <20150130194817.GF24537@noname.redhat.com>
Content-Type: text/plain; charset="windows-1252"; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH 1/1] block: enforce minimal 4096 alignment
	in qemu_blockalign
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Kevin Wolf <kwolf@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>, qemu-devel@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com>

On 30/01/15 22:48, Kevin Wolf wrote:
> Am 30.01.2015 um 19:39 hat Denis V. Lunev geschrieben:
>> On 29/01/15 16:49, Denis V. Lunev wrote:
>>> On 29/01/15 16:18, Kevin Wolf wrote:
>>>> Am 29.01.2015 um 11:50 hat Denis V. Lunev geschrieben:
>>>>> The following sequence
>>>>>      int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644);
>>>>>      for (i = 0; i < 100000; i++)
>>>>>              write(fd, buf, 4096);
>>>>> performs 5% better if buf is aligned to 4096 bytes rather then to
>>>>> 512 bytes on HDD with 512/4096 logical/physical sector size.
>>>>>
>>>>> The difference is quite reliable.
>>>>>
>>>>> On the other hand we do not want at the moment to enforce bounce
>>>>> buffering if guest request is aligned to 512 bytes. This patch
>>>>> forces page alignment when we really forced to perform memory
>>>>> allocation.
>>>>>
>>>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>>>> CC: Paolo Bonzini <pbonzini@redhat.com>
>>>>> CC: Kevin Wolf <kwolf@redhat.com>
>>>>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>>>>> ---
>>>>>   block.c | 9 ++++++++-
>>>>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/block.c b/block.c
>>>>> index d45e4dd..38cf73f 100644
>>>>> --- a/block.c
>>>>> +++ b/block.c
>>>>> @@ -5293,7 +5293,11 @@ void
>>>>> bdrv_set_guest_block_size(BlockDriverState *bs, int align)
>>>>>     void *qemu_blockalign(BlockDriverState *bs, size_t size)
>>>>>   {
>>>>> -    return qemu_memalign(bdrv_opt_mem_align(bs), size);
>>>>> +    size_t align = bdrv_opt_mem_align(bs);
>>>>> +    if (align < 4096) {
>>>>> +        align = 4096;
>>>>> +    }
>>>>> +    return qemu_memalign(align, size);
>>>>>   }
>>>>>     void *qemu_blockalign0(BlockDriverState *bs, size_t size)
>>>>> @@ -5307,6 +5311,9 @@ void
>>>>> *qemu_try_blockalign(BlockDriverState *bs, size_t size)
>>>>>         /* Ensure that NULL is never returned on success */
>>>>>       assert(align > 0);
>>>>> +    if (align < 4096) {
>>>>> +        align = 4096;
>>>>> +    }
>>>>>       if (size == 0) {
>>>>>           size = align;
>>>>>       }
>>>> This is the wrong place to make this change. First you're duplicating
>>>> logic in the callers of bdrv_opt_mem_align() instead of making it return
>>>> the right thing in the first place.
>>> This has been actually done in the first iteration. bdrv_opt_mem_align
>>> is called actually three times in:
>>>   qemu_blockalign
>>>   qemu_try_blockalign
>>>   bdrv_qiov_is_aligned
>>> Paolo says that he does not want to have bdrv_qiov_is_aligned affected
>>> to avoid extra bounce buffering.
>>>
>> >From my point of view this extra bounce buffering is better than
>>> unaligned
>>> pointer during write to the disk as 512/4096 logical/physical
>>> sectors size
>>> disks are mainstream now. Though I don't want to specially argue here.
>>> Normal guest operations results in page aligned requests and this is not
>>> a problem at all. The amount of 512 aligned requests from guest side is
>>> quite negligible.
>>>>   Second, you're arguing with numbers
>>> >from a simple test case for O_DIRECT on Linux, but you're changing the
>>>> alignment for everyone instead of just the raw-posix driver which is
>>>> responsible for accessing Linux files.
>>> This should not be a real problem. We are allocation memory for the
>>> buffer. A little bit stricter alignment is not a big overhead for
>>> any libc
>>> implementation thus this kludge will not produce any significant
>>> overhead.
>>>> Also, what's the real reason for the performance improvement? Having
>>>> page alignment? If so, actually querying the page size instead of
>>>> assuming 4k might be worth a thought.
>>>>
>>>> Kevin
>>> Most likely the problem comes from the read-modify-write pattern
>>> either in kernel or in disk. Actually my experience says that it is a
>>> bad idea to supply 512 byte aligned buffer for O_DIRECT IO.
>>> ABI technically allows this but in general it is much less tested.
>>>
>>> Yes, this synthetic test shows some difference here. In terms of
>>> qemu-io the result is also visible, but less
>>>   qemu-img create -f qcow2 ./1.img 64G
>>>   qemu-io -n -c 'write -P 0xaa 0 1G' 1.img
>>> performs 1% better.
>>>
>>> There is also similar kludge here
>>> size_t bdrv_opt_mem_align(BlockDriverState *bs)
>>> {
>>>     if (!bs || !bs->drv) {
>>>         /* 4k should be on the safe side */
>>>         return 4096;
>>>     }
>>>
>>>     return bs->bl.opt_mem_alignment;
>>> }
>>> which just uses 4096 constant.
>>>
>>> Yes, I could agree that queering page size could be a good idea, but
>>> I do not know at the moment how to do that. Can you pls share your
>>> opinion if you have any.
>>>
>>> Regards,
>>>     Den
>> Paolo, Kevin,
>>
>> I have spent a bit more time digging the issue and found some
>> additional information. The same 5% difference if the buffer is
>> aligned to 512/4096 is observed for the following devices/filesystems
>>
>> 1) ext4 with block size equals to 1024 over 512/512 physical/logical
>>     sector size SSD disk
>> 2) ext4 with block size equals to 4096 over 512/512 physical/logical
>>     sector size SSD disk
>> 3) ext4 with block size equals to 4096 over 512/4096 physical/logical
>>     sector size rotational disk (WDC WD20EZRX)
>> 4) with block size equals to 4096 over 512/512 physical/logical
>>     sector size SSD disk
>>
>> This means that only page size (4k) matters.
>>
>> Guys, you propose quite different approaches. I can extend this patch
>> to use sysconf(_SC_PAGESIZE) to detect page size and drop hardcoded
>> 4096. This is not a problem. But you have different opinion about
>> the place to insert the check.
>>
>> Could you please come into agreement?
> I agree that Paolo has made a good point. Using a bounce buffer in this
> case is not what we want, it would very likely degrade performance
> instead of improving it.
>
> I'm not completely sure about the conclusion yet, but it might be that
> what we need is separate min_mem_alignment (which is what causes usage
> of a bounce buffer) and opt_mem_alignment (which is what is used when we
> allocate a buffer anyway). In typical configurations, min would be 512
> and opt 4096.
>
> Kevin
ok, this sounds reasonable enough.

I'll send an updated version on Monday. Also I will try to check older
kernels to extend the coverage.