From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:40442) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YHHox-0008A6-6G for qemu-devel@nongnu.org; Fri, 30 Jan 2015 15:05:40 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YHHos-0004EU-3p for qemu-devel@nongnu.org; Fri, 30 Jan 2015 15:05:39 -0500 Received: from mx2.parallels.com ([199.115.105.18]:46957) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YHHor-0004EC-J3 for qemu-devel@nongnu.org; Fri, 30 Jan 2015 15:05:33 -0500 Message-ID: <54CBE403.6070703@parallels.com> Date: Fri, 30 Jan 2015 23:05:23 +0300 From: "Denis V. Lunev" MIME-Version: 1.0 References: <1422528659-3121-1-git-send-email-den@openvz.org> <1422528659-3121-2-git-send-email-den@openvz.org> <20150129131848.GA3950@noname.redhat.com> <54CA3A7A.8090208@openvz.org> <54CBCFCF.7090006@parallels.com> <20150130194817.GF24537@noname.redhat.com> In-Reply-To: <20150130194817.GF24537@noname.redhat.com> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH 1/1] block: enforce minimal 4096 alignment in qemu_blockalign List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf Cc: Paolo Bonzini , qemu-devel@nongnu.org, Stefan Hajnoczi On 30/01/15 22:48, Kevin Wolf wrote: > Am 30.01.2015 um 19:39 hat Denis V. Lunev geschrieben: >> On 29/01/15 16:49, Denis V. Lunev wrote: >>> On 29/01/15 16:18, Kevin Wolf wrote: >>>> Am 29.01.2015 um 11:50 hat Denis V. Lunev geschrieben: >>>>> The following sequence >>>>> int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644); >>>>> for (i = 0; i < 100000; i++) >>>>> write(fd, buf, 4096); >>>>> performs 5% better if buf is aligned to 4096 bytes rather then to >>>>> 512 bytes on HDD with 512/4096 logical/physical sector size. >>>>> >>>>> The difference is quite reliable. >>>>> >>>>> On the other hand we do not want at the moment to enforce bounce >>>>> buffering if guest request is aligned to 512 bytes. This patch >>>>> forces page alignment when we really forced to perform memory >>>>> allocation. >>>>> >>>>> Signed-off-by: Denis V. Lunev >>>>> CC: Paolo Bonzini >>>>> CC: Kevin Wolf >>>>> CC: Stefan Hajnoczi >>>>> --- >>>>> block.c | 9 ++++++++- >>>>> 1 file changed, 8 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/block.c b/block.c >>>>> index d45e4dd..38cf73f 100644 >>>>> --- a/block.c >>>>> +++ b/block.c >>>>> @@ -5293,7 +5293,11 @@ void >>>>> bdrv_set_guest_block_size(BlockDriverState *bs, int align) >>>>> void *qemu_blockalign(BlockDriverState *bs, size_t size) >>>>> { >>>>> - return qemu_memalign(bdrv_opt_mem_align(bs), size); >>>>> + size_t align = bdrv_opt_mem_align(bs); >>>>> + if (align < 4096) { >>>>> + align = 4096; >>>>> + } >>>>> + return qemu_memalign(align, size); >>>>> } >>>>> void *qemu_blockalign0(BlockDriverState *bs, size_t size) >>>>> @@ -5307,6 +5311,9 @@ void >>>>> *qemu_try_blockalign(BlockDriverState *bs, size_t size) >>>>> /* Ensure that NULL is never returned on success */ >>>>> assert(align > 0); >>>>> + if (align < 4096) { >>>>> + align = 4096; >>>>> + } >>>>> if (size == 0) { >>>>> size = align; >>>>> } >>>> This is the wrong place to make this change. First you're duplicating >>>> logic in the callers of bdrv_opt_mem_align() instead of making it return >>>> the right thing in the first place. >>> This has been actually done in the first iteration. bdrv_opt_mem_align >>> is called actually three times in: >>> qemu_blockalign >>> qemu_try_blockalign >>> bdrv_qiov_is_aligned >>> Paolo says that he does not want to have bdrv_qiov_is_aligned affected >>> to avoid extra bounce buffering. >>> >> >From my point of view this extra bounce buffering is better than >>> unaligned >>> pointer during write to the disk as 512/4096 logical/physical >>> sectors size >>> disks are mainstream now. Though I don't want to specially argue here. >>> Normal guest operations results in page aligned requests and this is not >>> a problem at all. The amount of 512 aligned requests from guest side is >>> quite negligible. >>>> Second, you're arguing with numbers >>> >from a simple test case for O_DIRECT on Linux, but you're changing the >>>> alignment for everyone instead of just the raw-posix driver which is >>>> responsible for accessing Linux files. >>> This should not be a real problem. We are allocation memory for the >>> buffer. A little bit stricter alignment is not a big overhead for >>> any libc >>> implementation thus this kludge will not produce any significant >>> overhead. >>>> Also, what's the real reason for the performance improvement? Having >>>> page alignment? If so, actually querying the page size instead of >>>> assuming 4k might be worth a thought. >>>> >>>> Kevin >>> Most likely the problem comes from the read-modify-write pattern >>> either in kernel or in disk. Actually my experience says that it is a >>> bad idea to supply 512 byte aligned buffer for O_DIRECT IO. >>> ABI technically allows this but in general it is much less tested. >>> >>> Yes, this synthetic test shows some difference here. In terms of >>> qemu-io the result is also visible, but less >>> qemu-img create -f qcow2 ./1.img 64G >>> qemu-io -n -c 'write -P 0xaa 0 1G' 1.img >>> performs 1% better. >>> >>> There is also similar kludge here >>> size_t bdrv_opt_mem_align(BlockDriverState *bs) >>> { >>> if (!bs || !bs->drv) { >>> /* 4k should be on the safe side */ >>> return 4096; >>> } >>> >>> return bs->bl.opt_mem_alignment; >>> } >>> which just uses 4096 constant. >>> >>> Yes, I could agree that queering page size could be a good idea, but >>> I do not know at the moment how to do that. Can you pls share your >>> opinion if you have any. >>> >>> Regards, >>> Den >> Paolo, Kevin, >> >> I have spent a bit more time digging the issue and found some >> additional information. The same 5% difference if the buffer is >> aligned to 512/4096 is observed for the following devices/filesystems >> >> 1) ext4 with block size equals to 1024 over 512/512 physical/logical >> sector size SSD disk >> 2) ext4 with block size equals to 4096 over 512/512 physical/logical >> sector size SSD disk >> 3) ext4 with block size equals to 4096 over 512/4096 physical/logical >> sector size rotational disk (WDC WD20EZRX) >> 4) with block size equals to 4096 over 512/512 physical/logical >> sector size SSD disk >> >> This means that only page size (4k) matters. >> >> Guys, you propose quite different approaches. I can extend this patch >> to use sysconf(_SC_PAGESIZE) to detect page size and drop hardcoded >> 4096. This is not a problem. But you have different opinion about >> the place to insert the check. >> >> Could you please come into agreement? > I agree that Paolo has made a good point. Using a bounce buffer in this > case is not what we want, it would very likely degrade performance > instead of improving it. > > I'm not completely sure about the conclusion yet, but it might be that > what we need is separate min_mem_alignment (which is what causes usage > of a bounce buffer) and opt_mem_alignment (which is what is used when we > allocate a buffer anyway). In typical configurations, min would be 512 > and opt 4096. > > Kevin ok, this sounds reasonable enough. I'll send an updated version on Monday. Also I will try to check older kernels to extend the coverage.