From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:49658) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YHGUH-0006GG-OX for qemu-devel@nongnu.org; Fri, 30 Jan 2015 13:40:15 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YHGUD-0007oe-J5 for qemu-devel@nongnu.org; Fri, 30 Jan 2015 13:40:13 -0500 Received: from mx2.parallels.com ([199.115.105.18]:35599) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YHGUD-0007Eu-BH for qemu-devel@nongnu.org; Fri, 30 Jan 2015 13:40:09 -0500 Message-ID: <54CBCFCF.7090006@parallels.com> Date: Fri, 30 Jan 2015 21:39:11 +0300 From: "Denis V. Lunev" MIME-Version: 1.0 References: <1422528659-3121-1-git-send-email-den@openvz.org> <1422528659-3121-2-git-send-email-den@openvz.org> <20150129131848.GA3950@noname.redhat.com> <54CA3A7A.8090208@openvz.org> In-Reply-To: <54CA3A7A.8090208@openvz.org> Content-Type: multipart/mixed; boundary="------------010909030408090301010202" Subject: Re: [Qemu-devel] [PATCH 1/1] block: enforce minimal 4096 alignment in qemu_blockalign List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf , Paolo Bonzini Cc: qemu-devel@nongnu.org, Stefan Hajnoczi --------------010909030408090301010202 Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit On 29/01/15 16:49, Denis V. Lunev wrote: > On 29/01/15 16:18, Kevin Wolf wrote: >> Am 29.01.2015 um 11:50 hat Denis V. Lunev geschrieben: >>> The following sequence >>> int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644); >>> for (i = 0; i < 100000; i++) >>> write(fd, buf, 4096); >>> performs 5% better if buf is aligned to 4096 bytes rather then to >>> 512 bytes on HDD with 512/4096 logical/physical sector size. >>> >>> The difference is quite reliable. >>> >>> On the other hand we do not want at the moment to enforce bounce >>> buffering if guest request is aligned to 512 bytes. This patch >>> forces page alignment when we really forced to perform memory >>> allocation. >>> >>> Signed-off-by: Denis V. Lunev >>> CC: Paolo Bonzini >>> CC: Kevin Wolf >>> CC: Stefan Hajnoczi >>> --- >>> block.c | 9 ++++++++- >>> 1 file changed, 8 insertions(+), 1 deletion(-) >>> >>> diff --git a/block.c b/block.c >>> index d45e4dd..38cf73f 100644 >>> --- a/block.c >>> +++ b/block.c >>> @@ -5293,7 +5293,11 @@ void >>> bdrv_set_guest_block_size(BlockDriverState *bs, int align) >>> void *qemu_blockalign(BlockDriverState *bs, size_t size) >>> { >>> - return qemu_memalign(bdrv_opt_mem_align(bs), size); >>> + size_t align = bdrv_opt_mem_align(bs); >>> + if (align < 4096) { >>> + align = 4096; >>> + } >>> + return qemu_memalign(align, size); >>> } >>> void *qemu_blockalign0(BlockDriverState *bs, size_t size) >>> @@ -5307,6 +5311,9 @@ void *qemu_try_blockalign(BlockDriverState >>> *bs, size_t size) >>> /* Ensure that NULL is never returned on success */ >>> assert(align > 0); >>> + if (align < 4096) { >>> + align = 4096; >>> + } >>> if (size == 0) { >>> size = align; >>> } >> This is the wrong place to make this change. First you're duplicating >> logic in the callers of bdrv_opt_mem_align() instead of making it return >> the right thing in the first place. > This has been actually done in the first iteration. bdrv_opt_mem_align > is called actually three times in: > qemu_blockalign > qemu_try_blockalign > bdrv_qiov_is_aligned > Paolo says that he does not want to have bdrv_qiov_is_aligned affected > to avoid extra bounce buffering. > > From my point of view this extra bounce buffering is better than > unaligned > pointer during write to the disk as 512/4096 logical/physical sectors > size > disks are mainstream now. Though I don't want to specially argue here. > Normal guest operations results in page aligned requests and this is not > a problem at all. The amount of 512 aligned requests from guest side is > quite negligible. >> Second, you're arguing with numbers >> from a simple test case for O_DIRECT on Linux, but you're changing the >> alignment for everyone instead of just the raw-posix driver which is >> responsible for accessing Linux files. > This should not be a real problem. We are allocation memory for the > buffer. A little bit stricter alignment is not a big overhead for any > libc > implementation thus this kludge will not produce any significant > overhead. >> Also, what's the real reason for the performance improvement? Having >> page alignment? If so, actually querying the page size instead of >> assuming 4k might be worth a thought. >> >> Kevin > Most likely the problem comes from the read-modify-write pattern > either in kernel or in disk. Actually my experience says that it is a > bad idea to supply 512 byte aligned buffer for O_DIRECT IO. > ABI technically allows this but in general it is much less tested. > > Yes, this synthetic test shows some difference here. In terms of > qemu-io the result is also visible, but less > qemu-img create -f qcow2 ./1.img 64G > qemu-io -n -c 'write -P 0xaa 0 1G' 1.img > performs 1% better. > > There is also similar kludge here > size_t bdrv_opt_mem_align(BlockDriverState *bs) > { > if (!bs || !bs->drv) { > /* 4k should be on the safe side */ > return 4096; > } > > return bs->bl.opt_mem_alignment; > } > which just uses 4096 constant. > > Yes, I could agree that queering page size could be a good idea, but > I do not know at the moment how to do that. Can you pls share your > opinion if you have any. > > Regards, > Den Paolo, Kevin, I have spent a bit more time digging the issue and found some additional information. The same 5% difference if the buffer is aligned to 512/4096 is observed for the following devices/filesystems 1) ext4 with block size equals to 1024 over 512/512 physical/logical sector size SSD disk 2) ext4 with block size equals to 4096 over 512/512 physical/logical sector size SSD disk 3) ext4 with block size equals to 4096 over 512/4096 physical/logical sector size rotational disk (WDC WD20EZRX) 4) with block size equals to 4096 over 512/512 physical/logical sector size SSD disk This means that only page size (4k) matters. Guys, you propose quite different approaches. I can extend this patch to use sysconf(_SC_PAGESIZE) to detect page size and drop hardcoded 4096. This is not a problem. But you have different opinion about the place to insert the check. Could you please come into agreement? Proper defines/configuration work to be done, I am trying to negotiate principal approach. Version 1) diff --git a/block.c b/block.c index d45e4dd..bc5d1e7 100644 --- a/block.c +++ b/block.c @@ -543,7 +543,7 @@ void bdrv_refresh_limits(BlockDriverState *bs, Error **errp) bs->bl.max_transfer_length = bs->file->bl.max_transfer_length; bs->bl.opt_mem_alignment = bs->file->bl.opt_mem_alignment; } else { - bs->bl.opt_mem_alignment = 512; + bs->bl.opt_mem_alignment = sysconf(_SC_PAGESIZE); } if (bs->backing_hd) { diff --git a/block/raw-posix.c b/block/raw-posix.c index ec38fee..d1b3388 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -266,7 +266,7 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) if (!s->buf_align) { size_t align; buf = qemu_memalign(MAX_BLOCKSIZE, 2 * MAX_BLOCKSIZE); - for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) { + for (align = sysconf(_SC_PAGESIZE); align <= MAX_BLOCKSIZE; align <<= 1) { if (pread(fd, buf + align, MAX_BLOCKSIZE, 0) >= 0) { s->buf_align = align; break; Version 2) diff --git a/block.c b/block.c index d45e4dd..e2bb3fd 100644 --- a/block.c +++ b/block.c @@ -5293,6 +5293,11 @@ void bdrv_set_guest_block_size(BlockDriverState *bs, int align) void *qemu_blockalign(BlockDriverState *bs, size_t size) { + int align = bdrv_opt_mem_align(bs); + int page_size = sysconf(_SC_PAGESIZE); + if (align < page_size) { + align = page_size; + } return qemu_memalign(bdrv_opt_mem_align(bs), size); } @@ -5304,9 +5309,13 @@ void *qemu_blockalign0(BlockDriverState *bs, size_t size) void *qemu_try_blockalign(BlockDriverState *bs, size_t size) { size_t align = bdrv_opt_mem_align(bs); + int page_size = sysconf(_SC_PAGESIZE); /* Ensure that NULL is never returned on success */ assert(align > 0); + if (align < page_size) { + align = page_size; + } if (size == 0) { size = align; } I am totally fine with both versions. Regards, Den P.S. A bit improved version of test is attached. --------------010909030408090301010202 Content-Type: text/x-csrc; name="1.c" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="1.c" #define _GNU_SOURCE #include #include #include #include #include #include int main(int argc, char *argv[]) { int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644); void *buf; int i = 0, align = atoi(argv[2]); do { buf = memalign(align, 4096); if (align >= 4096) break; if ((unsigned long)buf & 4095) break; i++; } while (1); printf("%d %p\n", i, buf); memset(buf, 0x11, 4096); for (i = 0; i < 100000; i++) { lseek(fd, SEEK_CUR, 4096); write(fd, buf, 4096); } close(fd); return 0; } --------------010909030408090301010202--