From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:56797) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Ys7He-0002hH-RF for qemu-devel@nongnu.org; Tue, 12 May 2015 06:19:32 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Ys7Hd-0007C7-5h for qemu-devel@nongnu.org; Tue, 12 May 2015 06:19:30 -0400 Message-ID: <5551D39E.1020902@odin.com> Date: Tue, 12 May 2015 13:19:10 +0300 From: "Denis V. Lunev" MIME-Version: 1.0 References: <1430746944-27347-1-git-send-email-den@openvz.org> <20150511150817.GK16270@stefanha-thinkpad.redhat.com> <5550D3B5.2050703@openvz.org> <5550DD2D.8000407@odin.com> <20150512100155.GB11497@stefanha-thinkpad.redhat.com> In-Reply-To: <20150512100155.GB11497@stefanha-thinkpad.redhat.com> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [Qemu-block] [PATCH v5 0/2] block: enforce minimal 4096 alignment in qemu_blockalign List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi Cc: Dmitry Monakhov , Stefan Hajnoczi , qemu-devel@nongnu.org, qemu-block@nongnu.org, Paolo Bonzini On 12/05/15 13:01, Stefan Hajnoczi wrote: > On Mon, May 11, 2015 at 07:47:41PM +0300, Denis V. Lunev wrote: >> On 11/05/15 19:07, Denis V. Lunev wrote: >>> On 11/05/15 18:08, Stefan Hajnoczi wrote: >>>> On Mon, May 04, 2015 at 04:42:22PM +0300, Denis V. Lunev wrote: >>>>> The difference is quite reliable and the same 5%. >>>>> qemu-io -n -c 'write -P 0xaa 0 1G' 1.img >>>>> for image in qcow2 format is 1% faster. >>>> I looked a little at the qemu-io invocation but am not clear why there >>>> would be a measurable performance difference. Can you explain? >>>> >>>> What about real qemu-img or QEMU use cases? >>>> >>>> I'm okay with the patches themselves, but I don't really understand why >>>> this code change is justified. >>>> >>>> Stefan >>> There is a problem in the Linux kernel when the buffer >>> is not aligned to the page size. Actually the strict requirement >>> is the alignment to the 512 (one physical sector). >>> >>> This comes into the account in qemu-img and qemu-io >>> when buffers are allocated inside the application. QEMU >>> is free of this problem as the guest sends buffers >>> aligned to page already. >>> >>> You can see below results of qemu-img, they are exactly >>> the same as for qemu-io. >>> >>> qemu-img create -f qcow2 1.img 64G >>> qemu-io -n -c 'write -P 0xaa 0 1G' 1.img >>> time for i in `seq 1 30` ; do /home/den/src/qemu/qemu-img convert 1.img -t >>> none -O raw 2.img ; rm -rf 2.img ; done >>> >>> ==== without patches ====: >>> real 2m6.287s >>> user 0m1.322s >>> sys 0m8.819s >>> >>> real 2m7.483s >>> user 0m1.614s >>> sys 0m9.096s >>> >>> ==== with patches ====: >>> real 1m59.715s >>> user 0m1.453s >>> sys 0m9.365s >>> >>> real 1m58.739s >>> user 0m1.419s >>> sys 0m8.530s >>> >>> I could not exactly say where the difference comes, but >>> the problem comes from the fact that real IO operation >>> over the block device should be >>> a) page aligned for the buffer >>> b) page aligned for the offset >>> This is how buffer cache is working in the kernel. And >>> with non-aligned buffer in userspace the kernel should collect >>> kernel page for IO from 2 userspaces pages instead of one. >>> Something is not optimal here I presume. I can assume >>> that the user page could be sent immediately to the >>> controller is buffer is aligned and no additional memory >>> allocation is needed. Though I don't know exactly. >>> >>> Regards, >>> Den >> Here are results of blktrace on my host. Logs are collected using >> sudo blktrace -d /dev/md0 -o - | blkparse -i - >> >> Test command: >> /home/den/src/qemu/qemu-img convert 1.img -t none -O raw 2.img >> >> In general, not patched qemu-img IO pattern looks like this: >> 9,0 11 1 0.000000000 11151 Q WS 312737792 + 1023 [qemu-img] >> 9,0 11 2 0.000007938 11151 Q WS 312738815 + 8 [qemu-img] >> 9,0 11 3 0.000030735 11151 Q WS 312738823 + 1016 [qemu-img] >> 9,0 11 4 0.000032482 11151 Q WS 312739839 + 8 [qemu-img] >> 9,0 11 5 0.000041379 11151 Q WS 312739847 + 1016 [qemu-img] >> 9,0 11 6 0.000042818 11151 Q WS 312740863 + 8 [qemu-img] >> 9,0 11 7 0.000051236 11151 Q WS 312740871 + 1017 [qemu-img] >> 9,0 5 1 0.169071519 11151 Q WS 312741888 + 1023 [qemu-img] >> 9,0 5 2 0.169075331 11151 Q WS 312742911 + 8 [qemu-img] >> 9,0 5 3 0.169085244 11151 Q WS 312742919 + 1016 [qemu-img] >> 9,0 5 4 0.169086786 11151 Q WS 312743935 + 8 [qemu-img] >> 9,0 5 5 0.169095740 11151 Q WS 312743943 + 1016 [qemu-img] >> >> and patched one: >> 9,0 6 1 0.000000000 12422 Q WS 314834944 + 1024 [qemu-img] >> 9,0 6 2 0.000038527 12422 Q WS 314835968 + 1024 [qemu-img] >> 9,0 6 3 0.000072849 12422 Q WS 314836992 + 1024 [qemu-img] >> 9,0 6 4 0.000106276 12422 Q WS 314838016 + 1024 [qemu-img] >> 9,0 2 1 0.171038202 12422 Q WS 314839040 + 1024 [qemu-img] >> 9,0 2 2 0.171073156 12422 Q WS 314840064 + 1024 [qemu-img] >> >> Thus the load to the disk is MUCH higher without the patch! >> >> Total amount of lines (IO requests sent to disks) are the following: >> >> hades ~ $ wc -l *.blk >> 3622 non-patched.blk >> 2086 patched.blk >> 5708 total >> hades ~ $ >> >> and this from my point of view explains everything! With aligned buffers the >> amount of IO requests is almost 2 times less. > The blktrace shows 512 KB I/Os. I think qemu-img convert uses 2 MB > buffers by default. What syscalls is qemu-img making? > > I'm curious whether the kernel could be splitting up requests more > efficiently. This would benefit all applications and not just qemu-img. > > Stefan strace shows that there is one and the only syscall of real value in qemu-io. The case is really simple. It uses pwrite for 1 GB and, important to note, it uses SINGLE pwrite for the entire operation in my test case. hades /vol $ strace -f -e pwrite -e raw=write,pwrite qemu-io -n -c "write -P 0x11 0 64M" ./1.img Process 19326 attached [pid 19326] pwrite(0x6, 0x7fac07fff200, 0x4000000, 0x50000) = 0x4000000 <---- 1 GB Write from userspace wrote 67108864/67108864 bytes at offset 0 64 MiB, 1 ops; 0.2964 sec (215.863 MiB/sec and 3.3729 ops/sec) [pid 19326] +++ exited with 0 +++ +++ exited with 0 +++ hades /vol $ while blktrace of this op looks like this (splitted!) 9,0 1 266 74.030359772 19326 Q WS 473095 + 1016 [(null)] 9,0 1 267 74.030361546 19326 Q WS 474111 + 8 [(null)] 9,0 1 268 74.030395522 19326 Q WS 474119 + 1016 [(null)] 9,0 1 269 74.030397509 19326 Q WS 475135 + 8 [(null)] This means, yes, kernel is INEFFECTIVE performing direct IO with not aligned address. For example, without direct IO the pattern is much better. hades /vol $ strace -f -e pwrite -e raw=write,pwrite qemu-io -c "write -P 0x11 0 64M" ./1.img Process 19333 attached [pid 19333] pwrite(0x6, 0x7fa863fff010, 0x4000000, 0x50000) = 0x4000000 <--- same 1 GB write wrote 67108864/67108864 bytes at offset 0 64 MiB, 1 ops; 0.4495 sec (142.366 MiB/sec and 2.2245 ops/sec) [pid 19333] +++ exited with 0 +++ +++ exited with 0 +++ hades /vol $ IO is splitted, but splitted is a much more efficient way. 9,0 11 126 213.154002990 19333 Q WS 471040 + 1024 [qemu-io] 9,0 11 127 213.154039500 19333 Q WS 472064 + 1024 [qemu-io] 9,0 11 128 213.154073454 19333 Q WS 473088 + 1024 [qemu-io] 9,0 11 129 213.154110079 19333 Q WS 474112 + 1024 [qemu-io] I have discussed the thing with my kernel colleagues and they do agree that this is a problem and it should be fixed. Though there is no fix so far. I do think that we will stay on the safe side enforcing page alignment for bounce buffers. This does not bring significant cost. As for other applications, I do think that they do they same with alignment. At least we do this in all our code. Regards, Den