Re: [Qemu-devel] [Qemu-block] [PATCH v5 0/2] block: enforce minimal 4096 alignment in qemu_blockalign

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Denis V. Lunev" <den@odin.com>
To: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Dmitry Monakhov <dmonakhov@odin.com>,
	Stefan Hajnoczi <stefanha@gmail.com>,
	qemu-devel@nongnu.org, qemu-block@nongnu.org,
	Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [Qemu-devel] [Qemu-block] [PATCH v5 0/2] block: enforce minimal 4096 alignment in qemu_blockalign
Date: Tue, 12 May 2015 13:19:10 +0300	[thread overview]
Message-ID: <5551D39E.1020902@odin.com> (raw)
In-Reply-To: <20150512100155.GB11497@stefanha-thinkpad.redhat.com>

On 12/05/15 13:01, Stefan Hajnoczi wrote:
> On Mon, May 11, 2015 at 07:47:41PM +0300, Denis V. Lunev wrote:
>> On 11/05/15 19:07, Denis V. Lunev wrote:
>>> On 11/05/15 18:08, Stefan Hajnoczi wrote:
>>>> On Mon, May 04, 2015 at 04:42:22PM +0300, Denis V. Lunev wrote:
>>>>> The difference is quite reliable and the same 5%.
>>>>>    qemu-io -n -c 'write -P 0xaa 0 1G' 1.img
>>>>> for image in qcow2 format is 1% faster.
>>>> I looked a little at the qemu-io invocation but am not clear why there
>>>> would be a measurable performance difference.  Can you explain?
>>>>
>>>> What about real qemu-img or QEMU use cases?
>>>>
>>>> I'm okay with the patches themselves, but I don't really understand why
>>>> this code change is justified.
>>>>
>>>> Stefan
>>> There is a problem in the Linux kernel when the buffer
>>> is not aligned to the page size. Actually the strict requirement
>>> is the alignment to the 512 (one physical sector).
>>>
>>> This comes into the account in qemu-img and qemu-io
>>> when buffers are allocated inside the application. QEMU
>>> is free of this problem as the guest sends buffers
>>> aligned to page already.
>>>
>>> You can see below results of qemu-img, they are exactly
>>> the same as for qemu-io.
>>>
>>> qemu-img create -f qcow2 1.img 64G
>>> qemu-io -n -c 'write -P 0xaa 0 1G' 1.img
>>> time for i in `seq 1 30` ; do /home/den/src/qemu/qemu-img convert 1.img -t
>>> none -O raw 2.img ; rm -rf 2.img ; done
>>>
>>> ==== without patches ====:
>>> real    2m6.287s
>>> user    0m1.322s
>>> sys    0m8.819s
>>>
>>> real    2m7.483s
>>> user    0m1.614s
>>> sys    0m9.096s
>>>
>>> ==== with patches ====:
>>> real    1m59.715s
>>> user    0m1.453s
>>> sys    0m9.365s
>>>
>>> real    1m58.739s
>>> user    0m1.419s
>>> sys    0m8.530s
>>>
>>> I could not exactly say where the difference comes, but
>>> the problem comes from the fact that real IO operation
>>> over the block device should be
>>>   a) page aligned for the buffer
>>>   b) page aligned for the offset
>>> This is how buffer cache is working in the kernel. And
>>> with non-aligned buffer in userspace the kernel should collect
>>> kernel page for IO from 2 userspaces pages instead of one.
>>> Something is not optimal here I presume. I can assume
>>> that the user page could be sent immediately to the
>>> controller is buffer is aligned and no additional memory
>>> allocation is needed. Though I don't know exactly.
>>>
>>> Regards,
>>>     Den
>> Here are results of blktrace on my host. Logs are collected using
>>    sudo blktrace -d /dev/md0 -o - | blkparse -i -
>>
>> Test command:
>> /home/den/src/qemu/qemu-img convert 1.img -t none -O raw 2.img
>>
>> In general, not patched qemu-img IO pattern looks like this:
>>    9,0   11        1     0.000000000 11151  Q  WS 312737792 + 1023 [qemu-img]
>>    9,0   11        2     0.000007938 11151  Q  WS 312738815 + 8 [qemu-img]
>>    9,0   11        3     0.000030735 11151  Q  WS 312738823 + 1016 [qemu-img]
>>    9,0   11        4     0.000032482 11151  Q  WS 312739839 + 8 [qemu-img]
>>    9,0   11        5     0.000041379 11151  Q  WS 312739847 + 1016 [qemu-img]
>>    9,0   11        6     0.000042818 11151  Q  WS 312740863 + 8 [qemu-img]
>>    9,0   11        7     0.000051236 11151  Q  WS 312740871 + 1017 [qemu-img]
>>    9,0    5        1     0.169071519 11151  Q  WS 312741888 + 1023 [qemu-img]
>>    9,0    5        2     0.169075331 11151  Q  WS 312742911 + 8 [qemu-img]
>>    9,0    5        3     0.169085244 11151  Q  WS 312742919 + 1016 [qemu-img]
>>    9,0    5        4     0.169086786 11151  Q  WS 312743935 + 8 [qemu-img]
>>    9,0    5        5     0.169095740 11151  Q  WS 312743943 + 1016 [qemu-img]
>>
>> and patched one:
>>    9,0    6        1     0.000000000 12422  Q  WS 314834944 + 1024 [qemu-img]
>>    9,0    6        2     0.000038527 12422  Q  WS 314835968 + 1024 [qemu-img]
>>    9,0    6        3     0.000072849 12422  Q  WS 314836992 + 1024 [qemu-img]
>>    9,0    6        4     0.000106276 12422  Q  WS 314838016 + 1024 [qemu-img]
>>    9,0    2        1     0.171038202 12422  Q  WS 314839040 + 1024 [qemu-img]
>>    9,0    2        2     0.171073156 12422  Q  WS 314840064 + 1024 [qemu-img]
>>
>> Thus the load to the disk is MUCH higher without the patch!
>>
>> Total amount of lines (IO requests sent to disks) are the following:
>>
>> hades ~ $ wc -l *.blk
>>    3622 non-patched.blk
>>    2086 patched.blk
>>    5708 total
>> hades ~ $
>>
>> and this from my point of view explains everything! With aligned buffers the
>> amount of IO requests is almost 2 times less.
> The blktrace shows 512 KB I/Os.  I think qemu-img convert uses 2 MB
> buffers by default.  What syscalls is qemu-img making?
>
> I'm curious whether the kernel could be splitting up requests more
> efficiently.  This would benefit all applications and not just qemu-img.
>
> Stefan
strace shows that there is one and the only syscall of real value in
qemu-io. The case is really simple. It uses pwrite for 1 GB and,
important to note, it uses SINGLE pwrite for the entire operation in
my test case.

hades /vol $ strace -f -e pwrite -e raw=write,pwrite  qemu-io -n -c 
"write -P 0x11 0 64M" ./1.img
Process 19326 attached
[pid 19326] pwrite(0x6, 0x7fac07fff200, 0x4000000, 0x50000) = 0x4000000 
<---- 1 GB Write from userspace
wrote 67108864/67108864 bytes at offset 0
64 MiB, 1 ops; 0.2964 sec (215.863 MiB/sec and 3.3729 ops/sec)
[pid 19326] +++ exited with 0 +++
+++ exited with 0 +++
hades /vol $

while blktrace of this op looks like this (splitted!)

   9,0    1      266    74.030359772 19326  Q  WS 473095 + 1016 [(null)]
   9,0    1      267    74.030361546 19326  Q  WS 474111 + 8 [(null)]
   9,0    1      268    74.030395522 19326  Q  WS 474119 + 1016 [(null)]
   9,0    1      269    74.030397509 19326  Q  WS 475135 + 8 [(null)]

This means, yes, kernel is INEFFECTIVE performing direct IO with
not aligned address. For example, without direct IO the pattern is
much better.

hades /vol $ strace -f -e pwrite -e raw=write,pwrite  qemu-io -c "write 
-P 0x11 0 64M" ./1.img
Process 19333 attached
[pid 19333] pwrite(0x6, 0x7fa863fff010, 0x4000000, 0x50000) = 0x4000000 
<--- same 1 GB write
wrote 67108864/67108864 bytes at offset 0
64 MiB, 1 ops; 0.4495 sec (142.366 MiB/sec and 2.2245 ops/sec)
[pid 19333] +++ exited with 0 +++
+++ exited with 0 +++
hades /vol $

IO is splitted, but splitted is a much more efficient way.

   9,0   11      126   213.154002990 19333  Q  WS 471040 + 1024 [qemu-io]
   9,0   11      127   213.154039500 19333  Q  WS 472064 + 1024 [qemu-io]
   9,0   11      128   213.154073454 19333  Q  WS 473088 + 1024 [qemu-io]
   9,0   11      129   213.154110079 19333  Q  WS 474112 + 1024 [qemu-io]

I have discussed the thing with my kernel colleagues and they do agree that
this is a problem and it should be fixed. Though there is no fix so far.

I do think that we will stay on the safe side enforcing page alignment for
bounce buffers. This does not bring significant cost. As for other 
applications,
I do think that they do they same with alignment. At least we do this in
all our code.

Regards,
     Den

next prev parent reply	other threads:[~2015-05-12 10:19 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-04 13:42 [Qemu-devel] [PATCH v5 0/2] block: enforce minimal 4096 alignment in qemu_blockalign Denis V. Lunev
2015-05-04 13:42 ` [Qemu-devel] [PATCH 1/2] block: minimal bounce buffer alignment Denis V. Lunev
2015-05-04 13:42 ` [Qemu-devel] [PATCH 2/2] block: align bounce buffers to page Denis V. Lunev
2015-05-11 14:54   ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
2015-05-11 15:32     ` Eric Blake
2015-05-11 15:40       ` Denis V. Lunev
2015-05-11 15:08 ` [Qemu-devel] [Qemu-block] [PATCH v5 0/2] block: enforce minimal 4096 alignment in qemu_blockalign Stefan Hajnoczi
2015-05-11 16:07   ` Denis V. Lunev
2015-05-11 16:38     ` Denis V. Lunev
2015-05-11 16:47     ` Denis V. Lunev
2015-05-12 10:01       ` Stefan Hajnoczi
2015-05-12 10:19         ` Denis V. Lunev [this message]
2015-05-12 10:46           ` Paolo Bonzini
2015-05-13 15:43             ` Stefan Hajnoczi
2015-05-13 16:46               ` Denis V. Lunev
2015-05-29 16:43                 ` [Qemu-devel] " Paolo Bonzini
2015-06-01 10:34                   ` Dmitry Monakhov
2015-06-01 10:41                     ` Paolo Bonzini
2015-06-01 11:16                       ` Dmitry Monakhov
2015-06-01 11:26                         ` Paolo Bonzini
2015-06-01 11:57                           ` Dmitry Monakhov
2015-05-14  9:13               ` [Qemu-devel] [Qemu-block] " Paolo Bonzini

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5551D39E.1020902@odin.com \
    --to=den@odin.com \
    --cc=dmonakhov@odin.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@gmail.com \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.