From: "Richard W.M. Jones" <rjones@redhat.com>
To: Nir Soffer <nsoffer@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>,
QEMU Developers <qemu-devel@nongnu.org>,
qemu-block <qemu-block@nongnu.org>,
Eric Blake <eblake@redhat.com>
Subject: Re: [Qemu-devel] [Qemu-block] Change in qemu 2.12 causes qemu-img convert to NBD to write more data
Date: Sat, 17 Nov 2018 21:13:56 +0000 [thread overview]
Message-ID: <20181117211356.GG27120@redhat.com> (raw)
In-Reply-To: <CAMRbyytf-NU3QgYoqbTvNR1f-FfA9DxRzUb+-oZGx=34qFXi5g@mail.gmail.com>
On Sat, Nov 17, 2018 at 10:59:26PM +0200, Nir Soffer wrote:
> On Fri, Nov 16, 2018 at 5:26 PM Kevin Wolf <kwolf@redhat.com> wrote:
>
> > Am 15.11.2018 um 23:27 hat Nir Soffer geschrieben:
> > > On Sun, Nov 11, 2018 at 6:11 PM Nir Soffer <nsoffer@redhat.com> wrote:
> > >
> > > > On Wed, Nov 7, 2018 at 7:55 PM Nir Soffer <nsoffer@redhat.com> wrote:
> > > >
> > > >> On Wed, Nov 7, 2018 at 7:27 PM Kevin Wolf <kwolf@redhat.com> wrote:
> > > >>
> > > >>> Am 07.11.2018 um 15:56 hat Nir Soffer geschrieben:
> > > >>> > Wed, Nov 7, 2018 at 4:36 PM Richard W.M. Jones <rjones@redhat.com>
> > > >>> wrote:
> > > >>> >
> > > >>> > > Another thing I tried was to change the NBD server (nbdkit) so
> > that
> > > >>> it
> > > >>> > > doesn't advertise zero support to the client:
> > > >>> > >
> > > >>> > > $ nbdkit --filter=log --filter=nozero memory size=6G
> > > >>> logfile=/tmp/log \
> > > >>> > > --run './qemu-img convert ./fedora-28.img -n $nbd'
> > > >>> > > $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' |
> > uniq
> > > >>> -c
> > > >>> > > 2154 Write
> > > >>> > >
> > > >>> > > Not surprisingly no zero commands are issued. The size of the
> > write
> > > >>> > > commands is very uneven -- it appears to be send one command per
> > > >>> block
> > > >>> > > of zeroes or data.
> > > >>> > >
> > > >>> > > Nir: If we could get information from imageio about whether
> > zeroing
> > > >>> is
> > > >>> > > implemented efficiently or not by the backend, we could change
> > > >>> > > virt-v2v / nbdkit to advertise this back to qemu.
> > > >>> >
> > > >>> > There is no way to detect the capability, ioctl(BLKZEROOUT) always
> > > >>> > succeeds, falling back to manual zeroing in the kernel silently
> > > >>> >
> > > >>> > Even if we could, sending zero on the wire from qemu may be even
> > > >>> > slower, and it looks like qemu send even more requests in this case
> > > >>> > (2154 vs ~1300).
> > > >>> >
> > > >>> > Looks like this optimization in qemu side leads to worse
> > performance,
> > > >>> > so it should not be enabled by default.
> > > >>>
> > > >>> Well, that's overgeneralising your case a bit. If the backend does
> > > >>> support efficient zero writes (which file systems, the most common
> > case,
> > > >>> generally do), doing one big write_zeroes request at the start can
> > > >>> improve performance quite a bit.
> > > >>>
> > > >>> It seems the problem is that we can't really know whether the
> > operation
> > > >>> will be efficient because the backends generally don't tell us. Maybe
> > > >>> NBD could introduce a flag for this, but in the general case it
> > appears
> > > >>> to me that we'll have to have a command line option.
> > > >>>
> > > >>> However, I'm curious what your exact use case and the backend used
> > in it
> > > >>> is? Can something be improved there to actually get efficient zero
> > > >>> writes and get even better performance than by just disabling the big
> > > >>> zero write?
> > > >>
> > > >>
> > > >> The backend is some NetApp storage connected via FC. I don't have
> > > >> more info on this. We get zero rate of about 1G/s on this storage,
> > which
> > > >> is quite slow compared with other storage we tested.
> > > >>
> > > >> One option we check now is if this is the kernel silent fallback to
> > manual
> > > >> zeroing when the server advertise wrong value of write_same_max_bytes.
> > > >>
> > > >
> > > > We eliminated this using blkdiscard. This is what we get on with this
> > > > storage
> > > > zeroing 100G LV:
> > > >
> > > > for i in 1 2 4 8 16 32; do time blkdiscard -z -p ${i}m
> > > >
> > /dev/6e1d84f9-f939-46e9-b108-0427a08c280c/2d5c06ce-6536-4b3c-a7b6-13c6d8e55ade;
> > > > done
> > > >
> > > > real 4m50.851s
> > > > user 0m0.065s
> > > > sys 0m1.482s
> > > >
> > > > real 4m30.504s
> > > > user 0m0.047s
> > > > sys 0m0.870s
> > > >
> > > > real 4m19.443s
> > > > user 0m0.029s
> > > > sys 0m0.508s
> > > >
> > > > real 4m13.016s
> > > > user 0m0.020s
> > > > sys 0m0.284s
> > > >
> > > > real 2m45.888s
> > > > user 0m0.011s
> > > > sys 0m0.162s
> > > >
> > > > real 2m10.153s
> > > > user 0m0.003s
> > > > sys 0m0.100s
> > > >
> > > > We are investigating why we get low throughput on this server, and also
> > > > will check
> > > > several other servers.
> > > >
> > > > Having a command line option to control this behavior sounds good. I
> > don't
> > > >> have enough data to tell what should be the default, but I think the
> > safe
> > > >> way would be to keep old behavior.
> > > >>
> > > >
> > > > We file this bug:
> > > > https://bugzilla.redhat.com/1648622
> > > >
> > >
> > > More data from even slower storage - zeroing 10G lv on Kaminario K2
> > >
> > > # time blkdiscard -z -p 32m /dev/test_vg/test_lv2
> > >
> > > real 50m12.425s
> > > user 0m0.018s
> > > sys 2m6.785s
> > >
> > > Maybe something is wrong with this storage, since we see this:
> > >
> > > # grep -s "" /sys/block/dm-29/queue/* | grep write_same_max_bytes
> > > /sys/block/dm-29/queue/write_same_max_bytes:512
> > >
> > > Since BLKZEROOUT always fallback to manual slow zeroing silently,
> > > maybe we can disable the aggressive pre-zero of the entire device
> > > for block devices, and keep this optimization for files when fallocate()
> > > is supported?
> >
> > I'm not sure what the detour through NBD changes, but qemu-img directly
> > on a block device doesn't use BLKZEROOUT first, but
> > FALLOC_FL_PUNCH_HOLE.
>
>
> Looking at block/file-posix.c (83c496599cc04926ecbc3e47a37debaa3e38b686)
> we don't use PUNCH_HOLE for block devices:
>
> 1472 if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
> 1473 return handle_aiocb_write_zeroes_block(aiocb);
> 1474 }
>
> qemu uses BLKZEROOUT, which is not guaranteed to be fast on storage side,
> and even worse fallback silently to manual zero if storage does not support
> WRITE_SAME.
>
> Maybe we can add a flag that avoids anything that
> > could be slow, such as BLKZEROOUT, as a fallback (and also the slow
> > emulation that QEMU itself would do if all kernel calls fail).
> >
>
> But the issue here is not how qemu-img handles this case, but how NBD
> server can handle it. NBD may support zeroing, but there is no way to tell
> if zeroing is going to be fast, since the backend writing zeros to storage
> has the same limits of qemu-img.
>
> So I think we need to fix the performance regression in 2.12 by enabling
> pre-zero of entire disk only if FALLOCATE_FL_PUNCH_HOLE can be used
> and only if it can be used without a fallback to slow zero method.
>
> Enabling this optimization for anything else requires changing the entire
> stack (storage, kernel, NBD protocol) to support reporting fast zero
> capability
> or limit zero to fast operations.
I may be missing something here, but doesn't imageio know if the
backing block device starts out as all zeroes? If so couldn't it
maintain a bitmap and simply ignore zero requests sent for unwritten
disk blocks?
Rich.
--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html
next prev parent reply other threads:[~2018-11-17 21:14 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-11-07 12:13 [Qemu-devel] Change in qemu 2.12 causes qemu-img convert to NBD to write more data Richard W.M. Jones
2018-11-07 14:36 ` Richard W.M. Jones
2018-11-07 14:56 ` Nir Soffer
2018-11-07 15:02 ` Richard W.M. Jones
2018-11-07 17:27 ` [Qemu-devel] [Qemu-block] " Kevin Wolf
2018-11-07 17:55 ` Nir Soffer
2018-11-11 16:11 ` Nir Soffer
2018-11-15 22:27 ` Nir Soffer
2018-11-16 15:26 ` Kevin Wolf
2018-11-17 20:59 ` Nir Soffer
2018-11-17 21:13 ` Richard W.M. Jones [this message]
2018-11-18 7:24 ` Nir Soffer
2018-11-19 11:50 ` Kevin Wolf
2018-11-07 16:42 ` [Qemu-devel] " Eric Blake
2018-11-11 15:25 ` Nir Soffer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20181117211356.GG27120@redhat.com \
--to=rjones@redhat.com \
--cc=eblake@redhat.com \
--cc=kwolf@redhat.com \
--cc=nsoffer@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.