qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Nir Soffer <nsoffer@redhat.com>
To: Kevin Wolf <kwolf@redhat.com>
Cc: QEMU Developers <qemu-devel@nongnu.org>,
	qemu-block <qemu-block@nongnu.org>
Subject: Re: [Qemu-devel] [PULL 5/7] file-posix: Support BDRV_REQ_NO_FALLBACK for zero writes
Date: Sat, 17 Aug 2019 20:45:13 +0300	[thread overview]
Message-ID: <CAMRbyyvXUCXD+9r2gurCrXMn8Nz2GSYM4UjzofOFpZQXHzJVEQ@mail.gmail.com> (raw)
In-Reply-To: <20190815102947.GB7415@linux.fritz.box>

On Thu, Aug 15, 2019 at 1:29 PM Kevin Wolf <kwolf@redhat.com> wrote:

> Am 15.08.2019 um 04:44 hat Eric Blake geschrieben:
> > On 3/26/19 10:51 AM, Kevin Wolf wrote:
> > > We know that the kernel implements a slow fallback code path for
> > > BLKZEROOUT, so if BDRV_REQ_NO_FALLBACK is given, we shouldn't call it.
> > > The other operations we call in the context of .bdrv_co_pwrite_zeroes
> > > should usually be quick, so no modification should be needed for them.
> > > If we ever notice that there are additional problematic cases, we can
> > > still make these conditional as well.
> >
> > Are there cases where fallocate(FALLOC_FL_ZERO_RANGE) falls back to slow
> > writes?  It may be fast on some file systems, but when used on a block
> > device, that may equally trigger slow fallbacks.  The man page is not
> > clear on that fact; I suspect that there may be cases in there that need
> > to be made conditional (it would be awesome if the kernel folks would
> > give us another FALLOC_ flag when we want to guarantee no fallback).
>
> The NO_FALLBACK changes were based on the Linux code rather than
> documentation because no interface is explicitly documented to forbid
> fallbacks.
>
> I think for file systems, we can generally assume that we don't get
> fallbacks because for file systems, just deallocating blocks is the
> easiest way to implement the function anyway. (Hm, or is it when we
> don't punch holes...?)
>
> And for block devices, we don't try FALLOC_FL_ZERO_RANGE because it also
> involves the same slow fallback as BLKZEROOUT. In other words,
> bdrv_co_pwrite_zeroes() with NO_FALLBACK, but without MAY_UNMAP, always
> fails on Linux block devices, and we fall back to emulation in user
> space.
>
> We would need a kernel interface that calls blkdev_issue_zeroout() with
> BLKDEV_ZERO_NOUNMAP | BLKDEV_ZERO_NOFALLBACK, but no such interface
> exists.
>
> When I talked to some file system people, they insisted that "efficient"
> or "fast" wasn't well-defined enough for them or something, so if we
> want to get a kernel change, maybe a new block device ioctl would be the
> most realistic thing.
>
> We do use FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE for MAY_UNMAP,
> which works for both file systems (I assume - each file system has a
> separate implementation) and block devices without slow fallbacks.
>
> qemu-img create sets MAY_UNMAP, so the case we are most interested in is
> covered with a fast implementation.
>
> > By the way, is there an easy setup to prove (maybe some qemu-img convert
> > command on a specially-prepared source image) whether the no fallback
> > flag makes a difference?  I'm about to cross-post a series of patches to
> > nbd/qemu/nbdkit/libnbd that adds a new NBD_CMD_FLAG_FAST_ZERO which fits
> > the bill of BDRV_REQ_NO_FALLBACK, but would like to include some
> > benchmark numbers in my cover letter if I can reproduce a setup where it
> > matters.
>
> Hm, the original case came from Nir, maybe he can suggest something.
>

The original case came from RHEL 7.{5,6}. The flow was:

qemu-img convert -> nbdkit rhv plugin -> imageio -> storage

nbdkit got NBD_CMD_WRITE_ZEROES request, converted it to imageio ZERO
request.

For block devices, imageio was trying:
1. fallocate(ZERO_RANGE) - fails
2. ioctl(BLKZEROOUT) - succeeds

See
https://github.com/oVirt/ovirt-imageio/blob/ca70170886b0c1fbeca8640b12bcf54f01a3fea0/common/ovirt_imageio_common/backends/file.py#L247

BLKZEROOUT can be fast (100 GiB/s) or slow (100 MiB/s) depending on the
server,
and on the allocation status of that area.

On our current storage (3PAR), if the device is fully allocated, for
example:

   dd if=/dev/zero bs=8M of=/dev/vg/lv

Then blkdiscard -z is slow (800 MiB/s):

But if you discard the device:

    blkdiscard /dev/vg/lv

blkdiscard -z becomes fast (100 GiB/s).

Previously we had XtremIO storage, which was able to zero 50 GiB/s
regardless
of the allocation.

You'll definitely need a block device that doesn't support
> FALLOC_FL_PUNCH_HOLE,


Old kernels (CentOS 7) did not support this.

# uname -r
3.10.0-957.21.3.el7.x86_64

# strace -e trace=fallocate fallocate -l 100m /dev/loop0
fallocate(3, 0, 0, 104857600)           = -1 ENODEV (No such device)
fallocate: fallocate failed: No such device
+++ exited with 1 +++

# strace -e trace=fallocate fallocate -p -l 100m /dev/loop0
fallocate(3, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 0, 104857600) = -1
ENODEV (No such device)
fallocate: fallocate failed: No such device
+++ exited with 1 +++

# strace -e trace=fallocate fallocate -z -l 100m /dev/loop0
fallocate(3, FALLOC_FL_ZERO_RANGE, 0, 104857600) = -1 ENODEV (No such
device)
fallocate: fallocate failed: No such device
+++ exited with 1 +++

otherwise you can't trigger the fallback. My
> first though was a loop device, but this actually does support the
> operation and passes it through to the underlying file system. So maybe
> if you know a file system that doesn't support it. Or if you have an old
> hard disk handy.

...

Nir

      reply	other threads:[~2019-08-17 17:46 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20190326155157.3719-1-kwolf@redhat.com>
     [not found] ` <20190326155157.3719-6-kwolf@redhat.com>
2019-08-15  2:44   ` [Qemu-devel] [PULL 5/7] file-posix: Support BDRV_REQ_NO_FALLBACK for zero writes Eric Blake
2019-08-15 10:29     ` Kevin Wolf
2019-08-17 17:45       ` Nir Soffer [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAMRbyyvXUCXD+9r2gurCrXMn8Nz2GSYM4UjzofOFpZQXHzJVEQ@mail.gmail.com \
    --to=nsoffer@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).