qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Richard Laager <rlaager@wiktel.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations
Date: Tue, 13 Mar 2012 14:13:10 -0500	[thread overview]
Message-ID: <1331665990.24052.42.camel@watermelon.coderich.net> (raw)
In-Reply-To: <4F5DEBCE.3040409@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 4349 bytes --]

On Mon, 2012-03-12 at 10:34 +0100, Paolo Bonzini wrote:
> To be completely correct, I suggest the following behavior:
> >      1. Add a discard boolean option to the disk layer.
> >      2. If discard is not specified:
> >               * For files, detect a true/false value by comparing
> >                 stat.st_blocks != stat.st_size>>9.
> >               * For devices, assume a fixed value (true?).
> >      3. If discard is true, issue discards.
> >      4. If discard is false, do not issue discards.
> 
> The problem is, who will use this interface?

I'm a libvirt and virt-manager user; virt-manager already differentiates
between thin and thick provisioning. So I'm envisioning passing that
information to libvirt, which would save it in a config file and use
that to set discard=true vs. discard=false when using QEMU.

On Mon, 2012-03-12 at 13:27 +0100, Paolo Bonzini wrote:
> Il 10/03/2012 19:02, Richard Laager ha scritto:
> > I propose adding the following behaviors in any event:
> >       * If a QEMU block device reports a discard_granularity > 0, it
> >         must be equal to 2^n (n >= 0), or QEMU's block core will change
> >         it to 0. (Non-power-of-two granularities are not likely to exist
> >         in the real world, and this assumption greatly simplifies
> >         ensuring correctness.)
> 
> Yeah, I was considering this to be simply a bug in the block device.
> 
> >       * For SCSI, report an unmap_granularity to the guest as follows:
> >       max(logical_block_size, discard_granularity) / logical_block_size
> 
> This is more or less already in place later in the series.

I didn't see it. Which patch number?

> > Note, I'm assuming fallocate() actually
> > guarantees that it zeros the data when punching holes.
> 
> It does, that's pretty much the definition of a hole.

Agreed. I verified this fact after sending that email. At the time, I
just wanted to be very clear on what I knew for sure vs. what I had not
yet verified.

> If you have a new kernel that supports SEEK_HOLE/SEEK_DATA, it can also
> be done by skipping the zero write on known holes.
> 
> This could even be done at the block layer level using bdrv_is_allocated.

Would we want to make all write_zeros operations check for and skip
holes, or is write_zeros different from a discard in that it SHOULD/MUST
allocate space?

> > If we could probe for FALLOC_FL_PUNCH_HOLE support, then we could avoid
> > advertising discard support based on FALLOC_FL_PUNCH_HOLE when it is not
> > going to work. This would side step these problems. 
> 
> ... and introduce others when migrating if your datacenter doesn't have
> homogeneous kernel versions and/or filesystems. :(

I hadn't thought of the migration issues. Thanks for bringing that up.

Worst case, you end up doing a bunch of zero writing if and only if you
migrate from a discard_zeros_data host to one that doesn't (or doesn't
do discard at all). But this only lasts until the guest reboots
(assuming we also add a behavior of re-probing on guest reboot--or until
it shuts down if we don't or can't). As far as I can see, this is
unavoidable, though. And this is no worse than writing zeros ALL of the
time that fallocate() fails, which is the behavior of your patch series,
right?

This might be another use case for a discard option on the disk. If some
but not all of one's hosts support discard, a system administrator might
want to set discard=0 to avoid this.

> Do you know if non-Linux operating systems have something similar to
> BLKDISCARDZEROES?

As far as I know, no. The SunOS one is only on Illumos (the open source
kernel forked from the now dead OpenSolaris) and only implemented for
ZFS zvols. So currently, it's roughly equivalent to fallocate() on Linux
in that it's happening at the filesystem level. (It doesn't actually
reach the platters yet. But even if it did, that's unrelated to the
guarantees provided by ZFS.) Thus, it always zeros, so we could set
discard_zeros_data = 1 unconditionally there. I should probably run that
by the Illumos developers, though, to ensure they're comfortable with
that ioctl() guaranteeing zeroing.

I haven't looked into the FreeBSD one as much yet. Worst case, we
unconditionally set discard_zeros_data = 0.

-- 
Richard

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

  parent reply	other threads:[~2012-03-13 19:13 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-08 17:15 [Qemu-devel] [RFC PATCH 00/17] Improvements around discard and write zeroes Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 01/17] qemu-iotests: add a simple test for write_zeroes Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 02/17] qed: make write-zeroes bounce buffer smaller than a single cluster Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 03/17] block: add discard properties to BlockDriverInfo Paolo Bonzini
2012-03-09 16:47   ` Kevin Wolf
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 04/17] qed: implement bdrv_aio_discard Paolo Bonzini
2012-03-09 16:31   ` Kevin Wolf
2012-03-09 17:53     ` Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 05/17] block: pass around qiov for write_zeroes operation Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations Paolo Bonzini
2012-03-09 16:37   ` Kevin Wolf
2012-03-09 18:06     ` Paolo Bonzini
2012-03-10 18:02       ` Richard Laager
2012-03-12 12:27         ` Paolo Bonzini
2012-03-12 13:04           ` Kevin Wolf
2012-03-13 19:13           ` Richard Laager [this message]
2012-03-14  7:41             ` Paolo Bonzini
2012-03-14 12:01               ` Kevin Wolf
2012-03-14 12:14                 ` Paolo Bonzini
2012-03-14 12:37                   ` Kevin Wolf
2012-03-14 12:49                     ` Paolo Bonzini
2012-03-14 13:04                       ` Kevin Wolf
2012-03-24 15:33                       ` Christoph Hellwig
2012-03-24 15:30                   ` Christoph Hellwig
2012-03-26 19:40                     ` Richard Laager
2012-03-27 10:20                       ` Kevin Wolf
2012-03-24 15:29                 ` Christoph Hellwig
2012-03-26  9:44                   ` Daniel P. Berrange
2012-03-26  9:56                     ` Christoph Hellwig
2012-03-15  0:42               ` Richard Laager
2012-03-15  9:36                 ` Paolo Bonzini
2012-03-16  0:47                   ` Richard Laager
2012-03-16  9:34                     ` Paolo Bonzini
2012-03-24 15:27         ` Christoph Hellwig
2012-03-26 19:40           ` Richard Laager
2012-03-27  9:08             ` Christoph Hellwig
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 07/17] block: make high level discard operation always zero Paolo Bonzini
2012-03-08 17:55   ` Avi Kivity
2012-03-09 16:42     ` Kevin Wolf
2012-03-12 10:42       ` Avi Kivity
2012-03-12 11:04         ` Kevin Wolf
2012-03-12 12:03           ` Avi Kivity
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 08/17] block: kill the write zeroes operation Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 09/17] ide: issue discard asynchronously but serialize the pieces Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 10/17] ide/scsi: add discard_zeroes_data property Paolo Bonzini
2012-03-08 18:13   ` Avi Kivity
2012-03-08 18:14     ` Avi Kivity
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 11/17] ide/scsi: prepare for flipping the discard defaults Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 12/17] ide/scsi: turn on discard Paolo Bonzini
2012-03-08 18:17   ` Avi Kivity
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 13/17] block: fallback from discard to writes Paolo Bonzini
2012-03-24 15:35   ` Christoph Hellwig
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 14/17] block: support FALLOC_FL_PUNCH_HOLE trimming Paolo Bonzini
2012-03-09  8:20   ` Chris Wedgwood
2012-03-09  8:31     ` Paolo Bonzini
2012-03-09  8:35       ` Chris Wedgwood
2012-03-09  8:40         ` Paolo Bonzini
2012-03-09 10:31   ` Stefan Hajnoczi
2012-03-09 10:43     ` Paolo Bonzini
2012-03-09 10:53       ` Stefan Hajnoczi
2012-03-09 10:57         ` Paolo Bonzini
2012-03-09 20:36   ` Richard Laager
2012-03-12  9:34     ` Paolo Bonzini
2012-03-24 15:40     ` Christoph Hellwig
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 15/17] raw: add get_info Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 16/17] qemu-io: fix the alloc command Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 17/17] raw: implement is_allocated Paolo Bonzini
2012-03-24 15:42   ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1331665990.24052.42.camel@watermelon.coderich.net \
    --to=rlaager@wiktel.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).