From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:51818) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1S7XAG-0008HE-Vq for qemu-devel@nongnu.org; Tue, 13 Mar 2012 15:13:46 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1S7XAB-0005fP-Oo for qemu-devel@nongnu.org; Tue, 13 Mar 2012 15:13:44 -0400 Received: from spam1.wiktel.com ([69.89.207.151]:55034) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1S7XAB-0005eu-Ig for qemu-devel@nongnu.org; Tue, 13 Mar 2012 15:13:39 -0400 From: Richard Laager In-Reply-To: <4F5DEBCE.3040409@redhat.com> References: <1331226917-6658-1-git-send-email-pbonzini@redhat.com> <1331226917-6658-7-git-send-email-pbonzini@redhat.com> <4F5A31B2.3050701@redhat.com> <4F5A46A1.4000508@redhat.com> <1331402560.8577.46.camel@watermelon.coderich.net> <4F5DEBCE.3040409@redhat.com> Content-Type: multipart/signed; micalg="pgp-sha1"; protocol="application/pgp-signature"; boundary="=-6aWKv6b8t3NEAnugntoM" Date: Tue, 13 Mar 2012 14:13:10 -0500 Message-ID: <1331665990.24052.42.camel@watermelon.coderich.net> Mime-Version: 1.0 Subject: Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Paolo Bonzini Cc: qemu-devel@nongnu.org --=-6aWKv6b8t3NEAnugntoM Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Mon, 2012-03-12 at 10:34 +0100, Paolo Bonzini wrote: > To be completely correct, I suggest the following behavior: > > 1. Add a discard boolean option to the disk layer. > > 2. If discard is not specified: > > * For files, detect a true/false value by comparing > > stat.st_blocks !=3D stat.st_size>>9. > > * For devices, assume a fixed value (true?). > > 3. If discard is true, issue discards. > > 4. If discard is false, do not issue discards. >=20 > The problem is, who will use this interface? I'm a libvirt and virt-manager user; virt-manager already differentiates between thin and thick provisioning. So I'm envisioning passing that information to libvirt, which would save it in a config file and use that to set discard=3Dtrue vs. discard=3Dfalse when using QEMU. On Mon, 2012-03-12 at 13:27 +0100, Paolo Bonzini wrote: > Il 10/03/2012 19:02, Richard Laager ha scritto: > > I propose adding the following behaviors in any event: > > * If a QEMU block device reports a discard_granularity > 0, it > > must be equal to 2^n (n >=3D 0), or QEMU's block core will chan= ge > > it to 0. (Non-power-of-two granularities are not likely to exis= t > > in the real world, and this assumption greatly simplifies > > ensuring correctness.) >=20 > Yeah, I was considering this to be simply a bug in the block device. >=20 > > * For SCSI, report an unmap_granularity to the guest as follows: > > max(logical_block_size, discard_granularity) / logical_block_size >=20 > This is more or less already in place later in the series. I didn't see it. Which patch number? > > Note, I'm assuming fallocate() actually > > guarantees that it zeros the data when punching holes. >=20 > It does, that's pretty much the definition of a hole. Agreed. I verified this fact after sending that email. At the time, I just wanted to be very clear on what I knew for sure vs. what I had not yet verified. > If you have a new kernel that supports SEEK_HOLE/SEEK_DATA, it can also > be done by skipping the zero write on known holes. >=20 > This could even be done at the block layer level using bdrv_is_allocated. Would we want to make all write_zeros operations check for and skip holes, or is write_zeros different from a discard in that it SHOULD/MUST allocate space? > > If we could probe for FALLOC_FL_PUNCH_HOLE support, then we could avoid > > advertising discard support based on FALLOC_FL_PUNCH_HOLE when it is no= t > > going to work. This would side step these problems.=20 >=20 > ... and introduce others when migrating if your datacenter doesn't have > homogeneous kernel versions and/or filesystems. :( I hadn't thought of the migration issues. Thanks for bringing that up. Worst case, you end up doing a bunch of zero writing if and only if you migrate from a discard_zeros_data host to one that doesn't (or doesn't do discard at all). But this only lasts until the guest reboots (assuming we also add a behavior of re-probing on guest reboot--or until it shuts down if we don't or can't). As far as I can see, this is unavoidable, though. And this is no worse than writing zeros ALL of the time that fallocate() fails, which is the behavior of your patch series, right? This might be another use case for a discard option on the disk. If some but not all of one's hosts support discard, a system administrator might want to set discard=3D0 to avoid this. > Do you know if non-Linux operating systems have something similar to > BLKDISCARDZEROES? As far as I know, no. The SunOS one is only on Illumos (the open source kernel forked from the now dead OpenSolaris) and only implemented for ZFS zvols. So currently, it's roughly equivalent to fallocate() on Linux in that it's happening at the filesystem level. (It doesn't actually reach the platters yet. But even if it did, that's unrelated to the guarantees provided by ZFS.) Thus, it always zeros, so we could set discard_zeros_data =3D 1 unconditionally there. I should probably run that by the Illumos developers, though, to ensure they're comfortable with that ioctl() guaranteeing zeroing. I haven't looked into the FreeBSD one as much yet. Worst case, we unconditionally set discard_zeros_data =3D 0. --=20 Richard --=-6aWKv6b8t3NEAnugntoM Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iEYEABECAAYFAk9fnEEACgkQbfU6uV4fG8453gCg6nAZ2rhmqKCUVPhP+HYCuekn PusAoM0IEPu5kf3QGi6CSDYK+6yY6UMg =kvBP -----END PGP SIGNATURE----- --=-6aWKv6b8t3NEAnugntoM--