From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:47233) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aldH3-0002CM-VD for qemu-devel@nongnu.org; Thu, 31 Mar 2016 10:08:44 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aldGx-0001IQ-JI for qemu-devel@nongnu.org; Thu, 31 Mar 2016 10:08:37 -0400 Received: from mx1.redhat.com ([209.132.183.28]:38453) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aldGx-0001IM-Bv for qemu-devel@nongnu.org; Thu, 31 Mar 2016 10:08:31 -0400 References: <1459429325-16350-1-git-send-email-den@openvz.org> <24E4A85C-254F-4324-A2F4-9DACA6037381@alex.org.uk> From: Eric Blake Message-ID: <56FD2F5D.3070400@redhat.com> Date: Thu, 31 Mar 2016 08:08:29 -0600 MIME-Version: 1.0 In-Reply-To: <24E4A85C-254F-4324-A2F4-9DACA6037381@alex.org.uk> Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="ma2KrrbNnK1Q7IOTfHT4heNB4ODfbs0A9" Subject: Re: [Qemu-devel] [PATCH v2 1/1] NBD proto: add WRITE_ZEROES extension List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alex Bligh , "Denis V. Lunev" Cc: "nbd-general@lists.sourceforge.net" , Kevin Wolf , "qemu-devel@nongnu.org" , Pavel Borzenkov , "Stefan stefanha@redhat. com" , Paolo Bonzini , Wouter Verhelst This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --ma2KrrbNnK1Q7IOTfHT4heNB4ODfbs0A9 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On 03/31/2016 07:53 AM, Alex Bligh wrote: >=20 > On 31 Mar 2016, at 14:02, Denis V. Lunev wrote: >=20 >> From: Pavel Borzenkov >> >> There exist some cases when a client knows that the data it is going t= o >> write is all zeroes. Such cases include mirroring or backing up a devi= ce >> implemented by a sparse file. >=20 > Useful. >=20 >> -- bit 0, `NBD_CMD_FLAG_FUA`; valid during `NBD_CMD_WRITE`. SHOULD be= >> - set to 1 if the client requires "Force Unit Access" mode of >> - operation. MUST NOT be set unless transmission flags included >> - `NBD_FLAG_SEND_FUA`. >> +- bit 0, `NBD_CMD_FLAG_FUA`; valid during `NBD_CMD_WRITE` and >> + `NBD_CMD_WRITE_ZEROES` commands. SHOULD be set to 1 if the client = requires >> + "Force Unit Access" mode of operation. MUST NOT be set unless tran= smission >> + flags included `NBD_FLAG_SEND_FUA`. >=20 > Not your fault, but this should actually say "unless export flags > included". Transmission flags would be the flags with the command. No, we just barely renamed 'export flags' to 'transmission flags', to represent the 16 bits sent by the server at the end of handshake phase; these are named 'NBD_FLAG_*'. We still use the term 'command flags' (although maybe 'request flags' is better) for the 16 bits sent with each request; these are named 'NBD_CMD_FLAG_*'. So Pavel's text is correct as-is. >=20 >> +- bit 1, `NBD_CMD_MAY_TRIM`; defined by the experimental `WRITE_ZEROE= S` >> + extension; see below. >=20 > For consistency, probably useful to say here: >=20 > MUST NOT be set unless the export flags include NBD_FLAG_SEND_WRITE_ZER= OES. Elsewhere, when defining an experimental extension, the forward reference has been as sparse as possible; so this sentence (about the transmission flags including NBD_FLAG_SEND_WRITE_ZEROES) should appear only in the experimental section, if it is not already there. >> >> +### `WRITE_ZEROES` extension >> + >> +There exist some cases when a client knows that the data it is going = to write >> +is all zeroes. Such cases include mirroring or backing up a device im= plemented >> +by a sparse file. With current NBD command set, the client has to iss= ue >> +`NBD_CMD_WRITE` command with zeroed payload and transfer these zero b= ytes >> +through the wire. The server has to write the data onto disk, effecti= vely >> +losing the sparseness. >> + >> +To remedy this, a `WRITE_ZEROES` extension is envisioned. This extens= ion adds >> +one new command and one new command flag. >> + >> +* `NBD_CMD_WRITE_ZEROES` (6) Wouter recently pointed out that we explicitly do NOT want to repeat constants in more than one location; define the value to (6) above where you make the forward reference in the normative section, then keep the experimental section referring to the command by name only. Especially useful if we end up renumbering things because we have multiple extension proposals in flight at the moment. >> + If the flag `NBD_CMD_FLAG_MAY_TRIM` was set by the client in the = command >> + flags field, the server MAY use trimming to zero out the area, bu= t it >> + MUST ensure that the data reads back as zero. >> + >=20 > Can you give an example of a situation where the client would not set t= his > and it would be undesirable for the server to create a 'hole' using > 'trim' type technology, even when the client doesn't specify it? Yes, I can see situations where the client REQUIRES that the server write actual zeroes, rather than trimming. The biggest reason is that in an environment where storage can be oversubscribed (multiple sparse files that in name occupy more data than the underlying storage contains), explicitly writing zeroes without punching a hole guarantees that YOUR file has storage allocated to it (whereas if YOUR file is trimmed, some other file can then use enough allocation to prevent you from actually writing data in place of the hole). Of course, the client can still achieve this by sticking with NBD_CMD_WRITE, but that requires more network traffic. However, having written that, I'm thinking we have the wrong sense for the flag. I think it makes more sense to allow trim/hole-punching by default (but ONLY when the server can guarantee that reads will still be zeroes), and make the flag NBD_CMD_FLAG_NO_TRIM to explicitly specify the cases where the server MUST NOT trim but allocate and write actual zeroes. I suspect that explicit allocation requests are less common, and also less efficient; so having the default state of the flag geared towards efficiency (both in the sense that punching holes can be faster than writing zeroes, and that most people LIKE the storage savings of sparse files). > I suspect there are already some backends (e.g. ceph on qemu-nbd) which= > will effectively do a 'trim' if you write 4k of zeroes even under > current circumstances. >=20 > IE why not always permit trimming PROVIDED the data always reads back > as zero? This would be far simpler. >=20 --=20 Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org --ma2KrrbNnK1Q7IOTfHT4heNB4ODfbs0A9 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 Comment: Public key at http://people.redhat.com/eblake/eblake.gpg Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBCAAGBQJW/S9dAAoJEKeha0olJ0NqmZoH/34lVqJflC4aYCIFx5PXPuTJ gOgUt1Jz2BGNA9nHMFOGgZ3gQP1bKH5DzAQsWPgX6cws4FvtjoJO/mZHtAifdkR/ 2Kz94ro4/QJ+7pamnotTSdZhkIN3WG+g4ZnR4m7uHvI3qOiOIraJHx8Mzu3OEzmi LPPFZv1ocPJPu6iOgW6emMqoRnvIphLBHdTFgXPJSGKn3Eh1A5tcAC70hy0ucpg1 13RQ3AYxMnvtlQ7DBWrP8GGVKnW33XZDoSgsa0L/+NItfVsUgJAghAM1QRBmy/Wa YyK42oJygq5HXyPQZIg14deQ8tEw3NyD9dp4kmWN+SEpv+MkkojoqF9A5gcPd68= =+zOE -----END PGP SIGNATURE----- --ma2KrrbNnK1Q7IOTfHT4heNB4ODfbs0A9--