[Qemu-devel] [RFC 1/1] nbd (specification): add NBD_CMD_WRITE

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC 1/1] nbd (specification): add NBD_CMD_WRITE_ZEROES command
@ 2016-02-17 18:10 Denis V. Lunev
  2016-02-17 20:58 ` Eric Blake
  2016-02-18  8:09 ` Alex Bligh
  0 siblings, 2 replies; 22+ messages in thread
From: Denis V. Lunev @ 2016-02-17 18:10 UTC (permalink / raw)
  Cc: nbd-general, Denis V. Lunev, qemu-devel

This patch proposes a new command to reduce the amount of data passed
through the wire when it is known that the data is all zeroes. This
functionality is generally useful for mirroring or backup operations.

Currently available NBD_CMD_TRIM command can not be used as the
specification explicitely says that "a client MUST NOT make any
assumptions about the contents of the export affected by this
[NBD_CMD_TRIM] command, until overwriting it again with `NBD_CMD_WRITE`"

Particular use case could be the following:

QEMU project uses own implementation of NBD server to transfer data
in between different instances of QEMU. Typically we tranfer VM virtual
disks over this channel. VM virtual disks are sparse and thus the
efficiency of backup and mirroring operations could be improved a lot.

Signed-off-by: Denis V. Lunev <den@openvz.org>
---
 doc/proto.md | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/doc/proto.md b/doc/proto.md
index 43065b7..c94751a 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -241,6 +241,8 @@ immediately after the global flags field in oldstyle negotiation:
   schedule I/O accesses as for a rotational medium
 - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server supports
   `NBD_CMD_TRIM` commands
+- bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server
+  supports `NBD_CMD_WRITE_ZEROES` commands
 
 ##### Client flags
 
@@ -446,6 +448,11 @@ The following request types exist:
     about the contents of the export affected by this command, until
     overwriting it again with `NBD_CMD_WRITE`.
 
+* `NBD_CMD_WRITE_ZEROES` (6)
+
+    A request to write zeroes. The command is functional equivalent of
+    the NBD_WRITE_COMMAND but without payload sent through the channel.
+
 * Other requests
 
     Some third-party implementations may require additional protocol
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC 1/1] nbd (specification): add NBD_CMD_WRITE_ZEROES command
  2016-02-17 18:10 [Qemu-devel] [RFC 1/1] nbd (specification): add NBD_CMD_WRITE_ZEROES command Denis V. Lunev
@ 2016-02-17 20:58 ` Eric Blake
  2016-02-18  4:46   ` Denis V. Lunev
                     ` (2 more replies)
  2016-02-18  8:09 ` Alex Bligh
  1 sibling, 3 replies; 22+ messages in thread
From: Eric Blake @ 2016-02-17 20:58 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: nbd-general, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2375 bytes --]

On 02/17/2016 11:10 AM, Denis V. Lunev wrote:
> This patch proposes a new command to reduce the amount of data passed
> through the wire when it is known that the data is all zeroes. This
> functionality is generally useful for mirroring or backup operations.
> 
> Currently available NBD_CMD_TRIM command can not be used as the
> specification explicitely says that "a client MUST NOT make any

s/explicitely/explicitly/

> assumptions about the contents of the export affected by this
> [NBD_CMD_TRIM] command, until overwriting it again with `NBD_CMD_WRITE`"
> 
> Particular use case could be the following:
> 
> QEMU project uses own implementation of NBD server to transfer data
> in between different instances of QEMU. Typically we tranfer VM virtual

s/tranfer/transfer/

> disks over this channel. VM virtual disks are sparse and thus the
> efficiency of backup and mirroring operations could be improved a lot.
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> ---
>  doc/proto.md | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/doc/proto.md b/doc/proto.md
> index 43065b7..c94751a 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -241,6 +241,8 @@ immediately after the global flags field in oldstyle negotiation:
>    schedule I/O accesses as for a rotational medium
>  - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server supports
>    `NBD_CMD_TRIM` commands
> +- bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server
> +  supports `NBD_CMD_WRITE_ZEROES` commands
>  
>  ##### Client flags
>  
> @@ -446,6 +448,11 @@ The following request types exist:
>      about the contents of the export affected by this command, until
>      overwriting it again with `NBD_CMD_WRITE`.
>  
> +* `NBD_CMD_WRITE_ZEROES` (6)
> +
> +    A request to write zeroes. The command is functional equivalent of
> +    the NBD_WRITE_COMMAND but without payload sent through the channel.

This lets us push holes during writes. Do we have the converse
operation, that is, an easy way to query if a block of data will read as
all zeroes, and therefore the client can bypass reading that portion of
the disk (in other words, an equivalent to lseek(SEEK_HOLE/SEEK_DATA))?

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC 1/1] nbd (specification): add NBD_CMD_WRITE_ZEROES command
  2016-02-17 20:58 ` Eric Blake
@ 2016-02-18  4:46   ` Denis V. Lunev
  2016-02-18  8:30     ` Denis V. Lunev
  2016-02-18  9:18   ` Roman Kagan
  2016-02-18 12:14   ` [Qemu-devel] " Daniel P. Berrange
  2 siblings, 1 reply; 22+ messages in thread
From: Denis V. Lunev @ 2016-02-18  4:46 UTC (permalink / raw)
  To: Eric Blake; +Cc: nbd-general, qemu-devel

On 02/17/2016 11:58 PM, Eric Blake wrote:
> On 02/17/2016 11:10 AM, Denis V. Lunev wrote:
>> This patch proposes a new command to reduce the amount of data passed
>> through the wire when it is known that the data is all zeroes. This
>> functionality is generally useful for mirroring or backup operations.
>>
>> Currently available NBD_CMD_TRIM command can not be used as the
>> specification explicitely says that "a client MUST NOT make any
> s/explicitely/explicitly/
>
>> assumptions about the contents of the export affected by this
>> [NBD_CMD_TRIM] command, until overwriting it again with `NBD_CMD_WRITE`"
>>
>> Particular use case could be the following:
>>
>> QEMU project uses own implementation of NBD server to transfer data
>> in between different instances of QEMU. Typically we tranfer VM virtual
> s/tranfer/transfer/
>
>> disks over this channel. VM virtual disks are sparse and thus the
>> efficiency of backup and mirroring operations could be improved a lot.
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> ---
>>   doc/proto.md | 7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/doc/proto.md b/doc/proto.md
>> index 43065b7..c94751a 100644
>> --- a/doc/proto.md
>> +++ b/doc/proto.md
>> @@ -241,6 +241,8 @@ immediately after the global flags field in oldstyle negotiation:
>>     schedule I/O accesses as for a rotational medium
>>   - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server supports
>>     `NBD_CMD_TRIM` commands
>> +- bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server
>> +  supports `NBD_CMD_WRITE_ZEROES` commands
>>   
>>   ##### Client flags
>>   
>> @@ -446,6 +448,11 @@ The following request types exist:
>>       about the contents of the export affected by this command, until
>>       overwriting it again with `NBD_CMD_WRITE`.
>>   
>> +* `NBD_CMD_WRITE_ZEROES` (6)
>> +
>> +    A request to write zeroes. The command is functional equivalent of
>> +    the NBD_WRITE_COMMAND but without payload sent through the channel.
> This lets us push holes during writes.
from my point this allows client to apply his policy. For QCOW2 output 
target the
client could skip the block. For RAW file he could decide whether to use 
UNMAP
and produce sparse file or use fallocate.
>   Do we have the converse
> operation, that is, an easy way to query if a block of data will read as
> all zeroes, and therefore the client can bypass reading that portion of
> the disk (in other words, an equivalent to lseek(SEEK_HOLE/SEEK_DATA))?
>
exactly!

static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
...
     ret = bdrv_get_block_status_above(source, NULL, sector_num,      
<------- query block state
                                       nb_sectors, &pnum, &file);
     if (ret < 0 || pnum < nb_sectors ||
             (ret & BDRV_BLOCK_DATA && !(ret & BDRV_BLOCK_ZERO))) {
         bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors,
                        mirror_read_complete, op);
     } else if (ret & BDRV_BLOCK_ZERO) {
         bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors,  
<------ skip read op if allowed
                               s->unmap ? BDRV_REQ_MAY_UNMAP : 0,
                               mirror_write_complete, op);
     } else {
         assert(!(ret & BDRV_BLOCK_DATA));
         bdrv_aio_discard(s->target, sector_num, op->nb_sectors,
                          mirror_write_complete, op);
     }
     return delay_ns;

Actually I have tried early at day begins to add .bdrv_co_write_zeroes
callback to NBD and it just works as expected. The problem is that
callback can not be written using NDB_SEND_TRIM to conform with the
NBD spec. But in QEMU -> QEMU communication it just works.

http://lists.nongnu.org/archive/html/qemu-devel/2016-02/msg03810.html

Den

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC 1/1] nbd (specification): add NBD_CMD_WRITE_ZEROES command
  2016-02-18  4:46   ` Denis V. Lunev
@ 2016-02-18  8:30     ` Denis V. Lunev
  0 siblings, 0 replies; 22+ messages in thread
From: Denis V. Lunev @ 2016-02-18  8:30 UTC (permalink / raw)
  To: Eric Blake; +Cc: nbd-general, qemu-devel

On 02/18/2016 07:46 AM, Denis V. Lunev wrote:
> On 02/17/2016 11:58 PM, Eric Blake wrote:
>> On 02/17/2016 11:10 AM, Denis V. Lunev wrote:
>>> This patch proposes a new command to reduce the amount of data passed
>>> through the wire when it is known that the data is all zeroes. This
>>> functionality is generally useful for mirroring or backup operations.
>>>
>>> Currently available NBD_CMD_TRIM command can not be used as the
>>> specification explicitely says that "a client MUST NOT make any
>> s/explicitely/explicitly/
>>
>>> assumptions about the contents of the export affected by this
>>> [NBD_CMD_TRIM] command, until overwriting it again with 
>>> `NBD_CMD_WRITE`"
>>>
>>> Particular use case could be the following:
>>>
>>> QEMU project uses own implementation of NBD server to transfer data
>>> in between different instances of QEMU. Typically we tranfer VM virtual
>> s/tranfer/transfer/
>>
>>> disks over this channel. VM virtual disks are sparse and thus the
>>> efficiency of backup and mirroring operations could be improved a lot.
>>>
>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>> ---
>>>   doc/proto.md | 7 +++++++
>>>   1 file changed, 7 insertions(+)
>>>
>>> diff --git a/doc/proto.md b/doc/proto.md
>>> index 43065b7..c94751a 100644
>>> --- a/doc/proto.md
>>> +++ b/doc/proto.md
>>> @@ -241,6 +241,8 @@ immediately after the global flags field in 
>>> oldstyle negotiation:
>>>     schedule I/O accesses as for a rotational medium
>>>   - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server 
>>> supports
>>>     `NBD_CMD_TRIM` commands
>>> +- bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the 
>>> server
>>> +  supports `NBD_CMD_WRITE_ZEROES` commands
>>>     ##### Client flags
>>>   @@ -446,6 +448,11 @@ The following request types exist:
>>>       about the contents of the export affected by this command, until
>>>       overwriting it again with `NBD_CMD_WRITE`.
>>>   +* `NBD_CMD_WRITE_ZEROES` (6)
>>> +
>>> +    A request to write zeroes. The command is functional equivalent of
>>> +    the NBD_WRITE_COMMAND but without payload sent through the 
>>> channel.
>> This lets us push holes during writes.
> from my point this allows client to apply his policy. For QCOW2 output 
> target the
s/client/server/

Sorry, have mistyped.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC 1/1] nbd (specification): add NBD_CMD_WRITE_ZEROES command
  2016-02-17 20:58 ` Eric Blake
  2016-02-18  4:46   ` Denis V. Lunev
@ 2016-02-18  9:18   ` Roman Kagan
  2016-02-18 10:36     ` Denis V. Lunev
  2016-02-18 16:35     ` Eric Blake
  2016-02-18 12:14   ` [Qemu-devel] " Daniel P. Berrange
  2 siblings, 2 replies; 22+ messages in thread
From: Roman Kagan @ 2016-02-18  9:18 UTC (permalink / raw)
  To: Eric Blake; +Cc: nbd-general, Denis V. Lunev, qemu-devel

On Wed, Feb 17, 2016 at 01:58:47PM -0700, Eric Blake wrote:
> On 02/17/2016 11:10 AM, Denis V. Lunev wrote:
> > @@ -446,6 +448,11 @@ The following request types exist:
> >      about the contents of the export affected by this command, until
> >      overwriting it again with `NBD_CMD_WRITE`.
> >  
> > +* `NBD_CMD_WRITE_ZEROES` (6)
> > +
> > +    A request to write zeroes. The command is functional equivalent of
> > +    the NBD_WRITE_COMMAND but without payload sent through the channel.
> 
> This lets us push holes during writes. Do we have the converse
> operation, that is, an easy way to query if a block of data will read as
> all zeroes, and therefore the client can bypass reading that portion of
> the disk (in other words, an equivalent to lseek(SEEK_HOLE/SEEK_DATA))?

The spec doesn't have anything like that.

OTOH, unlike the write case, where you have all the information and just
choose whether to send normal write or zero write, the extra round-trip
of a separate SEEK_HOLE/SEEK_DATA request may lead to actually degrading
the overall throughput.

Rather it may be a better idea to add something like sparse read where
the server would, instead of sending the full length of data in the
response payload, send a smarter variable-length package with a
scatter-gather list or a bitmap of used blocks in the beginning, and let
the client decode it and fill the gaps with zeros.

Roman.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC 1/1] nbd (specification): add NBD_CMD_WRITE_ZEROES command
  2016-02-18  9:18   ` Roman Kagan
@ 2016-02-18 10:36     ` Denis V. Lunev
  2016-02-18 16:35     ` Eric Blake
  1 sibling, 0 replies; 22+ messages in thread
From: Denis V. Lunev @ 2016-02-18 10:36 UTC (permalink / raw)
  To: Roman Kagan, Eric Blake, nbd-general, qemu-devel

On 02/18/2016 12:18 PM, Roman Kagan wrote:
> On Wed, Feb 17, 2016 at 01:58:47PM -0700, Eric Blake wrote:
>> On 02/17/2016 11:10 AM, Denis V. Lunev wrote:
>>> @@ -446,6 +448,11 @@ The following request types exist:
>>>       about the contents of the export affected by this command, until
>>>       overwriting it again with `NBD_CMD_WRITE`.
>>>   
>>> +* `NBD_CMD_WRITE_ZEROES` (6)
>>> +
>>> +    A request to write zeroes. The command is functional equivalent of
>>> +    the NBD_WRITE_COMMAND but without payload sent through the channel.
>> This lets us push holes during writes. Do we have the converse
>> operation, that is, an easy way to query if a block of data will read as
>> all zeroes, and therefore the client can bypass reading that portion of
>> the disk (in other words, an equivalent to lseek(SEEK_HOLE/SEEK_DATA))?
> The spec doesn't have anything like that.
>
> OTOH, unlike the write case, where you have all the information and just
> choose whether to send normal write or zero write, the extra round-trip
> of a separate SEEK_HOLE/SEEK_DATA request may lead to actually degrading
> the overall throughput.
>
> Rather it may be a better idea to add something like sparse read where
> the server would, instead of sending the full length of data in the
> response payload, send a smarter variable-length package with a
> scatter-gather list or a bitmap of used blocks in the beginning, and let
> the client decode it and fill the gaps with zeros.
>
> Roman.
ah, I see.

This story is more difficult but also viable for backup dirty bitmap
reading. But this will make the protocol more complex and will
require more efforts at specification stage.

I'd better start with the current change, which is simple enough
and make changes in a right direction and after that continue
with READ2 or whatever command.

Den

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC 1/1] nbd (specification): add NBD_CMD_WRITE_ZEROES command
  2016-02-18  9:18   ` Roman Kagan
  2016-02-18 10:36     ` Denis V. Lunev
@ 2016-02-18 16:35     ` Eric Blake
  2016-02-18 17:23       ` [Qemu-devel] SUMMARY: " Denis V. Lunev
  1 sibling, 1 reply; 22+ messages in thread
From: Eric Blake @ 2016-02-18 16:35 UTC (permalink / raw)
  To: Roman Kagan, Denis V. Lunev, nbd-general, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2038 bytes --]

On 02/18/2016 02:18 AM, Roman Kagan wrote:
> On Wed, Feb 17, 2016 at 01:58:47PM -0700, Eric Blake wrote:
>> On 02/17/2016 11:10 AM, Denis V. Lunev wrote:
>>> @@ -446,6 +448,11 @@ The following request types exist:
>>>      about the contents of the export affected by this command, until
>>>      overwriting it again with `NBD_CMD_WRITE`.
>>>  
>>> +* `NBD_CMD_WRITE_ZEROES` (6)
>>> +
>>> +    A request to write zeroes. The command is functional equivalent of
>>> +    the NBD_WRITE_COMMAND but without payload sent through the channel.
>>
>> This lets us push holes during writes. Do we have the converse
>> operation, that is, an easy way to query if a block of data will read as
>> all zeroes, and therefore the client can bypass reading that portion of
>> the disk (in other words, an equivalent to lseek(SEEK_HOLE/SEEK_DATA))?
> 
> The spec doesn't have anything like that.
> 
> OTOH, unlike the write case, where you have all the information and just
> choose whether to send normal write or zero write, the extra round-trip
> of a separate SEEK_HOLE/SEEK_DATA request may lead to actually degrading
> the overall throughput.
> 
> Rather it may be a better idea to add something like sparse read where
> the server would, instead of sending the full length of data in the
> response payload, send a smarter variable-length package with a
> scatter-gather list or a bitmap of used blocks in the beginning, and let
> the client decode it and fill the gaps with zeros.

Sure, that would work too, and sounds nicer.  Either way, the point is
that we should strongly consider improving the NBD protocol to allow
more efficient handling of sparse files, in both the push and in the
pull direction.  Qemu already has a desire to use both directions of
improvements, but there are more programs, both clients and servers,
outside of qemu, that could benefit from such protocol improvements.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Qemu-devel] SUMMARY: Re: [RFC 1/1] nbd (specification): add NBD_CMD_WRITE_ZEROES command
  2016-02-18 16:35     ` Eric Blake
@ 2016-02-18 17:23       ` Denis V. Lunev
  2016-02-18 17:55         ` Eric Blake
                           ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Denis V. Lunev @ 2016-02-18 17:23 UTC (permalink / raw)
  To: Eric Blake, Roman Kagan, nbd-general, qemu-devel, Stefan Hajnoczi,
	Daniel P. Berrange, Vladimir Sementsov-Ogievskiy

On 02/18/2016 07:35 PM, Eric Blake wrote:
> On 02/18/2016 02:18 AM, Roman Kagan wrote:
>> On Wed, Feb 17, 2016 at 01:58:47PM -0700, Eric Blake wrote:
>>> On 02/17/2016 11:10 AM, Denis V. Lunev wrote:
>>>> @@ -446,6 +448,11 @@ The following request types exist:
>>>>       about the contents of the export affected by this command, until
>>>>       overwriting it again with `NBD_CMD_WRITE`.
>>>>   
>>>> +* `NBD_CMD_WRITE_ZEROES` (6)
>>>> +
>>>> +    A request to write zeroes. The command is functional equivalent of
>>>> +    the NBD_WRITE_COMMAND but without payload sent through the channel.
>>> This lets us push holes during writes. Do we have the converse
>>> operation, that is, an easy way to query if a block of data will read as
>>> all zeroes, and therefore the client can bypass reading that portion of
>>> the disk (in other words, an equivalent to lseek(SEEK_HOLE/SEEK_DATA))?
>> The spec doesn't have anything like that.
>>
>> OTOH, unlike the write case, where you have all the information and just
>> choose whether to send normal write or zero write, the extra round-trip
>> of a separate SEEK_HOLE/SEEK_DATA request may lead to actually degrading
>> the overall throughput.
>>
>> Rather it may be a better idea to add something like sparse read where
>> the server would, instead of sending the full length of data in the
>> response payload, send a smarter variable-length package with a
>> scatter-gather list or a bitmap of used blocks in the beginning, and let
>> the client decode it and fill the gaps with zeros.
> Sure, that would work too, and sounds nicer.  Either way, the point is
> that we should strongly consider improving the NBD protocol to allow
> more efficient handling of sparse files, in both the push and in the
> pull direction.  Qemu already has a desire to use both directions of
> improvements, but there are more programs, both clients and servers,
> outside of qemu, that could benefit from such protocol improvements.
>
OK

Here is a short summary of features which seems necessary from QEMU point of
view:
- ability to avoid sending zeroes during write operation. The proposal 
comes in
   the thread-starter letter
- ability to request block status (allocate/not allocated) from server. 
This seems
   interesting to preserve "sparseness" of the transferring data
- ability to skip zeroes during read operation, i.e. something like 
READ2 command
   which will return vector of chunks as a reply

All 3 features seem usable for generic NBD use-cases and not only for QEMU.

If there are no objections I'll sum this up and come with a 
specification draft.

Den

P.S. I have added here all parties which have participated in 
conversation in
        different threads on QEMU side.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] SUMMARY: Re: [RFC 1/1] nbd (specification): add NBD_CMD_WRITE_ZEROES command
  2016-02-18 17:23       ` [Qemu-devel] SUMMARY: " Denis V. Lunev
@ 2016-02-18 17:55         ` Eric Blake
  2016-02-18 19:29         ` [Qemu-devel] [Nbd] " Alex Bligh
  2016-02-19  7:12         ` [Qemu-devel] " Denis V. Lunev
  2 siblings, 0 replies; 22+ messages in thread
From: Eric Blake @ 2016-02-18 17:55 UTC (permalink / raw)
  To: Denis V. Lunev, Roman Kagan, nbd-general, qemu-devel,
	Stefan Hajnoczi, Daniel P. Berrange, Vladimir Sementsov-Ogievskiy,
	Fam Zheng

[-- Attachment #1: Type: text/plain, Size: 1910 bytes --]

On 02/18/2016 10:23 AM, Denis V. Lunev wrote:

> 
> Here is a short summary of features which seems necessary from QEMU
> point of
> view:
> - ability to avoid sending zeroes during write operation. The proposal
> comes in
>   the thread-starter letter
> - ability to request block status (allocate/not allocated) from server.
> This seems
>   interesting to preserve "sparseness" of the transferring data
> - ability to skip zeroes during read operation, i.e. something like
> READ2 command
>   which will return vector of chunks as a reply
> 
> All 3 features seem usable for generic NBD use-cases and not only for QEMU.

All three features must be negotiated as part of connection handshake.
And we want to ensure sane fallbacks:

Client - if the server does not support the features, we fall back to
writing explicit zeroes (and give up on sparseness), and to assuming the
entire image is non-sparse (can't query or read sparseness).

Server - if client requests write 0, optimize where underlying storage
allows it, but we can always fall back to explicitly writing 0s and
merely treating the protocol as a compression of what is sent over the
wire.  If client requests block status, but underlying storage doesn't
provide it, we can always fall back to claiming the entire image is
allocated.  If client requests RAED2 but underlying storage has no way
to detect holes, we can always fall back to sending a single vector
covering the entire read request (no compression).

> 
> If there are no objections I'll sum this up and come with a
> specification draft.

Good luck! I'm sure you'll get good reviews.

> 
> Den
> 
> P.S. I have added here all parties which have participated in
> conversation in
>        different threads on QEMU side.
> 

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [Nbd] SUMMARY: Re: [RFC 1/1] nbd (specification): add NBD_CMD_WRITE_ZEROES command
  2016-02-18 17:23       ` [Qemu-devel] SUMMARY: " Denis V. Lunev
  2016-02-18 17:55         ` Eric Blake
@ 2016-02-18 19:29         ` Alex Bligh
  2016-02-19  7:12         ` [Qemu-devel] " Denis V. Lunev
  2 siblings, 0 replies; 22+ messages in thread
From: Alex Bligh @ 2016-02-18 19:29 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: nbd-general@lists.sourceforge.net, Vladimir Sementsov-Ogievskiy,
	Alex Bligh, qemu-devel@nongnu.org, Roman Kagan,
	Stefan stefanha@redhat. com, Fam famz@redhat. com


On 18 Feb 2016, at 17:23, Denis V. Lunev <den@openvz.org> wrote:

> - ability to skip zeroes during read operation, i.e. something like 
> READ2 command
>   which will return vector of chunks as a reply

...

> If there are no objections I'll sum this up and come with a 
> specification draft.

If you are fixing READ2 to allow a vector based reply, please also
consider allowing an error to be part of that vector. The protocol
currently has an issue where it is non-obvious how to return
an error encountered midway through a long read where some
data has already been sent to the client (from memory
i.e. unless it's already been fixed).

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] SUMMARY: Re: [RFC 1/1] nbd (specification): add NBD_CMD_WRITE_ZEROES command
  2016-02-18 17:23       ` [Qemu-devel] SUMMARY: " Denis V. Lunev
  2016-02-18 17:55         ` Eric Blake
  2016-02-18 19:29         ` [Qemu-devel] [Nbd] " Alex Bligh
@ 2016-02-19  7:12         ` Denis V. Lunev
  2016-02-19  8:56           ` Vladimir Sementsov-Ogievskiy
  2016-02-19  9:11           ` Daniel P. Berrange
  2 siblings, 2 replies; 22+ messages in thread
From: Denis V. Lunev @ 2016-02-19  7:12 UTC (permalink / raw)
  To: Eric Blake, Roman Kagan, nbd-general, qemu-devel, Stefan Hajnoczi,
	Daniel P. Berrange, Vladimir Sementsov-Ogievskiy

On 02/18/2016 08:23 PM, Denis V. Lunev wrote:
> On 02/18/2016 07:35 PM, Eric Blake wrote:
>> On 02/18/2016 02:18 AM, Roman Kagan wrote:
>>> On Wed, Feb 17, 2016 at 01:58:47PM -0700, Eric Blake wrote:
>>>> On 02/17/2016 11:10 AM, Denis V. Lunev wrote:
>>>>> @@ -446,6 +448,11 @@ The following request types exist:
>>>>>       about the contents of the export affected by this command, 
>>>>> until
>>>>>       overwriting it again with `NBD_CMD_WRITE`.
>>>>>   +* `NBD_CMD_WRITE_ZEROES` (6)
>>>>> +
>>>>> +    A request to write zeroes. The command is functional 
>>>>> equivalent of
>>>>> +    the NBD_WRITE_COMMAND but without payload sent through the 
>>>>> channel.
>>>> This lets us push holes during writes. Do we have the converse
>>>> operation, that is, an easy way to query if a block of data will 
>>>> read as
>>>> all zeroes, and therefore the client can bypass reading that 
>>>> portion of
>>>> the disk (in other words, an equivalent to 
>>>> lseek(SEEK_HOLE/SEEK_DATA))?
>>> The spec doesn't have anything like that.
>>>
>>> OTOH, unlike the write case, where you have all the information and 
>>> just
>>> choose whether to send normal write or zero write, the extra round-trip
>>> of a separate SEEK_HOLE/SEEK_DATA request may lead to actually 
>>> degrading
>>> the overall throughput.
>>>
>>> Rather it may be a better idea to add something like sparse read where
>>> the server would, instead of sending the full length of data in the
>>> response payload, send a smarter variable-length package with a
>>> scatter-gather list or a bitmap of used blocks in the beginning, and 
>>> let
>>> the client decode it and fill the gaps with zeros.
>> Sure, that would work too, and sounds nicer.  Either way, the point is
>> that we should strongly consider improving the NBD protocol to allow
>> more efficient handling of sparse files, in both the push and in the
>> pull direction.  Qemu already has a desire to use both directions of
>> improvements, but there are more programs, both clients and servers,
>> outside of qemu, that could benefit from such protocol improvements.
>>
> OK
>
> Here is a short summary of features which seems necessary from QEMU 
> point of
> view:
> - ability to avoid sending zeroes during write operation. The proposal 
> comes in
>   the thread-starter letter
> - ability to request block status (allocate/not allocated) from 
> server. This seems
>   interesting to preserve "sparseness" of the transferring data
> - ability to skip zeroes during read operation, i.e. something like 
> READ2 command
>   which will return vector of chunks as a reply
>
> All 3 features seem usable for generic NBD use-cases and not only for 
> QEMU.
>
> If there are no objections I'll sum this up and come with a 
> specification draft.
>
> Den
>
> P.S. I have added here all parties which have participated in 
> conversation in
>        different threads on QEMU side.

interesting point from a verbal discussion with one of my friends.
Protocol level compression could eliminate the necessity to
think about zeroes in channel either from read or from write
point of views and will also reduce the amount of data to
transfer.

Den

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] SUMMARY: Re: [RFC 1/1] nbd (specification): add NBD_CMD_WRITE_ZEROES command
  2016-02-19  7:12         ` [Qemu-devel] " Denis V. Lunev
@ 2016-02-19  8:56           ` Vladimir Sementsov-Ogievskiy
  2016-02-19  9:11           ` Daniel P. Berrange
  1 sibling, 0 replies; 22+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-02-19  8:56 UTC (permalink / raw)
  To: Denis V. Lunev, Eric Blake, Roman Kagan, nbd-general, qemu-devel,
	Stefan Hajnoczi, Daniel P. Berrange, Fam Zheng

On 19.02.2016 10:12, Denis V. Lunev wrote:
> On 02/18/2016 08:23 PM, Denis V. Lunev wrote:
>> On 02/18/2016 07:35 PM, Eric Blake wrote:
>>> On 02/18/2016 02:18 AM, Roman Kagan wrote:
>>>> On Wed, Feb 17, 2016 at 01:58:47PM -0700, Eric Blake wrote:
>>>>> On 02/17/2016 11:10 AM, Denis V. Lunev wrote:
>>>>>> @@ -446,6 +448,11 @@ The following request types exist:
>>>>>>       about the contents of the export affected by this command, 
>>>>>> until
>>>>>>       overwriting it again with `NBD_CMD_WRITE`.
>>>>>>   +* `NBD_CMD_WRITE_ZEROES` (6)
>>>>>> +
>>>>>> +    A request to write zeroes. The command is functional 
>>>>>> equivalent of
>>>>>> +    the NBD_WRITE_COMMAND but without payload sent through the 
>>>>>> channel.
>>>>> This lets us push holes during writes. Do we have the converse
>>>>> operation, that is, an easy way to query if a block of data will 
>>>>> read as
>>>>> all zeroes, and therefore the client can bypass reading that 
>>>>> portion of
>>>>> the disk (in other words, an equivalent to 
>>>>> lseek(SEEK_HOLE/SEEK_DATA))?
>>>> The spec doesn't have anything like that.
>>>>
>>>> OTOH, unlike the write case, where you have all the information and 
>>>> just
>>>> choose whether to send normal write or zero write, the extra 
>>>> round-trip
>>>> of a separate SEEK_HOLE/SEEK_DATA request may lead to actually 
>>>> degrading
>>>> the overall throughput.
>>>>
>>>> Rather it may be a better idea to add something like sparse read where
>>>> the server would, instead of sending the full length of data in the
>>>> response payload, send a smarter variable-length package with a
>>>> scatter-gather list or a bitmap of used blocks in the beginning, 
>>>> and let
>>>> the client decode it and fill the gaps with zeros.
>>> Sure, that would work too, and sounds nicer.  Either way, the point is
>>> that we should strongly consider improving the NBD protocol to allow
>>> more efficient handling of sparse files, in both the push and in the
>>> pull direction.  Qemu already has a desire to use both directions of
>>> improvements, but there are more programs, both clients and servers,
>>> outside of qemu, that could benefit from such protocol improvements.
>>>
>> OK
>>
>> Here is a short summary of features which seems necessary from QEMU 
>> point of
>> view:
>> - ability to avoid sending zeroes during write operation. The 
>> proposal comes in
>>   the thread-starter letter
>> - ability to request block status (allocate/not allocated) from 
>> server. This seems
>>   interesting to preserve "sparseness" of the transferring data
>> - ability to skip zeroes during read operation, i.e. something like 
>> READ2 command
>>   which will return vector of chunks as a reply
>>
>> All 3 features seem usable for generic NBD use-cases and not only for 
>> QEMU.
>>
>> If there are no objections I'll sum this up and come with a 
>> specification draft.
>>
>> Den
>>
>> P.S. I have added here all parties which have participated in 
>> conversation in
>>        different threads on QEMU side.
>
> interesting point from a verbal discussion with one of my friends.
> Protocol level compression could eliminate the necessity to
> think about zeroes in channel either from read or from write
> point of views and will also reduce the amount of data to
> transfer.
>
> Den

Compression is worse than separate commands, because after decompression 
we will have to write or somehow test these zeroes again.

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] SUMMARY: Re: [RFC 1/1] nbd (specification): add NBD_CMD_WRITE_ZEROES command
  2016-02-19  7:12         ` [Qemu-devel] " Denis V. Lunev
  2016-02-19  8:56           ` Vladimir Sementsov-Ogievskiy
@ 2016-02-19  9:11           ` Daniel P. Berrange
  1 sibling, 0 replies; 22+ messages in thread
From: Daniel P. Berrange @ 2016-02-19  9:11 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: nbd-general, Vladimir Sementsov-Ogievskiy, Fam Zheng, qemu-devel,
	Roman Kagan, Stefan Hajnoczi

On Fri, Feb 19, 2016 at 10:12:09AM +0300, Denis V. Lunev wrote:
> On 02/18/2016 08:23 PM, Denis V. Lunev wrote:
> >On 02/18/2016 07:35 PM, Eric Blake wrote:
> >>On 02/18/2016 02:18 AM, Roman Kagan wrote:
> >>>On Wed, Feb 17, 2016 at 01:58:47PM -0700, Eric Blake wrote:
> >>>>On 02/17/2016 11:10 AM, Denis V. Lunev wrote:
> >>>>>@@ -446,6 +448,11 @@ The following request types exist:
> >>>>>      about the contents of the export affected by this command,
> >>>>>until
> >>>>>      overwriting it again with `NBD_CMD_WRITE`.
> >>>>>  +* `NBD_CMD_WRITE_ZEROES` (6)
> >>>>>+
> >>>>>+    A request to write zeroes. The command is functional
> >>>>>equivalent of
> >>>>>+    the NBD_WRITE_COMMAND but without payload sent through the
> >>>>>channel.
> >>>>This lets us push holes during writes. Do we have the converse
> >>>>operation, that is, an easy way to query if a block of data will
> >>>>read as
> >>>>all zeroes, and therefore the client can bypass reading that portion
> >>>>of
> >>>>the disk (in other words, an equivalent to
> >>>>lseek(SEEK_HOLE/SEEK_DATA))?
> >>>The spec doesn't have anything like that.
> >>>
> >>>OTOH, unlike the write case, where you have all the information and
> >>>just
> >>>choose whether to send normal write or zero write, the extra round-trip
> >>>of a separate SEEK_HOLE/SEEK_DATA request may lead to actually
> >>>degrading
> >>>the overall throughput.
> >>>
> >>>Rather it may be a better idea to add something like sparse read where
> >>>the server would, instead of sending the full length of data in the
> >>>response payload, send a smarter variable-length package with a
> >>>scatter-gather list or a bitmap of used blocks in the beginning, and
> >>>let
> >>>the client decode it and fill the gaps with zeros.
> >>Sure, that would work too, and sounds nicer.  Either way, the point is
> >>that we should strongly consider improving the NBD protocol to allow
> >>more efficient handling of sparse files, in both the push and in the
> >>pull direction.  Qemu already has a desire to use both directions of
> >>improvements, but there are more programs, both clients and servers,
> >>outside of qemu, that could benefit from such protocol improvements.
> >>
> >OK
> >
> >Here is a short summary of features which seems necessary from QEMU point
> >of
> >view:
> >- ability to avoid sending zeroes during write operation. The proposal
> >comes in
> >  the thread-starter letter
> >- ability to request block status (allocate/not allocated) from server.
> >This seems
> >  interesting to preserve "sparseness" of the transferring data
> >- ability to skip zeroes during read operation, i.e. something like READ2
> >command
> >  which will return vector of chunks as a reply
> >
> >All 3 features seem usable for generic NBD use-cases and not only for
> >QEMU.
> >
> >If there are no objections I'll sum this up and come with a specification
> >draft.
> >
> >Den
> >
> >P.S. I have added here all parties which have participated in conversation
> >in
> >       different threads on QEMU side.
> 
> interesting point from a verbal discussion with one of my friends.
> Protocol level compression could eliminate the necessity to
> think about zeroes in channel either from read or from write
> point of views and will also reduce the amount of data to
> transfer.

With compression you have thrown away information about sparseness which
you really want to have when writing out the file on the other end. It
forces you to do memcmp detection of zero regions after decompression
which is CPU intensive.

Compression is a fine as a concept, but it is not a replacement for
handling sparseness directly in the protocol.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC 1/1] nbd (specification): add NBD_CMD_WRITE_ZEROES command
  2016-02-17 20:58 ` Eric Blake
  2016-02-18  4:46   ` Denis V. Lunev
  2016-02-18  9:18   ` Roman Kagan
@ 2016-02-18 12:14   ` Daniel P. Berrange
  2016-02-18 14:05     ` Denis V. Lunev
  2 siblings, 1 reply; 22+ messages in thread
From: Daniel P. Berrange @ 2016-02-18 12:14 UTC (permalink / raw)
  To: Eric Blake; +Cc: nbd-general, Denis V. Lunev, qemu-devel

On Wed, Feb 17, 2016 at 01:58:47PM -0700, Eric Blake wrote:
> On 02/17/2016 11:10 AM, Denis V. Lunev wrote:
> > This patch proposes a new command to reduce the amount of data passed
> > through the wire when it is known that the data is all zeroes. This
> > functionality is generally useful for mirroring or backup operations.
> > 
> > Currently available NBD_CMD_TRIM command can not be used as the
> > specification explicitely says that "a client MUST NOT make any
> 
> s/explicitely/explicitly/
> 
> > assumptions about the contents of the export affected by this
> > [NBD_CMD_TRIM] command, until overwriting it again with `NBD_CMD_WRITE`"
> > 
> > Particular use case could be the following:
> > 
> > QEMU project uses own implementation of NBD server to transfer data
> > in between different instances of QEMU. Typically we tranfer VM virtual
> 
> s/tranfer/transfer/
> 
> > disks over this channel. VM virtual disks are sparse and thus the
> > efficiency of backup and mirroring operations could be improved a lot.
> > 
> > Signed-off-by: Denis V. Lunev <den@openvz.org>
> > ---
> >  doc/proto.md | 7 +++++++
> >  1 file changed, 7 insertions(+)
> > 
> > diff --git a/doc/proto.md b/doc/proto.md
> > index 43065b7..c94751a 100644
> > --- a/doc/proto.md
> > +++ b/doc/proto.md
> > @@ -241,6 +241,8 @@ immediately after the global flags field in oldstyle negotiation:
> >    schedule I/O accesses as for a rotational medium
> >  - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server supports
> >    `NBD_CMD_TRIM` commands
> > +- bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server
> > +  supports `NBD_CMD_WRITE_ZEROES` commands
> >  
> >  ##### Client flags
> >  
> > @@ -446,6 +448,11 @@ The following request types exist:
> >      about the contents of the export affected by this command, until
> >      overwriting it again with `NBD_CMD_WRITE`.
> >  
> > +* `NBD_CMD_WRITE_ZEROES` (6)
> > +
> > +    A request to write zeroes. The command is functional equivalent of
> > +    the NBD_WRITE_COMMAND but without payload sent through the channel.
> 
> This lets us push holes during writes. Do we have the converse
> operation, that is, an easy way to query if a block of data will read as
> all zeroes, and therefore the client can bypass reading that portion of
> the disk (in other words, an equivalent to lseek(SEEK_HOLE/SEEK_DATA))?

Stefan has suggested that we add a command to the NBD spec that
implements the SCSI Get LBA Status command. This lets clients
query the allocation bitmap for the device, which would serve
this purpose.

  https://lists.gnu.org/archive/html/qemu-devel/2016-02/msg03582.html

In that thread he talks about it being a way to serve up the dirty
bitmap for live backup scenario, but in regular usage it obviously
provides the normal allocation bitmap


Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC 1/1] nbd (specification): add NBD_CMD_WRITE_ZEROES command
  2016-02-18 12:14   ` [Qemu-devel] " Daniel P. Berrange
@ 2016-02-18 14:05     ` Denis V. Lunev
  0 siblings, 0 replies; 22+ messages in thread
From: Denis V. Lunev @ 2016-02-18 14:05 UTC (permalink / raw)
  To: Daniel P. Berrange, Eric Blake; +Cc: nbd-general, qemu-devel

On 02/18/2016 03:14 PM, Daniel P. Berrange wrote:
> On Wed, Feb 17, 2016 at 01:58:47PM -0700, Eric Blake wrote:
>> On 02/17/2016 11:10 AM, Denis V. Lunev wrote:
>>> This patch proposes a new command to reduce the amount of data passed
>>> through the wire when it is known that the data is all zeroes. This
>>> functionality is generally useful for mirroring or backup operations.
>>>
>>> Currently available NBD_CMD_TRIM command can not be used as the
>>> specification explicitely says that "a client MUST NOT make any
>> s/explicitely/explicitly/
>>
>>> assumptions about the contents of the export affected by this
>>> [NBD_CMD_TRIM] command, until overwriting it again with `NBD_CMD_WRITE`"
>>>
>>> Particular use case could be the following:
>>>
>>> QEMU project uses own implementation of NBD server to transfer data
>>> in between different instances of QEMU. Typically we tranfer VM virtual
>> s/tranfer/transfer/
>>
>>> disks over this channel. VM virtual disks are sparse and thus the
>>> efficiency of backup and mirroring operations could be improved a lot.
>>>
>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>> ---
>>>   doc/proto.md | 7 +++++++
>>>   1 file changed, 7 insertions(+)
>>>
>>> diff --git a/doc/proto.md b/doc/proto.md
>>> index 43065b7..c94751a 100644
>>> --- a/doc/proto.md
>>> +++ b/doc/proto.md
>>> @@ -241,6 +241,8 @@ immediately after the global flags field in oldstyle negotiation:
>>>     schedule I/O accesses as for a rotational medium
>>>   - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server supports
>>>     `NBD_CMD_TRIM` commands
>>> +- bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server
>>> +  supports `NBD_CMD_WRITE_ZEROES` commands
>>>   
>>>   ##### Client flags
>>>   
>>> @@ -446,6 +448,11 @@ The following request types exist:
>>>       about the contents of the export affected by this command, until
>>>       overwriting it again with `NBD_CMD_WRITE`.
>>>   
>>> +* `NBD_CMD_WRITE_ZEROES` (6)
>>> +
>>> +    A request to write zeroes. The command is functional equivalent of
>>> +    the NBD_WRITE_COMMAND but without payload sent through the channel.
>> This lets us push holes during writes. Do we have the converse
>> operation, that is, an easy way to query if a block of data will read as
>> all zeroes, and therefore the client can bypass reading that portion of
>> the disk (in other words, an equivalent to lseek(SEEK_HOLE/SEEK_DATA))?
> Stefan has suggested that we add a command to the NBD spec that
> implements the SCSI Get LBA Status command. This lets clients
> query the allocation bitmap for the device, which would serve
> this purpose.
>
>    https://lists.gnu.org/archive/html/qemu-devel/2016-02/msg03582.html
>
> In that thread he talks about it being a way to serve up the dirty
> bitmap for live backup scenario, but in regular usage it obviously
> provides the normal allocation bitmap
>
>
> Regards,
> Daniel
But in this case we should allow to query the information for
more than one block at once and also we will have to make
an agreement in between the client and server about
the granularity of the request or specify the granularity
along as the range in the call.

Den

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC 1/1] nbd (specification): add NBD_CMD_WRITE_ZEROES command
  2016-02-17 18:10 [Qemu-devel] [RFC 1/1] nbd (specification): add NBD_CMD_WRITE_ZEROES command Denis V. Lunev
  2016-02-17 20:58 ` Eric Blake
@ 2016-02-18  8:09 ` Alex Bligh
  2016-02-18  8:34   ` Denis V. Lunev
  1 sibling, 1 reply; 22+ messages in thread
From: Alex Bligh @ 2016-02-18  8:09 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: nbd-general@lists.sourceforge.net, qemu-devel@nongnu.org,
	Alex Bligh


On 17 Feb 2016, at 18:10, Denis V. Lunev <den@openvz.org> wrote:

> Currently available NBD_CMD_TRIM command can not be used as the
> specification explicitely says that "a client MUST NOT make any
> assumptions about the contents of the export affected by this
> [NBD_CMD_TRIM] command, until overwriting it again with `NBD_CMD_WRITE`"

Would a flag to NBD_CMD_TRIM that says "ensure the written
data is zeroed" not be an easier solution than adding another
very similar command?

Or (cough) changing the spec?

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC 1/1] nbd (specification): add NBD_CMD_WRITE_ZEROES command
  2016-02-18  8:09 ` Alex Bligh
@ 2016-02-18  8:34   ` Denis V. Lunev
  2016-03-04  8:49     ` [Qemu-devel] [Nbd] " Wouter Verhelst
  0 siblings, 1 reply; 22+ messages in thread
From: Denis V. Lunev @ 2016-02-18  8:34 UTC (permalink / raw)
  To: Alex Bligh; +Cc: nbd-general@lists.sourceforge.net, qemu-devel@nongnu.org

On 02/18/2016 11:09 AM, Alex Bligh wrote:
> On 17 Feb 2016, at 18:10, Denis V. Lunev <den@openvz.org> wrote:
>
>> Currently available NBD_CMD_TRIM command can not be used as the
>> specification explicitely says that "a client MUST NOT make any
>> assumptions about the contents of the export affected by this
>> [NBD_CMD_TRIM] command, until overwriting it again with `NBD_CMD_WRITE`"
> Would a flag to NBD_CMD_TRIM that says "ensure the written
> data is zeroed" not be an easier solution than adding another
> very similar command?
>
> Or (cough) changing the spec?
>
from the point of the receiver the situation (from my POW) could
be different. Let us assume that we are writing to the plain
file.

There are 2 type of queries:
- pls make the target sparse, i.e. perform FALLOC_FL_PUNCH_HOLE
   and there is no problem that the operation could not be performed,
   this is a hint;
- pls write the following amount of zeroes in either way (even calling
   write directly), i.e. ensure that the data is zeroed and the space on
   the file system is allocated for that.

The difference comes from the situation when sparse files results
in a serious performance impact if written by small pieces. In this
case the data of one block of the virtual disk can be placed to
very different blocks of the physical disk seriously impacting
subsequent sequential reading.

Thus I propose to make the situation distinct for these commands.
Though you proposal to add NBD_FLAG_SEND_TRIM_ZEROED to export
flags is also great. Though this IMHO is a different story.

Den

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [Nbd] [RFC 1/1] nbd (specification): add NBD_CMD_WRITE_ZEROES command
  2016-02-18  8:34   ` Denis V. Lunev
@ 2016-03-04  8:49     ` Wouter Verhelst
  2016-03-04  9:54       ` Kevin Wolf
  0 siblings, 1 reply; 22+ messages in thread
From: Wouter Verhelst @ 2016-03-04  8:49 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: nbd-general@lists.sourceforge.net, qemu-devel@nongnu.org,
	Alex Bligh

Hi folks,

(sorry about the lateness of this reply, was busy for the last few weeks)

On Thu, Feb 18, 2016 at 11:34:04AM +0300, Denis V. Lunev wrote:
> On 02/18/2016 11:09 AM, Alex Bligh wrote:
> > On 17 Feb 2016, at 18:10, Denis V. Lunev <den@openvz.org> wrote:
> >
> >> Currently available NBD_CMD_TRIM command can not be used as the
> >> specification explicitely says that "a client MUST NOT make any
> >> assumptions about the contents of the export affected by this
> >> [NBD_CMD_TRIM] command, until overwriting it again with `NBD_CMD_WRITE`"
> > Would a flag to NBD_CMD_TRIM that says "ensure the written
> > data is zeroed" not be an easier solution than adding another
> > very similar command?
> >
> > Or (cough) changing the spec?
> >
> from the point of the receiver the situation (from my POW) could
> be different. Let us assume that we are writing to the plain
> file.
> 
> There are 2 type of queries:
> - pls make the target sparse, i.e. perform FALLOC_FL_PUNCH_HOLE
>    and there is no problem that the operation could not be performed,
>    this is a hint;

This is what NBD_CMD_TRIM does, currently.

The reason this is a hint, is that there is no guarantee that the
underlying operating system or storage even supports
FALLOC_FL_PUNCH_HOLE (or similar). We could have made NBD_CMD_TRIM fail
with a "not possible on this export" kind of error in that case, but it
was chosen not to do that (for reasons I don't remember; maybe we just
didn't consider this enough).

This could be remedied if the client could somehow ask what the result
of a TRIM command would be; i.e., if the server has support for
FALLOC_FL_PUNCH_HOLE, it could set a flag which would let the client
know that NBD_CMD_TRIM will zero out bytes. If the server doesn't set
that flag and the client requires zeroes, it could then just issue a
WRITE command, followed (maybe) by a TRIM for the same region (which
would be less optimal, but have the same result with older servers)

> - pls write the following amount of zeroes in either way (even calling
>    write directly), i.e. ensure that the data is zeroed and the space on
>    the file system is allocated for that.

IOW, you *don't* want to have a sparse file in that case? Or do I
misunderstand things here?

I'm not entirely sure I understand the problem here...

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [Nbd] [RFC 1/1] nbd (specification): add NBD_CMD_WRITE_ZEROES command
  2016-03-04  8:49     ` [Qemu-devel] [Nbd] " Wouter Verhelst
@ 2016-03-04  9:54       ` Kevin Wolf
  2016-03-04 14:03         ` Paolo Bonzini
  0 siblings, 1 reply; 22+ messages in thread
From: Kevin Wolf @ 2016-03-04  9:54 UTC (permalink / raw)
  To: Wouter Verhelst
  Cc: nbd-general@lists.sourceforge.net, Denis V. Lunev,
	qemu-devel@nongnu.org, Alex Bligh

Am 04.03.2016 um 09:49 hat Wouter Verhelst geschrieben:
> Hi folks,
> 
> (sorry about the lateness of this reply, was busy for the last few weeks)
> 
> On Thu, Feb 18, 2016 at 11:34:04AM +0300, Denis V. Lunev wrote:
> > On 02/18/2016 11:09 AM, Alex Bligh wrote:
> > > On 17 Feb 2016, at 18:10, Denis V. Lunev <den@openvz.org> wrote:
> > >
> > >> Currently available NBD_CMD_TRIM command can not be used as the
> > >> specification explicitely says that "a client MUST NOT make any
> > >> assumptions about the contents of the export affected by this
> > >> [NBD_CMD_TRIM] command, until overwriting it again with `NBD_CMD_WRITE`"
> > > Would a flag to NBD_CMD_TRIM that says "ensure the written
> > > data is zeroed" not be an easier solution than adding another
> > > very similar command?
> > >
> > > Or (cough) changing the spec?
> > >
> > from the point of the receiver the situation (from my POW) could
> > be different. Let us assume that we are writing to the plain
> > file.
> > 
> > There are 2 type of queries:
> > - pls make the target sparse, i.e. perform FALLOC_FL_PUNCH_HOLE
> >    and there is no problem that the operation could not be performed,
> >    this is a hint;
> 
> This is what NBD_CMD_TRIM does, currently.
> 
> The reason this is a hint, is that there is no guarantee that the
> underlying operating system or storage even supports
> FALLOC_FL_PUNCH_HOLE (or similar). We could have made NBD_CMD_TRIM fail
> with a "not possible on this export" kind of error in that case, but it
> was chosen not to do that (for reasons I don't remember; maybe we just
> didn't consider this enough).
> 
> This could be remedied if the client could somehow ask what the result
> of a TRIM command would be; i.e., if the server has support for
> FALLOC_FL_PUNCH_HOLE, it could set a flag which would let the client
> know that NBD_CMD_TRIM will zero out bytes. If the server doesn't set
> that flag and the client requires zeroes, it could then just issue a
> WRITE command, followed (maybe) by a TRIM for the same region (which
> would be less optimal, but have the same result with older servers)

NBD_CMD_TRIM covers the case "I don't need this data any more, you can
throw it away", and I think treating that purely as a hint is perfectly
fine.

> > - pls write the following amount of zeroes in either way (even calling
> >    write directly), i.e. ensure that the data is zeroed and the space on
> >    the file system is allocated for that.
> 
> IOW, you *don't* want to have a sparse file in that case? Or do I
> misunderstand things here?

I think what we're looking for is more like "zero out this area, feel
free to use whatever method is most efficient to achieve that".

So if the server knows that the backing store supports an efficient way
to write zeros (e.g. FALLOC_FL_ZERO_RANGE), it will use that. Otherwise,
if TRIM works and we know that the result is zeroed space instead of
undefined contents, the server is free to use it. And if even that
fails, it just falls back to an explicit write of a zeroed buffer.

If we want, we can give the client a little more control about whether
or not discarding in the process is allowed (or maybe even preferred).
qemu's interface for writing zeros has a BDRV_REQ_MAY_UNMAP flag, for
example.

Kevin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [Nbd] [RFC 1/1] nbd (specification): add NBD_CMD_WRITE_ZEROES command
  2016-03-04  9:54       ` Kevin Wolf
@ 2016-03-04 14:03         ` Paolo Bonzini
  2016-03-06 10:28           ` Wouter Verhelst
  0 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2016-03-04 14:03 UTC (permalink / raw)
  To: Kevin Wolf, Wouter Verhelst
  Cc: nbd-general@lists.sourceforge.net, Denis V. Lunev,
	qemu-devel@nongnu.org, Alex Bligh

On 04/03/2016 10:54, Kevin Wolf wrote:
>>> - pls write the following amount of zeroes in either way (even calling
>>> > >    write directly), i.e. ensure that the data is zeroed and the space on
>>> > >    the file system is allocated for that.
>> > 
>> > IOW, you *don't* want to have a sparse file in that case? Or do I
>> > misunderstand things here?
> I think what we're looking for is more like "zero out this area, feel
> free to use whatever method is most efficient to achieve that".
> 
> So if the server knows that the backing store supports an efficient way
> to write zeros (e.g. FALLOC_FL_ZERO_RANGE), it will use that. Otherwise,
> if TRIM works and we know that the result is zeroed space instead of
> undefined contents, the server is free to use it. And if even that
> fails, it just falls back to an explicit write of a zeroed buffer.
> 
> If we want, we can give the client a little more control about whether
> or not discarding in the process is allowed (or maybe even preferred).
> qemu's interface for writing zeros has a BDRV_REQ_MAY_UNMAP flag, for
> example.

NBD-wise, I think the TRIM command is good as it is, and
NBD_CMD_WRITE_ZEROES should be added like Den is doing.

It also makes sense to use trimming to implement NBD_CMD_WRITE_ZEROES,
but it should be explicitly requested by the user.  For this, my
suggestion is that NBD_CMD_WRITE_ZEROES should have an NBD_FLAG_TRY_TRIM
flag in bit 16.  If specified, the backend can use a zero-writing
mechanism that trims, _but_ it must ensure that the bytes read as zero.
 If it cannot ensure that, it must not trim and it should instead do a
full write.  This is similar to the SCSI command WRITE SAME (when the
command payload is all zeroes).  Like Kevin said, it also happens to map
nicely to the QEMU block device layer.

Paolo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [Nbd] [RFC 1/1] nbd (specification): add NBD_CMD_WRITE_ZEROES command
  2016-03-04 14:03         ` Paolo Bonzini
@ 2016-03-06 10:28           ` Wouter Verhelst
  2016-03-06 18:54             ` Denis V. Lunev
  0 siblings, 1 reply; 22+ messages in thread
From: Wouter Verhelst @ 2016-03-06 10:28 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, nbd-general@lists.sourceforge.net,
	qemu-devel@nongnu.org, Denis V. Lunev

Hi all,

On Fri, Mar 04, 2016 at 03:03:26PM +0100, Paolo Bonzini wrote:
> NBD-wise, I think the TRIM command is good as it is, and
> NBD_CMD_WRITE_ZEROES should be added like Den is doing.
> 
> It also makes sense to use trimming to implement NBD_CMD_WRITE_ZEROES,
> but it should be explicitly requested by the user.  For this, my
> suggestion is that NBD_CMD_WRITE_ZEROES should have an
> NBD_FLAG_TRY_TRIM flag in bit 16.  If specified, the backend can use a
> zero-writing mechanism that trims, _but_ it must ensure that the bytes
> read as zero.  If it cannot ensure that, it must not trim and it
> should instead do a full write.  This is similar to the SCSI command
> WRITE SAME (when the command payload is all zeroes).  Like Kevin said,
> it also happens to map nicely to the QEMU block device layer.

That seems like a sensible approach, yes.

Would one of you who feels strongly about this be willing to write up a
proposed spec for that? I could do it, but since I'm not 100% sure I
understand all the specific requirements, I'm uncomfortable doing so.

If that doesn't raise any obvious issues that I'm aware of, I'd be happy
to add it to the "experimental" section of the proto.md file.

(speaking of which, I notice that the STARTTLS patches got merged into
qemu, so I've moved the description of that to the main body and out of
the "experimental" part of that document)

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [Nbd] [RFC 1/1] nbd (specification): add NBD_CMD_WRITE_ZEROES command
  2016-03-06 10:28           ` Wouter Verhelst
@ 2016-03-06 18:54             ` Denis V. Lunev
  0 siblings, 0 replies; 22+ messages in thread
From: Denis V. Lunev @ 2016-03-06 18:54 UTC (permalink / raw)
  To: Wouter Verhelst, Paolo Bonzini
  Cc: Kevin Wolf, nbd-general@lists.sourceforge.net,
	qemu-devel@nongnu.org

On 03/06/2016 01:28 PM, Wouter Verhelst wrote:
> Hi all,
>
> On Fri, Mar 04, 2016 at 03:03:26PM +0100, Paolo Bonzini wrote:
>> NBD-wise, I think the TRIM command is good as it is, and
>> NBD_CMD_WRITE_ZEROES should be added like Den is doing.
>>
>> It also makes sense to use trimming to implement NBD_CMD_WRITE_ZEROES,
>> but it should be explicitly requested by the user.  For this, my
>> suggestion is that NBD_CMD_WRITE_ZEROES should have an
>> NBD_FLAG_TRY_TRIM flag in bit 16.  If specified, the backend can use a
>> zero-writing mechanism that trims, _but_ it must ensure that the bytes
>> read as zero.  If it cannot ensure that, it must not trim and it
>> should instead do a full write.  This is similar to the SCSI command
>> WRITE SAME (when the command payload is all zeroes).  Like Kevin said,
>> it also happens to map nicely to the QEMU block device layer.
> That seems like a sensible approach, yes.
>
> Would one of you who feels strongly about this be willing to write up a
> proposed spec for that? I could do it, but since I'm not 100% sure I
> understand all the specific requirements, I'm uncomfortable doing so.
>
> If that doesn't raise any obvious issues that I'm aware of, I'd be happy
> to add it to the "experimental" section of the proto.md file.
>
> (speaking of which, I notice that the STARTTLS patches got merged into
> qemu, so I've moved the description of that to the main body and out of
> the "experimental" part of that document)
>
I am going to send next RFC next Wednesday.
Just waited other opinions not from QEMU side.

Den

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2016-03-06 18:54 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-02-17 18:10 [Qemu-devel] [RFC 1/1] nbd (specification): add NBD_CMD_WRITE_ZEROES command Denis V. Lunev
2016-02-17 20:58 ` Eric Blake
2016-02-18  4:46   ` Denis V. Lunev
2016-02-18  8:30     ` Denis V. Lunev
2016-02-18  9:18   ` Roman Kagan
2016-02-18 10:36     ` Denis V. Lunev
2016-02-18 16:35     ` Eric Blake
2016-02-18 17:23       ` [Qemu-devel] SUMMARY: " Denis V. Lunev
2016-02-18 17:55         ` Eric Blake
2016-02-18 19:29         ` [Qemu-devel] [Nbd] " Alex Bligh
2016-02-19  7:12         ` [Qemu-devel] " Denis V. Lunev
2016-02-19  8:56           ` Vladimir Sementsov-Ogievskiy
2016-02-19  9:11           ` Daniel P. Berrange
2016-02-18 12:14   ` [Qemu-devel] " Daniel P. Berrange
2016-02-18 14:05     ` Denis V. Lunev
2016-02-18  8:09 ` Alex Bligh
2016-02-18  8:34   ` Denis V. Lunev
2016-03-04  8:49     ` [Qemu-devel] [Nbd] " Wouter Verhelst
2016-03-04  9:54       ` Kevin Wolf
2016-03-04 14:03         ` Paolo Bonzini
2016-03-06 10:28           ` Wouter Verhelst
2016-03-06 18:54             ` Denis V. Lunev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).