From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:53387)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <den@virtuozzo.com>) id 1anAZi-0006Eq-Ed
	for qemu-devel@nongnu.org; Mon, 04 Apr 2016 15:54:15 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <den@virtuozzo.com>) id 1anAZf-0002Op-8A
	for qemu-devel@nongnu.org; Mon, 04 Apr 2016 15:54:14 -0400
Received: from mail-db5eur01on0106.outbound.protection.outlook.com
	([104.47.2.106]:58688
	helo=EUR01-DB5-obe.outbound.protection.outlook.com)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <den@virtuozzo.com>) id 1anAZe-0002Oj-Br
	for qemu-devel@nongnu.org; Mon, 04 Apr 2016 15:54:11 -0400
References: <1459787950-15286-1-git-send-email-eblake@redhat.com>
	<7AD0DCB1-1868-4AAD-A03D-C976A728DD75@alex.org.uk>
	<5702C1AB.8020601@redhat.com>
From: "Denis V. Lunev" <den@openvz.org>
Message-ID: <5702C65A.7040101@openvz.org>
Date: Mon, 4 Apr 2016 22:54:02 +0300
MIME-Version: 1.0
In-Reply-To: <5702C1AB.8020601@redhat.com>
Content-Type: text/plain; charset="utf-8"; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS
	extension
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Eric Blake <eblake@redhat.com>, Alex Bligh <alex@alex.org.uk>
Cc: "nbd-general@lists.sourceforge.net" <nbd-general@lists.sourceforge.net>, Kevin Wolf <kwolf@redhat.com>, "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>, Pavel Borzenkov <pborzenkov@virtuozzo.com>, "Stefan stefanha@redhat. com" <stefanha@redhat.com>, Wouter Verhelst <w@uter.be>, Paolo Bonzini <pbonzini@redhat.com>

On 04/04/2016 10:34 PM, Eric Blake wrote:
> On 04/04/2016 12:06 PM, Alex Bligh wrote:
>> On 4 Apr 2016, at 17:39, Eric Blake <eblake@redhat.com> wrote:
>>
>>> +    This command is meant to operate in tandem with other (non-NBD)
>>> +    channels to the server.  Generally, a "dirty" block is a block
>>> +    that has been written to by someone, but the exact meaning of "has
>>> +    been written" is left to the implementation.  For example, a
>>> +    virtual machine monitor could provide a (non-NBD) command to start
>>> +    tracking blocks written by the virtual machine.  A backup client
>>> +    can then connect to an NBD server provided by the virtual machine
>>> +    monitor and use `NBD_CMD_BLOCK_STATUS` with the
>>> +    `NBD_FLAG_STATUS_DIRTY` bit set in order to read only the dirty
>>> +    blocks that the virtual machine has changed.
>>> +
>>> +    An implementation that doesn't track the "dirtiness" state of
>>> +    blocks MUST either fail this command with `EINVAL`, or mark all
>>> +    blocks as dirty in the descriptor that it returns.  Upon receiving
>>> +    an `NBD_CMD_BLOCK_STATUS` command with the flag
>>> +    `NBD_FLAG_STATUS_DIRTY` set, the server MUST return the dirtiness
>>> +    status of the device, where the status field of each descriptor is
>>> +    determined by the following bit:
>>> +
>>> +      - `NBD_STATE_CLEAN` (bit 2); if set, the block represents a
>>> +        portion of the file that is still clean because it has not
>>> +        been written; if clear, the block represents a portion of the
>>> +        file that is dirty, or where the server could not otherwise
>>> +        determine its status.
>> A couple of questions:
>>
>> 1. I am not sure that the block dirtiness and the zero/allocation/hole thing
>>     always have the same natural blocksize. It's pretty easy to imagine
>>     a server whose natural blocksize is a disk sector (and can therefore
>>     report presence of zeroes to that resolution) but where 'dirtiness'
>>     was maintained independently at a less fine-grained level. Maybe
>>     that suggests 2 commands would be useful.
> In fact, qemu does just that with qcow2 images - the user can request a
> dirtiness granularity that is much larger than cluster granularity
> (where clusters are the current limitation on reporting holes, but where
> Kevin Wolf has an idea about a potential qcow2 extension that would even
> let us report holes at a sector granularity).
>
> Nothing requires the two uses to report at the same granularity.  THe
> NBD_REPLY_TYPE_BLOCK_STATUS allows the server to divide into descriptors
> as it sees fit (so it could report holes at a 4k granularity, but
> dirtiness only at a 64k granularity) - all that matters is that when all
> the descriptors have been sent, they total up to the length of the
> original client request.  So by itself, granularity does not require
> another command.
exactly!


>> 2. Given the communication is out of band, how is it realistically
>>     possible to sync this backup? You'll ask for all the dirty blocks,
>>     but whilst the command is being executed (as well as immediately
>>     after the reply) further blocks may be dirtied. So your reply
>>     always overestimates what is clean (probably the wrong way around).
>>     Furthermore, the next time you do a 'backup', you don't know whether
>>     the blocks were dirty as they were dirty on the previous backup,
>>     or because they were dirty on this backup.
> You are correct that as a one-way operation, querying dirtiness is not
> very useful if there is not a way to mark something clean, or if
> something else can be dirtying things in parallel.  But that doesn't
> mean the command is not useful - if the NBD server is exporting a file
> as read-only, where nothing else can be dirtying it in parallel, then a
> single pass over the dirty information is sufficient to learn what
> portions of the file to copy out.
>
> At this point, I was just trying to rebase the proposal as originally
> made by Denis and Pavel; perhaps they will have more insight on how they
> envisioned using the command, or on whether we should try harder to make
> this more of a two-way protocol (where the client can tell the server
> when to mark something as clean, or when to start tracking whether
> something is dirty).
for now and for QEMU we want this to expose accumulated dirtiness
of the block device, which is collected by the server. Yes, this requires
external coordination. May be this COULD be the part of the protocol,
but QEMU will not use that part of the protocol.

saying about dirtiness, we would soon come to the fact, that
we can have several dirtiness states regarding different
lines of incremental backups. This complexity is hidden
inside QEMU and it would be very difficult to publish and
reuse it.


>> If I was designing a backup protocol (off the top of my head) I'd
>> make all commands return a monotonic 64 bit counter of the number of
>> writes to the disk since some arbitrary time, and provide a 'GETDIRTY'
>> command that returned all blocks with a monotonic counter greater than that.
>> That way I could precisely get the writes that were executed since
>> any particular read. You'd allow it to be 'slack' and include things
>> in that list that might not have changed (i.e. false positives) but
>> not false negatives.
> Yes, that might work as an implementation - but there's the question of
> whether other implementations would also work.  We want the protocol to
> describe the concept, and not be too heavily tied to one particular
> implementation.
>
> The documentation is also trying to be very straightforward that asking
> about dirtiness requires out-of-band coordination, and that a server can
> just blindly report everything as dirty if there is no better thing to
> report.  So anyone actually making use of this command already has to be
> aware of the out-of-band coordination needed to make it useful.
>
yes, and this approach is perfect. If there is no information about
dirtiness, we should report this as all dirty. Though this information
could be type-specific.