From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([208.118.235.92]:55080)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <paolo.bonzini@gmail.com>) id 1S74MG-0001uh-02
	for qemu-devel@nongnu.org; Mon, 12 Mar 2012 08:28:36 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <paolo.bonzini@gmail.com>) id 1S74ME-0001ud-1F
	for qemu-devel@nongnu.org; Mon, 12 Mar 2012 08:28:11 -0400
Received: from mail-gx0-f173.google.com ([209.85.161.173]:54318)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <paolo.bonzini@gmail.com>) id 1S74MD-0001uT-Qi
	for qemu-devel@nongnu.org; Mon, 12 Mar 2012 08:28:09 -0400
Received: by ggnj2 with SMTP id j2so2952403ggn.4
	for <qemu-devel@nongnu.org>; Mon, 12 Mar 2012 05:28:07 -0700 (PDT)
Sender: Paolo Bonzini <paolo.bonzini@gmail.com>
Message-ID: <4F5DEBCE.3040409@redhat.com>
Date: Mon, 12 Mar 2012 13:27:58 +0100
From: Paolo Bonzini <pbonzini@redhat.com>
MIME-Version: 1.0
References: <1331226917-6658-1-git-send-email-pbonzini@redhat.com>
	<1331226917-6658-7-git-send-email-pbonzini@redhat.com>
	<4F5A31B2.3050701@redhat.com> <4F5A46A1.4000508@redhat.com>
	<1331402560.8577.46.camel@watermelon.coderich.net>
In-Reply-To: <1331402560.8577.46.camel@watermelon.coderich.net>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co,
 aio}_discard for write_zeroes operations
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Richard Laager <rlaager@wiktel.com>
Cc: Kevin Wolf <kwolf@redhat.com>, qemu-devel@nongnu.org

Il 10/03/2012 19:02, Richard Laager ha scritto:
> I propose adding the following behaviors in any event:
>       * If a QEMU block device reports a discard_granularity > 0, it
>         must be equal to 2^n (n >= 0), or QEMU's block core will change
>         it to 0. (Non-power-of-two granularities are not likely to exist
>         in the real world, and this assumption greatly simplifies
>         ensuring correctness.)

Yeah, I was considering this to be simply a bug in the block device.

>       * For SCSI, report an unmap_granularity to the guest as follows:
>       max(logical_block_size, discard_granularity) / logical_block_size

This is more or less already in place later in the series.

> As a design concept, instead of guaranteeing that 512B zero'ing discards
> are supported, I think the QEMU block layer should instead guarantee
> aligned discards to QEMU block devices, emulating any misaligned
> discards (or portions thereof) by writing zeroes if (and only if)
> discard_zeros_data is set.

Yes, this can be done of course.  This series does not include it yet.

> This leaves one remaining issue: In raw-posix.c, for files (i.e. not
> devices), I assume you're going to advertise discard_granularity=1 and
> discard_zeros_data=1 when compiled with support for
> fallocate(FALLOC_FL_PUNCH_HOLE). Note, I'm assuming fallocate() actually
> guarantees that it zeros the data when punching holes.

It does, that's pretty much the definition of a hole.

> If the guest does a big discard (think mkfs) and fallocate() returns
> EOPNOTSUPP, you'll have to zero essentially the whole virtual disk,
> which, as you noted, will also allocate it (unless you explicitly check
> for holes). This is bad. It can be avoided by not advertising
> discard_zeros_data, but as you noted, that's unfortunate.

If you have a new kernel that supports SEEK_HOLE/SEEK_DATA, it can also
be done by skipping the zero write on known holes.

This could even be done at the block layer level using bdrv_is_allocated.

> If we could probe for FALLOC_FL_PUNCH_HOLE support, then we could avoid
> advertising discard support based on FALLOC_FL_PUNCH_HOLE when it is not
> going to work. This would side step these problems. 

... and introduce others when migrating if your datacenter doesn't have
homogeneous kernel versions and/or filesystems. :(

> You said it wasn't
> possible to probe for FALLOC_FL_PUNCH_HOLE. Have you considered probing
> by extending the file by one byte and then punching that:
>         char buf = 0;
>         fstat(s->fd, &st);
>         pwrite(s->fd, &buf, 1, st.st_size + 1);
>         has_discard = !fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
>                                  st.st_size + 1, 1);
>         ftruncate(s->fd, st.st_size);

Nice trick. :)   Yes, that could work.

Do you know if non-Linux operating systems have something similar to
BLKDISCARDZEROES?

Paolo