From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:55080) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1S74MG-0001uh-02 for qemu-devel@nongnu.org; Mon, 12 Mar 2012 08:28:36 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1S74ME-0001ud-1F for qemu-devel@nongnu.org; Mon, 12 Mar 2012 08:28:11 -0400 Received: from mail-gx0-f173.google.com ([209.85.161.173]:54318) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1S74MD-0001uT-Qi for qemu-devel@nongnu.org; Mon, 12 Mar 2012 08:28:09 -0400 Received: by ggnj2 with SMTP id j2so2952403ggn.4 for ; Mon, 12 Mar 2012 05:28:07 -0700 (PDT) Sender: Paolo Bonzini Message-ID: <4F5DEBCE.3040409@redhat.com> Date: Mon, 12 Mar 2012 13:27:58 +0100 From: Paolo Bonzini MIME-Version: 1.0 References: <1331226917-6658-1-git-send-email-pbonzini@redhat.com> <1331226917-6658-7-git-send-email-pbonzini@redhat.com> <4F5A31B2.3050701@redhat.com> <4F5A46A1.4000508@redhat.com> <1331402560.8577.46.camel@watermelon.coderich.net> In-Reply-To: <1331402560.8577.46.camel@watermelon.coderich.net> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Richard Laager Cc: Kevin Wolf , qemu-devel@nongnu.org Il 10/03/2012 19:02, Richard Laager ha scritto: > I propose adding the following behaviors in any event: > * If a QEMU block device reports a discard_granularity > 0, it > must be equal to 2^n (n >= 0), or QEMU's block core will change > it to 0. (Non-power-of-two granularities are not likely to exist > in the real world, and this assumption greatly simplifies > ensuring correctness.) Yeah, I was considering this to be simply a bug in the block device. > * For SCSI, report an unmap_granularity to the guest as follows: > max(logical_block_size, discard_granularity) / logical_block_size This is more or less already in place later in the series. > As a design concept, instead of guaranteeing that 512B zero'ing discards > are supported, I think the QEMU block layer should instead guarantee > aligned discards to QEMU block devices, emulating any misaligned > discards (or portions thereof) by writing zeroes if (and only if) > discard_zeros_data is set. Yes, this can be done of course. This series does not include it yet. > This leaves one remaining issue: In raw-posix.c, for files (i.e. not > devices), I assume you're going to advertise discard_granularity=1 and > discard_zeros_data=1 when compiled with support for > fallocate(FALLOC_FL_PUNCH_HOLE). Note, I'm assuming fallocate() actually > guarantees that it zeros the data when punching holes. It does, that's pretty much the definition of a hole. > If the guest does a big discard (think mkfs) and fallocate() returns > EOPNOTSUPP, you'll have to zero essentially the whole virtual disk, > which, as you noted, will also allocate it (unless you explicitly check > for holes). This is bad. It can be avoided by not advertising > discard_zeros_data, but as you noted, that's unfortunate. If you have a new kernel that supports SEEK_HOLE/SEEK_DATA, it can also be done by skipping the zero write on known holes. This could even be done at the block layer level using bdrv_is_allocated. > If we could probe for FALLOC_FL_PUNCH_HOLE support, then we could avoid > advertising discard support based on FALLOC_FL_PUNCH_HOLE when it is not > going to work. This would side step these problems. ... and introduce others when migrating if your datacenter doesn't have homogeneous kernel versions and/or filesystems. :( > You said it wasn't > possible to probe for FALLOC_FL_PUNCH_HOLE. Have you considered probing > by extending the file by one byte and then punching that: > char buf = 0; > fstat(s->fd, &st); > pwrite(s->fd, &buf, 1, st.st_size + 1); > has_discard = !fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, > st.st_size + 1, 1); > ftruncate(s->fd, st.st_size); Nice trick. :) Yes, that could work. Do you know if non-Linux operating systems have something similar to BLKDISCARDZEROES? Paolo