All of lore.kernel.org
 help / color / mirror / Atom feed
From: Kevin Wolf <kwolf@redhat.com>
To: Eric Blake <eblake@redhat.com>
Cc: qemu-block@nongnu.org, pbonzini@redhat.com, famz@redhat.com,
	qemu-devel@nongnu.org, mreitz@redhat.com,
	Nir Soffer <nsoffer@redhat.com>
Subject: Re: [Qemu-devel] [PATCH for-3.0] file-posix: Fix write_zeroes with unmap on block devices
Date: Thu, 26 Jul 2018 17:33:56 +0200	[thread overview]
Message-ID: <20180726153356.GH4215@localhost.localdomain> (raw)
In-Reply-To: <6cdde651-acdc-e6eb-63fc-e7efe8136710@redhat.com>

Am 26.07.2018 um 17:23 hat Eric Blake geschrieben:
> On 07/26/2018 10:06 AM, Kevin Wolf wrote:
> 
> > > > +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> > > > +    ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> > > > +                       aiocb->aio_offset, aiocb->aio_nbytes);
> > > 
> > > Umm, doesn't this have to use FALLOC_FL_ZERO_RANGE? FALLOC_FL_PUNCH_HOLE
> > > deallocs, but is not required to write zeroes.
> > 
> > Yes, it is. See the man page:
> > 
> >      Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
> >      2.6.38) in mode deallocates space (i.e., creates a hole) in the byte
> >      range starting at offset and continuing for len bytes. Within the
> >      specified range, partial filesystem blocks are zeroed, and whole
> >      filesystem blocks are removed from the file. After a successful
> >      call, subsequent reads from this range will return zeroes.
> 
> That's true for file-system fds, but not for block device fds.

It is true for block device fds, too. Look at fs/block_dev.c,
specifically blkdev_fallocate():

        switch (mode) {
        case FALLOC_FL_ZERO_RANGE:
        case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE: 
                error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
                                            GFP_KERNEL, BLKDEV_ZERO_NOUNMAP);
                break;
        case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE: 
                error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
                                             GFP_KERNEL, BLKDEV_ZERO_NOFALLBACK);
                break;
        case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | FALLOC_FL_NO_HIDE_STALE:
                error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
                                             GFP_KERNEL, 0);
                break;
        default:
                return -EOPNOTSUPP;
        }

> As pointed out by Nir,
> 
> > https://patchwork.kernel.org/patch/9903757/
> Which says, among other things:
> 
> >> Do we also know that the blocks were discarded as we do with
> >> BLKDISCARD ?
> >
> > There never was a way to know for sure.
> >
> > ATA DSM TRIM and SCSI UNMAP are hints by definition. We attempted to
> > bend their semantics towards getting predictable behavior but ultimately
> > failed. Too many corner cases.
> >
> >> As I mentioned before. We relied on discard_zeroes_data in mkfs.ext4
> >> to make sure that inode tables are zeroed after discard.
> >
> > The point is that you shouldn't have an if (discard_zeroes_data)
> > conditional in the first place.
> >
> >  - If you need to dellocate a block range and you don't care about its
> >    contents in the future, use BLKDISCARD / FL_PUNCH_HOLE.
> >
> >  - If you need to zero a block range, use BLKZEROOUT / FL_ZERO_RANGE.
> 
> PUNCH_HOLE deallocates; but can only guarantee a read back of zero on file
> systems.

As far as I know, the comment you quoted is accurate for BLKDISCARD and
BLKZEROOUT, but not for the fallocate() flags.

> Hmm - that thread also mentions FALLOC_FL_NO_HIDE_STALE, which is a new flag
> not present/documented on Fedora 28. I wonder if it helps, too.
> 
> > 
> > FALLOC_FL_ZERO_RANGE in contrast implements write_zeroes without unmap.
> 
> I thought the opposite: FALLOC_FL_ZERO_RANGE guarantees that you read back
> 0, using whatever is most efficient under the hood (in the case of block
> devices, unmapping that reliably reads back as zero is favored).

See the code I quoted above, FALLOC_FL_ZERO_RANGE calls
blkdev_issue_zeroout() with BLKDEV_ZERO_NOUNMAP internally.

Kevin

  reply	other threads:[~2018-07-26 15:34 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-26 11:33 [Qemu-devel] [PATCH for-3.0] file-posix: Fix write_zeroes with unmap on block devices Kevin Wolf
2018-07-26 14:28 ` Eric Blake
2018-07-26 15:06   ` [Qemu-devel] [Qemu-block] " Nir Soffer
2018-07-26 15:06   ` [Qemu-devel] " Kevin Wolf
2018-07-26 15:23     ` Eric Blake
2018-07-26 15:33       ` Kevin Wolf [this message]
2018-07-26 16:10         ` Eric Blake
2018-07-26 16:46           ` Nir Soffer
2018-07-26 16:58             ` Kevin Wolf
2018-07-26 17:34               ` Nir Soffer
2018-07-26 18:00                 ` Eric Blake
2018-07-26 18:09                 ` Kevin Wolf
2018-07-26 18:24       ` Eric Blake
2018-07-26 16:22 ` [Qemu-devel] [Qemu-block] " Nir Soffer
2018-07-26 16:41   ` Kevin Wolf
2018-07-26 16:54     ` Nir Soffer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180726153356.GH4215@localhost.localdomain \
    --to=kwolf@redhat.com \
    --cc=eblake@redhat.com \
    --cc=famz@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=nsoffer@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.