Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Jan Kara <jack@suse.cz>
Cc: Kevin Wolf <kwolf@redhat.com>,
	Christoph Hellwig <hch@infradead.org>,
	qemu-block@nongnu.org, Peter Lieven <pl@kamp.de>,
	qemu-devel@nongnu.org, Max Reitz <mreitz@redhat.com>,
	linux-block@vger.kernel.org, Jens Axboe <axboe@kernel.dk>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Maxim Levitsky <mlevitsk@redhat.com>
Subject: Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT
Date: Thu, 12 Nov 2020 14:08:38 -0800	[thread overview]
Message-ID: <20201112220838.GD9699@magnolia> (raw)
In-Reply-To: <20201112120056.GC27697@quack2.suse.cz>

On Thu, Nov 12, 2020 at 01:00:56PM +0100, Jan Kara wrote:
> On Thu 12-11-20 12:19:51, Jan Kara wrote:
> > [added some relevant people and lists to CC]
> > 
> > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > > > clone of "starship_production"
> > > 
> > > The git-publish destroyed the cover letter:
> > > 
> > > For the reference this is for bz #1872633
> > > 
> > > The issue is that current kernel code that implements 'fallocate'
> > > on kernel block devices roughly works like that:
> > > 
> > > 1. Flush the page cache on the range that is about to be discarded.
> > > 2. Issue the discard and wait for it to finish.
> > >    (as far as I can see the discard doesn't go through the
> > >    page cache).
> > > 
> > > 3. Check if the page cache is dirty for this range,
> > >    if it is dirty (meaning that someone wrote to it meanwhile)
> > >    return -EBUSY.
> > > 
> > > This means that if qemu (or qemu-img) issues a write, and then
> > > discard to the area that shares a page, -EBUSY can be returned by
> > > the kernel.
> > 
> > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
> > EBUSY which seems wrong to me. IMO we should handle this gracefully in the
> > kernel so we need to fix this.
> > 
> > > On the other hand, for example, the ext4 implementation of discard
> > > doesn't seem to be affected. It does take a lock on the inode to avoid
> > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
> > 
> > Well, filesystem hole punching is somewhat different beast than block device
> > discard (at least implementation wise).
> > 
> > > Doing fsync and retrying is seems to resolve this issue, but it might be
> > > a too big hammer.  Just retrying doesn't work, indicating that maybe the
> > > code that flushes the page cache in (1) doesn't do this correctly ?
> > > 
> > > It also can be racy unless special means are done to block IO from happening
> > > from qemu during this fsync.
> > > 
> > > This patch series contains two patches:
> > > 
> > > First patch just lets the file-posix ignore the -EBUSY errors, which is
> > > technically enough to fail back to plain write in this case, but seems wrong.
> > > 
> > > And the second patch adds an optimization to qemu-img to avoid such a
> > > fragmented write/discard in the first place.
> > > 
> > > Both patches make the reproducer work for this particular bugzilla,
> > > but I don't think they are enough.
> > > 
> > > What do you think?
> > 
> > So if the EBUSY error happens because something happened to the page cache
> > outside of discarded range (like you describe above), that is a kernel bug
> > than needs to get fixed. EBUSY should really mean - someone wrote to the
> > discarded range while discard was running and userspace app has to deal
> > with that depending on what it aims to do...
> 
> So I was looking what it would take to fix this inside the kernel. The
> problem is that invalidate_inode_pages2_range() is working on page
> granularity and it is non-trivial to extend it to work on byte granularity
> since we don't support something like "try to reclaim part of a page". But
> I'm also somewhat wondering why we use invalidate_inode_pages2_range() here
> instead of truncate_inode_pages_range() again? I mean the EBUSY detection
> cannot be reliable anyway and userspace has no way of knowing whether a
> write happened before discard or after it so just discarding data is fine
> from this point of view. Darrick?

Hmmm, I think I overlooked the fact that we can do buffered writes into
a block device's pagecache without taking any of the usual locks that
have to be held for filesystem files.  This is essentially a race
between a not-page-aligned fallocate and a buffered write to a different
sector that is mapped by a page that caches part of the fallocate range.

So yes, Jan is right that we need to use truncate_bdev_range instead of
invalidate_inode_pages2_range because the former will zero the sub-page
ranges on either end of the fallocate request instead of returning
-EBUSY because someone dirtied a part of a page that wasn't involved in
the fallocate operation.

I /probably/ just copy-pasta'd that invalidation call from directio
without thinking hard enough about it, sorry about that. :(

--D

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

next prev parent reply	other threads:[~2020-11-12 22:29 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-11 15:39 [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Maxim Levitsky
2020-11-11 15:39 ` [PATCH 1/2] file-posix: allow -EBUSY errors during write zeros on raw block devices Maxim Levitsky
2020-11-16 14:48   ` Kevin Wolf
2021-01-07 12:44     ` Maxim Levitsky
2020-11-11 15:39 ` [PATCH 2/2] qemu-img: align next status sector on destination alignment Maxim Levitsky
2020-11-12 12:40   ` Peter Lieven
2020-11-12 13:45     ` Eric Blake
2020-11-12 15:04       ` Maxim Levitsky
2021-03-18  9:57         ` Maxim Levitsky
2020-11-11 15:44 ` [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Maxim Levitsky
2020-11-12 11:19   ` Jan Kara
2020-11-12 12:00     ` Jan Kara
2020-11-12 22:08       ` Darrick J. Wong [this message]
2020-12-07 17:23         ` Maxim Levitsky
2020-11-12 15:38     ` Maxim Levitsky
2020-11-13 10:07       ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201112220838.GD9699@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=axboe@kernel.dk \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=kwolf@redhat.com \
    --cc=linux-block@vger.kernel.org \
    --cc=mlevitsk@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=pl@kamp.de \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).