From: Filipe Manana <fdmanana@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org,
linux-btrfs@vger.kernel.org, djwong@kernel.org,
hch@infradead.org, cluster-devel@redhat.com, agruenba@redhat.com,
josef@toxicpanda.com, Filipe Manana <fdmanana@suse.com>
Subject: Re: [PATCH] iomap: fix incomplete async dio reads when using IOMAP_DIO_PARTIAL
Date: Wed, 2 Mar 2022 10:17:01 +0000 [thread overview]
Message-ID: <Yh9EHfl3sYJHeo3T@debian9.Home> (raw)
In-Reply-To: <20220228223830.GR59715@dread.disaster.area>
On Tue, Mar 01, 2022 at 09:38:30AM +1100, Dave Chinner wrote:
> On Mon, Feb 28, 2022 at 02:32:03PM +0000, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Some users recently reported that MariaDB was getting a read corruption
> > when using io_uring on top of btrfs. This started to happen in 5.16,
> > after commit 51bd9563b6783d ("btrfs: fix deadlock due to page faults
> > during direct IO reads and writes"). That changed btrfs to use the new
> > iomap flag IOMAP_DIO_PARTIAL and to disable page faults before calling
> > iomap_dio_rw(). This was necessary to fix deadlocks when the iovector
> > corresponds to a memory mapped file region. That type of scenario is
> > exercised by test case generic/647 from fstests, and it also affected
> > gfs2, which, besides btrfs, is the only user of IOMAP_DIO_PARTIAL.
> >
> > For this MariaDB scenario, we attempt to read 16K from file offset X
> > using IOCB_NOWAIT and io_uring. In that range we have 4 extents, each
> > with a size of 4K, and what happens is the following:
> >
> > 1) btrfs_direct_read() disables page faults and calls iomap_dio_rw();
> >
> > 2) iomap creates a struct iomap_dio object, its reference count is
> > initialized to 1 and its ->size field is initialized to 0;
> >
> > 3) iomap calls btrfs_iomap_begin() with file offset X, which finds the
>
> You mean btrfs_dio_iomap_begin()?
Yes, correct.
>
> > first 4K extent, and setups an iomap for this extent consisting of
> > a single page;
>
> So we have IOCB_NOWAIT, which means btrfs_dio_iomap_begin() is being
> passed IOMAP_NOWAIT and so knows it is being asked
> to map an extent for an IO that is on a non-blocking path.
>
> btrfs_dio_iomap_begin() doesn't appear to support NOWAIT semantics
> at all - it will block doing writeback IO, memory allocation, extent
> locking, transaction reservations, extent allocation, etc....
We do have some checks for NOWAIT before getting into btrfs_dio_iomap_begin(),
but they are only for the write path, and they are incomplete. Some are a bit
tricky to deal with, but yes, there's several cases that are either missing
or need to be improved.
>
> That, to me, looks like the root cause of the problem here -
> btrfs_dio_iomap_begin() is not guaranteeing non-blocking atomic IO
> semantics for IOCB_NOWAIT IO.
>
> In the case above, given that the extent lookup only found a 4kB
> extent, we know that it doesn't span the entire requested IO range.
> We also known that we cannot tell if we'll block on subsequent
> mappings of the IO range, and hence no guarantee can be given that
> IOCB_NOWAIT IO will not block when it is too late to back out with a
> -EAGAIN error.
>
> Hence this whole set of problems could be avoided if
> btrfs_dio_iomap_begin() returns -EAGAIN if it can't map the entire
> IO into a single extent without blocking when IOMAP_NOWAIT is set?
> That's exactly what XFS does in xfs_direct_iomap_write_begin():
>
> /*
> * NOWAIT and OVERWRITE I/O needs to span the entire requested I/O with
> * a single map so that we avoid partial IO failures due to the rest of
> * the I/O range not covered by this map triggering an EAGAIN condition
> * when it is subsequently mapped and aborting the I/O.
> */
> if (flags & (IOMAP_NOWAIT | IOMAP_OVERWRITE_ONLY)) {
> error = -EAGAIN;
> if (!imap_spans_range(&imap, offset_fsb, end_fsb))
> goto out_unlock;
> }
>
> Basically, I'm thinking that IOMAP_NOWAIT and IOMAP_DIO_PARTIAL
> should be exclusive functionality - if you are doing IOMAP_NOWAIT
> then the entire IO must succeed without blocking, and if it doesn't
> then we return -EAGAIN and the caller retries without IOCB_NOWAIT
> set and so then we run with IOMAP_DIO_PARTIAL semantics in a thread
> that can actually block....
Indeed, I had not considered that, that is simple and effective, plus
it can be done exclusively in btrfs code, no need to change iomap.
>
> .....
>
> > 11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K
> > and return 4K (the amount of io done) to iomap_dio_complete_work();
> >
> > 12) iomap_dio_complete_work() calls the iocb completion callback,
> > iocb->ki_complete() with a second argument value of 4K (total io
> > done) and the iocb with the adjust ki_pos of X + 4K. This results
> > in completing the read request for io_uring, leaving it with a
> > result of 4K bytes read, and only the first page of the buffer
> > filled in, while the remaining 3 pages, corresponding to the other
> > 3 extents, were not filled;
> >
> > 13) For the application, the result is unexpected because if we ask
> > to read N bytes, it expects to get N bytes read as long as those
> > N bytes don't cross the EOF (i_size).
>
> Yeah, that's exactly the sort of problem we were having with XFS
> with partial DIO completions due to needing multiple iomap iteration
> loops to complete a single IO. Hence IOMAP_NOWAIT now triggers the
> above range check and aborts before we start...
Interesting.
>
> >
> > So fix this by making __iomap_dio_rw() assign true to the boolean variable
> > 'wait_for_completion' when we have IOMAP_DIO_PARTIAL set, we did some
> > progress for a read and we have not crossed the EOF boundary. Do this even
> > if the read has IOCB_NOWAIT set, as it's the only way to avoid providing
> > an unexpected result to an application.
>
> That's highly specific and ultimately will be fragile, IMO. I'd much
> prefer that *_iomap_begin_write() implementations simply follow
> IOMAP_NOWAIT requirements to ensure that any DIO that needs multiple
> mappings if punted to a context that can block...
Yes, agreed.
Thanks for your feedback Dave, it provided a really good insight into this
problem (and others).
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
next prev parent reply other threads:[~2022-03-02 10:17 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-02-28 14:32 [PATCH] iomap: fix incomplete async dio reads when using IOMAP_DIO_PARTIAL fdmanana
2022-02-28 22:38 ` Dave Chinner
2022-03-02 10:17 ` Filipe Manana [this message]
2022-03-02 13:03 ` Andreas Gruenbacher
2022-03-02 23:12 ` Dave Chinner
2022-03-01 16:42 ` Darrick J. Wong
2022-03-02 10:30 ` Filipe Manana
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Yh9EHfl3sYJHeo3T@debian9.Home \
--to=fdmanana@kernel.org \
--cc=agruenba@redhat.com \
--cc=cluster-devel@redhat.com \
--cc=david@fromorbit.com \
--cc=djwong@kernel.org \
--cc=fdmanana@suse.com \
--cc=hch@infradead.org \
--cc=josef@toxicpanda.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).