linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Filipe Manana <fdmanana@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org,
	linux-btrfs@vger.kernel.org, djwong@kernel.org,
	hch@infradead.org, cluster-devel@redhat.com, agruenba@redhat.com,
	josef@toxicpanda.com, Filipe Manana <fdmanana@suse.com>
Subject: Re: [PATCH] iomap: fix incomplete async dio reads when using IOMAP_DIO_PARTIAL
Date: Wed, 2 Mar 2022 10:17:01 +0000	[thread overview]
Message-ID: <Yh9EHfl3sYJHeo3T@debian9.Home> (raw)
In-Reply-To: <20220228223830.GR59715@dread.disaster.area>

On Tue, Mar 01, 2022 at 09:38:30AM +1100, Dave Chinner wrote:
> On Mon, Feb 28, 2022 at 02:32:03PM +0000, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> > 
> > Some users recently reported that MariaDB was getting a read corruption
> > when using io_uring on top of btrfs. This started to happen in 5.16,
> > after commit 51bd9563b6783d ("btrfs: fix deadlock due to page faults
> > during direct IO reads and writes"). That changed btrfs to use the new
> > iomap flag IOMAP_DIO_PARTIAL and to disable page faults before calling
> > iomap_dio_rw(). This was necessary to fix deadlocks when the iovector
> > corresponds to a memory mapped file region. That type of scenario is
> > exercised by test case generic/647 from fstests, and it also affected
> > gfs2, which, besides btrfs, is the only user of IOMAP_DIO_PARTIAL.
> > 
> > For this MariaDB scenario, we attempt to read 16K from file offset X
> > using IOCB_NOWAIT and io_uring. In that range we have 4 extents, each
> > with a size of 4K, and what happens is the following:
> > 
> > 1) btrfs_direct_read() disables page faults and calls iomap_dio_rw();
> > 
> > 2) iomap creates a struct iomap_dio object, its reference count is
> >    initialized to 1 and its ->size field is initialized to 0;
> > 
> > 3) iomap calls btrfs_iomap_begin() with file offset X, which finds the
> 
> You mean btrfs_dio_iomap_begin()?

Yes, correct.

> 
> >    first 4K extent, and setups an iomap for this extent consisting of
> >    a single page;
> 
> So we have IOCB_NOWAIT, which means btrfs_dio_iomap_begin() is being
> passed IOMAP_NOWAIT and so knows it is being asked
> to map an extent for an IO that is on a non-blocking path.
> 
> btrfs_dio_iomap_begin() doesn't appear to support NOWAIT semantics
> at all - it will block doing writeback IO, memory allocation, extent
> locking, transaction reservations, extent allocation, etc....

We do have some checks for NOWAIT before getting into btrfs_dio_iomap_begin(),
but they are only for the write path, and they are incomplete. Some are a bit
tricky to deal with, but yes, there's several cases that are either missing
or need to be improved.

> 
> That, to me, looks like the root cause of the problem here -
> btrfs_dio_iomap_begin() is not guaranteeing non-blocking atomic IO
> semantics for IOCB_NOWAIT IO.
> 
> In the case above, given that the extent lookup only found a 4kB
> extent, we know that it doesn't span the entire requested IO range.
> We also known that we cannot tell if we'll block on subsequent
> mappings of the IO range, and hence no guarantee can be given that
> IOCB_NOWAIT IO will not block when it is too late to back out with a
> -EAGAIN error.
> 
> Hence this whole set of problems could be avoided if
> btrfs_dio_iomap_begin() returns -EAGAIN if it can't map the entire
> IO into a single extent without blocking when IOMAP_NOWAIT is set?
> That's exactly what XFS does in xfs_direct_iomap_write_begin():
> 
>         /*
>          * NOWAIT and OVERWRITE I/O needs to span the entire requested I/O with
>          * a single map so that we avoid partial IO failures due to the rest of
>          * the I/O range not covered by this map triggering an EAGAIN condition
>          * when it is subsequently mapped and aborting the I/O.
>          */
>         if (flags & (IOMAP_NOWAIT | IOMAP_OVERWRITE_ONLY)) {
>                 error = -EAGAIN;
>                 if (!imap_spans_range(&imap, offset_fsb, end_fsb))
>                         goto out_unlock;
>         }
> 
> Basically, I'm thinking that IOMAP_NOWAIT and IOMAP_DIO_PARTIAL
> should be exclusive functionality - if you are doing IOMAP_NOWAIT
> then the entire IO must succeed without blocking, and if it doesn't
> then we return -EAGAIN and the caller retries without IOCB_NOWAIT
> set and so then we run with IOMAP_DIO_PARTIAL semantics in a thread
> that can actually block....

Indeed, I had not considered that, that is simple and effective, plus
it can be done exclusively in btrfs code, no need to change iomap.

> 
> .....
> 
> > 11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K
> >     and return 4K (the amount of io done) to iomap_dio_complete_work();
> > 
> > 12) iomap_dio_complete_work() calls the iocb completion callback,
> >     iocb->ki_complete() with a second argument value of 4K (total io
> >     done) and the iocb with the adjust ki_pos of X + 4K. This results
> >     in completing the read request for io_uring, leaving it with a
> >     result of 4K bytes read, and only the first page of the buffer
> >     filled in, while the remaining 3 pages, corresponding to the other
> >     3 extents, were not filled;
> > 
> > 13) For the application, the result is unexpected because if we ask
> >     to read N bytes, it expects to get N bytes read as long as those
> >     N bytes don't cross the EOF (i_size).
> 
> Yeah, that's exactly the sort of problem we were having with XFS
> with partial DIO completions due to needing multiple iomap iteration
> loops to complete a single IO. Hence IOMAP_NOWAIT now triggers the
> above range check and aborts before we start...

Interesting.

> 
> > 
> > So fix this by making __iomap_dio_rw() assign true to the boolean variable
> > 'wait_for_completion' when we have IOMAP_DIO_PARTIAL set, we did some
> > progress for a read and we have not crossed the EOF boundary. Do this even
> > if the read has IOCB_NOWAIT set, as it's the only way to avoid providing
> > an unexpected result to an application.
> 
> That's highly specific and ultimately will be fragile, IMO. I'd much
> prefer that *_iomap_begin_write() implementations simply follow
> IOMAP_NOWAIT requirements to ensure that any DIO that needs multiple
> mappings if punted to a context that can block...

Yes, agreed.

Thanks for your feedback Dave, it provided a really good insight into this
problem (and others).

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

  reply	other threads:[~2022-03-02 10:17 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-28 14:32 [PATCH] iomap: fix incomplete async dio reads when using IOMAP_DIO_PARTIAL fdmanana
2022-02-28 22:38 ` Dave Chinner
2022-03-02 10:17   ` Filipe Manana [this message]
2022-03-02 13:03     ` Andreas Gruenbacher
2022-03-02 23:12       ` Dave Chinner
2022-03-01 16:42 ` Darrick J. Wong
2022-03-02 10:30   ` Filipe Manana

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yh9EHfl3sYJHeo3T@debian9.Home \
    --to=fdmanana@kernel.org \
    --cc=agruenba@redhat.com \
    --cc=cluster-devel@redhat.com \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=fdmanana@suse.com \
    --cc=hch@infradead.org \
    --cc=josef@toxicpanda.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).