From: Josef Bacik <josef@toxicpanda.com>
To: Filipe Manana <fdmanana@kernel.org>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com,
Dylan Yudaken <dylany@fb.com>, Jens Axboe <axboe@kernel.dk>,
agruenba@redhat.com
Subject: Re: [PATCH] btrfs: handle DIO read faults properly
Date: Fri, 12 Aug 2022 11:39:23 -0400 [thread overview]
Message-ID: <YvZ0K53zVv21mNVs@localhost.localdomain> (raw)
In-Reply-To: <20220812090208.GA2373742@falcondesktop>
On Fri, Aug 12, 2022 at 10:02:08AM +0100, Filipe Manana wrote:
> On Thu, Aug 11, 2022 at 05:06:11PM -0400, Josef Bacik wrote:
> > Dylan reported a problem where he had an io_uring test that was returning
> > short DIO reads with ee5b46a353af ("btrfs: increase direct io read size
> > limit to 256 sectors") applied. This turned out to be a red herring,
> > this simply increases the size of the reads we'll attempt to do in one
> > go. The root of the problem is that because we're now trying to read in
> > more into our buffer, we're more likely to hit a page fault while trying
> > to read into the buffer.
> >
> > Because we pass IOMAP_DIO_PARTIAL into iomap if we get an -EFAULT we'll
> > simply return 0, expecting that we'll do the fault and then try again.
> > However since we're only checking for a ret > 0 || ret == -EFAULT we
> > return a short read.
>
> I find this explanation to be terse. There's a lot of non-obvious details missing.
>
> So, at __iomap_dio_rw() we have this:
>
> if (ret == -EFAULT && dio->size && (dio_flags & IOMAP_DIO_PARTIAL)) {
> if (!(iocb->ki_flags & IOCB_NOWAIT))
> wait_for_completion = true;
> ret = 0;
> }
>
> And shortly after, in the same function, we have:
>
> if (!atomic_dec_and_test(&dio->ref)) {
> if (!wait_for_completion)
> return ERR_PTR(-EIOCBQUEUED);
>
> for (;;) {
> set_current_state(TASK_UNINTERRUPTIBLE);
> if (!READ_ONCE(dio->submit.waiter))
> break;
>
> blk_io_schedule();
> }
> __set_current_state(TASK_RUNNING);
> }
>
> If the short read happens with io_uring, it means the initial request had
> IOCB_NOWAIT set. With the IOCB_NOWAIT call to btrfs, we were able to satisfy
> part of the read, submit one or more bios, and then we got -EFAULT when trying
> to faultin a page from the iovector.
>
> This makes the first if-statement set 'ret' to 0 and 'wait_for_completion'
> stays with a false value. Then later on, because 'wait_for_completion' is
> false, -EIOCBQUEUED is returned to btrfs, which returns it and propagates
> it back to io_uring.
>
> Because the read was not entirely satisfied, io_uring fallbacks to a blocking
> context, i.e. IOCB_NOWAIT is not set. We get back to btrfs and __iomap_dio_rw(),
> we submit one or more bios for some parts of remaining range, but then we get
> -EFAULT again. Then in that first if statement at __iomap_dio_rw(), we set 'ret'
> to 0 and 'wait_for_completion' to true, since dio->size is > 0 (we have submitted
> one more bios, made some progress) and IOCB_NOWAIT is not set.
>
> This results in 0 being returned to btrfs_direct_read(), which returns 0 back
> to btrfs_file_read_iter(). And there we call filemap_read() to try to read
> what's left using buffered IO, passing it a value of 0 for its 'already_read'
> argument. Then there we get some error, presumably a page fault again, and
> return it back to io_uring.
>
> And that's how user space gets the partial read?
> I don't see how else it can happen, so I'd like you to confirm that and update the
> changelog with a lot more details how user space gets a partial read.
>
Sure I'll include these details, I left them out because it's relatively
convoluted (as you found out) and really the important detail is we get a 0 back
because we tried to page fault and it got turned into 0 from -EFAULT.
>
> > Fix this by checking for a 0 read from iomap as
> > well. I've left the EFAULT case because it appears like we can still
> > get this in the case of a page fault from a direct read from an inline
> > extent.
>
> The EFAULT case is not related to inline extents.
> When we find an inline extent, we always fallback to buffered IO.
> btrfs_iomap_begin() will return -ENOTBLK or -EAGAIN depending on NOWAIT, so
> btrfs_direct_read() will never get -EFAULT because of an inline extent.
>
> To get -EFAULT it means we are trying to read a non-inline, non-compressed
> extent and then failed to faultin the very first page of the iovector, so in
> that first if-statement at __iomap_dio_rw(), 'ret' remains with the value
> of -EFAULT because dio->size is 0 (no progress made at all) and inline
> extents always start at file offset 0 (so no progress may have happened).
>
Sorry I thought I saw IOMAP_INLINE for our inline stuff. We can get it for
IOMAP_HOLE where we try to zero the buffer which can fault, and then we have a
dio->size == 0 && ret == -EFAULT and that'll get propagated up to us, so we need
to preserve the -EFAULT check. I'll update the changelog with this as well.
Thanks,
Josef
next prev parent reply other threads:[~2022-08-12 15:39 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-11 21:06 [PATCH] btrfs: handle DIO read faults properly Josef Bacik
2022-08-12 9:02 ` Filipe Manana
2022-08-12 15:39 ` Josef Bacik [this message]
2022-08-19 15:55 ` Josef Bacik
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YvZ0K53zVv21mNVs@localhost.localdomain \
--to=josef@toxicpanda.com \
--cc=agruenba@redhat.com \
--cc=axboe@kernel.dk \
--cc=dylany@fb.com \
--cc=fdmanana@kernel.org \
--cc=kernel-team@fb.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox