Linux Btrfs filesystem development
 help / color / mirror / Atom feed
From: Filipe Manana <fdmanana@kernel.org>
To: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com,
	Dylan Yudaken <dylany@fb.com>, Jens Axboe <axboe@kernel.dk>,
	agruenba@redhat.com
Subject: Re: [PATCH] btrfs: handle DIO read faults properly
Date: Fri, 12 Aug 2022 10:02:08 +0100	[thread overview]
Message-ID: <20220812090208.GA2373742@falcondesktop> (raw)
In-Reply-To: <552156d49d65ab7d635554b697252fdbfb8f93b0.1660251962.git.josef@toxicpanda.com>

On Thu, Aug 11, 2022 at 05:06:11PM -0400, Josef Bacik wrote:
> Dylan reported a problem where he had an io_uring test that was returning
> short DIO reads with ee5b46a353af ("btrfs: increase direct io read size
> limit to 256 sectors") applied.  This turned out to be a red herring,
> this simply increases the size of the reads we'll attempt to do in one
> go.  The root of the problem is that because we're now trying to read in
> more into our buffer, we're more likely to hit a page fault while trying
> to read into the buffer.
> 
> Because we pass IOMAP_DIO_PARTIAL into iomap if we get an -EFAULT we'll
> simply return 0, expecting that we'll do the fault and then try again.
> However since we're only checking for a ret > 0 || ret == -EFAULT we
> return a short read.

I find this explanation to be terse. There's a lot of non-obvious details missing.

So, at __iomap_dio_rw() we have this:

    if (ret == -EFAULT && dio->size && (dio_flags & IOMAP_DIO_PARTIAL)) {
        if (!(iocb->ki_flags & IOCB_NOWAIT))
            wait_for_completion = true;
        ret = 0;
    }

And shortly after, in the same function, we have:

    if (!atomic_dec_and_test(&dio->ref)) {
        if (!wait_for_completion)
            return ERR_PTR(-EIOCBQUEUED);

        for (;;) {
            set_current_state(TASK_UNINTERRUPTIBLE);
            if (!READ_ONCE(dio->submit.waiter))
                break;

            blk_io_schedule();
        }
        __set_current_state(TASK_RUNNING);
    }

If the short read happens with io_uring, it means the initial request had
IOCB_NOWAIT set. With the IOCB_NOWAIT call to btrfs, we were able to satisfy
part of the read, submit one or more bios, and then we got -EFAULT when trying
to faultin a page from the iovector.

This makes the first if-statement set 'ret' to 0 and 'wait_for_completion'
stays with a false value. Then later on, because 'wait_for_completion' is
false, -EIOCBQUEUED is returned to btrfs, which returns it and propagates
it back to io_uring.

Because the read was not entirely satisfied, io_uring fallbacks to a blocking
context, i.e. IOCB_NOWAIT is not set. We get back to btrfs and __iomap_dio_rw(),
we submit one or more bios for some parts of remaining range, but then we get
-EFAULT again. Then in that first if statement at __iomap_dio_rw(), we set 'ret'
to 0 and 'wait_for_completion' to true, since dio->size is > 0 (we have submitted
one more bios, made some progress) and IOCB_NOWAIT is not set.

This results in 0 being returned to btrfs_direct_read(), which returns 0 back
to btrfs_file_read_iter(). And there we call filemap_read() to try to read
what's left using buffered IO, passing it a value of 0 for its 'already_read'
argument. Then there we get some error, presumably a page fault again, and
return it back to io_uring.

And that's how user space gets the partial read?
I don't see how else it can happen, so I'd like you to confirm that and update the
changelog with a lot more details how user space gets a partial read.


> Fix this by checking for a 0 read from iomap as
> well.  I've left the EFAULT case because it appears like we can still
> get this in the case of a page fault from a direct read from an inline
> extent.

The EFAULT case is not related to inline extents.
When we find an inline extent, we always fallback to buffered IO.
btrfs_iomap_begin() will return -ENOTBLK or -EAGAIN depending on NOWAIT, so
btrfs_direct_read() will never get -EFAULT because of an inline extent.

To get -EFAULT it means we are trying to read a non-inline, non-compressed
extent and then failed to faultin the very first page of the iovector, so in
that first if-statement at __iomap_dio_rw(), 'ret' remains with the value
of -EFAULT because dio->size is 0 (no progress made at all) and inline
extents always start at file offset 0 (so no progress may have happened).

I'm also seeing that gfs2 also doesn't retry if it gets 0 from iomap
(at gfs2_file_direct_read()), so adding Andreas to CC.

Thanks.

> 
> Jens tested this patch with their testcase and it fixed the problem.
> 
> Reported-by: Dylan Yudaken <dylany@fb.com>
> Tested-by: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/file.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index c9649b7b2f18..a61085ac59d6 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -3776,7 +3776,7 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
>  	if (ret > 0)
>  		read = ret;
>  
> -	if (iov_iter_count(to) > 0 && (ret == -EFAULT || ret > 0)) {
> +	if (iov_iter_count(to) > 0 && (ret == -EFAULT || ret >= 0)) {
>  		const size_t left = iov_iter_count(to);
>  
>  		if (left == prev_left) {
> -- 
> 2.26.3
> 

  reply	other threads:[~2022-08-12  9:02 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-11 21:06 [PATCH] btrfs: handle DIO read faults properly Josef Bacik
2022-08-12  9:02 ` Filipe Manana [this message]
2022-08-12 15:39   ` Josef Bacik
2022-08-19 15:55   ` Josef Bacik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220812090208.GA2373742@falcondesktop \
    --to=fdmanana@kernel.org \
    --cc=agruenba@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=dylany@fb.com \
    --cc=josef@toxicpanda.com \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox