* Re: [PATCH] btrfs: don't allow large NOWAIT direct reads
2022-08-19 15:53 [PATCH] btrfs: don't allow large NOWAIT direct reads Josef Bacik
@ 2022-08-19 16:50 ` Filipe Manana
2022-08-19 22:11 ` Qu Wenruo
2022-08-22 15:58 ` David Sterba
2 siblings, 0 replies; 4+ messages in thread
From: Filipe Manana @ 2022-08-19 16:50 UTC (permalink / raw)
To: Josef Bacik; +Cc: kernel-team, linux-btrfs, Dylan Yudaken
On Fri, Aug 19, 2022 at 11:53:39AM -0400, Josef Bacik wrote:
> Dylan and Jens reported a problem where they had an io_uring test that
> was returning short reads, and bisected it to ee5b46a353af ("btrfs:
> increase direct io read size limit to 256 sectors").
>
> The root cause is their test was doing larger reads via io_uring with
> NOWAIT and async. This was triggering a page fault during the direct
> read, however the first page was able to work just fine and thus we
> submitted a 4k read for a larger iocb.
>
> Btrfs allows for partial IO's in this case specifically because we don't
> allow page faults, and thus we'll attempt to do any io that we can,
> submit what we could, come back and fault in the rest of the range and
> try to do the remaining IO.
>
> However for !is_sync_kiocb() we'll call ->ki_complete() as soon as the
> partial dio is done, which is incorrect. In the sync case we can exit
> the iomap code, submit more io's, and return with the amount of IO we
> were able to complete successfully.
>
> We were always doing short reads in this case, but for NOWAIT we were
> getting saved by the fact that we were limiting direct reads to
> sectorsize, and if we were larger than that we would return EAGAIN.
>
> Fix the regression by simply returning EAGAIN in the NOWAIT case with
> larger reads, that way io_uring can retry and get the larger IO and have
> the fault logic handle everything properly.
>
> This still leaves the AIO short read case, but that existed before this
> change. The way to properly fix this would be to handle partial iocb
> completions, but that's a lot of work, for now deal with the regression
> in the most straightforward way possible.
>
> Reported-by: Dylan Yudaken <dylany@fb.com>
> Fixes: ee5b46a353af ("btrfs: increase direct io read size limit to 256 sectors")
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Looks good, thanks.
> ---
> fs/btrfs/inode.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 5101111c5557..b39673e49732 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7694,6 +7694,20 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
> const u64 data_alloc_len = length;
> bool unlock_extents = false;
>
> + /*
> + * We could potentially fault if we have a buffer > PAGE_SIZE, and if
> + * we're NOWAIT we may submit a bio for a partial range and return
> + * EIOCBQUEUED, which would result in an errant short read.
> + *
> + * The best way to handle this would be to allow for partial completions
> + * of iocb's, so we could submit the partial bio, return and fault in
> + * the rest of the pages, and then submit the io for the rest of the
> + * range. However we don't have that currently, so simply return
> + * -EAGAIN at this point so that the normal path is used.
> + */
> + if (!write && (flags & IOMAP_NOWAIT) && length > PAGE_SIZE)
> + return -EAGAIN;
> +
> /*
> * Cap the size of reads to that usually seen in buffered I/O as we need
> * to allocate a contiguous array for the checksums.
> --
> 2.26.3
>
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [PATCH] btrfs: don't allow large NOWAIT direct reads
2022-08-19 15:53 [PATCH] btrfs: don't allow large NOWAIT direct reads Josef Bacik
2022-08-19 16:50 ` Filipe Manana
@ 2022-08-19 22:11 ` Qu Wenruo
2022-08-22 15:58 ` David Sterba
2 siblings, 0 replies; 4+ messages in thread
From: Qu Wenruo @ 2022-08-19 22:11 UTC (permalink / raw)
To: Josef Bacik, kernel-team, linux-btrfs; +Cc: Dylan Yudaken
On 2022/8/19 23:53, Josef Bacik wrote:
> Dylan and Jens reported a problem where they had an io_uring test that
> was returning short reads, and bisected it to ee5b46a353af ("btrfs:
> increase direct io read size limit to 256 sectors").
>
> The root cause is their test was doing larger reads via io_uring with
> NOWAIT and async. This was triggering a page fault during the direct
> read, however the first page was able to work just fine and thus we
> submitted a 4k read for a larger iocb.
>
> Btrfs allows for partial IO's in this case specifically because we don't
> allow page faults, and thus we'll attempt to do any io that we can,
> submit what we could, come back and fault in the rest of the range and
> try to do the remaining IO.
>
> However for !is_sync_kiocb() we'll call ->ki_complete() as soon as the
> partial dio is done, which is incorrect. In the sync case we can exit
> the iomap code, submit more io's, and return with the amount of IO we
> were able to complete successfully.
>
> We were always doing short reads in this case, but for NOWAIT we were
> getting saved by the fact that we were limiting direct reads to
> sectorsize, and if we were larger than that we would return EAGAIN.
>
> Fix the regression by simply returning EAGAIN in the NOWAIT case with
> larger reads, that way io_uring can retry and get the larger IO and have
> the fault logic handle everything properly.
>
> This still leaves the AIO short read case, but that existed before this
> change. The way to properly fix this would be to handle partial iocb
> completions, but that's a lot of work, for now deal with the regression
> in the most straightforward way possible.
>
> Reported-by: Dylan Yudaken <dylany@fb.com>
> Fixes: ee5b46a353af ("btrfs: increase direct io read size limit to 256 sectors")
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> fs/btrfs/inode.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 5101111c5557..b39673e49732 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7694,6 +7694,20 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
> const u64 data_alloc_len = length;
> bool unlock_extents = false;
>
> + /*
> + * We could potentially fault if we have a buffer > PAGE_SIZE, and if
> + * we're NOWAIT we may submit a bio for a partial range and return
> + * EIOCBQUEUED, which would result in an errant short read.
> + *
> + * The best way to handle this would be to allow for partial completions
> + * of iocb's, so we could submit the partial bio, return and fault in
> + * the rest of the pages, and then submit the io for the rest of the
> + * range. However we don't have that currently, so simply return
> + * -EAGAIN at this point so that the normal path is used.
> + */
> + if (!write && (flags & IOMAP_NOWAIT) && length > PAGE_SIZE)
Since there comes PAGE_SIZE, this let me wonder how would this handle
subpage cases.
Not familiar with page fault handling, but I guess since it's really
working in page unit, thus it should be fine I guess?
Thanks,
Qu
> + return -EAGAIN;
> +
> /*
> * Cap the size of reads to that usually seen in buffered I/O as we need
> * to allocate a contiguous array for the checksums.
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [PATCH] btrfs: don't allow large NOWAIT direct reads
2022-08-19 15:53 [PATCH] btrfs: don't allow large NOWAIT direct reads Josef Bacik
2022-08-19 16:50 ` Filipe Manana
2022-08-19 22:11 ` Qu Wenruo
@ 2022-08-22 15:58 ` David Sterba
2 siblings, 0 replies; 4+ messages in thread
From: David Sterba @ 2022-08-22 15:58 UTC (permalink / raw)
To: Josef Bacik; +Cc: kernel-team, linux-btrfs, Dylan Yudaken
On Fri, Aug 19, 2022 at 11:53:39AM -0400, Josef Bacik wrote:
> Dylan and Jens reported a problem where they had an io_uring test that
> was returning short reads, and bisected it to ee5b46a353af ("btrfs:
> increase direct io read size limit to 256 sectors").
>
> The root cause is their test was doing larger reads via io_uring with
> NOWAIT and async. This was triggering a page fault during the direct
> read, however the first page was able to work just fine and thus we
> submitted a 4k read for a larger iocb.
>
> Btrfs allows for partial IO's in this case specifically because we don't
> allow page faults, and thus we'll attempt to do any io that we can,
> submit what we could, come back and fault in the rest of the range and
> try to do the remaining IO.
>
> However for !is_sync_kiocb() we'll call ->ki_complete() as soon as the
> partial dio is done, which is incorrect. In the sync case we can exit
> the iomap code, submit more io's, and return with the amount of IO we
> were able to complete successfully.
>
> We were always doing short reads in this case, but for NOWAIT we were
> getting saved by the fact that we were limiting direct reads to
> sectorsize, and if we were larger than that we would return EAGAIN.
>
> Fix the regression by simply returning EAGAIN in the NOWAIT case with
> larger reads, that way io_uring can retry and get the larger IO and have
> the fault logic handle everything properly.
>
> This still leaves the AIO short read case, but that existed before this
> change. The way to properly fix this would be to handle partial iocb
> completions, but that's a lot of work, for now deal with the regression
> in the most straightforward way possible.
>
> Reported-by: Dylan Yudaken <dylany@fb.com>
> Fixes: ee5b46a353af ("btrfs: increase direct io read size limit to 256 sectors")
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Added to misc-next, thanks.
^ permalink raw reply [flat|nested] 4+ messages in thread