[PATCH] btrfs: don't allow large NOWAIT direct reads

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* [PATCH] btrfs: don't allow large NOWAIT direct reads
@ 2022-08-19 15:53 Josef Bacik
  2022-08-19 16:50 ` Filipe Manana
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Josef Bacik @ 2022-08-19 15:53 UTC (permalink / raw)
  To: kernel-team, linux-btrfs; +Cc: Dylan Yudaken

Dylan and Jens reported a problem where they had an io_uring test that
was returning short reads, and bisected it to ee5b46a353af ("btrfs:
increase direct io read size limit to 256 sectors").

The root cause is their test was doing larger reads via io_uring with
NOWAIT and async.  This was triggering a page fault during the direct
read, however the first page was able to work just fine and thus we
submitted a 4k read for a larger iocb.

Btrfs allows for partial IO's in this case specifically because we don't
allow page faults, and thus we'll attempt to do any io that we can,
submit what we could, come back and fault in the rest of the range and
try to do the remaining IO.

However for !is_sync_kiocb() we'll call ->ki_complete() as soon as the
partial dio is done, which is incorrect.  In the sync case we can exit
the iomap code, submit more io's, and return with the amount of IO we
were able to complete successfully.

We were always doing short reads in this case, but for NOWAIT we were
getting saved by the fact that we were limiting direct reads to
sectorsize, and if we were larger than that we would return EAGAIN.

Fix the regression by simply returning EAGAIN in the NOWAIT case with
larger reads, that way io_uring can retry and get the larger IO and have
the fault logic handle everything properly.

This still leaves the AIO short read case, but that existed before this
change.  The way to properly fix this would be to handle partial iocb
completions, but that's a lot of work, for now deal with the regression
in the most straightforward way possible.

Reported-by: Dylan Yudaken <dylany@fb.com>
Fixes: ee5b46a353af ("btrfs: increase direct io read size limit to 256 sectors")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/inode.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5101111c5557..b39673e49732 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7694,6 +7694,20 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
 	const u64 data_alloc_len = length;
 	bool unlock_extents = false;

+	/*
+	 * We could potentially fault if we have a buffer > PAGE_SIZE, and if
+	 * we're NOWAIT we may submit a bio for a partial range and return
+	 * EIOCBQUEUED, which would result in an errant short read.
+	 *
+	 * The best way to handle this would be to allow for partial completions
+	 * of iocb's, so we could submit the partial bio, return and fault in
+	 * the rest of the pages, and then submit the io for the rest of the
+	 * range.  However we don't have that currently, so simply return
+	 * -EAGAIN at this point so that the normal path is used.
+	 */
+	if (!write && (flags & IOMAP_NOWAIT) && length > PAGE_SIZE)
+		return -EAGAIN;
+
 	/*
 	 * Cap the size of reads to that usually seen in buffered I/O as we need
 	 * to allocate a contiguous array for the checksums.
-- 
2.26.3

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] btrfs: don't allow large NOWAIT direct reads
  2022-08-19 15:53 [PATCH] btrfs: don't allow large NOWAIT direct reads Josef Bacik
@ 2022-08-19 16:50 ` Filipe Manana
  2022-08-19 22:11 ` Qu Wenruo
  2022-08-22 15:58 ` David Sterba
  2 siblings, 0 replies; 4+ messages in thread
From: Filipe Manana @ 2022-08-19 16:50 UTC (permalink / raw)
  To: Josef Bacik; +Cc: kernel-team, linux-btrfs, Dylan Yudaken

On Fri, Aug 19, 2022 at 11:53:39AM -0400, Josef Bacik wrote:
> Dylan and Jens reported a problem where they had an io_uring test that
> was returning short reads, and bisected it to ee5b46a353af ("btrfs:
> increase direct io read size limit to 256 sectors").
> 
> The root cause is their test was doing larger reads via io_uring with
> NOWAIT and async.  This was triggering a page fault during the direct
> read, however the first page was able to work just fine and thus we
> submitted a 4k read for a larger iocb.
> 
> Btrfs allows for partial IO's in this case specifically because we don't
> allow page faults, and thus we'll attempt to do any io that we can,
> submit what we could, come back and fault in the rest of the range and
> try to do the remaining IO.
> 
> However for !is_sync_kiocb() we'll call ->ki_complete() as soon as the
> partial dio is done, which is incorrect.  In the sync case we can exit
> the iomap code, submit more io's, and return with the amount of IO we
> were able to complete successfully.
> 
> We were always doing short reads in this case, but for NOWAIT we were
> getting saved by the fact that we were limiting direct reads to
> sectorsize, and if we were larger than that we would return EAGAIN.
> 
> Fix the regression by simply returning EAGAIN in the NOWAIT case with
> larger reads, that way io_uring can retry and get the larger IO and have
> the fault logic handle everything properly.
> 
> This still leaves the AIO short read case, but that existed before this
> change.  The way to properly fix this would be to handle partial iocb
> completions, but that's a lot of work, for now deal with the regression
> in the most straightforward way possible.
> 
> Reported-by: Dylan Yudaken <dylany@fb.com>
> Fixes: ee5b46a353af ("btrfs: increase direct io read size limit to 256 sectors")
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Looks good, thanks.

> ---
>  fs/btrfs/inode.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 5101111c5557..b39673e49732 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7694,6 +7694,20 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
>  	const u64 data_alloc_len = length;
>  	bool unlock_extents = false;
>  
> +	/*
> +	 * We could potentially fault if we have a buffer > PAGE_SIZE, and if
> +	 * we're NOWAIT we may submit a bio for a partial range and return
> +	 * EIOCBQUEUED, which would result in an errant short read.
> +	 *
> +	 * The best way to handle this would be to allow for partial completions
> +	 * of iocb's, so we could submit the partial bio, return and fault in
> +	 * the rest of the pages, and then submit the io for the rest of the
> +	 * range.  However we don't have that currently, so simply return
> +	 * -EAGAIN at this point so that the normal path is used.
> +	 */
> +	if (!write && (flags & IOMAP_NOWAIT) && length > PAGE_SIZE)
> +		return -EAGAIN;
> +
>  	/*
>  	 * Cap the size of reads to that usually seen in buffered I/O as we need
>  	 * to allocate a contiguous array for the checksums.
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] btrfs: don't allow large NOWAIT direct reads
  2022-08-19 15:53 [PATCH] btrfs: don't allow large NOWAIT direct reads Josef Bacik
  2022-08-19 16:50 ` Filipe Manana
@ 2022-08-19 22:11 ` Qu Wenruo
  2022-08-22 15:58 ` David Sterba
  2 siblings, 0 replies; 4+ messages in thread
From: Qu Wenruo @ 2022-08-19 22:11 UTC (permalink / raw)
  To: Josef Bacik, kernel-team, linux-btrfs; +Cc: Dylan Yudaken



On 2022/8/19 23:53, Josef Bacik wrote:
> Dylan and Jens reported a problem where they had an io_uring test that
> was returning short reads, and bisected it to ee5b46a353af ("btrfs:
> increase direct io read size limit to 256 sectors").
>
> The root cause is their test was doing larger reads via io_uring with
> NOWAIT and async.  This was triggering a page fault during the direct
> read, however the first page was able to work just fine and thus we
> submitted a 4k read for a larger iocb.
>
> Btrfs allows for partial IO's in this case specifically because we don't
> allow page faults, and thus we'll attempt to do any io that we can,
> submit what we could, come back and fault in the rest of the range and
> try to do the remaining IO.
>
> However for !is_sync_kiocb() we'll call ->ki_complete() as soon as the
> partial dio is done, which is incorrect.  In the sync case we can exit
> the iomap code, submit more io's, and return with the amount of IO we
> were able to complete successfully.
>
> We were always doing short reads in this case, but for NOWAIT we were
> getting saved by the fact that we were limiting direct reads to
> sectorsize, and if we were larger than that we would return EAGAIN.
>
> Fix the regression by simply returning EAGAIN in the NOWAIT case with
> larger reads, that way io_uring can retry and get the larger IO and have
> the fault logic handle everything properly.
>
> This still leaves the AIO short read case, but that existed before this
> change.  The way to properly fix this would be to handle partial iocb
> completions, but that's a lot of work, for now deal with the regression
> in the most straightforward way possible.
>
> Reported-by: Dylan Yudaken <dylany@fb.com>
> Fixes: ee5b46a353af ("btrfs: increase direct io read size limit to 256 sectors")
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>   fs/btrfs/inode.c | 14 ++++++++++++++
>   1 file changed, 14 insertions(+)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 5101111c5557..b39673e49732 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7694,6 +7694,20 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
>   	const u64 data_alloc_len = length;
>   	bool unlock_extents = false;
>
> +	/*
> +	 * We could potentially fault if we have a buffer > PAGE_SIZE, and if
> +	 * we're NOWAIT we may submit a bio for a partial range and return
> +	 * EIOCBQUEUED, which would result in an errant short read.
> +	 *
> +	 * The best way to handle this would be to allow for partial completions
> +	 * of iocb's, so we could submit the partial bio, return and fault in
> +	 * the rest of the pages, and then submit the io for the rest of the
> +	 * range.  However we don't have that currently, so simply return
> +	 * -EAGAIN at this point so that the normal path is used.
> +	 */
> +	if (!write && (flags & IOMAP_NOWAIT) && length > PAGE_SIZE)

Since there comes PAGE_SIZE, this let me wonder how would this handle
subpage cases.

Not familiar with page fault handling, but I guess since it's really
working in page unit, thus it should be fine I guess?

Thanks,
Qu
> +		return -EAGAIN;
> +
>   	/*
>   	 * Cap the size of reads to that usually seen in buffered I/O as we need
>   	 * to allocate a contiguous array for the checksums.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] btrfs: don't allow large NOWAIT direct reads
  2022-08-19 15:53 [PATCH] btrfs: don't allow large NOWAIT direct reads Josef Bacik
  2022-08-19 16:50 ` Filipe Manana
  2022-08-19 22:11 ` Qu Wenruo
@ 2022-08-22 15:58 ` David Sterba
  2 siblings, 0 replies; 4+ messages in thread
From: David Sterba @ 2022-08-22 15:58 UTC (permalink / raw)
  To: Josef Bacik; +Cc: kernel-team, linux-btrfs, Dylan Yudaken

On Fri, Aug 19, 2022 at 11:53:39AM -0400, Josef Bacik wrote:
> Dylan and Jens reported a problem where they had an io_uring test that
> was returning short reads, and bisected it to ee5b46a353af ("btrfs:
> increase direct io read size limit to 256 sectors").
> 
> The root cause is their test was doing larger reads via io_uring with
> NOWAIT and async.  This was triggering a page fault during the direct
> read, however the first page was able to work just fine and thus we
> submitted a 4k read for a larger iocb.
> 
> Btrfs allows for partial IO's in this case specifically because we don't
> allow page faults, and thus we'll attempt to do any io that we can,
> submit what we could, come back and fault in the rest of the range and
> try to do the remaining IO.
> 
> However for !is_sync_kiocb() we'll call ->ki_complete() as soon as the
> partial dio is done, which is incorrect.  In the sync case we can exit
> the iomap code, submit more io's, and return with the amount of IO we
> were able to complete successfully.
> 
> We were always doing short reads in this case, but for NOWAIT we were
> getting saved by the fact that we were limiting direct reads to
> sectorsize, and if we were larger than that we would return EAGAIN.
> 
> Fix the regression by simply returning EAGAIN in the NOWAIT case with
> larger reads, that way io_uring can retry and get the larger IO and have
> the fault logic handle everything properly.
> 
> This still leaves the AIO short read case, but that existed before this
> change.  The way to properly fix this would be to handle partial iocb
> completions, but that's a lot of work, for now deal with the regression
> in the most straightforward way possible.
> 
> Reported-by: Dylan Yudaken <dylany@fb.com>
> Fixes: ee5b46a353af ("btrfs: increase direct io read size limit to 256 sectors")
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Added to misc-next, thanks.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-08-22 16:04 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-08-19 15:53 [PATCH] btrfs: don't allow large NOWAIT direct reads Josef Bacik
2022-08-19 16:50 ` Filipe Manana
2022-08-19 22:11 ` Qu Wenruo
2022-08-22 15:58 ` David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox