[PATCH] iomap: fix iomap_dio_zero() for fs bs

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size
@ 2023-10-26 14:08 Pankaj Raghav
  2023-10-26 22:10 ` Dave Chinner
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Pankaj Raghav @ 2023-10-26 14:08 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: willy, djwong, mcgrof, hch, da.gomez, gost.dev, david,
	Pankaj Raghav

From: Pankaj Raghav <p.raghav@samsung.com>

iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
< fs block size. iomap_dio_zero() has an implicit assumption that fs block
size < page_size. This is true for most filesystems at the moment.

If the block size > page size (Large block sizes)[1], this will send the
contents of the page next to zero page(as len > PAGE_SIZE) to the
underlying block device, causing FS corruption.

iomap is a generic infrastructure and it should not make any assumptions
about the fs block size and the page size of the system.

Fixes: db074436f421 ("iomap: move the direct IO code into a separate file")
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>

[1] https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@pankajraghav.com/
---
I had initially planned on sending this as a part of LBS patches but I                                                                                                                                                                                                                                                  
think this can go as a standalone patch as it is a generic fix to iomap.                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                        
@Dave chinner This fixes the corruption issue you were seeing in                                                                                                                                                                                                                                                        
generic/091 for bs=64k in [2]                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                        
[2] https://lore.kernel.org/lkml/ZQfbHloBUpDh+zCg@dread.disaster.area/

 fs/iomap/direct-io.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index bcd3f8cf5ea4..04f6c5548136 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -239,14 +239,23 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 	struct page *page = ZERO_PAGE(0);
 	struct bio *bio;
 
-	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
+	WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));
+
+	bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
+				  REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
 	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
 				  GFP_KERNEL);
+
 	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	__bio_add_page(bio, page, len, 0);
+	while (len) {
+		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
+
+		__bio_add_page(bio, page, io_len, 0);
+		len -= io_len;
+	}
 	iomap_dio_submit_bio(iter, dio, bio, pos);
 }
 

base-commit: 05d3ef8bba77c1b5f98d941d8b2d4aeab8118ef1
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size
  2023-10-26 14:08 [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav
@ 2023-10-26 22:10 ` Dave Chinner
  2023-10-27  7:53   ` Pankaj Raghav
  2023-10-26 23:20 ` Luis Chamberlain
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2023-10-26 22:10 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: linux-xfs, linux-fsdevel, willy, djwong, mcgrof, hch, da.gomez,
	gost.dev, Pankaj Raghav

On Thu, Oct 26, 2023 at 04:08:32PM +0200, Pankaj Raghav wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> size < page_size. This is true for most filesystems at the moment.
> 
> If the block size > page size (Large block sizes)[1], this will send the
> contents of the page next to zero page(as len > PAGE_SIZE) to the
> underlying block device, causing FS corruption.
> 
> iomap is a generic infrastructure and it should not make any assumptions
> about the fs block size and the page size of the system.
> 
> Fixes: db074436f421 ("iomap: move the direct IO code into a separate file")
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> 
> [1] https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@pankajraghav.com/
> ---
> I had initially planned on sending this as a part of LBS patches but I                                                                                                                                                                                                                                                  
> think this can go as a standalone patch as it is a generic fix to iomap.                                                                                                                                                                                                                                                
>                                                                                                                                                                                                                                                                                                                         
> @Dave chinner This fixes the corruption issue you were seeing in                                                                                                                                                                                                                                                        
> generic/091 for bs=64k in [2]                                                                                                                                                                                                                                                                                           
>                                                                                                                                                                                                                                                                                                                         
> [2] https://lore.kernel.org/lkml/ZQfbHloBUpDh+zCg@dread.disaster.area/
> 
>  fs/iomap/direct-io.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index bcd3f8cf5ea4..04f6c5548136 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -239,14 +239,23 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
>  	struct page *page = ZERO_PAGE(0);
>  	struct bio *bio;
>  
> -	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> +	WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));

How can that happen here? Max fsb size will be 64kB for the
foreseeable future, the bio can hold 256 pages so it will have a
minimum size capability of 1MB.

FWIW, as a general observation, I think this is the wrong place to
be checking that a filesystem block is larger than can be fit in a
single bio. There's going to be problems all over the place if we
can't do fsb sized IO in a single bio. i.e. I think this sort of
validation should be performed during filesystem mount, not
sporadically checked with WARN_ON() checks in random places in the
IO path...

> +
> +	bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
> +				  REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
>  	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
>  				  GFP_KERNEL);
> +
>  	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
>  	bio->bi_private = dio;
>  	bio->bi_end_io = iomap_dio_bio_end_io;
>  
> -	__bio_add_page(bio, page, len, 0);
> +	while (len) {
> +		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
> +
> +		__bio_add_page(bio, page, io_len, 0);
> +		len -= io_len;
> +	}
>  	iomap_dio_submit_bio(iter, dio, bio, pos);

/me wonders if we should have a set of ZERO_FOLIO()s that contain a
folio of each possible size. Then we just pull the ZERO_FOLIO of the
correct size and use __bio_add_folio(). i.e. no need for
looping over the bio to repeatedly add the ZERO_PAGE, etc, and the
code is then identical for all cases of page size vs FSB size.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size
  2023-10-26 14:08 [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav
  2023-10-26 22:10 ` Dave Chinner
@ 2023-10-26 23:20 ` Luis Chamberlain
  2023-10-27  5:18 ` Christoph Hellwig
  2023-10-27 22:10 ` Dave Chinner
  3 siblings, 0 replies; 15+ messages in thread
From: Luis Chamberlain @ 2023-10-26 23:20 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: linux-xfs, linux-fsdevel, willy, djwong, hch, da.gomez, gost.dev,
	david, Pankaj Raghav

On Thu, Oct 26, 2023 at 04:08:32PM +0200, Pankaj Raghav wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> size < page_size. This is true for most filesystems at the moment.
> 
> If the block size > page size (Large block sizes)[1], this will send the
> contents of the page next to zero page(as len > PAGE_SIZE) to the
> underlying block device, causing FS corruption.
> 
> iomap is a generic infrastructure and it should not make any assumptions
> about the fs block size and the page size of the system.
> 
> Fixes: db074436f421 ("iomap: move the direct IO code into a separate file")

Nice!

> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> 
> [1] https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@pankajraghav.com/

This URL jus tneeds to go above the Fixes tag.

   Luis

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size
  2023-10-26 14:08 [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav
  2023-10-26 22:10 ` Dave Chinner
  2023-10-26 23:20 ` Luis Chamberlain
@ 2023-10-27  5:18 ` Christoph Hellwig
  2023-10-27  8:03   ` Pankaj Raghav
  2023-10-27 22:10 ` Dave Chinner
  3 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2023-10-27  5:18 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: linux-xfs, linux-fsdevel, willy, djwong, mcgrof, hch, da.gomez,
	gost.dev, david, Pankaj Raghav

>  
> -	__bio_add_page(bio, page, len, 0);
> +	while (len) {
> +		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
> +
> +		__bio_add_page(bio, page, io_len, 0);
> +		len -= io_len;
> +	}

Maybe out of self-interest, but shouldn't we replace ZERO_PAGE with a
sufficiently larger ZERO_FOLIO?  Right now I have a case where I have
to have a zero padding of up to MAX_PAGECACHE_ORDER minus block size,
so having a MAX_PAGECACHE_ORDER folio would have been really helpful
for me, but I suspect there are many other such cases as well.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size
  2023-10-26 22:10 ` Dave Chinner
@ 2023-10-27  7:53   ` Pankaj Raghav
  2023-10-27 22:07     ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Pankaj Raghav @ 2023-10-27  7:53 UTC (permalink / raw)
  To: Dave Chinner, Pankaj Raghav
  Cc: linux-xfs, linux-fsdevel, willy, djwong, mcgrof, hch, da.gomez,
	gost.dev

>> -	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
>> +	WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));
> 
> How can that happen here? Max fsb size will be 64kB for the
> foreseeable future, the bio can hold 256 pages so it will have a
> minimum size capability of 1MB.
> 

I just added it as a pathological check. This will not trigger
anytime in the near future.

> FWIW, as a general observation, I think this is the wrong place to
> be checking that a filesystem block is larger than can be fit in a
> single bio. There's going to be problems all over the place if we
> can't do fsb sized IO in a single bio. i.e. I think this sort of
> validation should be performed during filesystem mount, not
> sporadically checked with WARN_ON() checks in random places in the
> IO path...
> 

I agree that it should be a check at a higher level.

As iomap can be used by any filesystem, the check on FSB limitation
should be a part iomap right? I don't see any explicit document/comment
that states the iomap limitations on FSB, etc.

>>  
>> -	__bio_add_page(bio, page, len, 0);
>> +	while (len) {
>> +		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
>> +
>> +		__bio_add_page(bio, page, io_len, 0);
>> +		len -= io_len;
>> +	}
>>  	iomap_dio_submit_bio(iter, dio, bio, pos);
> 
> /me wonders if we should have a set of ZERO_FOLIO()s that contain a
> folio of each possible size. Then we just pull the ZERO_FOLIO of the
> correct size and use __bio_add_folio(). i.e. no need for
> looping over the bio to repeatedly add the ZERO_PAGE, etc, and the
> code is then identical for all cases of page size vs FSB size.
> 

I was exactly looking for ZERO_FOLIOs. Instead of having a ZERO_PAGE, can
we reserve a ZERO_HUGE_PAGE so that it can be used directly by
bio_add_folio_nofail()?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size
  2023-10-27  5:18 ` Christoph Hellwig
@ 2023-10-27  8:03   ` Pankaj Raghav
  2023-10-27 10:47     ` Matthew Wilcox
  0 siblings, 1 reply; 15+ messages in thread
From: Pankaj Raghav @ 2023-10-27  8:03 UTC (permalink / raw)
  To: Christoph Hellwig, Pankaj Raghav
  Cc: linux-xfs, linux-fsdevel, willy, djwong, mcgrof, da.gomez,
	gost.dev, david

On 27/10/2023 07:18, Christoph Hellwig wrote:
>>  
>> -	__bio_add_page(bio, page, len, 0);
>> +	while (len) {
>> +		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
>> +
>> +		__bio_add_page(bio, page, io_len, 0);
>> +		len -= io_len;
>> +	}
> 
> Maybe out of self-interest, but shouldn't we replace ZERO_PAGE with a
> sufficiently larger ZERO_FOLIO?  Right now I have a case where I have
> to have a zero padding of up to MAX_PAGECACHE_ORDER minus block size,
> so having a MAX_PAGECACHE_ORDER folio would have been really helpful
> for me, but I suspect there are many other such cases as well.

I think that would definitely be useful.

I also noticed this pattern in fscrypt_zeroout_range_inline_crypt().
Probably there are more places which could use a ZERO_FOLIO directly
instead of iterating with ZERO_PAGE.

Chinner also had a similar comment. It would be nice if we can reserve
a zero huge page that is the size of MAX_PAGECACHE_ORDER and add it as
one folio to the bio.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size
  2023-10-27  8:03   ` Pankaj Raghav
@ 2023-10-27 10:47     ` Matthew Wilcox
  2023-10-27 15:41       ` Pankaj Raghav
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2023-10-27 10:47 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: Christoph Hellwig, Pankaj Raghav, linux-xfs, linux-fsdevel,
	djwong, mcgrof, da.gomez, gost.dev, david

On Fri, Oct 27, 2023 at 10:03:15AM +0200, Pankaj Raghav wrote:
> I also noticed this pattern in fscrypt_zeroout_range_inline_crypt().
> Probably there are more places which could use a ZERO_FOLIO directly
> instead of iterating with ZERO_PAGE.
> 
> Chinner also had a similar comment. It would be nice if we can reserve
> a zero huge page that is the size of MAX_PAGECACHE_ORDER and add it as
> one folio to the bio.

i'm on holiday atm.  start looking at mm_get_huge_zero_page()

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size
  2023-10-27 10:47     ` Matthew Wilcox
@ 2023-10-27 15:41       ` Pankaj Raghav
  2023-10-27 22:59         ` Matthew Wilcox
  2023-10-28 13:17         ` Hannes Reinecke
  0 siblings, 2 replies; 15+ messages in thread
From: Pankaj Raghav @ 2023-10-27 15:41 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Pankaj Raghav, linux-xfs, linux-fsdevel,
	djwong, mcgrof, da.gomez, gost.dev, david

On 27/10/2023 12:47, Matthew Wilcox wrote:
> On Fri, Oct 27, 2023 at 10:03:15AM +0200, Pankaj Raghav wrote:
>> I also noticed this pattern in fscrypt_zeroout_range_inline_crypt().
>> Probably there are more places which could use a ZERO_FOLIO directly
>> instead of iterating with ZERO_PAGE.
>>
>> Chinner also had a similar comment. It would be nice if we can reserve
>> a zero huge page that is the size of MAX_PAGECACHE_ORDER and add it as
>> one folio to the bio.
> 
> i'm on holiday atm.  start looking at mm_get_huge_zero_page()

Thanks for the pointer. I made a rough version of how it might
look like if I use that API:

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index bcd3f8cf5ea4..6ae21bd16dbe 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -236,17 +236,43 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
                loff_t pos, unsigned len)
 {
        struct inode *inode = file_inode(dio->iocb->ki_filp);
-       struct page *page = ZERO_PAGE(0);
+       struct page *page = NULL;
+       bool huge_page = false;
+       bool fallback = false;
        struct bio *bio;

-       bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
+       if (len > PAGE_SIZE) {
+               page = mm_get_huge_zero_page(current->mm);
+               if (likely(page))
+                       huge_page = true;
+       }
+
+       if (!huge_page)
+               page = ZERO_PAGE(0);
+
+       fallback = ((len > PAGE_SIZE) && !huge_page);
+
+       bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
+                                 REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
        fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
                                  GFP_KERNEL);
+
        bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
        bio->bi_private = dio;
        bio->bi_end_io = iomap_dio_bio_end_io;

-       __bio_add_page(bio, page, len, 0);
+       if (!fallback) {
+               bio_add_folio_nofail(bio, page_folio(page), len, 0);
+       } else {
+               while (len) {
+                       unsigned int io_len =
+                               min_t(unsigned int, len, PAGE_SIZE);
+
+                       __bio_add_page(bio, page, io_len, 0);
+                       len -= io_len;
+               }
+       }
+
        iomap_dio_submit_bio(iter, dio, bio, pos);
 }

The only issue with mm_get_huge_zero_page() is that it can fail, and we need
to fallback to iteration if it cannot allocate a huge folio.

PS: Enjoy your holidays :)

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size
  2023-10-27  7:53   ` Pankaj Raghav
@ 2023-10-27 22:07     ` Dave Chinner
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2023-10-27 22:07 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: Pankaj Raghav, linux-xfs, linux-fsdevel, willy, djwong, mcgrof,
	hch, da.gomez, gost.dev

On Fri, Oct 27, 2023 at 09:53:19AM +0200, Pankaj Raghav wrote:
> >> -	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> >> +	WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));
> > 
> > How can that happen here? Max fsb size will be 64kB for the
> > foreseeable future, the bio can hold 256 pages so it will have a
> > minimum size capability of 1MB.
> > 
> 
> I just added it as a pathological check. This will not trigger
> anytime in the near future.

Yeah, exactly my point - it should never happen, it's a fundamental
developer stuff-up bug, not a runtime bug, and so shouldn't be
checked at runtime on every bio....

> > FWIW, as a general observation, I think this is the wrong place to
> > be checking that a filesystem block is larger than can be fit in a
> > single bio. There's going to be problems all over the place if we
> > can't do fsb sized IO in a single bio. i.e. I think this sort of
> > validation should be performed during filesystem mount, not
> > sporadically checked with WARN_ON() checks in random places in the
> > IO path...
> > 
> 
> I agree that it should be a check at a higher level.
> 
> As iomap can be used by any filesystem, the check on FSB limitation
> should be a part iomap right? I don't see any explicit document/comment
> that states the iomap limitations on FSB, etc.

No, it should be part of the filesystem that *uses the bio
infrastrucure*.

We already do that with the page cache - filesystems check at mount
time that the FSB is <= PAGE_SIZE and reject the mount if this check
fails because the page cache simply cannot handle this situation.

This is no different: if we are going to allow FSB > PAGE_SIZE, then
we need to ensure somewhere, even as a BUILD_BUG_ON(), that the max
support FSB the filesystem has is smaller than what we can pack in a
bio.

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size
  2023-10-26 14:08 [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav
                   ` (2 preceding siblings ...)
  2023-10-27  5:18 ` Christoph Hellwig
@ 2023-10-27 22:10 ` Dave Chinner
  2023-10-28 17:20   ` Pankaj Raghav
  3 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2023-10-27 22:10 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: linux-xfs, linux-fsdevel, willy, djwong, mcgrof, hch, da.gomez,
	gost.dev, Pankaj Raghav

On Thu, Oct 26, 2023 at 04:08:32PM +0200, Pankaj Raghav wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> size < page_size. This is true for most filesystems at the moment.
> 
> If the block size > page size (Large block sizes)[1], this will send the
> contents of the page next to zero page(as len > PAGE_SIZE) to the
> underlying block device, causing FS corruption.
> 
> iomap is a generic infrastructure and it should not make any assumptions
> about the fs block size and the page size of the system.
> 
> Fixes: db074436f421 ("iomap: move the direct IO code into a separate file")

I forgot to mention - this fixes tag is completely bogus. That's
just a commit that moves the code from one file to another and
there's no actual functional change at all.

Further, this isn't a change that "fixes" a bug or regression - it
is a change to support new functionality that doesnt' yet exist
upstream, and so there is no bug in the existing kernels for it to
fix....

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size
  2023-10-27 15:41       ` Pankaj Raghav
@ 2023-10-27 22:59         ` Matthew Wilcox
  2023-10-28 19:57           ` Pankaj Raghav
  2023-10-28 13:17         ` Hannes Reinecke
  1 sibling, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2023-10-27 22:59 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: Christoph Hellwig, Pankaj Raghav, linux-xfs, linux-fsdevel,
	djwong, mcgrof, da.gomez, gost.dev, david

On Fri, Oct 27, 2023 at 05:41:10PM +0200, Pankaj Raghav wrote:
> On 27/10/2023 12:47, Matthew Wilcox wrote:
> > On Fri, Oct 27, 2023 at 10:03:15AM +0200, Pankaj Raghav wrote:
> >> I also noticed this pattern in fscrypt_zeroout_range_inline_crypt().
> >> Probably there are more places which could use a ZERO_FOLIO directly
> >> instead of iterating with ZERO_PAGE.
> >>
> >> Chinner also had a similar comment. It would be nice if we can reserve
> >> a zero huge page that is the size of MAX_PAGECACHE_ORDER and add it as
> >> one folio to the bio.
> > 
> > i'm on holiday atm.  start looking at mm_get_huge_zero_page()
> 
> Thanks for the pointer. I made a rough version of how it might
> look like if I use that API:

useful thing to do.  i think this shows we need a new folio api wrapping
it.  happy to do that when i'm back, or you can have a crack at it.

your point about it possibly failing is correct.  so i think we need an
api which definitely returns a folio, but it might be of arbitrary
order.

>         bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
>         bio->bi_private = dio;
>         bio->bi_end_io = iomap_dio_bio_end_io;
> 
> -       __bio_add_page(bio, page, len, 0);
> +       if (!fallback) {
> +               bio_add_folio_nofail(bio, page_folio(page), len, 0);
> +       } else {
> +               while (len) {
> +                       unsigned int io_len =
> +                               min_t(unsigned int, len, PAGE_SIZE);
> +
> +                       __bio_add_page(bio, page, io_len, 0);
> +                       len -= io_len;
> +               }
> +       }
> +
>         iomap_dio_submit_bio(iter, dio, bio, pos);

then this can look something like:

	while (len) {
		size_t size = min(len, folio_size(folio));

		__bio_add_folio(bio, folio, size, 0);
		len -= size;
	}

> PS: Enjoy your holidays :)

cheers ;-)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size
  2023-10-27 15:41       ` Pankaj Raghav
  2023-10-27 22:59         ` Matthew Wilcox
@ 2023-10-28 13:17         ` Hannes Reinecke
  2023-10-28 16:57           ` Pankaj Raghav
  1 sibling, 1 reply; 15+ messages in thread
From: Hannes Reinecke @ 2023-10-28 13:17 UTC (permalink / raw)
  To: Pankaj Raghav, Matthew Wilcox
  Cc: Christoph Hellwig, Pankaj Raghav, linux-xfs, linux-fsdevel,
	djwong, mcgrof, da.gomez, gost.dev, david

On 10/27/23 17:41, Pankaj Raghav wrote:
> On 27/10/2023 12:47, Matthew Wilcox wrote:
>> On Fri, Oct 27, 2023 at 10:03:15AM +0200, Pankaj Raghav wrote:
>>> I also noticed this pattern in fscrypt_zeroout_range_inline_crypt().
>>> Probably there are more places which could use a ZERO_FOLIO directly
>>> instead of iterating with ZERO_PAGE.
>>>
>>> Chinner also had a similar comment. It would be nice if we can reserve
>>> a zero huge page that is the size of MAX_PAGECACHE_ORDER and add it as
>>> one folio to the bio.
>>
>> i'm on holiday atm.  start looking at mm_get_huge_zero_page()
> 
> Thanks for the pointer. I made a rough version of how it might
> look like if I use that API:
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index bcd3f8cf5ea4..6ae21bd16dbe 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -236,17 +236,43 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
>                  loff_t pos, unsigned len)
>   {
>          struct inode *inode = file_inode(dio->iocb->ki_filp);
> -       struct page *page = ZERO_PAGE(0);
> +       struct page *page = NULL;
> +       bool huge_page = false;
> +       bool fallback = false;
>          struct bio *bio;
> 
> -       bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> +       if (len > PAGE_SIZE) {
> +               page = mm_get_huge_zero_page(current->mm);
> +               if (likely(page))
> +                       huge_page = true;
> +       }
> +
> +       if (!huge_page)
> +               page = ZERO_PAGE(0);
> +
> +       fallback = ((len > PAGE_SIZE) && !huge_page);
> +
That is pointless.
Bios can handle pages larger than PAGE_SIZE.

> +       bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
> +                                 REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
Similar here. Just allocate space for a single folio.

>          fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
>                                    GFP_KERNEL);
> +
>          bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
>          bio->bi_private = dio;
>          bio->bi_end_io = iomap_dio_bio_end_io;
> 
> -       __bio_add_page(bio, page, len, 0);
> +       if (!fallback) {
> +               bio_add_folio_nofail(bio, page_folio(page), len, 0);
> +       } else {
> +               while (len) {
> +                       unsigned int io_len =
> +                               min_t(unsigned int, len, PAGE_SIZE);
> +
> +                       __bio_add_page(bio, page, io_len, 0);
> +                       len -= io_len;
> +               }
> +       }
See above. Kill the 'fallback' case.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size
  2023-10-28 13:17         ` Hannes Reinecke
@ 2023-10-28 16:57           ` Pankaj Raghav
  0 siblings, 0 replies; 15+ messages in thread
From: Pankaj Raghav @ 2023-10-28 16:57 UTC (permalink / raw)
  To: Hannes Reinecke, Matthew Wilcox
  Cc: Christoph Hellwig, Pankaj Raghav, linux-xfs, linux-fsdevel,
	djwong, mcgrof, da.gomez, gost.dev, david

>> -       bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
>> +       if (len > PAGE_SIZE) {
>> +               page = mm_get_huge_zero_page(current->mm);
>> +               if (likely(page))
>> +                       huge_page = true;
>> +       }
>> +
>> +       if (!huge_page)
>> +               page = ZERO_PAGE(0);
>> +
>> +       fallback = ((len > PAGE_SIZE) && !huge_page);
>> +
> That is pointless.
> Bios can handle pages larger than PAGE_SIZE.

Yes, I know BIOs can handle large pages. But the point here is if we fail
to allocate a huge zero page that can cover the complete large FSB ( > page size),
then we need to use the statically allocated ZERO_PAGE (see the original patch)
for multiple offsets covering the range.

Unless we have an API that can return a zero folio with arbitrary order(see also the
reply from Willy), we can't use a bio with one vec for LBS support.

--
Pankaj

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size
  2023-10-27 22:10 ` Dave Chinner
@ 2023-10-28 17:20   ` Pankaj Raghav
  0 siblings, 0 replies; 15+ messages in thread
From: Pankaj Raghav @ 2023-10-28 17:20 UTC (permalink / raw)
  To: Dave Chinner, Pankaj Raghav
  Cc: linux-xfs, linux-fsdevel, willy, djwong, mcgrof, hch, da.gomez,
	gost.dev

>> Fixes: db074436f421 ("iomap: move the direct IO code into a separate file")
> 
> I forgot to mention - this fixes tag is completely bogus. That's
> just a commit that moves the code from one file to another and
> there's no actual functional change at all.
> 
> Further, this isn't a change that "fixes" a bug or regression - it
> is a change to support new functionality that doesnt' yet exist
> upstream, and so there is no bug in the existing kernels for it to
> fix....

I agree. I will remove the Fixes tag.

I added it as iomap infra does not explicitly tell it does not
support the LBS usecase. But I agree that it is a change to support
a new feature, and it shouldn't go as a fix.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size
  2023-10-27 22:59         ` Matthew Wilcox
@ 2023-10-28 19:57           ` Pankaj Raghav
  0 siblings, 0 replies; 15+ messages in thread
From: Pankaj Raghav @ 2023-10-28 19:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Pankaj Raghav, linux-xfs, linux-fsdevel,
	djwong, mcgrof, da.gomez, gost.dev, david

>> Thanks for the pointer. I made a rough version of how it might
>> look like if I use that API:
> 
> useful thing to do.  i think this shows we need a new folio api wrapping
> it.  happy to do that when i'm back, or you can have a crack at it.
> 
A folio wrapping for mm_get_huge_zero_page()?

I tried to look for all the callers of mm_get_huge_zero_page() and I don't
see them doing the folio <-> page dance. Of course, if we end up using
mm_get_huge_zero_page() in iomap, then getting making a folio API might be
worth it!

> your point about it possibly failing is correct.  so i think we need an
> api which definitely returns a folio, but it might be of arbitrary
> order.
> 
That will be really nice, given that many parts of the kernel might
use that API to get away with iterating with ZERO_PAGE().

>> +
>>         iomap_dio_submit_bio(iter, dio, bio, pos);
> 
> then this can look something like:
> 
> 	while (len) {
> 		size_t size = min(len, folio_size(folio));
> 
> 		__bio_add_folio(bio, folio, size, 0);
> 		len -= size;
> 	}
>

Ah, this looks good!

>> PS: Enjoy your holidays :)
> 
> cheers ;-)

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-10-28 19:57 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-26 14:08 [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav
2023-10-26 22:10 ` Dave Chinner
2023-10-27  7:53   ` Pankaj Raghav
2023-10-27 22:07     ` Dave Chinner
2023-10-26 23:20 ` Luis Chamberlain
2023-10-27  5:18 ` Christoph Hellwig
2023-10-27  8:03   ` Pankaj Raghav
2023-10-27 10:47     ` Matthew Wilcox
2023-10-27 15:41       ` Pankaj Raghav
2023-10-27 22:59         ` Matthew Wilcox
2023-10-28 19:57           ` Pankaj Raghav
2023-10-28 13:17         ` Hannes Reinecke
2023-10-28 16:57           ` Pankaj Raghav
2023-10-27 22:10 ` Dave Chinner
2023-10-28 17:20   ` Pankaj Raghav

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).