* [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size
@ 2023-10-26 14:08 Pankaj Raghav
2023-10-26 22:10 ` Dave Chinner
` (3 more replies)
0 siblings, 4 replies; 15+ messages in thread
From: Pankaj Raghav @ 2023-10-26 14:08 UTC (permalink / raw)
To: linux-xfs, linux-fsdevel
Cc: willy, djwong, mcgrof, hch, da.gomez, gost.dev, david,
Pankaj Raghav
From: Pankaj Raghav <p.raghav@samsung.com>
iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
< fs block size. iomap_dio_zero() has an implicit assumption that fs block
size < page_size. This is true for most filesystems at the moment.
If the block size > page size (Large block sizes)[1], this will send the
contents of the page next to zero page(as len > PAGE_SIZE) to the
underlying block device, causing FS corruption.
iomap is a generic infrastructure and it should not make any assumptions
about the fs block size and the page size of the system.
Fixes: db074436f421 ("iomap: move the direct IO code into a separate file")
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
[1] https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@pankajraghav.com/
---
I had initially planned on sending this as a part of LBS patches but I
think this can go as a standalone patch as it is a generic fix to iomap.
@Dave chinner This fixes the corruption issue you were seeing in
generic/091 for bs=64k in [2]
[2] https://lore.kernel.org/lkml/ZQfbHloBUpDh+zCg@dread.disaster.area/
fs/iomap/direct-io.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index bcd3f8cf5ea4..04f6c5548136 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -239,14 +239,23 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
struct page *page = ZERO_PAGE(0);
struct bio *bio;
- bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
+ WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));
+
+ bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
+ REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
GFP_KERNEL);
+
bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
bio->bi_private = dio;
bio->bi_end_io = iomap_dio_bio_end_io;
- __bio_add_page(bio, page, len, 0);
+ while (len) {
+ unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
+
+ __bio_add_page(bio, page, io_len, 0);
+ len -= io_len;
+ }
iomap_dio_submit_bio(iter, dio, bio, pos);
}
base-commit: 05d3ef8bba77c1b5f98d941d8b2d4aeab8118ef1
--
2.40.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size
2023-10-26 14:08 [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav
@ 2023-10-26 22:10 ` Dave Chinner
2023-10-27 7:53 ` Pankaj Raghav
2023-10-26 23:20 ` Luis Chamberlain
` (2 subsequent siblings)
3 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2023-10-26 22:10 UTC (permalink / raw)
To: Pankaj Raghav
Cc: linux-xfs, linux-fsdevel, willy, djwong, mcgrof, hch, da.gomez,
gost.dev, Pankaj Raghav
On Thu, Oct 26, 2023 at 04:08:32PM +0200, Pankaj Raghav wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
>
> iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> size < page_size. This is true for most filesystems at the moment.
>
> If the block size > page size (Large block sizes)[1], this will send the
> contents of the page next to zero page(as len > PAGE_SIZE) to the
> underlying block device, causing FS corruption.
>
> iomap is a generic infrastructure and it should not make any assumptions
> about the fs block size and the page size of the system.
>
> Fixes: db074436f421 ("iomap: move the direct IO code into a separate file")
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
>
> [1] https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@pankajraghav.com/
> ---
> I had initially planned on sending this as a part of LBS patches but I
> think this can go as a standalone patch as it is a generic fix to iomap.
>
> @Dave chinner This fixes the corruption issue you were seeing in
> generic/091 for bs=64k in [2]
>
> [2] https://lore.kernel.org/lkml/ZQfbHloBUpDh+zCg@dread.disaster.area/
>
> fs/iomap/direct-io.c | 13 +++++++++++--
> 1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index bcd3f8cf5ea4..04f6c5548136 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -239,14 +239,23 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
> struct page *page = ZERO_PAGE(0);
> struct bio *bio;
>
> - bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> + WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));
How can that happen here? Max fsb size will be 64kB for the
foreseeable future, the bio can hold 256 pages so it will have a
minimum size capability of 1MB.
FWIW, as a general observation, I think this is the wrong place to
be checking that a filesystem block is larger than can be fit in a
single bio. There's going to be problems all over the place if we
can't do fsb sized IO in a single bio. i.e. I think this sort of
validation should be performed during filesystem mount, not
sporadically checked with WARN_ON() checks in random places in the
IO path...
> +
> + bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
> + REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
> GFP_KERNEL);
> +
> bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
> bio->bi_private = dio;
> bio->bi_end_io = iomap_dio_bio_end_io;
>
> - __bio_add_page(bio, page, len, 0);
> + while (len) {
> + unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
> +
> + __bio_add_page(bio, page, io_len, 0);
> + len -= io_len;
> + }
> iomap_dio_submit_bio(iter, dio, bio, pos);
/me wonders if we should have a set of ZERO_FOLIO()s that contain a
folio of each possible size. Then we just pull the ZERO_FOLIO of the
correct size and use __bio_add_folio(). i.e. no need for
looping over the bio to repeatedly add the ZERO_PAGE, etc, and the
code is then identical for all cases of page size vs FSB size.
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size
2023-10-26 14:08 [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav
2023-10-26 22:10 ` Dave Chinner
@ 2023-10-26 23:20 ` Luis Chamberlain
2023-10-27 5:18 ` Christoph Hellwig
2023-10-27 22:10 ` Dave Chinner
3 siblings, 0 replies; 15+ messages in thread
From: Luis Chamberlain @ 2023-10-26 23:20 UTC (permalink / raw)
To: Pankaj Raghav
Cc: linux-xfs, linux-fsdevel, willy, djwong, hch, da.gomez, gost.dev,
david, Pankaj Raghav
On Thu, Oct 26, 2023 at 04:08:32PM +0200, Pankaj Raghav wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
>
> iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> size < page_size. This is true for most filesystems at the moment.
>
> If the block size > page size (Large block sizes)[1], this will send the
> contents of the page next to zero page(as len > PAGE_SIZE) to the
> underlying block device, causing FS corruption.
>
> iomap is a generic infrastructure and it should not make any assumptions
> about the fs block size and the page size of the system.
>
> Fixes: db074436f421 ("iomap: move the direct IO code into a separate file")
Nice!
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
>
> [1] https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@pankajraghav.com/
This URL jus tneeds to go above the Fixes tag.
Luis
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size
2023-10-26 14:08 [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav
2023-10-26 22:10 ` Dave Chinner
2023-10-26 23:20 ` Luis Chamberlain
@ 2023-10-27 5:18 ` Christoph Hellwig
2023-10-27 8:03 ` Pankaj Raghav
2023-10-27 22:10 ` Dave Chinner
3 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2023-10-27 5:18 UTC (permalink / raw)
To: Pankaj Raghav
Cc: linux-xfs, linux-fsdevel, willy, djwong, mcgrof, hch, da.gomez,
gost.dev, david, Pankaj Raghav
>
> - __bio_add_page(bio, page, len, 0);
> + while (len) {
> + unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
> +
> + __bio_add_page(bio, page, io_len, 0);
> + len -= io_len;
> + }
Maybe out of self-interest, but shouldn't we replace ZERO_PAGE with a
sufficiently larger ZERO_FOLIO? Right now I have a case where I have
to have a zero padding of up to MAX_PAGECACHE_ORDER minus block size,
so having a MAX_PAGECACHE_ORDER folio would have been really helpful
for me, but I suspect there are many other such cases as well.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size
2023-10-26 22:10 ` Dave Chinner
@ 2023-10-27 7:53 ` Pankaj Raghav
2023-10-27 22:07 ` Dave Chinner
0 siblings, 1 reply; 15+ messages in thread
From: Pankaj Raghav @ 2023-10-27 7:53 UTC (permalink / raw)
To: Dave Chinner, Pankaj Raghav
Cc: linux-xfs, linux-fsdevel, willy, djwong, mcgrof, hch, da.gomez,
gost.dev
>> - bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
>> + WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));
>
> How can that happen here? Max fsb size will be 64kB for the
> foreseeable future, the bio can hold 256 pages so it will have a
> minimum size capability of 1MB.
>
I just added it as a pathological check. This will not trigger
anytime in the near future.
> FWIW, as a general observation, I think this is the wrong place to
> be checking that a filesystem block is larger than can be fit in a
> single bio. There's going to be problems all over the place if we
> can't do fsb sized IO in a single bio. i.e. I think this sort of
> validation should be performed during filesystem mount, not
> sporadically checked with WARN_ON() checks in random places in the
> IO path...
>
I agree that it should be a check at a higher level.
As iomap can be used by any filesystem, the check on FSB limitation
should be a part iomap right? I don't see any explicit document/comment
that states the iomap limitations on FSB, etc.
>>
>> - __bio_add_page(bio, page, len, 0);
>> + while (len) {
>> + unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
>> +
>> + __bio_add_page(bio, page, io_len, 0);
>> + len -= io_len;
>> + }
>> iomap_dio_submit_bio(iter, dio, bio, pos);
>
> /me wonders if we should have a set of ZERO_FOLIO()s that contain a
> folio of each possible size. Then we just pull the ZERO_FOLIO of the
> correct size and use __bio_add_folio(). i.e. no need for
> looping over the bio to repeatedly add the ZERO_PAGE, etc, and the
> code is then identical for all cases of page size vs FSB size.
>
I was exactly looking for ZERO_FOLIOs. Instead of having a ZERO_PAGE, can
we reserve a ZERO_HUGE_PAGE so that it can be used directly by
bio_add_folio_nofail()?
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size
2023-10-27 5:18 ` Christoph Hellwig
@ 2023-10-27 8:03 ` Pankaj Raghav
2023-10-27 10:47 ` Matthew Wilcox
0 siblings, 1 reply; 15+ messages in thread
From: Pankaj Raghav @ 2023-10-27 8:03 UTC (permalink / raw)
To: Christoph Hellwig, Pankaj Raghav
Cc: linux-xfs, linux-fsdevel, willy, djwong, mcgrof, da.gomez,
gost.dev, david
On 27/10/2023 07:18, Christoph Hellwig wrote:
>>
>> - __bio_add_page(bio, page, len, 0);
>> + while (len) {
>> + unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
>> +
>> + __bio_add_page(bio, page, io_len, 0);
>> + len -= io_len;
>> + }
>
> Maybe out of self-interest, but shouldn't we replace ZERO_PAGE with a
> sufficiently larger ZERO_FOLIO? Right now I have a case where I have
> to have a zero padding of up to MAX_PAGECACHE_ORDER minus block size,
> so having a MAX_PAGECACHE_ORDER folio would have been really helpful
> for me, but I suspect there are many other such cases as well.
I think that would definitely be useful.
I also noticed this pattern in fscrypt_zeroout_range_inline_crypt().
Probably there are more places which could use a ZERO_FOLIO directly
instead of iterating with ZERO_PAGE.
Chinner also had a similar comment. It would be nice if we can reserve
a zero huge page that is the size of MAX_PAGECACHE_ORDER and add it as
one folio to the bio.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size
2023-10-27 8:03 ` Pankaj Raghav
@ 2023-10-27 10:47 ` Matthew Wilcox
2023-10-27 15:41 ` Pankaj Raghav
0 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2023-10-27 10:47 UTC (permalink / raw)
To: Pankaj Raghav
Cc: Christoph Hellwig, Pankaj Raghav, linux-xfs, linux-fsdevel,
djwong, mcgrof, da.gomez, gost.dev, david
On Fri, Oct 27, 2023 at 10:03:15AM +0200, Pankaj Raghav wrote:
> I also noticed this pattern in fscrypt_zeroout_range_inline_crypt().
> Probably there are more places which could use a ZERO_FOLIO directly
> instead of iterating with ZERO_PAGE.
>
> Chinner also had a similar comment. It would be nice if we can reserve
> a zero huge page that is the size of MAX_PAGECACHE_ORDER and add it as
> one folio to the bio.
i'm on holiday atm. start looking at mm_get_huge_zero_page()
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size
2023-10-27 10:47 ` Matthew Wilcox
@ 2023-10-27 15:41 ` Pankaj Raghav
2023-10-27 22:59 ` Matthew Wilcox
2023-10-28 13:17 ` Hannes Reinecke
0 siblings, 2 replies; 15+ messages in thread
From: Pankaj Raghav @ 2023-10-27 15:41 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Christoph Hellwig, Pankaj Raghav, linux-xfs, linux-fsdevel,
djwong, mcgrof, da.gomez, gost.dev, david
On 27/10/2023 12:47, Matthew Wilcox wrote:
> On Fri, Oct 27, 2023 at 10:03:15AM +0200, Pankaj Raghav wrote:
>> I also noticed this pattern in fscrypt_zeroout_range_inline_crypt().
>> Probably there are more places which could use a ZERO_FOLIO directly
>> instead of iterating with ZERO_PAGE.
>>
>> Chinner also had a similar comment. It would be nice if we can reserve
>> a zero huge page that is the size of MAX_PAGECACHE_ORDER and add it as
>> one folio to the bio.
>
> i'm on holiday atm. start looking at mm_get_huge_zero_page()
Thanks for the pointer. I made a rough version of how it might
look like if I use that API:
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index bcd3f8cf5ea4..6ae21bd16dbe 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -236,17 +236,43 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
loff_t pos, unsigned len)
{
struct inode *inode = file_inode(dio->iocb->ki_filp);
- struct page *page = ZERO_PAGE(0);
+ struct page *page = NULL;
+ bool huge_page = false;
+ bool fallback = false;
struct bio *bio;
- bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
+ if (len > PAGE_SIZE) {
+ page = mm_get_huge_zero_page(current->mm);
+ if (likely(page))
+ huge_page = true;
+ }
+
+ if (!huge_page)
+ page = ZERO_PAGE(0);
+
+ fallback = ((len > PAGE_SIZE) && !huge_page);
+
+ bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
+ REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
GFP_KERNEL);
+
bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
bio->bi_private = dio;
bio->bi_end_io = iomap_dio_bio_end_io;
- __bio_add_page(bio, page, len, 0);
+ if (!fallback) {
+ bio_add_folio_nofail(bio, page_folio(page), len, 0);
+ } else {
+ while (len) {
+ unsigned int io_len =
+ min_t(unsigned int, len, PAGE_SIZE);
+
+ __bio_add_page(bio, page, io_len, 0);
+ len -= io_len;
+ }
+ }
+
iomap_dio_submit_bio(iter, dio, bio, pos);
}
The only issue with mm_get_huge_zero_page() is that it can fail, and we need
to fallback to iteration if it cannot allocate a huge folio.
PS: Enjoy your holidays :)
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size
2023-10-27 7:53 ` Pankaj Raghav
@ 2023-10-27 22:07 ` Dave Chinner
0 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2023-10-27 22:07 UTC (permalink / raw)
To: Pankaj Raghav
Cc: Pankaj Raghav, linux-xfs, linux-fsdevel, willy, djwong, mcgrof,
hch, da.gomez, gost.dev
On Fri, Oct 27, 2023 at 09:53:19AM +0200, Pankaj Raghav wrote:
> >> - bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> >> + WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));
> >
> > How can that happen here? Max fsb size will be 64kB for the
> > foreseeable future, the bio can hold 256 pages so it will have a
> > minimum size capability of 1MB.
> >
>
> I just added it as a pathological check. This will not trigger
> anytime in the near future.
Yeah, exactly my point - it should never happen, it's a fundamental
developer stuff-up bug, not a runtime bug, and so shouldn't be
checked at runtime on every bio....
> > FWIW, as a general observation, I think this is the wrong place to
> > be checking that a filesystem block is larger than can be fit in a
> > single bio. There's going to be problems all over the place if we
> > can't do fsb sized IO in a single bio. i.e. I think this sort of
> > validation should be performed during filesystem mount, not
> > sporadically checked with WARN_ON() checks in random places in the
> > IO path...
> >
>
> I agree that it should be a check at a higher level.
>
> As iomap can be used by any filesystem, the check on FSB limitation
> should be a part iomap right? I don't see any explicit document/comment
> that states the iomap limitations on FSB, etc.
No, it should be part of the filesystem that *uses the bio
infrastrucure*.
We already do that with the page cache - filesystems check at mount
time that the FSB is <= PAGE_SIZE and reject the mount if this check
fails because the page cache simply cannot handle this situation.
This is no different: if we are going to allow FSB > PAGE_SIZE, then
we need to ensure somewhere, even as a BUILD_BUG_ON(), that the max
support FSB the filesystem has is smaller than what we can pack in a
bio.
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size
2023-10-26 14:08 [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav
` (2 preceding siblings ...)
2023-10-27 5:18 ` Christoph Hellwig
@ 2023-10-27 22:10 ` Dave Chinner
2023-10-28 17:20 ` Pankaj Raghav
3 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2023-10-27 22:10 UTC (permalink / raw)
To: Pankaj Raghav
Cc: linux-xfs, linux-fsdevel, willy, djwong, mcgrof, hch, da.gomez,
gost.dev, Pankaj Raghav
On Thu, Oct 26, 2023 at 04:08:32PM +0200, Pankaj Raghav wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
>
> iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> size < page_size. This is true for most filesystems at the moment.
>
> If the block size > page size (Large block sizes)[1], this will send the
> contents of the page next to zero page(as len > PAGE_SIZE) to the
> underlying block device, causing FS corruption.
>
> iomap is a generic infrastructure and it should not make any assumptions
> about the fs block size and the page size of the system.
>
> Fixes: db074436f421 ("iomap: move the direct IO code into a separate file")
I forgot to mention - this fixes tag is completely bogus. That's
just a commit that moves the code from one file to another and
there's no actual functional change at all.
Further, this isn't a change that "fixes" a bug or regression - it
is a change to support new functionality that doesnt' yet exist
upstream, and so there is no bug in the existing kernels for it to
fix....
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size
2023-10-27 15:41 ` Pankaj Raghav
@ 2023-10-27 22:59 ` Matthew Wilcox
2023-10-28 19:57 ` Pankaj Raghav
2023-10-28 13:17 ` Hannes Reinecke
1 sibling, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2023-10-27 22:59 UTC (permalink / raw)
To: Pankaj Raghav
Cc: Christoph Hellwig, Pankaj Raghav, linux-xfs, linux-fsdevel,
djwong, mcgrof, da.gomez, gost.dev, david
On Fri, Oct 27, 2023 at 05:41:10PM +0200, Pankaj Raghav wrote:
> On 27/10/2023 12:47, Matthew Wilcox wrote:
> > On Fri, Oct 27, 2023 at 10:03:15AM +0200, Pankaj Raghav wrote:
> >> I also noticed this pattern in fscrypt_zeroout_range_inline_crypt().
> >> Probably there are more places which could use a ZERO_FOLIO directly
> >> instead of iterating with ZERO_PAGE.
> >>
> >> Chinner also had a similar comment. It would be nice if we can reserve
> >> a zero huge page that is the size of MAX_PAGECACHE_ORDER and add it as
> >> one folio to the bio.
> >
> > i'm on holiday atm. start looking at mm_get_huge_zero_page()
>
> Thanks for the pointer. I made a rough version of how it might
> look like if I use that API:
useful thing to do. i think this shows we need a new folio api wrapping
it. happy to do that when i'm back, or you can have a crack at it.
your point about it possibly failing is correct. so i think we need an
api which definitely returns a folio, but it might be of arbitrary
order.
> bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
> bio->bi_private = dio;
> bio->bi_end_io = iomap_dio_bio_end_io;
>
> - __bio_add_page(bio, page, len, 0);
> + if (!fallback) {
> + bio_add_folio_nofail(bio, page_folio(page), len, 0);
> + } else {
> + while (len) {
> + unsigned int io_len =
> + min_t(unsigned int, len, PAGE_SIZE);
> +
> + __bio_add_page(bio, page, io_len, 0);
> + len -= io_len;
> + }
> + }
> +
> iomap_dio_submit_bio(iter, dio, bio, pos);
then this can look something like:
while (len) {
size_t size = min(len, folio_size(folio));
__bio_add_folio(bio, folio, size, 0);
len -= size;
}
> PS: Enjoy your holidays :)
cheers ;-)
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size
2023-10-27 15:41 ` Pankaj Raghav
2023-10-27 22:59 ` Matthew Wilcox
@ 2023-10-28 13:17 ` Hannes Reinecke
2023-10-28 16:57 ` Pankaj Raghav
1 sibling, 1 reply; 15+ messages in thread
From: Hannes Reinecke @ 2023-10-28 13:17 UTC (permalink / raw)
To: Pankaj Raghav, Matthew Wilcox
Cc: Christoph Hellwig, Pankaj Raghav, linux-xfs, linux-fsdevel,
djwong, mcgrof, da.gomez, gost.dev, david
On 10/27/23 17:41, Pankaj Raghav wrote:
> On 27/10/2023 12:47, Matthew Wilcox wrote:
>> On Fri, Oct 27, 2023 at 10:03:15AM +0200, Pankaj Raghav wrote:
>>> I also noticed this pattern in fscrypt_zeroout_range_inline_crypt().
>>> Probably there are more places which could use a ZERO_FOLIO directly
>>> instead of iterating with ZERO_PAGE.
>>>
>>> Chinner also had a similar comment. It would be nice if we can reserve
>>> a zero huge page that is the size of MAX_PAGECACHE_ORDER and add it as
>>> one folio to the bio.
>>
>> i'm on holiday atm. start looking at mm_get_huge_zero_page()
>
> Thanks for the pointer. I made a rough version of how it might
> look like if I use that API:
>
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index bcd3f8cf5ea4..6ae21bd16dbe 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -236,17 +236,43 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
> loff_t pos, unsigned len)
> {
> struct inode *inode = file_inode(dio->iocb->ki_filp);
> - struct page *page = ZERO_PAGE(0);
> + struct page *page = NULL;
> + bool huge_page = false;
> + bool fallback = false;
> struct bio *bio;
>
> - bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> + if (len > PAGE_SIZE) {
> + page = mm_get_huge_zero_page(current->mm);
> + if (likely(page))
> + huge_page = true;
> + }
> +
> + if (!huge_page)
> + page = ZERO_PAGE(0);
> +
> + fallback = ((len > PAGE_SIZE) && !huge_page);
> +
That is pointless.
Bios can handle pages larger than PAGE_SIZE.
> + bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
> + REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
Similar here. Just allocate space for a single folio.
> fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
> GFP_KERNEL);
> +
> bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
> bio->bi_private = dio;
> bio->bi_end_io = iomap_dio_bio_end_io;
>
> - __bio_add_page(bio, page, len, 0);
> + if (!fallback) {
> + bio_add_folio_nofail(bio, page_folio(page), len, 0);
> + } else {
> + while (len) {
> + unsigned int io_len =
> + min_t(unsigned int, len, PAGE_SIZE);
> +
> + __bio_add_page(bio, page, io_len, 0);
> + len -= io_len;
> + }
> + }
See above. Kill the 'fallback' case.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size
2023-10-28 13:17 ` Hannes Reinecke
@ 2023-10-28 16:57 ` Pankaj Raghav
0 siblings, 0 replies; 15+ messages in thread
From: Pankaj Raghav @ 2023-10-28 16:57 UTC (permalink / raw)
To: Hannes Reinecke, Matthew Wilcox
Cc: Christoph Hellwig, Pankaj Raghav, linux-xfs, linux-fsdevel,
djwong, mcgrof, da.gomez, gost.dev, david
>> - bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
>> + if (len > PAGE_SIZE) {
>> + page = mm_get_huge_zero_page(current->mm);
>> + if (likely(page))
>> + huge_page = true;
>> + }
>> +
>> + if (!huge_page)
>> + page = ZERO_PAGE(0);
>> +
>> + fallback = ((len > PAGE_SIZE) && !huge_page);
>> +
> That is pointless.
> Bios can handle pages larger than PAGE_SIZE.
Yes, I know BIOs can handle large pages. But the point here is if we fail
to allocate a huge zero page that can cover the complete large FSB ( > page size),
then we need to use the statically allocated ZERO_PAGE (see the original patch)
for multiple offsets covering the range.
Unless we have an API that can return a zero folio with arbitrary order(see also the
reply from Willy), we can't use a bio with one vec for LBS support.
--
Pankaj
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size
2023-10-27 22:10 ` Dave Chinner
@ 2023-10-28 17:20 ` Pankaj Raghav
0 siblings, 0 replies; 15+ messages in thread
From: Pankaj Raghav @ 2023-10-28 17:20 UTC (permalink / raw)
To: Dave Chinner, Pankaj Raghav
Cc: linux-xfs, linux-fsdevel, willy, djwong, mcgrof, hch, da.gomez,
gost.dev
>> Fixes: db074436f421 ("iomap: move the direct IO code into a separate file")
>
> I forgot to mention - this fixes tag is completely bogus. That's
> just a commit that moves the code from one file to another and
> there's no actual functional change at all.
>
> Further, this isn't a change that "fixes" a bug or regression - it
> is a change to support new functionality that doesnt' yet exist
> upstream, and so there is no bug in the existing kernels for it to
> fix....
I agree. I will remove the Fixes tag.
I added it as iomap infra does not explicitly tell it does not
support the LBS usecase. But I agree that it is a change to support
a new feature, and it shouldn't go as a fix.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size
2023-10-27 22:59 ` Matthew Wilcox
@ 2023-10-28 19:57 ` Pankaj Raghav
0 siblings, 0 replies; 15+ messages in thread
From: Pankaj Raghav @ 2023-10-28 19:57 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Christoph Hellwig, Pankaj Raghav, linux-xfs, linux-fsdevel,
djwong, mcgrof, da.gomez, gost.dev, david
>> Thanks for the pointer. I made a rough version of how it might
>> look like if I use that API:
>
> useful thing to do. i think this shows we need a new folio api wrapping
> it. happy to do that when i'm back, or you can have a crack at it.
>
A folio wrapping for mm_get_huge_zero_page()?
I tried to look for all the callers of mm_get_huge_zero_page() and I don't
see them doing the folio <-> page dance. Of course, if we end up using
mm_get_huge_zero_page() in iomap, then getting making a folio API might be
worth it!
> your point about it possibly failing is correct. so i think we need an
> api which definitely returns a folio, but it might be of arbitrary
> order.
>
That will be really nice, given that many parts of the kernel might
use that API to get away with iterating with ZERO_PAGE().
>> +
>> iomap_dio_submit_bio(iter, dio, bio, pos);
>
> then this can look something like:
>
> while (len) {
> size_t size = min(len, folio_size(folio));
>
> __bio_add_folio(bio, folio, size, 0);
> len -= size;
> }
>
Ah, this looks good!
>> PS: Enjoy your holidays :)
>
> cheers ;-)
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2023-10-28 19:57 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-26 14:08 [PATCH] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav
2023-10-26 22:10 ` Dave Chinner
2023-10-27 7:53 ` Pankaj Raghav
2023-10-27 22:07 ` Dave Chinner
2023-10-26 23:20 ` Luis Chamberlain
2023-10-27 5:18 ` Christoph Hellwig
2023-10-27 8:03 ` Pankaj Raghav
2023-10-27 10:47 ` Matthew Wilcox
2023-10-27 15:41 ` Pankaj Raghav
2023-10-27 22:59 ` Matthew Wilcox
2023-10-28 19:57 ` Pankaj Raghav
2023-10-28 13:17 ` Hannes Reinecke
2023-10-28 16:57 ` Pankaj Raghav
2023-10-27 22:10 ` Dave Chinner
2023-10-28 17:20 ` Pankaj Raghav
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).