From: Dave Chinner <david@fromorbit.com>
To: "Pankaj Raghav (Samsung)" <kernel@pankajraghav.com>
Cc: chandan.babu@oracle.com, akpm@linux-foundation.org,
brauner@kernel.org, willy@infradead.org, djwong@kernel.org,
linux-kernel@vger.kernel.org, hare@suse.de,
john.g.garry@oracle.com, gost.dev@samsung.com,
yang@os.amperecomputing.com, p.raghav@samsung.com,
cl@os.amperecomputing.com, linux-xfs@vger.kernel.org, hch@lst.de,
mcgrof@kernel.org, linux-mm@kvack.org,
linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH v6 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size
Date: Mon, 3 Jun 2024 09:22:34 +1000 [thread overview]
Message-ID: <Zlz+upnpESvduk7L@dread.disaster.area> (raw)
In-Reply-To: <20240529134509.120826-8-kernel@pankajraghav.com>
On Wed, May 29, 2024 at 03:45:05PM +0200, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
>
> iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> size < page_size. This is true for most filesystems at the moment.
>
> If the block size > page size, this will send the contents of the page
> next to zero page(as len > PAGE_SIZE) to the underlying block device,
> causing FS corruption.
>
> iomap is a generic infrastructure and it should not make any assumptions
> about the fs block size and the page size of the system.
>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>
> After disucssing a bit in LSFMM about this, it was clear that using a
> PMD sized zero folio might not be a good idea[0], especially in platforms
> with 64k base page size, the huge zero folio can be as high as
> 512M just for zeroing small block sizes in the direct IO path.
>
> The idea to use iomap_init to allocate 64k zero buffer was suggested by
> Dave Chinner as it gives decent tradeoff between memory usage and efficiency.
>
> This is a good enough solution for now as moving beyond 64k block size
> in XFS might take a while. We can work on a more generic solution in the
> future to offer different sized zero folio that can go beyond 64k.
>
> [0] https://lore.kernel.org/linux-fsdevel/ZkdcAsENj2mBHh91@casper.infradead.org/
>
> fs/internal.h | 8 ++++++++
> fs/iomap/buffered-io.c | 5 +++++
> fs/iomap/direct-io.c | 9 +++++++--
> 3 files changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/fs/internal.h b/fs/internal.h
> index 84f371193f74..18eedbb82c50 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -35,6 +35,14 @@ static inline void bdev_cache_init(void)
> int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
> get_block_t *get_block, const struct iomap *iomap);
>
> +/*
> + * iomap/buffered-io.c
> + */
> +
> +#define ZERO_FSB_SIZE (65536)
> +#define ZERO_FSB_ORDER (get_order(ZERO_FSB_SIZE))
> +extern struct page *zero_fs_block;
This is really iomap direct IO private stuff. It should be visible
anywhere else...
> +
> /*
> * char_dev.c
> */
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index c5802a459334..2c0149c827cd 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -42,6 +42,7 @@ struct iomap_folio_state {
> };
>
> static struct bio_set iomap_ioend_bioset;
> +struct page *zero_fs_block;
>
> static inline bool ifs_is_fully_uptodate(struct folio *folio,
> struct iomap_folio_state *ifs)
> @@ -1998,6 +1999,10 @@ EXPORT_SYMBOL_GPL(iomap_writepages);
>
> static int __init iomap_init(void)
> {
> + zero_fs_block = alloc_pages(GFP_KERNEL | __GFP_ZERO, ZERO_FSB_ORDER);
> + if (!zero_fs_block)
> + return -ENOMEM;
> +
> return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
> offsetof(struct iomap_ioend, io_bio),
> BIOSET_NEED_BVECS);
just create an iomap_dio_init() function in iomap/direct-io.c
and call that from here. Then everything can be private to
iomap/direct-io.c...
-Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2024-06-02 23:22 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-29 13:44 [PATCH v6 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
2024-05-29 13:44 ` [PATCH v6 01/11] readahead: rework loop in page_cache_ra_unbounded() Pankaj Raghav (Samsung)
2024-05-29 13:45 ` [PATCH v6 02/11] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
2024-06-03 6:28 ` Hannes Reinecke
2024-05-29 13:45 ` [PATCH v6 03/11] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
2024-06-03 12:18 ` Matthew Wilcox
2024-06-04 9:42 ` Pankaj Raghav (Samsung)
2024-05-29 13:45 ` [PATCH v6 04/11] readahead: allocate folios with mapping_min_order in readahead Pankaj Raghav (Samsung)
2024-05-29 13:45 ` [PATCH v6 05/11] mm: split a folio in minimum folio order chunks Pankaj Raghav (Samsung)
2024-06-03 6:34 ` Hannes Reinecke
2024-06-03 12:36 ` Matthew Wilcox
2024-06-04 10:29 ` Pankaj Raghav (Samsung)
2024-05-29 13:45 ` [PATCH v6 06/11] filemap: cap PTE range to be created to allowed zero fill in folio_map_range() Pankaj Raghav (Samsung)
2024-06-03 6:35 ` Hannes Reinecke
2024-05-29 13:45 ` [PATCH v6 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
2024-06-02 23:22 ` Dave Chinner [this message]
2024-06-04 9:43 ` Pankaj Raghav (Samsung)
2024-06-03 6:39 ` Hannes Reinecke
2024-05-29 13:45 ` [PATCH v6 08/11] xfs: use kvmalloc for xattr buffers Pankaj Raghav (Samsung)
2024-05-29 13:45 ` [PATCH v6 09/11] xfs: expose block size in stat Pankaj Raghav (Samsung)
2024-05-29 13:45 ` [PATCH v6 10/11] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
2024-05-29 13:45 ` [PATCH v6 11/11] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Zlz+upnpESvduk7L@dread.disaster.area \
--to=david@fromorbit.com \
--cc=akpm@linux-foundation.org \
--cc=brauner@kernel.org \
--cc=chandan.babu@oracle.com \
--cc=cl@os.amperecomputing.com \
--cc=djwong@kernel.org \
--cc=gost.dev@samsung.com \
--cc=hare@suse.de \
--cc=hch@lst.de \
--cc=john.g.garry@oracle.com \
--cc=kernel@pankajraghav.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=mcgrof@kernel.org \
--cc=p.raghav@samsung.com \
--cc=willy@infradead.org \
--cc=yang@os.amperecomputing.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.