[PATCH 00/12] btrfs reads through iomap

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/12] btrfs reads through iomap
@ 2024-10-04 20:04 Goldwyn Rodrigues
  2024-10-04 20:04 ` [PATCH 01/12] iomap: check if folio size is equal to FS block size Goldwyn Rodrigues
                   ` (12 more replies)
  0 siblings, 13 replies; 36+ messages in thread
From: Goldwyn Rodrigues @ 2024-10-04 20:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

These patches incorporate btrfs buffered reads using iomap code.
The final goal here is to give all folio handling to iomap. This is the
first installment to be followed by writes, writebacks and eventually
subpage support.

The crux of the changes are:
 * Since btrfs uses it's own bio handling structure, the btrfs bioset is
passed to iomap code in order to create a bio with the folios.
 * For compressed extents:
   - IOMAP_ENCODED which behaves like IOMAP_MAPPED, except, we supply
     the iomap back to submit_io() to identify if it is encoded.
   - The iomap code needs to keep a track of the previous iomap struct
     in order to pass it to submit_io with the corresponding iomap
   - A change in iomap struct should create a new bio

I have tested this patchset against a full run of xfstests for btrfs.

The current code does not support subpage access of iomap. Subpage
will be hopefully incorporated once the write and the writeback goes
through.

Code is available at https://github.com/goldwynr/linux/tree/buffered-iomap

Goldwyn Rodrigues (12):
  iomap: check if folio size is equal to FS block size
  iomap: Introduce iomap_read_folio_ops
  iomap: add bioset in iomap_read_folio_ops for filesystems to use own
    bioset
  iomap: include iomap_read_end_io() in header
  iomap: Introduce IOMAP_ENCODED
  iomap: Introduce read_inline() function hook
  btrfs: btrfs_em_to_iomap() to convert em to iomap
  btrfs: iomap_begin() for buffered reads
  btrfs: define btrfs_iomap_read_folio_ops
  btrfs: define btrfs_iomap_folio_ops
  btrfs: add read_inline for folio operations for read() calls
  btrfs: switch to iomap for buffered reads

 block/fops.c           |   4 +-
 fs/btrfs/bio.c         |   2 +-
 fs/btrfs/bio.h         |   1 +
 fs/btrfs/extent_io.c   | 131 +++++++++++++++++++++++++++++++++++++++++
 fs/erofs/data.c        |   4 +-
 fs/gfs2/aops.c         |   4 +-
 fs/iomap/buffered-io.c | 117 +++++++++++++++++++++++++-----------
 fs/xfs/xfs_aops.c      |   4 +-
 fs/zonefs/file.c       |   4 +-
 include/linux/iomap.h  |  37 +++++++++++-
 10 files changed, 260 insertions(+), 48 deletions(-)

-- 
2.46.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 01/12] iomap: check if folio size is equal to FS block size
  2024-10-04 20:04 [PATCH 00/12] btrfs reads through iomap Goldwyn Rodrigues
@ 2024-10-04 20:04 ` Goldwyn Rodrigues
  2024-10-05  2:17   ` Matthew Wilcox
  2024-10-08  9:40   ` Christoph Hellwig
  2024-10-04 20:04 ` [PATCH 02/12] iomap: Introduce iomap_read_folio_ops Goldwyn Rodrigues
                   ` (11 subsequent siblings)
  12 siblings, 2 replies; 36+ messages in thread
From: Goldwyn Rodrigues @ 2024-10-04 20:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Filesystems such as BTRFS use folio->private so that they receive a
callback while releasing folios. Add check if folio size is same as
filesystem block size while evaluating iomap_folio_state from
folio->private.

I am hoping this will be removed when all of btrfs code has moved to
iomap and BTRFS uses iomap's subpage.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/iomap/buffered-io.c | 41 ++++++++++++++++++++++++++++-------------
 1 file changed, 28 insertions(+), 13 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index f420c53d86ac..ad656f501f38 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -43,6 +43,21 @@ struct iomap_folio_state {
 
 static struct bio_set iomap_ioend_bioset;
 
+static inline struct iomap_folio_state *folio_state(struct folio *folio)
+{
+	struct inode *inode;
+
+	if (folio->mapping && folio->mapping->host)
+		inode = folio->mapping->host;
+	else
+		return folio->private;
+
+	if (i_blocks_per_folio(inode, folio) <= 1)
+		return NULL;
+
+	return folio->private;
+}
+
 static inline bool ifs_is_fully_uptodate(struct folio *folio,
 		struct iomap_folio_state *ifs)
 {
@@ -72,7 +87,7 @@ static bool ifs_set_range_uptodate(struct folio *folio,
 static void iomap_set_range_uptodate(struct folio *folio, size_t off,
 		size_t len)
 {
-	struct iomap_folio_state *ifs = folio->private;
+	struct iomap_folio_state *ifs = folio_state(folio);
 	unsigned long flags;
 	bool uptodate = true;
 
@@ -123,7 +138,7 @@ static unsigned ifs_find_dirty_range(struct folio *folio,
 static unsigned iomap_find_dirty_range(struct folio *folio, u64 *range_start,
 		u64 range_end)
 {
-	struct iomap_folio_state *ifs = folio->private;
+	struct iomap_folio_state *ifs = folio_state(folio);
 
 	if (*range_start >= range_end)
 		return 0;
@@ -150,7 +165,7 @@ static void ifs_clear_range_dirty(struct folio *folio,
 
 static void iomap_clear_range_dirty(struct folio *folio, size_t off, size_t len)
 {
-	struct iomap_folio_state *ifs = folio->private;
+	struct iomap_folio_state *ifs = folio_state(folio);
 
 	if (ifs)
 		ifs_clear_range_dirty(folio, ifs, off, len);
@@ -173,7 +188,7 @@ static void ifs_set_range_dirty(struct folio *folio,
 
 static void iomap_set_range_dirty(struct folio *folio, size_t off, size_t len)
 {
-	struct iomap_folio_state *ifs = folio->private;
+	struct iomap_folio_state *ifs = folio_state(folio);
 
 	if (ifs)
 		ifs_set_range_dirty(folio, ifs, off, len);
@@ -182,7 +197,7 @@ static void iomap_set_range_dirty(struct folio *folio, size_t off, size_t len)
 static struct iomap_folio_state *ifs_alloc(struct inode *inode,
 		struct folio *folio, unsigned int flags)
 {
-	struct iomap_folio_state *ifs = folio->private;
+	struct iomap_folio_state *ifs = folio_state(folio);
 	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
 	gfp_t gfp;
 
@@ -234,7 +249,7 @@ static void ifs_free(struct folio *folio)
 static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 		loff_t *pos, loff_t length, size_t *offp, size_t *lenp)
 {
-	struct iomap_folio_state *ifs = folio->private;
+	struct iomap_folio_state *ifs = folio_state(folio);
 	loff_t orig_pos = *pos;
 	loff_t isize = i_size_read(inode);
 	unsigned block_bits = inode->i_blkbits;
@@ -292,7 +307,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 static void iomap_finish_folio_read(struct folio *folio, size_t off,
 		size_t len, int error)
 {
-	struct iomap_folio_state *ifs = folio->private;
+	struct iomap_folio_state *ifs = folio_state(folio);
 	bool uptodate = !error;
 	bool finished = true;
 
@@ -568,7 +583,7 @@ EXPORT_SYMBOL_GPL(iomap_readahead);
  */
 bool iomap_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
 {
-	struct iomap_folio_state *ifs = folio->private;
+	struct iomap_folio_state *ifs = folio_state(folio);
 	struct inode *inode = folio->mapping->host;
 	unsigned first, last, i;
 
@@ -1062,7 +1077,7 @@ static int iomap_write_delalloc_ifs_punch(struct inode *inode,
 	 * but not dirty. In that case it is necessary to punch
 	 * out such blocks to avoid leaking any delalloc blocks.
 	 */
-	ifs = folio->private;
+	ifs = folio_state(folio);
 	if (!ifs)
 		return ret;
 
@@ -1522,7 +1537,7 @@ EXPORT_SYMBOL_GPL(iomap_page_mkwrite);
 static void iomap_finish_folio_write(struct inode *inode, struct folio *folio,
 		size_t len)
 {
-	struct iomap_folio_state *ifs = folio->private;
+	struct iomap_folio_state *ifs = folio_state(folio);
 
 	WARN_ON_ONCE(i_blocks_per_folio(inode, folio) > 1 && !ifs);
 	WARN_ON_ONCE(ifs && atomic_read(&ifs->write_bytes_pending) <= 0);
@@ -1768,7 +1783,7 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
 		struct writeback_control *wbc, struct folio *folio,
 		struct inode *inode, loff_t pos, unsigned len)
 {
-	struct iomap_folio_state *ifs = folio->private;
+	struct iomap_folio_state *ifs = folio_state(folio);
 	size_t poff = offset_in_folio(folio, pos);
 	int error;
 
@@ -1807,7 +1822,7 @@ static int iomap_writepage_map_blocks(struct iomap_writepage_ctx *wpc,
 
 		map_len = min_t(u64, dirty_len,
 			wpc->iomap.offset + wpc->iomap.length - pos);
-		WARN_ON_ONCE(!folio->private && map_len < dirty_len);
+		WARN_ON_ONCE(!folio_state(folio) && map_len < dirty_len);
 
 		switch (wpc->iomap.type) {
 		case IOMAP_INLINE:
@@ -1902,7 +1917,7 @@ static bool iomap_writepage_handle_eof(struct folio *folio, struct inode *inode,
 static int iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		struct writeback_control *wbc, struct folio *folio)
 {
-	struct iomap_folio_state *ifs = folio->private;
+	struct iomap_folio_state *ifs = folio_state(folio);
 	struct inode *inode = folio->mapping->host;
 	u64 pos = folio_pos(folio);
 	u64 end_pos = pos + folio_size(folio);
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 02/12] iomap: Introduce iomap_read_folio_ops
  2024-10-04 20:04 [PATCH 00/12] btrfs reads through iomap Goldwyn Rodrigues
  2024-10-04 20:04 ` [PATCH 01/12] iomap: check if folio size is equal to FS block size Goldwyn Rodrigues
@ 2024-10-04 20:04 ` Goldwyn Rodrigues
  2024-10-05  2:20   ` Matthew Wilcox
  2024-10-04 20:04 ` [PATCH 03/12] iomap: add bioset in iomap_read_folio_ops for filesystems to use own bioset Goldwyn Rodrigues
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Goldwyn Rodrigues @ 2024-10-04 20:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

iomap_read_folio_ops provide additional functions to allocate or submit
the bio. Filesystems such as btrfs have additional operations with bios
such as verifying data checksums. Creating a bio submission hook allows
the filesystem to process and verify the bio.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 block/fops.c           |  4 ++--
 fs/erofs/data.c        |  4 ++--
 fs/gfs2/aops.c         |  4 ++--
 fs/iomap/buffered-io.c | 31 ++++++++++++++++++++++++-------
 fs/xfs/xfs_aops.c      |  4 ++--
 fs/zonefs/file.c       |  4 ++--
 include/linux/iomap.h  | 14 ++++++++++++--
 7 files changed, 46 insertions(+), 19 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index 9825c1713a49..41cbd8e5db11 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -483,12 +483,12 @@ const struct address_space_operations def_blk_aops = {
 #else /* CONFIG_BUFFER_HEAD */
 static int blkdev_read_folio(struct file *file, struct folio *folio)
 {
-	return iomap_read_folio(folio, &blkdev_iomap_ops);
+	return iomap_read_folio(folio, &blkdev_iomap_ops, NULL);
 }
 
 static void blkdev_readahead(struct readahead_control *rac)
 {
-	iomap_readahead(rac, &blkdev_iomap_ops);
+	iomap_readahead(rac, &blkdev_iomap_ops, NULL);
 }
 
 static int blkdev_map_blocks(struct iomap_writepage_ctx *wpc,
diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index 1b7eba38ba1e..335ce8be5894 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -347,12 +347,12 @@ int erofs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
  */
 static int erofs_read_folio(struct file *file, struct folio *folio)
 {
-	return iomap_read_folio(folio, &erofs_iomap_ops);
+	return iomap_read_folio(folio, &erofs_iomap_ops, NULL);
 }
 
 static void erofs_readahead(struct readahead_control *rac)
 {
-	return iomap_readahead(rac, &erofs_iomap_ops);
+	return iomap_readahead(rac, &erofs_iomap_ops, NULL);
 }
 
 static sector_t erofs_bmap(struct address_space *mapping, sector_t block)
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 10d5acd3f742..3cc66b640cb9 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -451,7 +451,7 @@ static int gfs2_read_folio(struct file *file, struct folio *folio)
 
 	if (!gfs2_is_jdata(ip) ||
 	    (i_blocksize(inode) == PAGE_SIZE && !folio_buffers(folio))) {
-		error = iomap_read_folio(folio, &gfs2_iomap_ops);
+		error = iomap_read_folio(folio, &gfs2_iomap_ops, NULL);
 	} else if (gfs2_is_stuffed(ip)) {
 		error = stuffed_read_folio(ip, folio);
 	} else {
@@ -526,7 +526,7 @@ static void gfs2_readahead(struct readahead_control *rac)
 	else if (gfs2_is_jdata(ip))
 		mpage_readahead(rac, gfs2_block_map);
 	else
-		iomap_readahead(rac, &gfs2_iomap_ops);
+		iomap_readahead(rac, &gfs2_iomap_ops, NULL);
 }
 
 /**
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index ad656f501f38..71370bbe466c 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -341,6 +341,7 @@ struct iomap_readpage_ctx {
 	bool			cur_folio_in_bio;
 	struct bio		*bio;
 	struct readahead_control *rac;
+	const struct iomap_read_folio_ops *ops;
 };
 
 /**
@@ -424,8 +425,12 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 		gfp_t orig_gfp = gfp;
 		unsigned int nr_vecs = DIV_ROUND_UP(length, PAGE_SIZE);
 
-		if (ctx->bio)
-			submit_bio(ctx->bio);
+		if (ctx->bio) {
+			if (ctx->ops && ctx->ops->submit_io)
+				ctx->ops->submit_io(iter->inode, ctx->bio);
+			else
+				submit_bio(ctx->bio);
+		}
 
 		if (ctx->rac) /* same as readahead_gfp_mask */
 			gfp |= __GFP_NORETRY | __GFP_NOWARN;
@@ -475,7 +480,8 @@ static loff_t iomap_read_folio_iter(const struct iomap_iter *iter,
 	return done;
 }
 
-int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
+int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops,
+		const struct iomap_read_folio_ops *read_folio_ops)
 {
 	struct iomap_iter iter = {
 		.inode		= folio->mapping->host,
@@ -484,6 +490,7 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
 	};
 	struct iomap_readpage_ctx ctx = {
 		.cur_folio	= folio,
+		.ops		= read_folio_ops,
 	};
 	int ret;
 
@@ -493,7 +500,10 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
 		iter.processed = iomap_read_folio_iter(&iter, &ctx);
 
 	if (ctx.bio) {
-		submit_bio(ctx.bio);
+		if (ctx.ops->submit_io)
+			ctx.ops->submit_io(iter.inode, ctx.bio);
+		else
+			submit_bio(ctx.bio);
 		WARN_ON_ONCE(!ctx.cur_folio_in_bio);
 	} else {
 		WARN_ON_ONCE(ctx.cur_folio_in_bio);
@@ -538,6 +548,7 @@ static loff_t iomap_readahead_iter(const struct iomap_iter *iter,
  * iomap_readahead - Attempt to read pages from a file.
  * @rac: Describes the pages to be read.
  * @ops: The operations vector for the filesystem.
+ * @read_folio_ops: Function hooks for filesystems for special bio submissions
  *
  * This function is for filesystems to call to implement their readahead
  * address_space operation.
@@ -549,7 +560,8 @@ static loff_t iomap_readahead_iter(const struct iomap_iter *iter,
  * function is called with memalloc_nofs set, so allocations will not cause
  * the filesystem to be reentered.
  */
-void iomap_readahead(struct readahead_control *rac, const struct iomap_ops *ops)
+void iomap_readahead(struct readahead_control *rac, const struct iomap_ops *ops,
+		const struct iomap_read_folio_ops *read_folio_ops)
 {
 	struct iomap_iter iter = {
 		.inode	= rac->mapping->host,
@@ -558,6 +570,7 @@ void iomap_readahead(struct readahead_control *rac, const struct iomap_ops *ops)
 	};
 	struct iomap_readpage_ctx ctx = {
 		.rac	= rac,
+		.ops	= read_folio_ops,
 	};
 
 	trace_iomap_readahead(rac->mapping->host, readahead_count(rac));
@@ -565,8 +578,12 @@ void iomap_readahead(struct readahead_control *rac, const struct iomap_ops *ops)
 	while (iomap_iter(&iter, ops) > 0)
 		iter.processed = iomap_readahead_iter(&iter, &ctx);
 
-	if (ctx.bio)
-		submit_bio(ctx.bio);
+	if (ctx.bio) {
+		if (ctx.ops && ctx.ops->submit_io)
+			ctx.ops->submit_io(iter.inode, ctx.bio);
+		else
+			submit_bio(ctx.bio);
+	}
 	if (ctx.cur_folio) {
 		if (!ctx.cur_folio_in_bio)
 			folio_unlock(ctx.cur_folio);
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 6dead20338e2..9a3221d738bd 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -517,14 +517,14 @@ xfs_vm_read_folio(
 	struct file		*unused,
 	struct folio		*folio)
 {
-	return iomap_read_folio(folio, &xfs_read_iomap_ops);
+	return iomap_read_folio(folio, &xfs_read_iomap_ops, NULL);
 }
 
 STATIC void
 xfs_vm_readahead(
 	struct readahead_control	*rac)
 {
-	iomap_readahead(rac, &xfs_read_iomap_ops);
+	iomap_readahead(rac, &xfs_read_iomap_ops, NULL);
 }
 
 static int
diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
index 3b103715acc9..bf8db1ddd761 100644
--- a/fs/zonefs/file.c
+++ b/fs/zonefs/file.c
@@ -112,12 +112,12 @@ static const struct iomap_ops zonefs_write_iomap_ops = {
 
 static int zonefs_read_folio(struct file *unused, struct folio *folio)
 {
-	return iomap_read_folio(folio, &zonefs_read_iomap_ops);
+	return iomap_read_folio(folio, &zonefs_read_iomap_ops, NULL);
 }
 
 static void zonefs_readahead(struct readahead_control *rac)
 {
-	iomap_readahead(rac, &zonefs_read_iomap_ops);
+	iomap_readahead(rac, &zonefs_read_iomap_ops, NULL);
 }
 
 /*
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 6fc1c858013d..5b775213920e 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -256,14 +256,24 @@ static inline const struct iomap *iomap_iter_srcmap(const struct iomap_iter *i)
 	return &i->iomap;
 }
 
+struct iomap_read_folio_ops {
+	/*
+	 * Optional, allows the filesystem to perform a custom submission of
+	 * bio, such as csum calculations or multi-device bio split
+	 */
+	void (*submit_io)(struct inode *inode, struct bio *bio);
+};
+
 ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
 		const struct iomap_ops *ops);
 int iomap_file_buffered_write_punch_delalloc(struct inode *inode,
 		struct iomap *iomap, loff_t pos, loff_t length, ssize_t written,
 		int (*punch)(struct inode *inode, loff_t pos, loff_t length));
 
-int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops);
-void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops);
+int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops,
+		const struct iomap_read_folio_ops *);
+void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops,
+		const struct iomap_read_folio_ops *);
 bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
 struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len);
 bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 03/12] iomap: add bioset in iomap_read_folio_ops for filesystems to use own bioset
  2024-10-04 20:04 [PATCH 00/12] btrfs reads through iomap Goldwyn Rodrigues
  2024-10-04 20:04 ` [PATCH 01/12] iomap: check if folio size is equal to FS block size Goldwyn Rodrigues
  2024-10-04 20:04 ` [PATCH 02/12] iomap: Introduce iomap_read_folio_ops Goldwyn Rodrigues
@ 2024-10-04 20:04 ` Goldwyn Rodrigues
  2024-10-08  9:45   ` Christoph Hellwig
  2024-10-04 20:04 ` [PATCH 04/12] iomap: include iomap_read_end_io() in header Goldwyn Rodrigues
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Goldwyn Rodrigues @ 2024-10-04 20:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Allocate the bio from the bioset provided in iomap_read_folio_ops.
If no bioset is provided, fs_bio_set is used which is the standard
bioset for filesystems.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/iomap/buffered-io.c | 20 ++++++++++++++------
 include/linux/iomap.h  |  6 ++++++
 2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 71370bbe466c..d007b4a8307c 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -421,6 +421,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 	if (!ctx->bio ||
 	    bio_end_sector(ctx->bio) != sector ||
 	    !bio_add_folio(ctx->bio, folio, plen, poff)) {
+		struct bio_set *bioset;
 		gfp_t gfp = mapping_gfp_constraint(folio->mapping, GFP_KERNEL);
 		gfp_t orig_gfp = gfp;
 		unsigned int nr_vecs = DIV_ROUND_UP(length, PAGE_SIZE);
@@ -434,17 +435,24 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 
 		if (ctx->rac) /* same as readahead_gfp_mask */
 			gfp |= __GFP_NORETRY | __GFP_NOWARN;
-		ctx->bio = bio_alloc(iomap->bdev, bio_max_segs(nr_vecs),
-				     REQ_OP_READ, gfp);
+
+		if (ctx->ops && ctx->ops->bio_set)
+			bioset = ctx->ops->bio_set;
+		else
+			bioset = &fs_bio_set;
+
+		ctx->bio = bio_alloc_bioset(iomap->bdev, bio_max_segs(nr_vecs),
+				REQ_OP_READ, gfp, bioset);
+
 		/*
 		 * If the bio_alloc fails, try it again for a single page to
 		 * avoid having to deal with partial page reads.  This emulates
 		 * what do_mpage_read_folio does.
 		 */
-		if (!ctx->bio) {
-			ctx->bio = bio_alloc(iomap->bdev, 1, REQ_OP_READ,
-					     orig_gfp);
-		}
+		if (!ctx->bio)
+			ctx->bio = bio_alloc_bioset(iomap->bdev, 1, REQ_OP_READ,
+					orig_gfp, bioset);
+
 		if (ctx->rac)
 			ctx->bio->bi_opf |= REQ_RAHEAD;
 		ctx->bio->bi_iter.bi_sector = sector;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 5b775213920e..f876d16353c6 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -262,6 +262,12 @@ struct iomap_read_folio_ops {
 	 * bio, such as csum calculations or multi-device bio split
 	 */
 	void (*submit_io)(struct inode *inode, struct bio *bio);
+
+	/*
+	 * Optional, allows filesystem to specify own bio_set, so new bio's
+	 * can be allocated from the provided bio_set.
+	 */
+	struct bio_set *bio_set;
 };
 
 ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 04/12] iomap: include iomap_read_end_io() in header
  2024-10-04 20:04 [PATCH 00/12] btrfs reads through iomap Goldwyn Rodrigues
                   ` (2 preceding siblings ...)
  2024-10-04 20:04 ` [PATCH 03/12] iomap: add bioset in iomap_read_folio_ops for filesystems to use own bioset Goldwyn Rodrigues
@ 2024-10-04 20:04 ` Goldwyn Rodrigues
  2024-10-07 17:02   ` Darrick J. Wong
  2024-10-04 20:04 ` [PATCH 05/12] iomap: Introduce IOMAP_ENCODED Goldwyn Rodrigues
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Goldwyn Rodrigues @ 2024-10-04 20:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

iomap_read_end_io() will be used BTRFS after it has completed the reads
to handle control back to iomap to finish reads on all folios.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/iomap/buffered-io.c | 2 +-
 include/linux/iomap.h  | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index d007b4a8307c..0e682ff84e4a 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -326,7 +326,7 @@ static void iomap_finish_folio_read(struct folio *folio, size_t off,
 		folio_end_read(folio, uptodate);
 }
 
-static void iomap_read_end_io(struct bio *bio)
+void iomap_read_end_io(struct bio *bio)
 {
 	int error = blk_status_to_errno(bio->bi_status);
 	struct folio_iter fi;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index f876d16353c6..7b757bea8455 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -280,6 +280,7 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops,
 		const struct iomap_read_folio_ops *);
 void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops,
 		const struct iomap_read_folio_ops *);
+void iomap_read_end_io(struct bio *bio);
 bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
 struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len);
 bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 05/12] iomap: Introduce IOMAP_ENCODED
  2024-10-04 20:04 [PATCH 00/12] btrfs reads through iomap Goldwyn Rodrigues
                   ` (3 preceding siblings ...)
  2024-10-04 20:04 ` [PATCH 04/12] iomap: include iomap_read_end_io() in header Goldwyn Rodrigues
@ 2024-10-04 20:04 ` Goldwyn Rodrigues
  2024-10-08  9:48   ` Christoph Hellwig
  2024-10-04 20:04 ` [PATCH 06/12] iomap: Introduce read_inline() function hook Goldwyn Rodrigues
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Goldwyn Rodrigues @ 2024-10-04 20:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

An encoded extent must be read completely. Make the bio just as a
regular bio and let filesystem deal with the rest of the extent.
A new bio must be created if a new iomap is returned.
The filesystem must be informed that the bio to be read is
encoded and the offset from which the encoded extent starts. So, pass
the iomap associated with the bio while calling submit_io. Save the
previous iomap (associated with the bio being submitted) in prev in
order to submit when the iomap changes.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/iomap/buffered-io.c | 17 ++++++++++-------
 include/linux/iomap.h  | 11 +++++++++--
 2 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 0e682ff84e4a..4c734899a8e5 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -378,12 +378,13 @@ static inline bool iomap_block_needs_zeroing(const struct iomap_iter *iter,
 {
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
 
-	return srcmap->type != IOMAP_MAPPED ||
+	return (srcmap->type != IOMAP_MAPPED &&
+			srcmap->type != IOMAP_ENCODED) ||
 		(srcmap->flags & IOMAP_F_NEW) ||
 		pos >= i_size_read(iter->inode);
 }
 
-static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
+static loff_t iomap_readpage_iter(struct iomap_iter *iter,
 		struct iomap_readpage_ctx *ctx, loff_t offset)
 {
 	const struct iomap *iomap = &iter->iomap;
@@ -419,6 +420,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 
 	sector = iomap_sector(iomap, pos);
 	if (!ctx->bio ||
+	    (iomap->type & IOMAP_ENCODED && iomap->offset != iter->prev.offset) ||
 	    bio_end_sector(ctx->bio) != sector ||
 	    !bio_add_folio(ctx->bio, folio, plen, poff)) {
 		struct bio_set *bioset;
@@ -428,10 +430,11 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 
 		if (ctx->bio) {
 			if (ctx->ops && ctx->ops->submit_io)
-				ctx->ops->submit_io(iter->inode, ctx->bio);
+				ctx->ops->submit_io(iter->inode, ctx->bio, &iter->prev);
 			else
 				submit_bio(ctx->bio);
 		}
+		iter->prev = iter->iomap;
 
 		if (ctx->rac) /* same as readahead_gfp_mask */
 			gfp |= __GFP_NORETRY | __GFP_NOWARN;
@@ -470,7 +473,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 	return pos - orig_pos + plen;
 }
 
-static loff_t iomap_read_folio_iter(const struct iomap_iter *iter,
+static loff_t iomap_read_folio_iter(struct iomap_iter *iter,
 		struct iomap_readpage_ctx *ctx)
 {
 	struct folio *folio = ctx->cur_folio;
@@ -509,7 +512,7 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops,
 
 	if (ctx.bio) {
 		if (ctx.ops->submit_io)
-			ctx.ops->submit_io(iter.inode, ctx.bio);
+			ctx.ops->submit_io(iter.inode, ctx.bio, &iter.prev);
 		else
 			submit_bio(ctx.bio);
 		WARN_ON_ONCE(!ctx.cur_folio_in_bio);
@@ -527,7 +530,7 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops,
 }
 EXPORT_SYMBOL_GPL(iomap_read_folio);
 
-static loff_t iomap_readahead_iter(const struct iomap_iter *iter,
+static loff_t iomap_readahead_iter(struct iomap_iter *iter,
 		struct iomap_readpage_ctx *ctx)
 {
 	loff_t length = iomap_length(iter);
@@ -588,7 +591,7 @@ void iomap_readahead(struct readahead_control *rac, const struct iomap_ops *ops,
 
 	if (ctx.bio) {
 		if (ctx.ops && ctx.ops->submit_io)
-			ctx.ops->submit_io(iter.inode, ctx.bio);
+			ctx.ops->submit_io(iter.inode, ctx.bio, &iter.prev);
 		else
 			submit_bio(ctx.bio);
 	}
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 7b757bea8455..a5cf00a01f23 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -30,6 +30,7 @@ struct vm_fault;
 #define IOMAP_MAPPED	2	/* blocks allocated at @addr */
 #define IOMAP_UNWRITTEN	3	/* blocks allocated at @addr in unwritten state */
 #define IOMAP_INLINE	4	/* data inline in the inode */
+#define IOMAP_ENCODED	5	/* data encoded, R/W whole extent */
 
 /*
  * Flags reported by the file system from iomap_begin:
@@ -107,6 +108,8 @@ struct iomap {
 
 static inline sector_t iomap_sector(const struct iomap *iomap, loff_t pos)
 {
+	if (iomap->type & IOMAP_ENCODED)
+		return iomap->addr;
 	return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
 }
 
@@ -217,9 +220,13 @@ struct iomap_iter {
 	loff_t pos;
 	u64 len;
 	s64 processed;
+	unsigned type;
 	unsigned flags;
 	struct iomap iomap;
-	struct iomap srcmap;
+	union {
+		struct iomap srcmap;
+		struct iomap prev;
+	};
 	void *private;
 };
 
@@ -261,7 +268,7 @@ struct iomap_read_folio_ops {
 	 * Optional, allows the filesystem to perform a custom submission of
 	 * bio, such as csum calculations or multi-device bio split
 	 */
-	void (*submit_io)(struct inode *inode, struct bio *bio);
+	void (*submit_io)(struct inode *inode, struct bio *bio, const struct iomap *iomap);
 
 	/*
 	 * Optional, allows filesystem to specify own bio_set, so new bio's
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 06/12] iomap: Introduce read_inline() function hook
  2024-10-04 20:04 [PATCH 00/12] btrfs reads through iomap Goldwyn Rodrigues
                   ` (4 preceding siblings ...)
  2024-10-04 20:04 ` [PATCH 05/12] iomap: Introduce IOMAP_ENCODED Goldwyn Rodrigues
@ 2024-10-04 20:04 ` Goldwyn Rodrigues
  2024-10-05  2:30   ` Matthew Wilcox
  2024-10-04 20:04 ` [PATCH 07/12] btrfs: btrfs_em_to_iomap() to convert em to iomap Goldwyn Rodrigues
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Goldwyn Rodrigues @ 2024-10-04 20:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Introduce read_inline() function hook for reading inline extents. This
is performed for filesystems such as btrfs which may compress the data
in the inline extents.

This is added in struct iomap_folio_ops, since folio is available at
this point.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/iomap/buffered-io.c | 12 +++++++++---
 include/linux/iomap.h  |  7 +++++++
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 4c734899a8e5..ef805730125a 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -359,6 +359,7 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 	const struct iomap *iomap = iomap_iter_srcmap(iter);
 	size_t size = i_size_read(iter->inode) - iomap->offset;
 	size_t offset = offset_in_folio(folio, iomap->offset);
+	int ret = 0;
 
 	if (folio_test_uptodate(folio))
 		return 0;
@@ -368,9 +369,14 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 	if (offset > 0)
 		ifs_alloc(iter->inode, folio, iter->flags);
 
-	folio_fill_tail(folio, offset, iomap->inline_data, size);
-	iomap_set_range_uptodate(folio, offset, folio_size(folio) - offset);
-	return 0;
+	if (iomap->folio_ops && iomap->folio_ops->read_inline)
+		ret = iomap->folio_ops->read_inline(iomap, folio);
+	else
+		folio_fill_tail(folio, offset, iomap->inline_data, size);
+
+	if (!ret)
+		iomap_set_range_uptodate(folio, offset, folio_size(folio) - offset);
+	return ret;
 }
 
 static inline bool iomap_block_needs_zeroing(const struct iomap_iter *iter,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index a5cf00a01f23..82dabc0369cd 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -163,6 +163,13 @@ struct iomap_folio_ops {
 	 * locked by the iomap code.
 	 */
 	bool (*iomap_valid)(struct inode *inode, const struct iomap *iomap);
+
+	/*
+	 * Custom read_inline function for filesystem such as btrfs
+	 * that may store data in compressed form.
+	 */
+
+	int (*read_inline)(const struct iomap *iomap, struct folio *folio);
 };
 
 /*
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 07/12] btrfs: btrfs_em_to_iomap() to convert em to iomap
  2024-10-04 20:04 [PATCH 00/12] btrfs reads through iomap Goldwyn Rodrigues
                   ` (5 preceding siblings ...)
  2024-10-04 20:04 ` [PATCH 06/12] iomap: Introduce read_inline() function hook Goldwyn Rodrigues
@ 2024-10-04 20:04 ` Goldwyn Rodrigues
  2024-10-04 20:04 ` [PATCH 08/12] btrfs: iomap_begin() for buffered reads Goldwyn Rodrigues
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 36+ messages in thread
From: Goldwyn Rodrigues @ 2024-10-04 20:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

btrfs_em_to_iomap() converts and extent map into passed argument struct
iomap. It makes sure the information is in multiple of sectorsize block.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/extent_io.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index cb0a39370d30..7f40c2b8bfb8 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -14,6 +14,7 @@
 #include <linux/pagevec.h>
 #include <linux/prefetch.h>
 #include <linux/fsverity.h>
+#include <linux/iomap.h>
 #include "extent_io.h"
 #include "extent-io-tree.h"
 #include "extent_map.h"
@@ -900,6 +901,35 @@ void clear_folio_extent_mapped(struct folio *folio)
 	folio_detach_private(folio);
 }
 
+static void btrfs_em_to_iomap(struct inode *inode,
+		struct extent_map *em, struct iomap *iomap,
+		loff_t sector_pos, bool write)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+
+	if (!btrfs_is_data_reloc_root(BTRFS_I(inode)->root) &&
+	    em->flags & EXTENT_FLAG_PINNED) {
+		iomap->type = IOMAP_UNWRITTEN;
+		iomap->addr = extent_map_block_start(em);
+	} else if (em->disk_bytenr == EXTENT_MAP_INLINE) {
+		iomap->addr = IOMAP_NULL_ADDR;
+		iomap->type = IOMAP_INLINE;
+	} else if (em->disk_bytenr == EXTENT_MAP_HOLE ||
+			(!write && (em->flags & EXTENT_FLAG_PREALLOC))) {
+		iomap->addr = IOMAP_NULL_ADDR;
+		iomap->type = IOMAP_HOLE;
+	} else if (extent_map_is_compressed(em)) {
+		iomap->type = IOMAP_ENCODED;
+		iomap->addr = em->disk_bytenr;
+	} else {
+		iomap->addr = extent_map_block_start(em);
+		iomap->type = IOMAP_MAPPED;
+	}
+	iomap->offset = em->start;
+	iomap->bdev = fs_info->fs_devices->latest_dev->bdev;
+	iomap->length = em->len;
+}
+
 static struct extent_map *__get_extent_map(struct inode *inode,
 					   struct folio *folio, u64 start,
 					   u64 len, struct extent_map **em_cached)
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 08/12] btrfs: iomap_begin() for buffered reads
  2024-10-04 20:04 [PATCH 00/12] btrfs reads through iomap Goldwyn Rodrigues
                   ` (6 preceding siblings ...)
  2024-10-04 20:04 ` [PATCH 07/12] btrfs: btrfs_em_to_iomap() to convert em to iomap Goldwyn Rodrigues
@ 2024-10-04 20:04 ` Goldwyn Rodrigues
  2024-10-04 20:04 ` [PATCH 09/12] btrfs: define btrfs_iomap_read_folio_ops Goldwyn Rodrigues
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 36+ messages in thread
From: Goldwyn Rodrigues @ 2024-10-04 20:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

btrfs_read_iomap_begin() fetches the extent on the file position passed
and converts the resultant extent map to iomap. The iomap code uses the
sector offset from iomap structure to create the bios.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/extent_io.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 7f40c2b8bfb8..6ef2fa802c30 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -962,6 +962,34 @@ static struct extent_map *__get_extent_map(struct inode *inode,
 
 	return em;
 }
+
+static int btrfs_read_iomap_begin(struct inode *inode, loff_t pos,
+		loff_t length, unsigned int flags, struct iomap *iomap,
+		struct iomap *srcmap)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct extent_map *em;
+	struct extent_state *cached_state = NULL;
+	u64 start = round_down(pos, fs_info->sectorsize);
+	u64 end = round_up(pos + length, fs_info->sectorsize) - 1;
+
+	btrfs_lock_and_flush_ordered_range(BTRFS_I(inode), start, end, &cached_state);
+	em = btrfs_get_extent(BTRFS_I(inode), NULL, start, end - start + 1);
+	unlock_extent(&BTRFS_I(inode)->io_tree, start, end, &cached_state);
+	if (IS_ERR(em))
+		return PTR_ERR(em);
+
+	btrfs_em_to_iomap(inode, em, iomap, start, false);
+	free_extent_map(em);
+
+	return 0;
+}
+
+static const struct iomap_ops btrfs_buffered_read_iomap_ops = {
+	.iomap_begin = btrfs_read_iomap_begin,
+};
+
+
 /*
  * basic readpage implementation.  Locked extent state structs are inserted
  * into the tree that are removed when the IO is done (by the end_io
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 09/12] btrfs: define btrfs_iomap_read_folio_ops
  2024-10-04 20:04 [PATCH 00/12] btrfs reads through iomap Goldwyn Rodrigues
                   ` (7 preceding siblings ...)
  2024-10-04 20:04 ` [PATCH 08/12] btrfs: iomap_begin() for buffered reads Goldwyn Rodrigues
@ 2024-10-04 20:04 ` Goldwyn Rodrigues
  2024-10-04 20:04 ` [PATCH 10/12] btrfs: define btrfs_iomap_folio_ops Goldwyn Rodrigues
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 36+ messages in thread
From: Goldwyn Rodrigues @ 2024-10-04 20:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Use btrfs_bioset so iomap allocates btrfs_bio for bios to be submitted
for btrfs.

Set the file_offset of the bbio from the first folio in the bio.

For compressed/encoded reads, call btrfs_submit_compressed_read()
else call btrfs_submit_bbio()

After the read is complete, call iomap_read_end_io() to finish reads on
the folios.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/bio.c       |  2 +-
 fs/btrfs/bio.h       |  1 +
 fs/btrfs/extent_io.c | 29 +++++++++++++++++++++++++++++
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index 056f8a171bba..9d448235b8bd 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -15,7 +15,7 @@
 #include "file-item.h"
 #include "raid-stripe-tree.h"
 
-static struct bio_set btrfs_bioset;
+struct bio_set btrfs_bioset;
 static struct bio_set btrfs_clone_bioset;
 static struct bio_set btrfs_repair_bioset;
 static mempool_t btrfs_failed_bio_pool;
diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
index e48612340745..687a8361202a 100644
--- a/fs/btrfs/bio.h
+++ b/fs/btrfs/bio.h
@@ -15,6 +15,7 @@
 struct btrfs_bio;
 struct btrfs_fs_info;
 struct btrfs_inode;
+extern struct bio_set btrfs_bioset;
 
 #define BTRFS_BIO_INLINE_CSUM_SIZE	64
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6ef2fa802c30..43418b6d4824 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -985,10 +985,39 @@ static int btrfs_read_iomap_begin(struct inode *inode, loff_t pos,
 	return 0;
 }
 
+static void btrfs_read_endio(struct btrfs_bio *bbio)
+{
+       iomap_read_end_io(&bbio->bio);
+}
+
+static void btrfs_read_submit_io(struct inode *inode, struct bio *bio,
+				 const struct iomap *iomap)
+{
+	struct btrfs_bio *bbio = btrfs_bio(bio);
+	struct folio_iter fi;
+
+	btrfs_bio_init(bbio, btrfs_sb(inode->i_sb), btrfs_read_endio, NULL);
+	bbio->inode = BTRFS_I(inode);
+
+	bio_first_folio(&fi, bio, 0);
+	bbio->file_offset = folio_pos(fi.folio);
+
+	if (iomap->type & IOMAP_ENCODED) {
+		bbio->bio.bi_iter.bi_sector = iomap->addr >> SECTOR_SHIFT;
+		btrfs_submit_compressed_read(bbio);
+	} else {
+		btrfs_submit_bbio(bbio, 0);
+	}
+}
+
 static const struct iomap_ops btrfs_buffered_read_iomap_ops = {
 	.iomap_begin = btrfs_read_iomap_begin,
 };
 
+static const struct iomap_read_folio_ops btrfs_iomap_read_folio_ops = {
+	.submit_io      = btrfs_read_submit_io,
+	.bio_set        = &btrfs_bioset,
+};
 
 /*
  * basic readpage implementation.  Locked extent state structs are inserted
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 10/12] btrfs: define btrfs_iomap_folio_ops
  2024-10-04 20:04 [PATCH 00/12] btrfs reads through iomap Goldwyn Rodrigues
                   ` (8 preceding siblings ...)
  2024-10-04 20:04 ` [PATCH 09/12] btrfs: define btrfs_iomap_read_folio_ops Goldwyn Rodrigues
@ 2024-10-04 20:04 ` Goldwyn Rodrigues
  2024-10-04 20:04 ` [PATCH 11/12] btrfs: add read_inline for folio operations for read() calls Goldwyn Rodrigues
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 36+ messages in thread
From: Goldwyn Rodrigues @ 2024-10-04 20:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

The put_folio() sets folio->private to EXTENT_PAGE_PRIVATE if not
already set using set_page_extent_mapped().

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/extent_io.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 43418b6d4824..ee0d37388441 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -901,6 +901,18 @@ void clear_folio_extent_mapped(struct folio *folio)
 	folio_detach_private(folio);
 }
 
+static void btrfs_put_folio(struct inode *inode, loff_t pos,
+		unsigned copied, struct folio *folio)
+{
+	set_folio_extent_mapped(folio);
+	folio_unlock(folio);
+	folio_put(folio);
+}
+
+static const struct iomap_folio_ops btrfs_iomap_folio_ops = {
+	.put_folio = btrfs_put_folio,
+};
+
 static void btrfs_em_to_iomap(struct inode *inode,
 		struct extent_map *em, struct iomap *iomap,
 		loff_t sector_pos, bool write)
@@ -928,6 +940,7 @@ static void btrfs_em_to_iomap(struct inode *inode,
 	iomap->offset = em->start;
 	iomap->bdev = fs_info->fs_devices->latest_dev->bdev;
 	iomap->length = em->len;
+	iomap->folio_ops =  &btrfs_iomap_folio_ops;
 }
 
 static struct extent_map *__get_extent_map(struct inode *inode,
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 11/12] btrfs: add read_inline for folio operations for read() calls
  2024-10-04 20:04 [PATCH 00/12] btrfs reads through iomap Goldwyn Rodrigues
                   ` (9 preceding siblings ...)
  2024-10-04 20:04 ` [PATCH 10/12] btrfs: define btrfs_iomap_folio_ops Goldwyn Rodrigues
@ 2024-10-04 20:04 ` Goldwyn Rodrigues
  2024-10-04 20:04 ` [PATCH 12/12] btrfs: switch to iomap for buffered reads Goldwyn Rodrigues
  2024-10-08  9:39 ` [PATCH 00/12] btrfs reads through iomap Christoph Hellwig
  12 siblings, 0 replies; 36+ messages in thread
From: Goldwyn Rodrigues @ 2024-10-04 20:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Use the iomap read_inline() hook to fill data into the folio passed.
Just call btrfs_get_extent() again, because this time read_inline()
function has the folio present.

Comment:
Another way to do this is to bounce around btrfs_path all the way to
read_inline() function. This was getting too complicated with cleanup,
but should reduce the number of operations to fetch an inline extent.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/extent_io.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ee0d37388441..01408bc5b04e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -909,8 +909,26 @@ static void btrfs_put_folio(struct inode *inode, loff_t pos,
 	folio_put(folio);
 }
 
+static int btrfs_iomap_read_inline(const struct iomap *iomap, struct folio *folio)
+{
+	int ret = 0;
+	struct inode *inode = folio->mapping->host;
+	struct extent_map *em;
+	struct extent_state *cached_state = NULL;
+
+	lock_extent(&BTRFS_I(inode)->io_tree, 0, folio_size(folio) - 1, &cached_state);
+	em = btrfs_get_extent(BTRFS_I(inode), folio, 0, folio_size(folio));
+	unlock_extent(&BTRFS_I(inode)->io_tree, 0, folio_size(folio) - 1, &cached_state);
+	if (IS_ERR(em))
+		ret = PTR_ERR(em);
+	free_extent_map(em);
+	return ret;
+}
+
+
 static const struct iomap_folio_ops btrfs_iomap_folio_ops = {
 	.put_folio = btrfs_put_folio,
+	.read_inline = btrfs_iomap_read_inline,
 };
 
 static void btrfs_em_to_iomap(struct inode *inode,
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 12/12] btrfs: switch to iomap for buffered reads
  2024-10-04 20:04 [PATCH 00/12] btrfs reads through iomap Goldwyn Rodrigues
                   ` (10 preceding siblings ...)
  2024-10-04 20:04 ` [PATCH 11/12] btrfs: add read_inline for folio operations for read() calls Goldwyn Rodrigues
@ 2024-10-04 20:04 ` Goldwyn Rodrigues
  2024-10-08  9:39 ` [PATCH 00/12] btrfs reads through iomap Christoph Hellwig
  12 siblings, 0 replies; 36+ messages in thread
From: Goldwyn Rodrigues @ 2024-10-04 20:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

For buffered reads, call iomap_readahead() and iomap_read_folio().

This is limited to non-subpage calls.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/extent_io.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 01408bc5b04e..cfe771cebb36 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1205,9 +1205,15 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 int btrfs_read_folio(struct file *file, struct folio *folio)
 {
 	struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ };
+	struct inode *inode = folio->mapping->host;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct extent_map *em_cached = NULL;
 	int ret;
 
+	if (!btrfs_is_subpage(fs_info, inode->i_mapping))
+		return iomap_read_folio(folio, &btrfs_buffered_read_iomap_ops,
+				&btrfs_iomap_read_folio_ops);
+
 	ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL);
 	free_extent_map(em_cached);
 
@@ -2449,10 +2455,17 @@ int btrfs_writepages(struct address_space *mapping, struct writeback_control *wb
 void btrfs_readahead(struct readahead_control *rac)
 {
 	struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ | REQ_RAHEAD };
+	struct inode *inode = rac->mapping->host;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct folio *folio;
 	struct extent_map *em_cached = NULL;
 	u64 prev_em_start = (u64)-1;
 
+	if (!btrfs_is_subpage(fs_info, rac->mapping)) {
+		iomap_readahead(rac, &btrfs_buffered_read_iomap_ops, &btrfs_iomap_read_folio_ops);
+		return;
+	}
+
 	while ((folio = readahead_folio(rac)) != NULL)
 		btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
 
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 01/12] iomap: check if folio size is equal to FS block size
  2024-10-04 20:04 ` [PATCH 01/12] iomap: check if folio size is equal to FS block size Goldwyn Rodrigues
@ 2024-10-05  2:17   ` Matthew Wilcox
  2024-10-07 16:57     ` Darrick J. Wong
  2024-10-08  9:40   ` Christoph Hellwig
  1 sibling, 1 reply; 36+ messages in thread
From: Matthew Wilcox @ 2024-10-05  2:17 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-kernel, linux-fsdevel, Goldwyn Rodrigues

On Fri, Oct 04, 2024 at 04:04:28PM -0400, Goldwyn Rodrigues wrote:
> Filesystems such as BTRFS use folio->private so that they receive a
> callback while releasing folios. Add check if folio size is same as
> filesystem block size while evaluating iomap_folio_state from
> folio->private.
> 
> I am hoping this will be removed when all of btrfs code has moved to
> iomap and BTRFS uses iomap's subpage.

This seems like a terrible explanation for why you need this patch.

As I understand it, what you're really doing is saying that iomap only
uses folio->private for block size < folio size.  So if you add this
hack, iomap won't look at folio->private for block size == folio size
and that means that btrfs can continue to use it.

I don't think this is a good way to start the conversion.  I appreciate
that it's a long, complex procedure, and you can't do the whole
conversion in a single patchset.

Also, please stop calling this "subpage".  That's btrfs terminology,
it's confusing as hell, and it should be deleted from your brain.

But I don't understand why you need it at all.  btrfs doesn't attach
private data to folios unless block size < page size.  Which is precisely
the case that you're not using.  So it seems like you could just drop
this patch and everything would still work.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 02/12] iomap: Introduce iomap_read_folio_ops
  2024-10-04 20:04 ` [PATCH 02/12] iomap: Introduce iomap_read_folio_ops Goldwyn Rodrigues
@ 2024-10-05  2:20   ` Matthew Wilcox
  2024-10-07 17:01     ` Darrick J. Wong
  2024-10-08  9:43     ` Christoph Hellwig
  0 siblings, 2 replies; 36+ messages in thread
From: Matthew Wilcox @ 2024-10-05  2:20 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-kernel, linux-fsdevel, Goldwyn Rodrigues

On Fri, Oct 04, 2024 at 04:04:29PM -0400, Goldwyn Rodrigues wrote:
> iomap_read_folio_ops provide additional functions to allocate or submit
> the bio. Filesystems such as btrfs have additional operations with bios
> such as verifying data checksums. Creating a bio submission hook allows
> the filesystem to process and verify the bio.

But surely you're going to need something similar for writeback too?
So why go to all this trouble to add a new kind of ops instead of making
it part of iomap_ops or iomap_folio_ops?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 06/12] iomap: Introduce read_inline() function hook
  2024-10-04 20:04 ` [PATCH 06/12] iomap: Introduce read_inline() function hook Goldwyn Rodrigues
@ 2024-10-05  2:30   ` Matthew Wilcox
  2024-10-07 17:47     ` Darrick J. Wong
  0 siblings, 1 reply; 36+ messages in thread
From: Matthew Wilcox @ 2024-10-05  2:30 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-kernel, linux-fsdevel, Goldwyn Rodrigues

On Fri, Oct 04, 2024 at 04:04:33PM -0400, Goldwyn Rodrigues wrote:
> Introduce read_inline() function hook for reading inline extents. This
> is performed for filesystems such as btrfs which may compress the data
> in the inline extents.

This feels like an attempt to work around "iomap doesn't support
compressed extents" by keeping the decompression in the filesystem,
instead of extending iomap to support compressed extents itself.
I'd certainly prefer iomap to support compressed extents, but maybe I'm
in a minority here.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 01/12] iomap: check if folio size is equal to FS block size
  2024-10-05  2:17   ` Matthew Wilcox
@ 2024-10-07 16:57     ` Darrick J. Wong
  2024-10-10 17:59       ` Goldwyn Rodrigues
  0 siblings, 1 reply; 36+ messages in thread
From: Darrick J. Wong @ 2024-10-07 16:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Goldwyn Rodrigues, linux-kernel, linux-fsdevel, Goldwyn Rodrigues

On Sat, Oct 05, 2024 at 03:17:47AM +0100, Matthew Wilcox wrote:
> On Fri, Oct 04, 2024 at 04:04:28PM -0400, Goldwyn Rodrigues wrote:
> > Filesystems such as BTRFS use folio->private so that they receive a
> > callback while releasing folios. Add check if folio size is same as
> > filesystem block size while evaluating iomap_folio_state from
> > folio->private.
> > 
> > I am hoping this will be removed when all of btrfs code has moved to
> > iomap and BTRFS uses iomap's subpage.
> 
> This seems like a terrible explanation for why you need this patch.
> 
> As I understand it, what you're really doing is saying that iomap only
> uses folio->private for block size < folio size.  So if you add this
> hack, iomap won't look at folio->private for block size == folio size
> and that means that btrfs can continue to use it.
> 
> I don't think this is a good way to start the conversion.  I appreciate
> that it's a long, complex procedure, and you can't do the whole
> conversion in a single patchset.
> 
> Also, please stop calling this "subpage".  That's btrfs terminology,
> it's confusing as hell, and it should be deleted from your brain.

I've long wondered if 'subpage' is shorthand for 'subpage blocksize'?
If so then the term makes sense to me as a fs developer, but I can also
see how it might not make sense to anyone from the mm side of things.

Wait, is a btrfs sector the same as what ext4/xfs call a fs block?

> But I don't understand why you need it at all.  btrfs doesn't attach
> private data to folios unless block size < page size.  Which is precisely
> the case that you're not using.  So it seems like you could just drop
> this patch and everything would still work.

I was also wondering this.  Given that the end of struct btrfs_subpage
is an uptodate/dirty/ordered bitmap, maybe iomap_folio_ops should grow a
method to allocate a struct iomap_folio_state object, and then you could
embed one in the btrfs subpage object and provide that custom allocation
function?

(and yes that makes for an ugly mess of pointer math crud to have two
VLAs inside struct btrfs_subpage, so this might be too ugly to live in
practice)

--D

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 02/12] iomap: Introduce iomap_read_folio_ops
  2024-10-05  2:20   ` Matthew Wilcox
@ 2024-10-07 17:01     ` Darrick J. Wong
  2024-10-08  9:43     ` Christoph Hellwig
  1 sibling, 0 replies; 36+ messages in thread
From: Darrick J. Wong @ 2024-10-07 17:01 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Goldwyn Rodrigues, linux-kernel, linux-fsdevel, Goldwyn Rodrigues

On Sat, Oct 05, 2024 at 03:20:06AM +0100, Matthew Wilcox wrote:
> On Fri, Oct 04, 2024 at 04:04:29PM -0400, Goldwyn Rodrigues wrote:
> > iomap_read_folio_ops provide additional functions to allocate or submit
> > the bio. Filesystems such as btrfs have additional operations with bios
> > such as verifying data checksums. Creating a bio submission hook allows
> > the filesystem to process and verify the bio.
> 
> But surely you're going to need something similar for writeback too?
> So why go to all this trouble to add a new kind of ops instead of making
> it part of iomap_ops or iomap_folio_ops?

iomap_folio_ops, and maybe it's time to rename it iomap_pagecache_ops.

I almost wonder if we should have this instead:

struct iomap_pagecache_ops {
	struct iomap_ops ops;

	/* folio management */
	struct folio *(*get_folio)(struct iomap_iter *iter, loff_t pos,
			unsigned len);
	void (*put_folio)(struct inode *inode, loff_t pos, unsigned copied,
			struct folio *folio);

	/* mapping revalidation */
	bool (*iomap_valid)(struct inode *inode, const struct iomap *iomap);

	/* writeback */
	int (*map_blocks)(struct iomap_writepage_ctx *wpc, struct inode *inode,
			  loff_t offset, unsigned len);

	int (*prepare_ioend)(struct iomap_ioend *ioend, int status);
	void (*discard_folio)(struct folio *folio, loff_t pos);
};

and then we change the buffered-io.c functions to take a (const struct
iomap_pagecache_ops*) instead of iomap_ops+iomap_folio_ops, or
iomap_ops+iomap_writeback_ops.

Same embedding suggestion for iomap_dio_ops.

--D

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 04/12] iomap: include iomap_read_end_io() in header
  2024-10-04 20:04 ` [PATCH 04/12] iomap: include iomap_read_end_io() in header Goldwyn Rodrigues
@ 2024-10-07 17:02   ` Darrick J. Wong
  2024-10-10 18:12     ` Goldwyn Rodrigues
  0 siblings, 1 reply; 36+ messages in thread
From: Darrick J. Wong @ 2024-10-07 17:02 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-kernel, linux-fsdevel, Goldwyn Rodrigues

On Fri, Oct 04, 2024 at 04:04:31PM -0400, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> iomap_read_end_io() will be used BTRFS after it has completed the reads
> to handle control back to iomap to finish reads on all folios.

That probably needs EXPORT_SYMBOL_GPL if btrfs is going to use it,
right?

--D

> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/iomap/buffered-io.c | 2 +-
>  include/linux/iomap.h  | 1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index d007b4a8307c..0e682ff84e4a 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -326,7 +326,7 @@ static void iomap_finish_folio_read(struct folio *folio, size_t off,
>  		folio_end_read(folio, uptodate);
>  }
>  
> -static void iomap_read_end_io(struct bio *bio)
> +void iomap_read_end_io(struct bio *bio)
>  {
>  	int error = blk_status_to_errno(bio->bi_status);
>  	struct folio_iter fi;
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index f876d16353c6..7b757bea8455 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -280,6 +280,7 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops,
>  		const struct iomap_read_folio_ops *);
>  void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops,
>  		const struct iomap_read_folio_ops *);
> +void iomap_read_end_io(struct bio *bio);
>  bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
>  struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len);
>  bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
> -- 
> 2.46.1
> 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 06/12] iomap: Introduce read_inline() function hook
  2024-10-05  2:30   ` Matthew Wilcox
@ 2024-10-07 17:47     ` Darrick J. Wong
  2024-10-10 18:10       ` Goldwyn Rodrigues
  0 siblings, 1 reply; 36+ messages in thread
From: Darrick J. Wong @ 2024-10-07 17:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Goldwyn Rodrigues, linux-kernel, linux-fsdevel, Goldwyn Rodrigues

On Sat, Oct 05, 2024 at 03:30:53AM +0100, Matthew Wilcox wrote:
> On Fri, Oct 04, 2024 at 04:04:33PM -0400, Goldwyn Rodrigues wrote:
> > Introduce read_inline() function hook for reading inline extents. This
> > is performed for filesystems such as btrfs which may compress the data
> > in the inline extents.

Why don't you set iomap->inline_data to the uncompressed buffer, let
readahead copy it to the pagecache, and free it in ->iomap_end?

> This feels like an attempt to work around "iomap doesn't support
> compressed extents" by keeping the decompression in the filesystem,
> instead of extending iomap to support compressed extents itself.
> I'd certainly prefer iomap to support compressed extents, but maybe I'm
> in a minority here.

I'm not an expert on fs compression, but I get the impression that most
filesystems handle reads by allocating some folios, reading off the disk
into those folios, and decompressing into the pagecache folios.  It
might be kind of amusing to try to hoist that into the vfs/iomap at some
point, but I think the next problem you'd run into is that fscrypt has
similar requirements, since it's also a data transformation step.
fsverity I think is less complicated because it only needs to read the
pagecache contents at the very end to check it against the merkle tree.

That, I think, is why this btrfs iomap port want to override submit_bio,
right?  So that it can allocate a separate set of folios, create a
second bio around that, submit the second bio, decode what it read off
the disk into the folios of the first bio, then "complete" the first bio
so that iomap only has to update the pagecache state and doesn't have to
know about the encoding magic?

And then, having established that beachhead, porting btrfs fscrypt is
a simple matter of adding more transformation steps to the ioend
processing of the second bio (aka the one that actually calls the disk),
right?  And I think all that processing stuff is more or less already in
place for the existing read path, so it should be trivial (ha!) to
call it in an iomap context instead of straight from btrfs.
iomap_folio_state notwithstanding, of course.

Hmm.  I'll have to give some thought to what would the ideal iomap data
transformation pipeline look like?

Though I already have a sneaking suspicion that will morph into "If I
wanted to add {crypt,verity,compression} to xfs how would I do that?" ->
"How would I design a pipeine to handle all three to avoid bouncing
pages between workqueue threads like ext4 does?" -> "Oh gosh now I have
a totally different design than any of the existing implementations." ->
"Well, crumbs. :("

I'll start that by asking: Hey btrfs developers, what do you like and
hate about the current way that btrfs handles fscrypt, compression, and
fsverity?  Assuming that you can set all three on a file, right?

--D

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/12] btrfs reads through iomap
  2024-10-04 20:04 [PATCH 00/12] btrfs reads through iomap Goldwyn Rodrigues
                   ` (11 preceding siblings ...)
  2024-10-04 20:04 ` [PATCH 12/12] btrfs: switch to iomap for buffered reads Goldwyn Rodrigues
@ 2024-10-08  9:39 ` Christoph Hellwig
  12 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2024-10-08  9:39 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-kernel, linux-fsdevel, Goldwyn Rodrigues

Please also add the linux-xfs list per the MAINTAINERS file,
I only noticed this now because I'm a little behind on my fsdevel
mailbox.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 01/12] iomap: check if folio size is equal to FS block size
  2024-10-04 20:04 ` [PATCH 01/12] iomap: check if folio size is equal to FS block size Goldwyn Rodrigues
  2024-10-05  2:17   ` Matthew Wilcox
@ 2024-10-08  9:40   ` Christoph Hellwig
  1 sibling, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2024-10-08  9:40 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-kernel, linux-fsdevel, Goldwyn Rodrigues

On Fri, Oct 04, 2024 at 04:04:28PM -0400, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> Filesystems such as BTRFS use folio->private so that they receive a
> callback while releasing folios. Add check if folio size is same as
> filesystem block size while evaluating iomap_folio_state from
> folio->private.
> 
> I am hoping this will be removed when all of btrfs code has moved to
> iomap and BTRFS uses iomap's subpage.

iomap owns folio->private, so please make sure this isn't
needed before moving out of RFC state.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 02/12] iomap: Introduce iomap_read_folio_ops
  2024-10-05  2:20   ` Matthew Wilcox
  2024-10-07 17:01     ` Darrick J. Wong
@ 2024-10-08  9:43     ` Christoph Hellwig
  1 sibling, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2024-10-08  9:43 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Goldwyn Rodrigues, linux-kernel, linux-fsdevel, Goldwyn Rodrigues

On Sat, Oct 05, 2024 at 03:20:06AM +0100, Matthew Wilcox wrote:
> On Fri, Oct 04, 2024 at 04:04:29PM -0400, Goldwyn Rodrigues wrote:
> > iomap_read_folio_ops provide additional functions to allocate or submit
> > the bio. Filesystems such as btrfs have additional operations with bios
> > such as verifying data checksums. Creating a bio submission hook allows
> > the filesystem to process and verify the bio.
> 
> But surely you're going to need something similar for writeback too?
> So why go to all this trouble to add a new kind of ops instead of making
> it part of iomap_ops or iomap_folio_ops?

We really should not add anything to iomap_ops.  That's just the
iteration that should not even know about pages.  In fact I hope
to eventuall get back and replace it with a single iterator that
could use direct calls for the fast path as per your RFC from years
ago.

iomap_folio_ops is entirely specific to the buffered write path.

I'm honestly not sure what the point is of merging structures specific
to buffered read, buffered write (and suggested later in the thread
buffered writeback) when they have very little overlap.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 03/12] iomap: add bioset in iomap_read_folio_ops for filesystems to use own bioset
  2024-10-04 20:04 ` [PATCH 03/12] iomap: add bioset in iomap_read_folio_ops for filesystems to use own bioset Goldwyn Rodrigues
@ 2024-10-08  9:45   ` Christoph Hellwig
  2024-10-10 17:51     ` Goldwyn Rodrigues
  0 siblings, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2024-10-08  9:45 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-kernel, linux-fsdevel, Goldwyn Rodrigues

>  	    !bio_add_folio(ctx->bio, folio, plen, poff)) {
> +		struct bio_set *bioset;
>  		gfp_t gfp = mapping_gfp_constraint(folio->mapping, GFP_KERNEL);
>  		gfp_t orig_gfp = gfp;
>  		unsigned int nr_vecs = DIV_ROUND_UP(length, PAGE_SIZE);

Nit: I try to keep variables just declared and not initialized after
those initialized at declaration time.

> +
> +		if (ctx->ops && ctx->ops->bio_set)
> +			bioset = ctx->ops->bio_set;
> +		else
> +			bioset = &fs_bio_set;
> +
> +		ctx->bio = bio_alloc_bioset(iomap->bdev, bio_max_segs(nr_vecs),
> +				REQ_OP_READ, gfp, bioset);
> +

But it would be nice to move this logic into a helper, similar to what
is done in the direct I/O code.  That should robably include
picking the gfp flags from the ctx.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 05/12] iomap: Introduce IOMAP_ENCODED
  2024-10-04 20:04 ` [PATCH 05/12] iomap: Introduce IOMAP_ENCODED Goldwyn Rodrigues
@ 2024-10-08  9:48   ` Christoph Hellwig
  2024-10-10  9:42     ` Christoph Hellwig
  0 siblings, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2024-10-08  9:48 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-kernel, linux-fsdevel, Goldwyn Rodrigues

On Fri, Oct 04, 2024 at 04:04:32PM -0400, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> An encoded extent must be read completely. Make the bio just as a
> regular bio and let filesystem deal with the rest of the extent.
> A new bio must be created if a new iomap is returned.
> The filesystem must be informed that the bio to be read is
> encoded and the offset from which the encoded extent starts. So, pass
> the iomap associated with the bio while calling submit_io. Save the
> previous iomap (associated with the bio being submitted) in prev in
> order to submit when the iomap changes.
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/iomap/buffered-io.c | 17 ++++++++++-------
>  include/linux/iomap.h  | 11 +++++++++--
>  2 files changed, 19 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 0e682ff84e4a..4c734899a8e5 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -378,12 +378,13 @@ static inline bool iomap_block_needs_zeroing(const struct iomap_iter *iter,
>  {
>  	const struct iomap *srcmap = iomap_iter_srcmap(iter);
>  
> -	return srcmap->type != IOMAP_MAPPED ||
> +	return (srcmap->type != IOMAP_MAPPED &&
> +			srcmap->type != IOMAP_ENCODED) ||
>  		(srcmap->flags & IOMAP_F_NEW) ||
>  		pos >= i_size_read(iter->inode);

Non-standard indentation for the new line.  But at this point the
condition is becoming complex enough that it is best split into
multiple if statements anyway.


>  	sector = iomap_sector(iomap, pos);
>  	if (!ctx->bio ||
> +	    (iomap->type & IOMAP_ENCODED && iomap->offset != iter->prev.offset) ||

Overly long line.  And this could really use a comment as well.

> -static loff_t iomap_readahead_iter(const struct iomap_iter *iter,
> +static loff_t iomap_readahead_iter(struct iomap_iter *iter,
>  		struct iomap_readpage_ctx *ctx)

Why is the iter de-constified?

In general I'm not a huge fan of the encoded magic here, but I'll
need to take a closer look at the caller if I can come up with
something better.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 05/12] iomap: Introduce IOMAP_ENCODED
  2024-10-08  9:48   ` Christoph Hellwig
@ 2024-10-10  9:42     ` Christoph Hellwig
  2024-10-10 17:50       ` Goldwyn Rodrigues
  2024-10-15  3:43       ` Ritesh Harjani
  0 siblings, 2 replies; 36+ messages in thread
From: Christoph Hellwig @ 2024-10-10  9:42 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-kernel, linux-fsdevel, Goldwyn Rodrigues, Ritesh Harjani

On Tue, Oct 08, 2024 at 02:48:43AM -0700, Christoph Hellwig wrote:
> In general I'm not a huge fan of the encoded magic here, but I'll
> need to take a closer look at the caller if I can come up with
> something better.

I looked a bit more at the code.  I'm not entirely sure I fully
understand it yet, but:

I think most of the read side special casing would be handled by
always submitting the bio at the end of an iomap.  Ritesh was
looking into that for supporting ext2-like file systems that
read indirect block ondemand, but I think it actually is fundamentally
the right thing to do anyway.

For the write we plan to add a new IOMAP_BOUNDARY flag to prevent
merges as part of the XFS RT group series, and I think this should
also solve your problems with keeping one bio per iomap?  The
current patch is here:

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=91b5d7a52dab63732aee451bba0db315ae9bd09b

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 05/12] iomap: Introduce IOMAP_ENCODED
  2024-10-10  9:42     ` Christoph Hellwig
@ 2024-10-10 17:50       ` Goldwyn Rodrigues
  2024-10-15  3:43       ` Ritesh Harjani
  1 sibling, 0 replies; 36+ messages in thread
From: Goldwyn Rodrigues @ 2024-10-10 17:50 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel, linux-fsdevel, Ritesh Harjani

On  2:42 10/10, Christoph Hellwig wrote:
> On Tue, Oct 08, 2024 at 02:48:43AM -0700, Christoph Hellwig wrote:
> > In general I'm not a huge fan of the encoded magic here, but I'll
> > need to take a closer look at the caller if I can come up with
> > something better.
> 
> I looked a bit more at the code.  I'm not entirely sure I fully
> understand it yet, but:
> 
> I think most of the read side special casing would be handled by
> always submitting the bio at the end of an iomap.  Ritesh was
> looking into that for supporting ext2-like file systems that
> read indirect block ondemand, but I think it actually is fundamentally
> the right thing to do anyway.
> 
> For the write we plan to add a new IOMAP_BOUNDARY flag to prevent
> merges as part of the XFS RT group series, and I think this should
> also solve your problems with keeping one bio per iomap?  The
> current patch is here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=91b5d7a52dab63732aee451bba0db315ae9bd09b
> 

Yes, this is helpful but it will solve only one of the three
requirements.

The first, compressed and uncompressed extents are to be dealth with
differently because of the additional step of uncompressing the bio
corresponding to the extent. So, iomap needs to inform btrfs that the
bio submitted (through newly created function submit_io()) is compressed
and needs to be read and decompressed.

The second, btrfs sets the bi_sector to the start of the extent map
and the assignment cannot go through iomap_sector(). Compressed extents
being smaller than regular one, iomap_sector would most of the times
overrun beyond the compressed extent. So, in this patchset, I am setting
this in ->submit_io()

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 03/12] iomap: add bioset in iomap_read_folio_ops for filesystems to use own bioset
  2024-10-08  9:45   ` Christoph Hellwig
@ 2024-10-10 17:51     ` Goldwyn Rodrigues
  0 siblings, 0 replies; 36+ messages in thread
From: Goldwyn Rodrigues @ 2024-10-10 17:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel, linux-fsdevel

On  2:45 08/10, Christoph Hellwig wrote:
> >  	    !bio_add_folio(ctx->bio, folio, plen, poff)) {
> > +		struct bio_set *bioset;
> >  		gfp_t gfp = mapping_gfp_constraint(folio->mapping, GFP_KERNEL);
> >  		gfp_t orig_gfp = gfp;
> >  		unsigned int nr_vecs = DIV_ROUND_UP(length, PAGE_SIZE);
> 
> Nit: I try to keep variables just declared and not initialized after
> those initialized at declaration time.
> 
> > +
> > +		if (ctx->ops && ctx->ops->bio_set)
> > +			bioset = ctx->ops->bio_set;
> > +		else
> > +			bioset = &fs_bio_set;
> > +
> > +		ctx->bio = bio_alloc_bioset(iomap->bdev, bio_max_segs(nr_vecs),
> > +				REQ_OP_READ, gfp, bioset);
> > +
> 
> But it would be nice to move this logic into a helper, similar to what
> is done in the direct I/O code.  That should robably include
> picking the gfp flags from the ctx.
> 

Agree. I will put this in the next version.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 01/12] iomap: check if folio size is equal to FS block size
  2024-10-07 16:57     ` Darrick J. Wong
@ 2024-10-10 17:59       ` Goldwyn Rodrigues
  0 siblings, 0 replies; 36+ messages in thread
From: Goldwyn Rodrigues @ 2024-10-10 17:59 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Matthew Wilcox, linux-kernel, linux-fsdevel

On  9:57 07/10, Darrick J. Wong wrote:
> On Sat, Oct 05, 2024 at 03:17:47AM +0100, Matthew Wilcox wrote:
> > On Fri, Oct 04, 2024 at 04:04:28PM -0400, Goldwyn Rodrigues wrote:
> > > Filesystems such as BTRFS use folio->private so that they receive a
> > > callback while releasing folios. Add check if folio size is same as
> > > filesystem block size while evaluating iomap_folio_state from
> > > folio->private.
> > > 
> > > I am hoping this will be removed when all of btrfs code has moved to
> > > iomap and BTRFS uses iomap's subpage.
> > 
> > This seems like a terrible explanation for why you need this patch.
> > 
> > As I understand it, what you're really doing is saying that iomap only
> > uses folio->private for block size < folio size.  So if you add this
> > hack, iomap won't look at folio->private for block size == folio size
> > and that means that btrfs can continue to use it.
> > 
> > I don't think this is a good way to start the conversion.  I appreciate
> > that it's a long, complex procedure, and you can't do the whole
> > conversion in a single patchset.
> > 
> > Also, please stop calling this "subpage".  That's btrfs terminology,
> > it's confusing as hell, and it should be deleted from your brain.
> 
> I've long wondered if 'subpage' is shorthand for 'subpage blocksize'?
> If so then the term makes sense to me as a fs developer, but I can also
> see how it might not make sense to anyone from the mm side of things.

Yes, it is subpage blocksize.

> 
> Wait, is a btrfs sector the same as what ext4/xfs call a fs block?

Yup, fs_info->sectorsize.

> 
> > But I don't understand why you need it at all.  btrfs doesn't attach
> > private data to folios unless block size < page size.  Which is precisely
> > the case that you're not using.  So it seems like you could just drop
> > this patch and everything would still work.
> 
> I was also wondering this.  Given that the end of struct btrfs_subpage
> is an uptodate/dirty/ordered bitmap, maybe iomap_folio_ops should grow a
> method to allocate a struct iomap_folio_state object, and then you could
> embed one in the btrfs subpage object and provide that custom allocation
> function?
> 
> (and yes that makes for an ugly mess of pointer math crud to have two
> VLAs inside struct btrfs_subpage, so this might be too ugly to live in
> practice)
> 

btrfs does use iomap->private  and writes out EXTENT_FOLIO_PRIVATE. This
is not ideal, but it requires it to get a callback from mm before folios
are released. Refer set_folio_extent_mapped(). BTRFS does it for every
folio (for filesystems which is not a subpage blocksize). Perhaps there
is a better way to do this?

Ideally, after the move to iomap, we should not require btrfs_subpage
structures, and most (if not all) folio "handlings" will be done by
iomap but that is still far way off.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 06/12] iomap: Introduce read_inline() function hook
  2024-10-07 17:47     ` Darrick J. Wong
@ 2024-10-10 18:10       ` Goldwyn Rodrigues
  2024-10-11  0:43         ` Dave Chinner
  0 siblings, 1 reply; 36+ messages in thread
From: Goldwyn Rodrigues @ 2024-10-10 18:10 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Matthew Wilcox, linux-kernel, linux-fsdevel

On 10:47 07/10, Darrick J. Wong wrote:
> On Sat, Oct 05, 2024 at 03:30:53AM +0100, Matthew Wilcox wrote:
> > On Fri, Oct 04, 2024 at 04:04:33PM -0400, Goldwyn Rodrigues wrote:
> > > Introduce read_inline() function hook for reading inline extents. This
> > > is performed for filesystems such as btrfs which may compress the data
> > > in the inline extents.
> 
> Why don't you set iomap->inline_data to the uncompressed buffer, let
> readahead copy it to the pagecache, and free it in ->iomap_end?

This will increase the number of copies. BTRFS uncompresses directly
into pagecache. Yes, this is an option but at the cost of efficiency.

> 
> > This feels like an attempt to work around "iomap doesn't support
> > compressed extents" by keeping the decompression in the filesystem,
> > instead of extending iomap to support compressed extents itself.
> > I'd certainly prefer iomap to support compressed extents, but maybe I'm
> > in a minority here.
> 
> I'm not an expert on fs compression, but I get the impression that most
> filesystems handle reads by allocating some folios, reading off the disk
> into those folios, and decompressing into the pagecache folios.  It
> might be kind of amusing to try to hoist that into the vfs/iomap at some
> point, but I think the next problem you'd run into is that fscrypt has
> similar requirements, since it's also a data transformation step.
> fsverity I think is less complicated because it only needs to read the
> pagecache contents at the very end to check it against the merkle tree.
> 
> That, I think, is why this btrfs iomap port want to override submit_bio,
> right?  So that it can allocate a separate set of folios, create a
> second bio around that, submit the second bio, decode what it read off
> the disk into the folios of the first bio, then "complete" the first bio
> so that iomap only has to update the pagecache state and doesn't have to
> know about the encoding magic?

Yes, but that is not the only reason. BTRFS also calculates and checks
block checksums for data read during bio completions.

> 
> And then, having established that beachhead, porting btrfs fscrypt is
> a simple matter of adding more transformation steps to the ioend
> processing of the second bio (aka the one that actually calls the disk),
> right?  And I think all that processing stuff is more or less already in
> place for the existing read path, so it should be trivial (ha!) to
> call it in an iomap context instead of straight from btrfs.
> iomap_folio_state notwithstanding, of course.
> 
> Hmm.  I'll have to give some thought to what would the ideal iomap data
> transformation pipeline look like?

The order of transformation would make all the difference, and I am not
sure if everyone involved can come to a conclusion that all
transformations should be done in a particular decided order.

FWIW, checksums are performed on what is read/written on disk. So
for writes, compression happens before checksums.

> 
> Though I already have a sneaking suspicion that will morph into "If I
> wanted to add {crypt,verity,compression} to xfs how would I do that?" ->
> "How would I design a pipeine to handle all three to avoid bouncing
> pages between workqueue threads like ext4 does?" -> "Oh gosh now I have
> a totally different design than any of the existing implementations." ->
> "Well, crumbs. :("
> 
> I'll start that by asking: Hey btrfs developers, what do you like and
> hate about the current way that btrfs handles fscrypt, compression, and
> fsverity?  Assuming that you can set all three on a file, right?

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 04/12] iomap: include iomap_read_end_io() in header
  2024-10-07 17:02   ` Darrick J. Wong
@ 2024-10-10 18:12     ` Goldwyn Rodrigues
  0 siblings, 0 replies; 36+ messages in thread
From: Goldwyn Rodrigues @ 2024-10-10 18:12 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-kernel, linux-fsdevel

On 10:02 07/10, Darrick J. Wong wrote:
> On Fri, Oct 04, 2024 at 04:04:31PM -0400, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > 
> > iomap_read_end_io() will be used BTRFS after it has completed the reads
> > to handle control back to iomap to finish reads on all folios.
> 
> That probably needs EXPORT_SYMBOL_GPL if btrfs is going to use it,
> right?

Yes! I got a warning from kernel test robot as well!

Note to self: Stop building all modules in the kernel for testing.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 06/12] iomap: Introduce read_inline() function hook
  2024-10-10 18:10       ` Goldwyn Rodrigues
@ 2024-10-11  0:43         ` Dave Chinner
  2024-10-11  3:28           ` Gao Xiang
  0 siblings, 1 reply; 36+ messages in thread
From: Dave Chinner @ 2024-10-11  0:43 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: Darrick J. Wong, Matthew Wilcox, linux-kernel, linux-fsdevel

On Thu, Oct 10, 2024 at 02:10:25PM -0400, Goldwyn Rodrigues wrote:
> On 10:47 07/10, Darrick J. Wong wrote:
> > On Sat, Oct 05, 2024 at 03:30:53AM +0100, Matthew Wilcox wrote:
> > > On Fri, Oct 04, 2024 at 04:04:33PM -0400, Goldwyn Rodrigues wrote:
> > > > Introduce read_inline() function hook for reading inline extents. This
> > > > is performed for filesystems such as btrfs which may compress the data
> > > > in the inline extents.
> > 
> > Why don't you set iomap->inline_data to the uncompressed buffer, let
> > readahead copy it to the pagecache, and free it in ->iomap_end?
> 
> This will increase the number of copies. BTRFS uncompresses directly
> into pagecache. Yes, this is an option but at the cost of efficiency.

Is that such a problem? The process of decompression means the data
is already hot in cache, so there isn't a huge extra penalty for
moving this inline data a second time...

> > > This feels like an attempt to work around "iomap doesn't support
> > > compressed extents" by keeping the decompression in the filesystem,
> > > instead of extending iomap to support compressed extents itself.
> > > I'd certainly prefer iomap to support compressed extents, but maybe I'm
> > > in a minority here.
> > 
> > I'm not an expert on fs compression, but I get the impression that most
> > filesystems handle reads by allocating some folios, reading off the disk
> > into those folios, and decompressing into the pagecache folios.  It
> > might be kind of amusing to try to hoist that into the vfs/iomap at some
> > point, but I think the next problem you'd run into is that fscrypt has
> > similar requirements, since it's also a data transformation step.
> > fsverity I think is less complicated because it only needs to read the
> > pagecache contents at the very end to check it against the merkle tree.
> > 
> > That, I think, is why this btrfs iomap port want to override submit_bio,
> > right?  So that it can allocate a separate set of folios, create a
> > second bio around that, submit the second bio, decode what it read off
> > the disk into the folios of the first bio, then "complete" the first bio
> > so that iomap only has to update the pagecache state and doesn't have to
> > know about the encoding magic?
> 
> Yes, but that is not the only reason. BTRFS also calculates and checks
> block checksums for data read during bio completions.

This is no different to doing fsverity checks at read BIO io
completion. We should be using the same mechanism in iomap for
fsverity IO completion verification as filesystems do for data
checksum verification because, conceptually speaking, they are the
same operation.

> > And then, having established that beachhead, porting btrfs fscrypt is
> > a simple matter of adding more transformation steps to the ioend
> > processing of the second bio (aka the one that actually calls the disk),
> > right?  And I think all that processing stuff is more or less already in
> > place for the existing read path, so it should be trivial (ha!) to
> > call it in an iomap context instead of straight from btrfs.
> > iomap_folio_state notwithstanding, of course.
> > 
> > Hmm.  I'll have to give some thought to what would the ideal iomap data
> > transformation pipeline look like?
> 
> The order of transformation would make all the difference, and I am not
> sure if everyone involved can come to a conclusion that all
> transformations should be done in a particular decided order.

I think there is only one viable order of data transformations
to/from disk. That's because....

> FWIW, checksums are performed on what is read/written on disk. So
> for writes, compression happens before checksums.

.... there is specific ordering needed.

For writes, the ordering is:

	1. pre-write data compression - requires data copy
	2. pre-write data encryption - requires data copy
	3. pre-write data checksums - data read only
	4. write the data
	5. post-write metadata updates

We cannot usefully perform compression after encryption -
random data doesn't compress - and the checksum must match what is
written to disk, so it has to come after all other transformations
have been done.

For reads, the order is:

	1. read the data
	2. verify the data checksum
	3. decrypt the data - requires data copy
	4. decompress the data - requires data copy
	5. place the plain text data in the page cache

To do 1) we need memory buffers allocated, 2) runs directly out of
them. If no other transformations are required, then we can read the
data directly into the folios in the page cache.

However, if we have to decrypt and/or decompress the data, then
we need multiple sets of bounce buffers - one of the data being
read and one of the decrypted data. The data can be decompressed
directly into the page cache.

If there is no compression, the decryption should be done directly
into the page cache.

If there is nor decryption or compression, then the read IO should
be done directly into the page cache and the checksum verification
done by reading out of the page cache.

IOWs, each step of the data read path needs to have "buffer" that
is a set of folios that may or may not be the final page cache
location.

The other consideration here is we may want to be able to cache
these data transformation layers when we are doing rapid RMW
operations. e.g. on encrypted data, a RMW will need to allocate
two sets of bounce buffers - one for the read, another for the
write. If the encrypted data is also hosted in the page cache (at
some high offset beyond EOF) then we don't need to allocate the
bounce buffer during write....

> > Though I already have a sneaking suspicion that will morph into "If I
> > wanted to add {crypt,verity,compression} to xfs how would I do that?" ->
> > "How would I design a pipeine to handle all three to avoid bouncing
> > pages between workqueue threads like ext4 does?" -> "Oh gosh now I have
> > a totally different design than any of the existing implementations." ->
> > "Well, crumbs. :("

I've actually been thinking about how to do data CRCs, encryption
and compression with XFS through iomap. I've even started writing a
design doc, based on feedback from the first fsverity implementation
attempt.

Andrey and I are essentially working towards a "direct mapped remote
xattr" model for fsverity merkle tree data. fsverity reads and
writes are marshalled through the page cache at a filesystem
determined offset beyond EOF (same model as ext4, et al), but the
data is mapped into remote xattr blocks. This requires fixing the
mistake of putting metadata headers into xattr remote blocks by
moving the CRC to the remote xattr name structure such that remote
xattrs only contain data again.

The read/write ops that are passed to iomap for such IO are fsverity
implementation specific that understand that the high offset is to
be mapped directly to a filesystem block sized remote xattr extent
and the IO is done directly on that block. The remote xattr block
CRC can be retreived when it is mapped and validated on read IO
completion by the iomap code before passing the data to fsverity.

The reason for doing using xattrs in this way is that it provides a
generic mechanism for storing multiple sets of optional data related
and/or size-transformed data within a file and within the single
page caceh address space the inode provides.

For example, it allows XFS to implement native data checksums. For
this, we need to store 2 x 32bit CRCs per filesystem block. i.e. we
need the old CRC and the new CRC so we can write/log the CRC update
before we write the data, and if we crash before the data hits the
disk we still know what the original CRC was and so can validate
that the filesystem block contains entirely either the old or the
new data when it is next read. i.e. one of the two CRC values should
always be valid.

By using direct mapping into the page cache for CRCs, we don't need
to continually refetch the checksum data. It gets read in once, and
a large read will continue to pull CRC data from the cached folio as
we walk through the data we read from disk.  We can fetch the CRC
data concurrently with the data itself, and IO completion processing
doesn't signalled until both have been brought into cache.

For writes, we can calculate the new CRCs sequentially and flush the
cached CRC page but to the xattr when we reach the end of it. The
data write bio that was built as we CRC'd the data can then be
flushed once the xattr data has reached stable storage. There's a
bit more latency to writeback operations here, but nothing is ever
free...

Compression is where using xattrs gets interesting - the xattrs can
have a fixed "offset" they blong to, but can store variable sized
data records for that offset.

If we say we have a 64kB compression block size, we can store the
compressed data for a 64k block entirely in a remote xattr even if
compression fails (i.e. we can store the raw data, not the expanded
"compressed" data). The remote xattr can store any amount of smaller
data, and we map the compressed data directly into the page cache at
a high offset. Then decompression can run on the high offset pages
with the destination being some other page cache offset....

On the write side, compression can be done directly into the high
offset page cache range for that 64kb offset range, then we can
map that to a remote xattr block and write the xattr. The xattr
naturally handles variable size blocks.

If we overwrite the compressed data (e.g. a 4kB RMW cycle in a 64kB
block), then we essentially overwrite the compressed data in the
page cache at writeback, extending the folio set for that record
if it grows. Then on flush of the compressed record, we atomically
replace the remote xattr we already have so we are guaranteed that
we'll see either the old or the new compressed data at that point in
time. The new remote xattr block is CRC'd by the xattr code, so if
we are using compression-only, we don't actually need separate
on-disk data CRC xattrs...

Encryption/decryption doesn't require alternate data storage, so
that just requires reading the encrypted data into a high page cache
offset. However, if we are compressing then encrypting, then after
the encryption step we'd need to store it via the compression
method, not the uncompressed method. Hence there are some
implementation details needed to be worked through here.

-----

Overall, I'm looking an iomap mechanism that stages every part of
the transformation stack in the page cache itself. We have address
space to burn because we don't support file sizes larger than 8EiB.
That leaves a full 8EiB of address space available for hosting
non-user-visible ephemeral IO path data states.

How to break up that address space for different data transformation
uses is an open question. Files with transformed data would be
limited to the size of the address space reserved by iomap for
transformed data, so we want it to be large.

XFS can't use more than 2^32 -extents- for xattrs on an inode at
this point in time. That means >4 billion alternate data records can
be stored, not that 256TB is the maximum offset that can be
supported.

Hence I'm tending towards at least 1PB per alternate address
range, which would limit encrypted and/or compressed files to this
size. That leaves over 8000 alternate address ranges iomap can
support, but I suspect that we'll want fewer, larger ranges in
preference to more, smaller ranges....

> > I'll start that by asking: Hey btrfs developers, what do you like and
> > hate about the current way that btrfs handles fscrypt, compression, and
> > fsverity?  Assuming that you can set all three on a file, right?

I'm definitely assuming that all 3 can be set at once, as well as
low level filesystem data CRCs underneath all the layered "VFS"
stuff.

However, what layers we actually need at any point in time can be
malleable. For XFS, compression would naturally be CRC'd by the
xattr storage so it wouldn't need that layer at all. If fsverity is
enabled and fs admin tooling understands fsverity, then we probably
don't need fs checksums for fsverity files.

so there definitely needs to be some flexibility in the stack to
allow filesystems to elide transformation layers that are
redundant...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 06/12] iomap: Introduce read_inline() function hook
  2024-10-11  0:43         ` Dave Chinner
@ 2024-10-11  3:28           ` Gao Xiang
  2024-10-11  4:52             ` Dave Chinner
  0 siblings, 1 reply; 36+ messages in thread
From: Gao Xiang @ 2024-10-11  3:28 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Matthew Wilcox, linux-kernel, linux-fsdevel,
	Goldwyn Rodrigues

Hi Dave,

On 2024/10/11 08:43, Dave Chinner wrote:
> On Thu, Oct 10, 2024 at 02:10:25PM -0400, Goldwyn Rodrigues wrote:

...

> 
> .... there is specific ordering needed.
> 
> For writes, the ordering is:
> 
> 	1. pre-write data compression - requires data copy
> 	2. pre-write data encryption - requires data copy
> 	3. pre-write data checksums - data read only
> 	4. write the data
> 	5. post-write metadata updates
> 
> We cannot usefully perform compression after encryption -
> random data doesn't compress - and the checksum must match what is
> written to disk, so it has to come after all other transformations
> have been done.
> 
> For reads, the order is:
> 
> 	1. read the data
> 	2. verify the data checksum
> 	3. decrypt the data - requires data copy
> 	4. decompress the data - requires data copy
> 	5. place the plain text data in the page cache

Just random stuffs for for reference, currently fsverity makes
markle tree for the plain text, but from the on-disk data
security/authentication and integrity perspective, I guess the
order you mentioned sounds more saner to me, or:

  	1. read the data
  	2. pre-verify the encoded data checksum (optional,
                                        dm-verity likewise way)
  	3. decrypt the data - requires data copy
  	4. decompress the data - requires data copy
	5. post-verify the decoded (plain) checksum (optional)
  	6. place the plain text data in the page cache

2,5 may apply to different use cases though.

> 

...

> 
> Compression is where using xattrs gets interesting - the xattrs can
> have a fixed "offset" they blong to, but can store variable sized
> data records for that offset.
> 
> If we say we have a 64kB compression block size, we can store the
> compressed data for a 64k block entirely in a remote xattr even if
> compression fails (i.e. we can store the raw data, not the expanded
> "compressed" data). The remote xattr can store any amount of smaller
> data, and we map the compressed data directly into the page cache at
> a high offset. Then decompression can run on the high offset pages
> with the destination being some other page cache offset....

but compressed data itself can also be multiple reference (reflink
likewise), so currently EROFS uses a seperate pseudo inode if it
decides with physical addresses as indexes.

> 
> On the write side, compression can be done directly into the high
> offset page cache range for that 64kb offset range, then we can
> map that to a remote xattr block and write the xattr. The xattr
> naturally handles variable size blocks.

Also different from plain text, each compression fses may keep
different encoded data forms (e.g. fses could add headers or
trailers to the on-disk compressed data or add more informations
to extent metadata) for their own needs.  So unlike the current
unique plain text process, an unique encoder/decoder may not sound
quite flexible though.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 06/12] iomap: Introduce read_inline() function hook
  2024-10-11  3:28           ` Gao Xiang
@ 2024-10-11  4:52             ` Dave Chinner
  2024-10-11  5:19               ` Gao Xiang
  0 siblings, 1 reply; 36+ messages in thread
From: Dave Chinner @ 2024-10-11  4:52 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Darrick J. Wong, Matthew Wilcox, linux-kernel, linux-fsdevel,
	Goldwyn Rodrigues

[FYI, your email got classified as spam by gmail...]

On Fri, Oct 11, 2024 at 11:28:42AM +0800, Gao Xiang wrote:
> Hi Dave,
> 
> On 2024/10/11 08:43, Dave Chinner wrote:
> > On Thu, Oct 10, 2024 at 02:10:25PM -0400, Goldwyn Rodrigues wrote:
> 
> ...
> 
> > 
> > .... there is specific ordering needed.
> > 
> > For writes, the ordering is:
> > 
> > 	1. pre-write data compression - requires data copy
> > 	2. pre-write data encryption - requires data copy
> > 	3. pre-write data checksums - data read only
> > 	4. write the data
> > 	5. post-write metadata updates
> > 
> > We cannot usefully perform compression after encryption -
> > random data doesn't compress - and the checksum must match what is
> > written to disk, so it has to come after all other transformations
> > have been done.
> > 
> > For reads, the order is:
> > 
> > 	1. read the data
> > 	2. verify the data checksum
> > 	3. decrypt the data - requires data copy
> > 	4. decompress the data - requires data copy
> > 	5. place the plain text data in the page cache
> 
> Just random stuffs for for reference, currently fsverity makes
> markle tree for the plain text,

Well, that is specifically an existing implementation detail -
the fsverity core does not care what data is asked to measure as long
as it is the same data that it is asked to verify.

Hence a filesystem could ask fsverity to measure compressed,
encrypted data, and as long as the filesystem also asks fsverity to
measure the same compressed, encrypted data as it is read from disk
it will work as expected.

We could do this quite easily - hand the compressed data record
to fsverity one fsblock sized chunk at a time, and treat the empty
regions between the end of the compressed record and the offset
of the start of the next compressed record as a hole....

So, yeah, I think that fsverity can be placed at the at the "verify
data on disk" layer successfully rather than at the "verify plain
text" layer without actually impacting on it's functionality.

....
> > Compression is where using xattrs gets interesting - the xattrs can
> > have a fixed "offset" they blong to, but can store variable sized
> > data records for that offset.
> > 
> > If we say we have a 64kB compression block size, we can store the
> > compressed data for a 64k block entirely in a remote xattr even if
> > compression fails (i.e. we can store the raw data, not the expanded
> > "compressed" data). The remote xattr can store any amount of smaller
> > data, and we map the compressed data directly into the page cache at
> > a high offset. Then decompression can run on the high offset pages
> > with the destination being some other page cache offset....
> 
> but compressed data itself can also be multiple reference (reflink
> likewise), so currently EROFS uses a seperate pseudo inode if it
> decides with physical addresses as indexes.

Sure, but handling shared data extents and breaking of shared
mappings on write is not an iomap/page cache problem - that's a
problem the filesystem block mapping operations that iomap calls
need to handle.

EROFS uses a separate pseudo inode so taht it can share page cache
as well as shared blocks on disk. I don't think that compression
changes that at all - the page cache contents for all those blocks
are still going to be identical...

As for the case of shared compressed data extents in XFS, I think
that shared status just needs a shared bit to added to the remote
xattr extent record header. Nothing else will really have to change,
because xattr record overwrites are naturally copy-on-write. Hence
updating a record will always break sharing, and the "shared bit"
simply propagates into the block freeing operation to indicate a
refcount update for the blocks being freed is needed. I don't see
supporting FICLONE on compressed inodes as a major issue.

> > On the write side, compression can be done directly into the high
> > offset page cache range for that 64kb offset range, then we can
> > map that to a remote xattr block and write the xattr. The xattr
> > naturally handles variable size blocks.
> 
> Also different from plain text, each compression fses may keep
> different encoded data forms (e.g. fses could add headers or
> trailers to the on-disk compressed data or add more informations
> to extent metadata) for their own needs.i

Sure, but that's something that the filesystem can add when encoding
the data into the page cache. iomap treats the contents of the page
caceh as entirely opaque - how "transformed" data is encoded in the
destination folios is completely up to the filesystem doing the
transformation. All iomap needs to care about is the offset and
length of the opaque transformed data the filesystem needs to reside
in the cache to perform the transformation.

i.e. The example I gave above for XFS compression doesn't need
metadata in the page cache data because it is held in an externally
indexed xattr btree record. That's an XFS compression implementation
detail, not an iomap concern.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 06/12] iomap: Introduce read_inline() function hook
  2024-10-11  4:52             ` Dave Chinner
@ 2024-10-11  5:19               ` Gao Xiang
  0 siblings, 0 replies; 36+ messages in thread
From: Gao Xiang @ 2024-10-11  5:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Matthew Wilcox, linux-kernel, linux-fsdevel,
	Goldwyn Rodrigues



On 2024/10/11 12:52, Dave Chinner wrote:
> [FYI, your email got classified as spam by gmail...]

(I know.. yet that is the only permitted way to send email at work..)

> 
> On Fri, Oct 11, 2024 at 11:28:42AM +0800, Gao Xiang wrote:
>> Hi Dave,
>>
>> On 2024/10/11 08:43, Dave Chinner wrote:
>>> On Thu, Oct 10, 2024 at 02:10:25PM -0400, Goldwyn Rodrigues wrote:
>>
>> ...
>>
>>>
>>> .... there is specific ordering needed.
>>>
>>> For writes, the ordering is:
>>>
>>> 	1. pre-write data compression - requires data copy
>>> 	2. pre-write data encryption - requires data copy
>>> 	3. pre-write data checksums - data read only
>>> 	4. write the data
>>> 	5. post-write metadata updates
>>>
>>> We cannot usefully perform compression after encryption -
>>> random data doesn't compress - and the checksum must match what is
>>> written to disk, so it has to come after all other transformations
>>> have been done.
>>>
>>> For reads, the order is:
>>>
>>> 	1. read the data
>>> 	2. verify the data checksum
>>> 	3. decrypt the data - requires data copy
>>> 	4. decompress the data - requires data copy
>>> 	5. place the plain text data in the page cache
>>
>> Just random stuffs for for reference, currently fsverity makes
>> markle tree for the plain text,
> 
> Well, that is specifically an existing implementation detail -
> the fsverity core does not care what data is asked to measure as long
> as it is the same data that it is asked to verify.
> 
> Hence a filesystem could ask fsverity to measure compressed,
> encrypted data, and as long as the filesystem also asks fsverity to
> measure the same compressed, encrypted data as it is read from disk
> it will work as expected.
> 
> We could do this quite easily - hand the compressed data record
> to fsverity one fsblock sized chunk at a time, and treat the empty
> regions between the end of the compressed record and the offset
> of the start of the next compressed record as a hole....

.. honestly I'm not quite sure that is an implementation detail,
for example, currently userspace can get the root hash digest of
files to check the identical files, such as the same data A:
   A + LZ4 = A1
   A + DEFLATE = A2
   A + Zstd = A3
All three files will have the same root digest for the current
fsverity use cases, but if merkle trees are applied to transformed
data, that will be difference and might not meet some users' use
cases anyway.

> 
> So, yeah, I think that fsverity can be placed at the at the "verify
> data on disk" layer successfully rather than at the "verify plain
> text" layer without actually impacting on it's functionality.
> 
> ....
>>> Compression is where using xattrs gets interesting - the xattrs can
>>> have a fixed "offset" they blong to, but can store variable sized
>>> data records for that offset.
>>>
>>> If we say we have a 64kB compression block size, we can store the
>>> compressed data for a 64k block entirely in a remote xattr even if
>>> compression fails (i.e. we can store the raw data, not the expanded
>>> "compressed" data). The remote xattr can store any amount of smaller
>>> data, and we map the compressed data directly into the page cache at
>>> a high offset. Then decompression can run on the high offset pages
>>> with the destination being some other page cache offset....
>>
>> but compressed data itself can also be multiple reference (reflink
>> likewise), so currently EROFS uses a seperate pseudo inode if it
>> decides with physical addresses as indexes.
> 
> Sure, but handling shared data extents and breaking of shared
> mappings on write is not an iomap/page cache problem - that's a
> problem the filesystem block mapping operations that iomap calls
> need to handle.
> 
> EROFS uses a separate pseudo inode so taht it can share page cache
> as well as shared blocks on disk. I don't think that compression
> changes that at all - the page cache contents for all those blocks
> are still going to be identical...
> 
> As for the case of shared compressed data extents in XFS, I think
> that shared status just needs a shared bit to added to the remote
> xattr extent record header. Nothing else will really have to change,
> because xattr record overwrites are naturally copy-on-write. Hence
> updating a record will always break sharing, and the "shared bit"
> simply propagates into the block freeing operation to indicate a
> refcount update for the blocks being freed is needed. I don't see
> supporting FICLONE on compressed inodes as a major issue.

Yes, I agree for XFS on-disk format it's quite easy.  My comment
related to a minor runtime point: "compressed data directly into
the page cache at a high offset".

That is if a separate pseudo inode is used to contain cached
compressed data, it will take the only one copy and one I/O for
shared compressed data if cache decompression is used..  Anyway,
that is XFS's proposal, so that was my minor comment through.

> 
>>> On the write side, compression can be done directly into the high
>>> offset page cache range for that 64kb offset range, then we can
>>> map that to a remote xattr block and write the xattr. The xattr
>>> naturally handles variable size blocks.
>>
>> Also different from plain text, each compression fses may keep
>> different encoded data forms (e.g. fses could add headers or
>> trailers to the on-disk compressed data or add more informations
>> to extent metadata) for their own needs.i
> 
> Sure, but that's something that the filesystem can add when encoding
> the data into the page cache. iomap treats the contents of the page
> caceh as entirely opaque - how "transformed" data is encoded in the
> destination folios is completely up to the filesystem doing the
> transformation. All iomap needs to care about is the offset and
> length of the opaque transformed data the filesystem needs to reside
> in the cache to perform the transformation.
> 
> i.e. The example I gave above for XFS compression doesn't need
> metadata in the page cache data because it is held in an externally
> indexed xattr btree record. That's an XFS compression implementation
> detail, not an iomap concern.

Got it.

Thanks,
Gao Xiang

> 
> -Dave.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 05/12] iomap: Introduce IOMAP_ENCODED
  2024-10-10  9:42     ` Christoph Hellwig
  2024-10-10 17:50       ` Goldwyn Rodrigues
@ 2024-10-15  3:43       ` Ritesh Harjani
  1 sibling, 0 replies; 36+ messages in thread
From: Ritesh Harjani @ 2024-10-15  3:43 UTC (permalink / raw)
  To: Christoph Hellwig, Goldwyn Rodrigues
  Cc: linux-kernel, linux-fsdevel, Goldwyn Rodrigues

Christoph Hellwig <hch@infradead.org> writes:

> On Tue, Oct 08, 2024 at 02:48:43AM -0700, Christoph Hellwig wrote:
>> In general I'm not a huge fan of the encoded magic here, but I'll
>> need to take a closer look at the caller if I can come up with
>> something better.
>
> I looked a bit more at the code.  I'm not entirely sure I fully
> understand it yet, but:
>
> I think most of the read side special casing would be handled by
> always submitting the bio at the end of an iomap.  Ritesh was
> looking into that for supporting ext2-like file systems that
> read indirect block ondemand, but I think it actually is fundamentally
> the right thing to do anyway.

yes, it was... 
    This patch optimizes the data access patterns for filesystems with
    indirect block mapping by implementing BH_Boundary handling within
    iomap.

    Currently the bios for reads within iomap are only submitted at
    2 places -
    1. If we cannot merge the new req. with previous bio, only then we
    submit the previous bio.
    2. Submit the bio at the end of the entire read processing.

    This means for filesystems with indirect block mapping, we call into
    ->iomap_begin() again w/o submitting the previous bios. That causes
    unoptimized data access patterns for blocks which are of BH_Boundary type.

Here is the change which describes this [1]. The implementation part
needed to be change to reduce the complexity. Willy gave a better
implementation alternative here [2].  

[1]: https://lore.kernel.org/all/4e2752e99f55469c4eb5f2fe83e816d529110192.1714046808.git.ritesh.list@gmail.com/
[2]: https://lore.kernel.org/all/Zivu0gzb4aiazSNu@casper.infradead.org/

Sorry once I am done with the other priority work on my plate - I can
resume this work. Meanwhile I would be happy to help if someone would
like to work on this.

-ritesh

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2024-10-15  3:55 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-04 20:04 [PATCH 00/12] btrfs reads through iomap Goldwyn Rodrigues
2024-10-04 20:04 ` [PATCH 01/12] iomap: check if folio size is equal to FS block size Goldwyn Rodrigues
2024-10-05  2:17   ` Matthew Wilcox
2024-10-07 16:57     ` Darrick J. Wong
2024-10-10 17:59       ` Goldwyn Rodrigues
2024-10-08  9:40   ` Christoph Hellwig
2024-10-04 20:04 ` [PATCH 02/12] iomap: Introduce iomap_read_folio_ops Goldwyn Rodrigues
2024-10-05  2:20   ` Matthew Wilcox
2024-10-07 17:01     ` Darrick J. Wong
2024-10-08  9:43     ` Christoph Hellwig
2024-10-04 20:04 ` [PATCH 03/12] iomap: add bioset in iomap_read_folio_ops for filesystems to use own bioset Goldwyn Rodrigues
2024-10-08  9:45   ` Christoph Hellwig
2024-10-10 17:51     ` Goldwyn Rodrigues
2024-10-04 20:04 ` [PATCH 04/12] iomap: include iomap_read_end_io() in header Goldwyn Rodrigues
2024-10-07 17:02   ` Darrick J. Wong
2024-10-10 18:12     ` Goldwyn Rodrigues
2024-10-04 20:04 ` [PATCH 05/12] iomap: Introduce IOMAP_ENCODED Goldwyn Rodrigues
2024-10-08  9:48   ` Christoph Hellwig
2024-10-10  9:42     ` Christoph Hellwig
2024-10-10 17:50       ` Goldwyn Rodrigues
2024-10-15  3:43       ` Ritesh Harjani
2024-10-04 20:04 ` [PATCH 06/12] iomap: Introduce read_inline() function hook Goldwyn Rodrigues
2024-10-05  2:30   ` Matthew Wilcox
2024-10-07 17:47     ` Darrick J. Wong
2024-10-10 18:10       ` Goldwyn Rodrigues
2024-10-11  0:43         ` Dave Chinner
2024-10-11  3:28           ` Gao Xiang
2024-10-11  4:52             ` Dave Chinner
2024-10-11  5:19               ` Gao Xiang
2024-10-04 20:04 ` [PATCH 07/12] btrfs: btrfs_em_to_iomap() to convert em to iomap Goldwyn Rodrigues
2024-10-04 20:04 ` [PATCH 08/12] btrfs: iomap_begin() for buffered reads Goldwyn Rodrigues
2024-10-04 20:04 ` [PATCH 09/12] btrfs: define btrfs_iomap_read_folio_ops Goldwyn Rodrigues
2024-10-04 20:04 ` [PATCH 10/12] btrfs: define btrfs_iomap_folio_ops Goldwyn Rodrigues
2024-10-04 20:04 ` [PATCH 11/12] btrfs: add read_inline for folio operations for read() calls Goldwyn Rodrigues
2024-10-04 20:04 ` [PATCH 12/12] btrfs: switch to iomap for buffered reads Goldwyn Rodrigues
2024-10-08  9:39 ` [PATCH 00/12] btrfs reads through iomap Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).