linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv8 0/5] iomap: Add support for per-block dirty state to improve write performance
@ 2023-06-06 11:43 Ritesh Harjani (IBM)
  2023-06-06 11:43 ` [PATCHv8 1/5] iomap: Rename iomap_page to iomap_folio and others Ritesh Harjani (IBM)
                   ` (6 more replies)
  0 siblings, 7 replies; 27+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-06-06 11:43 UTC (permalink / raw)
  To: linux-xfs
  Cc: linux-fsdevel, Darrick J. Wong, Matthew Wilcox, Dave Chinner,
	Brian Foster, Christoph Hellwig, Andreas Gruenbacher,
	Ojaswin Mujoo, Disha Goel, Ritesh Harjani (IBM)

Hello All,

Please find PATCHv8 which adds per-block dirty tracking to iomap.
As discussed earlier this is required to improve write performance and reduce
write amplification for cases where either blocksize is less than pagesize (such
as Power platform with 64k pagesize) or when we have a large folio (such as xfs
which currently supports large folio).

v7 -> v8
==========
1. Renamed iomap_page -> iomap_folio & iop -> iof in Patch-1 itself.
2. Made changes to the naming conventions of iomap_iop_** function to take iop
   as a function argument (as Darrick suggested). Also changed
   iomap_iop_test_full_uptodate() -> iomap_iop_is_fully_uptodate() &
   iomap_iop_test_block_uptodate() -> iomap_iop_is_block_uptodate().
   (similary for dirty state as well)
3. Added comment in iomap_set_range_dirty() to explain why we pass inode
   argument in that.
4. Added Reviewed-by from Darrick in Patch-3 & Patch-4 as there were no changes
   in these patches in v8.

Thanks to all the review comments. I think the patch series looks in much better
shape now. Please do let me know if this looks good?

Patchv6 -> Patchv7
==================
1. Fixed __maybe_unused annotation.
2. Added this patch-4
   iomap: Refactor iomap_write_delalloc_punch() function out

RFCv5 -> PATCHv6:
=================
1. Addresses review comments from Brian, Christoph and Matthew.
   @Christoph:
     - I have renamed the higher level functions such as iop_alloc/iop_free() to
       iomap_iop_alloc/free() in v6.
     - As for the low level bitmap accessor functions I couldn't find any better
       naming then iop_test_/set/clear_**. I could have gone for
       iomap_iop__test/set/clear/_** or iomap__iop_test/set/clear_**, but
       I wasn't convinced with either of above as it also increases function
       name.
       Besides iop_test/set_clear_ accessors functions for uptodate and dirty
       status tracking make sense as we are sure we have a valid iop in such
       cases. Please do let me know if this looks ok to you.
2. I tried testing gfs2 (initially with no patches) with xfstests. But I always ended up
   in some or the other deadlock (I couldn't spend any time debugging that).
   I also ran it with -x log, but still it was always failing for me.
   @Andreas:
   - could you please suggest how can I test gfs2 with these patches. I see gfs2
     can have a smaller blocksize and it uses iomap buffered io path. It will be
     good if we can get these patches tested on it too.

3. I can now say I have run some good amount of fstests on these patches on
   these platforms and I haven't found any new failure in my testing so far.
   arm64 (64k pagesize): with 4k -g quick
   Power: with 4k -g auto
   x86: 1k, 4k with -g auto and adv_auto

From my testing so far these patches looks stable to me and if this looks good
to reviewers as well, do you think this can be queued to linux-next for wider
testing?

Performance numbers copied from last patch commit message
==================================================
Performance testing of below fio workload reveals ~16x performance
improvement using nvme with XFS (4k blocksize) on Power (64K pagesize)
FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.

1. <test_randwrite.fio>
[global]
	ioengine=psync
	rw=randwrite
	overwrite=1
	pre_read=1
	direct=0
	bs=4k
	size=1G
	dir=./
	numjobs=8
	fdatasync=1
	runtime=60
	iodepth=64
	group_reporting=1

[fio-run]

2. Also our internal performance team reported that this patch improves
   their database workload performance by around ~83% (with XFS on Power)

Ritesh Harjani (IBM) (5):
  iomap: Rename iomap_page to iomap_folio and others
  iomap: Renames and refactor iomap_folio state bitmap handling
  iomap: Refactor iomap_write_delalloc_punch() function out
  iomap: Allocate iof in ->write_begin() early
  iomap: Add per-block dirty state tracking to improve performance

 fs/gfs2/aops.c         |   2 +-
 fs/iomap/buffered-io.c | 378 +++++++++++++++++++++++++++++------------
 fs/xfs/xfs_aops.c      |   2 +-
 fs/zonefs/file.c       |   2 +-
 include/linux/iomap.h  |   1 +
 5 files changed, 269 insertions(+), 116 deletions(-)

--
2.40.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCHv8 1/5] iomap: Rename iomap_page to iomap_folio and others
  2023-06-06 11:43 [PATCHv8 0/5] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
@ 2023-06-06 11:43 ` Ritesh Harjani (IBM)
  2023-06-07  6:49   ` Christoph Hellwig
  2023-06-06 11:43 ` [PATCHv8 2/5] iomap: Renames and refactor iomap_folio state bitmap handling Ritesh Harjani (IBM)
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-06-06 11:43 UTC (permalink / raw)
  To: linux-xfs
  Cc: linux-fsdevel, Darrick J. Wong, Matthew Wilcox, Dave Chinner,
	Brian Foster, Christoph Hellwig, Andreas Gruenbacher,
	Ojaswin Mujoo, Disha Goel, Ritesh Harjani (IBM)

This patch renames -
1. struct iomap_page -> struct iomap_folio
2. iop -> iof
3. iomap_page_create() -> iomap_iof_alloc()
4. iomap_page_release() -> iomap_iof_free()
5. to_iomap_page() -> iomap_get_iof()

Since iomap works with folio and all the necessary conversions are
already in place. Hence rename the structures like struct iomap_page
while we are already working on it (later patches will add per-block
dirty status tracking to iomap_folio state bitmap).
Later patches also adds more functions for handling iof structure with
iomap_iof_** naming conventions. Hence iomap_iof_alloc/free() naming
makes more sense to be consistent with all APIs.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/iomap/buffered-io.c | 138 ++++++++++++++++++++---------------------
 1 file changed, 69 insertions(+), 69 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 063133ec77f4..741baa10c517 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -27,14 +27,14 @@
  * Structure allocated for each folio when block size < folio size
  * to track sub-folio uptodate status and I/O completions.
  */
-struct iomap_page {
+struct iomap_folio {
 	atomic_t		read_bytes_pending;
 	atomic_t		write_bytes_pending;
 	spinlock_t		uptodate_lock;
 	unsigned long		uptodate[];
 };
 
-static inline struct iomap_page *to_iomap_page(struct folio *folio)
+static inline struct iomap_folio *iomap_get_iof(struct folio *folio)
 {
 	if (folio_test_private(folio))
 		return folio_get_private(folio);
@@ -43,45 +43,45 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
 
 static struct bio_set iomap_ioend_bioset;
 
-static struct iomap_page *
-iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
+static struct iomap_folio *iomap_iof_alloc(struct inode *inode,
+				struct folio *folio, unsigned int flags)
 {
-	struct iomap_page *iop = to_iomap_page(folio);
+	struct iomap_folio *iof = iomap_get_iof(folio);
 	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
 	gfp_t gfp;
 
-	if (iop || nr_blocks <= 1)
-		return iop;
+	if (iof || nr_blocks <= 1)
+		return iof;
 
 	if (flags & IOMAP_NOWAIT)
 		gfp = GFP_NOWAIT;
 	else
 		gfp = GFP_NOFS | __GFP_NOFAIL;
 
-	iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
+	iof = kzalloc(struct_size(iof, uptodate, BITS_TO_LONGS(nr_blocks)),
 		      gfp);
-	if (iop) {
-		spin_lock_init(&iop->uptodate_lock);
+	if (iof) {
+		spin_lock_init(&iof->uptodate_lock);
 		if (folio_test_uptodate(folio))
-			bitmap_fill(iop->uptodate, nr_blocks);
-		folio_attach_private(folio, iop);
+			bitmap_fill(iof->uptodate, nr_blocks);
+		folio_attach_private(folio, iof);
 	}
-	return iop;
+	return iof;
 }
 
-static void iomap_page_release(struct folio *folio)
+static void iomap_iof_free(struct folio *folio)
 {
-	struct iomap_page *iop = folio_detach_private(folio);
+	struct iomap_folio *iof = folio_detach_private(folio);
 	struct inode *inode = folio->mapping->host;
 	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
 
-	if (!iop)
+	if (!iof)
 		return;
-	WARN_ON_ONCE(atomic_read(&iop->read_bytes_pending));
-	WARN_ON_ONCE(atomic_read(&iop->write_bytes_pending));
-	WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
+	WARN_ON_ONCE(atomic_read(&iof->read_bytes_pending));
+	WARN_ON_ONCE(atomic_read(&iof->write_bytes_pending));
+	WARN_ON_ONCE(bitmap_full(iof->uptodate, nr_blocks) !=
 			folio_test_uptodate(folio));
-	kfree(iop);
+	kfree(iof);
 }
 
 /*
@@ -90,7 +90,7 @@ static void iomap_page_release(struct folio *folio)
 static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 		loff_t *pos, loff_t length, size_t *offp, size_t *lenp)
 {
-	struct iomap_page *iop = to_iomap_page(folio);
+	struct iomap_folio *iof = iomap_get_iof(folio);
 	loff_t orig_pos = *pos;
 	loff_t isize = i_size_read(inode);
 	unsigned block_bits = inode->i_blkbits;
@@ -105,12 +105,12 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 	 * per-block uptodate status and adjust the offset and length if needed
 	 * to avoid reading in already uptodate ranges.
 	 */
-	if (iop) {
+	if (iof) {
 		unsigned int i;
 
 		/* move forward for each leading block marked uptodate */
 		for (i = first; i <= last; i++) {
-			if (!test_bit(i, iop->uptodate))
+			if (!test_bit(i, iof->uptodate))
 				break;
 			*pos += block_size;
 			poff += block_size;
@@ -120,7 +120,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 
 		/* truncate len if we find any trailing uptodate block(s) */
 		for ( ; i <= last; i++) {
-			if (test_bit(i, iop->uptodate)) {
+			if (test_bit(i, iof->uptodate)) {
 				plen -= (last - i + 1) * block_size;
 				last = i - 1;
 				break;
@@ -144,26 +144,26 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 	*lenp = plen;
 }
 
-static void iomap_iop_set_range_uptodate(struct folio *folio,
-		struct iomap_page *iop, size_t off, size_t len)
+static void iomap_iof_set_range_uptodate(struct folio *folio,
+		struct iomap_folio *iof, size_t off, size_t len)
 {
 	struct inode *inode = folio->mapping->host;
 	unsigned first = off >> inode->i_blkbits;
 	unsigned last = (off + len - 1) >> inode->i_blkbits;
 	unsigned long flags;
 
-	spin_lock_irqsave(&iop->uptodate_lock, flags);
-	bitmap_set(iop->uptodate, first, last - first + 1);
-	if (bitmap_full(iop->uptodate, i_blocks_per_folio(inode, folio)))
+	spin_lock_irqsave(&iof->uptodate_lock, flags);
+	bitmap_set(iof->uptodate, first, last - first + 1);
+	if (bitmap_full(iof->uptodate, i_blocks_per_folio(inode, folio)))
 		folio_mark_uptodate(folio);
-	spin_unlock_irqrestore(&iop->uptodate_lock, flags);
+	spin_unlock_irqrestore(&iof->uptodate_lock, flags);
 }
 
 static void iomap_set_range_uptodate(struct folio *folio,
-		struct iomap_page *iop, size_t off, size_t len)
+		struct iomap_folio *iof, size_t off, size_t len)
 {
-	if (iop)
-		iomap_iop_set_range_uptodate(folio, iop, off, len);
+	if (iof)
+		iomap_iof_set_range_uptodate(folio, iof, off, len);
 	else
 		folio_mark_uptodate(folio);
 }
@@ -171,16 +171,16 @@ static void iomap_set_range_uptodate(struct folio *folio,
 static void iomap_finish_folio_read(struct folio *folio, size_t offset,
 		size_t len, int error)
 {
-	struct iomap_page *iop = to_iomap_page(folio);
+	struct iomap_folio *iof = iomap_get_iof(folio);
 
 	if (unlikely(error)) {
 		folio_clear_uptodate(folio);
 		folio_set_error(folio);
 	} else {
-		iomap_set_range_uptodate(folio, iop, offset, len);
+		iomap_set_range_uptodate(folio, iof, offset, len);
 	}
 
-	if (!iop || atomic_sub_and_test(len, &iop->read_bytes_pending))
+	if (!iof || atomic_sub_and_test(len, &iof->read_bytes_pending))
 		folio_unlock(folio);
 }
 
@@ -213,7 +213,7 @@ struct iomap_readpage_ctx {
 static int iomap_read_inline_data(const struct iomap_iter *iter,
 		struct folio *folio)
 {
-	struct iomap_page *iop;
+	struct iomap_folio *iof;
 	const struct iomap *iomap = iomap_iter_srcmap(iter);
 	size_t size = i_size_read(iter->inode) - iomap->offset;
 	size_t poff = offset_in_page(iomap->offset);
@@ -231,15 +231,15 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 	if (WARN_ON_ONCE(size > iomap->length))
 		return -EIO;
 	if (offset > 0)
-		iop = iomap_page_create(iter->inode, folio, iter->flags);
+		iof = iomap_iof_alloc(iter->inode, folio, iter->flags);
 	else
-		iop = to_iomap_page(folio);
+		iof = iomap_get_iof(folio);
 
 	addr = kmap_local_folio(folio, offset);
 	memcpy(addr, iomap->inline_data, size);
 	memset(addr + size, 0, PAGE_SIZE - poff - size);
 	kunmap_local(addr);
-	iomap_set_range_uptodate(folio, iop, offset, PAGE_SIZE - poff);
+	iomap_set_range_uptodate(folio, iof, offset, PAGE_SIZE - poff);
 	return 0;
 }
 
@@ -260,7 +260,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 	loff_t pos = iter->pos + offset;
 	loff_t length = iomap_length(iter) - offset;
 	struct folio *folio = ctx->cur_folio;
-	struct iomap_page *iop;
+	struct iomap_folio *iof;
 	loff_t orig_pos = pos;
 	size_t poff, plen;
 	sector_t sector;
@@ -269,20 +269,20 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 		return iomap_read_inline_data(iter, folio);
 
 	/* zero post-eof blocks as the page may be mapped */
-	iop = iomap_page_create(iter->inode, folio, iter->flags);
+	iof = iomap_iof_alloc(iter->inode, folio, iter->flags);
 	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
 	if (plen == 0)
 		goto done;
 
 	if (iomap_block_needs_zeroing(iter, pos)) {
 		folio_zero_range(folio, poff, plen);
-		iomap_set_range_uptodate(folio, iop, poff, plen);
+		iomap_set_range_uptodate(folio, iof, poff, plen);
 		goto done;
 	}
 
 	ctx->cur_folio_in_bio = true;
-	if (iop)
-		atomic_add(plen, &iop->read_bytes_pending);
+	if (iof)
+		atomic_add(plen, &iof->read_bytes_pending);
 
 	sector = iomap_sector(iomap, pos);
 	if (!ctx->bio ||
@@ -436,11 +436,11 @@ EXPORT_SYMBOL_GPL(iomap_readahead);
  */
 bool iomap_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
 {
-	struct iomap_page *iop = to_iomap_page(folio);
+	struct iomap_folio *iof = iomap_get_iof(folio);
 	struct inode *inode = folio->mapping->host;
 	unsigned first, last, i;
 
-	if (!iop)
+	if (!iof)
 		return false;
 
 	/* Caller's range may extend past the end of this folio */
@@ -451,7 +451,7 @@ bool iomap_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
 	last = (from + count - 1) >> inode->i_blkbits;
 
 	for (i = first; i <= last; i++)
-		if (!test_bit(i, iop->uptodate))
+		if (!test_bit(i, iof->uptodate))
 			return false;
 	return true;
 }
@@ -490,7 +490,7 @@ bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags)
 	 */
 	if (folio_test_dirty(folio) || folio_test_writeback(folio))
 		return false;
-	iomap_page_release(folio);
+	iomap_iof_free(folio);
 	return true;
 }
 EXPORT_SYMBOL_GPL(iomap_release_folio);
@@ -507,12 +507,12 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
 	if (offset == 0 && len == folio_size(folio)) {
 		WARN_ON_ONCE(folio_test_writeback(folio));
 		folio_cancel_dirty(folio);
-		iomap_page_release(folio);
+		iomap_iof_free(folio);
 	} else if (folio_test_large(folio)) {
-		/* Must release the iop so the page can be split */
+		/* Must release the iof so the page can be split */
 		WARN_ON_ONCE(!folio_test_uptodate(folio) &&
 			     folio_test_dirty(folio));
-		iomap_page_release(folio);
+		iomap_iof_free(folio);
 	}
 }
 EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
@@ -547,7 +547,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 		size_t len, struct folio *folio)
 {
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
-	struct iomap_page *iop;
+	struct iomap_folio *iof;
 	loff_t block_size = i_blocksize(iter->inode);
 	loff_t block_start = round_down(pos, block_size);
 	loff_t block_end = round_up(pos + len, block_size);
@@ -559,8 +559,8 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 		return 0;
 	folio_clear_error(folio);
 
-	iop = iomap_page_create(iter->inode, folio, iter->flags);
-	if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1)
+	iof = iomap_iof_alloc(iter->inode, folio, iter->flags);
+	if ((iter->flags & IOMAP_NOWAIT) && !iof && nr_blocks > 1)
 		return -EAGAIN;
 
 	do {
@@ -589,7 +589,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 			if (status)
 				return status;
 		}
-		iomap_set_range_uptodate(folio, iop, poff, plen);
+		iomap_set_range_uptodate(folio, iof, poff, plen);
 	} while ((block_start += plen) < block_end);
 
 	return 0;
@@ -696,7 +696,7 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 		size_t copied, struct folio *folio)
 {
-	struct iomap_page *iop = to_iomap_page(folio);
+	struct iomap_folio *iof = iomap_get_iof(folio);
 	flush_dcache_folio(folio);
 
 	/*
@@ -712,7 +712,7 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 	 */
 	if (unlikely(copied < len && !folio_test_uptodate(folio)))
 		return 0;
-	iomap_set_range_uptodate(folio, iop, offset_in_folio(folio, pos), len);
+	iomap_set_range_uptodate(folio, iof, offset_in_folio(folio, pos), len);
 	filemap_dirty_folio(inode->i_mapping, folio);
 	return copied;
 }
@@ -1290,17 +1290,17 @@ EXPORT_SYMBOL_GPL(iomap_page_mkwrite);
 static void iomap_finish_folio_write(struct inode *inode, struct folio *folio,
 		size_t len, int error)
 {
-	struct iomap_page *iop = to_iomap_page(folio);
+	struct iomap_folio *iof = iomap_get_iof(folio);
 
 	if (error) {
 		folio_set_error(folio);
 		mapping_set_error(inode->i_mapping, error);
 	}
 
-	WARN_ON_ONCE(i_blocks_per_folio(inode, folio) > 1 && !iop);
-	WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) <= 0);
+	WARN_ON_ONCE(i_blocks_per_folio(inode, folio) > 1 && !iof);
+	WARN_ON_ONCE(iof && atomic_read(&iof->write_bytes_pending) <= 0);
 
-	if (!iop || atomic_sub_and_test(len, &iop->write_bytes_pending))
+	if (!iof || atomic_sub_and_test(len, &iof->write_bytes_pending))
 		folio_end_writeback(folio);
 }
 
@@ -1567,7 +1567,7 @@ iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t offset,
  */
 static void
 iomap_add_to_ioend(struct inode *inode, loff_t pos, struct folio *folio,
-		struct iomap_page *iop, struct iomap_writepage_ctx *wpc,
+		struct iomap_folio *iof, struct iomap_writepage_ctx *wpc,
 		struct writeback_control *wbc, struct list_head *iolist)
 {
 	sector_t sector = iomap_sector(&wpc->iomap, pos);
@@ -1585,8 +1585,8 @@ iomap_add_to_ioend(struct inode *inode, loff_t pos, struct folio *folio,
 		bio_add_folio(wpc->ioend->io_bio, folio, len, poff);
 	}
 
-	if (iop)
-		atomic_add(len, &iop->write_bytes_pending);
+	if (iof)
+		atomic_add(len, &iof->write_bytes_pending);
 	wpc->ioend->io_size += len;
 	wbc_account_cgroup_owner(wbc, &folio->page, len);
 }
@@ -1612,7 +1612,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		struct writeback_control *wbc, struct inode *inode,
 		struct folio *folio, u64 end_pos)
 {
-	struct iomap_page *iop = iomap_page_create(inode, folio, 0);
+	struct iomap_folio *iof = iomap_iof_alloc(inode, folio, 0);
 	struct iomap_ioend *ioend, *next;
 	unsigned len = i_blocksize(inode);
 	unsigned nblocks = i_blocks_per_folio(inode, folio);
@@ -1620,7 +1620,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	int error = 0, count = 0, i;
 	LIST_HEAD(submit_list);
 
-	WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) != 0);
+	WARN_ON_ONCE(iof && atomic_read(&iof->write_bytes_pending) != 0);
 
 	/*
 	 * Walk through the folio to find areas to write back. If we
@@ -1628,7 +1628,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	 * invalid, grab a new one.
 	 */
 	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
-		if (iop && !test_bit(i, iop->uptodate))
+		if (iof && !test_bit(i, iof->uptodate))
 			continue;
 
 		error = wpc->ops->map_blocks(wpc, inode, pos);
@@ -1639,7 +1639,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 			continue;
 		if (wpc->iomap.type == IOMAP_HOLE)
 			continue;
-		iomap_add_to_ioend(inode, pos, folio, iop, wpc, wbc,
+		iomap_add_to_ioend(inode, pos, folio, iof, wpc, wbc,
 				 &submit_list);
 		count++;
 	}
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCHv8 2/5] iomap: Renames and refactor iomap_folio state bitmap handling
  2023-06-06 11:43 [PATCHv8 0/5] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
  2023-06-06 11:43 ` [PATCHv8 1/5] iomap: Rename iomap_page to iomap_folio and others Ritesh Harjani (IBM)
@ 2023-06-06 11:43 ` Ritesh Harjani (IBM)
  2023-06-06 15:13   ` Darrick J. Wong
  2023-06-07  6:50   ` Christoph Hellwig
  2023-06-06 11:43 ` [PATCHv8 3/5] iomap: Refactor iomap_write_delalloc_punch() function out Ritesh Harjani (IBM)
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 27+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-06-06 11:43 UTC (permalink / raw)
  To: linux-xfs
  Cc: linux-fsdevel, Darrick J. Wong, Matthew Wilcox, Dave Chinner,
	Brian Foster, Christoph Hellwig, Andreas Gruenbacher,
	Ojaswin Mujoo, Disha Goel, Ritesh Harjani (IBM)

This patch renames iomap_folio's uptodate bitmap to state bitmap.
Also refactors and adds iof->state handling functions for uptodate
state.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/iomap/buffered-io.c | 107 +++++++++++++++++++++++------------------
 1 file changed, 59 insertions(+), 48 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 741baa10c517..08f2a1cf0a66 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -24,14 +24,14 @@
 #define IOEND_BATCH_SIZE	4096
 
 /*
- * Structure allocated for each folio when block size < folio size
- * to track sub-folio uptodate status and I/O completions.
+ * Structure allocated for each folio to track per-block uptodate state
+ * and I/O completions.
  */
 struct iomap_folio {
 	atomic_t		read_bytes_pending;
 	atomic_t		write_bytes_pending;
-	spinlock_t		uptodate_lock;
-	unsigned long		uptodate[];
+	spinlock_t		state_lock;
+	unsigned long		state[];
 };
 
 static inline struct iomap_folio *iomap_get_iof(struct folio *folio)
@@ -43,6 +43,47 @@ static inline struct iomap_folio *iomap_get_iof(struct folio *folio)
 
 static struct bio_set iomap_ioend_bioset;
 
+static inline bool iomap_iof_is_fully_uptodate(struct folio *folio,
+					       struct iomap_folio *iof)
+{
+	struct inode *inode = folio->mapping->host;
+
+	return bitmap_full(iof->state, i_blocks_per_folio(inode, folio));
+}
+
+static inline bool iomap_iof_is_block_uptodate(struct iomap_folio *iof,
+					       unsigned int block)
+{
+	return test_bit(block, iof->state);
+}
+
+static void iomap_iof_set_range_uptodate(struct folio *folio,
+			struct iomap_folio *iof, size_t off, size_t len)
+{
+	struct inode *inode = folio->mapping->host;
+	unsigned int first_blk = off >> inode->i_blkbits;
+	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
+	unsigned int nr_blks = last_blk - first_blk + 1;
+	unsigned long flags;
+
+	spin_lock_irqsave(&iof->state_lock, flags);
+	bitmap_set(iof->state, first_blk, nr_blks);
+	if (iomap_iof_is_fully_uptodate(folio, iof))
+		folio_mark_uptodate(folio);
+	spin_unlock_irqrestore(&iof->state_lock, flags);
+}
+
+static void iomap_set_range_uptodate(struct folio *folio, size_t off,
+				     size_t len)
+{
+	struct iomap_folio *iof = iomap_get_iof(folio);
+
+	if (iof)
+		iomap_iof_set_range_uptodate(folio, iof, off, len);
+	else
+		folio_mark_uptodate(folio);
+}
+
 static struct iomap_folio *iomap_iof_alloc(struct inode *inode,
 				struct folio *folio, unsigned int flags)
 {
@@ -58,12 +99,12 @@ static struct iomap_folio *iomap_iof_alloc(struct inode *inode,
 	else
 		gfp = GFP_NOFS | __GFP_NOFAIL;
 
-	iof = kzalloc(struct_size(iof, uptodate, BITS_TO_LONGS(nr_blocks)),
+	iof = kzalloc(struct_size(iof, state, BITS_TO_LONGS(nr_blocks)),
 		      gfp);
 	if (iof) {
-		spin_lock_init(&iof->uptodate_lock);
+		spin_lock_init(&iof->state_lock);
 		if (folio_test_uptodate(folio))
-			bitmap_fill(iof->uptodate, nr_blocks);
+			bitmap_fill(iof->state, nr_blocks);
 		folio_attach_private(folio, iof);
 	}
 	return iof;
@@ -72,14 +113,12 @@ static struct iomap_folio *iomap_iof_alloc(struct inode *inode,
 static void iomap_iof_free(struct folio *folio)
 {
 	struct iomap_folio *iof = folio_detach_private(folio);
-	struct inode *inode = folio->mapping->host;
-	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
 
 	if (!iof)
 		return;
 	WARN_ON_ONCE(atomic_read(&iof->read_bytes_pending));
 	WARN_ON_ONCE(atomic_read(&iof->write_bytes_pending));
-	WARN_ON_ONCE(bitmap_full(iof->uptodate, nr_blocks) !=
+	WARN_ON_ONCE(iomap_iof_is_fully_uptodate(folio, iof) !=
 			folio_test_uptodate(folio));
 	kfree(iof);
 }
@@ -110,7 +149,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 
 		/* move forward for each leading block marked uptodate */
 		for (i = first; i <= last; i++) {
-			if (!test_bit(i, iof->uptodate))
+			if (!iomap_iof_is_block_uptodate(iof, i))
 				break;
 			*pos += block_size;
 			poff += block_size;
@@ -120,7 +159,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 
 		/* truncate len if we find any trailing uptodate block(s) */
 		for ( ; i <= last; i++) {
-			if (test_bit(i, iof->uptodate)) {
+			if (iomap_iof_is_block_uptodate(iof, i)) {
 				plen -= (last - i + 1) * block_size;
 				last = i - 1;
 				break;
@@ -144,30 +183,6 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 	*lenp = plen;
 }
 
-static void iomap_iof_set_range_uptodate(struct folio *folio,
-		struct iomap_folio *iof, size_t off, size_t len)
-{
-	struct inode *inode = folio->mapping->host;
-	unsigned first = off >> inode->i_blkbits;
-	unsigned last = (off + len - 1) >> inode->i_blkbits;
-	unsigned long flags;
-
-	spin_lock_irqsave(&iof->uptodate_lock, flags);
-	bitmap_set(iof->uptodate, first, last - first + 1);
-	if (bitmap_full(iof->uptodate, i_blocks_per_folio(inode, folio)))
-		folio_mark_uptodate(folio);
-	spin_unlock_irqrestore(&iof->uptodate_lock, flags);
-}
-
-static void iomap_set_range_uptodate(struct folio *folio,
-		struct iomap_folio *iof, size_t off, size_t len)
-{
-	if (iof)
-		iomap_iof_set_range_uptodate(folio, iof, off, len);
-	else
-		folio_mark_uptodate(folio);
-}
-
 static void iomap_finish_folio_read(struct folio *folio, size_t offset,
 		size_t len, int error)
 {
@@ -177,7 +192,7 @@ static void iomap_finish_folio_read(struct folio *folio, size_t offset,
 		folio_clear_uptodate(folio);
 		folio_set_error(folio);
 	} else {
-		iomap_set_range_uptodate(folio, iof, offset, len);
+		iomap_set_range_uptodate(folio, offset, len);
 	}
 
 	if (!iof || atomic_sub_and_test(len, &iof->read_bytes_pending))
@@ -213,7 +228,6 @@ struct iomap_readpage_ctx {
 static int iomap_read_inline_data(const struct iomap_iter *iter,
 		struct folio *folio)
 {
-	struct iomap_folio *iof;
 	const struct iomap *iomap = iomap_iter_srcmap(iter);
 	size_t size = i_size_read(iter->inode) - iomap->offset;
 	size_t poff = offset_in_page(iomap->offset);
@@ -231,15 +245,13 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 	if (WARN_ON_ONCE(size > iomap->length))
 		return -EIO;
 	if (offset > 0)
-		iof = iomap_iof_alloc(iter->inode, folio, iter->flags);
-	else
-		iof = iomap_get_iof(folio);
+		iomap_iof_alloc(iter->inode, folio, iter->flags);
 
 	addr = kmap_local_folio(folio, offset);
 	memcpy(addr, iomap->inline_data, size);
 	memset(addr + size, 0, PAGE_SIZE - poff - size);
 	kunmap_local(addr);
-	iomap_set_range_uptodate(folio, iof, offset, PAGE_SIZE - poff);
+	iomap_set_range_uptodate(folio, offset, PAGE_SIZE - poff);
 	return 0;
 }
 
@@ -276,7 +288,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 
 	if (iomap_block_needs_zeroing(iter, pos)) {
 		folio_zero_range(folio, poff, plen);
-		iomap_set_range_uptodate(folio, iof, poff, plen);
+		iomap_set_range_uptodate(folio, poff, plen);
 		goto done;
 	}
 
@@ -451,7 +463,7 @@ bool iomap_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
 	last = (from + count - 1) >> inode->i_blkbits;
 
 	for (i = first; i <= last; i++)
-		if (!test_bit(i, iof->uptodate))
+		if (!iomap_iof_is_block_uptodate(iof, i))
 			return false;
 	return true;
 }
@@ -589,7 +601,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 			if (status)
 				return status;
 		}
-		iomap_set_range_uptodate(folio, iof, poff, plen);
+		iomap_set_range_uptodate(folio, poff, plen);
 	} while ((block_start += plen) < block_end);
 
 	return 0;
@@ -696,7 +708,6 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 		size_t copied, struct folio *folio)
 {
-	struct iomap_folio *iof = iomap_get_iof(folio);
 	flush_dcache_folio(folio);
 
 	/*
@@ -712,7 +723,7 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 	 */
 	if (unlikely(copied < len && !folio_test_uptodate(folio)))
 		return 0;
-	iomap_set_range_uptodate(folio, iof, offset_in_folio(folio, pos), len);
+	iomap_set_range_uptodate(folio, offset_in_folio(folio, pos), len);
 	filemap_dirty_folio(inode->i_mapping, folio);
 	return copied;
 }
@@ -1628,7 +1639,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	 * invalid, grab a new one.
 	 */
 	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
-		if (iof && !test_bit(i, iof->uptodate))
+		if (iof && !iomap_iof_is_block_uptodate(iof, i))
 			continue;
 
 		error = wpc->ops->map_blocks(wpc, inode, pos);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCHv8 3/5] iomap: Refactor iomap_write_delalloc_punch() function out
  2023-06-06 11:43 [PATCHv8 0/5] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
  2023-06-06 11:43 ` [PATCHv8 1/5] iomap: Rename iomap_page to iomap_folio and others Ritesh Harjani (IBM)
  2023-06-06 11:43 ` [PATCHv8 2/5] iomap: Renames and refactor iomap_folio state bitmap handling Ritesh Harjani (IBM)
@ 2023-06-06 11:43 ` Ritesh Harjani (IBM)
  2023-06-07  6:52   ` Christoph Hellwig
  2023-06-06 11:43 ` [PATCHv8 4/5] iomap: Allocate iof in ->write_begin() early Ritesh Harjani (IBM)
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-06-06 11:43 UTC (permalink / raw)
  To: linux-xfs
  Cc: linux-fsdevel, Darrick J. Wong, Matthew Wilcox, Dave Chinner,
	Brian Foster, Christoph Hellwig, Andreas Gruenbacher,
	Ojaswin Mujoo, Disha Goel, Ritesh Harjani (IBM)

This patch moves iomap_write_delalloc_punch() out of
iomap_write_delalloc_scan(). No functionality change in this patch.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/iomap/buffered-io.c | 54 ++++++++++++++++++++++++++----------------
 1 file changed, 34 insertions(+), 20 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 08f2a1cf0a66..89489aed49c0 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -888,6 +888,33 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
 }
 EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
 
+static int iomap_write_delalloc_punch(struct inode *inode, struct folio *folio,
+		loff_t *punch_start_byte, loff_t start_byte, loff_t end_byte,
+		int (*punch)(struct inode *inode, loff_t offset, loff_t length))
+{
+	int ret = 0;
+
+	if (!folio_test_dirty(folio))
+		return ret;
+
+	/* if dirty, punch up to offset */
+	if (start_byte > *punch_start_byte) {
+		ret = punch(inode, *punch_start_byte,
+				start_byte - *punch_start_byte);
+		if (ret)
+			goto out;
+	}
+	/*
+	 * Make sure the next punch start is correctly bound to
+	 * the end of this data range, not the end of the folio.
+	 */
+	*punch_start_byte = min_t(loff_t, end_byte,
+				  folio_next_index(folio) << PAGE_SHIFT);
+
+out:
+	return ret;
+}
+
 /*
  * Scan the data range passed to us for dirty page cache folios. If we find a
  * dirty folio, punch out the preceeding range and update the offset from which
@@ -911,6 +938,7 @@ static int iomap_write_delalloc_scan(struct inode *inode,
 {
 	while (start_byte < end_byte) {
 		struct folio	*folio;
+		int ret;
 
 		/* grab locked page */
 		folio = filemap_lock_folio(inode->i_mapping,
@@ -921,26 +949,12 @@ static int iomap_write_delalloc_scan(struct inode *inode,
 			continue;
 		}
 
-		/* if dirty, punch up to offset */
-		if (folio_test_dirty(folio)) {
-			if (start_byte > *punch_start_byte) {
-				int	error;
-
-				error = punch(inode, *punch_start_byte,
-						start_byte - *punch_start_byte);
-				if (error) {
-					folio_unlock(folio);
-					folio_put(folio);
-					return error;
-				}
-			}
-
-			/*
-			 * Make sure the next punch start is correctly bound to
-			 * the end of this data range, not the end of the folio.
-			 */
-			*punch_start_byte = min_t(loff_t, end_byte,
-					folio_next_index(folio) << PAGE_SHIFT);
+		ret = iomap_write_delalloc_punch(inode, folio, punch_start_byte,
+						 start_byte, end_byte, punch);
+		if (ret) {
+			folio_unlock(folio);
+			folio_put(folio);
+			return ret;
 		}
 
 		/* move offset to start of next folio in range */
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCHv8 4/5] iomap: Allocate iof in ->write_begin() early
  2023-06-06 11:43 [PATCHv8 0/5] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
                   ` (2 preceding siblings ...)
  2023-06-06 11:43 ` [PATCHv8 3/5] iomap: Refactor iomap_write_delalloc_punch() function out Ritesh Harjani (IBM)
@ 2023-06-06 11:43 ` Ritesh Harjani (IBM)
  2023-06-07  6:53   ` Christoph Hellwig
  2023-06-06 11:43 ` [PATCHv8 5/5] iomap: Add per-block dirty state tracking to improve performance Ritesh Harjani (IBM)
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-06-06 11:43 UTC (permalink / raw)
  To: linux-xfs
  Cc: linux-fsdevel, Darrick J. Wong, Matthew Wilcox, Dave Chinner,
	Brian Foster, Christoph Hellwig, Andreas Gruenbacher,
	Ojaswin Mujoo, Disha Goel, Ritesh Harjani (IBM)

We dont need to allocate an iof in ->write_begin() for writes where the
position and length completely overlap with the given folio.
Therefore, such cases are skipped.

Currently when the folio is uptodate, we only allocate iof at writeback
time (in iomap_writepage_map()). This is ok until now, but when we are
going to add support for per-block dirty state bitmap in iof, this
could cause some performance degradation. The reason is that if we don't
allocate iof during ->write_begin(), then we will never mark the
necessary dirty bits in ->write_end() call. And we will have to mark all
the bits as dirty at the writeback time, that could cause the same write
amplification and performance problems as it is now.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/iomap/buffered-io.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 89489aed49c0..2b72ca3ba37a 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -567,14 +567,23 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 	size_t from = offset_in_folio(folio, pos), to = from + len;
 	size_t poff, plen;
 
-	if (folio_test_uptodate(folio))
+	/*
+	 * If the write completely overlaps the current folio, then
+	 * entire folio will be dirtied so there is no need for
+	 * per-block state tracking structures to be attached to this folio.
+	 */
+	if (pos <= folio_pos(folio) &&
+	    pos + len >= folio_pos(folio) + folio_size(folio))
 		return 0;
-	folio_clear_error(folio);
 
 	iof = iomap_iof_alloc(iter->inode, folio, iter->flags);
 	if ((iter->flags & IOMAP_NOWAIT) && !iof && nr_blocks > 1)
 		return -EAGAIN;
 
+	if (folio_test_uptodate(folio))
+		return 0;
+	folio_clear_error(folio);
+
 	do {
 		iomap_adjust_read_range(iter->inode, folio, &block_start,
 				block_end - block_start, &poff, &plen);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCHv8 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-06-06 11:43 [PATCHv8 0/5] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
                   ` (3 preceding siblings ...)
  2023-06-06 11:43 ` [PATCHv8 4/5] iomap: Allocate iof in ->write_begin() early Ritesh Harjani (IBM)
@ 2023-06-06 11:43 ` Ritesh Harjani (IBM)
  2023-06-06 15:06   ` Darrick J. Wong
  2023-06-07  7:04   ` Christoph Hellwig
  2023-06-06 12:37 ` [PATCHv8 0/5] iomap: Add support for per-block dirty state to improve write performance Matthew Wilcox
  2023-06-07  6:48 ` Christoph Hellwig
  6 siblings, 2 replies; 27+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-06-06 11:43 UTC (permalink / raw)
  To: linux-xfs
  Cc: linux-fsdevel, Darrick J. Wong, Matthew Wilcox, Dave Chinner,
	Brian Foster, Christoph Hellwig, Andreas Gruenbacher,
	Ojaswin Mujoo, Disha Goel, Ritesh Harjani (IBM), Aravinda Herle

When filesystem blocksize is less than folio size (either with
mapping_large_folio_support() or with blocksize < pagesize) and when the
folio is uptodate in pagecache, then even a byte write can cause
an entire folio to be written to disk during writeback. This happens
because we currently don't have a mechanism to track per-block dirty
state within struct iomap_folio. We currently only track uptodate state.

This patch implements support for tracking per-block dirty state in
iomap_folio->state bitmap. This should help improve the filesystem write
performance and help reduce write amplification.

Performance testing of below fio workload reveals ~16x performance
improvement using nvme with XFS (4k blocksize) on Power (64K pagesize)
FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.

1. <test_randwrite.fio>
[global]
	ioengine=psync
	rw=randwrite
	overwrite=1
	pre_read=1
	direct=0
	bs=4k
	size=1G
	dir=./
	numjobs=8
	fdatasync=1
	runtime=60
	iodepth=64
	group_reporting=1

[fio-run]

2. Also our internal performance team reported that this patch improves
   their database workload performance by around ~83% (with XFS on Power)

Reported-by: Aravinda Herle <araherle@in.ibm.com>
Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/gfs2/aops.c         |   2 +-
 fs/iomap/buffered-io.c | 126 +++++++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_aops.c      |   2 +-
 fs/zonefs/file.c       |   2 +-
 include/linux/iomap.h  |   1 +
 5 files changed, 126 insertions(+), 7 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index a5f4be6b9213..75efec3c3b71 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -746,7 +746,7 @@ static const struct address_space_operations gfs2_aops = {
 	.writepages = gfs2_writepages,
 	.read_folio = gfs2_read_folio,
 	.readahead = gfs2_readahead,
-	.dirty_folio = filemap_dirty_folio,
+	.dirty_folio = iomap_dirty_folio,
 	.release_folio = iomap_release_folio,
 	.invalidate_folio = iomap_invalidate_folio,
 	.bmap = gfs2_bmap,
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 2b72ca3ba37a..b99d611f1cd5 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -84,6 +84,70 @@ static void iomap_set_range_uptodate(struct folio *folio, size_t off,
 		folio_mark_uptodate(folio);
 }
 
+static inline bool iomap_iof_is_block_dirty(struct folio *folio,
+			struct iomap_folio *iof, int block)
+{
+	struct inode *inode = folio->mapping->host;
+	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
+
+	return test_bit(block + blks_per_folio, iof->state);
+}
+
+static void iomap_iof_clear_range_dirty(struct folio *folio,
+			struct iomap_folio *iof, size_t off, size_t len)
+{
+	struct inode *inode = folio->mapping->host;
+	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
+	unsigned int first_blk = off >> inode->i_blkbits;
+	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
+	unsigned int nr_blks = last_blk - first_blk + 1;
+	unsigned long flags;
+
+	spin_lock_irqsave(&iof->state_lock, flags);
+	bitmap_clear(iof->state, first_blk + blks_per_folio, nr_blks);
+	spin_unlock_irqrestore(&iof->state_lock, flags);
+}
+
+static void iomap_clear_range_dirty(struct folio *folio, size_t off, size_t len)
+{
+	struct iomap_folio *iof = iomap_get_iof(folio);
+
+	if (!iof)
+		return;
+	iomap_iof_clear_range_dirty(folio, iof, off, len);
+}
+
+static void iomap_iof_set_range_dirty(struct inode *inode, struct folio *folio,
+			struct iomap_folio *iof, size_t off, size_t len)
+{
+	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
+	unsigned int first_blk = off >> inode->i_blkbits;
+	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
+	unsigned int nr_blks = last_blk - first_blk + 1;
+	unsigned long flags;
+
+	spin_lock_irqsave(&iof->state_lock, flags);
+	bitmap_set(iof->state, first_blk + blks_per_folio, nr_blks);
+	spin_unlock_irqrestore(&iof->state_lock, flags);
+}
+
+/*
+ * The reason we pass inode explicitly here is to avoid any race with truncate
+ * when iomap_set_range_dirty() gets called from iomap_dirty_folio().
+ * Check filemap_dirty_folio() & __folio_mark_dirty() for more details.
+ * Hence we explicitly pass mapping->host to iomap_set_range_dirty() from
+ * iomap_dirty_folio().
+ */
+static void iomap_set_range_dirty(struct inode *inode, struct folio *folio,
+				  size_t off, size_t len)
+{
+	struct iomap_folio *iof = iomap_get_iof(folio);
+
+	if (!iof)
+		return;
+	iomap_iof_set_range_dirty(inode, folio, iof, off, len);
+}
+
 static struct iomap_folio *iomap_iof_alloc(struct inode *inode,
 				struct folio *folio, unsigned int flags)
 {
@@ -99,12 +163,20 @@ static struct iomap_folio *iomap_iof_alloc(struct inode *inode,
 	else
 		gfp = GFP_NOFS | __GFP_NOFAIL;
 
-	iof = kzalloc(struct_size(iof, state, BITS_TO_LONGS(nr_blocks)),
+	/*
+	 * iof->state tracks two sets of state flags when the
+	 * filesystem block size is smaller than the folio size.
+	 * The first state tracks per-block uptodate and the
+	 * second tracks per-block dirty state.
+	 */
+	iof = kzalloc(struct_size(iof, state, BITS_TO_LONGS(2 * nr_blocks)),
 		      gfp);
 	if (iof) {
 		spin_lock_init(&iof->state_lock);
 		if (folio_test_uptodate(folio))
-			bitmap_fill(iof->state, nr_blocks);
+			bitmap_set(iof->state, 0, nr_blocks);
+		if (folio_test_dirty(folio))
+			bitmap_set(iof->state, nr_blocks, nr_blocks);
 		folio_attach_private(folio, iof);
 	}
 	return iof;
@@ -529,6 +601,17 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
 }
 EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
 
+bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
+{
+	struct inode *inode = mapping->host;
+	size_t len = folio_size(folio);
+
+	iomap_iof_alloc(inode, folio, 0);
+	iomap_set_range_dirty(inode, folio, 0, len);
+	return filemap_dirty_folio(mapping, folio);
+}
+EXPORT_SYMBOL_GPL(iomap_dirty_folio);
+
 static void
 iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
 {
@@ -733,6 +816,8 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 	if (unlikely(copied < len && !folio_test_uptodate(folio)))
 		return 0;
 	iomap_set_range_uptodate(folio, offset_in_folio(folio, pos), len);
+	iomap_set_range_dirty(inode, folio, offset_in_folio(folio, pos),
+			      copied);
 	filemap_dirty_folio(inode->i_mapping, folio);
 	return copied;
 }
@@ -902,6 +987,10 @@ static int iomap_write_delalloc_punch(struct inode *inode, struct folio *folio,
 		int (*punch)(struct inode *inode, loff_t offset, loff_t length))
 {
 	int ret = 0;
+	struct iomap_folio *iof;
+	unsigned int first_blk, last_blk, i;
+	loff_t last_byte;
+	u8 blkbits = inode->i_blkbits;
 
 	if (!folio_test_dirty(folio))
 		return ret;
@@ -913,6 +1002,29 @@ static int iomap_write_delalloc_punch(struct inode *inode, struct folio *folio,
 		if (ret)
 			goto out;
 	}
+	/*
+	 * When we have per-block dirty tracking, there can be
+	 * blocks within a folio which are marked uptodate
+	 * but not dirty. In that case it is necessary to punch
+	 * out such blocks to avoid leaking any delalloc blocks.
+	 */
+	iof = iomap_get_iof(folio);
+	if (!iof)
+		goto skip_iof_punch;
+
+	last_byte = min_t(loff_t, end_byte - 1,
+		(folio_next_index(folio) << PAGE_SHIFT) - 1);
+	first_blk = offset_in_folio(folio, start_byte) >> blkbits;
+	last_blk = offset_in_folio(folio, last_byte) >> blkbits;
+	for (i = first_blk; i <= last_blk; i++) {
+		if (!iomap_iof_is_block_dirty(folio, iof, i)) {
+			ret = punch(inode, i << blkbits, 1 << blkbits);
+			if (ret)
+				goto out;
+		}
+	}
+
+skip_iof_punch:
 	/*
 	 * Make sure the next punch start is correctly bound to
 	 * the end of this data range, not the end of the folio.
@@ -1646,7 +1758,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		struct writeback_control *wbc, struct inode *inode,
 		struct folio *folio, u64 end_pos)
 {
-	struct iomap_folio *iof = iomap_iof_alloc(inode, folio, 0);
+	struct iomap_folio *iof = iomap_get_iof(folio);
 	struct iomap_ioend *ioend, *next;
 	unsigned len = i_blocksize(inode);
 	unsigned nblocks = i_blocks_per_folio(inode, folio);
@@ -1654,6 +1766,11 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	int error = 0, count = 0, i;
 	LIST_HEAD(submit_list);
 
+	if (!iof && nblocks > 1) {
+		iof = iomap_iof_alloc(inode, folio, 0);
+		iomap_set_range_dirty(inode, folio, 0, folio_size(folio));
+	}
+
 	WARN_ON_ONCE(iof && atomic_read(&iof->write_bytes_pending) != 0);
 
 	/*
@@ -1662,7 +1779,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	 * invalid, grab a new one.
 	 */
 	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
-		if (iof && !iomap_iof_is_block_uptodate(iof, i))
+		if (iof && !iomap_iof_is_block_dirty(folio, iof, i))
 			continue;
 
 		error = wpc->ops->map_blocks(wpc, inode, pos);
@@ -1706,6 +1823,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		}
 	}
 
+	iomap_clear_range_dirty(folio, 0, end_pos - folio_pos(folio));
 	folio_start_writeback(folio);
 	folio_unlock(folio);
 
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 2ef78aa1d3f6..77c7332ae197 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -578,7 +578,7 @@ const struct address_space_operations xfs_address_space_operations = {
 	.read_folio		= xfs_vm_read_folio,
 	.readahead		= xfs_vm_readahead,
 	.writepages		= xfs_vm_writepages,
-	.dirty_folio		= filemap_dirty_folio,
+	.dirty_folio		= iomap_dirty_folio,
 	.release_folio		= iomap_release_folio,
 	.invalidate_folio	= iomap_invalidate_folio,
 	.bmap			= xfs_vm_bmap,
diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
index 132f01d3461f..e508c8e97372 100644
--- a/fs/zonefs/file.c
+++ b/fs/zonefs/file.c
@@ -175,7 +175,7 @@ const struct address_space_operations zonefs_file_aops = {
 	.read_folio		= zonefs_read_folio,
 	.readahead		= zonefs_readahead,
 	.writepages		= zonefs_writepages,
-	.dirty_folio		= filemap_dirty_folio,
+	.dirty_folio		= iomap_dirty_folio,
 	.release_folio		= iomap_release_folio,
 	.invalidate_folio	= iomap_invalidate_folio,
 	.migrate_folio		= filemap_migrate_folio,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index e2b836c2e119..eb9335c46bf3 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -264,6 +264,7 @@ bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
 struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos);
 bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
 void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
+bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
 int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
 		const struct iomap_ops *ops);
 int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCHv8 0/5] iomap: Add support for per-block dirty state to improve write performance
  2023-06-06 11:43 [PATCHv8 0/5] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
                   ` (4 preceding siblings ...)
  2023-06-06 11:43 ` [PATCHv8 5/5] iomap: Add per-block dirty state tracking to improve performance Ritesh Harjani (IBM)
@ 2023-06-06 12:37 ` Matthew Wilcox
  2023-06-06 13:00   ` Ritesh Harjani
  2023-06-07  6:54   ` Christoph Hellwig
  2023-06-07  6:48 ` Christoph Hellwig
  6 siblings, 2 replies; 27+ messages in thread
From: Matthew Wilcox @ 2023-06-06 12:37 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-xfs, linux-fsdevel, Darrick J. Wong, Dave Chinner,
	Brian Foster, Christoph Hellwig, Andreas Gruenbacher,
	Ojaswin Mujoo, Disha Goel

On Tue, Jun 06, 2023 at 05:13:47PM +0530, Ritesh Harjani (IBM) wrote:
> Hello All,
> 
> Please find PATCHv8 which adds per-block dirty tracking to iomap.
> As discussed earlier this is required to improve write performance and reduce
> write amplification for cases where either blocksize is less than pagesize (such
> as Power platform with 64k pagesize) or when we have a large folio (such as xfs
> which currently supports large folio).

You're moving too fast.  Please, allow at least a few hours between
getting review comments and sending a new version.

> v7 -> v8
> ==========
> 1. Renamed iomap_page -> iomap_folio & iop -> iof in Patch-1 itself.

I don't think iomap_folio is the right name.  Indeed, I did not believe
that iomap_page was the right name.  As I said on #xfs recently ...

<willy> i'm still not crazy about iomap_page as the name of that
   data structure.  and calling the variable 'iop' just seems doomed
   to be trouble.  how do people feel about either iomap_block_state or
   folio_block_state ... or even just calling it block_state since it's
   local to iomap/buffered-io.c
<willy> we'd then call the variable either ibs or fbs, both of which
   have some collisions in the kernel, but none in filesystems
<dchinner> willy - sounds reasonable


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv8 0/5] iomap: Add support for per-block dirty state to improve write performance
  2023-06-06 12:37 ` [PATCHv8 0/5] iomap: Add support for per-block dirty state to improve write performance Matthew Wilcox
@ 2023-06-06 13:00   ` Ritesh Harjani
  2023-06-06 15:16     ` Darrick J. Wong
  2023-06-07  6:54   ` Christoph Hellwig
  1 sibling, 1 reply; 27+ messages in thread
From: Ritesh Harjani @ 2023-06-06 13:00 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-fsdevel, Darrick J. Wong, Dave Chinner,
	Brian Foster, Christoph Hellwig, Andreas Gruenbacher,
	Ojaswin Mujoo, Disha Goel

Matthew Wilcox <willy@infradead.org> writes:

> On Tue, Jun 06, 2023 at 05:13:47PM +0530, Ritesh Harjani (IBM) wrote:
>> Hello All,
>> 
>> Please find PATCHv8 which adds per-block dirty tracking to iomap.
>> As discussed earlier this is required to improve write performance and reduce
>> write amplification for cases where either blocksize is less than pagesize (such
>> as Power platform with 64k pagesize) or when we have a large folio (such as xfs
>> which currently supports large folio).
>
> You're moving too fast.  Please, allow at least a few hours between
> getting review comments and sending a new version.
>

Sorry about that. I felt those were mainly only mechanical conversion
changes. Will keep in mind.

>> v7 -> v8
>> ==========
>> 1. Renamed iomap_page -> iomap_folio & iop -> iof in Patch-1 itself.
>
> I don't think iomap_folio is the right name.  Indeed, I did not believe
> that iomap_page was the right name.  As I said on #xfs recently ...
>
> <willy> i'm still not crazy about iomap_page as the name of that
>    data structure.  and calling the variable 'iop' just seems doomed
>    to be trouble.  how do people feel about either iomap_block_state or
>    folio_block_state ... or even just calling it block_state since it's
>    local to iomap/buffered-io.c
> <willy> we'd then call the variable either ibs or fbs, both of which
>    have some collisions in the kernel, but none in filesystems
> <dchinner> willy - sounds reasonable

Both seems equally reasonable to me. If others are equally ok with both,
then shall we go with iomap_block_state and ibs? 

I see that as "iomap_block_state" which is local to iomap buffered-io
layer to track per-block state within a folio and gets attached to
folio->private.

-ritesh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv8 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-06-06 11:43 ` [PATCHv8 5/5] iomap: Add per-block dirty state tracking to improve performance Ritesh Harjani (IBM)
@ 2023-06-06 15:06   ` Darrick J. Wong
  2023-06-07 10:27     ` Ritesh Harjani
  2023-06-07  7:04   ` Christoph Hellwig
  1 sibling, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2023-06-06 15:06 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Brian Foster, Christoph Hellwig, Andreas Gruenbacher,
	Ojaswin Mujoo, Disha Goel, Aravinda Herle

On Tue, Jun 06, 2023 at 05:13:52PM +0530, Ritesh Harjani (IBM) wrote:
> When filesystem blocksize is less than folio size (either with
> mapping_large_folio_support() or with blocksize < pagesize) and when the
> folio is uptodate in pagecache, then even a byte write can cause
> an entire folio to be written to disk during writeback. This happens
> because we currently don't have a mechanism to track per-block dirty
> state within struct iomap_folio. We currently only track uptodate state.
> 
> This patch implements support for tracking per-block dirty state in
> iomap_folio->state bitmap. This should help improve the filesystem write
> performance and help reduce write amplification.
> 
> Performance testing of below fio workload reveals ~16x performance
> improvement using nvme with XFS (4k blocksize) on Power (64K pagesize)
> FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
> 
> 1. <test_randwrite.fio>
> [global]
> 	ioengine=psync
> 	rw=randwrite
> 	overwrite=1
> 	pre_read=1
> 	direct=0
> 	bs=4k
> 	size=1G
> 	dir=./
> 	numjobs=8
> 	fdatasync=1
> 	runtime=60
> 	iodepth=64
> 	group_reporting=1
> 
> [fio-run]
> 
> 2. Also our internal performance team reported that this patch improves
>    their database workload performance by around ~83% (with XFS on Power)
> 
> Reported-by: Aravinda Herle <araherle@in.ibm.com>
> Reported-by: Brian Foster <bfoster@redhat.com>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>  fs/gfs2/aops.c         |   2 +-
>  fs/iomap/buffered-io.c | 126 +++++++++++++++++++++++++++++++++++++++--
>  fs/xfs/xfs_aops.c      |   2 +-
>  fs/zonefs/file.c       |   2 +-
>  include/linux/iomap.h  |   1 +
>  5 files changed, 126 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
> index a5f4be6b9213..75efec3c3b71 100644
> --- a/fs/gfs2/aops.c
> +++ b/fs/gfs2/aops.c
> @@ -746,7 +746,7 @@ static const struct address_space_operations gfs2_aops = {
>  	.writepages = gfs2_writepages,
>  	.read_folio = gfs2_read_folio,
>  	.readahead = gfs2_readahead,
> -	.dirty_folio = filemap_dirty_folio,
> +	.dirty_folio = iomap_dirty_folio,
>  	.release_folio = iomap_release_folio,
>  	.invalidate_folio = iomap_invalidate_folio,
>  	.bmap = gfs2_bmap,
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 2b72ca3ba37a..b99d611f1cd5 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -84,6 +84,70 @@ static void iomap_set_range_uptodate(struct folio *folio, size_t off,
>  		folio_mark_uptodate(folio);
>  }
>  
> +static inline bool iomap_iof_is_block_dirty(struct folio *folio,
> +			struct iomap_folio *iof, int block)
> +{
> +	struct inode *inode = folio->mapping->host;
> +	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
> +
> +	return test_bit(block + blks_per_folio, iof->state);
> +}
> +
> +static void iomap_iof_clear_range_dirty(struct folio *folio,
> +			struct iomap_folio *iof, size_t off, size_t len)
> +{
> +	struct inode *inode = folio->mapping->host;
> +	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
> +	unsigned int first_blk = off >> inode->i_blkbits;
> +	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
> +	unsigned int nr_blks = last_blk - first_blk + 1;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&iof->state_lock, flags);
> +	bitmap_clear(iof->state, first_blk + blks_per_folio, nr_blks);
> +	spin_unlock_irqrestore(&iof->state_lock, flags);
> +}
> +
> +static void iomap_clear_range_dirty(struct folio *folio, size_t off, size_t len)
> +{
> +	struct iomap_folio *iof = iomap_get_iof(folio);
> +
> +	if (!iof)
> +		return;
> +	iomap_iof_clear_range_dirty(folio, iof, off, len);
> +}
> +
> +static void iomap_iof_set_range_dirty(struct inode *inode, struct folio *folio,
> +			struct iomap_folio *iof, size_t off, size_t len)
> +{
> +	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
> +	unsigned int first_blk = off >> inode->i_blkbits;
> +	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
> +	unsigned int nr_blks = last_blk - first_blk + 1;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&iof->state_lock, flags);
> +	bitmap_set(iof->state, first_blk + blks_per_folio, nr_blks);
> +	spin_unlock_irqrestore(&iof->state_lock, flags);
> +}
> +
> +/*
> + * The reason we pass inode explicitly here is to avoid any race with truncate
> + * when iomap_set_range_dirty() gets called from iomap_dirty_folio().
> + * Check filemap_dirty_folio() & __folio_mark_dirty() for more details.
> + * Hence we explicitly pass mapping->host to iomap_set_range_dirty() from
> + * iomap_dirty_folio().
> + */
> +static void iomap_set_range_dirty(struct inode *inode, struct folio *folio,
> +				  size_t off, size_t len)
> +{
> +	struct iomap_folio *iof = iomap_get_iof(folio);
> +
> +	if (!iof)
> +		return;
> +	iomap_iof_set_range_dirty(inode, folio, iof, off, len);
> +}
> +
>  static struct iomap_folio *iomap_iof_alloc(struct inode *inode,
>  				struct folio *folio, unsigned int flags)
>  {
> @@ -99,12 +163,20 @@ static struct iomap_folio *iomap_iof_alloc(struct inode *inode,
>  	else
>  		gfp = GFP_NOFS | __GFP_NOFAIL;
>  
> -	iof = kzalloc(struct_size(iof, state, BITS_TO_LONGS(nr_blocks)),
> +	/*
> +	 * iof->state tracks two sets of state flags when the
> +	 * filesystem block size is smaller than the folio size.
> +	 * The first state tracks per-block uptodate and the
> +	 * second tracks per-block dirty state.
> +	 */
> +	iof = kzalloc(struct_size(iof, state, BITS_TO_LONGS(2 * nr_blocks)),
>  		      gfp);
>  	if (iof) {
>  		spin_lock_init(&iof->state_lock);
>  		if (folio_test_uptodate(folio))
> -			bitmap_fill(iof->state, nr_blocks);
> +			bitmap_set(iof->state, 0, nr_blocks);
> +		if (folio_test_dirty(folio))
> +			bitmap_set(iof->state, nr_blocks, nr_blocks);
>  		folio_attach_private(folio, iof);
>  	}
>  	return iof;
> @@ -529,6 +601,17 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
>  }
>  EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
>  
> +bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
> +{
> +	struct inode *inode = mapping->host;
> +	size_t len = folio_size(folio);
> +
> +	iomap_iof_alloc(inode, folio, 0);
> +	iomap_set_range_dirty(inode, folio, 0, len);
> +	return filemap_dirty_folio(mapping, folio);
> +}
> +EXPORT_SYMBOL_GPL(iomap_dirty_folio);
> +
>  static void
>  iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
>  {
> @@ -733,6 +816,8 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
>  	if (unlikely(copied < len && !folio_test_uptodate(folio)))
>  		return 0;
>  	iomap_set_range_uptodate(folio, offset_in_folio(folio, pos), len);
> +	iomap_set_range_dirty(inode, folio, offset_in_folio(folio, pos),
> +			      copied);
>  	filemap_dirty_folio(inode->i_mapping, folio);
>  	return copied;
>  }
> @@ -902,6 +987,10 @@ static int iomap_write_delalloc_punch(struct inode *inode, struct folio *folio,
>  		int (*punch)(struct inode *inode, loff_t offset, loff_t length))
>  {
>  	int ret = 0;
> +	struct iomap_folio *iof;
> +	unsigned int first_blk, last_blk, i;
> +	loff_t last_byte;
> +	u8 blkbits = inode->i_blkbits;
>  
>  	if (!folio_test_dirty(folio))
>  		return ret;
> @@ -913,6 +1002,29 @@ static int iomap_write_delalloc_punch(struct inode *inode, struct folio *folio,
>  		if (ret)
>  			goto out;
>  	}
> +	/*
> +	 * When we have per-block dirty tracking, there can be
> +	 * blocks within a folio which are marked uptodate
> +	 * but not dirty. In that case it is necessary to punch
> +	 * out such blocks to avoid leaking any delalloc blocks.
> +	 */
> +	iof = iomap_get_iof(folio);
> +	if (!iof)
> +		goto skip_iof_punch;
> +
> +	last_byte = min_t(loff_t, end_byte - 1,
> +		(folio_next_index(folio) << PAGE_SHIFT) - 1);
> +	first_blk = offset_in_folio(folio, start_byte) >> blkbits;
> +	last_blk = offset_in_folio(folio, last_byte) >> blkbits;
> +	for (i = first_blk; i <= last_blk; i++) {
> +		if (!iomap_iof_is_block_dirty(folio, iof, i)) {
> +			ret = punch(inode, i << blkbits, 1 << blkbits);

Isn't the second argument to punch() supposed to be the offset within
the file, not the offset within the folio?

I /almost/ wonder if this chunk should be its own static inline
iomap_iof_delalloc_punch function, but ... eh.  My enthusiasm for
slicing and dicing is weak today.

--D

> +			if (ret)
> +				goto out;
> +		}
> +	}
> +
> +skip_iof_punch:
>  	/*
>  	 * Make sure the next punch start is correctly bound to
>  	 * the end of this data range, not the end of the folio.
> @@ -1646,7 +1758,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  		struct writeback_control *wbc, struct inode *inode,
>  		struct folio *folio, u64 end_pos)
>  {
> -	struct iomap_folio *iof = iomap_iof_alloc(inode, folio, 0);
> +	struct iomap_folio *iof = iomap_get_iof(folio);
>  	struct iomap_ioend *ioend, *next;
>  	unsigned len = i_blocksize(inode);
>  	unsigned nblocks = i_blocks_per_folio(inode, folio);
> @@ -1654,6 +1766,11 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  	int error = 0, count = 0, i;
>  	LIST_HEAD(submit_list);
>  
> +	if (!iof && nblocks > 1) {
> +		iof = iomap_iof_alloc(inode, folio, 0);
> +		iomap_set_range_dirty(inode, folio, 0, folio_size(folio));
> +	}
> +
>  	WARN_ON_ONCE(iof && atomic_read(&iof->write_bytes_pending) != 0);
>  
>  	/*
> @@ -1662,7 +1779,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  	 * invalid, grab a new one.
>  	 */
>  	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
> -		if (iof && !iomap_iof_is_block_uptodate(iof, i))
> +		if (iof && !iomap_iof_is_block_dirty(folio, iof, i))
>  			continue;
>  
>  		error = wpc->ops->map_blocks(wpc, inode, pos);
> @@ -1706,6 +1823,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  		}
>  	}
>  
> +	iomap_clear_range_dirty(folio, 0, end_pos - folio_pos(folio));
>  	folio_start_writeback(folio);
>  	folio_unlock(folio);
>  
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 2ef78aa1d3f6..77c7332ae197 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -578,7 +578,7 @@ const struct address_space_operations xfs_address_space_operations = {
>  	.read_folio		= xfs_vm_read_folio,
>  	.readahead		= xfs_vm_readahead,
>  	.writepages		= xfs_vm_writepages,
> -	.dirty_folio		= filemap_dirty_folio,
> +	.dirty_folio		= iomap_dirty_folio,
>  	.release_folio		= iomap_release_folio,
>  	.invalidate_folio	= iomap_invalidate_folio,
>  	.bmap			= xfs_vm_bmap,
> diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
> index 132f01d3461f..e508c8e97372 100644
> --- a/fs/zonefs/file.c
> +++ b/fs/zonefs/file.c
> @@ -175,7 +175,7 @@ const struct address_space_operations zonefs_file_aops = {
>  	.read_folio		= zonefs_read_folio,
>  	.readahead		= zonefs_readahead,
>  	.writepages		= zonefs_writepages,
> -	.dirty_folio		= filemap_dirty_folio,
> +	.dirty_folio		= iomap_dirty_folio,
>  	.release_folio		= iomap_release_folio,
>  	.invalidate_folio	= iomap_invalidate_folio,
>  	.migrate_folio		= filemap_migrate_folio,
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index e2b836c2e119..eb9335c46bf3 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -264,6 +264,7 @@ bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
>  struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos);
>  bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
>  void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
> +bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
>  int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
>  		const struct iomap_ops *ops);
>  int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv8 2/5] iomap: Renames and refactor iomap_folio state bitmap handling
  2023-06-06 11:43 ` [PATCHv8 2/5] iomap: Renames and refactor iomap_folio state bitmap handling Ritesh Harjani (IBM)
@ 2023-06-06 15:13   ` Darrick J. Wong
  2023-06-07  6:50   ` Christoph Hellwig
  1 sibling, 0 replies; 27+ messages in thread
From: Darrick J. Wong @ 2023-06-06 15:13 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Brian Foster, Christoph Hellwig, Andreas Gruenbacher,
	Ojaswin Mujoo, Disha Goel

On Tue, Jun 06, 2023 at 05:13:49PM +0530, Ritesh Harjani (IBM) wrote:
> This patch renames iomap_folio's uptodate bitmap to state bitmap.
> Also refactors and adds iof->state handling functions for uptodate
> state.
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>

Looks good to me,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/iomap/buffered-io.c | 107 +++++++++++++++++++++++------------------
>  1 file changed, 59 insertions(+), 48 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 741baa10c517..08f2a1cf0a66 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -24,14 +24,14 @@
>  #define IOEND_BATCH_SIZE	4096
>  
>  /*
> - * Structure allocated for each folio when block size < folio size
> - * to track sub-folio uptodate status and I/O completions.
> + * Structure allocated for each folio to track per-block uptodate state
> + * and I/O completions.
>   */
>  struct iomap_folio {
>  	atomic_t		read_bytes_pending;
>  	atomic_t		write_bytes_pending;
> -	spinlock_t		uptodate_lock;
> -	unsigned long		uptodate[];
> +	spinlock_t		state_lock;
> +	unsigned long		state[];
>  };
>  
>  static inline struct iomap_folio *iomap_get_iof(struct folio *folio)
> @@ -43,6 +43,47 @@ static inline struct iomap_folio *iomap_get_iof(struct folio *folio)
>  
>  static struct bio_set iomap_ioend_bioset;
>  
> +static inline bool iomap_iof_is_fully_uptodate(struct folio *folio,
> +					       struct iomap_folio *iof)
> +{
> +	struct inode *inode = folio->mapping->host;
> +
> +	return bitmap_full(iof->state, i_blocks_per_folio(inode, folio));
> +}
> +
> +static inline bool iomap_iof_is_block_uptodate(struct iomap_folio *iof,
> +					       unsigned int block)
> +{
> +	return test_bit(block, iof->state);
> +}
> +
> +static void iomap_iof_set_range_uptodate(struct folio *folio,
> +			struct iomap_folio *iof, size_t off, size_t len)
> +{
> +	struct inode *inode = folio->mapping->host;
> +	unsigned int first_blk = off >> inode->i_blkbits;
> +	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
> +	unsigned int nr_blks = last_blk - first_blk + 1;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&iof->state_lock, flags);
> +	bitmap_set(iof->state, first_blk, nr_blks);
> +	if (iomap_iof_is_fully_uptodate(folio, iof))
> +		folio_mark_uptodate(folio);
> +	spin_unlock_irqrestore(&iof->state_lock, flags);
> +}
> +
> +static void iomap_set_range_uptodate(struct folio *folio, size_t off,
> +				     size_t len)
> +{
> +	struct iomap_folio *iof = iomap_get_iof(folio);
> +
> +	if (iof)
> +		iomap_iof_set_range_uptodate(folio, iof, off, len);
> +	else
> +		folio_mark_uptodate(folio);
> +}
> +
>  static struct iomap_folio *iomap_iof_alloc(struct inode *inode,
>  				struct folio *folio, unsigned int flags)
>  {
> @@ -58,12 +99,12 @@ static struct iomap_folio *iomap_iof_alloc(struct inode *inode,
>  	else
>  		gfp = GFP_NOFS | __GFP_NOFAIL;
>  
> -	iof = kzalloc(struct_size(iof, uptodate, BITS_TO_LONGS(nr_blocks)),
> +	iof = kzalloc(struct_size(iof, state, BITS_TO_LONGS(nr_blocks)),
>  		      gfp);
>  	if (iof) {
> -		spin_lock_init(&iof->uptodate_lock);
> +		spin_lock_init(&iof->state_lock);
>  		if (folio_test_uptodate(folio))
> -			bitmap_fill(iof->uptodate, nr_blocks);
> +			bitmap_fill(iof->state, nr_blocks);
>  		folio_attach_private(folio, iof);
>  	}
>  	return iof;
> @@ -72,14 +113,12 @@ static struct iomap_folio *iomap_iof_alloc(struct inode *inode,
>  static void iomap_iof_free(struct folio *folio)
>  {
>  	struct iomap_folio *iof = folio_detach_private(folio);
> -	struct inode *inode = folio->mapping->host;
> -	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
>  
>  	if (!iof)
>  		return;
>  	WARN_ON_ONCE(atomic_read(&iof->read_bytes_pending));
>  	WARN_ON_ONCE(atomic_read(&iof->write_bytes_pending));
> -	WARN_ON_ONCE(bitmap_full(iof->uptodate, nr_blocks) !=
> +	WARN_ON_ONCE(iomap_iof_is_fully_uptodate(folio, iof) !=
>  			folio_test_uptodate(folio));
>  	kfree(iof);
>  }
> @@ -110,7 +149,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
>  
>  		/* move forward for each leading block marked uptodate */
>  		for (i = first; i <= last; i++) {
> -			if (!test_bit(i, iof->uptodate))
> +			if (!iomap_iof_is_block_uptodate(iof, i))
>  				break;
>  			*pos += block_size;
>  			poff += block_size;
> @@ -120,7 +159,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
>  
>  		/* truncate len if we find any trailing uptodate block(s) */
>  		for ( ; i <= last; i++) {
> -			if (test_bit(i, iof->uptodate)) {
> +			if (iomap_iof_is_block_uptodate(iof, i)) {
>  				plen -= (last - i + 1) * block_size;
>  				last = i - 1;
>  				break;
> @@ -144,30 +183,6 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
>  	*lenp = plen;
>  }
>  
> -static void iomap_iof_set_range_uptodate(struct folio *folio,
> -		struct iomap_folio *iof, size_t off, size_t len)
> -{
> -	struct inode *inode = folio->mapping->host;
> -	unsigned first = off >> inode->i_blkbits;
> -	unsigned last = (off + len - 1) >> inode->i_blkbits;
> -	unsigned long flags;
> -
> -	spin_lock_irqsave(&iof->uptodate_lock, flags);
> -	bitmap_set(iof->uptodate, first, last - first + 1);
> -	if (bitmap_full(iof->uptodate, i_blocks_per_folio(inode, folio)))
> -		folio_mark_uptodate(folio);
> -	spin_unlock_irqrestore(&iof->uptodate_lock, flags);
> -}
> -
> -static void iomap_set_range_uptodate(struct folio *folio,
> -		struct iomap_folio *iof, size_t off, size_t len)
> -{
> -	if (iof)
> -		iomap_iof_set_range_uptodate(folio, iof, off, len);
> -	else
> -		folio_mark_uptodate(folio);
> -}
> -
>  static void iomap_finish_folio_read(struct folio *folio, size_t offset,
>  		size_t len, int error)
>  {
> @@ -177,7 +192,7 @@ static void iomap_finish_folio_read(struct folio *folio, size_t offset,
>  		folio_clear_uptodate(folio);
>  		folio_set_error(folio);
>  	} else {
> -		iomap_set_range_uptodate(folio, iof, offset, len);
> +		iomap_set_range_uptodate(folio, offset, len);
>  	}
>  
>  	if (!iof || atomic_sub_and_test(len, &iof->read_bytes_pending))
> @@ -213,7 +228,6 @@ struct iomap_readpage_ctx {
>  static int iomap_read_inline_data(const struct iomap_iter *iter,
>  		struct folio *folio)
>  {
> -	struct iomap_folio *iof;
>  	const struct iomap *iomap = iomap_iter_srcmap(iter);
>  	size_t size = i_size_read(iter->inode) - iomap->offset;
>  	size_t poff = offset_in_page(iomap->offset);
> @@ -231,15 +245,13 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
>  	if (WARN_ON_ONCE(size > iomap->length))
>  		return -EIO;
>  	if (offset > 0)
> -		iof = iomap_iof_alloc(iter->inode, folio, iter->flags);
> -	else
> -		iof = iomap_get_iof(folio);
> +		iomap_iof_alloc(iter->inode, folio, iter->flags);
>  
>  	addr = kmap_local_folio(folio, offset);
>  	memcpy(addr, iomap->inline_data, size);
>  	memset(addr + size, 0, PAGE_SIZE - poff - size);
>  	kunmap_local(addr);
> -	iomap_set_range_uptodate(folio, iof, offset, PAGE_SIZE - poff);
> +	iomap_set_range_uptodate(folio, offset, PAGE_SIZE - poff);
>  	return 0;
>  }
>  
> @@ -276,7 +288,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>  
>  	if (iomap_block_needs_zeroing(iter, pos)) {
>  		folio_zero_range(folio, poff, plen);
> -		iomap_set_range_uptodate(folio, iof, poff, plen);
> +		iomap_set_range_uptodate(folio, poff, plen);
>  		goto done;
>  	}
>  
> @@ -451,7 +463,7 @@ bool iomap_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
>  	last = (from + count - 1) >> inode->i_blkbits;
>  
>  	for (i = first; i <= last; i++)
> -		if (!test_bit(i, iof->uptodate))
> +		if (!iomap_iof_is_block_uptodate(iof, i))
>  			return false;
>  	return true;
>  }
> @@ -589,7 +601,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>  			if (status)
>  				return status;
>  		}
> -		iomap_set_range_uptodate(folio, iof, poff, plen);
> +		iomap_set_range_uptodate(folio, poff, plen);
>  	} while ((block_start += plen) < block_end);
>  
>  	return 0;
> @@ -696,7 +708,6 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
>  static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
>  		size_t copied, struct folio *folio)
>  {
> -	struct iomap_folio *iof = iomap_get_iof(folio);
>  	flush_dcache_folio(folio);
>  
>  	/*
> @@ -712,7 +723,7 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
>  	 */
>  	if (unlikely(copied < len && !folio_test_uptodate(folio)))
>  		return 0;
> -	iomap_set_range_uptodate(folio, iof, offset_in_folio(folio, pos), len);
> +	iomap_set_range_uptodate(folio, offset_in_folio(folio, pos), len);
>  	filemap_dirty_folio(inode->i_mapping, folio);
>  	return copied;
>  }
> @@ -1628,7 +1639,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  	 * invalid, grab a new one.
>  	 */
>  	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
> -		if (iof && !test_bit(i, iof->uptodate))
> +		if (iof && !iomap_iof_is_block_uptodate(iof, i))
>  			continue;
>  
>  		error = wpc->ops->map_blocks(wpc, inode, pos);
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv8 0/5] iomap: Add support for per-block dirty state to improve write performance
  2023-06-06 13:00   ` Ritesh Harjani
@ 2023-06-06 15:16     ` Darrick J. Wong
  0 siblings, 0 replies; 27+ messages in thread
From: Darrick J. Wong @ 2023-06-06 15:16 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Matthew Wilcox, linux-xfs, linux-fsdevel, Dave Chinner,
	Brian Foster, Christoph Hellwig, Andreas Gruenbacher,
	Ojaswin Mujoo, Disha Goel

On Tue, Jun 06, 2023 at 06:30:35PM +0530, Ritesh Harjani wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> 
> > On Tue, Jun 06, 2023 at 05:13:47PM +0530, Ritesh Harjani (IBM) wrote:
> >> Hello All,
> >> 
> >> Please find PATCHv8 which adds per-block dirty tracking to iomap.
> >> As discussed earlier this is required to improve write performance and reduce
> >> write amplification for cases where either blocksize is less than pagesize (such
> >> as Power platform with 64k pagesize) or when we have a large folio (such as xfs
> >> which currently supports large folio).
> >
> > You're moving too fast.  Please, allow at least a few hours between
> > getting review comments and sending a new version.
> >
> 
> Sorry about that. I felt those were mainly only mechanical conversion
> changes. Will keep in mind.
> 
> >> v7 -> v8
> >> ==========
> >> 1. Renamed iomap_page -> iomap_folio & iop -> iof in Patch-1 itself.
> >
> > I don't think iomap_folio is the right name.  Indeed, I did not believe
> > that iomap_page was the right name.  As I said on #xfs recently ...
> >
> > <willy> i'm still not crazy about iomap_page as the name of that
> >    data structure.  and calling the variable 'iop' just seems doomed
> >    to be trouble.  how do people feel about either iomap_block_state or
> >    folio_block_state ... or even just calling it block_state since it's
> >    local to iomap/buffered-io.c
> > <willy> we'd then call the variable either ibs or fbs, both of which
> >    have some collisions in the kernel, but none in filesystems
> > <dchinner> willy - sounds reasonable
> 
> Both seems equally reasonable to me. If others are equally ok with both,
> then shall we go with iomap_block_state and ibs? 

I've a slight preference for struct folio_block_state/folio_bs/fbs.

Or even struct iomap_folio_state/iofs/ifs.

IBS is an acronym for a rather uncomfortable medical condition... ;)

--D

> I see that as "iomap_block_state" which is local to iomap buffered-io
> layer to track per-block state within a folio and gets attached to
> folio->private.
> 
> -ritesh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv8 0/5] iomap: Add support for per-block dirty state to improve write performance
  2023-06-06 11:43 [PATCHv8 0/5] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
                   ` (5 preceding siblings ...)
  2023-06-06 12:37 ` [PATCHv8 0/5] iomap: Add support for per-block dirty state to improve write performance Matthew Wilcox
@ 2023-06-07  6:48 ` Christoph Hellwig
  2023-06-07 10:22   ` Ritesh Harjani
  6 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2023-06-07  6:48 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-xfs, linux-fsdevel, Darrick J. Wong, Matthew Wilcox,
	Dave Chinner, Brian Foster, Christoph Hellwig,
	Andreas Gruenbacher, Ojaswin Mujoo, Disha Goel

FYI, daily reposts while there is still heavy discussion ongoing
makes it really hard to chime into the discussion if you get preempted..


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv8 1/5] iomap: Rename iomap_page to iomap_folio and others
  2023-06-06 11:43 ` [PATCHv8 1/5] iomap: Rename iomap_page to iomap_folio and others Ritesh Harjani (IBM)
@ 2023-06-07  6:49   ` Christoph Hellwig
  2023-06-07 10:28     ` Ritesh Harjani
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2023-06-07  6:49 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-xfs, linux-fsdevel, Darrick J. Wong, Matthew Wilcox,
	Dave Chinner, Brian Foster, Christoph Hellwig,
	Andreas Gruenbacher, Ojaswin Mujoo, Disha Goel

> -static struct iomap_page *
> -iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
> +static struct iomap_folio *iomap_iof_alloc(struct inode *inode,
> +				struct folio *folio, unsigned int flags)

This is really weird indenttion.  Please just use two tabs like the
rest of the code.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv8 2/5] iomap: Renames and refactor iomap_folio state bitmap handling
  2023-06-06 11:43 ` [PATCHv8 2/5] iomap: Renames and refactor iomap_folio state bitmap handling Ritesh Harjani (IBM)
  2023-06-06 15:13   ` Darrick J. Wong
@ 2023-06-07  6:50   ` Christoph Hellwig
  2023-06-07 10:44     ` Ritesh Harjani
  1 sibling, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2023-06-07  6:50 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-xfs, linux-fsdevel, Darrick J. Wong, Matthew Wilcox,
	Dave Chinner, Brian Foster, Christoph Hellwig,
	Andreas Gruenbacher, Ojaswin Mujoo, Disha Goel

On Tue, Jun 06, 2023 at 05:13:49PM +0530, Ritesh Harjani (IBM) wrote:
> Also refactors and adds iof->state handling functions for uptodate
> state.

What does this mean?  And please don't mix renames and other changes in
a single patch.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv8 3/5] iomap: Refactor iomap_write_delalloc_punch() function out
  2023-06-06 11:43 ` [PATCHv8 3/5] iomap: Refactor iomap_write_delalloc_punch() function out Ritesh Harjani (IBM)
@ 2023-06-07  6:52   ` Christoph Hellwig
  2023-06-07 10:55     ` Ritesh Harjani
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2023-06-07  6:52 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-xfs, linux-fsdevel, Darrick J. Wong, Matthew Wilcox,
	Dave Chinner, Brian Foster, Christoph Hellwig,
	Andreas Gruenbacher, Ojaswin Mujoo, Disha Goel

On Tue, Jun 06, 2023 at 05:13:50PM +0530, Ritesh Harjani (IBM) wrote:
> This patch moves iomap_write_delalloc_punch() out of
> iomap_write_delalloc_scan(). No functionality change in this patch.

Please chose one refactor (the existing function), or factor (the
new function) out.  The mix doesn't make much sense.

Also please explain why you're doing that.  The fact tha a new helper
is split out is pretty obvious from the patch, but I have no idea why
you want it.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv8 4/5] iomap: Allocate iof in ->write_begin() early
  2023-06-06 11:43 ` [PATCHv8 4/5] iomap: Allocate iof in ->write_begin() early Ritesh Harjani (IBM)
@ 2023-06-07  6:53   ` Christoph Hellwig
  0 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2023-06-07  6:53 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-xfs, linux-fsdevel, Darrick J. Wong, Matthew Wilcox,
	Dave Chinner, Brian Foster, Christoph Hellwig,
	Andreas Gruenbacher, Ojaswin Mujoo, Disha Goel

On Tue, Jun 06, 2023 at 05:13:51PM +0530, Ritesh Harjani (IBM) wrote:
> We dont need to allocate an iof in ->write_begin() for writes where the
> position and length completely overlap with the given folio.
> Therefore, such cases are skipped.
> 
> Currently when the folio is uptodate, we only allocate iof at writeback
> time (in iomap_writepage_map()). This is ok until now, but when we are
> going to add support for per-block dirty state bitmap in iof, this
> could cause some performance degradation. The reason is that if we don't
> allocate iof during ->write_begin(), then we will never mark the
> necessary dirty bits in ->write_end() call. And we will have to mark all
> the bits as dirty at the writeback time, that could cause the same write
> amplification and performance problems as it is now.
> 
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

(and didn't I ack an earlier of this already?)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv8 0/5] iomap: Add support for per-block dirty state to improve write performance
  2023-06-06 12:37 ` [PATCHv8 0/5] iomap: Add support for per-block dirty state to improve write performance Matthew Wilcox
  2023-06-06 13:00   ` Ritesh Harjani
@ 2023-06-07  6:54   ` Christoph Hellwig
  1 sibling, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2023-06-07  6:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ritesh Harjani (IBM), linux-xfs, linux-fsdevel, Darrick J. Wong,
	Dave Chinner, Brian Foster, Christoph Hellwig,
	Andreas Gruenbacher, Ojaswin Mujoo, Disha Goel

On Tue, Jun 06, 2023 at 01:37:48PM +0100, Matthew Wilcox wrote:
> > 1. Renamed iomap_page -> iomap_folio & iop -> iof in Patch-1 itself.
> 
> I don't think iomap_folio is the right name.  Indeed, I did not believe
> that iomap_page was the right name.  As I said on #xfs recently ...
> 
> <willy> i'm still not crazy about iomap_page as the name of that
>    data structure.  and calling the variable 'iop' just seems doomed
>    to be trouble.  how do people feel about either iomap_block_state or
>    folio_block_state ... or even just calling it block_state since it's
>    local to iomap/buffered-io.c
> <willy> we'd then call the variable either ibs or fbs, both of which
>    have some collisions in the kernel, but none in filesystems
> <dchinner> willy - sounds reasonable

I'd keep an iomap prefix, but block_state looks fine to me.  So
iomap_block_state would be my preference.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv8 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-06-06 11:43 ` [PATCHv8 5/5] iomap: Add per-block dirty state tracking to improve performance Ritesh Harjani (IBM)
  2023-06-06 15:06   ` Darrick J. Wong
@ 2023-06-07  7:04   ` Christoph Hellwig
  2023-06-07 15:37     ` Ritesh Harjani
  1 sibling, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2023-06-07  7:04 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-xfs, linux-fsdevel, Darrick J. Wong, Matthew Wilcox,
	Dave Chinner, Brian Foster, Christoph Hellwig,
	Andreas Gruenbacher, Ojaswin Mujoo, Disha Goel, Aravinda Herle

> +static inline bool iomap_iof_is_block_dirty(struct folio *folio,
> +			struct iomap_folio *iof, int block)

Two tabs indents here please and for various other functions.

> +{
> +	struct inode *inode = folio->mapping->host;
> +	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
> +	unsigned int first_blk = off >> inode->i_blkbits;
> +	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
> +	unsigned int nr_blks = last_blk - first_blk + 1;

Given how many places we we opencode this logic I wonder if a helper
would be usefuļ, even if the calling conventions are a bit odd.

To make this nicer it would also be good an take a neat trick from
the btrfs subpage support and use an enum for the bits, e.g.:

enum iomap_block_state {
	IOMAP_ST_UPTODATE,
	IOMAP_ST_DIRTY,

	IOMAP_ST_MAX,
};


static void iomap_ibs_calc_range(struct folio *folio, size_t off, size_t len,
		enum iomap_block_state state, unsigned int *first_blk,
		unsigned int *nr_blks)
{
	struct inode *inode = folio->mapping->host;
	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;

	*first_blk = state * blks_per_folio + (off >> inode->i_blkbits);
	*nr_blks = last_blk - first_blk + 1;
}

> +	/*
> +	 * iof->state tracks two sets of state flags when the
> +	 * filesystem block size is smaller than the folio size.
> +	 * The first state tracks per-block uptodate and the
> +	 * second tracks per-block dirty state.
> +	 */
> +	iof = kzalloc(struct_size(iof, state, BITS_TO_LONGS(2 * nr_blocks)),
>  		      gfp);

with the above this can use IOMAP_ST_MAX and make the whole thing a
little more robus.

>
>  	if (iof) {

No that this adds a lot more initialization I'd do a

	if (!iof)
		return NULL;

here and unindent the rest.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv8 0/5] iomap: Add support for per-block dirty state to improve write performance
  2023-06-07  6:48 ` Christoph Hellwig
@ 2023-06-07 10:22   ` Ritesh Harjani
  0 siblings, 0 replies; 27+ messages in thread
From: Ritesh Harjani @ 2023-06-07 10:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, Darrick J. Wong, Matthew Wilcox,
	Dave Chinner, Brian Foster, Christoph Hellwig,
	Andreas Gruenbacher, Ojaswin Mujoo, Disha Goel

Christoph Hellwig <hch@infradead.org> writes:

> FYI, daily reposts while there is still heavy discussion ongoing
> makes it really hard to chime into the discussion if you get preempted..

Sure, noted. 
Thanks again for the help and review!

-ritesh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv8 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-06-06 15:06   ` Darrick J. Wong
@ 2023-06-07 10:27     ` Ritesh Harjani
  0 siblings, 0 replies; 27+ messages in thread
From: Ritesh Harjani @ 2023-06-07 10:27 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Brian Foster, Christoph Hellwig, Andreas Gruenbacher,
	Ojaswin Mujoo, Disha Goel, Aravinda Herle

"Darrick J. Wong" <djwong@kernel.org> writes:

> On Tue, Jun 06, 2023 at 05:13:52PM +0530, Ritesh Harjani (IBM) wrote:
>> When filesystem blocksize is less than folio size (either with
>> mapping_large_folio_support() or with blocksize < pagesize) and when the
>> folio is uptodate in pagecache, then even a byte write can cause
>> an entire folio to be written to disk during writeback. This happens
>> because we currently don't have a mechanism to track per-block dirty
>> state within struct iomap_folio. We currently only track uptodate state.
>>
>> This patch implements support for tracking per-block dirty state in
>> iomap_folio->state bitmap. This should help improve the filesystem write
>> performance and help reduce write amplification.
>>
>> Performance testing of below fio workload reveals ~16x performance
>> improvement using nvme with XFS (4k blocksize) on Power (64K pagesize)
>> FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
>>
>> 1. <test_randwrite.fio>
>> [global]
>> 	ioengine=psync
>> 	rw=randwrite
>> 	overwrite=1
>> 	pre_read=1
>> 	direct=0
>> 	bs=4k
>> 	size=1G
>> 	dir=./
>> 	numjobs=8
>> 	fdatasync=1
>> 	runtime=60
>> 	iodepth=64
>> 	group_reporting=1
>>
>> [fio-run]
>>
>> 2. Also our internal performance team reported that this patch improves
>>    their database workload performance by around ~83% (with XFS on Power)
>>
>> Reported-by: Aravinda Herle <araherle@in.ibm.com>
>> Reported-by: Brian Foster <bfoster@redhat.com>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> ---
>>  fs/gfs2/aops.c         |   2 +-
>>  fs/iomap/buffered-io.c | 126 +++++++++++++++++++++++++++++++++++++++--
>>  fs/xfs/xfs_aops.c      |   2 +-
>>  fs/zonefs/file.c       |   2 +-
>>  include/linux/iomap.h  |   1 +
>>  5 files changed, 126 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
>> index a5f4be6b9213..75efec3c3b71 100644
>> --- a/fs/gfs2/aops.c
>> +++ b/fs/gfs2/aops.c
>> @@ -746,7 +746,7 @@ static const struct address_space_operations gfs2_aops = {
>>  	.writepages = gfs2_writepages,
>>  	.read_folio = gfs2_read_folio,
>>  	.readahead = gfs2_readahead,
>> -	.dirty_folio = filemap_dirty_folio,
>> +	.dirty_folio = iomap_dirty_folio,
>>  	.release_folio = iomap_release_folio,
>>  	.invalidate_folio = iomap_invalidate_folio,
>>  	.bmap = gfs2_bmap,
>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> index 2b72ca3ba37a..b99d611f1cd5 100644
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
>> @@ -84,6 +84,70 @@ static void iomap_set_range_uptodate(struct folio *folio, size_t off,
>>  		folio_mark_uptodate(folio);
>>  }
>>
>> +static inline bool iomap_iof_is_block_dirty(struct folio *folio,
>> +			struct iomap_folio *iof, int block)
>> +{
>> +	struct inode *inode = folio->mapping->host;
>> +	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
>> +
>> +	return test_bit(block + blks_per_folio, iof->state);
>> +}
>> +
>> +static void iomap_iof_clear_range_dirty(struct folio *folio,
>> +			struct iomap_folio *iof, size_t off, size_t len)
>> +{
>> +	struct inode *inode = folio->mapping->host;
>> +	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
>> +	unsigned int first_blk = off >> inode->i_blkbits;
>> +	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
>> +	unsigned int nr_blks = last_blk - first_blk + 1;
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(&iof->state_lock, flags);
>> +	bitmap_clear(iof->state, first_blk + blks_per_folio, nr_blks);
>> +	spin_unlock_irqrestore(&iof->state_lock, flags);
>> +}
>> +
>> +static void iomap_clear_range_dirty(struct folio *folio, size_t off, size_t len)
>> +{
>> +	struct iomap_folio *iof = iomap_get_iof(folio);
>> +
>> +	if (!iof)
>> +		return;
>> +	iomap_iof_clear_range_dirty(folio, iof, off, len);
>> +}
>> +
>> +static void iomap_iof_set_range_dirty(struct inode *inode, struct folio *folio,
>> +			struct iomap_folio *iof, size_t off, size_t len)
>> +{
>> +	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
>> +	unsigned int first_blk = off >> inode->i_blkbits;
>> +	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
>> +	unsigned int nr_blks = last_blk - first_blk + 1;
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(&iof->state_lock, flags);
>> +	bitmap_set(iof->state, first_blk + blks_per_folio, nr_blks);
>> +	spin_unlock_irqrestore(&iof->state_lock, flags);
>> +}
>> +
>> +/*
>> + * The reason we pass inode explicitly here is to avoid any race with truncate
>> + * when iomap_set_range_dirty() gets called from iomap_dirty_folio().
>> + * Check filemap_dirty_folio() & __folio_mark_dirty() for more details.
>> + * Hence we explicitly pass mapping->host to iomap_set_range_dirty() from
>> + * iomap_dirty_folio().
>> + */
>> +static void iomap_set_range_dirty(struct inode *inode, struct folio *folio,
>> +				  size_t off, size_t len)
>> +{
>> +	struct iomap_folio *iof = iomap_get_iof(folio);
>> +
>> +	if (!iof)
>> +		return;
>> +	iomap_iof_set_range_dirty(inode, folio, iof, off, len);
>> +}
>> +
>>  static struct iomap_folio *iomap_iof_alloc(struct inode *inode,
>>  				struct folio *folio, unsigned int flags)
>>  {
>> @@ -99,12 +163,20 @@ static struct iomap_folio *iomap_iof_alloc(struct inode *inode,
>>  	else
>>  		gfp = GFP_NOFS | __GFP_NOFAIL;
>>
>> -	iof = kzalloc(struct_size(iof, state, BITS_TO_LONGS(nr_blocks)),
>> +	/*
>> +	 * iof->state tracks two sets of state flags when the
>> +	 * filesystem block size is smaller than the folio size.
>> +	 * The first state tracks per-block uptodate and the
>> +	 * second tracks per-block dirty state.
>> +	 */
>> +	iof = kzalloc(struct_size(iof, state, BITS_TO_LONGS(2 * nr_blocks)),
>>  		      gfp);
>>  	if (iof) {
>>  		spin_lock_init(&iof->state_lock);
>>  		if (folio_test_uptodate(folio))
>> -			bitmap_fill(iof->state, nr_blocks);
>> +			bitmap_set(iof->state, 0, nr_blocks);
>> +		if (folio_test_dirty(folio))
>> +			bitmap_set(iof->state, nr_blocks, nr_blocks);
>>  		folio_attach_private(folio, iof);
>>  	}
>>  	return iof;
>> @@ -529,6 +601,17 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
>>  }
>>  EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
>>
>> +bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
>> +{
>> +	struct inode *inode = mapping->host;
>> +	size_t len = folio_size(folio);
>> +
>> +	iomap_iof_alloc(inode, folio, 0);
>> +	iomap_set_range_dirty(inode, folio, 0, len);
>> +	return filemap_dirty_folio(mapping, folio);
>> +}
>> +EXPORT_SYMBOL_GPL(iomap_dirty_folio);
>> +
>>  static void
>>  iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
>>  {
>> @@ -733,6 +816,8 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
>>  	if (unlikely(copied < len && !folio_test_uptodate(folio)))
>>  		return 0;
>>  	iomap_set_range_uptodate(folio, offset_in_folio(folio, pos), len);
>> +	iomap_set_range_dirty(inode, folio, offset_in_folio(folio, pos),
>> +			      copied);
>>  	filemap_dirty_folio(inode->i_mapping, folio);
>>  	return copied;
>>  }
>> @@ -902,6 +987,10 @@ static int iomap_write_delalloc_punch(struct inode *inode, struct folio *folio,
>>  		int (*punch)(struct inode *inode, loff_t offset, loff_t length))
>>  {
>>  	int ret = 0;
>> +	struct iomap_folio *iof;
>> +	unsigned int first_blk, last_blk, i;
>> +	loff_t last_byte;
>> +	u8 blkbits = inode->i_blkbits;
>>
>>  	if (!folio_test_dirty(folio))
>>  		return ret;
>> @@ -913,6 +1002,29 @@ static int iomap_write_delalloc_punch(struct inode *inode, struct folio *folio,
>>  		if (ret)
>>  			goto out;
>>  	}
>> +	/*
>> +	 * When we have per-block dirty tracking, there can be
>> +	 * blocks within a folio which are marked uptodate
>> +	 * but not dirty. In that case it is necessary to punch
>> +	 * out such blocks to avoid leaking any delalloc blocks.
>> +	 */
>> +	iof = iomap_get_iof(folio);
>> +	if (!iof)
>> +		goto skip_iof_punch;
>> +
>> +	last_byte = min_t(loff_t, end_byte - 1,
>> +		(folio_next_index(folio) << PAGE_SHIFT) - 1);
>> +	first_blk = offset_in_folio(folio, start_byte) >> blkbits;
>> +	last_blk = offset_in_folio(folio, last_byte) >> blkbits;
>> +	for (i = first_blk; i <= last_blk; i++) {
>> +		if (!iomap_iof_is_block_dirty(folio, iof, i)) {
>> +			ret = punch(inode, i << blkbits, 1 << blkbits);
>
> Isn't the second argument to punch() supposed to be the offset within
> the file, not the offset within the folio?
>

Thanks for spotting this. Somehow got missed.
Yes, it should be byte position within file. Will fix in the next rev.

    punch(inode, folio_pos(folio) + (i << blkbits), 1 << blkbits);

> I /almost/ wonder if this chunk should be its own static inline
> iomap_iof_delalloc_punch function, but ... eh.  My enthusiasm for
> slicing and dicing is weak today.
>

umm, I felt having all punch logic in one function would be better.
Hence iomap_write_delalloc_punch().

-ritesh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv8 1/5] iomap: Rename iomap_page to iomap_folio and others
  2023-06-07  6:49   ` Christoph Hellwig
@ 2023-06-07 10:28     ` Ritesh Harjani
  0 siblings, 0 replies; 27+ messages in thread
From: Ritesh Harjani @ 2023-06-07 10:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, Darrick J. Wong, Matthew Wilcox,
	Dave Chinner, Brian Foster, Christoph Hellwig,
	Andreas Gruenbacher, Ojaswin Mujoo, Disha Goel

Christoph Hellwig <hch@infradead.org> writes:

>> -static struct iomap_page *
>> -iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
>> +static struct iomap_folio *iomap_iof_alloc(struct inode *inode,
>> +				struct folio *folio, unsigned int flags)
>
> This is really weird indenttion.  Please just use two tabs like the
> rest of the code.

Ok, sure. Will fix this indentation in the next rev.

-ritesh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv8 2/5] iomap: Renames and refactor iomap_folio state bitmap handling
  2023-06-07  6:50   ` Christoph Hellwig
@ 2023-06-07 10:44     ` Ritesh Harjani
  2023-06-07 13:29       ` Christoph Hellwig
  0 siblings, 1 reply; 27+ messages in thread
From: Ritesh Harjani @ 2023-06-07 10:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, Darrick J. Wong, Matthew Wilcox,
	Dave Chinner, Brian Foster, Christoph Hellwig,
	Andreas Gruenbacher, Ojaswin Mujoo, Disha Goel

Christoph Hellwig <hch@infradead.org> writes:

> On Tue, Jun 06, 2023 at 05:13:49PM +0530, Ritesh Harjani (IBM) wrote:
>> Also refactors and adds iof->state handling functions for uptodate
>> state.
>
> What does this mean?

It is this part.

+static inline bool iomap_iof_is_fully_uptodate(struct folio *folio,
+					       struct iomap_folio *iof)

+static inline bool iomap_iof_is_block_uptodate(struct iomap_folio *iof,
+					       unsigned int block)

+static void iomap_iof_set_range_uptodate(struct folio *folio,
+			struct iomap_folio *iof, size_t off, size_t len)

+static void iomap_set_range_uptodate(struct folio *folio, size_t off,
+				     size_t len)


> And please don't mix renames and other changes in
> a single patch.

All of this is related to uptodate bitmap handling code.
i.e.
- Renaming "uptodate" bitmap to "state" bitmap in struct iomap_page.
- Renaming "uptodate_lock" to "state_lock" in struct iomap_page
- Adding helper routines for uptodate bitmap handling
- A small refactoring of iomap_set_range_uptodate() function to drop
"iop" as a function argument. And move it's function definition above iomap_iof_alloc()


Ok, so would you prefer if this has been split into 3 seperate patches?

1. Renaming of uptodate and uptodate_lock to state and state_Lock.
2. Refactor iomap_set_range_uptodate() function to drop struct
iomap_page from it's argument and move it above iomap_iof_alloc() (or
iomap_ibs_alloc or iomap_fbs_alloc whichever name we settle with)
3. Add uptodate bitmap helper routines e.g.
iomap_iof_is_block_uptodate(), iomap_iof_is_fully_uptodate().

-ritesh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv8 3/5] iomap: Refactor iomap_write_delalloc_punch() function out
  2023-06-07  6:52   ` Christoph Hellwig
@ 2023-06-07 10:55     ` Ritesh Harjani
  0 siblings, 0 replies; 27+ messages in thread
From: Ritesh Harjani @ 2023-06-07 10:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, Darrick J. Wong, Matthew Wilcox,
	Dave Chinner, Brian Foster, Christoph Hellwig,
	Andreas Gruenbacher, Ojaswin Mujoo, Disha Goel

Christoph Hellwig <hch@infradead.org> writes:

> On Tue, Jun 06, 2023 at 05:13:50PM +0530, Ritesh Harjani (IBM) wrote:
>> This patch moves iomap_write_delalloc_punch() out of
>> iomap_write_delalloc_scan(). No functionality change in this patch.
>
> Please chose one refactor (the existing function), or factor (the
> new function) out.  The mix doesn't make much sense.
>
> Also please explain why you're doing that.  The fact tha a new helper
> is split out is pretty obvious from the patch, but I have no idea why
> you want it.

This patch factors iomap_write_delalloc_punch() function out. This function
is resposible for actual punch out operation.
The reason for this is, to avoid deep indentation when we bring
punch-out of individual non-dirty blocks within a dirty folio in a later patch
to avoid delalloc block leak. This later patch is what adds per-block
dirty status handling to iomap.

I will rephrase commit message to above ^^^.
Let me know if any other changes requried.

-ritesh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv8 2/5] iomap: Renames and refactor iomap_folio state bitmap handling
  2023-06-07 10:44     ` Ritesh Harjani
@ 2023-06-07 13:29       ` Christoph Hellwig
  0 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2023-06-07 13:29 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, Darrick J. Wong,
	Matthew Wilcox, Dave Chinner, Brian Foster, Andreas Gruenbacher,
	Ojaswin Mujoo, Disha Goel

On Wed, Jun 07, 2023 at 04:14:21PM +0530, Ritesh Harjani wrote:
> 1. Renaming of uptodate and uptodate_lock to state and state_Lock.
> 2. Refactor iomap_set_range_uptodate() function to drop struct
> iomap_page from it's argument and move it above iomap_iof_alloc() (or
> iomap_ibs_alloc or iomap_fbs_alloc whichever name we settle with)
> 3. Add uptodate bitmap helper routines e.g.
> iomap_iof_is_block_uptodate(), iomap_iof_is_fully_uptodate().

Yes, please.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv8 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-06-07  7:04   ` Christoph Hellwig
@ 2023-06-07 15:37     ` Ritesh Harjani
  2023-06-08  5:33       ` Christoph Hellwig
  0 siblings, 1 reply; 27+ messages in thread
From: Ritesh Harjani @ 2023-06-07 15:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, Darrick J. Wong, Matthew Wilcox,
	Dave Chinner, Brian Foster, Christoph Hellwig,
	Andreas Gruenbacher, Ojaswin Mujoo, Disha Goel, Aravinda Herle

Christoph Hellwig <hch@infradead.org> writes:

>> +static inline bool iomap_iof_is_block_dirty(struct folio *folio,
>> +			struct iomap_folio *iof, int block)
>
> Two tabs indents here please and for various other functions.
>

Sure. 

>> +{
>> +	struct inode *inode = folio->mapping->host;
>> +	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
>> +	unsigned int first_blk = off >> inode->i_blkbits;
>> +	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
>> +	unsigned int nr_blks = last_blk - first_blk + 1;
>
> Given how many places we we opencode this logic I wonder if a helper
> would be usefuļ, even if the calling conventions are a bit odd.

I agree that moving it to a common helper would come useful as it is
open coded at 3 places.

>
> To make this nicer it would also be good an take a neat trick from
> the btrfs subpage support and use an enum for the bits, e.g.:
>
> enum iomap_block_state {
> 	IOMAP_ST_UPTODATE,
> 	IOMAP_ST_DIRTY,
>
> 	IOMAP_ST_MAX,
> };
>

I think the only remaining piece is the naming of this enum and struct
iomap_page.

Ok, so here is what I think my preference would be -

This enum perfectly qualifies for "iomap_block_state" as it holds the
state of per-block.
Then the struct iomap_page (iop) should be renamed to struct
"iomap_folio_state" (ifs), because that holds the state information of all the
blocks within a folio.

Is this something which sounds ok to others too? 

>
> static void iomap_ibs_calc_range(struct folio *folio, size_t off, size_t len,
> 		enum iomap_block_state state, unsigned int *first_blk,
> 		unsigned int *nr_blks)
> {
> 	struct inode *inode = folio->mapping->host;
> 	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
> 	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
>
> 	*first_blk = state * blks_per_folio + (off >> inode->i_blkbits);
> 	*nr_blks = last_blk - first_blk + 1;
> }
>

Sure, like the idea of the state enum. Will make the changes.

>> +	/*
>> +	 * iof->state tracks two sets of state flags when the
>> +	 * filesystem block size is smaller than the folio size.
>> +	 * The first state tracks per-block uptodate and the
>> +	 * second tracks per-block dirty state.
>> +	 */
>> +	iof = kzalloc(struct_size(iof, state, BITS_TO_LONGS(2 * nr_blocks)),
>>  		      gfp);
>
> with the above this can use IOMAP_ST_MAX and make the whole thing a
> little more robus.
>

yes.

>>
>>  	if (iof) {
>
> No that this adds a lot more initialization I'd do a
>
> 	if (!iof)
> 		return NULL;
>
> here and unindent the rest.

Sure. Is it ok to fold this change in the same patch, right?
Or does it qualify for a seperate patch?

-ritesh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv8 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-06-07 15:37     ` Ritesh Harjani
@ 2023-06-08  5:33       ` Christoph Hellwig
  2023-06-08 14:45         ` Darrick J. Wong
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2023-06-08  5:33 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, Darrick J. Wong,
	Matthew Wilcox, Dave Chinner, Brian Foster, Andreas Gruenbacher,
	Ojaswin Mujoo, Disha Goel, Aravinda Herle

On Wed, Jun 07, 2023 at 09:07:19PM +0530, Ritesh Harjani wrote:
> I think the only remaining piece is the naming of this enum and struct
> iomap_page.
> 
> Ok, so here is what I think my preference would be -
> 
> This enum perfectly qualifies for "iomap_block_state" as it holds the
> state of per-block.
> Then the struct iomap_page (iop) should be renamed to struct
> "iomap_folio_state" (ifs), because that holds the state information of all the
> blocks within a folio.

Fine with me.

> > 	if (!iof)
> > 		return NULL;
> >
> > here and unindent the rest.
> 
> Sure. Is it ok to fold this change in the same patch, right?
> Or does it qualify for a seperate patch?

Either way is fine with me.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv8 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-06-08  5:33       ` Christoph Hellwig
@ 2023-06-08 14:45         ` Darrick J. Wong
  0 siblings, 0 replies; 27+ messages in thread
From: Darrick J. Wong @ 2023-06-08 14:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ritesh Harjani, linux-xfs, linux-fsdevel, Matthew Wilcox,
	Dave Chinner, Brian Foster, Andreas Gruenbacher, Ojaswin Mujoo,
	Disha Goel, Aravinda Herle

On Wed, Jun 07, 2023 at 10:33:52PM -0700, Christoph Hellwig wrote:
> On Wed, Jun 07, 2023 at 09:07:19PM +0530, Ritesh Harjani wrote:
> > I think the only remaining piece is the naming of this enum and struct
> > iomap_page.
> > 
> > Ok, so here is what I think my preference would be -
> > 
> > This enum perfectly qualifies for "iomap_block_state" as it holds the
> > state of per-block.
> > Then the struct iomap_page (iop) should be renamed to struct
> > "iomap_folio_state" (ifs), because that holds the state information of all the
> > blocks within a folio.
> 
> Fine with me.

Yeah, fine with me too.

> > > 	if (!iof)
> > > 		return NULL;
> > >
> > > here and unindent the rest.
> > 
> > Sure. Is it ok to fold this change in the same patch, right?
> > Or does it qualify for a seperate patch?
> 
> Either way is fine with me.

Same patch is ok.

--D

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2023-06-08 14:45 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-06 11:43 [PATCHv8 0/5] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
2023-06-06 11:43 ` [PATCHv8 1/5] iomap: Rename iomap_page to iomap_folio and others Ritesh Harjani (IBM)
2023-06-07  6:49   ` Christoph Hellwig
2023-06-07 10:28     ` Ritesh Harjani
2023-06-06 11:43 ` [PATCHv8 2/5] iomap: Renames and refactor iomap_folio state bitmap handling Ritesh Harjani (IBM)
2023-06-06 15:13   ` Darrick J. Wong
2023-06-07  6:50   ` Christoph Hellwig
2023-06-07 10:44     ` Ritesh Harjani
2023-06-07 13:29       ` Christoph Hellwig
2023-06-06 11:43 ` [PATCHv8 3/5] iomap: Refactor iomap_write_delalloc_punch() function out Ritesh Harjani (IBM)
2023-06-07  6:52   ` Christoph Hellwig
2023-06-07 10:55     ` Ritesh Harjani
2023-06-06 11:43 ` [PATCHv8 4/5] iomap: Allocate iof in ->write_begin() early Ritesh Harjani (IBM)
2023-06-07  6:53   ` Christoph Hellwig
2023-06-06 11:43 ` [PATCHv8 5/5] iomap: Add per-block dirty state tracking to improve performance Ritesh Harjani (IBM)
2023-06-06 15:06   ` Darrick J. Wong
2023-06-07 10:27     ` Ritesh Harjani
2023-06-07  7:04   ` Christoph Hellwig
2023-06-07 15:37     ` Ritesh Harjani
2023-06-08  5:33       ` Christoph Hellwig
2023-06-08 14:45         ` Darrick J. Wong
2023-06-06 12:37 ` [PATCHv8 0/5] iomap: Add support for per-block dirty state to improve write performance Matthew Wilcox
2023-06-06 13:00   ` Ritesh Harjani
2023-06-06 15:16     ` Darrick J. Wong
2023-06-07  6:54   ` Christoph Hellwig
2023-06-07  6:48 ` Christoph Hellwig
2023-06-07 10:22   ` Ritesh Harjani

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).