[PATCHv6 0/5] iomap: Add support for per-block dirty state to improve write performance

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCHv6 0/5] iomap: Add support for per-block dirty state to improve write performance
@ 2023-06-05  1:31 Ritesh Harjani (IBM)
  2023-06-05  1:31 ` [PATCHv6 1/5] iomap: Rename iomap_page_create/release() to iomap_iop_alloc/free() Ritesh Harjani (IBM)
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-06-05  1:31 UTC (permalink / raw)
  To: linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Dave Chinner, Brian Foster,
	Christoph Hellwig, Andreas Gruenbacher, Ojaswin Mujoo, Disha Goel,
	Ritesh Harjani (IBM)

Hello All,

Please find PATCHv6 which adds per-block dirty tracking to iomap.
As discussed earlier this is required to improve write performance and reduce
write amplification for cases where either blocksize is less than pagesize (such
as Power platform with 64k pagesize) or when we have a large folio (such as xfs
which currently supports large folio).

RFCv5 -> PATCHv6:
=================
1. Addresses review comments from Brian, Christoph and Matthew.
   @Christoph:
     - I have renamed the higher level functions such as iop_alloc/iop_free() to
       iomap_iop_alloc/free() in v6.
     - As for the low level bitmap accessor functions I couldn't find any better
       naming then iop_test_/set/clear_**. I could have gone for
       iomap_iop__test/set/clear/_** or iomap__iop_test/set/clear_**, but
       I wasn't convinced with either of above as it also increases function
       name.
       Besides iop_test/set_clear_ accessors functions for uptodate and dirty
       status tracking make sense as we are sure we have a valid iop in such
       cases. Please do let me know if this looks ok to you.
2. I tried testing gfs2 (initially with no patches) with xfstests. But I always ended up
   in some or the other deadlock (I couldn't spend any time debugging that).
   I also ran it with -x log, but still it was always failing for me.
   @Andreas:
   - could you please suggest how can I test gfs2 with these patches. I see gfs2
     can have a smaller blocksize and it uses iomap buffered io path. It will be
     good if we can get these patches tested on it too.

3. I can now say I have run some good amount of fstests on these patches on
   these platforms and I haven't found any new failure in my testing so far.
   arm64 (64k pagesize): with 4k -g quick
   Power: with 4k -g auto
   x86: 1k, 4k with -g auto and adv_auto

From my testing so far these patches looks stable to me and if this looks good
to reviewers as well, do you think this can be queued to linux-next for wider
testing?

Performance numbers copied from last patch commit message
==================================================
Performance testing of below fio workload reveals ~16x performance
improvement using nvme with XFS (4k blocksize) on Power (64K pagesize)
FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.

1. <test_randwrite.fio>
[global]
	ioengine=psync
	rw=randwrite
	overwrite=1
	pre_read=1
	direct=0
	bs=4k
	size=1G
	dir=./
	numjobs=8
	fdatasync=1
	runtime=60
	iodepth=64
	group_reporting=1

[fio-run]

2. Also our internal performance team reported that this patch improves
   their database workload performance by around ~83% (with XFS on Power)

Ritesh Harjani (IBM) (5):
  iomap: Rename iomap_page_create/release() to iomap_iop_alloc/free()
  iomap: Move folio_detach_private() in iomap_iop_free() to the end
  iomap: Refactor some iop related accessor functions
  iomap: Allocate iop in ->write_begin() early
  iomap: Add per-block dirty state tracking to improve performance

 fs/gfs2/aops.c         |   2 +-
 fs/iomap/buffered-io.c | 309 ++++++++++++++++++++++++++++++-----------
 fs/xfs/xfs_aops.c      |   2 +-
 fs/zonefs/file.c       |   2 +-
 include/linux/iomap.h  |   1 +
 5 files changed, 235 insertions(+), 81 deletions(-)

--
2.40.1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCHv6 1/5] iomap: Rename iomap_page_create/release() to iomap_iop_alloc/free()
  2023-06-05  1:31 [PATCHv6 0/5] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
@ 2023-06-05  1:31 ` Ritesh Harjani (IBM)
  2023-06-05  1:31 ` [PATCHv6 2/5] iomap: Move folio_detach_private() in iomap_iop_free() to the end Ritesh Harjani (IBM)
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-06-05  1:31 UTC (permalink / raw)
  To: linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Dave Chinner, Brian Foster,
	Christoph Hellwig, Andreas Gruenbacher, Ojaswin Mujoo, Disha Goel,
	Ritesh Harjani (IBM)

This patch renames the iomap_page_create/release() functions to
iomap_iop_alloc/free() calls. Later patches adds more functions for
handling iop structure with iomap_iop_** naming conventions.
Hence iomap_iop_alloc/free() makes more sense to be consistent with all
APIs.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/iomap/buffered-io.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 063133ec77f4..4567bdd4fff9 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -43,8 +43,8 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
 
 static struct bio_set iomap_ioend_bioset;
 
-static struct iomap_page *
-iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
+static struct iomap_page *iomap_iop_alloc(struct inode *inode,
+				struct folio *folio, unsigned int flags)
 {
 	struct iomap_page *iop = to_iomap_page(folio);
 	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
@@ -69,7 +69,7 @@ iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
 	return iop;
 }
 
-static void iomap_page_release(struct folio *folio)
+static void iomap_iop_free(struct folio *folio)
 {
 	struct iomap_page *iop = folio_detach_private(folio);
 	struct inode *inode = folio->mapping->host;
@@ -231,7 +231,7 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 	if (WARN_ON_ONCE(size > iomap->length))
 		return -EIO;
 	if (offset > 0)
-		iop = iomap_page_create(iter->inode, folio, iter->flags);
+		iop = iomap_iop_alloc(iter->inode, folio, iter->flags);
 	else
 		iop = to_iomap_page(folio);
 
@@ -269,7 +269,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 		return iomap_read_inline_data(iter, folio);
 
 	/* zero post-eof blocks as the page may be mapped */
-	iop = iomap_page_create(iter->inode, folio, iter->flags);
+	iop = iomap_iop_alloc(iter->inode, folio, iter->flags);
 	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
 	if (plen == 0)
 		goto done;
@@ -490,7 +490,7 @@ bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags)
 	 */
 	if (folio_test_dirty(folio) || folio_test_writeback(folio))
 		return false;
-	iomap_page_release(folio);
+	iomap_iop_free(folio);
 	return true;
 }
 EXPORT_SYMBOL_GPL(iomap_release_folio);
@@ -507,12 +507,12 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
 	if (offset == 0 && len == folio_size(folio)) {
 		WARN_ON_ONCE(folio_test_writeback(folio));
 		folio_cancel_dirty(folio);
-		iomap_page_release(folio);
+		iomap_iop_free(folio);
 	} else if (folio_test_large(folio)) {
 		/* Must release the iop so the page can be split */
 		WARN_ON_ONCE(!folio_test_uptodate(folio) &&
 			     folio_test_dirty(folio));
-		iomap_page_release(folio);
+		iomap_iop_free(folio);
 	}
 }
 EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
@@ -559,7 +559,8 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 		return 0;
 	folio_clear_error(folio);
 
-	iop = iomap_page_create(iter->inode, folio, iter->flags);
+	iop = iomap_iop_alloc(iter->inode, folio, iter->flags);
+
 	if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1)
 		return -EAGAIN;
 
@@ -1612,7 +1613,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		struct writeback_control *wbc, struct inode *inode,
 		struct folio *folio, u64 end_pos)
 {
-	struct iomap_page *iop = iomap_page_create(inode, folio, 0);
+	struct iomap_page *iop = iomap_iop_alloc(inode, folio, 0);
 	struct iomap_ioend *ioend, *next;
 	unsigned len = i_blocksize(inode);
 	unsigned nblocks = i_blocks_per_folio(inode, folio);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCHv6 2/5] iomap: Move folio_detach_private() in iomap_iop_free() to the end
  2023-06-05  1:31 [PATCHv6 0/5] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
  2023-06-05  1:31 ` [PATCHv6 1/5] iomap: Rename iomap_page_create/release() to iomap_iop_alloc/free() Ritesh Harjani (IBM)
@ 2023-06-05  1:31 ` Ritesh Harjani (IBM)
  2023-06-05  1:31 ` [PATCHv6 3/5] iomap: Refactor some iop related accessor functions Ritesh Harjani (IBM)
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-06-05  1:31 UTC (permalink / raw)
  To: linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Dave Chinner, Brian Foster,
	Christoph Hellwig, Andreas Gruenbacher, Ojaswin Mujoo, Disha Goel,
	Ritesh Harjani (IBM)

In later patches we will add other accessor APIs which will take inode
and folio to operate over struct iomap_page. Since we need folio's
private (iomap_page) in those functions, hence this function moves
detaching of folio's private at the end just before calling kfree(iop).

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/iomap/buffered-io.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 4567bdd4fff9..6fffda355c45 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -71,7 +71,7 @@ static struct iomap_page *iomap_iop_alloc(struct inode *inode,
 
 static void iomap_iop_free(struct folio *folio)
 {
-	struct iomap_page *iop = folio_detach_private(folio);
+	struct iomap_page *iop = to_iomap_page(folio);
 	struct inode *inode = folio->mapping->host;
 	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
 
@@ -81,6 +81,7 @@ static void iomap_iop_free(struct folio *folio)
 	WARN_ON_ONCE(atomic_read(&iop->write_bytes_pending));
 	WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
 			folio_test_uptodate(folio));
+	folio_detach_private(folio);
 	kfree(iop);
 }
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCHv6 3/5] iomap: Refactor some iop related accessor functions
  2023-06-05  1:31 [PATCHv6 0/5] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
  2023-06-05  1:31 ` [PATCHv6 1/5] iomap: Rename iomap_page_create/release() to iomap_iop_alloc/free() Ritesh Harjani (IBM)
  2023-06-05  1:31 ` [PATCHv6 2/5] iomap: Move folio_detach_private() in iomap_iop_free() to the end Ritesh Harjani (IBM)
@ 2023-06-05  1:31 ` Ritesh Harjani (IBM)
  2023-06-05  3:33   ` Matthew Wilcox
  2023-06-05  1:31 ` [PATCHv6 4/5] iomap: Allocate iop in ->write_begin() early Ritesh Harjani (IBM)
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 11+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-06-05  1:31 UTC (permalink / raw)
  To: linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Dave Chinner, Brian Foster,
	Christoph Hellwig, Andreas Gruenbacher, Ojaswin Mujoo, Disha Goel,
	Ritesh Harjani (IBM)

We would eventually use iomap_iop_** function naming by the rest of the
buffered-io iomap code. This patch update function arguments and naming
from iomap_set_range_uptodate() -> iomap_iop_set_range_uptodate().
iop_set_range_uptodate() then becomes an accessor function used by
iomap_iop_** functions.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/iomap/buffered-io.c | 108 ++++++++++++++++++++++++-----------------
 1 file changed, 63 insertions(+), 45 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 6fffda355c45..e264ff0fa36e 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -24,14 +24,14 @@
 #define IOEND_BATCH_SIZE	4096
 
 /*
- * Structure allocated for each folio when block size < folio size
- * to track sub-folio uptodate status and I/O completions.
+ * Structure allocated for each folio to track per-block uptodate state
+ * and I/O completions.
  */
 struct iomap_page {
 	atomic_t		read_bytes_pending;
 	atomic_t		write_bytes_pending;
-	spinlock_t		uptodate_lock;
-	unsigned long		uptodate[];
+	spinlock_t		state_lock;
+	unsigned long		state[];
 };
 
 static inline struct iomap_page *to_iomap_page(struct folio *folio)
@@ -43,6 +43,48 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
 
 static struct bio_set iomap_ioend_bioset;
 
+static bool iop_test_full_uptodate(struct folio *folio)
+{
+	struct iomap_page *iop = to_iomap_page(folio);
+	struct inode *inode = folio->mapping->host;
+
+	return bitmap_full(iop->state, i_blocks_per_folio(inode, folio));
+}
+
+static bool iop_test_block_uptodate(struct folio *folio, unsigned int block)
+{
+	struct iomap_page *iop = to_iomap_page(folio);
+
+	return test_bit(block, iop->state);
+}
+
+static void iop_set_range_uptodate(struct inode *inode, struct folio *folio,
+				   size_t off, size_t len)
+{
+	struct iomap_page *iop = to_iomap_page(folio);
+	unsigned int first_blk = off >> inode->i_blkbits;
+	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
+	unsigned int nr_blks = last_blk - first_blk + 1;
+	unsigned long flags;
+
+	spin_lock_irqsave(&iop->state_lock, flags);
+	bitmap_set(iop->state, first_blk, nr_blks);
+	if (iop_test_full_uptodate(folio))
+		folio_mark_uptodate(folio);
+	spin_unlock_irqrestore(&iop->state_lock, flags);
+}
+
+static void iomap_iop_set_range_uptodate(struct inode *inode,
+		struct folio *folio, size_t off, size_t len)
+{
+	struct iomap_page *iop = to_iomap_page(folio);
+
+	if (iop)
+		iop_set_range_uptodate(inode, folio, off, len);
+	else
+		folio_mark_uptodate(folio);
+}
+
 static struct iomap_page *iomap_iop_alloc(struct inode *inode,
 				struct folio *folio, unsigned int flags)
 {
@@ -58,12 +100,12 @@ static struct iomap_page *iomap_iop_alloc(struct inode *inode,
 	else
 		gfp = GFP_NOFS | __GFP_NOFAIL;
 
-	iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
+	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
 		      gfp);
 	if (iop) {
-		spin_lock_init(&iop->uptodate_lock);
+		spin_lock_init(&iop->state_lock);
 		if (folio_test_uptodate(folio))
-			bitmap_fill(iop->uptodate, nr_blocks);
+			bitmap_fill(iop->state, nr_blocks);
 		folio_attach_private(folio, iop);
 	}
 	return iop;
@@ -72,14 +114,12 @@ static struct iomap_page *iomap_iop_alloc(struct inode *inode,
 static void iomap_iop_free(struct folio *folio)
 {
 	struct iomap_page *iop = to_iomap_page(folio);
-	struct inode *inode = folio->mapping->host;
-	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
 
 	if (!iop)
 		return;
 	WARN_ON_ONCE(atomic_read(&iop->read_bytes_pending));
 	WARN_ON_ONCE(atomic_read(&iop->write_bytes_pending));
-	WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
+	WARN_ON_ONCE(iop_test_full_uptodate(folio) !=
 			folio_test_uptodate(folio));
 	folio_detach_private(folio);
 	kfree(iop);
@@ -111,7 +151,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 
 		/* move forward for each leading block marked uptodate */
 		for (i = first; i <= last; i++) {
-			if (!test_bit(i, iop->uptodate))
+			if (!iop_test_block_uptodate(folio, i))
 				break;
 			*pos += block_size;
 			poff += block_size;
@@ -121,7 +161,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 
 		/* truncate len if we find any trailing uptodate block(s) */
 		for ( ; i <= last; i++) {
-			if (test_bit(i, iop->uptodate)) {
+			if (iop_test_block_uptodate(folio, i)) {
 				plen -= (last - i + 1) * block_size;
 				last = i - 1;
 				break;
@@ -145,30 +185,6 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 	*lenp = plen;
 }
 
-static void iomap_iop_set_range_uptodate(struct folio *folio,
-		struct iomap_page *iop, size_t off, size_t len)
-{
-	struct inode *inode = folio->mapping->host;
-	unsigned first = off >> inode->i_blkbits;
-	unsigned last = (off + len - 1) >> inode->i_blkbits;
-	unsigned long flags;
-
-	spin_lock_irqsave(&iop->uptodate_lock, flags);
-	bitmap_set(iop->uptodate, first, last - first + 1);
-	if (bitmap_full(iop->uptodate, i_blocks_per_folio(inode, folio)))
-		folio_mark_uptodate(folio);
-	spin_unlock_irqrestore(&iop->uptodate_lock, flags);
-}
-
-static void iomap_set_range_uptodate(struct folio *folio,
-		struct iomap_page *iop, size_t off, size_t len)
-{
-	if (iop)
-		iomap_iop_set_range_uptodate(folio, iop, off, len);
-	else
-		folio_mark_uptodate(folio);
-}
-
 static void iomap_finish_folio_read(struct folio *folio, size_t offset,
 		size_t len, int error)
 {
@@ -178,7 +194,8 @@ static void iomap_finish_folio_read(struct folio *folio, size_t offset,
 		folio_clear_uptodate(folio);
 		folio_set_error(folio);
 	} else {
-		iomap_set_range_uptodate(folio, iop, offset, len);
+		iomap_iop_set_range_uptodate(folio->mapping->host, folio,
+					     offset, len);
 	}
 
 	if (!iop || atomic_sub_and_test(len, &iop->read_bytes_pending))
@@ -214,7 +231,7 @@ struct iomap_readpage_ctx {
 static int iomap_read_inline_data(const struct iomap_iter *iter,
 		struct folio *folio)
 {
-	struct iomap_page *iop;
+	struct iomap_page __maybe_unused *iop;
 	const struct iomap *iomap = iomap_iter_srcmap(iter);
 	size_t size = i_size_read(iter->inode) - iomap->offset;
 	size_t poff = offset_in_page(iomap->offset);
@@ -240,7 +257,8 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 	memcpy(addr, iomap->inline_data, size);
 	memset(addr + size, 0, PAGE_SIZE - poff - size);
 	kunmap_local(addr);
-	iomap_set_range_uptodate(folio, iop, offset, PAGE_SIZE - poff);
+	iomap_iop_set_range_uptodate(iter->inode, folio, offset,
+				     PAGE_SIZE - poff);
 	return 0;
 }
 
@@ -277,7 +295,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 
 	if (iomap_block_needs_zeroing(iter, pos)) {
 		folio_zero_range(folio, poff, plen);
-		iomap_set_range_uptodate(folio, iop, poff, plen);
+		iomap_iop_set_range_uptodate(iter->inode, folio, poff, plen);
 		goto done;
 	}
 
@@ -452,7 +470,7 @@ bool iomap_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
 	last = (from + count - 1) >> inode->i_blkbits;
 
 	for (i = first; i <= last; i++)
-		if (!test_bit(i, iop->uptodate))
+		if (!iop_test_block_uptodate(folio, i))
 			return false;
 	return true;
 }
@@ -591,7 +609,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 			if (status)
 				return status;
 		}
-		iomap_set_range_uptodate(folio, iop, poff, plen);
+		iomap_iop_set_range_uptodate(iter->inode, folio, poff, plen);
 	} while ((block_start += plen) < block_end);
 
 	return 0;
@@ -698,7 +716,6 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 		size_t copied, struct folio *folio)
 {
-	struct iomap_page *iop = to_iomap_page(folio);
 	flush_dcache_folio(folio);
 
 	/*
@@ -714,7 +731,8 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 	 */
 	if (unlikely(copied < len && !folio_test_uptodate(folio)))
 		return 0;
-	iomap_set_range_uptodate(folio, iop, offset_in_folio(folio, pos), len);
+	iomap_iop_set_range_uptodate(inode, folio, offset_in_folio(folio, pos),
+				     len);
 	filemap_dirty_folio(inode->i_mapping, folio);
 	return copied;
 }
@@ -1630,7 +1648,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	 * invalid, grab a new one.
 	 */
 	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
-		if (iop && !test_bit(i, iop->uptodate))
+		if (iop && !iop_test_block_uptodate(folio, i))
 			continue;
 
 		error = wpc->ops->map_blocks(wpc, inode, pos);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCHv6 4/5] iomap: Allocate iop in ->write_begin() early
  2023-06-05  1:31 [PATCHv6 0/5] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
                   ` (2 preceding siblings ...)
  2023-06-05  1:31 ` [PATCHv6 3/5] iomap: Refactor some iop related accessor functions Ritesh Harjani (IBM)
@ 2023-06-05  1:31 ` Ritesh Harjani (IBM)
  2023-06-05  1:31 ` [PATCHv6 5/5] iomap: Add per-block dirty state tracking to improve performance Ritesh Harjani (IBM)
  2023-06-06 12:17 ` [PATCHv6 0/5] iomap: Add support for per-block dirty state to improve write performance Andreas Grünbacher
  5 siblings, 0 replies; 11+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-06-05  1:31 UTC (permalink / raw)
  To: linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Dave Chinner, Brian Foster,
	Christoph Hellwig, Andreas Gruenbacher, Ojaswin Mujoo, Disha Goel,
	Ritesh Harjani (IBM)

We dont need to allocate an iop in ->write_begin() for writes where the
position and length completely overlap with the given folio.
Therefore, such cases are skipped.

Currently when the folio is uptodate, we only allocate iop at writeback
time (in iomap_writepage_map()). This is ok until now, but when we are
going to add support for per-block dirty state bitmap in iop, this
could cause some performance degradation. The reason is that if we don't
allocate iop during ->write_begin(), then we will never mark the
necessary dirty bits in ->write_end() call. And we will have to mark all
the bits as dirty at the writeback time, that could cause the same write
amplification and performance problems as it is now.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/iomap/buffered-io.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index e264ff0fa36e..a70242cb32b1 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -574,15 +574,24 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 	size_t from = offset_in_folio(folio, pos), to = from + len;
 	size_t poff, plen;
 
-	if (folio_test_uptodate(folio))
+	/*
+	 * If the write completely overlaps the current folio, then
+	 * entire folio will be dirtied so there is no need for
+	 * per-block state tracking structures to be attached to this folio.
+	 */
+	if (pos <= folio_pos(folio) &&
+	    pos + len >= folio_pos(folio) + folio_size(folio))
 		return 0;
-	folio_clear_error(folio);
 
 	iop = iomap_iop_alloc(iter->inode, folio, iter->flags);
 
 	if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1)
 		return -EAGAIN;
 
+	if (folio_test_uptodate(folio))
+		return 0;
+	folio_clear_error(folio);
+
 	do {
 		iomap_adjust_read_range(iter->inode, folio, &block_start,
 				block_end - block_start, &poff, &plen);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCHv6 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-06-05  1:31 [PATCHv6 0/5] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
                   ` (3 preceding siblings ...)
  2023-06-05  1:31 ` [PATCHv6 4/5] iomap: Allocate iop in ->write_begin() early Ritesh Harjani (IBM)
@ 2023-06-05  1:31 ` Ritesh Harjani (IBM)
  2023-06-05  4:03   ` Matthew Wilcox
  2023-06-06 12:17 ` [PATCHv6 0/5] iomap: Add support for per-block dirty state to improve write performance Andreas Grünbacher
  5 siblings, 1 reply; 11+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-06-05  1:31 UTC (permalink / raw)
  To: linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Dave Chinner, Brian Foster,
	Christoph Hellwig, Andreas Gruenbacher, Ojaswin Mujoo, Disha Goel,
	Ritesh Harjani (IBM), Aravinda Herle

When filesystem blocksize is less than folio size (either with
mapping_large_folio_support() or with blocksize < pagesize) and when the
folio is uptodate in pagecache, then even a byte write can cause
an entire folio to be written to disk during writeback. This happens
because we currently don't have a mechanism to track per-block dirty
state within struct iomap_page. We currently only track uptodate state.

This patch implements support for tracking per-block dirty state in
iomap_page->state bitmap. This should help improve the filesystem write
performance and help reduce write amplification.

Performance testing of below fio workload reveals ~16x performance
improvement using nvme with XFS (4k blocksize) on Power (64K pagesize)
FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.

1. <test_randwrite.fio>
[global]
	ioengine=psync
	rw=randwrite
	overwrite=1
	pre_read=1
	direct=0
	bs=4k
	size=1G
	dir=./
	numjobs=8
	fdatasync=1
	runtime=60
	iodepth=64
	group_reporting=1

[fio-run]

2. Also our internal performance team reported that this patch improves
   their database workload performance by around ~83% (with XFS on Power)

Reported-by: Aravinda Herle <araherle@in.ibm.com>
Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/gfs2/aops.c         |   2 +-
 fs/iomap/buffered-io.c | 172 +++++++++++++++++++++++++++++++++++------
 fs/xfs/xfs_aops.c      |   2 +-
 fs/zonefs/file.c       |   2 +-
 include/linux/iomap.h  |   1 +
 5 files changed, 152 insertions(+), 27 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index a5f4be6b9213..75efec3c3b71 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -746,7 +746,7 @@ static const struct address_space_operations gfs2_aops = {
 	.writepages = gfs2_writepages,
 	.read_folio = gfs2_read_folio,
 	.readahead = gfs2_readahead,
-	.dirty_folio = filemap_dirty_folio,
+	.dirty_folio = iomap_dirty_folio,
 	.release_folio = iomap_release_folio,
 	.invalidate_folio = iomap_invalidate_folio,
 	.bmap = gfs2_bmap,
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index a70242cb32b1..dee81d16804e 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -85,6 +85,63 @@ static void iomap_iop_set_range_uptodate(struct inode *inode,
 		folio_mark_uptodate(folio);
 }
 
+static bool iop_test_block_dirty(struct folio *folio, int block)
+{
+	struct iomap_page *iop = to_iomap_page(folio);
+	struct inode *inode = folio->mapping->host;
+	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
+
+	return test_bit(block + blks_per_folio, iop->state);
+}
+
+static void iop_set_range_dirty(struct inode *inode, struct folio *folio,
+				size_t off, size_t len)
+{
+	struct iomap_page *iop = to_iomap_page(folio);
+	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
+	unsigned int first_blk = off >> inode->i_blkbits;
+	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
+	unsigned int nr_blks = last_blk - first_blk + 1;
+	unsigned long flags;
+
+	spin_lock_irqsave(&iop->state_lock, flags);
+	bitmap_set(iop->state, first_blk + blks_per_folio, nr_blks);
+	spin_unlock_irqrestore(&iop->state_lock, flags);
+}
+
+static void iomap_iop_set_range_dirty(struct inode *inode, struct folio *folio,
+				size_t off, size_t len)
+{
+	struct iomap_page *iop = to_iomap_page(folio);
+
+	if (iop)
+		iop_set_range_dirty(inode, folio, off, len);
+}
+
+static void iop_clear_range_dirty(struct inode *inode, struct folio *folio,
+				  size_t off, size_t len)
+{
+	struct iomap_page *iop = to_iomap_page(folio);
+	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
+	unsigned int first_blk = off >> inode->i_blkbits;
+	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
+	unsigned int nr_blks = last_blk - first_blk + 1;
+	unsigned long flags;
+
+	spin_lock_irqsave(&iop->state_lock, flags);
+	bitmap_clear(iop->state, first_blk + blks_per_folio, nr_blks);
+	spin_unlock_irqrestore(&iop->state_lock, flags);
+}
+
+static void iomap_iop_clear_range_dirty(struct inode *inode,
+				struct folio *folio, size_t off, size_t len)
+{
+	struct iomap_page *iop = to_iomap_page(folio);
+
+	if (iop)
+		iop_clear_range_dirty(inode, folio, off, len);
+}
+
 static struct iomap_page *iomap_iop_alloc(struct inode *inode,
 				struct folio *folio, unsigned int flags)
 {
@@ -100,12 +157,20 @@ static struct iomap_page *iomap_iop_alloc(struct inode *inode,
 	else
 		gfp = GFP_NOFS | __GFP_NOFAIL;
 
-	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
+	/*
+	 * iop->state tracks two sets of state flags when the
+	 * filesystem block size is smaller than the folio size.
+	 * The first state tracks per-block uptodate and the
+	 * second tracks per-block dirty state.
+	 */
+	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)),
 		      gfp);
 	if (iop) {
 		spin_lock_init(&iop->state_lock);
 		if (folio_test_uptodate(folio))
-			bitmap_fill(iop->state, nr_blocks);
+			bitmap_set(iop->state, 0, nr_blocks);
+		if (folio_test_dirty(folio))
+			bitmap_set(iop->state, nr_blocks, nr_blocks);
 		folio_attach_private(folio, iop);
 	}
 	return iop;
@@ -536,6 +601,18 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
 }
 EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
 
+bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
+{
+	struct iomap_page __maybe_unused *iop;
+	struct inode *inode = mapping->host;
+	size_t len = folio_size(folio);
+
+	iop = iomap_iop_alloc(inode, folio, 0);
+	iomap_iop_set_range_dirty(inode, folio, 0, len);
+	return filemap_dirty_folio(mapping, folio);
+}
+EXPORT_SYMBOL_GPL(iomap_dirty_folio);
+
 static void
 iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
 {
@@ -742,6 +819,8 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 		return 0;
 	iomap_iop_set_range_uptodate(inode, folio, offset_in_folio(folio, pos),
 				     len);
+	iomap_iop_set_range_dirty(inode, folio, offset_in_folio(folio, pos),
+				  copied);
 	filemap_dirty_folio(inode->i_mapping, folio);
 	return copied;
 }
@@ -906,6 +985,61 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
 }
 EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
 
+static int iomap_write_delalloc_punch(struct inode *inode, struct folio *folio,
+		loff_t *punch_start_byte, loff_t start_byte, loff_t end_byte,
+		int (*punch)(struct inode *inode, loff_t offset, loff_t length))
+{
+	struct iomap_page *iop;
+	unsigned int first_blk, last_blk, i;
+	loff_t last_byte;
+	u8 blkbits = inode->i_blkbits;
+	int ret = 0;
+
+	if (start_byte > *punch_start_byte) {
+		ret = punch(inode, *punch_start_byte,
+				start_byte - *punch_start_byte);
+		if (ret)
+			goto out_err;
+	}
+	/*
+	 * When we have per-block dirty tracking, there can be
+	 * blocks within a folio which are marked uptodate
+	 * but not dirty. In that case it is necessary to punch
+	 * out such blocks to avoid leaking any delalloc blocks.
+	 */
+	iop = to_iomap_page(folio);
+	if (!iop)
+		goto skip_iop_punch;
+
+	last_byte = min_t(loff_t, end_byte - 1,
+		(folio_next_index(folio) << PAGE_SHIFT) - 1);
+	first_blk = offset_in_folio(folio, start_byte) >> blkbits;
+	last_blk = offset_in_folio(folio, last_byte) >> blkbits;
+	for (i = first_blk; i <= last_blk; i++) {
+		if (!iop_test_block_dirty(folio, i)) {
+			ret = punch(inode, i << blkbits, 1 << blkbits);
+			if (ret)
+				goto out_err;
+		}
+	}
+
+skip_iop_punch:
+	/*
+	 * Make sure the next punch start is correctly bound to
+	 * the end of this data range, not the end of the folio.
+	 */
+	*punch_start_byte = min_t(loff_t, end_byte,
+			folio_next_index(folio) << PAGE_SHIFT);
+
+	return ret;
+
+out_err:
+	folio_unlock(folio);
+	folio_put(folio);
+	return ret;
+
+}
+
 /*
  * Scan the data range passed to us for dirty page cache folios. If we find a
  * dirty folio, punch out the preceeding range and update the offset from which
@@ -940,26 +1074,9 @@ static int iomap_write_delalloc_scan(struct inode *inode,
 		}
 
 		/* if dirty, punch up to offset */
-		if (folio_test_dirty(folio)) {
-			if (start_byte > *punch_start_byte) {
-				int	error;
-
-				error = punch(inode, *punch_start_byte,
-						start_byte - *punch_start_byte);
-				if (error) {
-					folio_unlock(folio);
-					folio_put(folio);
-					return error;
-				}
-			}
-
-			/*
-			 * Make sure the next punch start is correctly bound to
-			 * the end of this data range, not the end of the folio.
-			 */
-			*punch_start_byte = min_t(loff_t, end_byte,
-					folio_next_index(folio) << PAGE_SHIFT);
-		}
+		if (folio_test_dirty(folio))
+			iomap_write_delalloc_punch(inode, folio, punch_start_byte,
+					   start_byte, end_byte, punch);
 
 		/* move offset to start of next folio in range */
 		start_byte = folio_next_index(folio) << PAGE_SHIFT;
@@ -1641,7 +1758,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		struct writeback_control *wbc, struct inode *inode,
 		struct folio *folio, u64 end_pos)
 {
-	struct iomap_page *iop = iomap_iop_alloc(inode, folio, 0);
+	struct iomap_page *iop = to_iomap_page(folio);
 	struct iomap_ioend *ioend, *next;
 	unsigned len = i_blocksize(inode);
 	unsigned nblocks = i_blocks_per_folio(inode, folio);
@@ -1649,6 +1766,11 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	int error = 0, count = 0, i;
 	LIST_HEAD(submit_list);
 
+	if (!iop && nblocks > 1) {
+		iop = iomap_iop_alloc(inode, folio, 0);
+		iomap_iop_set_range_dirty(inode, folio, 0, folio_size(folio));
+	}
+
 	WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) != 0);
 
 	/*
@@ -1657,7 +1779,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	 * invalid, grab a new one.
 	 */
 	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
-		if (iop && !iop_test_block_uptodate(folio, i))
+		if (iop && !iop_test_block_dirty(folio, i))
 			continue;
 
 		error = wpc->ops->map_blocks(wpc, inode, pos);
@@ -1701,6 +1823,8 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		}
 	}
 
+	iomap_iop_clear_range_dirty(inode, folio, 0,
+				    end_pos - folio_pos(folio));
 	folio_start_writeback(folio);
 	folio_unlock(folio);
 
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 2ef78aa1d3f6..77c7332ae197 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -578,7 +578,7 @@ const struct address_space_operations xfs_address_space_operations = {
 	.read_folio		= xfs_vm_read_folio,
 	.readahead		= xfs_vm_readahead,
 	.writepages		= xfs_vm_writepages,
-	.dirty_folio		= filemap_dirty_folio,
+	.dirty_folio		= iomap_dirty_folio,
 	.release_folio		= iomap_release_folio,
 	.invalidate_folio	= iomap_invalidate_folio,
 	.bmap			= xfs_vm_bmap,
diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
index 132f01d3461f..e508c8e97372 100644
--- a/fs/zonefs/file.c
+++ b/fs/zonefs/file.c
@@ -175,7 +175,7 @@ const struct address_space_operations zonefs_file_aops = {
 	.read_folio		= zonefs_read_folio,
 	.readahead		= zonefs_readahead,
 	.writepages		= zonefs_writepages,
-	.dirty_folio		= filemap_dirty_folio,
+	.dirty_folio		= iomap_dirty_folio,
 	.release_folio		= iomap_release_folio,
 	.invalidate_folio	= iomap_invalidate_folio,
 	.migrate_folio		= filemap_migrate_folio,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index e2b836c2e119..eb9335c46bf3 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -264,6 +264,7 @@ bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
 struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos);
 bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
 void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
+bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
 int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
 		const struct iomap_ops *ops);
 int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCHv6 3/5] iomap: Refactor some iop related accessor functions
  2023-06-05  1:31 ` [PATCHv6 3/5] iomap: Refactor some iop related accessor functions Ritesh Harjani (IBM)
@ 2023-06-05  3:33   ` Matthew Wilcox
  2023-06-05  5:16     ` Ritesh Harjani
  0 siblings, 1 reply; 11+ messages in thread
From: Matthew Wilcox @ 2023-06-05  3:33 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-xfs, linux-fsdevel, Dave Chinner, Brian Foster,
	Christoph Hellwig, Andreas Gruenbacher, Ojaswin Mujoo, Disha Goel

On Mon, Jun 05, 2023 at 07:01:50AM +0530, Ritesh Harjani (IBM) wrote:
> @@ -214,7 +231,7 @@ struct iomap_readpage_ctx {
>  static int iomap_read_inline_data(const struct iomap_iter *iter,
>  		struct folio *folio)
>  {
> -	struct iomap_page *iop;
> +	struct iomap_page __maybe_unused *iop;

Ummm ... definitely unused, right?

>  	const struct iomap *iomap = iomap_iter_srcmap(iter);
>  	size_t size = i_size_read(iter->inode) - iomap->offset;
>  	size_t poff = offset_in_page(iomap->offset);
> @@ -240,7 +257,8 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
>  	memcpy(addr, iomap->inline_data, size);
>  	memset(addr + size, 0, PAGE_SIZE - poff - size);
>  	kunmap_local(addr);
> -	iomap_set_range_uptodate(folio, iop, offset, PAGE_SIZE - poff);
> +	iomap_iop_set_range_uptodate(iter->inode, folio, offset,
> +				     PAGE_SIZE - poff);

Once you make this change, iop is set in this function, but never used.
So you still want to call iomap_page_create() if offset > 0, but you
can ignore the return value.  And you don't need to call to_iomap_page().

Or did I miss something elsewhere in this patch series?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCHv6 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-06-05  1:31 ` [PATCHv6 5/5] iomap: Add per-block dirty state tracking to improve performance Ritesh Harjani (IBM)
@ 2023-06-05  4:03   ` Matthew Wilcox
  2023-06-05  5:20     ` Ritesh Harjani
  0 siblings, 1 reply; 11+ messages in thread
From: Matthew Wilcox @ 2023-06-05  4:03 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-xfs, linux-fsdevel, Dave Chinner, Brian Foster,
	Christoph Hellwig, Andreas Gruenbacher, Ojaswin Mujoo, Disha Goel,
	Aravinda Herle

On Mon, Jun 05, 2023 at 07:01:52AM +0530, Ritesh Harjani (IBM) wrote:
> +static void iop_set_range_dirty(struct inode *inode, struct folio *folio,
> +				size_t off, size_t len)
> +{
> +	struct iomap_page *iop = to_iomap_page(folio);
> +	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
> +	unsigned int first_blk = off >> inode->i_blkbits;
> +	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
> +	unsigned int nr_blks = last_blk - first_blk + 1;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&iop->state_lock, flags);
> +	bitmap_set(iop->state, first_blk + blks_per_folio, nr_blks);
> +	spin_unlock_irqrestore(&iop->state_lock, flags);
> +}
> +
> +static void iomap_iop_set_range_dirty(struct inode *inode, struct folio *folio,
> +				size_t off, size_t len)
> +{
> +	struct iomap_page *iop = to_iomap_page(folio);
> +
> +	if (iop)
> +		iop_set_range_dirty(inode, folio, off, len);
> +}

Why are these separate functions?  It'd be much better written as:

static void iomap_iop_set_range_dirty(struct inode *inode, struct folio *folio,
		size_t off, size_t len)
{
	struct iomap_page *iop = to_iomap_page(folio);
	unsigned int start, first, last;
	unsigned long flags;

	if (!iop)
		return;

	start = i_blocks_per_folio(inode, folio);
	first = off >> inode->i_blkbits;
	last = (off + len - 1) >> inode->i_blkbits;

	spin_lock_irqsave(&iop->state_lock, flags);
	bitmap_set(iop->state, start + first, last - first + 1);
	spin_unlock_irqrestore(&iop->state_lock, flags);
}

> +static void iop_clear_range_dirty(struct inode *inode, struct folio *folio,
> +				  size_t off, size_t len)
> +{
> +	struct iomap_page *iop = to_iomap_page(folio);
> +	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
> +	unsigned int first_blk = off >> inode->i_blkbits;
> +	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
> +	unsigned int nr_blks = last_blk - first_blk + 1;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&iop->state_lock, flags);
> +	bitmap_clear(iop->state, first_blk + blks_per_folio, nr_blks);
> +	spin_unlock_irqrestore(&iop->state_lock, flags);
> +}
> +
> +static void iomap_iop_clear_range_dirty(struct inode *inode,
> +				struct folio *folio, size_t off, size_t len)
> +{
> +	struct iomap_page *iop = to_iomap_page(folio);
> +
> +	if (iop)
> +		iop_clear_range_dirty(inode, folio, off, len);
> +}

Similarly

> +bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
> +{
> +	struct iomap_page __maybe_unused *iop;
> +	struct inode *inode = mapping->host;
> +	size_t len = folio_size(folio);
> +
> +	iop = iomap_iop_alloc(inode, folio, 0);

Why do you keep doing this?  Just throw away the return value from
iomap_iop_alloc().  Don't clutter the source with the unnecessary
variable declaration and annotation that it's not used!

> +static int iomap_write_delalloc_punch(struct inode *inode, struct folio *folio,
> +		loff_t *punch_start_byte, loff_t start_byte, loff_t end_byte,
> +		int (*punch)(struct inode *inode, loff_t offset, loff_t length))
> +{
> +	struct iomap_page *iop;
> +	unsigned int first_blk, last_blk, i;
> +	loff_t last_byte;
> +	u8 blkbits = inode->i_blkbits;
> +	int ret = 0;
> +
> +	if (start_byte > *punch_start_byte) {
> +		ret = punch(inode, *punch_start_byte,
> +				start_byte - *punch_start_byte);
> +		if (ret)
> +			goto out_err;
> +	}
> +	/*
> +	 * When we have per-block dirty tracking, there can be
> +	 * blocks within a folio which are marked uptodate
> +	 * but not dirty. In that case it is necessary to punch
> +	 * out such blocks to avoid leaking any delalloc blocks.
> +	 */
> +	iop = to_iomap_page(folio);
> +	if (!iop)
> +		goto skip_iop_punch;
> +
> +	last_byte = min_t(loff_t, end_byte - 1,
> +		(folio_next_index(folio) << PAGE_SHIFT) - 1);
> +	first_blk = offset_in_folio(folio, start_byte) >> blkbits;
> +	last_blk = offset_in_folio(folio, last_byte) >> blkbits;
> +	for (i = first_blk; i <= last_blk; i++) {
> +		if (!iop_test_block_dirty(folio, i)) {
> +			ret = punch(inode, i << blkbits, 1 << blkbits);
> +			if (ret)
> +				goto out_err;
> +		}
> +	}
> +
> +skip_iop_punch:
> +	/*
> +	 * Make sure the next punch start is correctly bound to
> +	 * the end of this data range, not the end of the folio.
> +	 */
> +	*punch_start_byte = min_t(loff_t, end_byte,
> +			folio_next_index(folio) << PAGE_SHIFT);
> +
> +	return ret;
> +
> +out_err:
> +	folio_unlock(folio);
> +	folio_put(folio);
> +	return ret;
> +
> +}
> +
>  /*
>   * Scan the data range passed to us for dirty page cache folios. If we find a
>   * dirty folio, punch out the preceeding range and update the offset from which
> @@ -940,26 +1074,9 @@ static int iomap_write_delalloc_scan(struct inode *inode,
>  		}
>  
>  		/* if dirty, punch up to offset */
> -		if (folio_test_dirty(folio)) {
> -			if (start_byte > *punch_start_byte) {
> -				int	error;
> -
> -				error = punch(inode, *punch_start_byte,
> -						start_byte - *punch_start_byte);
> -				if (error) {
> -					folio_unlock(folio);
> -					folio_put(folio);
> -					return error;
> -				}
> -			}
> -
> -			/*
> -			 * Make sure the next punch start is correctly bound to
> -			 * the end of this data range, not the end of the folio.
> -			 */
> -			*punch_start_byte = min_t(loff_t, end_byte,
> -					folio_next_index(folio) << PAGE_SHIFT);
> -		}
> +		if (folio_test_dirty(folio))
> +			iomap_write_delalloc_punch(inode, folio, punch_start_byte,
> +					   start_byte, end_byte, punch);
>  
>  		/* move offset to start of next folio in range */
>  		start_byte = folio_next_index(folio) << PAGE_SHIFT;

I'm having trouble following this refactoring + modification.  Perhaps
I'm just tired.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCHv6 3/5] iomap: Refactor some iop related accessor functions
  2023-06-05  3:33   ` Matthew Wilcox
@ 2023-06-05  5:16     ` Ritesh Harjani
  0 siblings, 0 replies; 11+ messages in thread
From: Ritesh Harjani @ 2023-06-05  5:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-fsdevel, Dave Chinner, Brian Foster,
	Christoph Hellwig, Andreas Gruenbacher, Ojaswin Mujoo, Disha Goel

On Mon, Jun 5, 2023 at 9:03 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Jun 05, 2023 at 07:01:50AM +0530, Ritesh Harjani (IBM) wrote:
> > @@ -214,7 +231,7 @@ struct iomap_readpage_ctx {
> >  static int iomap_read_inline_data(const struct iomap_iter *iter,
> >               struct folio *folio)
> >  {
> > -     struct iomap_page *iop;
> > +     struct iomap_page __maybe_unused *iop;
>
> Ummm ... definitely unused, right?
>

Yes, I will fix it in the next rev. Will send it out soon.

> >       const struct iomap *iomap = iomap_iter_srcmap(iter);
> >       size_t size = i_size_read(iter->inode) - iomap->offset;
> >       size_t poff = offset_in_page(iomap->offset);
> > @@ -240,7 +257,8 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
> >       memcpy(addr, iomap->inline_data, size);
> >       memset(addr + size, 0, PAGE_SIZE - poff - size);
> >       kunmap_local(addr);
> > -     iomap_set_range_uptodate(folio, iop, offset, PAGE_SIZE - poff);
> > +     iomap_iop_set_range_uptodate(iter->inode, folio, offset,
> > +                                  PAGE_SIZE - poff);
>
> Once you make this change, iop is set in this function, but never used.
> So you still want to call iomap_page_create() if offset > 0, but you
> can ignore the return value.  And you don't need to call to_iomap_page().
>
> Or did I miss something elsewhere in this patch series?

No, I added __maybe_unused earlier to avoid W=1 warnings and then
forgot to fix it, before sending forgot to
fix that part of code.

-ritesh

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCHv6 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-06-05  4:03   ` Matthew Wilcox
@ 2023-06-05  5:20     ` Ritesh Harjani
  0 siblings, 0 replies; 11+ messages in thread
From: Ritesh Harjani @ 2023-06-05  5:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-fsdevel, Dave Chinner, Brian Foster,
	Christoph Hellwig, Andreas Gruenbacher, Ojaswin Mujoo, Disha Goel,
	Aravinda Herle

On Mon, Jun 5, 2023 at 9:33 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Jun 05, 2023 at 07:01:52AM +0530, Ritesh Harjani (IBM) wrote:
> > +static void iop_set_range_dirty(struct inode *inode, struct folio *folio,
> > +                             size_t off, size_t len)
> > +{
> > +     struct iomap_page *iop = to_iomap_page(folio);
> > +     unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
> > +     unsigned int first_blk = off >> inode->i_blkbits;
> > +     unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
> > +     unsigned int nr_blks = last_blk - first_blk + 1;
> > +     unsigned long flags;
> > +
> > +     spin_lock_irqsave(&iop->state_lock, flags);
> > +     bitmap_set(iop->state, first_blk + blks_per_folio, nr_blks);
> > +     spin_unlock_irqrestore(&iop->state_lock, flags);
> > +}
> > +
> > +static void iomap_iop_set_range_dirty(struct inode *inode, struct folio *folio,
> > +                             size_t off, size_t len)
> > +{
> > +     struct iomap_page *iop = to_iomap_page(folio);
> > +
> > +     if (iop)
> > +             iop_set_range_dirty(inode, folio, off, len);
> > +}
>
> Why are these separate functions?  It'd be much better written as:

It got discussed here [1] on the preference would be to have it in a
separate helper.
[1]: https://lore.kernel.org/linux-xfs/ZGYnzcoGuzWKa7lh@infradead.org/

>
> static void iomap_iop_set_range_dirty(struct inode *inode, struct folio *folio,
>                 size_t off, size_t len)
> {
>         struct iomap_page *iop = to_iomap_page(folio);
>         unsigned int start, first, last;
>         unsigned long flags;
>
>         if (!iop)
>                 return;
>
>         start = i_blocks_per_folio(inode, folio);
>         first = off >> inode->i_blkbits;
>         last = (off + len - 1) >> inode->i_blkbits;
>
>         spin_lock_irqsave(&iop->state_lock, flags);
>         bitmap_set(iop->state, start + first, last - first + 1);
>         spin_unlock_irqrestore(&iop->state_lock, flags);
> }
>
> > +static void iop_clear_range_dirty(struct inode *inode, struct folio *folio,
> > +                               size_t off, size_t len)
> > +{
> > +     struct iomap_page *iop = to_iomap_page(folio);
> > +     unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
> > +     unsigned int first_blk = off >> inode->i_blkbits;
> > +     unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
> > +     unsigned int nr_blks = last_blk - first_blk + 1;
> > +     unsigned long flags;
> > +
> > +     spin_lock_irqsave(&iop->state_lock, flags);
> > +     bitmap_clear(iop->state, first_blk + blks_per_folio, nr_blks);
> > +     spin_unlock_irqrestore(&iop->state_lock, flags);
> > +}
> > +
> > +static void iomap_iop_clear_range_dirty(struct inode *inode,
> > +                             struct folio *folio, size_t off, size_t len)
> > +{
> > +     struct iomap_page *iop = to_iomap_page(folio);
> > +
> > +     if (iop)
> > +             iop_clear_range_dirty(inode, folio, off, len);
> > +}
>
> Similarly
>
> > +bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
> > +{
> > +     struct iomap_page __maybe_unused *iop;
> > +     struct inode *inode = mapping->host;
> > +     size_t len = folio_size(folio);
> > +
> > +     iop = iomap_iop_alloc(inode, folio, 0);
>
> Why do you keep doing this?  Just throw away the return value from
> iomap_iop_alloc().  Don't clutter the source with the unnecessary
> variable declaration and annotation that it's not used!
>

Sorry, it got leftover. I will quickly fix this.

> > +static int iomap_write_delalloc_punch(struct inode *inode, struct folio *folio,
> > +             loff_t *punch_start_byte, loff_t start_byte, loff_t end_byte,
> > +             int (*punch)(struct inode *inode, loff_t offset, loff_t length))
> > +{
> > +     struct iomap_page *iop;
> > +     unsigned int first_blk, last_blk, i;
> > +     loff_t last_byte;
> > +     u8 blkbits = inode->i_blkbits;
> > +     int ret = 0;
> > +
> > +     if (start_byte > *punch_start_byte) {
> > +             ret = punch(inode, *punch_start_byte,
> > +                             start_byte - *punch_start_byte);
> > +             if (ret)
> > +                     goto out_err;
> > +     }
> > +     /*
> > +      * When we have per-block dirty tracking, there can be
> > +      * blocks within a folio which are marked uptodate
> > +      * but not dirty. In that case it is necessary to punch
> > +      * out such blocks to avoid leaking any delalloc blocks.
> > +      */
> > +     iop = to_iomap_page(folio);
> > +     if (!iop)
> > +             goto skip_iop_punch;
> > +
> > +     last_byte = min_t(loff_t, end_byte - 1,
> > +             (folio_next_index(folio) << PAGE_SHIFT) - 1);
> > +     first_blk = offset_in_folio(folio, start_byte) >> blkbits;
> > +     last_blk = offset_in_folio(folio, last_byte) >> blkbits;
> > +     for (i = first_blk; i <= last_blk; i++) {
> > +             if (!iop_test_block_dirty(folio, i)) {
> > +                     ret = punch(inode, i << blkbits, 1 << blkbits);
> > +                     if (ret)
> > +                             goto out_err;
> > +             }
> > +     }
> > +
> > +skip_iop_punch:
> > +     /*
> > +      * Make sure the next punch start is correctly bound to
> > +      * the end of this data range, not the end of the folio.
> > +      */
> > +     *punch_start_byte = min_t(loff_t, end_byte,
> > +                     folio_next_index(folio) << PAGE_SHIFT);
> > +
> > +     return ret;
> > +
> > +out_err:
> > +     folio_unlock(folio);
> > +     folio_put(folio);
> > +     return ret;
> > +
> > +}
> > +
> >  /*
> >   * Scan the data range passed to us for dirty page cache folios. If we find a
> >   * dirty folio, punch out the preceeding range and update the offset from which
> > @@ -940,26 +1074,9 @@ static int iomap_write_delalloc_scan(struct inode *inode,
> >               }
> >
> >               /* if dirty, punch up to offset */
> > -             if (folio_test_dirty(folio)) {
> > -                     if (start_byte > *punch_start_byte) {
> > -                             int     error;
> > -
> > -                             error = punch(inode, *punch_start_byte,
> > -                                             start_byte - *punch_start_byte);
> > -                             if (error) {
> > -                                     folio_unlock(folio);
> > -                                     folio_put(folio);
> > -                                     return error;
> > -                             }
> > -                     }
> > -
> > -                     /*
> > -                      * Make sure the next punch start is correctly bound to
> > -                      * the end of this data range, not the end of the folio.
> > -                      */
> > -                     *punch_start_byte = min_t(loff_t, end_byte,
> > -                                     folio_next_index(folio) << PAGE_SHIFT);
> > -             }
> > +             if (folio_test_dirty(folio))
> > +                     iomap_write_delalloc_punch(inode, folio, punch_start_byte,
> > +                                        start_byte, end_byte, punch);
> >
> >               /* move offset to start of next folio in range */
> >               start_byte = folio_next_index(folio) << PAGE_SHIFT;
>
> I'm having trouble following this refactoring + modification.  Perhaps
> I'm just tired.
>

Let me refactor this part out in the next revision.

-ritesh

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCHv6 0/5] iomap: Add support for per-block dirty state to improve write performance
  2023-06-05  1:31 [PATCHv6 0/5] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
                   ` (4 preceding siblings ...)
  2023-06-05  1:31 ` [PATCHv6 5/5] iomap: Add per-block dirty state tracking to improve performance Ritesh Harjani (IBM)
@ 2023-06-06 12:17 ` Andreas Grünbacher
  5 siblings, 0 replies; 11+ messages in thread
From: Andreas Grünbacher @ 2023-06-06 12:17 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Brian Foster, Christoph Hellwig, Andreas Gruenbacher,
	Ojaswin Mujoo, Disha Goel

Ritesh,

Am Mo., 5. Juni 2023 um 03:33 Uhr schrieb Ritesh Harjani (IBM)
<ritesh.list@gmail.com>:
> Hello All,
>
> Please find PATCHv6 which adds per-block dirty tracking to iomap.
> As discussed earlier this is required to improve write performance and reduce
> write amplification for cases where either blocksize is less than pagesize (such
> as Power platform with 64k pagesize) or when we have a large folio (such as xfs
> which currently supports large folio).
>
> RFCv5 -> PATCHv6:
> =================
> 1. Addresses review comments from Brian, Christoph and Matthew.
>    @Christoph:
>      - I have renamed the higher level functions such as iop_alloc/iop_free() to
>        iomap_iop_alloc/free() in v6.
>      - As for the low level bitmap accessor functions I couldn't find any better
>        naming then iop_test_/set/clear_**. I could have gone for
>        iomap_iop__test/set/clear/_** or iomap__iop_test/set/clear_**, but
>        I wasn't convinced with either of above as it also increases function
>        name.
>        Besides iop_test/set_clear_ accessors functions for uptodate and dirty
>        status tracking make sense as we are sure we have a valid iop in such
>        cases. Please do let me know if this looks ok to you.
> 2. I tried testing gfs2 (initially with no patches) with xfstests. But I always ended up
>    in some or the other deadlock (I couldn't spend any time debugging that).
>    I also ran it with -x log, but still it was always failing for me.
>    @Andreas:
>    - could you please suggest how can I test gfs2 with these patches. I see gfs2
>      can have a smaller blocksize and it uses iomap buffered io path. It will be
>      good if we can get these patches tested on it too.

here's a minimal list of tests we're running automatically on a daily basis:

https://gitlab.com/redhat/centos-stream/tests/kernel/kernel-tests/-/blob/main/filesystems/xfs/xfstests/RUNTESTS.gfs2

Please note that function inode_to_wb() in <linux/backing-dev.h>
includes some asserts that are active when CONFIG_LOCKDEP is enabled.
Those asserts will cause every single fstest to fail. So either
disable CONFIG_LOCKDEP or remove the asserts in inode_to_wb() for now.

Thanks,
Andreas

> 3. I can now say I have run some good amount of fstests on these patches on
>    these platforms and I haven't found any new failure in my testing so far.
>    arm64 (64k pagesize): with 4k -g quick
>    Power: with 4k -g auto
>    x86: 1k, 4k with -g auto and adv_auto
>
> From my testing so far these patches looks stable to me and if this looks good
> to reviewers as well, do you think this can be queued to linux-next for wider
> testing?
>
> Performance numbers copied from last patch commit message
> ==================================================
> Performance testing of below fio workload reveals ~16x performance
> improvement using nvme with XFS (4k blocksize) on Power (64K pagesize)
> FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
>
> 1. <test_randwrite.fio>
> [global]
>         ioengine=psync
>         rw=randwrite
>         overwrite=1
>         pre_read=1
>         direct=0
>         bs=4k
>         size=1G
>         dir=./
>         numjobs=8
>         fdatasync=1
>         runtime=60
>         iodepth=64
>         group_reporting=1
>
> [fio-run]
>
> 2. Also our internal performance team reported that this patch improves
>    their database workload performance by around ~83% (with XFS on Power)
>
> Ritesh Harjani (IBM) (5):
>   iomap: Rename iomap_page_create/release() to iomap_iop_alloc/free()
>   iomap: Move folio_detach_private() in iomap_iop_free() to the end
>   iomap: Refactor some iop related accessor functions
>   iomap: Allocate iop in ->write_begin() early
>   iomap: Add per-block dirty state tracking to improve performance
>
>  fs/gfs2/aops.c         |   2 +-
>  fs/iomap/buffered-io.c | 309 ++++++++++++++++++++++++++++++-----------
>  fs/xfs/xfs_aops.c      |   2 +-
>  fs/zonefs/file.c       |   2 +-
>  include/linux/iomap.h  |   1 +
>  5 files changed, 235 insertions(+), 81 deletions(-)
>
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-06-06 12:17 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-05  1:31 [PATCHv6 0/5] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
2023-06-05  1:31 ` [PATCHv6 1/5] iomap: Rename iomap_page_create/release() to iomap_iop_alloc/free() Ritesh Harjani (IBM)
2023-06-05  1:31 ` [PATCHv6 2/5] iomap: Move folio_detach_private() in iomap_iop_free() to the end Ritesh Harjani (IBM)
2023-06-05  1:31 ` [PATCHv6 3/5] iomap: Refactor some iop related accessor functions Ritesh Harjani (IBM)
2023-06-05  3:33   ` Matthew Wilcox
2023-06-05  5:16     ` Ritesh Harjani
2023-06-05  1:31 ` [PATCHv6 4/5] iomap: Allocate iop in ->write_begin() early Ritesh Harjani (IBM)
2023-06-05  1:31 ` [PATCHv6 5/5] iomap: Add per-block dirty state tracking to improve performance Ritesh Harjani (IBM)
2023-06-05  4:03   ` Matthew Wilcox
2023-06-05  5:20     ` Ritesh Harjani
2023-06-06 12:17 ` [PATCHv6 0/5] iomap: Add support for per-block dirty state to improve write performance Andreas Grünbacher

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).