[PATCH 0/6] iomap: some minor non-critical fixes and improvements when block size

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] iomap: some minor non-critical fixes and improvements when block size < folio size
@ 2024-07-31  9:12 Zhang Yi
  2024-07-31  9:13 ` [PATCH 1/6] iomap: correct the range of a partial dirty clear Zhang Yi
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Zhang Yi @ 2024-07-31  9:12 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, jack, yi.zhang,
	yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

This series contains some minor non-critical fixes and performance
improvements on the filesystem with block size < folio size.

The first 4 patches fix the handling of setting and clearing folio ifs
dirty bits when mark the folio dirty and when invalidat the folio.
Although none of these code mistakes caused a real problem now, it's
still deserve a fix to correct the behavior.

The second 2 patches drop the unnecessary state_lock in ifs when setting
and clearing dirty/uptodate bits in the buffered write path, it could
improve some (~10% on my machine) buffer write performance. I tested it
through UnixBench on my x86_64 (Xeon Gold 6151) and arm64 (Kunpeng-920)
virtual machine with 50GB ramdisk and xfs filesystem, the results shows
below.

UnixBench test cmd:
 ./Run -i 1 -c 1 fstime-w

Before:
x86    File Write 1024 bufsize 2000 maxblocks       524708.0 KBps
arm64  File Write 1024 bufsize 2000 maxblocks       801965.0 KBps

After:
x86    File Write 1024 bufsize 2000 maxblocks       571315.0 KBps
arm64  File Write 1024 bufsize 2000 maxblocks       876077.0 KBps

Thanks,
Yi.

Zhang Yi (6):
  iomap: correct the range of a partial dirty clear
  iomap: support invalidating partial folios
  iomap: advance the ifs allocation if we have more than one blocks per
    folio
  iomap: correct the dirty length in page mkwrite
  iomap: drop unnecessary state_lock when setting ifs uptodate bits
  iomap: drop unnecessary state_lock when changing ifs dirty bits

 fs/iomap/buffered-io.c | 55 ++++++++++++++++++++++++++----------------
 1 file changed, 34 insertions(+), 21 deletions(-)

-- 
2.39.2

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/6] iomap: correct the range of a partial dirty clear
  2024-07-31  9:12 [PATCH 0/6] iomap: some minor non-critical fixes and improvements when block size < folio size Zhang Yi
@ 2024-07-31  9:13 ` Zhang Yi
  2024-07-31  9:13 ` [PATCH 2/6] iomap: support invalidating partial folios Zhang Yi
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Zhang Yi @ 2024-07-31  9:13 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, jack, yi.zhang,
	yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

The block range calculation in ifs_clear_range_dirty() is incorrect when
partial clear a range in a folio. We can't clear the dirty bit of the
first block or the last block if the start or end offset is blocksize
unaligned, this has not yet caused any issue since we always clear a
whole folio in iomap_writepage_map()->iomap_clear_range_dirty(). Fix
this by round up the first block and round down the last block and
correct the calculation of nr_blks.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/iomap/buffered-io.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index d46558990279..a896d15c191a 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -138,11 +138,14 @@ static void ifs_clear_range_dirty(struct folio *folio,
 {
 	struct inode *inode = folio->mapping->host;
 	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
-	unsigned int first_blk = (off >> inode->i_blkbits);
-	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
-	unsigned int nr_blks = last_blk - first_blk + 1;
+	unsigned int first_blk = DIV_ROUND_UP(off, i_blocksize(inode));
+	unsigned int last_blk = (off + len) >> inode->i_blkbits;
+	unsigned int nr_blks = last_blk - first_blk;
 	unsigned long flags;
 
+	if (!nr_blks)
+		return;
+
 	spin_lock_irqsave(&ifs->state_lock, flags);
 	bitmap_clear(ifs->state, first_blk + blks_per_folio, nr_blks);
 	spin_unlock_irqrestore(&ifs->state_lock, flags);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/6] iomap: support invalidating partial folios
  2024-07-31  9:12 [PATCH 0/6] iomap: some minor non-critical fixes and improvements when block size < folio size Zhang Yi
  2024-07-31  9:13 ` [PATCH 1/6] iomap: correct the range of a partial dirty clear Zhang Yi
@ 2024-07-31  9:13 ` Zhang Yi
  2024-07-31  9:13 ` [PATCH 3/6] iomap: advance the ifs allocation if we have more than one blocks per folio Zhang Yi
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Zhang Yi @ 2024-07-31  9:13 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, jack, yi.zhang,
	yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Current iomap_invalidate_folio() could only invalidate an entire folio,
if we truncate a partial folio on a filesystem with blocksize < folio
size, it will left over the dirty bits of truncated/punched blocks, and
the write back process will try to map the invalid hole range, but
fortunately it hasn't trigger any real problems now since ->map_blocks()
will fix the length. Fix this by supporting invalidating partial folios.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/iomap/buffered-io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index a896d15c191a..64c4808fab31 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -631,6 +631,8 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
 		WARN_ON_ONCE(folio_test_writeback(folio));
 		folio_cancel_dirty(folio);
 		ifs_free(folio);
+	} else {
+		iomap_clear_range_dirty(folio, offset, len);
 	}
 }
 EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/6] iomap: advance the ifs allocation if we have more than one blocks per folio
  2024-07-31  9:12 [PATCH 0/6] iomap: some minor non-critical fixes and improvements when block size < folio size Zhang Yi
  2024-07-31  9:13 ` [PATCH 1/6] iomap: correct the range of a partial dirty clear Zhang Yi
  2024-07-31  9:13 ` [PATCH 2/6] iomap: support invalidating partial folios Zhang Yi
@ 2024-07-31  9:13 ` Zhang Yi
  2024-07-31  9:13 ` [PATCH 4/6] iomap: correct the dirty length in page mkwrite Zhang Yi
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Zhang Yi @ 2024-07-31  9:13 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, jack, yi.zhang,
	yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Now we allocate ifs if i_blocks_per_folio is larger than one when
writing back dirty folios in iomap_writepage_map(), so we don't attach
an ifs after buffer write to an entire folio until it starts writing
back, if we partial truncate that folio, iomap_invalidate_folio() can't
clear counterpart block's dirty bit as expected. Fix this by advance the
ifs allocation to __iomap_write_begin().

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/iomap/buffered-io.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 64c4808fab31..ec17bf8d62e9 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -686,6 +686,12 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 	size_t from = offset_in_folio(folio, pos), to = from + len;
 	size_t poff, plen;
 
+	if (nr_blocks > 1) {
+		ifs = ifs_alloc(iter->inode, folio, iter->flags);
+		if ((iter->flags & IOMAP_NOWAIT) && !ifs)
+			return -EAGAIN;
+	}
+
 	/*
 	 * If the write or zeroing completely overlaps the current folio, then
 	 * entire folio will be dirtied so there is no need for
@@ -697,10 +703,6 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 	    pos + len >= folio_pos(folio) + folio_size(folio))
 		return 0;
 
-	ifs = ifs_alloc(iter->inode, folio, iter->flags);
-	if ((iter->flags & IOMAP_NOWAIT) && !ifs && nr_blocks > 1)
-		return -EAGAIN;
-
 	if (folio_test_uptodate(folio))
 		return 0;
 	folio_clear_error(folio);
@@ -1913,7 +1915,12 @@ static int iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	WARN_ON_ONCE(end_pos <= pos);
 
 	if (i_blocks_per_folio(inode, folio) > 1) {
-		if (!ifs) {
+		/*
+		 * This should not happen since we always allocate ifs in
+		 * iomap_folio_mkwrite_iter() and there is more than one
+		 * blocks per folio in __iomap_write_begin().
+		 */
+		if (WARN_ON_ONCE(!ifs)) {
 			ifs = ifs_alloc(inode, folio, 0);
 			iomap_set_range_dirty(folio, 0, end_pos - pos);
 		}
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/6] iomap: correct the dirty length in page mkwrite
  2024-07-31  9:12 [PATCH 0/6] iomap: some minor non-critical fixes and improvements when block size < folio size Zhang Yi
                   ` (2 preceding siblings ...)
  2024-07-31  9:13 ` [PATCH 3/6] iomap: advance the ifs allocation if we have more than one blocks per folio Zhang Yi
@ 2024-07-31  9:13 ` Zhang Yi
  2024-07-31  9:13 ` [PATCH 5/6] iomap: drop unnecessary state_lock when setting ifs uptodate bits Zhang Yi
  2024-07-31  9:13 ` [PATCH 6/6] iomap: drop unnecessary state_lock when changing ifs dirty bits Zhang Yi
  5 siblings, 0 replies; 19+ messages in thread
From: Zhang Yi @ 2024-07-31  9:13 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, jack, yi.zhang,
	yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

When doing page mkwrite, iomap_folio_mkwrite_iter() dirty the entire
folio by folio_mark_dirty() even the map length is shorter than one
folio. However, on the filesystem with more than one blocks per folio,
we'd better to only set counterpart block's dirty bit according to
iomap_length(), so open code folio_mark_dirty() and pass the correct
length.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/iomap/buffered-io.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index ec17bf8d62e9..f5668895df66 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1475,7 +1475,10 @@ static loff_t iomap_folio_mkwrite_iter(struct iomap_iter *iter,
 		block_commit_write(&folio->page, 0, length);
 	} else {
 		WARN_ON_ONCE(!folio_test_uptodate(folio));
-		folio_mark_dirty(folio);
+
+		ifs_alloc(iter->inode, folio, 0);
+		iomap_set_range_dirty(folio, 0, length);
+		filemap_dirty_folio(iter->inode->i_mapping, folio);
 	}
 
 	return length;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 5/6] iomap: drop unnecessary state_lock when setting ifs uptodate bits
  2024-07-31  9:12 [PATCH 0/6] iomap: some minor non-critical fixes and improvements when block size < folio size Zhang Yi
                   ` (3 preceding siblings ...)
  2024-07-31  9:13 ` [PATCH 4/6] iomap: correct the dirty length in page mkwrite Zhang Yi
@ 2024-07-31  9:13 ` Zhang Yi
  2024-07-31 16:52   ` Matthew Wilcox
  2024-08-02  0:05   ` Dave Chinner
  2024-07-31  9:13 ` [PATCH 6/6] iomap: drop unnecessary state_lock when changing ifs dirty bits Zhang Yi
  5 siblings, 2 replies; 19+ messages in thread
From: Zhang Yi @ 2024-07-31  9:13 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, jack, yi.zhang,
	yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Commit '1cea335d1db1 ("iomap: fix sub-page uptodate handling")' fix a
race issue when submitting multiple read bios for a page spans more than
one file system block by adding a spinlock(which names state_lock now)
to make the page uptodate synchronous. However, the race condition only
happened between the read I/O submitting and completeing threads, it's
sufficient to use page lock to protect other paths, e.g. buffered write
path. After large folio is supported, the spinlock could affect more
about the buffered write performance, so drop it could reduce some
unnecessary locking overhead.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/iomap/buffered-io.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index f5668895df66..248f4a586f8f 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -73,14 +73,10 @@ static void iomap_set_range_uptodate(struct folio *folio, size_t off,
 		size_t len)
 {
 	struct iomap_folio_state *ifs = folio->private;
-	unsigned long flags;
 	bool uptodate = true;
 
-	if (ifs) {
-		spin_lock_irqsave(&ifs->state_lock, flags);
+	if (ifs)
 		uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
-		spin_unlock_irqrestore(&ifs->state_lock, flags);
-	}
 
 	if (uptodate)
 		folio_mark_uptodate(folio);
@@ -395,7 +391,18 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 
 	if (iomap_block_needs_zeroing(iter, pos)) {
 		folio_zero_range(folio, poff, plen);
-		iomap_set_range_uptodate(folio, poff, plen);
+		if (ifs) {
+			/*
+			 * Hold state_lock to protect from submitting multiple
+			 * bios for the same locked folio and then get multiple
+			 * callbacks in parallel.
+			 */
+			spin_lock_irq(&ifs->state_lock);
+			iomap_set_range_uptodate(folio, poff, plen);
+			spin_unlock_irq(&ifs->state_lock);
+		} else {
+			folio_mark_uptodate(folio);
+		}
 		goto done;
 	}
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 6/6] iomap: drop unnecessary state_lock when changing ifs dirty bits
  2024-07-31  9:12 [PATCH 0/6] iomap: some minor non-critical fixes and improvements when block size < folio size Zhang Yi
                   ` (4 preceding siblings ...)
  2024-07-31  9:13 ` [PATCH 5/6] iomap: drop unnecessary state_lock when setting ifs uptodate bits Zhang Yi
@ 2024-07-31  9:13 ` Zhang Yi
  5 siblings, 0 replies; 19+ messages in thread
From: Zhang Yi @ 2024-07-31  9:13 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, jack, yi.zhang,
	yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Hold the state_lock when set and clear ifs dirty bits is unnecessary
since both paths are protected under folio lock, so it's safe to drop
the state_lock, which could reduce some unnecessary locking overhead and
improve the buffer write performance a bit.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/iomap/buffered-io.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 248f4a586f8f..22ce6062cfd1 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -137,14 +137,8 @@ static void ifs_clear_range_dirty(struct folio *folio,
 	unsigned int first_blk = DIV_ROUND_UP(off, i_blocksize(inode));
 	unsigned int last_blk = (off + len) >> inode->i_blkbits;
 	unsigned int nr_blks = last_blk - first_blk;
-	unsigned long flags;
 
-	if (!nr_blks)
-		return;
-
-	spin_lock_irqsave(&ifs->state_lock, flags);
 	bitmap_clear(ifs->state, first_blk + blks_per_folio, nr_blks);
-	spin_unlock_irqrestore(&ifs->state_lock, flags);
 }
 
 static void iomap_clear_range_dirty(struct folio *folio, size_t off, size_t len)
@@ -163,11 +157,8 @@ static void ifs_set_range_dirty(struct folio *folio,
 	unsigned int first_blk = (off >> inode->i_blkbits);
 	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
 	unsigned int nr_blks = last_blk - first_blk + 1;
-	unsigned long flags;
 
-	spin_lock_irqsave(&ifs->state_lock, flags);
 	bitmap_set(ifs->state, first_blk + blks_per_folio, nr_blks);
-	spin_unlock_irqrestore(&ifs->state_lock, flags);
 }
 
 static void iomap_set_range_dirty(struct folio *folio, size_t off, size_t len)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/6] iomap: drop unnecessary state_lock when setting ifs uptodate bits
  2024-07-31  9:13 ` [PATCH 5/6] iomap: drop unnecessary state_lock when setting ifs uptodate bits Zhang Yi
@ 2024-07-31 16:52   ` Matthew Wilcox
  2024-08-01  1:52     ` Zhang Yi
  2024-08-02  0:05   ` Dave Chinner
  1 sibling, 1 reply; 19+ messages in thread
From: Matthew Wilcox @ 2024-07-31 16:52 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, hch, brauner,
	david, jack, yi.zhang, chengzhihao1, yukuai3

On Wed, Jul 31, 2024 at 05:13:04PM +0800, Zhang Yi wrote:
> Commit '1cea335d1db1 ("iomap: fix sub-page uptodate handling")' fix a
> race issue when submitting multiple read bios for a page spans more than
> one file system block by adding a spinlock(which names state_lock now)
> to make the page uptodate synchronous. However, the race condition only
> happened between the read I/O submitting and completeing threads, it's
> sufficient to use page lock to protect other paths, e.g. buffered write
> path. After large folio is supported, the spinlock could affect more
> about the buffered write performance, so drop it could reduce some
> unnecessary locking overhead.

This patch doesn't work.  If we get two read completions at the same
time for blocks belonging to the same folio, they will both write to
the uptodate array at the same time.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/6] iomap: drop unnecessary state_lock when setting ifs uptodate bits
  2024-07-31 16:52   ` Matthew Wilcox
@ 2024-08-01  1:52     ` Zhang Yi
  2024-08-01  4:24       ` Matthew Wilcox
  0 siblings, 1 reply; 19+ messages in thread
From: Zhang Yi @ 2024-08-01  1:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, hch, brauner,
	david, jack, yi.zhang, chengzhihao1, yukuai3

On 2024/8/1 0:52, Matthew Wilcox wrote:
> On Wed, Jul 31, 2024 at 05:13:04PM +0800, Zhang Yi wrote:
>> Commit '1cea335d1db1 ("iomap: fix sub-page uptodate handling")' fix a
>> race issue when submitting multiple read bios for a page spans more than
>> one file system block by adding a spinlock(which names state_lock now)
>> to make the page uptodate synchronous. However, the race condition only
>> happened between the read I/O submitting and completeing threads, it's
>> sufficient to use page lock to protect other paths, e.g. buffered write
>> path. After large folio is supported, the spinlock could affect more
>> about the buffered write performance, so drop it could reduce some
>> unnecessary locking overhead.
> 
> This patch doesn't work.  If we get two read completions at the same
> time for blocks belonging to the same folio, they will both write to
> the uptodate array at the same time.
> 
This patch just drop the state_lock in the buffered write path, doesn't
affect the read path, the uptodate setting in the read completion path
is still protected the state_lock, please see iomap_finish_folio_read().
So I think this patch doesn't affect the case you mentioned, or am I
missing something?

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/6] iomap: drop unnecessary state_lock when setting ifs uptodate bits
  2024-08-01  1:52     ` Zhang Yi
@ 2024-08-01  4:24       ` Matthew Wilcox
  2024-08-01  9:19         ` Zhang Yi
  0 siblings, 1 reply; 19+ messages in thread
From: Matthew Wilcox @ 2024-08-01  4:24 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, hch, brauner,
	david, jack, yi.zhang, chengzhihao1, yukuai3

On Thu, Aug 01, 2024 at 09:52:49AM +0800, Zhang Yi wrote:
> On 2024/8/1 0:52, Matthew Wilcox wrote:
> > On Wed, Jul 31, 2024 at 05:13:04PM +0800, Zhang Yi wrote:
> >> Commit '1cea335d1db1 ("iomap: fix sub-page uptodate handling")' fix a
> >> race issue when submitting multiple read bios for a page spans more than
> >> one file system block by adding a spinlock(which names state_lock now)
> >> to make the page uptodate synchronous. However, the race condition only
> >> happened between the read I/O submitting and completeing threads, it's
> >> sufficient to use page lock to protect other paths, e.g. buffered write
> >> path. After large folio is supported, the spinlock could affect more
> >> about the buffered write performance, so drop it could reduce some
> >> unnecessary locking overhead.
> > 
> > This patch doesn't work.  If we get two read completions at the same
> > time for blocks belonging to the same folio, they will both write to
> > the uptodate array at the same time.
> > 
> This patch just drop the state_lock in the buffered write path, doesn't
> affect the read path, the uptodate setting in the read completion path
> is still protected the state_lock, please see iomap_finish_folio_read().
> So I think this patch doesn't affect the case you mentioned, or am I
> missing something?

Oh, I see.  So the argument for locking correctness is that:

A. If ifs_set_range_uptodate() is called from iomap_finish_folio_read(),
   the state_lock is held.
B. If ifs_set_range_uptodate() is called from iomap_set_range_uptodate(),
   either we know:
B1. The caller of iomap_set_range_uptodate() holds the folio lock, and this
    is the only place that can call ifs_set_range_uptodate() for this folio
B2. The caller of iomap_set_range_uptodate() holds the state lock

But I think you've assigned iomap_read_inline_data() to case B1 when I
think it's B2.  erofs can certainly have a file which consists of various
blocks elsewhere in the file and then a tail that is stored inline.

__iomap_write_begin() is case B1 because it holds the folio lock, and
submits its read(s) sychronously.  Likewise __iomap_write_end() is
case B1.

But, um.  Why do we need to call iomap_set_range_uptodate() in both
write_begin() and write_end()?

And I think this is actively buggy:

               if (iomap_block_needs_zeroing(iter, block_start)) {
                        if (WARN_ON_ONCE(iter->flags & IOMAP_UNSHARE))
                                return -EIO;
                        folio_zero_segments(folio, poff, from, to, poff + plen);
...
                iomap_set_range_uptodate(folio, poff, plen);

because we zero from 'poff' to 'from', then from 'to' to 'poff+plen',
but mark the entire range as uptodate.  And once a range is marked
as uptodate, it can be read from.

So we can do this:

 - Get a write request for bytes 1-4094 over a hole
 - allocate single page folio
 - zero bytes 0 and 4095
 - mark 0-4095 as uptodate
 - take page fault while trying to access the user address
 - read() to bytes 0-4095 now succeeds even though we haven't written
   1-4094 yet

And that page fault can be uffd or a buffer that's in an mmap that's
out on disc.  Plenty of time to make this race happen, and we leak
4094/4096 bytes of the previous contents of that folio to userspace.

Or did I miss something?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/6] iomap: drop unnecessary state_lock when setting ifs uptodate bits
  2024-08-01  4:24       ` Matthew Wilcox
@ 2024-08-01  9:19         ` Zhang Yi
  0 siblings, 0 replies; 19+ messages in thread
From: Zhang Yi @ 2024-08-01  9:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, hch, brauner,
	david, jack, yi.zhang, chengzhihao1, yukuai3

On 2024/8/1 12:24, Matthew Wilcox wrote:
> On Thu, Aug 01, 2024 at 09:52:49AM +0800, Zhang Yi wrote:
>> On 2024/8/1 0:52, Matthew Wilcox wrote:
>>> On Wed, Jul 31, 2024 at 05:13:04PM +0800, Zhang Yi wrote:
>>>> Commit '1cea335d1db1 ("iomap: fix sub-page uptodate handling")' fix a
>>>> race issue when submitting multiple read bios for a page spans more than
>>>> one file system block by adding a spinlock(which names state_lock now)
>>>> to make the page uptodate synchronous. However, the race condition only
>>>> happened between the read I/O submitting and completeing threads, it's
>>>> sufficient to use page lock to protect other paths, e.g. buffered write
>>>> path. After large folio is supported, the spinlock could affect more
>>>> about the buffered write performance, so drop it could reduce some
>>>> unnecessary locking overhead.
>>>
>>> This patch doesn't work.  If we get two read completions at the same
>>> time for blocks belonging to the same folio, they will both write to
>>> the uptodate array at the same time.
>>>
>> This patch just drop the state_lock in the buffered write path, doesn't
>> affect the read path, the uptodate setting in the read completion path
>> is still protected the state_lock, please see iomap_finish_folio_read().
>> So I think this patch doesn't affect the case you mentioned, or am I
>> missing something?
> 
> Oh, I see.  So the argument for locking correctness is that:
> 
> A. If ifs_set_range_uptodate() is called from iomap_finish_folio_read(),
>    the state_lock is held.
> B. If ifs_set_range_uptodate() is called from iomap_set_range_uptodate(),
>    either we know:
> B1. The caller of iomap_set_range_uptodate() holds the folio lock, and this
>     is the only place that can call ifs_set_range_uptodate() for this folio
> B2. The caller of iomap_set_range_uptodate() holds the state lock
> 
> But I think you've assigned iomap_read_inline_data() to case B1 when I
> think it's B2.  erofs can certainly have a file which consists of various
> blocks elsewhere in the file and then a tail that is stored inline.

Oh, you are right, thanks for pointing this out. I missed the case of
having both file blocks and inline data in one folio on erofs. So we
also need to hold state_lock in iomap_read_inline_data(), it looks like
we'd better to introduce a new common helper to do this job for B2.

> 
> __iomap_write_begin() is case B1 because it holds the folio lock, and
> submits its read(s) sychronously.  Likewise __iomap_write_end() is
> case B1.
> 
> But, um.  Why do we need to call iomap_set_range_uptodate() in both
> write_begin() and write_end()?
> 
> And I think this is actively buggy:
> 
>                if (iomap_block_needs_zeroing(iter, block_start)) {
>                         if (WARN_ON_ONCE(iter->flags & IOMAP_UNSHARE))
>                                 return -EIO;
>                         folio_zero_segments(folio, poff, from, to, poff + plen);
> ...
>                 iomap_set_range_uptodate(folio, poff, plen);
> 
> because we zero from 'poff' to 'from', then from 'to' to 'poff+plen',
> but mark the entire range as uptodate.  And once a range is marked
> as uptodate, it can be read from.
> 
> So we can do this:
> 
>  - Get a write request for bytes 1-4094 over a hole
>  - allocate single page folio
>  - zero bytes 0 and 4095
>  - mark 0-4095 as uptodate
>  - take page fault while trying to access the user address
>  - read() to bytes 0-4095 now succeeds even though we haven't written
>    1-4094 yet
> 
> And that page fault can be uffd or a buffer that's in an mmap that's
> out on disc.  Plenty of time to make this race happen, and we leak
> 4094/4096 bytes of the previous contents of that folio to userspace.
> 
> Or did I miss something?
> 

Indeed, this could happen on the filesystem without inode lock in the
buffered read path(I've checked it out on my ext4 buffered iomap
branch), and I guess it could also happen after a short copy happened
in the write path. We don't need iomap_set_range_uptodate() for the
zeroing case in __iomap_write_begin().

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/6] iomap: drop unnecessary state_lock when setting ifs uptodate bits
  2024-07-31  9:13 ` [PATCH 5/6] iomap: drop unnecessary state_lock when setting ifs uptodate bits Zhang Yi
  2024-07-31 16:52   ` Matthew Wilcox
@ 2024-08-02  0:05   ` Dave Chinner
  2024-08-02  2:57     ` Zhang Yi
  1 sibling, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2024-08-02  0:05 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, hch, brauner,
	jack, yi.zhang, chengzhihao1, yukuai3

On Wed, Jul 31, 2024 at 05:13:04PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Commit '1cea335d1db1 ("iomap: fix sub-page uptodate handling")' fix a
> race issue when submitting multiple read bios for a page spans more than
> one file system block by adding a spinlock(which names state_lock now)
> to make the page uptodate synchronous. However, the race condition only
> happened between the read I/O submitting and completeing threads,

when we do writeback on a folio that has multiple blocks on it we
can submit multiple bios for that, too. Hence the write completions
can race with each other and write submission, too.

Yes, write bio submission and completion only need to update ifs
accounting using an atomic operation, but the same race condition
exists even though the folio is fully locked at the point of bio
submission.

> it's
> sufficient to use page lock to protect other paths, e.g. buffered write
                    ^^^^ folio
> path.
>
> After large folio is supported, the spinlock could affect more
> about the buffered write performance, so drop it could reduce some
> unnecessary locking overhead.

From the point of view of simple to understand and maintain code, I
think this is a bad idea. The data structure is currently protected
by the state lock in all situations, but this change now makes it
protected by the state lock in one case and the folio lock in a
different case.

Making this change also misses the elephant in the room: the
buffered write path still needs the ifs->state_lock to update the
dirty bitmap. Hence we're effectively changing the serialisation
mechanism for only one of the two ifs state bitmaps that the
buffered write path has to update.

Indeed, we can't get rid of the ifs->state_lock from the dirty range
updates because iomap_dirty_folio() can be called without the folio
being locked through folio_mark_dirty() calling the ->dirty_folio()
aop.

IOWs, getting rid of the state lock out of the uptodate range
changes does not actually get rid of it from the buffered IO patch.
we still have to take it to update the dirty range, and so there's
an obvious way to optimise the state lock usage without changing any
of the bitmap access serialisation behaviour. i.e.  We combine the
uptodate and dirty range updates in __iomap_write_end() into a
single lock context such as:

iomap_set_range_dirty_uptodate()
{
	struct iomap_folio_state *ifs = folio->private;
	struct inode *inode:
        unsigned int blks_per_folio;
        unsigned int first_blk;
        unsigned int last_blk;
        unsigned int nr_blks;
	unsigned long flags;

	if (!ifs)
		return;

	inode = folio->mapping->host;
	blks_per_folio = i_blocks_per_folio(inode, folio);
	first_blk = (off >> inode->i_blkbits);
	last_blk = (off + len - 1) >> inode->i_blkbits;
	nr_blks = last_blk - first_blk + 1;

	spin_lock_irqsave(&ifs->state_lock, flags);
	bitmap_set(ifs->state, first_blk, nr_blks);
	bitmap_set(ifs->state, first_blk + blks_per_folio, nr_blks);
	spin_unlock_irqrestore(&ifs->state_lock, flags);
}

This means we calculate the bitmap offsets only once, we take the
state lock only once, and we don't do anything if there is no
sub-folio state.

If we then fix the __iomap_write_begin() code as Willy pointed out
to elide the erroneous uptodate range update, then we end up only
taking the state lock once per buffered write instead of 3 times per
write.

This patch only reduces it to twice per buffered write, so doing the
above should provide even better performance without needing to
change the underlying serialisation mechanism at all.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/6] iomap: drop unnecessary state_lock when setting ifs uptodate bits
  2024-08-02  0:05   ` Dave Chinner
@ 2024-08-02  2:57     ` Zhang Yi
  2024-08-02  6:29       ` Dave Chinner
  0 siblings, 1 reply; 19+ messages in thread
From: Zhang Yi @ 2024-08-02  2:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, hch, brauner,
	jack, yi.zhang, chengzhihao1, yukuai3

On 2024/8/2 8:05, Dave Chinner wrote:
> On Wed, Jul 31, 2024 at 05:13:04PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> Commit '1cea335d1db1 ("iomap: fix sub-page uptodate handling")' fix a
>> race issue when submitting multiple read bios for a page spans more than
>> one file system block by adding a spinlock(which names state_lock now)
>> to make the page uptodate synchronous. However, the race condition only
>> happened between the read I/O submitting and completeing threads,
> 
> when we do writeback on a folio that has multiple blocks on it we
> can submit multiple bios for that, too. Hence the write completions
> can race with each other and write submission, too.
> 
> Yes, write bio submission and completion only need to update ifs
> accounting using an atomic operation, but the same race condition
> exists even though the folio is fully locked at the point of bio
> submission.
> 
> 
>> it's
>> sufficient to use page lock to protect other paths, e.g. buffered write
>                     ^^^^ folio
>> path.
>>
>> After large folio is supported, the spinlock could affect more
>> about the buffered write performance, so drop it could reduce some
>> unnecessary locking overhead.
> 
> From the point of view of simple to understand and maintain code, I
> think this is a bad idea. The data structure is currently protected
> by the state lock in all situations, but this change now makes it
> protected by the state lock in one case and the folio lock in a
> different case.

Yeah, I agree that this is a side-effect of this change, after this patch,
we have to be careful to distinguish between below two cases B1 and B2 as
Willy mentioned.

B. If ifs_set_range_uptodate() is called from iomap_set_range_uptodate(),
   either we know:
B1. The caller of iomap_set_range_uptodate() holds the folio lock, and this
    is the only place that can call ifs_set_range_uptodate() for this folio
B2. The caller of iomap_set_range_uptodate() holds the state lock

> 
> Making this change also misses the elephant in the room: the
> buffered write path still needs the ifs->state_lock to update the
> dirty bitmap. Hence we're effectively changing the serialisation
> mechanism for only one of the two ifs state bitmaps that the
> buffered write path has to update.
> 
> Indeed, we can't get rid of the ifs->state_lock from the dirty range
> updates because iomap_dirty_folio() can be called without the folio
> being locked through folio_mark_dirty() calling the ->dirty_folio()
> aop.
> 

Sorry, I don't understand, why folio_mark_dirty() could be called without
folio lock (isn't this supposed to be a bug)? IIUC, all the file backed
folios must be locked before marking dirty. Are there any exceptions or am
I missing something?

> IOWs, getting rid of the state lock out of the uptodate range
> changes does not actually get rid of it from the buffered IO patch.
> we still have to take it to update the dirty range, and so there's
> an obvious way to optimise the state lock usage without changing any
> of the bitmap access serialisation behaviour. i.e.  We combine the
> uptodate and dirty range updates in __iomap_write_end() into a
> single lock context such as:
> 
> iomap_set_range_dirty_uptodate()
> {
> 	struct iomap_folio_state *ifs = folio->private;
> 	struct inode *inode:
>         unsigned int blks_per_folio;
>         unsigned int first_blk;
>         unsigned int last_blk;
>         unsigned int nr_blks;
> 	unsigned long flags;
> 
> 	if (!ifs)
> 		return;
> 
> 	inode = folio->mapping->host;
> 	blks_per_folio = i_blocks_per_folio(inode, folio);
> 	first_blk = (off >> inode->i_blkbits);
> 	last_blk = (off + len - 1) >> inode->i_blkbits;
> 	nr_blks = last_blk - first_blk + 1;
> 
> 	spin_lock_irqsave(&ifs->state_lock, flags);
> 	bitmap_set(ifs->state, first_blk, nr_blks);
> 	bitmap_set(ifs->state, first_blk + blks_per_folio, nr_blks);
> 	spin_unlock_irqrestore(&ifs->state_lock, flags);
> }
> 
> This means we calculate the bitmap offsets only once, we take the
> state lock only once, and we don't do anything if there is no
> sub-folio state.
> 
> If we then fix the __iomap_write_begin() code as Willy pointed out
> to elide the erroneous uptodate range update, then we end up only
> taking the state lock once per buffered write instead of 3 times per
> write.
> 
> This patch only reduces it to twice per buffered write, so doing the
> above should provide even better performance without needing to
> change the underlying serialisation mechanism at all.
> 

Thanks for the suggestion. I've thought about this solution too, but I
didn't think we need the state_lock when setting ifs dirty bit since the
folio lock should work, so I changed my mind and planed to drop all ifs
state_lock in the write path (please see the patch 6). Please let me
know if I'm wrong.

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/6] iomap: drop unnecessary state_lock when setting ifs uptodate bits
  2024-08-02  2:57     ` Zhang Yi
@ 2024-08-02  6:29       ` Dave Chinner
  2024-08-02 11:13         ` Zhang Yi
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2024-08-02  6:29 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, hch, brauner,
	jack, yi.zhang, chengzhihao1, yukuai3

On Fri, Aug 02, 2024 at 10:57:41AM +0800, Zhang Yi wrote:
> On 2024/8/2 8:05, Dave Chinner wrote:
> > On Wed, Jul 31, 2024 at 05:13:04PM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> Commit '1cea335d1db1 ("iomap: fix sub-page uptodate handling")' fix a
> >> race issue when submitting multiple read bios for a page spans more than
> >> one file system block by adding a spinlock(which names state_lock now)
> >> to make the page uptodate synchronous. However, the race condition only
> >> happened between the read I/O submitting and completeing threads,
> > 
> > when we do writeback on a folio that has multiple blocks on it we
> > can submit multiple bios for that, too. Hence the write completions
> > can race with each other and write submission, too.
> > 
> > Yes, write bio submission and completion only need to update ifs
> > accounting using an atomic operation, but the same race condition
> > exists even though the folio is fully locked at the point of bio
> > submission.
> > 
> > 
> >> it's
> >> sufficient to use page lock to protect other paths, e.g. buffered write
> >                     ^^^^ folio
> >> path.
> >>
> >> After large folio is supported, the spinlock could affect more
> >> about the buffered write performance, so drop it could reduce some
> >> unnecessary locking overhead.
> > 
> > From the point of view of simple to understand and maintain code, I
> > think this is a bad idea. The data structure is currently protected
> > by the state lock in all situations, but this change now makes it
> > protected by the state lock in one case and the folio lock in a
> > different case.
> 
> Yeah, I agree that this is a side-effect of this change, after this patch,
> we have to be careful to distinguish between below two cases B1 and B2 as
> Willy mentioned.
> 
> B. If ifs_set_range_uptodate() is called from iomap_set_range_uptodate(),
>    either we know:
> B1. The caller of iomap_set_range_uptodate() holds the folio lock, and this
>     is the only place that can call ifs_set_range_uptodate() for this folio
> B2. The caller of iomap_set_range_uptodate() holds the state lock

Yes, I read that before I commented that I think it's a bad idea.
And then provided a method where we don't need to care about this at
all.
> 
> > 
> > Making this change also misses the elephant in the room: the
> > buffered write path still needs the ifs->state_lock to update the
> > dirty bitmap. Hence we're effectively changing the serialisation
> > mechanism for only one of the two ifs state bitmaps that the
> > buffered write path has to update.
> > 
> > Indeed, we can't get rid of the ifs->state_lock from the dirty range
> > updates because iomap_dirty_folio() can be called without the folio
> > being locked through folio_mark_dirty() calling the ->dirty_folio()
> > aop.
> > 
> 
> Sorry, I don't understand, why folio_mark_dirty() could be called without
> folio lock (isn't this supposed to be a bug)?  IIUC, all the file backed
> folios must be locked before marking dirty. Are there any exceptions or am
> I missing something?

Yes: reading the code I pointed you at.

/**
 * folio_mark_dirty - Mark a folio as being modified.
 * @folio: The folio.
 *
 * The folio may not be truncated while this function is running.
 * Holding the folio lock is sufficient to prevent truncation, but some
 * callers cannot acquire a sleeping lock.  These callers instead hold
 * the page table lock for a page table which contains at least one page
 * in this folio.  Truncation will block on the page table lock as it
 * unmaps pages before removing the folio from its mapping.
 *
 * Return: True if the folio was newly dirtied, false if it was already dirty.
 */

So, yes, ->dirty_folio() can indeed be called without the folio
being locked and it is not a bug.

Hence we have to serialise ->dirty_folio against both
__iomap_write_begin() dirtying the folio and iomap_writepage_map()
clearing the dirty range.

And that means we alway need to take the ifs->state_lock in
__iomap_write_begin() when we have an ifs attached to the folio.
Hence it is a) not correct and b) makes no sense to try to do
uptodate bitmap updates without it held...

> > IOWs, getting rid of the state lock out of the uptodate range
> > changes does not actually get rid of it from the buffered IO patch.
> > we still have to take it to update the dirty range, and so there's
> > an obvious way to optimise the state lock usage without changing any
> > of the bitmap access serialisation behaviour. i.e.  We combine the
> > uptodate and dirty range updates in __iomap_write_end() into a
> > single lock context such as:
> > 
> > iomap_set_range_dirty_uptodate()
> > {
> > 	struct iomap_folio_state *ifs = folio->private;
> > 	struct inode *inode:
> >         unsigned int blks_per_folio;
> >         unsigned int first_blk;
> >         unsigned int last_blk;
> >         unsigned int nr_blks;
> > 	unsigned long flags;
> > 
> > 	if (!ifs)
> > 		return;
> > 
> > 	inode = folio->mapping->host;
> > 	blks_per_folio = i_blocks_per_folio(inode, folio);
> > 	first_blk = (off >> inode->i_blkbits);
> > 	last_blk = (off + len - 1) >> inode->i_blkbits;
> > 	nr_blks = last_blk - first_blk + 1;
> > 
> > 	spin_lock_irqsave(&ifs->state_lock, flags);
> > 	bitmap_set(ifs->state, first_blk, nr_blks);
> > 	bitmap_set(ifs->state, first_blk + blks_per_folio, nr_blks);
> > 	spin_unlock_irqrestore(&ifs->state_lock, flags);
> > }
> > 
> > This means we calculate the bitmap offsets only once, we take the
> > state lock only once, and we don't do anything if there is no
> > sub-folio state.
> > 
> > If we then fix the __iomap_write_begin() code as Willy pointed out
> > to elide the erroneous uptodate range update, then we end up only
> > taking the state lock once per buffered write instead of 3 times per
> > write.
> > 
> > This patch only reduces it to twice per buffered write, so doing the
> > above should provide even better performance without needing to
> > change the underlying serialisation mechanism at all.
> > 
> 
> Thanks for the suggestion. I've thought about this solution too, but I
> didn't think we need the state_lock when setting ifs dirty bit since the
> folio lock should work, so I changed my mind and planed to drop all ifs
> state_lock in the write path (please see the patch 6). Please let me
> know if I'm wrong.

Whether it works or not is irrelevant: it is badly designed code
that you have proposed. We can acheive the same result without
changing the locking rules for the bitmap data via a small amount of
refactoring, and that is a much better solution than creating
complex and subtle locking rules for the object.

"But it works" doesn't mean the code is robust, maintainable code.

So, good optimisation, but NACK in this form. Please rework it to
only take the ifs->state_lock once for both bitmap updates in the
__iomap_write_end() path.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/6] iomap: drop unnecessary state_lock when setting ifs uptodate bits
  2024-08-02  6:29       ` Dave Chinner
@ 2024-08-02 11:13         ` Zhang Yi
  2024-08-05 12:42           ` Jan Kara
  0 siblings, 1 reply; 19+ messages in thread
From: Zhang Yi @ 2024-08-02 11:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, hch, brauner,
	jack, yi.zhang, chengzhihao1, yukuai3

On 2024/8/2 14:29, Dave Chinner wrote:
> On Fri, Aug 02, 2024 at 10:57:41AM +0800, Zhang Yi wrote:
>> On 2024/8/2 8:05, Dave Chinner wrote:
>>> On Wed, Jul 31, 2024 at 05:13:04PM +0800, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
>>>> Commit '1cea335d1db1 ("iomap: fix sub-page uptodate handling")' fix a
>>>> race issue when submitting multiple read bios for a page spans more than
>>>> one file system block by adding a spinlock(which names state_lock now)
>>>> to make the page uptodate synchronous. However, the race condition only
>>>> happened between the read I/O submitting and completeing threads,
>>>
>>> when we do writeback on a folio that has multiple blocks on it we
>>> can submit multiple bios for that, too. Hence the write completions
>>> can race with each other and write submission, too.
>>>
>>> Yes, write bio submission and completion only need to update ifs
>>> accounting using an atomic operation, but the same race condition
>>> exists even though the folio is fully locked at the point of bio
>>> submission.
>>>
>>>
>>>> it's
>>>> sufficient to use page lock to protect other paths, e.g. buffered write
>>>                     ^^^^ folio
>>>> path.
>>>>
>>>> After large folio is supported, the spinlock could affect more
>>>> about the buffered write performance, so drop it could reduce some
>>>> unnecessary locking overhead.
>>>
>>> From the point of view of simple to understand and maintain code, I
>>> think this is a bad idea. The data structure is currently protected
>>> by the state lock in all situations, but this change now makes it
>>> protected by the state lock in one case and the folio lock in a
>>> different case.
>>
>> Yeah, I agree that this is a side-effect of this change, after this patch,
>> we have to be careful to distinguish between below two cases B1 and B2 as
>> Willy mentioned.
>>
>> B. If ifs_set_range_uptodate() is called from iomap_set_range_uptodate(),
>>    either we know:
>> B1. The caller of iomap_set_range_uptodate() holds the folio lock, and this
>>     is the only place that can call ifs_set_range_uptodate() for this folio
>> B2. The caller of iomap_set_range_uptodate() holds the state lock
> 
> Yes, I read that before I commented that I think it's a bad idea.
> And then provided a method where we don't need to care about this at
> all.
>>
>>>
>>> Making this change also misses the elephant in the room: the
>>> buffered write path still needs the ifs->state_lock to update the
>>> dirty bitmap. Hence we're effectively changing the serialisation
>>> mechanism for only one of the two ifs state bitmaps that the
>>> buffered write path has to update.
>>>
>>> Indeed, we can't get rid of the ifs->state_lock from the dirty range
>>> updates because iomap_dirty_folio() can be called without the folio
>>> being locked through folio_mark_dirty() calling the ->dirty_folio()
>>> aop.
>>>
>>
>> Sorry, I don't understand, why folio_mark_dirty() could be called without
>> folio lock (isn't this supposed to be a bug)?  IIUC, all the file backed
>> folios must be locked before marking dirty. Are there any exceptions or am
>> I missing something?
> 
> Yes: reading the code I pointed you at.
> 
> /**
>  * folio_mark_dirty - Mark a folio as being modified.
>  * @folio: The folio.
>  *
>  * The folio may not be truncated while this function is running.
>  * Holding the folio lock is sufficient to prevent truncation, but some
>  * callers cannot acquire a sleeping lock.  These callers instead hold
>  * the page table lock for a page table which contains at least one page
>  * in this folio.  Truncation will block on the page table lock as it
>  * unmaps pages before removing the folio from its mapping.
>  *
>  * Return: True if the folio was newly dirtied, false if it was already dirty.
>  */
> 
> So, yes, ->dirty_folio() can indeed be called without the folio
> being locked and it is not a bug.

Ha, right, I missed the comments of this function, it means that there are
some special callers that hold table lock instead of folio lock, is it
pte_alloc_map_lock?

I checked all the filesystem related callers and didn't find any real
caller that mark folio dirty without holding folio lock and that could
affect current filesystems which are using iomap framework, it's just
a potential possibility in the future, am I right?

> 
> Hence we have to serialise ->dirty_folio against both
> __iomap_write_begin() dirtying the folio and iomap_writepage_map()
> clearing the dirty range.
> 

Both __iomap_write_begin() and iomap_writepage_map() are under the folio
lock now (locked in iomap_get_folio() and writeback_get_folio()), is there
any special about this case?

> And that means we alway need to take the ifs->state_lock in
> __iomap_write_begin() when we have an ifs attached to the folio.
> Hence it is a) not correct and b) makes no sense to try to do
> uptodate bitmap updates without it held...
> 
>>> IOWs, getting rid of the state lock out of the uptodate range
>>> changes does not actually get rid of it from the buffered IO patch.
>>> we still have to take it to update the dirty range, and so there's
>>> an obvious way to optimise the state lock usage without changing any
>>> of the bitmap access serialisation behaviour. i.e.  We combine the
>>> uptodate and dirty range updates in __iomap_write_end() into a
>>> single lock context such as:
>>>
>>> iomap_set_range_dirty_uptodate()
>>> {
>>> 	struct iomap_folio_state *ifs = folio->private;
>>> 	struct inode *inode:
>>>         unsigned int blks_per_folio;
>>>         unsigned int first_blk;
>>>         unsigned int last_blk;
>>>         unsigned int nr_blks;
>>> 	unsigned long flags;
>>>
>>> 	if (!ifs)
>>> 		return;
>>>
>>> 	inode = folio->mapping->host;
>>> 	blks_per_folio = i_blocks_per_folio(inode, folio);
>>> 	first_blk = (off >> inode->i_blkbits);
>>> 	last_blk = (off + len - 1) >> inode->i_blkbits;
>>> 	nr_blks = last_blk - first_blk + 1;
>>>
>>> 	spin_lock_irqsave(&ifs->state_lock, flags);
>>> 	bitmap_set(ifs->state, first_blk, nr_blks);
>>> 	bitmap_set(ifs->state, first_blk + blks_per_folio, nr_blks);
>>> 	spin_unlock_irqrestore(&ifs->state_lock, flags);
>>> }
>>>
>>> This means we calculate the bitmap offsets only once, we take the
>>> state lock only once, and we don't do anything if there is no
>>> sub-folio state.
>>>
>>> If we then fix the __iomap_write_begin() code as Willy pointed out
>>> to elide the erroneous uptodate range update, then we end up only
>>> taking the state lock once per buffered write instead of 3 times per
>>> write.
>>>
>>> This patch only reduces it to twice per buffered write, so doing the
>>> above should provide even better performance without needing to
>>> change the underlying serialisation mechanism at all.
>>>
>>
>> Thanks for the suggestion. I've thought about this solution too, but I
>> didn't think we need the state_lock when setting ifs dirty bit since the
>> folio lock should work, so I changed my mind and planed to drop all ifs
>> state_lock in the write path (please see the patch 6). Please let me
>> know if I'm wrong.
> 
> Whether it works or not is irrelevant: it is badly designed code
> that you have proposed. We can acheive the same result without
> changing the locking rules for the bitmap data via a small amount of
> refactoring, and that is a much better solution than creating
> complex and subtle locking rules for the object.
> 
> "But it works" doesn't mean the code is robust, maintainable code.
> 
> So, good optimisation, but NACK in this form. Please rework it to
> only take the ifs->state_lock once for both bitmap updates in the
> __iomap_write_end() path.
> 

OK, sure, this looks reasonable to me, I will revise it as you suggested
and check performance gain again in my next iteration.

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/6] iomap: drop unnecessary state_lock when setting ifs uptodate bits
  2024-08-02 11:13         ` Zhang Yi
@ 2024-08-05 12:42           ` Jan Kara
  2024-08-05 14:00             ` Jan Kara
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2024-08-05 12:42 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Dave Chinner, linux-xfs, linux-fsdevel, linux-kernel, djwong, hch,
	brauner, jack, yi.zhang, chengzhihao1, yukuai3

On Fri 02-08-24 19:13:11, Zhang Yi wrote:
> On 2024/8/2 14:29, Dave Chinner wrote:
> > On Fri, Aug 02, 2024 at 10:57:41AM +0800, Zhang Yi wrote:
> >> On 2024/8/2 8:05, Dave Chinner wrote:
> >>> On Wed, Jul 31, 2024 at 05:13:04PM +0800, Zhang Yi wrote:
> >>> Making this change also misses the elephant in the room: the
> >>> buffered write path still needs the ifs->state_lock to update the
> >>> dirty bitmap. Hence we're effectively changing the serialisation
> >>> mechanism for only one of the two ifs state bitmaps that the
> >>> buffered write path has to update.
> >>>
> >>> Indeed, we can't get rid of the ifs->state_lock from the dirty range
> >>> updates because iomap_dirty_folio() can be called without the folio
> >>> being locked through folio_mark_dirty() calling the ->dirty_folio()
> >>> aop.
> >>>
> >>
> >> Sorry, I don't understand, why folio_mark_dirty() could be called without
> >> folio lock (isn't this supposed to be a bug)?  IIUC, all the file backed
> >> folios must be locked before marking dirty. Are there any exceptions or am
> >> I missing something?
> > 
> > Yes: reading the code I pointed you at.
> > 
> > /**
> >  * folio_mark_dirty - Mark a folio as being modified.
> >  * @folio: The folio.
> >  *
> >  * The folio may not be truncated while this function is running.
> >  * Holding the folio lock is sufficient to prevent truncation, but some
> >  * callers cannot acquire a sleeping lock.  These callers instead hold
> >  * the page table lock for a page table which contains at least one page
> >  * in this folio.  Truncation will block on the page table lock as it
> >  * unmaps pages before removing the folio from its mapping.
> >  *
> >  * Return: True if the folio was newly dirtied, false if it was already dirty.
> >  */
> > 
> > So, yes, ->dirty_folio() can indeed be called without the folio
> > being locked and it is not a bug.
> 
> Ha, right, I missed the comments of this function, it means that there are
> some special callers that hold table lock instead of folio lock, is it
> pte_alloc_map_lock?
> 
> I checked all the filesystem related callers and didn't find any real
> caller that mark folio dirty without holding folio lock and that could
> affect current filesystems which are using iomap framework, it's just
> a potential possibility in the future, am I right?

There used to be quite a few places doing that. Now that I've checked all I
places was aware of got actually converted to call folio_mark_dirty() under
a folio lock (in particular all the cases happening on IO completion, folio
unmap etc.). Matthew, are you aware of any place where folio_mark_dirty()
would be called for regular file page cache (block device page cache is in a
different situation obviously) without folio lock held?

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/6] iomap: drop unnecessary state_lock when setting ifs uptodate bits
  2024-08-05 12:42           ` Jan Kara
@ 2024-08-05 14:00             ` Jan Kara
  2024-08-05 15:48               ` Matthew Wilcox
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2024-08-05 14:00 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Dave Chinner, linux-xfs, linux-fsdevel, linux-kernel, djwong, hch,
	brauner, jack, yi.zhang, chengzhihao1, yukuai3, Matthew Wilcox

Actually add Matthew to CC ;)

On Mon 05-08-24 14:42:52, Jan Kara wrote:
> On Fri 02-08-24 19:13:11, Zhang Yi wrote:
> > On 2024/8/2 14:29, Dave Chinner wrote:
> > > On Fri, Aug 02, 2024 at 10:57:41AM +0800, Zhang Yi wrote:
> > >> On 2024/8/2 8:05, Dave Chinner wrote:
> > >>> On Wed, Jul 31, 2024 at 05:13:04PM +0800, Zhang Yi wrote:
> > >>> Making this change also misses the elephant in the room: the
> > >>> buffered write path still needs the ifs->state_lock to update the
> > >>> dirty bitmap. Hence we're effectively changing the serialisation
> > >>> mechanism for only one of the two ifs state bitmaps that the
> > >>> buffered write path has to update.
> > >>>
> > >>> Indeed, we can't get rid of the ifs->state_lock from the dirty range
> > >>> updates because iomap_dirty_folio() can be called without the folio
> > >>> being locked through folio_mark_dirty() calling the ->dirty_folio()
> > >>> aop.
> > >>>
> > >>
> > >> Sorry, I don't understand, why folio_mark_dirty() could be called without
> > >> folio lock (isn't this supposed to be a bug)?  IIUC, all the file backed
> > >> folios must be locked before marking dirty. Are there any exceptions or am
> > >> I missing something?
> > > 
> > > Yes: reading the code I pointed you at.
> > > 
> > > /**
> > >  * folio_mark_dirty - Mark a folio as being modified.
> > >  * @folio: The folio.
> > >  *
> > >  * The folio may not be truncated while this function is running.
> > >  * Holding the folio lock is sufficient to prevent truncation, but some
> > >  * callers cannot acquire a sleeping lock.  These callers instead hold
> > >  * the page table lock for a page table which contains at least one page
> > >  * in this folio.  Truncation will block on the page table lock as it
> > >  * unmaps pages before removing the folio from its mapping.
> > >  *
> > >  * Return: True if the folio was newly dirtied, false if it was already dirty.
> > >  */
> > > 
> > > So, yes, ->dirty_folio() can indeed be called without the folio
> > > being locked and it is not a bug.
> > 
> > Ha, right, I missed the comments of this function, it means that there are
> > some special callers that hold table lock instead of folio lock, is it
> > pte_alloc_map_lock?
> > 
> > I checked all the filesystem related callers and didn't find any real
> > caller that mark folio dirty without holding folio lock and that could
> > affect current filesystems which are using iomap framework, it's just
> > a potential possibility in the future, am I right?
> 
> There used to be quite a few places doing that. Now that I've checked all
> places I was aware of got actually converted to call folio_mark_dirty() under
> a folio lock (in particular all the cases happening on IO completion, folio
> unmap etc.). Matthew, are you aware of any place where folio_mark_dirty()
> would be called for regular file page cache (block device page cache is in a
> different situation obviously) without folio lock held?
> 
> 								Honza
> 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/6] iomap: drop unnecessary state_lock when setting ifs uptodate bits
  2024-08-05 14:00             ` Jan Kara
@ 2024-08-05 15:48               ` Matthew Wilcox
  2024-08-07 11:39                 ` Zhang Yi
  0 siblings, 1 reply; 19+ messages in thread
From: Matthew Wilcox @ 2024-08-05 15:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: Zhang Yi, Dave Chinner, linux-xfs, linux-fsdevel, linux-kernel,
	djwong, hch, brauner, yi.zhang, chengzhihao1, yukuai3

On Mon, Aug 05, 2024 at 04:00:23PM +0200, Jan Kara wrote:
> Actually add Matthew to CC ;)

It's OK, I was reading.

FWIW, I agree with Dave; the locking complexity in this patch was
horrendous.  I was going to get to the same critique he had, but I first
wanted to understand what the thought process was.

> > > Ha, right, I missed the comments of this function, it means that there are
> > > some special callers that hold table lock instead of folio lock, is it
> > > pte_alloc_map_lock?
> > > 
> > > I checked all the filesystem related callers and didn't find any real
> > > caller that mark folio dirty without holding folio lock and that could
> > > affect current filesystems which are using iomap framework, it's just
> > > a potential possibility in the future, am I right?

Filesystems are normally quite capable of taking the folio lock to
prevent truncation.  It's the MM code that needs the "or holding the
page table lock" get-out clause.  I forget exactly which callers it
is; I worked through them a few times.  It's not hard to put a
WARN_ON_RATELIMIT() into folio_mark_dirty() and get a good sampling.

There's also a "or holding a buffer_head locked" get-out clause that
I'm not sure is documented anywhere, but obviously that doesn't apply
to the iomap code.

> > There used to be quite a few places doing that. Now that I've checked all
> > places I was aware of got actually converted to call folio_mark_dirty() under
> > a folio lock (in particular all the cases happening on IO completion, folio
> > unmap etc.). Matthew, are you aware of any place where folio_mark_dirty()
> > would be called for regular file page cache (block device page cache is in a
> > different situation obviously) without folio lock held?

Yes, the MM code definitely applies to regular files as well as block
devices.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/6] iomap: drop unnecessary state_lock when setting ifs uptodate bits
  2024-08-05 15:48               ` Matthew Wilcox
@ 2024-08-07 11:39                 ` Zhang Yi
  0 siblings, 0 replies; 19+ messages in thread
From: Zhang Yi @ 2024-08-07 11:39 UTC (permalink / raw)
  To: Matthew Wilcox, Jan Kara
  Cc: Dave Chinner, linux-xfs, linux-fsdevel, linux-kernel, djwong, hch,
	brauner, yi.zhang, chengzhihao1, yukuai3

On 2024/8/5 23:48, Matthew Wilcox wrote:
> On Mon, Aug 05, 2024 at 04:00:23PM +0200, Jan Kara wrote:
>> Actually add Matthew to CC ;)
> 
> It's OK, I was reading.
> 
> FWIW, I agree with Dave; the locking complexity in this patch was
> horrendous.  I was going to get to the same critique he had, but I first
> wanted to understand what the thought process was.

Yes, I'd like to change to use the solution as Dave suggested.

> 
>>>> Ha, right, I missed the comments of this function, it means that there are
>>>> some special callers that hold table lock instead of folio lock, is it
>>>> pte_alloc_map_lock?
>>>>
>>>> I checked all the filesystem related callers and didn't find any real
>>>> caller that mark folio dirty without holding folio lock and that could
>>>> affect current filesystems which are using iomap framework, it's just
>>>> a potential possibility in the future, am I right?
> 
> Filesystems are normally quite capable of taking the folio lock to
> prevent truncation.  It's the MM code that needs the "or holding the
> page table lock" get-out clause.  I forget exactly which callers it
> is; I worked through them a few times.  It's not hard to put a
> WARN_ON_RATELIMIT() into folio_mark_dirty() and get a good sampling.
> 
> There's also a "or holding a buffer_head locked" get-out clause that
> I'm not sure is documented anywhere, but obviously that doesn't apply
> to the iomap code.

Thanks for your answer, I've found some callers.

Thanks,
Yi.

> 
>>> There used to be quite a few places doing that. Now that I've checked all
>>> places I was aware of got actually converted to call folio_mark_dirty() under
>>> a folio lock (in particular all the cases happening on IO completion, folio
>>> unmap etc.). Matthew, are you aware of any place where folio_mark_dirty()
>>> would be called for regular file page cache (block device page cache is in a
>>> different situation obviously) without folio lock held?
> 
> Yes, the MM code definitely applies to regular files as well as block
> devices.
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-08-07 11:39 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-31  9:12 [PATCH 0/6] iomap: some minor non-critical fixes and improvements when block size < folio size Zhang Yi
2024-07-31  9:13 ` [PATCH 1/6] iomap: correct the range of a partial dirty clear Zhang Yi
2024-07-31  9:13 ` [PATCH 2/6] iomap: support invalidating partial folios Zhang Yi
2024-07-31  9:13 ` [PATCH 3/6] iomap: advance the ifs allocation if we have more than one blocks per folio Zhang Yi
2024-07-31  9:13 ` [PATCH 4/6] iomap: correct the dirty length in page mkwrite Zhang Yi
2024-07-31  9:13 ` [PATCH 5/6] iomap: drop unnecessary state_lock when setting ifs uptodate bits Zhang Yi
2024-07-31 16:52   ` Matthew Wilcox
2024-08-01  1:52     ` Zhang Yi
2024-08-01  4:24       ` Matthew Wilcox
2024-08-01  9:19         ` Zhang Yi
2024-08-02  0:05   ` Dave Chinner
2024-08-02  2:57     ` Zhang Yi
2024-08-02  6:29       ` Dave Chinner
2024-08-02 11:13         ` Zhang Yi
2024-08-05 12:42           ` Jan Kara
2024-08-05 14:00             ` Jan Kara
2024-08-05 15:48               ` Matthew Wilcox
2024-08-07 11:39                 ` Zhang Yi
2024-07-31  9:13 ` [PATCH 6/6] iomap: drop unnecessary state_lock when changing ifs dirty bits Zhang Yi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).