[PATCH 0/5] fs/buffer: strack reduction on async read

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/5] fs/buffer: strack reduction on async read
@ 2024-12-18  2:26 Luis Chamberlain
  2024-12-18  2:26 ` [PATCH 1/5] fs/buffer: move async batch read code into a helper Luis Chamberlain
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Luis Chamberlain @ 2024-12-18  2:26 UTC (permalink / raw)
  To: hare, willy, dave, david, djwong, kbusch
  Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
	mcgrof

This splits up a minor enhancement from the bs > ps device support
series into its own series for better review / focus / testing.
This series just addresses the reducing the array size used and cleaning
up the async read to be easier to read and maintain.

Changes on this series from the RFC v2:

  - Fixes the loop by ensuring we bump counts on continue as noted by
    Matthew Wilcox
  - new patch to simplify the loop by using bh_offset()
  - bike shed love on comments
  - Testing: tested on ext4 and XFS with fstests, with no regressions found
    after willy's suggested loop fix on continue. I'm however running a new
    set of tests now after a rebase onto v6.13-rc3 and a bit of patch ordering
    and the addition of bh_offset(). Prelimimary tests however show no
    issues.

[0] https://lkml.kernel.org/r/20241214031050.1337920-1-mcgrof@kernel.org

Luis Chamberlain (5):
  fs/buffer: move async batch read code into a helper
  fs/buffer: simplify block_read_full_folio() with bh_offset()
  fs/buffer: add a for_each_bh() for block_read_full_folio()
  fs/buffer: add iteration support for block_read_full_folio()
  fs/buffer: reduce stack usage on bh_read_iter()

 fs/buffer.c | 221 +++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 160 insertions(+), 61 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/5] fs/buffer: move async batch read code into a helper
  2024-12-18  2:26 [PATCH 0/5] fs/buffer: strack reduction on async read Luis Chamberlain
@ 2024-12-18  2:26 ` Luis Chamberlain
  2024-12-18  2:26 ` [PATCH 2/5] fs/buffer: simplify block_read_full_folio() with bh_offset() Luis Chamberlain
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Luis Chamberlain @ 2024-12-18  2:26 UTC (permalink / raw)
  To: hare, willy, dave, david, djwong, kbusch
  Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
	mcgrof

Move the code from block_read_full_folio() which does a batch of async
reads into a helper.

No functional changes.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/buffer.c | 79 +++++++++++++++++++++++++++++++++--------------------
 1 file changed, 49 insertions(+), 30 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index cc8452f60251..7c6aac0742a6 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2350,6 +2350,53 @@ bool block_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
 }
 EXPORT_SYMBOL(block_is_partially_uptodate);
 
+/*
+ * Stage one is to collect an array of buffer heads which we need a read for,
+ * you can then use this afterwards. On that effort you should also check
+ * to see if you really need a read, and if we are already fully mapped.
+ */
+static void bh_read_batch_async(struct folio *folio,
+				int nr, struct buffer_head *arr[],
+				bool fully_mapped, bool no_reads,
+				bool any_get_block_error)
+{
+	int i;
+	struct buffer_head *bh;
+
+	if (fully_mapped)
+		folio_set_mappedtodisk(folio);
+
+	if (no_reads) {
+		/*
+		 * All buffers are uptodate or get_block() returned an
+		 * error when trying to map them *all* buffers we can
+		 * finish the read.
+		 */
+		folio_end_read(folio, !any_get_block_error);
+		return;
+	}
+
+	/* Stage two: lock the buffers */
+	for (i = 0; i < nr; i++) {
+		bh = arr[i];
+		lock_buffer(bh);
+		mark_buffer_async_read(bh);
+	}
+
+	/*
+	 * Stage three: start the IO.  Check for uptodateness
+	 * inside the buffer lock in case another process reading
+	 * the underlying blockdev brought it uptodate (the sct fix).
+	 */
+	for (i = 0; i < nr; i++) {
+		bh = arr[i];
+		if (buffer_uptodate(bh))
+			end_buffer_async_read(bh, 1);
+		else
+			submit_bh(REQ_OP_READ, bh);
+	}
+}
+
 /*
  * Generic "read_folio" function for block devices that have the normal
  * get_block functionality. This is most of the block device filesystems.
@@ -2383,6 +2430,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 	nr = 0;
 	i = 0;
 
+	/* Stage one - collect buffer heads we need issue a read for */
 	do {
 		if (buffer_uptodate(bh))
 			continue;
@@ -2414,37 +2462,8 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 		arr[nr++] = bh;
 	} while (i++, iblock++, (bh = bh->b_this_page) != head);
 
-	if (fully_mapped)
-		folio_set_mappedtodisk(folio);
-
-	if (!nr) {
-		/*
-		 * All buffers are uptodate or get_block() returned an
-		 * error when trying to map them - we can finish the read.
-		 */
-		folio_end_read(folio, !page_error);
-		return 0;
-	}
-
-	/* Stage two: lock the buffers */
-	for (i = 0; i < nr; i++) {
-		bh = arr[i];
-		lock_buffer(bh);
-		mark_buffer_async_read(bh);
-	}
+	bh_read_batch_async(folio, nr, arr, fully_mapped, nr == 0, page_error);
 
-	/*
-	 * Stage 3: start the IO.  Check for uptodateness
-	 * inside the buffer lock in case another process reading
-	 * the underlying blockdev brought it uptodate (the sct fix).
-	 */
-	for (i = 0; i < nr; i++) {
-		bh = arr[i];
-		if (buffer_uptodate(bh))
-			end_buffer_async_read(bh, 1);
-		else
-			submit_bh(REQ_OP_READ, bh);
-	}
 	return 0;
 }
 EXPORT_SYMBOL(block_read_full_folio);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/5] fs/buffer: simplify block_read_full_folio() with bh_offset()
  2024-12-18  2:26 [PATCH 0/5] fs/buffer: strack reduction on async read Luis Chamberlain
  2024-12-18  2:26 ` [PATCH 1/5] fs/buffer: move async batch read code into a helper Luis Chamberlain
@ 2024-12-18  2:26 ` Luis Chamberlain
  2024-12-18  2:26 ` [PATCH 3/5] fs/buffer: add a for_each_bh() for block_read_full_folio() Luis Chamberlain
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Luis Chamberlain @ 2024-12-18  2:26 UTC (permalink / raw)
  To: hare, willy, dave, david, djwong, kbusch
  Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
	mcgrof

Remove the temporary variable i on the iteration of all buffer heads
on a folio and just use bh_offset(bh) to simplify the loop.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/buffer.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 7c6aac0742a6..8baf87db110d 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2410,7 +2410,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 	sector_t iblock, lblock;
 	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
 	size_t blocksize;
-	int nr, i;
+	int nr;
 	int fully_mapped = 1;
 	bool page_error = false;
 	loff_t limit = i_size_read(inode);
@@ -2428,7 +2428,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 	lblock = div_u64(limit + blocksize - 1, blocksize);
 	bh = head;
 	nr = 0;
-	i = 0;
 
 	/* Stage one - collect buffer heads we need issue a read for */
 	do {
@@ -2446,7 +2445,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 					page_error = true;
 			}
 			if (!buffer_mapped(bh)) {
-				folio_zero_range(folio, i * blocksize,
+				folio_zero_range(folio, bh_offset(bh),
 						blocksize);
 				if (!err)
 					set_buffer_uptodate(bh);
@@ -2460,7 +2459,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 				continue;
 		}
 		arr[nr++] = bh;
-	} while (i++, iblock++, (bh = bh->b_this_page) != head);
+	} while (iblock++, (bh = bh->b_this_page) != head);
 
 	bh_read_batch_async(folio, nr, arr, fully_mapped, nr == 0, page_error);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/5] fs/buffer: add a for_each_bh() for block_read_full_folio()
  2024-12-18  2:26 [PATCH 0/5] fs/buffer: strack reduction on async read Luis Chamberlain
  2024-12-18  2:26 ` [PATCH 1/5] fs/buffer: move async batch read code into a helper Luis Chamberlain
  2024-12-18  2:26 ` [PATCH 2/5] fs/buffer: simplify block_read_full_folio() with bh_offset() Luis Chamberlain
@ 2024-12-18  2:26 ` Luis Chamberlain
  2024-12-18 19:20   ` Matthew Wilcox
  2024-12-18  2:26 ` [PATCH 4/5] fs/buffer: add iteration support " Luis Chamberlain
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Luis Chamberlain @ 2024-12-18  2:26 UTC (permalink / raw)
  To: hare, willy, dave, david, djwong, kbusch
  Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
	mcgrof

We want to be able to work through all buffer heads on a folio
for an async read, but in the future we want to support the option
to stop before we've processed all linked buffer heads. To make
code easier to read and follow adopt a for_each_bh(tmp, head) loop
instead of using a do { ... } while () to make the code easier to
read and later be expanded in subsequent patches.

This introduces no functional changes.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/buffer.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 8baf87db110d..1aeef7dd2281 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2397,6 +2397,17 @@ static void bh_read_batch_async(struct folio *folio,
 	}
 }
 
+#define bh_is_last(__bh, __head) ((__bh)->b_this_page == (__head))
+
+#define bh_next(__bh, __head) \
+    (bh_is_last(__bh, __head) ? NULL : (__bh)->b_this_page)
+
+/* Starts from the provided head */
+#define for_each_bh(__tmp, __head)			\
+    for ((__tmp) = (__head);				\
+         (__tmp);					\
+         (__tmp) = bh_next(__tmp, __head))
+
 /*
  * Generic "read_folio" function for block devices that have the normal
  * get_block functionality. This is most of the block device filesystems.
@@ -2426,13 +2437,14 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 
 	iblock = div_u64(folio_pos(folio), blocksize);
 	lblock = div_u64(limit + blocksize - 1, blocksize);
-	bh = head;
 	nr = 0;
 
 	/* Stage one - collect buffer heads we need issue a read for */
-	do {
-		if (buffer_uptodate(bh))
+	for_each_bh(bh, head) {
+		if (buffer_uptodate(bh)) {
+			iblock++;
 			continue;
+		}
 
 		if (!buffer_mapped(bh)) {
 			int err = 0;
@@ -2449,17 +2461,21 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 						blocksize);
 				if (!err)
 					set_buffer_uptodate(bh);
+				iblock++;
 				continue;
 			}
 			/*
 			 * get_block() might have updated the buffer
 			 * synchronously
 			 */
-			if (buffer_uptodate(bh))
+			if (buffer_uptodate(bh)) {
+				iblock++;
 				continue;
+			}
 		}
 		arr[nr++] = bh;
-	} while (iblock++, (bh = bh->b_this_page) != head);
+		iblock++;
+	}
 
 	bh_read_batch_async(folio, nr, arr, fully_mapped, nr == 0, page_error);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/5] fs/buffer: add iteration support for block_read_full_folio()
  2024-12-18  2:26 [PATCH 0/5] fs/buffer: strack reduction on async read Luis Chamberlain
                   ` (2 preceding siblings ...)
  2024-12-18  2:26 ` [PATCH 3/5] fs/buffer: add a for_each_bh() for block_read_full_folio() Luis Chamberlain
@ 2024-12-18  2:26 ` Luis Chamberlain
  2024-12-18  2:26 ` [PATCH 5/5] fs/buffer: reduce stack usage on bh_read_iter() Luis Chamberlain
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Luis Chamberlain @ 2024-12-18  2:26 UTC (permalink / raw)
  To: hare, willy, dave, david, djwong, kbusch
  Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
	mcgrof

Provide a helper to iterate on buffer heads on a folio. We do this
as a preliminary step so to make the subsequent changes easier to
read. Right now we use an array on stack to loop over all buffer heads
in a folio of size MAX_BUF_PER_PAGE, however on CPUs where the system
page size is quite larger like Hexagon with 256 KiB page size support
this can mean the kernel can end up spewing spews stack growth
warnings.

To be able to break this down into smaller array chunks add support for
processing smaller array chunks of buffer heads at a time. The used
array size is not changed yet, that will be done in a subsequent patch,
this just adds the iterator support and logic.

While at it clarify the booleans used on bh_read_batch_async() and
how they are only valid in consideration when we've processed all
buffer-heads of a folio, that is when we're on the last buffer head in
a folio:

  * bh_folio_reads
  * unmapped

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/buffer.c | 134 +++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 97 insertions(+), 37 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 1aeef7dd2281..b8ba72f2f211 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2402,66 +2402,75 @@ static void bh_read_batch_async(struct folio *folio,
 #define bh_next(__bh, __head) \
     (bh_is_last(__bh, __head) ? NULL : (__bh)->b_this_page)
 
+/* Starts from a pivot which you initialize */
+#define for_each_bh_pivot(__pivot, __last, __head)	\
+    for ((__pivot) = __last = (__pivot);		\
+         (__pivot);					\
+         (__pivot) = bh_next(__pivot, __head),		\
+	 (__last) = (__pivot) ? (__pivot) : (__last))
+
 /* Starts from the provided head */
 #define for_each_bh(__tmp, __head)			\
     for ((__tmp) = (__head);				\
          (__tmp);					\
          (__tmp) = bh_next(__tmp, __head))
 
+struct bh_iter {
+	sector_t iblock;
+	get_block_t *get_block;
+	bool any_get_block_error;
+	int unmapped;
+	int bh_folio_reads;
+};
+
 /*
- * Generic "read_folio" function for block devices that have the normal
- * get_block functionality. This is most of the block device filesystems.
- * Reads the folio asynchronously --- the unlock_buffer() and
- * set/clear_buffer_uptodate() functions propagate buffer state into the
- * folio once IO has completed.
+ * Reads up to MAX_BUF_PER_PAGE buffer heads at a time on a folio on the given
+ * block range iblock to lblock and helps update the number of buffer-heads
+ * which were not uptodate or unmapped for which we issued an async read for
+ * on iter->bh_folio_reads for the full folio. Returns the last buffer-head we
+ * worked on.
  */
-int block_read_full_folio(struct folio *folio, get_block_t *get_block)
-{
-	struct inode *inode = folio->mapping->host;
-	sector_t iblock, lblock;
-	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
-	size_t blocksize;
-	int nr;
-	int fully_mapped = 1;
-	bool page_error = false;
-	loff_t limit = i_size_read(inode);
-
-	/* This is needed for ext4. */
-	if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode))
-		limit = inode->i_sb->s_maxbytes;
+static struct buffer_head *bh_read_iter(struct folio *folio,
+					struct buffer_head *pivot,
+					struct buffer_head *head,
+					struct inode *inode,
+					struct bh_iter *iter, sector_t lblock)
+{
+	struct buffer_head *arr[MAX_BUF_PER_PAGE];
+	struct buffer_head *bh = pivot, *last;
+	int nr = 0, i = 0;
+	size_t blocksize = head->b_size;
+	bool no_reads = false;
+	bool fully_mapped = false;
 
-	VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
+	/* Stage one - collect buffer heads we need issue a read for */
 
-	head = folio_create_buffers(folio, inode, 0);
-	blocksize = head->b_size;
+	/* collect buffers not uptodate and not mapped yet */
+	for_each_bh_pivot(bh, last, head) {
+		BUG_ON(nr >= MAX_BUF_PER_PAGE);
 
-	iblock = div_u64(folio_pos(folio), blocksize);
-	lblock = div_u64(limit + blocksize - 1, blocksize);
-	nr = 0;
-
-	/* Stage one - collect buffer heads we need issue a read for */
-	for_each_bh(bh, head) {
 		if (buffer_uptodate(bh)) {
-			iblock++;
+			iter->iblock++;
 			continue;
 		}
 
 		if (!buffer_mapped(bh)) {
 			int err = 0;
 
-			fully_mapped = 0;
-			if (iblock < lblock) {
+			iter->unmapped++;
+			if (iter->iblock < lblock) {
 				WARN_ON(bh->b_size != blocksize);
-				err = get_block(inode, iblock, bh, 0);
+				err = iter->get_block(inode, iter->iblock,
+						      bh, 0);
 				if (err)
-					page_error = true;
+					iter->any_get_block_error = true;
 			}
 			if (!buffer_mapped(bh)) {
 				folio_zero_range(folio, bh_offset(bh),
 						blocksize);
 				if (!err)
 					set_buffer_uptodate(bh);
-				iblock++;
+				iter->iblock++;
 				continue;
 			}
 			/*
@@ -2469,15 +2478,66 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 			 * synchronously
 			 */
 			if (buffer_uptodate(bh)) {
-				iblock++;
+				iter->iblock++;
 				continue;
 			}
 		}
 		arr[nr++] = bh;
-		iblock++;
+		iter->iblock++;
+	}
+
+	iter->bh_folio_reads += nr;
+
+	WARN_ON_ONCE(!bh_is_last(last, head));
+
+	if (bh_is_last(last, head)) {
+		if (!iter->bh_folio_reads)
+			no_reads = true;
+		if (!iter->unmapped)
+			fully_mapped = true;
 	}
 
-	bh_read_batch_async(folio, nr, arr, fully_mapped, nr == 0, page_error);
+	bh_read_batch_async(folio, nr, arr, fully_mapped, no_reads,
+			    iter->any_get_block_error);
+
+	return last;
+}
+
+/*
+ * Generic "read_folio" function for block devices that have the normal
+ * get_block functionality. This is most of the block device filesystems.
+ * Reads the folio asynchronously --- the unlock_buffer() and
+ * set/clear_buffer_uptodate() functions propagate buffer state into the
+ * folio once IO has completed.
+ */
+int block_read_full_folio(struct folio *folio, get_block_t *get_block)
+{
+	struct inode *inode = folio->mapping->host;
+	sector_t lblock;
+	size_t blocksize;
+	struct buffer_head *bh, *head;
+	struct bh_iter iter = {
+		.get_block = get_block,
+		.unmapped = 0,
+		.any_get_block_error = false,
+		.bh_folio_reads = 0,
+	};
+	loff_t limit = i_size_read(inode);
+
+	/* This is needed for ext4. */
+	if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode))
+		limit = inode->i_sb->s_maxbytes;
+
+	VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
+
+	head = folio_create_buffers(folio, inode, 0);
+	blocksize = head->b_size;
+
+	iter.iblock = div_u64(folio_pos(folio), blocksize);
+	lblock = div_u64(limit + blocksize - 1, blocksize);
+
+	for_each_bh(bh, head)
+		bh = bh_read_iter(folio, bh, head, inode, &iter, lblock);
 
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 5/5] fs/buffer: reduce stack usage on bh_read_iter()
  2024-12-18  2:26 [PATCH 0/5] fs/buffer: strack reduction on async read Luis Chamberlain
                   ` (3 preceding siblings ...)
  2024-12-18  2:26 ` [PATCH 4/5] fs/buffer: add iteration support " Luis Chamberlain
@ 2024-12-18  2:26 ` Luis Chamberlain
  2024-12-18  2:47   ` Luis Chamberlain
  2024-12-18 20:05 ` [PATCH 0/5] fs/buffer: strack reduction on async read Matthew Wilcox
  2024-12-19  6:28 ` Christoph Hellwig
  6 siblings, 1 reply; 17+ messages in thread
From: Luis Chamberlain @ 2024-12-18  2:26 UTC (permalink / raw)
  To: hare, willy, dave, david, djwong, kbusch
  Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
	mcgrof

Now that we can read asynchronously buffer heads from a folio in
chunks,  we can chop up bh_read_iter() with a smaller array size.
Use an array of 8 to avoid stack growth warnings on systems with
huge base page sizes.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/buffer.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index b8ba72f2f211..bfa9c09b8597 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2415,7 +2415,10 @@ static void bh_read_batch_async(struct folio *folio,
          (__tmp);					\
          (__tmp) = bh_next(__tmp, __head))
 
+#define MAX_BUF_CHUNK 8
+
 struct bh_iter {
+	int chunk_number;
 	sector_t iblock;
 	get_block_t *get_block;
 	bool any_get_block_error;
@@ -2424,7 +2427,7 @@ struct bh_iter {
 };
 
 /*
- * Reads up to MAX_BUF_PER_PAGE buffer heads at a time on a folio on the given
+ * Reads up to MAX_BUF_CHUNK buffer heads at a time on a folio on the given
  * block range iblock to lblock and helps update the number of buffer-heads
  * which were not uptodate or unmapped for which we issued an async read for
  * on iter->bh_folio_reads for the full folio. Returns the last buffer-head we
@@ -2436,10 +2439,11 @@ static struct buffer_head *bh_read_iter(struct folio *folio,
 					struct inode *inode,
 					struct bh_iter *iter, sector_t lblock)
 {
-	struct buffer_head *arr[MAX_BUF_PER_PAGE];
+	struct buffer_head *arr[MAX_BUF_CHUNK];
 	struct buffer_head *bh = pivot, *last;
 	int nr = 0, i = 0;
 	size_t blocksize = head->b_size;
+	int chunk_idx = MAX_BUF_CHUNK * iter->chunk_number;
 	bool no_reads = false;
 	bool fully_mapped = false;
 
@@ -2447,7 +2451,8 @@ static struct buffer_head *bh_read_iter(struct folio *folio,
 
 	/* collect buffers not uptodate and not mapped yet */
 	for_each_bh_pivot(bh, last, head) {
-		BUG_ON(nr >= MAX_BUF_PER_PAGE);
+		if (nr >= MAX_BUF_CHUNK)
+			break;
 
 		if (buffer_uptodate(bh)) {
 			iter->iblock++;
@@ -2487,8 +2492,7 @@ static struct buffer_head *bh_read_iter(struct folio *folio,
 	}
 
 	iter->bh_folio_reads += nr;
-
-	WARN_ON_ONCE(!bh_is_last(last, head));
+	iter->chunk_number++;
 
 	if (bh_is_last(last, head)) {
 		if (!iter->bh_folio_reads)
@@ -2518,6 +2522,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 	struct buffer_head *bh, *head;
 	struct bh_iter iter = {
 		.get_block = get_block,
+		.chunk_number = 0,
 		.unmapped = 0,
 		.any_get_block_error = false,
 		.bh_folio_reads = 0,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] fs/buffer: reduce stack usage on bh_read_iter()
  2024-12-18  2:26 ` [PATCH 5/5] fs/buffer: reduce stack usage on bh_read_iter() Luis Chamberlain
@ 2024-12-18  2:47   ` Luis Chamberlain
  0 siblings, 0 replies; 17+ messages in thread
From: Luis Chamberlain @ 2024-12-18  2:47 UTC (permalink / raw)
  To: hare, willy, dave, david, djwong, kbusch
  Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel

On Tue, Dec 17, 2024 at 06:26:26PM -0800, Luis Chamberlain wrote:
> Now that we can read asynchronously buffer heads from a folio in
> chunks,  we can chop up bh_read_iter() with a smaller array size.
> Use an array of 8 to avoid stack growth warnings on systems with
> huge base page sizes.
> 
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  fs/buffer.c | 15 ++++++++++-----
>  1 file changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index b8ba72f2f211..bfa9c09b8597 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2415,7 +2415,10 @@ static void bh_read_batch_async(struct folio *folio,
>           (__tmp);					\
>           (__tmp) = bh_next(__tmp, __head))
>  
> +#define MAX_BUF_CHUNK 8
> +
>  struct bh_iter {
> +	int chunk_number;
>  	sector_t iblock;
>  	get_block_t *get_block;
>  	bool any_get_block_error;

Oh... this can be cleaned up even further, chunk_idx and be removed now,
and we can also remove the unused i variable... I'll wait for more
feedback for a v2.

  Luis

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/5] fs/buffer: add a for_each_bh() for block_read_full_folio()
  2024-12-18  2:26 ` [PATCH 3/5] fs/buffer: add a for_each_bh() for block_read_full_folio() Luis Chamberlain
@ 2024-12-18 19:20   ` Matthew Wilcox
  0 siblings, 0 replies; 17+ messages in thread
From: Matthew Wilcox @ 2024-12-18 19:20 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: hare, dave, david, djwong, kbusch, john.g.garry, hch, ritesh.list,
	linux-fsdevel, linux-xfs, linux-mm, linux-block, gost.dev,
	p.raghav, da.gomez, kernel

On Tue, Dec 17, 2024 at 06:26:24PM -0800, Luis Chamberlain wrote:
>  	/* Stage one - collect buffer heads we need issue a read for */
> -	do {
> -		if (buffer_uptodate(bh))
> +	for_each_bh(bh, head) {
> +		if (buffer_uptodate(bh)) {
> +			iblock++;
>  			continue;
> +		}

I'm not loving this.  It's fragile to have to put 'iblock++' before each
continue.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] fs/buffer: strack reduction on async read
  2024-12-18  2:26 [PATCH 0/5] fs/buffer: strack reduction on async read Luis Chamberlain
                   ` (4 preceding siblings ...)
  2024-12-18  2:26 ` [PATCH 5/5] fs/buffer: reduce stack usage on bh_read_iter() Luis Chamberlain
@ 2024-12-18 20:05 ` Matthew Wilcox
  2024-12-19  2:27   ` Luis Chamberlain
  2024-12-19  6:28 ` Christoph Hellwig
  6 siblings, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2024-12-18 20:05 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: hare, dave, david, djwong, kbusch, john.g.garry, hch, ritesh.list,
	linux-fsdevel, linux-xfs, linux-mm, linux-block, gost.dev,
	p.raghav, da.gomez, kernel

On Tue, Dec 17, 2024 at 06:26:21PM -0800, Luis Chamberlain wrote:
> This splits up a minor enhancement from the bs > ps device support
> series into its own series for better review / focus / testing.
> This series just addresses the reducing the array size used and cleaning
> up the async read to be easier to read and maintain.

How about this approach instead -- get rid of the batch entirely?

diff --git a/fs/buffer.c b/fs/buffer.c
index cc8452f60251..f50ebbc1f518 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2361,9 +2361,9 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 {
 	struct inode *inode = folio->mapping->host;
 	sector_t iblock, lblock;
-	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+	struct buffer_head *bh, *head;
 	size_t blocksize;
-	int nr, i;
+	int i, submitted = 0;
 	int fully_mapped = 1;
 	bool page_error = false;
 	loff_t limit = i_size_read(inode);
@@ -2380,7 +2380,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 	iblock = div_u64(folio_pos(folio), blocksize);
 	lblock = div_u64(limit + blocksize - 1, blocksize);
 	bh = head;
-	nr = 0;
 	i = 0;
 
 	do {
@@ -2411,40 +2410,30 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 			if (buffer_uptodate(bh))
 				continue;
 		}
-		arr[nr++] = bh;
+
+		lock_buffer(bh);
+		if (buffer_uptodate(bh)) {
+			unlock_buffer(bh);
+			continue;
+		}
+
+		mark_buffer_async_read(bh);
+		submit_bh(REQ_OP_READ, bh);
+		submitted++;
 	} while (i++, iblock++, (bh = bh->b_this_page) != head);
 
 	if (fully_mapped)
 		folio_set_mappedtodisk(folio);
 
-	if (!nr) {
-		/*
-		 * All buffers are uptodate or get_block() returned an
-		 * error when trying to map them - we can finish the read.
-		 */
-		folio_end_read(folio, !page_error);
-		return 0;
-	}
-
-	/* Stage two: lock the buffers */
-	for (i = 0; i < nr; i++) {
-		bh = arr[i];
-		lock_buffer(bh);
-		mark_buffer_async_read(bh);
-	}
-
 	/*
-	 * Stage 3: start the IO.  Check for uptodateness
-	 * inside the buffer lock in case another process reading
-	 * the underlying blockdev brought it uptodate (the sct fix).
+	 * All buffers are uptodate or get_block() returned an error
+	 * when trying to map them - we must finish the read because
+	 * end_buffer_async_read() will never be called on any buffer
+	 * in this folio.
 	 */
-	for (i = 0; i < nr; i++) {
-		bh = arr[i];
-		if (buffer_uptodate(bh))
-			end_buffer_async_read(bh, 1);
-		else
-			submit_bh(REQ_OP_READ, bh);
-	}
+	if (!submitted)
+		folio_end_read(folio, !page_error);
+
 	return 0;
 }
 EXPORT_SYMBOL(block_read_full_folio);

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] fs/buffer: strack reduction on async read
  2024-12-18 20:05 ` [PATCH 0/5] fs/buffer: strack reduction on async read Matthew Wilcox
@ 2024-12-19  2:27   ` Luis Chamberlain
  2024-12-19  3:51     ` Matthew Wilcox
  0 siblings, 1 reply; 17+ messages in thread
From: Luis Chamberlain @ 2024-12-19  2:27 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: hare, dave, david, djwong, kbusch, john.g.garry, hch, ritesh.list,
	linux-fsdevel, linux-xfs, linux-mm, linux-block, gost.dev,
	p.raghav, da.gomez, kernel

On Wed, Dec 18, 2024 at 08:05:29PM +0000, Matthew Wilcox wrote:
> On Tue, Dec 17, 2024 at 06:26:21PM -0800, Luis Chamberlain wrote:
> > This splits up a minor enhancement from the bs > ps device support
> > series into its own series for better review / focus / testing.
> > This series just addresses the reducing the array size used and cleaning
> > up the async read to be easier to read and maintain.
> 
> How about this approach instead -- get rid of the batch entirely?

Less is more! I wish it worked, but we end up with a null pointer on
ext4/032 (and indeed this is the test that helped me find most bugs in
what I was working on):

[  105.942462] loop0: detected capacity change from 0 to 1342177280
[  106.034851] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  106.036903] #PF: supervisor read access in kernel mode
[  106.038366] #PF: error_code(0x0000) - not-present page
[  106.039819] PGD 0 P4D 0
[  106.040574] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[  106.041967] CPU: 2 UID: 0 PID: 29 Comm: ksoftirqd/2 Not tainted 6.13.0-rc3+ #42
[  106.044018] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2024.08-1 09/18/2024
[  106.046300] RIP: 0010:end_buffer_async_read_io+0x11/0x90
[  106.047819] Code: f2 ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 53 48 8b 47 10 48 89 fb 48 8b 40 18 <48> 8b 00 f6 40 0d 40 74 0d 0f b7 00 66 25 00 f0 66 3d 00 80 74 09
[  106.053016] RSP: 0018:ffffa85880137dd0 EFLAGS: 00010246
[  106.054499] RAX: 0000000000000000 RBX: ffff95e38f22e5b0 RCX: ffff95e39c8753e0
[  106.056507] RDX: ffff95e3809f8000 RSI: 0000000000000001 RDI: ffff95e38f22e5b0
[  106.058530] RBP: 0000000000000400 R08: ffff95e3a326b040 R09: 0000000000000001
[  106.060546] R10: ffffffffbb6070c0 R11: 00000000002dc6c0 R12: 0000000000000000
[  106.062426] R13: ffff95e3960ea800 R14: ffff95e39586ae40 R15: 0000000000000400
[  106.064223] FS:  0000000000000000(0000) GS:ffff95e3fbc80000(0000) knlGS:0000000000000000
[  106.066155] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  106.067473] CR2: 0000000000000000 CR3: 00000001226e2006 CR4: 0000000000772ef0
[  106.069085] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  106.070571] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  106.072050] PKRU: 55555554
[  106.072632] Call Trace:
[  106.073176]  <TASK>
[  106.073611]  ? __die_body.cold+0x19/0x26
[  106.074383]  ? page_fault_oops+0xa2/0x230
[  106.075155]  ? __smp_call_single_queue+0xa7/0x110
[  106.076077]  ? do_user_addr_fault+0x63/0x640
[  106.076916]  ? exc_page_fault+0x7a/0x190
[  106.077639]  ? asm_exc_page_fault+0x22/0x30
[  106.078394]  ? end_buffer_async_read_io+0x11/0x90
[  106.079245]  end_bio_bh_io_sync+0x23/0x40
[  106.079973]  blk_update_request+0x178/0x420
[  106.080727]  ? finish_task_switch.isra.0+0x94/0x290
[  106.081600]  blk_mq_end_request+0x18/0x30
[  106.082281]  blk_complete_reqs+0x3d/0x50
[  106.082954]  handle_softirqs+0xf9/0x2c0
[  106.083607]  ? __pfx_smpboot_thread_fn+0x10/0x10
[  106.084393]  run_ksoftirqd+0x37/0x50
[  106.085012]  smpboot_thread_fn+0x184/0x220
[  106.085688]  kthread+0xda/0x110
[  106.086208]  ? __pfx_kthread+0x10/0x10
[  106.086824]  ret_from_fork+0x2d/0x50
[  106.087409]  ? __pfx_kthread+0x10/0x10
[  106.088013]  ret_from_fork_asm+0x1a/0x30
[  106.088658]  </TASK>
[  106.089045] Modules linked in: loop sunrpc 9p nls_iso8859_1 nls_cp437 vfat fat kvm_intel kvm crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 sha512_generic sha256_ssse3 sha1_ssse3 aesni_intel gf128mul crypto_simd cryptd virtio_balloon 9pnet_virtio virtio_console joydev button evdev serio_raw nvme_fabrics nvme_core dm_mod drm nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vsock autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic efivarfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 md_mod virtio_net net_failover virtio_blk failover crc32_pclmul crc32c_intel psmouse virtio_pci virtio_pci_legacy_dev virtio_pci_modern_dev virtio virtio_ring
[  106.097895] CR2: 0000000000000000
[  106.098326] ---[ end trace 0000000000000000 ]---

  Luis

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] fs/buffer: strack reduction on async read
  2024-12-19  2:27   ` Luis Chamberlain
@ 2024-12-19  3:51     ` Matthew Wilcox
  2024-12-30 17:30       ` Luis Chamberlain
  2025-01-31 16:54       ` Luis Chamberlain
  0 siblings, 2 replies; 17+ messages in thread
From: Matthew Wilcox @ 2024-12-19  3:51 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: hare, dave, david, djwong, kbusch, john.g.garry, hch, ritesh.list,
	linux-fsdevel, linux-xfs, linux-mm, linux-block, gost.dev,
	p.raghav, da.gomez, kernel

On Wed, Dec 18, 2024 at 06:27:36PM -0800, Luis Chamberlain wrote:
> On Wed, Dec 18, 2024 at 08:05:29PM +0000, Matthew Wilcox wrote:
> > On Tue, Dec 17, 2024 at 06:26:21PM -0800, Luis Chamberlain wrote:
> > > This splits up a minor enhancement from the bs > ps device support
> > > series into its own series for better review / focus / testing.
> > > This series just addresses the reducing the array size used and cleaning
> > > up the async read to be easier to read and maintain.
> > 
> > How about this approach instead -- get rid of the batch entirely?
> 
> Less is more! I wish it worked, but we end up with a null pointer on
> ext4/032 (and indeed this is the test that helped me find most bugs in
> what I was working on):

Yeah, I did no testing; just wanted to give people a different approach
to consider.

> [  106.034851] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [  106.046300] RIP: 0010:end_buffer_async_read_io+0x11/0x90
> [  106.047819] Code: f2 ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 53 48 8b 47 10 48 89 fb 48 8b 40 18 <48> 8b 00 f6 40 0d 40 74 0d 0f b7 00 66 25 00 f0 66 3d 00 80 74 09

That decodes as:

   5:	53                   	push   %rbx
   6:	48 8b 47 10          	mov    0x10(%rdi),%rax
   a:	48 89 fb             	mov    %rdi,%rbx
   d:	48 8b 40 18          	mov    0x18(%rax),%rax
  11:*	48 8b 00             	mov    (%rax),%rax		<-- trapping instruction
  14:	f6 40 0d 40          	testb  $0x40,0xd(%rax)

6: bh->b_folio
d: b_folio->mapping
11: mapping->host

So folio->mapping is NULL.

Ah, I see the problem.  end_buffer_async_read() uses the buffer_async_read
test to decide if all buffers on the page are uptodate or not.  So both
having no batch (ie this patch) and having a batch which is smaller than
the number of buffers in the folio can lead to folio_end_read() being
called prematurely (ie we'll unlock the folio before finishing reading
every buffer in the folio).

Once the folio is unlocked, it can be truncated.  That's a second-order
problem, but it's the one your test happened to hit.


This should fix the problem; we always have at least one BH held in
the submission path with the async_read flag set, so
end_buffer_async_read() will not end it prematurely.

By the way, do you have CONFIG_VM_DEBUG enabled in your testing?

        VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
in folio_end_read() should have tripped before hitting the race with
truncate.

diff --git a/fs/buffer.c b/fs/buffer.c
index cc8452f60251..fd2633e4a5d2 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2361,9 +2361,9 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 {
 	struct inode *inode = folio->mapping->host;
 	sector_t iblock, lblock;
-	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+	struct buffer_head *bh, *head, *prev = NULL;
 	size_t blocksize;
-	int nr, i;
+	int i;
 	int fully_mapped = 1;
 	bool page_error = false;
 	loff_t limit = i_size_read(inode);
@@ -2380,7 +2380,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 	iblock = div_u64(folio_pos(folio), blocksize);
 	lblock = div_u64(limit + blocksize - 1, blocksize);
 	bh = head;
-	nr = 0;
 	i = 0;
 
 	do {
@@ -2411,40 +2410,33 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 			if (buffer_uptodate(bh))
 				continue;
 		}
-		arr[nr++] = bh;
+
+		lock_buffer(bh);
+		if (buffer_uptodate(bh)) {
+			unlock_buffer(bh);
+			continue;
+		}
+
+		mark_buffer_async_read(bh);
+		if (prev)
+			submit_bh(REQ_OP_READ, prev);
+		prev = bh;
 	} while (i++, iblock++, (bh = bh->b_this_page) != head);
 
 	if (fully_mapped)
 		folio_set_mappedtodisk(folio);
 
-	if (!nr) {
-		/*
-		 * All buffers are uptodate or get_block() returned an
-		 * error when trying to map them - we can finish the read.
-		 */
-		folio_end_read(folio, !page_error);
-		return 0;
-	}
-
-	/* Stage two: lock the buffers */
-	for (i = 0; i < nr; i++) {
-		bh = arr[i];
-		lock_buffer(bh);
-		mark_buffer_async_read(bh);
-	}
-
 	/*
-	 * Stage 3: start the IO.  Check for uptodateness
-	 * inside the buffer lock in case another process reading
-	 * the underlying blockdev brought it uptodate (the sct fix).
+	 * All buffers are uptodate or get_block() returned an error
+	 * when trying to map them - we must finish the read because
+	 * end_buffer_async_read() will never be called on any buffer
+	 * in this folio.
 	 */
-	for (i = 0; i < nr; i++) {
-		bh = arr[i];
-		if (buffer_uptodate(bh))
-			end_buffer_async_read(bh, 1);
-		else
-			submit_bh(REQ_OP_READ, bh);
-	}
+	if (prev)
+		submit_bh(REQ_OP_READ, prev);
+	else
+		folio_end_read(folio, !page_error);
+
 	return 0;
 }
 EXPORT_SYMBOL(block_read_full_folio);

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] fs/buffer: strack reduction on async read
  2024-12-18  2:26 [PATCH 0/5] fs/buffer: strack reduction on async read Luis Chamberlain
                   ` (5 preceding siblings ...)
  2024-12-18 20:05 ` [PATCH 0/5] fs/buffer: strack reduction on async read Matthew Wilcox
@ 2024-12-19  6:28 ` Christoph Hellwig
  2024-12-19 17:53   ` Luis Chamberlain
  6 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2024-12-19  6:28 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: hare, willy, dave, david, djwong, kbusch, john.g.garry, hch,
	ritesh.list, linux-fsdevel, linux-xfs, linux-mm, linux-block,
	gost.dev, p.raghav, da.gomez, kernel

What is this strack that gets reduced here?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] fs/buffer: strack reduction on async read
  2024-12-19  6:28 ` Christoph Hellwig
@ 2024-12-19 17:53   ` Luis Chamberlain
  0 siblings, 0 replies; 17+ messages in thread
From: Luis Chamberlain @ 2024-12-19 17:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: hare, willy, dave, david, djwong, kbusch, john.g.garry,
	ritesh.list, linux-fsdevel, linux-xfs, linux-mm, linux-block,
	gost.dev, p.raghav, da.gomez, kernel

On Thu, Dec 19, 2024 at 07:28:27AM +0100, Christoph Hellwig wrote:
> What is this strack that gets reduced here?

s/strack/stack

I seriously need a spell checker as part of my pipeline.

  Luis

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] fs/buffer: strack reduction on async read
  2024-12-19  3:51     ` Matthew Wilcox
@ 2024-12-30 17:30       ` Luis Chamberlain
  2025-01-31 16:54       ` Luis Chamberlain
  1 sibling, 0 replies; 17+ messages in thread
From: Luis Chamberlain @ 2024-12-30 17:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: hare, dave, david, djwong, kbusch, john.g.garry, hch, ritesh.list,
	linux-fsdevel, linux-xfs, linux-mm, linux-block, gost.dev,
	p.raghav, da.gomez, kernel

On Thu, Dec 19, 2024 at 03:51:34AM +0000, Matthew Wilcox wrote:
> So folio->mapping is NULL.
> 
> Ah, I see the problem.  end_buffer_async_read() uses the buffer_async_read
> test to decide if all buffers on the page are uptodate or not.  So both
> having no batch (ie this patch) and having a batch which is smaller than
> the number of buffers in the folio can lead to folio_end_read() being
> called prematurely (ie we'll unlock the folio before finishing reading
> every buffer in the folio).
> 
> Once the folio is unlocked, it can be truncated.  That's a second-order
> problem, but it's the one your test happened to hit.
> 
> This should fix the problem; we always have at least one BH held in
> the submission path with the async_read flag set, so
> end_buffer_async_read() will not end it prematurely.

Oh neat, yes.

> By the way, do you have CONFIG_VM_DEBUG enabled in your testing?

You mean DEBUG_VM ? Yes:

grep DEBUG_VM .config
CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE=y
CONFIG_DEBUG_VM_IRQSOFF=y
CONFIG_DEBUG_VM=y
# CONFIG_DEBUG_VM_MAPLE_TREE is not set
# CONFIG_DEBUG_VM_RB is not set
CONFIG_DEBUG_VM_PGFLAGS=y
# CONFIG_DEBUG_VM_PGTABLE is not set

>         VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> in folio_end_read() should have tripped before hitting the race with
> truncate.

Odd that it did not, I had run into that folio_test_locked() splat but in my
attempts to simplify this without your trick to only run into the similar
truncate race, your resolution to this is nice.

> diff --git a/fs/buffer.c b/fs/buffer.c

This is a nice resolution and simplification, thanks, I've tested it and
passes without regressions on ext4. I'll take this into this series as
an alternative.

  Luis

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] fs/buffer: strack reduction on async read
  2024-12-19  3:51     ` Matthew Wilcox
  2024-12-30 17:30       ` Luis Chamberlain
@ 2025-01-31 16:54       ` Luis Chamberlain
  2025-01-31 22:01         ` Matthew Wilcox
  1 sibling, 1 reply; 17+ messages in thread
From: Luis Chamberlain @ 2025-01-31 16:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: hare, dave, david, djwong, kbusch, john.g.garry, hch, ritesh.list,
	linux-fsdevel, linux-xfs, linux-mm, linux-block, gost.dev,
	p.raghav, da.gomez, kernel

On Thu, Dec 19, 2024 at 03:51:34AM +0000, Matthew Wilcox wrote:
> On Wed, Dec 18, 2024 at 06:27:36PM -0800, Luis Chamberlain wrote:
> > On Wed, Dec 18, 2024 at 08:05:29PM +0000, Matthew Wilcox wrote:
> > > On Tue, Dec 17, 2024 at 06:26:21PM -0800, Luis Chamberlain wrote:
> > > > This splits up a minor enhancement from the bs > ps device support
> > > > series into its own series for better review / focus / testing.
> > > > This series just addresses the reducing the array size used and cleaning
> > > > up the async read to be easier to read and maintain.
> > > 
> > > How about this approach instead -- get rid of the batch entirely?
> > 
> > Less is more! I wish it worked, but we end up with a null pointer on
> > ext4/032 (and indeed this is the test that helped me find most bugs in
> > what I was working on):
> 
> Yeah, I did no testing; just wanted to give people a different approach
> to consider.
> 
> > [  106.034851] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > [  106.046300] RIP: 0010:end_buffer_async_read_io+0x11/0x90
> > [  106.047819] Code: f2 ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 53 48 8b 47 10 48 89 fb 48 8b 40 18 <48> 8b 00 f6 40 0d 40 74 0d 0f b7 00 66 25 00 f0 66 3d 00 80 74 09
> 
> That decodes as:
> 
>    5:	53                   	push   %rbx
>    6:	48 8b 47 10          	mov    0x10(%rdi),%rax
>    a:	48 89 fb             	mov    %rdi,%rbx
>    d:	48 8b 40 18          	mov    0x18(%rax),%rax
>   11:*	48 8b 00             	mov    (%rax),%rax		<-- trapping instruction
>   14:	f6 40 0d 40          	testb  $0x40,0xd(%rax)
> 
> 6: bh->b_folio
> d: b_folio->mapping
> 11: mapping->host
> 
> So folio->mapping is NULL.
> 
> Ah, I see the problem.  end_buffer_async_read() uses the buffer_async_read
> test to decide if all buffers on the page are uptodate or not.  So both
> having no batch (ie this patch) and having a batch which is smaller than
> the number of buffers in the folio can lead to folio_end_read() being
> called prematurely (ie we'll unlock the folio before finishing reading
> every buffer in the folio).

But:

a) all batched buffers are locked in the old code, we only unlock
   the currently evaluated buffer, the buffers from our pivot are locked
   and should also have the async flag set. That fact that buffers ahead
   should have the async flag set should prevent from calling
   folio_end_read() prematurely as I read the code, no?

b) In the case we're evaluting the last buffer, we can unlock and call
   folio_end_read(), but that seems fine given the previous batch work
   was in charge of finding candidate buffers which need a read, so in
   this case there should be no pending read.

So I don't see how we yet can call folio_end_read() prematurely.

We do however unlock the buffer in end_buffer_async_read(), but in case
of an inconsistency we simply bail on the loop, and since we only called
end_buffer_async_read() in case of the buffer being up to date, the only
iconsistency we check for is if (!buffer_uptodate(tmp)) in which case
the folio_end_read() will be called but without a success being annoted.

> Once the folio is unlocked, it can be truncated.  That's a second-order
> problem, but it's the one your test happened to hit.
> 
> 
> This should fix the problem; we always have at least one BH held in
> the submission path with the async_read flag set, so
> end_buffer_async_read() will not end it prematurely.

But this alternative does not call end_buffer_async_read(), in fact
we only call folio_end_read() in case of no pending reads being needed.

> diff --git a/fs/buffer.c b/fs/buffer.c
> index cc8452f60251..fd2633e4a5d2 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2361,9 +2361,9 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
>  {
>  	struct inode *inode = folio->mapping->host;
>  	sector_t iblock, lblock;
> -	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
> +	struct buffer_head *bh, *head, *prev = NULL;
>  	size_t blocksize;
> -	int nr, i;
> +	int i;
>  	int fully_mapped = 1;
>  	bool page_error = false;
>  	loff_t limit = i_size_read(inode);
> @@ -2380,7 +2380,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
>  	iblock = div_u64(folio_pos(folio), blocksize);
>  	lblock = div_u64(limit + blocksize - 1, blocksize);
>  	bh = head;
> -	nr = 0;
>  	i = 0;
>  
>  	do {
> @@ -2411,40 +2410,33 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
>  			if (buffer_uptodate(bh))
>  				continue;
>  		}
> -		arr[nr++] = bh;
> +
> +		lock_buffer(bh);
> +		if (buffer_uptodate(bh)) {
> +			unlock_buffer(bh);
> +			continue;
> +		}
> +
> +		mark_buffer_async_read(bh);
> +		if (prev)
> +			submit_bh(REQ_OP_READ, prev);
> +		prev = bh;
>  	} while (i++, iblock++, (bh = bh->b_this_page) != head);
>  
>  	if (fully_mapped)
>  		folio_set_mappedtodisk(folio);
>  
> -	if (!nr) {
> -		/*
> -		 * All buffers are uptodate or get_block() returned an
> -		 * error when trying to map them - we can finish the read.
> -		 */
> -		folio_end_read(folio, !page_error);
> -		return 0;
> -	}
> -
> -	/* Stage two: lock the buffers */
> -	for (i = 0; i < nr; i++) {
> -		bh = arr[i];
> -		lock_buffer(bh);
> -		mark_buffer_async_read(bh);
> -	}
> -
>  	/*
> -	 * Stage 3: start the IO.  Check for uptodateness
> -	 * inside the buffer lock in case another process reading
> -	 * the underlying blockdev brought it uptodate (the sct fix).
> +	 * All buffers are uptodate or get_block() returned an error
> +	 * when trying to map them - we must finish the read because
> +	 * end_buffer_async_read() will never be called on any buffer
> +	 * in this folio.

Do we want to keep mentioning end_buffer_async_read() here?

>  	 */
> -	for (i = 0; i < nr; i++) {
> -		bh = arr[i];
> -		if (buffer_uptodate(bh))
> -			end_buffer_async_read(bh, 1);
> -		else
> -			submit_bh(REQ_OP_READ, bh);
> -	}
> +	if (prev)
> +		submit_bh(REQ_OP_READ, prev);
> +	else
> +		folio_end_read(folio, !page_error);
> +
>  	return 0;

Becuase we only call folio_end_read() in the above code in case we had
no pending read IO determined. In case we had to at least issue one read
for one buffer we never call folio_end_read(). We didn't before unless
we ran into a race where a pending batched read coincided with a read
being issued and updating our buffer by chance, and we determined we
either completed that read fine or with an error.

Reason I'm asking these things is I'm trying to determine if there was
an issue before we're trying to fix other than the simplification with
the new un-batched strategy. I don't see it yet. If we're fixing
something here, it is still a bit obscure to me and I'd like to make
sure we document it properly.

  Luis

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] fs/buffer: strack reduction on async read
  2025-01-31 16:54       ` Luis Chamberlain
@ 2025-01-31 22:01         ` Matthew Wilcox
  2025-02-03 14:00           ` Luis Chamberlain
  0 siblings, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2025-01-31 22:01 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: hare, dave, david, djwong, kbusch, john.g.garry, hch, ritesh.list,
	linux-fsdevel, linux-xfs, linux-mm, linux-block, gost.dev,
	p.raghav, da.gomez, kernel

On Fri, Jan 31, 2025 at 08:54:31AM -0800, Luis Chamberlain wrote:
> On Thu, Dec 19, 2024 at 03:51:34AM +0000, Matthew Wilcox wrote:
> > On Wed, Dec 18, 2024 at 06:27:36PM -0800, Luis Chamberlain wrote:
> > > On Wed, Dec 18, 2024 at 08:05:29PM +0000, Matthew Wilcox wrote:
> > > > On Tue, Dec 17, 2024 at 06:26:21PM -0800, Luis Chamberlain wrote:
> > > > > This splits up a minor enhancement from the bs > ps device support
> > > > > series into its own series for better review / focus / testing.
> > > > > This series just addresses the reducing the array size used and cleaning
> > > > > up the async read to be easier to read and maintain.
> > > > 
> > > > How about this approach instead -- get rid of the batch entirely?
> > > 
> > > Less is more! I wish it worked, but we end up with a null pointer on
> > > ext4/032 (and indeed this is the test that helped me find most bugs in
> > > what I was working on):
> > 
> > Yeah, I did no testing; just wanted to give people a different approach
> > to consider.
> > 
> > > [  106.034851] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > > [  106.046300] RIP: 0010:end_buffer_async_read_io+0x11/0x90
> > > [  106.047819] Code: f2 ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 53 48 8b 47 10 48 89 fb 48 8b 40 18 <48> 8b 00 f6 40 0d 40 74 0d 0f b7 00 66 25 00 f0 66 3d 00 80 74 09
> > 
> > That decodes as:
> > 
> >    5:	53                   	push   %rbx
> >    6:	48 8b 47 10          	mov    0x10(%rdi),%rax
> >    a:	48 89 fb             	mov    %rdi,%rbx
> >    d:	48 8b 40 18          	mov    0x18(%rax),%rax
> >   11:*	48 8b 00             	mov    (%rax),%rax		<-- trapping instruction
> >   14:	f6 40 0d 40          	testb  $0x40,0xd(%rax)
> > 
> > 6: bh->b_folio
> > d: b_folio->mapping
> > 11: mapping->host
> > 
> > So folio->mapping is NULL.
> > 
> > Ah, I see the problem.  end_buffer_async_read() uses the buffer_async_read
> > test to decide if all buffers on the page are uptodate or not.  So both
> > having no batch (ie this patch) and having a batch which is smaller than
> > the number of buffers in the folio can lead to folio_end_read() being
> > called prematurely (ie we'll unlock the folio before finishing reading
> > every buffer in the folio).
> 
> But:
> 
> a) all batched buffers are locked in the old code, we only unlock
>    the currently evaluated buffer, the buffers from our pivot are locked
>    and should also have the async flag set. That fact that buffers ahead
>    should have the async flag set should prevent from calling
>    folio_end_read() prematurely as I read the code, no?

I'm sure you know what you mean by "the old code", but I don't.

If you mean "the code in 6.13", here's what it does:

        tmp = bh;
        do {
                if (!buffer_uptodate(tmp))
                        folio_uptodate = 0;
                if (buffer_async_read(tmp)) {
                        BUG_ON(!buffer_locked(tmp));
                        goto still_busy;
                }
                tmp = tmp->b_this_page;
        } while (tmp != bh);
        folio_end_read(folio, folio_uptodate);

so it's going to cycle around every buffer on the page, and if it finds
none which are marked async_read, it'll call folio_end_read().
That's fine in 6.13 because in stage 2, all buffers which are part of
this folio are marked as async_read.

In your patch, you mark every buffer _in the batch_ as async_read
and then submit the entire batch.  So if they all complete before you
mark the next bh as being uptodate, it'll think the read is complete
and call folio_end_read().


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] fs/buffer: strack reduction on async read
  2025-01-31 22:01         ` Matthew Wilcox
@ 2025-02-03 14:00           ` Luis Chamberlain
  0 siblings, 0 replies; 17+ messages in thread
From: Luis Chamberlain @ 2025-02-03 14:00 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: hare, dave, david, djwong, kbusch, john.g.garry, hch, ritesh.list,
	linux-fsdevel, linux-xfs, linux-mm, linux-block, gost.dev,
	p.raghav, da.gomez, kernel

On Fri, Jan 31, 2025 at 10:01:46PM +0000, Matthew Wilcox wrote:
> On Fri, Jan 31, 2025 at 08:54:31AM -0800, Luis Chamberlain wrote:
> > On Thu, Dec 19, 2024 at 03:51:34AM +0000, Matthew Wilcox wrote:
> > > On Wed, Dec 18, 2024 at 06:27:36PM -0800, Luis Chamberlain wrote:
> > > > On Wed, Dec 18, 2024 at 08:05:29PM +0000, Matthew Wilcox wrote:
> > > > > On Tue, Dec 17, 2024 at 06:26:21PM -0800, Luis Chamberlain wrote:
> > > > > > This splits up a minor enhancement from the bs > ps device support
> > > > > > series into its own series for better review / focus / testing.
> > > > > > This series just addresses the reducing the array size used and cleaning
> > > > > > up the async read to be easier to read and maintain.
> > > > > 
> > > > > How about this approach instead -- get rid of the batch entirely?
> > > > 
> > > > Less is more! I wish it worked, but we end up with a null pointer on
> > > > ext4/032 (and indeed this is the test that helped me find most bugs in
> > > > what I was working on):
> > > 
> > > Yeah, I did no testing; just wanted to give people a different approach
> > > to consider.
> > > 
> > > > [  106.034851] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > > > [  106.046300] RIP: 0010:end_buffer_async_read_io+0x11/0x90
> > > > [  106.047819] Code: f2 ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 53 48 8b 47 10 48 89 fb 48 8b 40 18 <48> 8b 00 f6 40 0d 40 74 0d 0f b7 00 66 25 00 f0 66 3d 00 80 74 09
> > > 
> > > That decodes as:
> > > 
> > >    5:	53                   	push   %rbx
> > >    6:	48 8b 47 10          	mov    0x10(%rdi),%rax
> > >    a:	48 89 fb             	mov    %rdi,%rbx
> > >    d:	48 8b 40 18          	mov    0x18(%rax),%rax
> > >   11:*	48 8b 00             	mov    (%rax),%rax		<-- trapping instruction
> > >   14:	f6 40 0d 40          	testb  $0x40,0xd(%rax)
> > > 
> > > 6: bh->b_folio
> > > d: b_folio->mapping
> > > 11: mapping->host
> > > 
> > > So folio->mapping is NULL.
> > > 
> > > Ah, I see the problem.  end_buffer_async_read() uses the buffer_async_read
> > > test to decide if all buffers on the page are uptodate or not.  So both
> > > having no batch (ie this patch) and having a batch which is smaller than
> > > the number of buffers in the folio can lead to folio_end_read() being
> > > called prematurely (ie we'll unlock the folio before finishing reading
> > > every buffer in the folio).
> > 
> > But:
> > 
> > a) all batched buffers are locked in the old code, we only unlock
> >    the currently evaluated buffer, the buffers from our pivot are locked
> >    and should also have the async flag set. That fact that buffers ahead
> >    should have the async flag set should prevent from calling
> >    folio_end_read() prematurely as I read the code, no?
> 
> I'm sure you know what you mean by "the old code", but I don't.
> 
> If you mean "the code in 6.13", here's what it does:

Yes that is what I meant, sorry.

> 
>         tmp = bh;
>         do {
>                 if (!buffer_uptodate(tmp))
>                         folio_uptodate = 0;
>                 if (buffer_async_read(tmp)) {
>                         BUG_ON(!buffer_locked(tmp));
>                         goto still_busy;
>                 }
>                 tmp = tmp->b_this_page;
>         } while (tmp != bh);
>         folio_end_read(folio, folio_uptodate);
> 
> so it's going to cycle around every buffer on the page, and if it finds
> none which are marked async_read, it'll call folio_end_read().
> That's fine in 6.13 because in stage 2, all buffers which are part of
> this folio are marked as async_read.

Indeed, also, its not just every buffer on the page, since we can call
end_buffer_async_read() on every buffer in the page we can end up
calling end_buffer_async_read() on every buffer in the worst case, and
on each loop above we start from the pivot buffer up to the end of the
page.

> In your patch, you mark every buffer _in the batch_ as async_read
> and then submit the entire batch.  So if they all complete before you
> mark the next bh as being uptodate, it'll think the read is complete
> and call folio_end_read().

Ah yes, thanks, this clarifies to me what you meant!

  Luis

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-02-03 14:00 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-18  2:26 [PATCH 0/5] fs/buffer: strack reduction on async read Luis Chamberlain
2024-12-18  2:26 ` [PATCH 1/5] fs/buffer: move async batch read code into a helper Luis Chamberlain
2024-12-18  2:26 ` [PATCH 2/5] fs/buffer: simplify block_read_full_folio() with bh_offset() Luis Chamberlain
2024-12-18  2:26 ` [PATCH 3/5] fs/buffer: add a for_each_bh() for block_read_full_folio() Luis Chamberlain
2024-12-18 19:20   ` Matthew Wilcox
2024-12-18  2:26 ` [PATCH 4/5] fs/buffer: add iteration support " Luis Chamberlain
2024-12-18  2:26 ` [PATCH 5/5] fs/buffer: reduce stack usage on bh_read_iter() Luis Chamberlain
2024-12-18  2:47   ` Luis Chamberlain
2024-12-18 20:05 ` [PATCH 0/5] fs/buffer: strack reduction on async read Matthew Wilcox
2024-12-19  2:27   ` Luis Chamberlain
2024-12-19  3:51     ` Matthew Wilcox
2024-12-30 17:30       ` Luis Chamberlain
2025-01-31 16:54       ` Luis Chamberlain
2025-01-31 22:01         ` Matthew Wilcox
2025-02-03 14:00           ` Luis Chamberlain
2024-12-19  6:28 ` Christoph Hellwig
2024-12-19 17:53   ` Luis Chamberlain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).