[PATCH 2/2] iomap: align writeback to RAID stripe boundaries

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 2/2] iomap: align writeback to RAID stripe boundaries
@ 2025-07-29 16:13 Tony Battersby
  2025-07-29 18:38 ` Matthew Wilcox
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Tony Battersby @ 2025-07-29 16:13 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Christian Brauner, Darrick J. Wong,
	Matthew Wilcox (Oracle)
  Cc: linux-raid, linux-xfs, linux-fsdevel, linux-kernel

Improve writeback performance to RAID-4/5/6 by aligning writes to stripe
boundaries.  This relies on io_opt being set to the stripe size (or
a multiple) when BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE is set.

Benchmark of sequential writing to a large file on XFS using
io_uring with 8-disk md-raid6:
Before:      601.0 MB/s
After:       614.5 MB/s
Improvement: +2.3%

Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
---
 fs/iomap/buffered-io.c | 175 +++++++++++++++++++++++++----------------
 1 file changed, 106 insertions(+), 69 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index fb4519158f3a..f9020f916268 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1685,81 +1685,118 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
 		struct inode *inode, loff_t pos, loff_t end_pos,
 		unsigned len)
 {
-	struct iomap_folio_state *ifs = folio->private;
-	size_t poff = offset_in_folio(folio, pos);
-	unsigned int ioend_flags = 0;
-	int error;
-
-	if (wpc->iomap.type == IOMAP_UNWRITTEN)
-		ioend_flags |= IOMAP_IOEND_UNWRITTEN;
-	if (wpc->iomap.flags & IOMAP_F_SHARED)
-		ioend_flags |= IOMAP_IOEND_SHARED;
-	if (folio_test_dropbehind(folio))
-		ioend_flags |= IOMAP_IOEND_DONTCACHE;
-	if (pos == wpc->iomap.offset && (wpc->iomap.flags & IOMAP_F_BOUNDARY))
-		ioend_flags |= IOMAP_IOEND_BOUNDARY;
+	struct queue_limits *lim = bdev_limits(wpc->iomap.bdev);
+	unsigned int io_align =
+		(lim->features & BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE) ?
+		lim->io_opt >> SECTOR_SHIFT : 0;
 
-	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos, ioend_flags)) {
+	do {
+		struct iomap_folio_state *ifs = folio->private;
+		size_t poff = offset_in_folio(folio, pos);
+		unsigned int ioend_flags = 0;
+		unsigned int rem_len = 0;
+		int error;
+
+		if (wpc->iomap.type == IOMAP_UNWRITTEN)
+			ioend_flags |= IOMAP_IOEND_UNWRITTEN;
+		if (wpc->iomap.flags & IOMAP_F_SHARED)
+			ioend_flags |= IOMAP_IOEND_SHARED;
+		if (folio_test_dropbehind(folio))
+			ioend_flags |= IOMAP_IOEND_DONTCACHE;
+		if (pos == wpc->iomap.offset &&
+		    (wpc->iomap.flags & IOMAP_F_BOUNDARY))
+			ioend_flags |= IOMAP_IOEND_BOUNDARY;
+
+		if (!wpc->ioend ||
+		    !iomap_can_add_to_ioend(wpc, pos, ioend_flags)) {
 new_ioend:
-		error = iomap_submit_ioend(wpc, 0);
-		if (error)
-			return error;
-		wpc->ioend = iomap_alloc_ioend(wpc, wbc, inode, pos,
-				ioend_flags);
-	}
+			error = iomap_submit_ioend(wpc, 0);
+			if (error)
+				return error;
+			wpc->ioend = iomap_alloc_ioend(wpc, wbc, inode, pos,
+					ioend_flags);
+		}
 
-	if (!bio_add_folio(&wpc->ioend->io_bio, folio, len, poff))
-		goto new_ioend;
+		/* Align writes to io_align if given. */
+		if (io_align && !(wpc->iomap.flags & IOMAP_F_ANON_WRITE)) {
+			sector_t lba = bio_end_sector(&wpc->ioend->io_bio);
+			unsigned int mod = lba % io_align;
+			unsigned int max_len;
 
-	if (ifs)
-		atomic_add(len, &ifs->write_bytes_pending);
+			/*
+			 * If the end sector is already aligned and the bio is
+			 * nonempty, then start a new bio for the remainder.
+			 */
+			if (!mod && wpc->ioend->io_bio.bi_iter.bi_size)
+				goto new_ioend;
 
-	/*
-	 * Clamp io_offset and io_size to the incore EOF so that ondisk
-	 * file size updates in the ioend completion are byte-accurate.
-	 * This avoids recovering files with zeroed tail regions when
-	 * writeback races with appending writes:
-	 *
-	 *    Thread 1:                  Thread 2:
-	 *    ------------               -----------
-	 *    write [A, A+B]
-	 *    update inode size to A+B
-	 *    submit I/O [A, A+BS]
-	 *                               write [A+B, A+B+C]
-	 *                               update inode size to A+B+C
-	 *    <I/O completes, updates disk size to min(A+B+C, A+BS)>
-	 *    <power failure>
-	 *
-	 *  After reboot:
-	 *    1) with A+B+C < A+BS, the file has zero padding in range
-	 *       [A+B, A+B+C]
-	 *
-	 *    |<     Block Size (BS)   >|
-	 *    |DDDDDDDDDDDD0000000000000|
-	 *    ^           ^        ^
-	 *    A          A+B     A+B+C
-	 *                       (EOF)
-	 *
-	 *    2) with A+B+C > A+BS, the file has zero padding in range
-	 *       [A+B, A+BS]
-	 *
-	 *    |<     Block Size (BS)   >|<     Block Size (BS)    >|
-	 *    |DDDDDDDDDDDD0000000000000|00000000000000000000000000|
-	 *    ^           ^             ^           ^
-	 *    A          A+B           A+BS       A+B+C
-	 *                             (EOF)
-	 *
-	 *    D = Valid Data
-	 *    0 = Zero Padding
-	 *
-	 * Note that this defeats the ability to chain the ioends of
-	 * appending writes.
-	 */
-	wpc->ioend->io_size += len;
-	if (wpc->ioend->io_offset + wpc->ioend->io_size > end_pos)
-		wpc->ioend->io_size = end_pos - wpc->ioend->io_offset;
+			/*
+			 * Clip the end of the bio to the alignment boundary.
+			 */
+			max_len = (io_align - mod) << SECTOR_SHIFT;
+			if (len > max_len) {
+				rem_len = len - max_len;
+				len = max_len;
+			}
+		}
+
+		if (!bio_add_folio(&wpc->ioend->io_bio, folio, len, poff))
+			goto new_ioend;
+
+		if (ifs)
+			atomic_add(len, &ifs->write_bytes_pending);
+
+		/*
+		 * Clamp io_offset and io_size to the incore EOF so that ondisk
+		 * file size updates in the ioend completion are byte-accurate.
+		 * This avoids recovering files with zeroed tail regions when
+		 * writeback races with appending writes:
+		 *
+		 *    Thread 1:                  Thread 2:
+		 *    ------------               -----------
+		 *    write [A, A+B]
+		 *    update inode size to A+B
+		 *    submit I/O [A, A+BS]
+		 *                               write [A+B, A+B+C]
+		 *                               update inode size to A+B+C
+		 *    <I/O completes, updates disk size to min(A+B+C, A+BS)>
+		 *    <power failure>
+		 *
+		 *  After reboot:
+		 *    1) with A+B+C < A+BS, the file has zero padding in range
+		 *       [A+B, A+B+C]
+		 *
+		 *    |<     Block Size (BS)   >|
+		 *    |DDDDDDDDDDDD0000000000000|
+		 *    ^           ^        ^
+		 *    A          A+B     A+B+C
+		 *                       (EOF)
+		 *
+		 *    2) with A+B+C > A+BS, the file has zero padding in range
+		 *       [A+B, A+BS]
+		 *
+		 *    |<     Block Size (BS)   >|<     Block Size (BS)    >|
+		 *    |DDDDDDDDDDDD0000000000000|00000000000000000000000000|
+		 *    ^           ^             ^           ^
+		 *    A          A+B           A+BS       A+B+C
+		 *                             (EOF)
+		 *
+		 *    D = Valid Data
+		 *    0 = Zero Padding
+		 *
+		 * Note that this defeats the ability to chain the ioends of
+		 * appending writes.
+		 */
+		wpc->ioend->io_size += len;
+		if (wpc->ioend->io_offset + wpc->ioend->io_size > end_pos)
+			wpc->ioend->io_size = end_pos - wpc->ioend->io_offset;
+
+		wbc_account_cgroup_owner(wbc, folio, len);
+
+		pos += len;
+		len = rem_len;
+	} while (len);
 
-	wbc_account_cgroup_owner(wbc, folio, len);
 	return 0;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] iomap: align writeback to RAID stripe boundaries
  2025-07-29 16:13 [PATCH 2/2] iomap: align writeback to RAID stripe boundaries Tony Battersby
@ 2025-07-29 18:38 ` Matthew Wilcox
  2025-07-29 19:01   ` Tony Battersby
  2025-07-30  0:52 ` Dave Chinner
  2025-07-30 14:14 ` Christoph Hellwig
  2 siblings, 1 reply; 7+ messages in thread
From: Matthew Wilcox @ 2025-07-29 18:38 UTC (permalink / raw)
  To: Tony Battersby
  Cc: Song Liu, Yu Kuai, Christian Brauner, Darrick J. Wong, linux-raid,
	linux-xfs, linux-fsdevel, linux-kernel

On Tue, Jul 29, 2025 at 12:13:42PM -0400, Tony Battersby wrote:
> Improve writeback performance to RAID-4/5/6 by aligning writes to stripe
> boundaries.  This relies on io_opt being set to the stripe size (or
> a multiple) when BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE is set.

When you say "aligning writes to stripe boundaries", what you actually
seem to be doing here is sending writes down once we hit a write stripe
boundary, instead of accumulating writes that cross stripe boundaries.
Do I understand correctly?

If so, the performance gain we see here is presumably from the DM/MD
driver not having to split bios that cross boundaries?

Further, wouldn't it be simpler to just put a new condition in
iomap_can_add_to_ioend() rather than turning iomap_add_to_ioend()
into a nested loop?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] iomap: align writeback to RAID stripe boundaries
  2025-07-29 18:38 ` Matthew Wilcox
@ 2025-07-29 19:01   ` Tony Battersby
  2025-07-29 19:17     ` Matthew Wilcox
  0 siblings, 1 reply; 7+ messages in thread
From: Tony Battersby @ 2025-07-29 19:01 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Song Liu, Yu Kuai, Christian Brauner, Darrick J. Wong, linux-raid,
	linux-xfs, linux-fsdevel, linux-kernel

On 7/29/25 14:38, Matthew Wilcox wrote:
> On Tue, Jul 29, 2025 at 12:13:42PM -0400, Tony Battersby wrote:
>> Improve writeback performance to RAID-4/5/6 by aligning writes to stripe
>> boundaries.  This relies on io_opt being set to the stripe size (or
>> a multiple) when BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE is set.
> When you say "aligning writes to stripe boundaries", what you actually
> seem to be doing here is sending writes down once we hit a write stripe
> boundary, instead of accumulating writes that cross stripe boundaries.
> Do I understand correctly?
>
> If so, the performance gain we see here is presumably from the DM/MD
> driver not having to split bios that cross boundaries?
>
> Further, wouldn't it be simpler to just put a new condition in
> iomap_can_add_to_ioend() rather than turning iomap_add_to_ioend()
> into a nested loop?
>
Yes, you understand correctly.  The test creates a number of sequential
writes, and this patch cuts the stream of sequential bios on the stripe
boundaries rather than letting the bios span stripes, so that MD doesn't
have to do extra work for writes that cross the boundary.  I am actually
working on an out-of-tree RAID driver that benefits hugely from this
because it doesn't have the complexity of the MD caching layer.  But
benchmarks showed that MD benefited from it  (slightly) also, so I
figured it was worth submitting.

The problem with using iomap_can_add_to_ioend() is that it returns
true/false, whereas sometimes it is necessary to add some of the folio
to the current bio and the rest to a new bio.

Tony Battersby
Cybernetics



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] iomap: align writeback to RAID stripe boundaries
  2025-07-29 19:01   ` Tony Battersby
@ 2025-07-29 19:17     ` Matthew Wilcox
  2025-07-29 20:12       ` Tony Battersby
  0 siblings, 1 reply; 7+ messages in thread
From: Matthew Wilcox @ 2025-07-29 19:17 UTC (permalink / raw)
  To: Tony Battersby
  Cc: Song Liu, Yu Kuai, Christian Brauner, Darrick J. Wong, linux-raid,
	linux-xfs, linux-fsdevel, linux-kernel

On Tue, Jul 29, 2025 at 03:01:28PM -0400, Tony Battersby wrote:
> Yes, you understand correctly.  The test creates a number of sequential
> writes, and this patch cuts the stream of sequential bios on the stripe
> boundaries rather than letting the bios span stripes, so that MD doesn't
> have to do extra work for writes that cross the boundary.  I am actually
> working on an out-of-tree RAID driver that benefits hugely from this
> because it doesn't have the complexity of the MD caching layer.  But
> benchmarks showed that MD benefited from it  (slightly) also, so I
> figured it was worth submitting.
> 
> The problem with using iomap_can_add_to_ioend() is that it returns
> true/false, whereas sometimes it is necessary to add some of the folio
> to the current bio and the rest to a new bio.

Hm.  Maybe something like this would be more clear?

(contents and indeed name of iomap_should_split_ioend() very much TBD)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 9f541c05103b..429890fb7763 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1684,6 +1684,7 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
 	struct iomap_folio_state *ifs = folio->private;
 	size_t poff = offset_in_folio(folio, pos);
 	unsigned int ioend_flags = 0;
+	unsigned thislen;
 	int error;
 
 	if (wpc->iomap.type == IOMAP_UNWRITTEN)
@@ -1704,8 +1705,16 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
 				ioend_flags);
 	}
 
-	if (!bio_add_folio(&wpc->ioend->io_bio, folio, len, poff))
+	thislen = iomap_should_split_ioend(wpc, pos, len);
+
+	if (!bio_add_folio(&wpc->ioend->io_bio, folio, thislen, poff))
+		goto new_ioend;
+	if (thislen < len) {
+		pos += thislen;
+		len -= thislen;
+		wbc_account_cgroup_owner(wbc, folio, thislen);
 		goto new_ioend;
+	}
 
 	if (ifs)
 		atomic_add(len, &ifs->write_bytes_pending);

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] iomap: align writeback to RAID stripe boundaries
  2025-07-29 19:17     ` Matthew Wilcox
@ 2025-07-29 20:12       ` Tony Battersby
  0 siblings, 0 replies; 7+ messages in thread
From: Tony Battersby @ 2025-07-29 20:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Song Liu, Yu Kuai, Christian Brauner, Darrick J. Wong, linux-raid,
	linux-xfs, linux-fsdevel, linux-kernel

On 7/29/25 15:17, Matthew Wilcox wrote:
> Hm.  Maybe something like this would be more clear?
>
> (contents and indeed name of iomap_should_split_ioend() very much TBD)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 9f541c05103b..429890fb7763 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1684,6 +1684,7 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
>  	struct iomap_folio_state *ifs = folio->private;
>  	size_t poff = offset_in_folio(folio, pos);
>  	unsigned int ioend_flags = 0;
> +	unsigned thislen;
>  	int error;
>  
>  	if (wpc->iomap.type == IOMAP_UNWRITTEN)
> @@ -1704,8 +1705,16 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
>  				ioend_flags);
>  	}
>  
> -	if (!bio_add_folio(&wpc->ioend->io_bio, folio, len, poff))
> +	thislen = iomap_should_split_ioend(wpc, pos, len);
> +
> +	if (!bio_add_folio(&wpc->ioend->io_bio, folio, thislen, poff))
> +		goto new_ioend;
> +	if (thislen < len) {
> +		pos += thislen;
> +		len -= thislen;
> +		wbc_account_cgroup_owner(wbc, folio, thislen);
>  		goto new_ioend;
> +	}
>  
>  	if (ifs)
>  		atomic_add(len, &ifs->write_bytes_pending);


How is this?  Does ioend_flags need to be recomputed (particularly
IOMAP_IOEND_BOUNDARY) when processing the remainder of the folio?

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index fb4519158f3a..0967e6fd62a1 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1669,6 +1669,39 @@ static bool iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t pos,
 	return true;
 }
 
+static unsigned int iomap_should_split_ioend(struct iomap_writepage_ctx *wpc,
+		loff_t pos, unsigned int len)
+{
+	struct queue_limits *lim = bdev_limits(wpc->iomap.bdev);
+
+	if ((lim->features & BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE) &&
+	    !(wpc->iomap.flags & IOMAP_F_ANON_WRITE)) {
+		unsigned int io_align = lim->io_opt >> SECTOR_SHIFT;
+
+		/* Split sequential writes along io_align boundaries. */
+		if (io_align) {
+			sector_t lba = bio_end_sector(&wpc->ioend->io_bio);
+			unsigned int mod = lba % io_align;
+			unsigned int max_len;
+
+			/*
+			 * If the end sector is already aligned and the bio is
+			 * nonempty, then start a new bio for the remainder.
+			 */
+			if (!mod && wpc->ioend->io_bio.bi_iter.bi_size)
+				return 0;
+
+			/*
+			 * Clip the end of the bio to the alignment boundary.
+			 */
+			max_len = (io_align - mod) << SECTOR_SHIFT;
+			if (len > max_len)
+				len = max_len;
+		}
+	}
+	return len;
+}
+
 /*
  * Test to see if we have an existing ioend structure that we could append to
  * first; otherwise finish off the current ioend and start another.
@@ -1688,6 +1721,7 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
 	struct iomap_folio_state *ifs = folio->private;
 	size_t poff = offset_in_folio(folio, pos);
 	unsigned int ioend_flags = 0;
+	unsigned int thislen;
 	int error;
 
 	if (wpc->iomap.type == IOMAP_UNWRITTEN)
@@ -1708,11 +1742,14 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
 				ioend_flags);
 	}
 
-	if (!bio_add_folio(&wpc->ioend->io_bio, folio, len, poff))
+	thislen = iomap_should_split_ioend(wpc, pos, len);
+	if (!thislen)
+		goto new_ioend;
+	if (!bio_add_folio(&wpc->ioend->io_bio, folio, thislen, poff))
 		goto new_ioend;
 
 	if (ifs)
-		atomic_add(len, &ifs->write_bytes_pending);
+		atomic_add(thislen, &ifs->write_bytes_pending);
 
 	/*
 	 * Clamp io_offset and io_size to the incore EOF so that ondisk
@@ -1755,11 +1792,18 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
 	 * Note that this defeats the ability to chain the ioends of
 	 * appending writes.
 	 */
-	wpc->ioend->io_size += len;
+	wpc->ioend->io_size += thislen;
 	if (wpc->ioend->io_offset + wpc->ioend->io_size > end_pos)
 		wpc->ioend->io_size = end_pos - wpc->ioend->io_offset;
 
-	wbc_account_cgroup_owner(wbc, folio, len);
+	wbc_account_cgroup_owner(wbc, folio, thislen);
+
+	if (thislen < len) {
+		pos += thislen;
+		len -= thislen;
+		goto new_ioend;
+	}
+
 	return 0;
 }
 

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] iomap: align writeback to RAID stripe boundaries
  2025-07-29 16:13 [PATCH 2/2] iomap: align writeback to RAID stripe boundaries Tony Battersby
  2025-07-29 18:38 ` Matthew Wilcox
@ 2025-07-30  0:52 ` Dave Chinner
  2025-07-30 14:14 ` Christoph Hellwig
  2 siblings, 0 replies; 7+ messages in thread
From: Dave Chinner @ 2025-07-30  0:52 UTC (permalink / raw)
  To: Tony Battersby
  Cc: Song Liu, Yu Kuai, Christian Brauner, Darrick J. Wong,
	Matthew Wilcox (Oracle), linux-raid, linux-xfs, linux-fsdevel,
	linux-kernel

On Tue, Jul 29, 2025 at 12:13:42PM -0400, Tony Battersby wrote:
> Improve writeback performance to RAID-4/5/6 by aligning writes to stripe
> boundaries.  This relies on io_opt being set to the stripe size (or
> a multiple) when BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE is set.

This is the wrong layer to be pulling filesystem write alignments
from.

Filesystems already have alignment information in their on-disk
formats. XFS has stripe unit and stripe width information in the
filesysetm superblock that is set by mkfs.xfs.

This information comes from the block device io-opt/io-min values
exposed to userspace at mkfs time, so the filesystem already knows
what the optimal IO alignment parameters are for the storage stack
underneath it.

Indeed, we already align extent allocations to these parameters, so
aligning filesystem writeback to the same configured alignment makes
a lot more sense than pulling random stuff from block devices during
IO submission...

> @@ -1685,81 +1685,118 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
>  		struct inode *inode, loff_t pos, loff_t end_pos,
>  		unsigned len)
>  {
> -	struct iomap_folio_state *ifs = folio->private;
> -	size_t poff = offset_in_folio(folio, pos);
> -	unsigned int ioend_flags = 0;
> -	int error;
> -
> -	if (wpc->iomap.type == IOMAP_UNWRITTEN)
> -		ioend_flags |= IOMAP_IOEND_UNWRITTEN;
> -	if (wpc->iomap.flags & IOMAP_F_SHARED)
> -		ioend_flags |= IOMAP_IOEND_SHARED;
> -	if (folio_test_dropbehind(folio))
> -		ioend_flags |= IOMAP_IOEND_DONTCACHE;
> -	if (pos == wpc->iomap.offset && (wpc->iomap.flags & IOMAP_F_BOUNDARY))
> -		ioend_flags |= IOMAP_IOEND_BOUNDARY;
> +	struct queue_limits *lim = bdev_limits(wpc->iomap.bdev);
> +	unsigned int io_align =
> +		(lim->features & BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE) ?
> +		lim->io_opt >> SECTOR_SHIFT : 0;

i.e. this alignment should come from the filesystem, not the block
device.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] iomap: align writeback to RAID stripe boundaries
  2025-07-29 16:13 [PATCH 2/2] iomap: align writeback to RAID stripe boundaries Tony Battersby
  2025-07-29 18:38 ` Matthew Wilcox
  2025-07-30  0:52 ` Dave Chinner
@ 2025-07-30 14:14 ` Christoph Hellwig
  2 siblings, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2025-07-30 14:14 UTC (permalink / raw)
  To: Tony Battersby
  Cc: Song Liu, Yu Kuai, Christian Brauner, Darrick J. Wong,
	Matthew Wilcox (Oracle), linux-raid, linux-xfs, linux-fsdevel,
	linux-kernel

On Tue, Jul 29, 2025 at 12:13:42PM -0400, Tony Battersby wrote:
> Improve writeback performance to RAID-4/5/6 by aligning writes to stripe
> boundaries.  This relies on io_opt being set to the stripe size (or
> a multiple) when BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE is set.

You're not aligning anything.  You are splitting I/O, which is exactly
what we've been trying to avoid by moving to the immutable bio_vec
model that moves the splitting to the place that needs it.

> Benchmark of sequential writing to a large file on XFS using
> io_uring with 8-disk md-raid6:
> Before:      601.0 MB/s
> After:       614.5 MB/s
> Improvement: +2.3%

Looks like you need to do some work on the bio splitting in RAID.
It would help to Cc the maintainers as the driver is actually
pretty actively worked on these days.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-07-30 14:14 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-29 16:13 [PATCH 2/2] iomap: align writeback to RAID stripe boundaries Tony Battersby
2025-07-29 18:38 ` Matthew Wilcox
2025-07-29 19:01   ` Tony Battersby
2025-07-29 19:17     ` Matthew Wilcox
2025-07-29 20:12       ` Tony Battersby
2025-07-30  0:52 ` Dave Chinner
2025-07-30 14:14 ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).