[RFC 0/2] writeback: add support for filesystems to optimize parallel writeback

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC 0/2] writeback: add support for filesystems to optimize parallel writeback
@ 2025-09-14 12:11 wangyufei
  2025-09-14 12:11 ` [RFC 1/2] writeback: add support for filesystems to affine inodes to specific writeback ctx wangyufei
  2025-09-14 12:11 ` [RFC 2/2] xfs: implement get_inode_wb_ctx_idx() for per-AG parallel writeback wangyufei
  0 siblings, 2 replies; 5+ messages in thread
From: wangyufei @ 2025-09-14 12:11 UTC (permalink / raw)
  To: viro, brauner, jack, cem
  Cc: kundan.kumar, anuj20.g, hch, bernd, djwong, david, linux-kernel,
	linux-xfs, linux-fsdevel, opensource.kernel, wangyufei

Based on this parallel writeback testing on XFS [1] and prior discussions,
we believe that the features and architecture of filesystems must be
considered to optimize parallel writeback performance.

We introduce a filesystem interface to control the assignment of inodes
to writeback contexts based on the following insights:
- Following Dave's earlier suggestion [2], filesystems should determine
both the number of writeback contexts and how inodes are assigned to them.
Therefore, we provide an interface for filesystems to customize their
inode assignment strategy for writeback.
- Instead of dynamically changing the number of writeback contexts during
filesystem initialization, we allow filesystems to determine how many
contexts it require, and push inodes only to those designated contexts.

To implement this, we have made the following changes:
- Introduces get_inode_wb_ctx_idx() in super_operations, called from
fetch_bdi_writeback_ctx(), allowing filesystems to provide a writeback 
context index for an inode. This generic interface can be extended to 
all filesystems.
- Implements XFS adaptation. To address contention during delayed
allocation, all inodes from the same Allocation Group bind to a unique
writeback context.

Through this testing [1], we obtained the following results. Our approach
achieves performance similar to nr_wb_ctx=4 but shows no further
improvement. After collecting perf data, the results show that lock
contention during delayed allocation remains unresolved.

System config:
Number of CPUs = 8
System RAM = 4G
For XFS number of AGs = 4
Used NVMe SSD of 20GB (emulated via QEMU)

Result:

Default:
Parallel Writeback (nr_wb_ctx = 1)    :  16.4MiB/s
Parallel Writeback (nr_wb_ctx = 2)    :  32.3MiB/s
Parallel Writeback (nr_wb_ctx = 3)    :  39.0MiB/s
Parallel Writeback (nr_wb_ctx = 4)    :  47.3MiB/s
Parallel Writeback (nr_wb_ctx = 5)    :  45.7MiB/s
Parallel Writeback (nr_wb_ctx = 6)    :  46.0MiB/s
Parallel Writeback (nr_wb_ctx = 7)    :  42.7MiB/s
Parallel Writeback (nr_wb_ctx = 8)    :  40.8MiB/s

After optimization (4 AGs utilized):
Parallel Writeback (nr_wb_ctx = 8)    :  47.1MiB/s (4 active contexts)

These results lead to the following discussions:
1. How can we design workloads that better expose the lock contention of 
delay allocation?
2. Given the lack of performance improvements, is there an oversight or 
misunderstanding of the implementation of the xfs interface, or is there 
some other performance bottleneck?

[1] 
https://lore.kernel.org/linux-fsdevel/CALYkqXpOBb1Ak2kEKWbO2Kc5NaGwb4XsX1q4eEaNWmO_4SQq9w@mail.gmail.com/
[2] 
https://lore.kernel.org/linux-fsdevel/Z5qw_1BOqiFum5Dn@dread.disaster.area/

wangyufei (2):
  writeback: add support for filesystems to affine inodes to specific
    writeback ctx
  xfs: implement get_inode_wb_ctx_idx() for per-AG parallel writeback

 fs/xfs/xfs_super.c          | 14 ++++++++++++++
 include/linux/backing-dev.h |  3 +++
 include/linux/fs.h          |  1 +
 3 files changed, 18 insertions(+)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC 1/2] writeback: add support for filesystems to affine inodes to specific writeback ctx
  2025-09-14 12:11 [RFC 0/2] writeback: add support for filesystems to optimize parallel writeback wangyufei
@ 2025-09-14 12:11 ` wangyufei
  2025-09-14 12:11 ` [RFC 2/2] xfs: implement get_inode_wb_ctx_idx() for per-AG parallel writeback wangyufei
  1 sibling, 0 replies; 5+ messages in thread
From: wangyufei @ 2025-09-14 12:11 UTC (permalink / raw)
  To: viro, brauner, jack, cem
  Cc: kundan.kumar, anuj20.g, hch, bernd, djwong, david, linux-kernel,
	linux-xfs, linux-fsdevel, opensource.kernel, wangyufei

Introduce a new superblock operation get_inode_wb_ctx_idx() to allow
filesystems to decide how inodes are assigned to writeback contexts.
This helps optimize parallel writeback performance based on the
underlying filesystem architecture.

In fetch_bdi_writeback_ctx(), if this operation is implemented by the
filesystem, it will return a specific writeback context index for a
given inode. Otherwise fall back to the default way.

Signed-off-by: wangyufei <wangyufei@vivo.com>
---
 include/linux/backing-dev.h | 3 +++
 include/linux/fs.h          | 1 +
 2 files changed, 4 insertions(+)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index c93509f5e..d02536f6e 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -144,7 +144,10 @@ static inline struct bdi_writeback_ctx *
 fetch_bdi_writeback_ctx(struct inode *inode)
 {
 	struct backing_dev_info *bdi = inode_to_bdi(inode);
+	struct super_block *sb = inode->i_sb;
 
+	if (sb->s_op->get_inode_wb_ctx_idx)
+		return bdi->wb_ctx_arr[sb->s_op->get_inode_wb_ctx_idx(inode, bdi->nr_wb_ctx)];
 	return bdi->wb_ctx_arr[inode->i_ino % bdi->nr_wb_ctx];
 }
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6c07228bd..fad7a75fd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2491,6 +2491,7 @@ struct super_operations {
 	 */
 	int (*remove_bdev)(struct super_block *sb, struct block_device *bdev);
 	void (*shutdown)(struct super_block *sb);
+	unsigned int (*get_inode_wb_ctx_idx)(struct inode *inode, int nr_wb_ctx);
 };
 
 /*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [RFC 2/2] xfs: implement get_inode_wb_ctx_idx() for per-AG parallel writeback
  2025-09-14 12:11 [RFC 0/2] writeback: add support for filesystems to optimize parallel writeback wangyufei
  2025-09-14 12:11 ` [RFC 1/2] writeback: add support for filesystems to affine inodes to specific writeback ctx wangyufei
@ 2025-09-14 12:11 ` wangyufei
  2025-09-22 16:56   ` Christoph Hellwig
  1 sibling, 1 reply; 5+ messages in thread
From: wangyufei @ 2025-09-14 12:11 UTC (permalink / raw)
  To: viro, brauner, jack, cem
  Cc: kundan.kumar, anuj20.g, hch, bernd, djwong, david, linux-kernel,
	linux-xfs, linux-fsdevel, opensource.kernel, wangyufei

The number of writeback contexts is set to the number of CPUs by
default. This allows XFS to decide how to assign inodes to writeback
contexts based on its allocation groups.

Implement get_inode_wb_ctx_idx() in xfs_super_operations as follows:
- Limit the number of active writeback contexts to the number of AGs.
- Assign inodes from the same AG to a unique writeback context.

Signed-off-by: wangyufei <wangyufei@vivo.com>
---
 fs/xfs/xfs_super.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 77acb3e5a..156df0397 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1279,6 +1279,19 @@ xfs_fs_show_stats(
 	return 0;
 }
 
+static unsigned int
+xfs_fs_get_inode_wb_ctx_idx(
+	struct inode		*inode,
+	int			nr_wb_ctx)
+{
+	struct xfs_inode *xfs_inode = XFS_I(inode);
+	struct xfs_mount *mp = XFS_M(inode->i_sb);
+
+	if (mp->m_sb.sb_agcount <= nr_wb_ctx)
+		return XFS_INO_TO_AGNO(mp, xfs_inode->i_ino);
+	return xfs_inode->i_ino % nr_wb_ctx;
+}
+
 static const struct super_operations xfs_super_operations = {
 	.alloc_inode		= xfs_fs_alloc_inode,
 	.destroy_inode		= xfs_fs_destroy_inode,
@@ -1295,6 +1308,7 @@ static const struct super_operations xfs_super_operations = {
 	.free_cached_objects	= xfs_fs_free_cached_objects,
 	.shutdown		= xfs_fs_shutdown,
 	.show_stats		= xfs_fs_show_stats,
+	.get_inode_wb_ctx_idx   = xfs_fs_get_inode_wb_ctx_idx,
 };
 
 static int
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC 2/2] xfs: implement get_inode_wb_ctx_idx() for per-AG parallel writeback
  2025-09-14 12:11 ` [RFC 2/2] xfs: implement get_inode_wb_ctx_idx() for per-AG parallel writeback wangyufei
@ 2025-09-22 16:56   ` Christoph Hellwig
  2025-09-23 18:46     ` Darrick J. Wong
  0 siblings, 1 reply; 5+ messages in thread
From: Christoph Hellwig @ 2025-09-22 16:56 UTC (permalink / raw)
  To: wangyufei
  Cc: viro, brauner, jack, cem, kundan.kumar, anuj20.g, hch, bernd,
	djwong, david, linux-kernel, linux-xfs, linux-fsdevel,
	opensource.kernel

On Sun, Sep 14, 2025 at 08:11:09PM +0800, wangyufei wrote:
> The number of writeback contexts is set to the number of CPUs by
> default. This allows XFS to decide how to assign inodes to writeback
> contexts based on its allocation groups.
> 
> Implement get_inode_wb_ctx_idx() in xfs_super_operations as follows:
> - Limit the number of active writeback contexts to the number of AGs.
> - Assign inodes from the same AG to a unique writeback context.

I'm not sure this actually works.  Data is spread over AGs, just with
a default to the parent inode AG if there is space, and even that isn't
true for the inode32 option or when using the RT subvolume.

> +
> +	if (mp->m_sb.sb_agcount <= nr_wb_ctx)
> +		return XFS_INO_TO_AGNO(mp, xfs_inode->i_ino);
> +	return xfs_inode->i_ino % nr_wb_ctx;
> +}
> +
>  static const struct super_operations xfs_super_operations = {
>  	.alloc_inode		= xfs_fs_alloc_inode,
>  	.destroy_inode		= xfs_fs_destroy_inode,
> @@ -1295,6 +1308,7 @@ static const struct super_operations xfs_super_operations = {
>  	.free_cached_objects	= xfs_fs_free_cached_objects,
>  	.shutdown		= xfs_fs_shutdown,
>  	.show_stats		= xfs_fs_show_stats,
> +	.get_inode_wb_ctx_idx   = xfs_fs_get_inode_wb_ctx_idx,
>  };
>  
>  static int
> -- 
> 2.34.1
---end quoted text---

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC 2/2] xfs: implement get_inode_wb_ctx_idx() for per-AG parallel writeback
  2025-09-22 16:56   ` Christoph Hellwig
@ 2025-09-23 18:46     ` Darrick J. Wong
  0 siblings, 0 replies; 5+ messages in thread
From: Darrick J. Wong @ 2025-09-23 18:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: wangyufei, viro, brauner, jack, cem, kundan.kumar, anuj20.g,
	bernd, david, linux-kernel, linux-xfs, linux-fsdevel,
	opensource.kernel

On Mon, Sep 22, 2025 at 06:56:42PM +0200, Christoph Hellwig wrote:
> On Sun, Sep 14, 2025 at 08:11:09PM +0800, wangyufei wrote:
> > The number of writeback contexts is set to the number of CPUs by
> > default. This allows XFS to decide how to assign inodes to writeback
> > contexts based on its allocation groups.
> > 
> > Implement get_inode_wb_ctx_idx() in xfs_super_operations as follows:
> > - Limit the number of active writeback contexts to the number of AGs.
> > - Assign inodes from the same AG to a unique writeback context.
> 
> I'm not sure this actually works.  Data is spread over AGs, just with
> a default to the parent inode AG if there is space, and even that isn't
> true for the inode32 option or when using the RT subvolume.

I don't know of a better way to shard cheaply -- if you could group
inodes dynamically by a rough estimate of the AGs that map to the dirty
data (especially delalloc/unwritten/cow mappings) then that would be an
improvement, but that's still far from what I would consider the ideal.

Ideally (maybe?) one could shard dirty ranges first by the amount of
effort (pure overwrite; secondly backed-by-unwritten; thirdly
delalloc/cow).  The first two groups could then be sharded by AG and
issued in parallel.  The third group involve so much metadata changes
that you could probably just shard evenly across CPUs.  Writebacks get
initiated in that order, and then we see where the bottlenecks lie in
ioend completion.

(But that's just my hazy untested brai^Widea :P)

--D

> > +
> > +	if (mp->m_sb.sb_agcount <= nr_wb_ctx)
> > +		return XFS_INO_TO_AGNO(mp, xfs_inode->i_ino);
> > +	return xfs_inode->i_ino % nr_wb_ctx;
> > +}
> > +
> >  static const struct super_operations xfs_super_operations = {
> >  	.alloc_inode		= xfs_fs_alloc_inode,
> >  	.destroy_inode		= xfs_fs_destroy_inode,
> > @@ -1295,6 +1308,7 @@ static const struct super_operations xfs_super_operations = {
> >  	.free_cached_objects	= xfs_fs_free_cached_objects,
> >  	.shutdown		= xfs_fs_shutdown,
> >  	.show_stats		= xfs_fs_show_stats,
> > +	.get_inode_wb_ctx_idx   = xfs_fs_get_inode_wb_ctx_idx,
> >  };
> >  
> >  static int
> > -- 
> > 2.34.1
> ---end quoted text---
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-09-23 18:46 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-14 12:11 [RFC 0/2] writeback: add support for filesystems to optimize parallel writeback wangyufei
2025-09-14 12:11 ` [RFC 1/2] writeback: add support for filesystems to affine inodes to specific writeback ctx wangyufei
2025-09-14 12:11 ` [RFC 2/2] xfs: implement get_inode_wb_ctx_idx() for per-AG parallel writeback wangyufei
2025-09-22 16:56   ` Christoph Hellwig
2025-09-23 18:46     ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).