linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/13] Parallelizing filesystem writeback
       [not found] <CGME20250529113215epcas5p2edd67e7b129621f386be005fdba53378@epcas5p2.samsung.com>
@ 2025-05-29 11:14 ` Kundan Kumar
       [not found]   ` <CGME20250529113219epcas5p4d8ccb25ea910faea7120f092623f321d@epcas5p4.samsung.com>
                     ` (14 more replies)
  0 siblings, 15 replies; 40+ messages in thread
From: Kundan Kumar @ 2025-05-29 11:14 UTC (permalink / raw)
  To: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy,
	anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch,
	ritesh.list, djwong, dave, p.raghav, da.gomez
  Cc: linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm,
	gost.dev, Kundan Kumar

Currently, pagecache writeback is performed by a single thread. Inodes
are added to a dirty list, and delayed writeback is triggered. The single
writeback thread then iterates through the dirty inode list, and executes
the writeback.

This series parallelizes the writeback by allowing multiple writeback
contexts per backing device (bdi). These writebacks contexts are executed
as separate, independent threads, improving overall parallelism.

Would love to hear feedback in-order to move this effort forward.

Design Overview
================
Following Jan Kara's suggestion [1], we have introduced a new bdi
writeback context within the backing_dev_info structure. Specifically,
we have created a new structure, bdi_writeback_context, which contains
its own set of members for each writeback context.

struct bdi_writeback_ctx {
        struct bdi_writeback wb;
        struct list_head wb_list; /* list of all wbs */
        struct radix_tree_root cgwb_tree;
        struct rw_semaphore wb_switch_rwsem;
        wait_queue_head_t wb_waitq;
};

There can be multiple writeback contexts in a bdi, which helps in
achieving writeback parallelism.

struct backing_dev_info {
...
        int nr_wb_ctx;
        struct bdi_writeback_ctx **wb_ctx_arr;
...
};

FS geometry and filesystem fragmentation
========================================
The community was concerned that parallelizing writeback would impact
delayed allocation and increase filesystem fragmentation.
Our analysis of XFS delayed allocation behavior showed that merging of
extents occurs within a specific inode. Earlier experiments with multiple
writeback contexts [2] resulted in increased fragmentation due to the
same inode being processed by different threads.

To address this, we now affine an inode to a specific writeback context
ensuring that delayed allocation works effectively.

Number of writeback contexts
===========================
The plan is to keep the nr_wb_ctx as 1, ensuring default single threaded
behavior. However, we set the number of writeback contexts equal to
number of CPUs in the current version. Later we will make it configurable
using a mount option, allowing filesystems to choose the optimal number
of writeback contexts.

IOPS and throughput
===================
We see significant improvement in IOPS across several filesystem on both
PMEM and NVMe devices.

Performance gains:
  - On PMEM:
	Base XFS		: 544 MiB/s
	Parallel Writeback XFS	: 1015 MiB/s  (+86%)
	Base EXT4		: 536 MiB/s
	Parallel Writeback EXT4	: 1047 MiB/s  (+95%)

  - On NVMe:
	Base XFS		: 651 MiB/s
	Parallel Writeback XFS	: 808 MiB/s  (+24%)
	Base EXT4		: 494 MiB/s
	Parallel Writeback EXT4	: 797 MiB/s  (+61%)

We also see that there is no increase in filesystem fragmentation
# of extents:
  - On XFS (on PMEM):
	Base XFS		: 1964
	Parallel Writeback XFS	: 1384

  - On EXT4 (on PMEM):
	Base EXT4		: 21
	Parallel Writeback EXT4	: 11

[1] Jan Kara suggestion :
https://lore.kernel.org/all/gamxtewl5yzg4xwu7lpp7obhp44xh344swvvf7tmbiknvbd3ww@jowphz4h4zmb/
[2] Writeback using unaffined N (# of CPUs) threads :
https://lore.kernel.org/all/20250414102824.9901-1-kundan.kumar@samsung.com/

Kundan Kumar (13):
  writeback: add infra for parallel writeback
  writeback: add support to initialize and free multiple writeback ctxs
  writeback: link bdi_writeback to its corresponding bdi_writeback_ctx
  writeback: affine inode to a writeback ctx within a bdi
  writeback: modify bdi_writeback search logic to search across all wb
    ctxs
  writeback: invoke all writeback contexts for flusher and dirtytime
    writeback
  writeback: modify sync related functions to iterate over all writeback
    contexts
  writeback: add support to collect stats for all writeback ctxs
  f2fs: add support in f2fs to handle multiple writeback contexts
  fuse: add support for multiple writeback contexts in fuse
  gfs2: add support in gfs2 to handle multiple writeback contexts
  nfs: add support in nfs to handle multiple writeback contexts
  writeback: set the num of writeback contexts to number of online cpus

 fs/f2fs/node.c                   |  11 +-
 fs/f2fs/segment.h                |   7 +-
 fs/fs-writeback.c                | 146 +++++++++++++-------
 fs/fuse/file.c                   |   9 +-
 fs/gfs2/super.c                  |  11 +-
 fs/nfs/internal.h                |   4 +-
 fs/nfs/write.c                   |   5 +-
 include/linux/backing-dev-defs.h |  32 +++--
 include/linux/backing-dev.h      |  45 +++++--
 include/linux/fs.h               |   1 -
 mm/backing-dev.c                 | 225 ++++++++++++++++++++-----------
 mm/page-writeback.c              |   5 +-
 12 files changed, 333 insertions(+), 168 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 01/13] writeback: add infra for parallel writeback
       [not found]   ` <CGME20250529113219epcas5p4d8ccb25ea910faea7120f092623f321d@epcas5p4.samsung.com>
@ 2025-05-29 11:14     ` Kundan Kumar
  0 siblings, 0 replies; 40+ messages in thread
From: Kundan Kumar @ 2025-05-29 11:14 UTC (permalink / raw)
  To: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy,
	anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch,
	ritesh.list, djwong, dave, p.raghav, da.gomez
  Cc: linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm,
	gost.dev, Kundan Kumar, Anuj Gupta

This is a prep patch which introduces a new bdi_writeback_ctx structure
that enables us to have multiple writeback contexts for parallel
writeback. Each bdi now can have multiple writeback contexts, with each
writeback context having has its own cgwb tree.

Modify all the functions/places that operate on bdi's wb, wb_list,
cgwb_tree, wb_switch_rwsem, wb_waitq as these fields have now been moved
to bdi_writeback_ctx.

This patch mechanically replaces bdi->wb to bdi->wb_ctx_arr[0]->wb and
there is no functional change.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
---
 fs/f2fs/node.c                   |   4 +-
 fs/f2fs/segment.h                |   2 +-
 fs/fs-writeback.c                |  78 +++++++++++++--------
 fs/fuse/file.c                   |   6 +-
 fs/gfs2/super.c                  |   2 +-
 fs/nfs/internal.h                |   3 +-
 fs/nfs/write.c                   |   3 +-
 include/linux/backing-dev-defs.h |  32 +++++----
 include/linux/backing-dev.h      |  41 +++++++----
 include/linux/fs.h               |   1 -
 mm/backing-dev.c                 | 113 +++++++++++++++++++------------
 mm/page-writeback.c              |   5 +-
 12 files changed, 179 insertions(+), 111 deletions(-)

diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 5f15c224bf78..4b6568cd5bef 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -73,7 +73,7 @@ bool f2fs_available_free_memory(struct f2fs_sb_info *sbi, int type)
 		if (excess_cached_nats(sbi))
 			res = false;
 	} else if (type == DIRTY_DENTS) {
-		if (sbi->sb->s_bdi->wb.dirty_exceeded)
+		if (sbi->sb->s_bdi->wb_ctx_arr[0]->wb.dirty_exceeded)
 			return false;
 		mem_size = get_pages(sbi, F2FS_DIRTY_DENTS);
 		res = mem_size < ((avail_ram * nm_i->ram_thresh / 100) >> 1);
@@ -114,7 +114,7 @@ bool f2fs_available_free_memory(struct f2fs_sb_info *sbi, int type)
 		res = false;
 #endif
 	} else {
-		if (!sbi->sb->s_bdi->wb.dirty_exceeded)
+		if (!sbi->sb->s_bdi->wb_ctx_arr[0]->wb.dirty_exceeded)
 			return true;
 	}
 	return res;
diff --git a/fs/f2fs/segment.h b/fs/f2fs/segment.h
index 0465dc00b349..a525ccd4cfc8 100644
--- a/fs/f2fs/segment.h
+++ b/fs/f2fs/segment.h
@@ -936,7 +936,7 @@ static inline bool sec_usage_check(struct f2fs_sb_info *sbi, unsigned int secno)
  */
 static inline int nr_pages_to_skip(struct f2fs_sb_info *sbi, int type)
 {
-	if (sbi->sb->s_bdi->wb.dirty_exceeded)
+	if (sbi->sb->s_bdi->wb_ctx_arr[0]->wb.dirty_exceeded)
 		return 0;
 
 	if (type == DATA)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index cc57367fb641..0959fff46235 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -265,23 +265,26 @@ void __inode_attach_wb(struct inode *inode, struct folio *folio)
 {
 	struct backing_dev_info *bdi = inode_to_bdi(inode);
 	struct bdi_writeback *wb = NULL;
+	struct bdi_writeback_ctx *bdi_writeback_ctx = bdi->wb_ctx_arr[0];
 
 	if (inode_cgwb_enabled(inode)) {
 		struct cgroup_subsys_state *memcg_css;
 
 		if (folio) {
 			memcg_css = mem_cgroup_css_from_folio(folio);
-			wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
+			wb = wb_get_create(bdi, bdi_writeback_ctx, memcg_css,
+					   GFP_ATOMIC);
 		} else {
 			/* must pin memcg_css, see wb_get_create() */
 			memcg_css = task_get_css(current, memory_cgrp_id);
-			wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
+			wb = wb_get_create(bdi, bdi_writeback_ctx, memcg_css,
+					   GFP_ATOMIC);
 			css_put(memcg_css);
 		}
 	}
 
 	if (!wb)
-		wb = &bdi->wb;
+		wb = &bdi_writeback_ctx->wb;
 
 	/*
 	 * There may be multiple instances of this function racing to
@@ -307,7 +310,7 @@ static void inode_cgwb_move_to_attached(struct inode *inode,
 	WARN_ON_ONCE(inode->i_state & I_FREEING);
 
 	inode->i_state &= ~I_SYNC_QUEUED;
-	if (wb != &wb->bdi->wb)
+	if (wb != &wb->bdi_wb_ctx->wb)
 		list_move(&inode->i_io_list, &wb->b_attached);
 	else
 		list_del_init(&inode->i_io_list);
@@ -382,14 +385,16 @@ struct inode_switch_wbs_context {
 	struct inode		*inodes[];
 };
 
-static void bdi_down_write_wb_switch_rwsem(struct backing_dev_info *bdi)
+static void
+bdi_down_write_wb_ctx_switch_rwsem(struct bdi_writeback_ctx *bdi_wb_ctx)
 {
-	down_write(&bdi->wb_switch_rwsem);
+	down_write(&bdi_wb_ctx->wb_switch_rwsem);
 }
 
-static void bdi_up_write_wb_switch_rwsem(struct backing_dev_info *bdi)
+static void
+bdi_up_write_wb_ctx_switch_rwsem(struct bdi_writeback_ctx *bdi_wb_ctx)
 {
-	up_write(&bdi->wb_switch_rwsem);
+	up_write(&bdi_wb_ctx->wb_switch_rwsem);
 }
 
 static bool inode_do_switch_wbs(struct inode *inode,
@@ -490,7 +495,8 @@ static void inode_switch_wbs_work_fn(struct work_struct *work)
 {
 	struct inode_switch_wbs_context *isw =
 		container_of(to_rcu_work(work), struct inode_switch_wbs_context, work);
-	struct backing_dev_info *bdi = inode_to_bdi(isw->inodes[0]);
+	struct bdi_writeback_ctx *bdi_wb_ctx =
+		fetch_bdi_writeback_ctx(isw->inodes[0]);
 	struct bdi_writeback *old_wb = isw->inodes[0]->i_wb;
 	struct bdi_writeback *new_wb = isw->new_wb;
 	unsigned long nr_switched = 0;
@@ -500,7 +506,7 @@ static void inode_switch_wbs_work_fn(struct work_struct *work)
 	 * If @inode switches cgwb membership while sync_inodes_sb() is
 	 * being issued, sync_inodes_sb() might miss it.  Synchronize.
 	 */
-	down_read(&bdi->wb_switch_rwsem);
+	down_read(&bdi_wb_ctx->wb_switch_rwsem);
 
 	/*
 	 * By the time control reaches here, RCU grace period has passed
@@ -529,7 +535,7 @@ static void inode_switch_wbs_work_fn(struct work_struct *work)
 	spin_unlock(&new_wb->list_lock);
 	spin_unlock(&old_wb->list_lock);
 
-	up_read(&bdi->wb_switch_rwsem);
+	up_read(&bdi_wb_ctx->wb_switch_rwsem);
 
 	if (nr_switched) {
 		wb_wakeup(new_wb);
@@ -583,6 +589,7 @@ static bool inode_prepare_wbs_switch(struct inode *inode,
 static void inode_switch_wbs(struct inode *inode, int new_wb_id)
 {
 	struct backing_dev_info *bdi = inode_to_bdi(inode);
+	struct bdi_writeback_ctx *bdi_wb_ctx = fetch_bdi_writeback_ctx(inode);
 	struct cgroup_subsys_state *memcg_css;
 	struct inode_switch_wbs_context *isw;
 
@@ -609,7 +616,7 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id)
 	if (!memcg_css)
 		goto out_free;
 
-	isw->new_wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
+	isw->new_wb = wb_get_create(bdi, bdi_wb_ctx, memcg_css, GFP_ATOMIC);
 	css_put(memcg_css);
 	if (!isw->new_wb)
 		goto out_free;
@@ -678,12 +685,14 @@ bool cleanup_offline_cgwb(struct bdi_writeback *wb)
 
 	for (memcg_css = wb->memcg_css->parent; memcg_css;
 	     memcg_css = memcg_css->parent) {
-		isw->new_wb = wb_get_create(wb->bdi, memcg_css, GFP_KERNEL);
+		isw->new_wb = wb_get_create(wb->bdi, wb->bdi_wb_ctx,
+					    memcg_css, GFP_KERNEL);
 		if (isw->new_wb)
 			break;
 	}
+	/* wb_get() is noop for bdi's wb */
 	if (unlikely(!isw->new_wb))
-		isw->new_wb = &wb->bdi->wb; /* wb_get() is noop for bdi's wb */
+		isw->new_wb = &wb->bdi_wb_ctx->wb;
 
 	nr = 0;
 	spin_lock(&wb->list_lock);
@@ -994,18 +1003,19 @@ static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages)
  * total active write bandwidth of @bdi.
  */
 static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
+				  struct bdi_writeback_ctx *bdi_wb_ctx,
 				  struct wb_writeback_work *base_work,
 				  bool skip_if_busy)
 {
 	struct bdi_writeback *last_wb = NULL;
-	struct bdi_writeback *wb = list_entry(&bdi->wb_list,
+	struct bdi_writeback *wb = list_entry(&bdi_wb_ctx->wb_list,
 					      struct bdi_writeback, bdi_node);
 
 	might_sleep();
 restart:
 	rcu_read_lock();
-	list_for_each_entry_continue_rcu(wb, &bdi->wb_list, bdi_node) {
-		DEFINE_WB_COMPLETION(fallback_work_done, bdi);
+	list_for_each_entry_continue_rcu(wb, &bdi_wb_ctx->wb_list, bdi_node) {
+		DEFINE_WB_COMPLETION(fallback_work_done, bdi_wb_ctx);
 		struct wb_writeback_work fallback_work;
 		struct wb_writeback_work *work;
 		long nr_pages;
@@ -1103,7 +1113,7 @@ int cgroup_writeback_by_id(u64 bdi_id, int memcg_id,
 	 * And find the associated wb.  If the wb isn't there already
 	 * there's nothing to flush, don't create one.
 	 */
-	wb = wb_get_lookup(bdi, memcg_css);
+	wb = wb_get_lookup(bdi->wb_ctx_arr[0], memcg_css);
 	if (!wb) {
 		ret = -ENOENT;
 		goto out_css_put;
@@ -1189,8 +1199,13 @@ fs_initcall(cgroup_writeback_init);
 
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
-static void bdi_down_write_wb_switch_rwsem(struct backing_dev_info *bdi) { }
-static void bdi_up_write_wb_switch_rwsem(struct backing_dev_info *bdi) { }
+static void
+bdi_down_write_wb_ctx_switch_rwsem(struct bdi_writeback_ctx *bdi_wb_ctx)
+{ }
+
+static void
+bdi_up_write_wb_ctx_switch_rwsem(struct bdi_writeback_ctx *bdi_wb_ctx)
+{ }
 
 static void inode_cgwb_move_to_attached(struct inode *inode,
 					struct bdi_writeback *wb)
@@ -1231,14 +1246,15 @@ static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages)
 }
 
 static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
+				  struct bdi_writeback_ctx *bdi_wb_ctx,
 				  struct wb_writeback_work *base_work,
 				  bool skip_if_busy)
 {
 	might_sleep();
 
-	if (!skip_if_busy || !writeback_in_progress(&bdi->wb)) {
+	if (!skip_if_busy || !writeback_in_progress(&bdi_wb_ctx->wb)) {
 		base_work->auto_free = 0;
-		wb_queue_work(&bdi->wb, base_work);
+		wb_queue_work(&bdi_wb_ctx->wb, base_work);
 	}
 }
 
@@ -2371,7 +2387,7 @@ static void __wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
 	if (!bdi_has_dirty_io(bdi))
 		return;
 
-	list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node)
+	list_for_each_entry_rcu(wb, &bdi->wb_ctx_arr[0]->wb_list, bdi_node)
 		wb_start_writeback(wb, reason);
 }
 
@@ -2427,7 +2443,8 @@ static void wakeup_dirtytime_writeback(struct work_struct *w)
 	list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
 		struct bdi_writeback *wb;
 
-		list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node)
+		list_for_each_entry_rcu(wb, &bdi->wb_ctx_arr[0]->wb_list,
+					bdi_node)
 			if (!list_empty(&wb->b_dirty_time))
 				wb_wakeup(wb);
 	}
@@ -2729,7 +2746,7 @@ static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr,
 				     enum wb_reason reason, bool skip_if_busy)
 {
 	struct backing_dev_info *bdi = sb->s_bdi;
-	DEFINE_WB_COMPLETION(done, bdi);
+	DEFINE_WB_COMPLETION(done, bdi->wb_ctx_arr[0]);
 	struct wb_writeback_work work = {
 		.sb			= sb,
 		.sync_mode		= WB_SYNC_NONE,
@@ -2743,7 +2760,8 @@ static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr,
 		return;
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	bdi_split_work_to_wbs(sb->s_bdi, &work, skip_if_busy);
+	bdi_split_work_to_wbs(sb->s_bdi, bdi->wb_ctx_arr[0], &work,
+			      skip_if_busy);
 	wb_wait_for_completion(&done);
 }
 
@@ -2807,7 +2825,7 @@ EXPORT_SYMBOL(try_to_writeback_inodes_sb);
 void sync_inodes_sb(struct super_block *sb)
 {
 	struct backing_dev_info *bdi = sb->s_bdi;
-	DEFINE_WB_COMPLETION(done, bdi);
+	DEFINE_WB_COMPLETION(done, bdi->wb_ctx_arr[0]);
 	struct wb_writeback_work work = {
 		.sb		= sb,
 		.sync_mode	= WB_SYNC_ALL,
@@ -2828,10 +2846,10 @@ void sync_inodes_sb(struct super_block *sb)
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	/* protect against inode wb switch, see inode_switch_wbs_work_fn() */
-	bdi_down_write_wb_switch_rwsem(bdi);
-	bdi_split_work_to_wbs(bdi, &work, false);
+	bdi_down_write_wb_ctx_switch_rwsem(bdi->wb_ctx_arr[0]);
+	bdi_split_work_to_wbs(bdi, bdi->wb_ctx_arr[0], &work, false);
 	wb_wait_for_completion(&done);
-	bdi_up_write_wb_switch_rwsem(bdi);
+	bdi_up_write_wb_ctx_switch_rwsem(bdi->wb_ctx_arr[0]);
 
 	wait_sb_inodes(sb);
 }
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 754378dd9f71..7817219d1599 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1853,9 +1853,9 @@ static void fuse_writepage_finish_stat(struct inode *inode, struct folio *folio)
 {
 	struct backing_dev_info *bdi = inode_to_bdi(inode);
 
-	dec_wb_stat(&bdi->wb, WB_WRITEBACK);
+	dec_wb_stat(&bdi->wb_ctx_arr[0]->wb, WB_WRITEBACK);
 	node_stat_sub_folio(folio, NR_WRITEBACK_TEMP);
-	wb_writeout_inc(&bdi->wb);
+	wb_writeout_inc(&bdi->wb_ctx_arr[0]->wb);
 }
 
 static void fuse_writepage_finish(struct fuse_writepage_args *wpa)
@@ -2142,7 +2142,7 @@ static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struc
 	ap->descs[folio_index].offset = 0;
 	ap->descs[folio_index].length = PAGE_SIZE;
 
-	inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK);
+	inc_wb_stat(&inode_to_bdi(inode)->wb_ctx_arr[0]->wb, WB_WRITEBACK);
 	node_stat_add_folio(tmp_folio, NR_WRITEBACK_TEMP);
 }
 
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 44e5658b896c..dfc83bd3def3 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -457,7 +457,7 @@ static int gfs2_write_inode(struct inode *inode, struct writeback_control *wbc)
 		gfs2_log_flush(GFS2_SB(inode), ip->i_gl,
 			       GFS2_LOG_HEAD_FLUSH_NORMAL |
 			       GFS2_LFC_WRITE_INODE);
-	if (bdi->wb.dirty_exceeded)
+	if (bdi->wb_ctx_arr[0]->wb.dirty_exceeded)
 		gfs2_ail1_flush(sdp, wbc);
 	else
 		filemap_fdatawrite(metamapping);
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 6655e5f32ec6..fd513bf9e875 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -844,7 +844,8 @@ static inline void nfs_folio_mark_unstable(struct folio *folio,
 		 * writeback is happening on the server now.
 		 */
 		node_stat_mod_folio(folio, NR_WRITEBACK, nr);
-		wb_stat_mod(&inode_to_bdi(inode)->wb, WB_WRITEBACK, nr);
+		wb_stat_mod(&inode_to_bdi(inode)->wb_ctx_arr[0]->wb,
+			    WB_WRITEBACK, nr);
 		__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 	}
 }
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 23df8b214474..ec48ec8c2db8 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -932,9 +932,10 @@ static void nfs_folio_clear_commit(struct folio *folio)
 {
 	if (folio) {
 		long nr = folio_nr_pages(folio);
+		struct inode *inode = folio->mapping->host;
 
 		node_stat_mod_folio(folio, NR_WRITEBACK, -nr);
-		wb_stat_mod(&inode_to_bdi(folio->mapping->host)->wb,
+		wb_stat_mod(&inode_to_bdi(inode)->wb_ctx_arr[0]->wb,
 			    WB_WRITEBACK, -nr);
 	}
 }
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 2ad261082bba..ec0dd8df1a8c 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -75,10 +75,11 @@ struct wb_completion {
  * can wait for the completion of all using wb_wait_for_completion().  Work
  * items which are waited upon aren't freed automatically on completion.
  */
-#define WB_COMPLETION_INIT(bdi)		__WB_COMPLETION_INIT(&(bdi)->wb_waitq)
+#define WB_COMPLETION_INIT(bdi_wb_ctx)	\
+	__WB_COMPLETION_INIT(&(bdi_wb_ctx)->wb_waitq)
 
-#define DEFINE_WB_COMPLETION(cmpl, bdi)	\
-	struct wb_completion cmpl = WB_COMPLETION_INIT(bdi)
+#define DEFINE_WB_COMPLETION(cmpl, bdi_wb_ctx)	\
+	struct wb_completion cmpl = WB_COMPLETION_INIT(bdi_wb_ctx)
 
 /*
  * Each wb (bdi_writeback) can perform writeback operations, is measured
@@ -104,6 +105,7 @@ struct wb_completion {
  */
 struct bdi_writeback {
 	struct backing_dev_info *bdi;	/* our parent bdi */
+	struct bdi_writeback_ctx *bdi_wb_ctx;
 
 	unsigned long state;		/* Always use atomic bitops on this */
 	unsigned long last_old_flush;	/* last old data flush */
@@ -160,6 +162,16 @@ struct bdi_writeback {
 #endif
 };
 
+struct bdi_writeback_ctx {
+	struct bdi_writeback wb;  /* the root writeback info for this bdi */
+	struct list_head wb_list; /* list of all wbs */
+#ifdef CONFIG_CGROUP_WRITEBACK
+	struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
+	struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */
+#endif
+	wait_queue_head_t wb_waitq;
+};
+
 struct backing_dev_info {
 	u64 id;
 	struct rb_node rb_node; /* keyed by ->id */
@@ -183,15 +195,11 @@ struct backing_dev_info {
 	 */
 	unsigned long last_bdp_sleep;
 
-	struct bdi_writeback wb;  /* the root writeback info for this bdi */
-	struct list_head wb_list; /* list of all wbs */
+	int nr_wb_ctx;
+	struct bdi_writeback_ctx **wb_ctx_arr;
 #ifdef CONFIG_CGROUP_WRITEBACK
-	struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
 	struct mutex cgwb_release_mutex;  /* protect shutdown of wb structs */
-	struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */
 #endif
-	wait_queue_head_t wb_waitq;
-
 	struct device *dev;
 	char dev_name[64];
 	struct device *owner;
@@ -216,7 +224,7 @@ struct wb_lock_cookie {
  */
 static inline bool wb_tryget(struct bdi_writeback *wb)
 {
-	if (wb != &wb->bdi->wb)
+	if (wb != &wb->bdi_wb_ctx->wb)
 		return percpu_ref_tryget(&wb->refcnt);
 	return true;
 }
@@ -227,7 +235,7 @@ static inline bool wb_tryget(struct bdi_writeback *wb)
  */
 static inline void wb_get(struct bdi_writeback *wb)
 {
-	if (wb != &wb->bdi->wb)
+	if (wb != &wb->bdi_wb_ctx->wb)
 		percpu_ref_get(&wb->refcnt);
 }
 
@@ -246,7 +254,7 @@ static inline void wb_put_many(struct bdi_writeback *wb, unsigned long nr)
 		return;
 	}
 
-	if (wb != &wb->bdi->wb)
+	if (wb != &wb->bdi_wb_ctx->wb)
 		percpu_ref_put_many(&wb->refcnt, nr);
 }
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index e721148c95d0..894968e98dd8 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -148,11 +148,20 @@ static inline bool mapping_can_writeback(struct address_space *mapping)
 	return inode_to_bdi(mapping->host)->capabilities & BDI_CAP_WRITEBACK;
 }
 
+static inline struct bdi_writeback_ctx *
+fetch_bdi_writeback_ctx(struct inode *inode)
+{
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+	return bdi->wb_ctx_arr[0];
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
-struct bdi_writeback *wb_get_lookup(struct backing_dev_info *bdi,
+struct bdi_writeback *wb_get_lookup(struct bdi_writeback_ctx *bdi_wb_ctx,
 				    struct cgroup_subsys_state *memcg_css);
 struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
+				    struct bdi_writeback_ctx *bdi_wb_ctx,
 				    struct cgroup_subsys_state *memcg_css,
 				    gfp_t gfp);
 void wb_memcg_offline(struct mem_cgroup *memcg);
@@ -187,16 +196,18 @@ static inline bool inode_cgwb_enabled(struct inode *inode)
  * Must be called under rcu_read_lock() which protects the returend wb.
  * NULL if not found.
  */
-static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi)
+static inline struct bdi_writeback *
+wb_find_current(struct backing_dev_info *bdi,
+		struct bdi_writeback_ctx *bdi_wb_ctx)
 {
 	struct cgroup_subsys_state *memcg_css;
 	struct bdi_writeback *wb;
 
 	memcg_css = task_css(current, memory_cgrp_id);
 	if (!memcg_css->parent)
-		return &bdi->wb;
+		return &bdi_wb_ctx->wb;
 
-	wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id);
+	wb = radix_tree_lookup(&bdi_wb_ctx->cgwb_tree, memcg_css->id);
 
 	/*
 	 * %current's blkcg equals the effective blkcg of its memcg.  No
@@ -217,12 +228,13 @@ static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi
  * wb_find_current().
  */
 static inline struct bdi_writeback *
-wb_get_create_current(struct backing_dev_info *bdi, gfp_t gfp)
+wb_get_create_current(struct backing_dev_info *bdi,
+		      struct bdi_writeback_ctx *bdi_wb_ctx, gfp_t gfp)
 {
 	struct bdi_writeback *wb;
 
 	rcu_read_lock();
-	wb = wb_find_current(bdi);
+	wb = wb_find_current(bdi, bdi_wb_ctx);
 	if (wb && unlikely(!wb_tryget(wb)))
 		wb = NULL;
 	rcu_read_unlock();
@@ -231,7 +243,7 @@ wb_get_create_current(struct backing_dev_info *bdi, gfp_t gfp)
 		struct cgroup_subsys_state *memcg_css;
 
 		memcg_css = task_get_css(current, memory_cgrp_id);
-		wb = wb_get_create(bdi, memcg_css, gfp);
+		wb = wb_get_create(bdi, bdi_wb_ctx, memcg_css, gfp);
 		css_put(memcg_css);
 	}
 	return wb;
@@ -265,7 +277,7 @@ static inline struct bdi_writeback *inode_to_wb_wbc(
 	 * If wbc does not have inode attached, it means cgroup writeback was
 	 * disabled when wbc started. Just use the default wb in that case.
 	 */
-	return wbc->wb ? wbc->wb : &inode_to_bdi(inode)->wb;
+	return wbc->wb ? wbc->wb : &fetch_bdi_writeback_ctx(inode)->wb;
 }
 
 /**
@@ -325,20 +337,23 @@ static inline bool inode_cgwb_enabled(struct inode *inode)
 	return false;
 }
 
-static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi)
+static inline struct bdi_writeback *wb_find_current(
+		struct backing_dev_info *bdi,
+		struct bdi_writeback_ctx *bdi_wb_ctx)
 {
-	return &bdi->wb;
+	return &bdi_wb_ctx->wb;
 }
 
 static inline struct bdi_writeback *
-wb_get_create_current(struct backing_dev_info *bdi, gfp_t gfp)
+wb_get_create_current(struct backing_dev_info *bdi,
+		      struct bdi_writeback_ctx *bdi_wb_ctx, gfp_t gfp)
 {
-	return &bdi->wb;
+	return &bdi_wb_ctx->wb;
 }
 
 static inline struct bdi_writeback *inode_to_wb(struct inode *inode)
 {
-	return &inode_to_bdi(inode)->wb;
+	return &fetch_bdi_writeback_ctx(inode)->wb;
 }
 
 static inline struct bdi_writeback *inode_to_wb_wbc(
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d5988867fe31..09575c399ccc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2289,7 +2289,6 @@ struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
 	void (*destroy_inode)(struct inode *);
 	void (*free_inode)(struct inode *);
-
    	void (*dirty_inode) (struct inode *, int flags);
 	int (*write_inode) (struct inode *, struct writeback_control *wbc);
 	int (*drop_inode) (struct inode *);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 783904d8c5ef..0efa9632011a 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -84,13 +84,14 @@ static void collect_wb_stats(struct wb_stats *stats,
 }
 
 #ifdef CONFIG_CGROUP_WRITEBACK
+
 static void bdi_collect_stats(struct backing_dev_info *bdi,
 			      struct wb_stats *stats)
 {
 	struct bdi_writeback *wb;
 
 	rcu_read_lock();
-	list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node) {
+	list_for_each_entry_rcu(wb, &bdi->wb_ctx_arr[0]->wb_list, bdi_node) {
 		if (!wb_tryget(wb))
 			continue;
 
@@ -103,7 +104,7 @@ static void bdi_collect_stats(struct backing_dev_info *bdi,
 static void bdi_collect_stats(struct backing_dev_info *bdi,
 			      struct wb_stats *stats)
 {
-	collect_wb_stats(stats, &bdi->wb);
+	collect_wb_stats(stats, &bdi->wb_ctx_arr[0]->wb);
 }
 #endif
 
@@ -149,7 +150,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		   stats.nr_io,
 		   stats.nr_more_io,
 		   stats.nr_dirty_time,
-		   !list_empty(&bdi->bdi_list), bdi->wb.state);
+		   !list_empty(&bdi->bdi_list), bdi->wb_ctx_arr[0]->wb.state);
 
 	return 0;
 }
@@ -193,14 +194,14 @@ static void wb_stats_show(struct seq_file *m, struct bdi_writeback *wb,
 static int cgwb_debug_stats_show(struct seq_file *m, void *v)
 {
 	struct backing_dev_info *bdi = m->private;
+	struct bdi_writeback *wb;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
-	struct bdi_writeback *wb;
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
 
 	rcu_read_lock();
-	list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node) {
+	list_for_each_entry_rcu(wb, &bdi->wb_ctx_arr[0]->wb_list, bdi_node) {
 		struct wb_stats stats = { .dirty_thresh = dirty_thresh };
 
 		if (!wb_tryget(wb))
@@ -520,6 +521,7 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
 	memset(wb, 0, sizeof(*wb));
 
 	wb->bdi = bdi;
+	wb->bdi_wb_ctx = bdi->wb_ctx_arr[0];
 	wb->last_old_flush = jiffies;
 	INIT_LIST_HEAD(&wb->b_dirty);
 	INIT_LIST_HEAD(&wb->b_io);
@@ -643,11 +645,12 @@ static void cgwb_release(struct percpu_ref *refcnt)
 	queue_work(cgwb_release_wq, &wb->release_work);
 }
 
-static void cgwb_kill(struct bdi_writeback *wb)
+static void cgwb_kill(struct bdi_writeback *wb,
+		      struct bdi_writeback_ctx *bdi_wb_ctx)
 {
 	lockdep_assert_held(&cgwb_lock);
 
-	WARN_ON(!radix_tree_delete(&wb->bdi->cgwb_tree, wb->memcg_css->id));
+	WARN_ON(!radix_tree_delete(&bdi_wb_ctx->cgwb_tree, wb->memcg_css->id));
 	list_del(&wb->memcg_node);
 	list_del(&wb->blkcg_node);
 	list_add(&wb->offline_node, &offline_cgwbs);
@@ -662,6 +665,7 @@ static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb)
 }
 
 static int cgwb_create(struct backing_dev_info *bdi,
+		       struct bdi_writeback_ctx *bdi_wb_ctx,
 		       struct cgroup_subsys_state *memcg_css, gfp_t gfp)
 {
 	struct mem_cgroup *memcg;
@@ -678,9 +682,9 @@ static int cgwb_create(struct backing_dev_info *bdi,
 
 	/* look up again under lock and discard on blkcg mismatch */
 	spin_lock_irqsave(&cgwb_lock, flags);
-	wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id);
+	wb = radix_tree_lookup(&bdi_wb_ctx->cgwb_tree, memcg_css->id);
 	if (wb && wb->blkcg_css != blkcg_css) {
-		cgwb_kill(wb);
+		cgwb_kill(wb, bdi_wb_ctx);
 		wb = NULL;
 	}
 	spin_unlock_irqrestore(&cgwb_lock, flags);
@@ -721,12 +725,13 @@ static int cgwb_create(struct backing_dev_info *bdi,
 	 */
 	ret = -ENODEV;
 	spin_lock_irqsave(&cgwb_lock, flags);
-	if (test_bit(WB_registered, &bdi->wb.state) &&
+	if (test_bit(WB_registered, &bdi_wb_ctx->wb.state) &&
 	    blkcg_cgwb_list->next && memcg_cgwb_list->next) {
 		/* we might have raced another instance of this function */
-		ret = radix_tree_insert(&bdi->cgwb_tree, memcg_css->id, wb);
+		ret = radix_tree_insert(&bdi_wb_ctx->cgwb_tree,
+					memcg_css->id, wb);
 		if (!ret) {
-			list_add_tail_rcu(&wb->bdi_node, &bdi->wb_list);
+			list_add_tail_rcu(&wb->bdi_node, &bdi_wb_ctx->wb_list);
 			list_add(&wb->memcg_node, memcg_cgwb_list);
 			list_add(&wb->blkcg_node, blkcg_cgwb_list);
 			blkcg_pin_online(blkcg_css);
@@ -779,16 +784,16 @@ static int cgwb_create(struct backing_dev_info *bdi,
  * each lookup.  On mismatch, the existing wb is discarded and a new one is
  * created.
  */
-struct bdi_writeback *wb_get_lookup(struct backing_dev_info *bdi,
+struct bdi_writeback *wb_get_lookup(struct bdi_writeback_ctx *bdi_wb_ctx,
 				    struct cgroup_subsys_state *memcg_css)
 {
 	struct bdi_writeback *wb;
 
 	if (!memcg_css->parent)
-		return &bdi->wb;
+		return &bdi_wb_ctx->wb;
 
 	rcu_read_lock();
-	wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id);
+	wb = radix_tree_lookup(&bdi_wb_ctx->cgwb_tree, memcg_css->id);
 	if (wb) {
 		struct cgroup_subsys_state *blkcg_css;
 
@@ -813,6 +818,7 @@ struct bdi_writeback *wb_get_lookup(struct backing_dev_info *bdi,
  * create one.  See wb_get_lookup() for more details.
  */
 struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
+				    struct bdi_writeback_ctx *bdi_wb_ctx,
 				    struct cgroup_subsys_state *memcg_css,
 				    gfp_t gfp)
 {
@@ -821,8 +827,8 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
 	might_alloc(gfp);
 
 	do {
-		wb = wb_get_lookup(bdi, memcg_css);
-	} while (!wb && !cgwb_create(bdi, memcg_css, gfp));
+		wb = wb_get_lookup(bdi_wb_ctx, memcg_css);
+	} while (!wb && !cgwb_create(bdi, bdi_wb_ctx, memcg_css, gfp));
 
 	return wb;
 }
@@ -830,36 +836,40 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
 static int cgwb_bdi_init(struct backing_dev_info *bdi)
 {
 	int ret;
+	struct bdi_writeback_ctx *bdi_wb_ctx = bdi->wb_ctx_arr[0];
 
-	INIT_RADIX_TREE(&bdi->cgwb_tree, GFP_ATOMIC);
+	INIT_RADIX_TREE(&bdi_wb_ctx->cgwb_tree, GFP_ATOMIC);
 	mutex_init(&bdi->cgwb_release_mutex);
-	init_rwsem(&bdi->wb_switch_rwsem);
+	init_rwsem(&bdi_wb_ctx->wb_switch_rwsem);
 
-	ret = wb_init(&bdi->wb, bdi, GFP_KERNEL);
+	ret = wb_init(&bdi_wb_ctx->wb, bdi, GFP_KERNEL);
 	if (!ret) {
-		bdi->wb.memcg_css = &root_mem_cgroup->css;
-		bdi->wb.blkcg_css = blkcg_root_css;
+		bdi_wb_ctx->wb.memcg_css = &root_mem_cgroup->css;
+		bdi_wb_ctx->wb.blkcg_css = blkcg_root_css;
 	}
 	return ret;
 }
 
-static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
+/* callers should create a loop and pass bdi_wb_ctx */
+static void cgwb_bdi_unregister(struct backing_dev_info *bdi,
+	struct bdi_writeback_ctx *bdi_wb_ctx)
 {
 	struct radix_tree_iter iter;
 	void **slot;
 	struct bdi_writeback *wb;
 
-	WARN_ON(test_bit(WB_registered, &bdi->wb.state));
+	WARN_ON(test_bit(WB_registered, &bdi_wb_ctx->wb.state));
 
 	spin_lock_irq(&cgwb_lock);
-	radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0)
-		cgwb_kill(*slot);
+	radix_tree_for_each_slot(slot, &bdi_wb_ctx->cgwb_tree, &iter, 0)
+		cgwb_kill(*slot, bdi_wb_ctx);
 	spin_unlock_irq(&cgwb_lock);
 
 	mutex_lock(&bdi->cgwb_release_mutex);
 	spin_lock_irq(&cgwb_lock);
-	while (!list_empty(&bdi->wb_list)) {
-		wb = list_first_entry(&bdi->wb_list, struct bdi_writeback,
+	while (!list_empty(&bdi_wb_ctx->wb_list)) {
+		wb = list_first_entry(&bdi_wb_ctx->wb_list,
+				      struct bdi_writeback,
 				      bdi_node);
 		spin_unlock_irq(&cgwb_lock);
 		wb_shutdown(wb);
@@ -930,7 +940,7 @@ void wb_memcg_offline(struct mem_cgroup *memcg)
 
 	spin_lock_irq(&cgwb_lock);
 	list_for_each_entry_safe(wb, next, memcg_cgwb_list, memcg_node)
-		cgwb_kill(wb);
+		cgwb_kill(wb, wb->bdi_wb_ctx);
 	memcg_cgwb_list->next = NULL;	/* prevent new wb's */
 	spin_unlock_irq(&cgwb_lock);
 
@@ -950,15 +960,16 @@ void wb_blkcg_offline(struct cgroup_subsys_state *css)
 
 	spin_lock_irq(&cgwb_lock);
 	list_for_each_entry_safe(wb, next, list, blkcg_node)
-		cgwb_kill(wb);
+		cgwb_kill(wb, wb->bdi_wb_ctx);
 	list->next = NULL;	/* prevent new wb's */
 	spin_unlock_irq(&cgwb_lock);
 }
 
-static void cgwb_bdi_register(struct backing_dev_info *bdi)
+static void cgwb_bdi_register(struct backing_dev_info *bdi,
+		struct bdi_writeback_ctx *bdi_wb_ctx)
 {
 	spin_lock_irq(&cgwb_lock);
-	list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list);
+	list_add_tail_rcu(&bdi_wb_ctx->wb.bdi_node, &bdi_wb_ctx->wb_list);
 	spin_unlock_irq(&cgwb_lock);
 }
 
@@ -981,14 +992,18 @@ subsys_initcall(cgwb_init);
 
 static int cgwb_bdi_init(struct backing_dev_info *bdi)
 {
-	return wb_init(&bdi->wb, bdi, GFP_KERNEL);
+	return wb_init(&bdi->wb_ctx_arr[0]->wb, bdi, GFP_KERNEL);
 }
 
-static void cgwb_bdi_unregister(struct backing_dev_info *bdi) { }
+static void cgwb_bdi_unregister(struct backing_dev_info *bdi,
+		struct bdi_writeback_ctx *bdi_wb_ctx)
+{ }
 
-static void cgwb_bdi_register(struct backing_dev_info *bdi)
+/* callers should create a loop and pass bdi_wb_ctx */
+static void cgwb_bdi_register(struct backing_dev_info *bdi,
+		struct bdi_writeback_ctx *bdi_wb_ctx)
 {
-	list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list);
+	list_add_tail_rcu(&bdi_wb_ctx->wb.bdi_node, &bdi_wb_ctx->wb_list);
 }
 
 static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb)
@@ -1006,9 +1021,15 @@ int bdi_init(struct backing_dev_info *bdi)
 	bdi->min_ratio = 0;
 	bdi->max_ratio = 100 * BDI_RATIO_SCALE;
 	bdi->max_prop_frac = FPROP_FRAC_BASE;
+	bdi->nr_wb_ctx = 1;
+	bdi->wb_ctx_arr = kcalloc(bdi->nr_wb_ctx,
+				  sizeof(struct bdi_writeback_ctx *),
+				  GFP_KERNEL);
 	INIT_LIST_HEAD(&bdi->bdi_list);
-	INIT_LIST_HEAD(&bdi->wb_list);
-	init_waitqueue_head(&bdi->wb_waitq);
+	bdi->wb_ctx_arr[0] = (struct bdi_writeback_ctx *)
+			kzalloc(sizeof(struct bdi_writeback_ctx), GFP_KERNEL);
+	INIT_LIST_HEAD(&bdi->wb_ctx_arr[0]->wb_list);
+	init_waitqueue_head(&bdi->wb_ctx_arr[0]->wb_waitq);
 	bdi->last_bdp_sleep = jiffies;
 
 	return cgwb_bdi_init(bdi);
@@ -1023,6 +1044,8 @@ struct backing_dev_info *bdi_alloc(int node_id)
 		return NULL;
 
 	if (bdi_init(bdi)) {
+		kfree(bdi->wb_ctx_arr[0]);
+		kfree(bdi->wb_ctx_arr);
 		kfree(bdi);
 		return NULL;
 	}
@@ -1095,11 +1118,11 @@ int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args)
 	if (IS_ERR(dev))
 		return PTR_ERR(dev);
 
-	cgwb_bdi_register(bdi);
+	cgwb_bdi_register(bdi, bdi->wb_ctx_arr[0]);
+	set_bit(WB_registered, &bdi->wb_ctx_arr[0]->wb.state);
 	bdi->dev = dev;
 
 	bdi_debug_register(bdi, dev_name(dev));
-	set_bit(WB_registered, &bdi->wb.state);
 
 	spin_lock_bh(&bdi_lock);
 
@@ -1155,8 +1178,8 @@ void bdi_unregister(struct backing_dev_info *bdi)
 
 	/* make sure nobody finds us on the bdi_list anymore */
 	bdi_remove_from_list(bdi);
-	wb_shutdown(&bdi->wb);
-	cgwb_bdi_unregister(bdi);
+	wb_shutdown(&bdi->wb_ctx_arr[0]->wb);
+	cgwb_bdi_unregister(bdi, bdi->wb_ctx_arr[0]);
 
 	/*
 	 * If this BDI's min ratio has been set, use bdi_set_min_ratio() to
@@ -1183,9 +1206,11 @@ static void release_bdi(struct kref *ref)
 	struct backing_dev_info *bdi =
 			container_of(ref, struct backing_dev_info, refcnt);
 
-	WARN_ON_ONCE(test_bit(WB_registered, &bdi->wb.state));
 	WARN_ON_ONCE(bdi->dev);
-	wb_exit(&bdi->wb);
+	WARN_ON_ONCE(test_bit(WB_registered, &bdi->wb_ctx_arr[0]->wb.state));
+	wb_exit(&bdi->wb_ctx_arr[0]->wb);
+	kfree(bdi->wb_ctx_arr[0]);
+	kfree(bdi->wb_ctx_arr);
 	kfree(bdi);
 }
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index c81624bc3969..b27416da569b 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2049,6 +2049,7 @@ int balance_dirty_pages_ratelimited_flags(struct address_space *mapping,
 {
 	struct inode *inode = mapping->host;
 	struct backing_dev_info *bdi = inode_to_bdi(inode);
+	struct bdi_writeback_ctx *bdi_wb_ctx = fetch_bdi_writeback_ctx(inode);
 	struct bdi_writeback *wb = NULL;
 	int ratelimit;
 	int ret = 0;
@@ -2058,9 +2059,9 @@ int balance_dirty_pages_ratelimited_flags(struct address_space *mapping,
 		return ret;
 
 	if (inode_cgwb_enabled(inode))
-		wb = wb_get_create_current(bdi, GFP_KERNEL);
+		wb = wb_get_create_current(bdi, bdi_wb_ctx, GFP_KERNEL);
 	if (!wb)
-		wb = &bdi->wb;
+		wb = &bdi_wb_ctx->wb;
 
 	ratelimit = current->nr_dirtied_pause;
 	if (wb->dirty_exceeded)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 02/13] writeback: add support to initialize and free multiple writeback ctxs
       [not found]   ` <CGME20250529113224epcas5p2eea35fd0ebe445d8ad0471a144714b23@epcas5p2.samsung.com>
@ 2025-05-29 11:14     ` Kundan Kumar
  0 siblings, 0 replies; 40+ messages in thread
From: Kundan Kumar @ 2025-05-29 11:14 UTC (permalink / raw)
  To: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy,
	anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch,
	ritesh.list, djwong, dave, p.raghav, da.gomez
  Cc: linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm,
	gost.dev, Kundan Kumar, Anuj Gupta

Introduce a new macro for_each_bdi_wb_ctx to iterate over multiple
writeback ctxs. Added logic for allocation, init, free, registration
and unregistration of multiple writeback contexts within a bdi.

Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 include/linux/backing-dev.h |  4 ++
 mm/backing-dev.c            | 79 +++++++++++++++++++++++++++----------
 2 files changed, 62 insertions(+), 21 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 894968e98dd8..fbccb483e59c 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -148,6 +148,10 @@ static inline bool mapping_can_writeback(struct address_space *mapping)
 	return inode_to_bdi(mapping->host)->capabilities & BDI_CAP_WRITEBACK;
 }
 
+#define for_each_bdi_wb_ctx(bdi, wb_ctx) \
+	for (int __i = 0; __i < (bdi)->nr_wb_ctx \
+		&& ((wb_ctx) = (bdi)->wb_ctx_arr[__i]) != NULL; __i++)
+
 static inline struct bdi_writeback_ctx *
 fetch_bdi_writeback_ctx(struct inode *inode)
 {
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 0efa9632011a..adf87b036827 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -836,16 +836,19 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
 static int cgwb_bdi_init(struct backing_dev_info *bdi)
 {
 	int ret;
-	struct bdi_writeback_ctx *bdi_wb_ctx = bdi->wb_ctx_arr[0];
+	struct bdi_writeback_ctx *bdi_wb_ctx;
 
-	INIT_RADIX_TREE(&bdi_wb_ctx->cgwb_tree, GFP_ATOMIC);
-	mutex_init(&bdi->cgwb_release_mutex);
-	init_rwsem(&bdi_wb_ctx->wb_switch_rwsem);
+	for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) {
+		INIT_RADIX_TREE(&bdi_wb_ctx->cgwb_tree, GFP_ATOMIC);
+		mutex_init(&bdi->cgwb_release_mutex);
+		init_rwsem(&bdi_wb_ctx->wb_switch_rwsem);
 
-	ret = wb_init(&bdi_wb_ctx->wb, bdi, GFP_KERNEL);
-	if (!ret) {
-		bdi_wb_ctx->wb.memcg_css = &root_mem_cgroup->css;
-		bdi_wb_ctx->wb.blkcg_css = blkcg_root_css;
+		ret = wb_init(&bdi_wb_ctx->wb, bdi, GFP_KERNEL);
+		if (!ret) {
+			bdi_wb_ctx->wb.memcg_css = &root_mem_cgroup->css;
+			bdi_wb_ctx->wb.blkcg_css = blkcg_root_css;
+		} else
+			return ret;
 	}
 	return ret;
 }
@@ -992,7 +995,16 @@ subsys_initcall(cgwb_init);
 
 static int cgwb_bdi_init(struct backing_dev_info *bdi)
 {
-	return wb_init(&bdi->wb_ctx_arr[0]->wb, bdi, GFP_KERNEL);
+	struct bdi_writeback_ctx *bdi_wb_ctx;
+
+	for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) {
+		int ret;
+
+		ret = wb_init(&bdi_wb_ctx->wb, bdi, GFP_KERNEL);
+		if (ret)
+			return ret;
+	}
+	return 0;
 }
 
 static void cgwb_bdi_unregister(struct backing_dev_info *bdi,
@@ -1026,10 +1038,19 @@ int bdi_init(struct backing_dev_info *bdi)
 				  sizeof(struct bdi_writeback_ctx *),
 				  GFP_KERNEL);
 	INIT_LIST_HEAD(&bdi->bdi_list);
-	bdi->wb_ctx_arr[0] = (struct bdi_writeback_ctx *)
-			kzalloc(sizeof(struct bdi_writeback_ctx), GFP_KERNEL);
-	INIT_LIST_HEAD(&bdi->wb_ctx_arr[0]->wb_list);
-	init_waitqueue_head(&bdi->wb_ctx_arr[0]->wb_waitq);
+	for (int i = 0; i < bdi->nr_wb_ctx; i++) {
+		bdi->wb_ctx_arr[i] = (struct bdi_writeback_ctx *)
+			 kzalloc(sizeof(struct bdi_writeback_ctx), GFP_KERNEL);
+		if (!bdi->wb_ctx_arr[i]) {
+			pr_err("Failed to allocate %d", i);
+			while (--i >= 0)
+				kfree(bdi->wb_ctx_arr[i]);
+			kfree(bdi->wb_ctx_arr);
+			return -ENOMEM;
+		}
+		INIT_LIST_HEAD(&bdi->wb_ctx_arr[i]->wb_list);
+		init_waitqueue_head(&bdi->wb_ctx_arr[i]->wb_waitq);
+	}
 	bdi->last_bdp_sleep = jiffies;
 
 	return cgwb_bdi_init(bdi);
@@ -1038,13 +1059,16 @@ int bdi_init(struct backing_dev_info *bdi)
 struct backing_dev_info *bdi_alloc(int node_id)
 {
 	struct backing_dev_info *bdi;
+	struct bdi_writeback_ctx *bdi_wb_ctx;
 
 	bdi = kzalloc_node(sizeof(*bdi), GFP_KERNEL, node_id);
 	if (!bdi)
 		return NULL;
 
 	if (bdi_init(bdi)) {
-		kfree(bdi->wb_ctx_arr[0]);
+		for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) {
+			kfree(bdi_wb_ctx);
+		}
 		kfree(bdi->wb_ctx_arr);
 		kfree(bdi);
 		return NULL;
@@ -1109,6 +1133,7 @@ int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args)
 {
 	struct device *dev;
 	struct rb_node *parent, **p;
+	struct bdi_writeback_ctx *bdi_wb_ctx;
 
 	if (bdi->dev)	/* The driver needs to use separate queues per device */
 		return 0;
@@ -1118,8 +1143,11 @@ int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args)
 	if (IS_ERR(dev))
 		return PTR_ERR(dev);
 
-	cgwb_bdi_register(bdi, bdi->wb_ctx_arr[0]);
-	set_bit(WB_registered, &bdi->wb_ctx_arr[0]->wb.state);
+	for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) {
+		cgwb_bdi_register(bdi, bdi_wb_ctx);
+		set_bit(WB_registered, &bdi_wb_ctx->wb.state);
+	}
+
 	bdi->dev = dev;
 
 	bdi_debug_register(bdi, dev_name(dev));
@@ -1174,12 +1202,17 @@ static void bdi_remove_from_list(struct backing_dev_info *bdi)
 
 void bdi_unregister(struct backing_dev_info *bdi)
 {
+	struct bdi_writeback_ctx *bdi_wb_ctx;
+
 	timer_delete_sync(&bdi->laptop_mode_wb_timer);
 
 	/* make sure nobody finds us on the bdi_list anymore */
 	bdi_remove_from_list(bdi);
-	wb_shutdown(&bdi->wb_ctx_arr[0]->wb);
-	cgwb_bdi_unregister(bdi, bdi->wb_ctx_arr[0]);
+
+	for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) {
+		wb_shutdown(&bdi_wb_ctx->wb);
+		cgwb_bdi_unregister(bdi, bdi_wb_ctx);
+	}
 
 	/*
 	 * If this BDI's min ratio has been set, use bdi_set_min_ratio() to
@@ -1205,11 +1238,15 @@ static void release_bdi(struct kref *ref)
 {
 	struct backing_dev_info *bdi =
 			container_of(ref, struct backing_dev_info, refcnt);
+	struct bdi_writeback_ctx *bdi_wb_ctx;
 
 	WARN_ON_ONCE(bdi->dev);
-	WARN_ON_ONCE(test_bit(WB_registered, &bdi->wb_ctx_arr[0]->wb.state));
-	wb_exit(&bdi->wb_ctx_arr[0]->wb);
-	kfree(bdi->wb_ctx_arr[0]);
+
+	for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) {
+		WARN_ON_ONCE(test_bit(WB_registered, &bdi_wb_ctx->wb.state));
+		wb_exit(&bdi_wb_ctx->wb);
+		kfree(bdi_wb_ctx);
+	}
 	kfree(bdi->wb_ctx_arr);
 	kfree(bdi);
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 03/13] writeback: link bdi_writeback to its corresponding bdi_writeback_ctx
       [not found]   ` <CGME20250529113228epcas5p1db88ab42c2dac0698d715e38bd5e0896@epcas5p1.samsung.com>
@ 2025-05-29 11:14     ` Kundan Kumar
  0 siblings, 0 replies; 40+ messages in thread
From: Kundan Kumar @ 2025-05-29 11:14 UTC (permalink / raw)
  To: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy,
	anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch,
	ritesh.list, djwong, dave, p.raghav, da.gomez
  Cc: linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm,
	gost.dev, Kundan Kumar, Anuj Gupta

Introduce a bdi_writeback_ctx field in bdi_writeback. This helps in
fetching the writeback context from the bdi_writeback.

Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 mm/backing-dev.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index adf87b036827..5479e2d34160 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -513,15 +513,16 @@ static void wb_update_bandwidth_workfn(struct work_struct *work)
  */
 #define INIT_BW		(100 << (20 - PAGE_SHIFT))
 
-static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
-		   gfp_t gfp)
+static int wb_init(struct bdi_writeback *wb,
+		   struct bdi_writeback_ctx *bdi_wb_ctx,
+		   struct backing_dev_info *bdi, gfp_t gfp)
 {
 	int err;
 
 	memset(wb, 0, sizeof(*wb));
 
 	wb->bdi = bdi;
-	wb->bdi_wb_ctx = bdi->wb_ctx_arr[0];
+	wb->bdi_wb_ctx = bdi_wb_ctx;
 	wb->last_old_flush = jiffies;
 	INIT_LIST_HEAD(&wb->b_dirty);
 	INIT_LIST_HEAD(&wb->b_io);
@@ -698,7 +699,7 @@ static int cgwb_create(struct backing_dev_info *bdi,
 		goto out_put;
 	}
 
-	ret = wb_init(wb, bdi, gfp);
+	ret = wb_init(wb, bdi_wb_ctx, bdi, gfp);
 	if (ret)
 		goto err_free;
 
@@ -843,7 +844,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 		mutex_init(&bdi->cgwb_release_mutex);
 		init_rwsem(&bdi_wb_ctx->wb_switch_rwsem);
 
-		ret = wb_init(&bdi_wb_ctx->wb, bdi, GFP_KERNEL);
+		ret = wb_init(&bdi_wb_ctx->wb, bdi_wb_ctx, bdi, GFP_KERNEL);
 		if (!ret) {
 			bdi_wb_ctx->wb.memcg_css = &root_mem_cgroup->css;
 			bdi_wb_ctx->wb.blkcg_css = blkcg_root_css;
@@ -1000,7 +1001,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 	for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) {
 		int ret;
 
-		ret = wb_init(&bdi_wb_ctx->wb, bdi, GFP_KERNEL);
+		ret = wb_init(&bdi_wb_ctx->wb, bdi_wb_ctx, bdi, GFP_KERNEL);
 		if (ret)
 			return ret;
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 04/13] writeback: affine inode to a writeback ctx within a bdi
       [not found]   ` <CGME20250529113232epcas5p4e6f3b2f03d3a5f8fcaace3ddd03298d0@epcas5p4.samsung.com>
@ 2025-05-29 11:14     ` Kundan Kumar
  2025-06-02 14:24       ` Christoph Hellwig
  0 siblings, 1 reply; 40+ messages in thread
From: Kundan Kumar @ 2025-05-29 11:14 UTC (permalink / raw)
  To: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy,
	anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch,
	ritesh.list, djwong, dave, p.raghav, da.gomez
  Cc: linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm,
	gost.dev, Kundan Kumar, Anuj Gupta

Affine inode to a writeback context. This helps in minimizing the
filesytem fragmentation due to inode being processed by different
threads.

Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/fs-writeback.c           | 3 ++-
 include/linux/backing-dev.h | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 0959fff46235..9529e16c9b66 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -265,7 +265,8 @@ void __inode_attach_wb(struct inode *inode, struct folio *folio)
 {
 	struct backing_dev_info *bdi = inode_to_bdi(inode);
 	struct bdi_writeback *wb = NULL;
-	struct bdi_writeback_ctx *bdi_writeback_ctx = bdi->wb_ctx_arr[0];
+	struct bdi_writeback_ctx *bdi_writeback_ctx =
+						fetch_bdi_writeback_ctx(inode);
 
 	if (inode_cgwb_enabled(inode)) {
 		struct cgroup_subsys_state *memcg_css;
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index fbccb483e59c..30a812fbd488 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -157,7 +157,7 @@ fetch_bdi_writeback_ctx(struct inode *inode)
 {
 	struct backing_dev_info *bdi = inode_to_bdi(inode);
 
-	return bdi->wb_ctx_arr[0];
+	return bdi->wb_ctx_arr[inode->i_ino % bdi->nr_wb_ctx];
 }
 
 #ifdef CONFIG_CGROUP_WRITEBACK
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 05/13] writeback: modify bdi_writeback search logic to search across all wb ctxs
       [not found]   ` <CGME20250529113236epcas5p2049b6cc3be27d8727ac1f15697987ff5@epcas5p2.samsung.com>
@ 2025-05-29 11:14     ` Kundan Kumar
  0 siblings, 0 replies; 40+ messages in thread
From: Kundan Kumar @ 2025-05-29 11:14 UTC (permalink / raw)
  To: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy,
	anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch,
	ritesh.list, djwong, dave, p.raghav, da.gomez
  Cc: linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm,
	gost.dev, Kundan Kumar, Anuj Gupta

Since we have multiple cgwb per bdi, embedded in writeback_ctx now, we
iterate over all of them to find the associated writeback.

Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 fs/fs-writeback.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 9529e16c9b66..72b73c3353fe 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1091,6 +1091,7 @@ int cgroup_writeback_by_id(u64 bdi_id, int memcg_id,
 	struct backing_dev_info *bdi;
 	struct cgroup_subsys_state *memcg_css;
 	struct bdi_writeback *wb;
+	struct bdi_writeback_ctx *bdi_wb_ctx;
 	struct wb_writeback_work *work;
 	unsigned long dirty;
 	int ret;
@@ -1114,7 +1115,11 @@ int cgroup_writeback_by_id(u64 bdi_id, int memcg_id,
 	 * And find the associated wb.  If the wb isn't there already
 	 * there's nothing to flush, don't create one.
 	 */
-	wb = wb_get_lookup(bdi->wb_ctx_arr[0], memcg_css);
+	for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) {
+		wb = wb_get_lookup(bdi_wb_ctx, memcg_css);
+		if (wb)
+			break;
+	}
 	if (!wb) {
 		ret = -ENOENT;
 		goto out_css_put;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 06/13] writeback: invoke all writeback contexts for flusher and dirtytime writeback
       [not found]   ` <CGME20250529113240epcas5p295dcf9a016cc28e5c3e88d698808f645@epcas5p2.samsung.com>
@ 2025-05-29 11:14     ` Kundan Kumar
  0 siblings, 0 replies; 40+ messages in thread
From: Kundan Kumar @ 2025-05-29 11:14 UTC (permalink / raw)
  To: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy,
	anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch,
	ritesh.list, djwong, dave, p.raghav, da.gomez
  Cc: linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm,
	gost.dev, Kundan Kumar, Anuj Gupta

Modify flusher and dirtytime logic to iterate through all the writeback
contexts.

Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 fs/fs-writeback.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 72b73c3353fe..9b0940a6fe78 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2389,12 +2389,14 @@ static void __wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
 					 enum wb_reason reason)
 {
 	struct bdi_writeback *wb;
+	struct bdi_writeback_ctx *bdi_wb_ctx;
 
 	if (!bdi_has_dirty_io(bdi))
 		return;
 
-	list_for_each_entry_rcu(wb, &bdi->wb_ctx_arr[0]->wb_list, bdi_node)
-		wb_start_writeback(wb, reason);
+	for_each_bdi_wb_ctx(bdi, bdi_wb_ctx)
+		list_for_each_entry_rcu(wb, &bdi_wb_ctx->wb_list, bdi_node)
+			wb_start_writeback(wb, reason);
 }
 
 void wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
@@ -2444,15 +2446,17 @@ static DECLARE_DELAYED_WORK(dirtytime_work, wakeup_dirtytime_writeback);
 static void wakeup_dirtytime_writeback(struct work_struct *w)
 {
 	struct backing_dev_info *bdi;
+	struct bdi_writeback_ctx *bdi_wb_ctx;
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
 		struct bdi_writeback *wb;
 
-		list_for_each_entry_rcu(wb, &bdi->wb_ctx_arr[0]->wb_list,
-					bdi_node)
-			if (!list_empty(&wb->b_dirty_time))
-				wb_wakeup(wb);
+		for_each_bdi_wb_ctx(bdi, bdi_wb_ctx)
+			list_for_each_entry_rcu(wb, &bdi_wb_ctx->wb_list,
+						bdi_node)
+				if (!list_empty(&wb->b_dirty_time))
+					wb_wakeup(wb);
 	}
 	rcu_read_unlock();
 	schedule_delayed_work(&dirtytime_work, dirtytime_expire_interval * HZ);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 07/13] writeback: modify sync related functions to iterate over all writeback contexts
       [not found]   ` <CGME20250529113245epcas5p2978b77ce5ccf2d620f2a9ee5e796bee3@epcas5p2.samsung.com>
@ 2025-05-29 11:14     ` Kundan Kumar
  0 siblings, 0 replies; 40+ messages in thread
From: Kundan Kumar @ 2025-05-29 11:14 UTC (permalink / raw)
  To: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy,
	anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch,
	ritesh.list, djwong, dave, p.raghav, da.gomez
  Cc: linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm,
	gost.dev, Kundan Kumar, Anuj Gupta

Modify sync related functions to iterate over all writeback contexts.

Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 fs/fs-writeback.c | 66 +++++++++++++++++++++++++++++++----------------
 1 file changed, 44 insertions(+), 22 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 9b0940a6fe78..7558b8a33fe0 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2752,11 +2752,13 @@ static void wait_sb_inodes(struct super_block *sb)
 	mutex_unlock(&sb->s_sync_lock);
 }
 
-static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr,
-				     enum wb_reason reason, bool skip_if_busy)
+static void __writeback_inodes_sb_nr_ctx(struct super_block *sb,
+					 unsigned long nr,
+					 enum wb_reason reason,
+					 bool skip_if_busy,
+					 struct bdi_writeback_ctx *bdi_wb_ctx)
 {
-	struct backing_dev_info *bdi = sb->s_bdi;
-	DEFINE_WB_COMPLETION(done, bdi->wb_ctx_arr[0]);
+	DEFINE_WB_COMPLETION(done, bdi_wb_ctx);
 	struct wb_writeback_work work = {
 		.sb			= sb,
 		.sync_mode		= WB_SYNC_NONE,
@@ -2766,13 +2768,23 @@ static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr,
 		.reason			= reason,
 	};
 
+	bdi_split_work_to_wbs(sb->s_bdi, bdi_wb_ctx, &work, skip_if_busy);
+	wb_wait_for_completion(&done);
+}
+
+static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr,
+				     enum wb_reason reason, bool skip_if_busy)
+{
+	struct backing_dev_info *bdi = sb->s_bdi;
+	struct bdi_writeback_ctx *bdi_wb_ctx;
+
 	if (!bdi_has_dirty_io(bdi) || bdi == &noop_backing_dev_info)
 		return;
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	bdi_split_work_to_wbs(sb->s_bdi, bdi->wb_ctx_arr[0], &work,
-			      skip_if_busy);
-	wb_wait_for_completion(&done);
+	for_each_bdi_wb_ctx(bdi, bdi_wb_ctx)
+		__writeback_inodes_sb_nr_ctx(sb, nr, reason, skip_if_busy,
+					     bdi_wb_ctx);
 }
 
 /**
@@ -2825,17 +2837,11 @@ void try_to_writeback_inodes_sb(struct super_block *sb, enum wb_reason reason)
 }
 EXPORT_SYMBOL(try_to_writeback_inodes_sb);
 
-/**
- * sync_inodes_sb	-	sync sb inode pages
- * @sb: the superblock
- *
- * This function writes and waits on any dirty inode belonging to this
- * super_block.
- */
-void sync_inodes_sb(struct super_block *sb)
+static void sync_inodes_bdi_wb_ctx(struct super_block *sb,
+				   struct backing_dev_info *bdi,
+				   struct bdi_writeback_ctx *bdi_wb_ctx)
 {
-	struct backing_dev_info *bdi = sb->s_bdi;
-	DEFINE_WB_COMPLETION(done, bdi->wb_ctx_arr[0]);
+	DEFINE_WB_COMPLETION(done, bdi_wb_ctx);
 	struct wb_writeback_work work = {
 		.sb		= sb,
 		.sync_mode	= WB_SYNC_ALL,
@@ -2846,6 +2852,25 @@ void sync_inodes_sb(struct super_block *sb)
 		.for_sync	= 1,
 	};
 
+	/* protect against inode wb switch, see inode_switch_wbs_work_fn() */
+	bdi_down_write_wb_ctx_switch_rwsem(bdi_wb_ctx);
+	bdi_split_work_to_wbs(bdi, bdi_wb_ctx, &work, false);
+	wb_wait_for_completion(&done);
+	bdi_up_write_wb_ctx_switch_rwsem(bdi_wb_ctx);
+}
+
+/**
+ * sync_inodes_sb	-	sync sb inode pages
+ * @sb: the superblock
+ *
+ * This function writes and waits on any dirty inode belonging to this
+ * super_block.
+ */
+void sync_inodes_sb(struct super_block *sb)
+{
+	struct backing_dev_info *bdi = sb->s_bdi;
+	struct bdi_writeback_ctx *bdi_wb_ctx;
+
 	/*
 	 * Can't skip on !bdi_has_dirty() because we should wait for !dirty
 	 * inodes under writeback and I_DIRTY_TIME inodes ignored by
@@ -2855,11 +2880,8 @@ void sync_inodes_sb(struct super_block *sb)
 		return;
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	/* protect against inode wb switch, see inode_switch_wbs_work_fn() */
-	bdi_down_write_wb_ctx_switch_rwsem(bdi->wb_ctx_arr[0]);
-	bdi_split_work_to_wbs(bdi, bdi->wb_ctx_arr[0], &work, false);
-	wb_wait_for_completion(&done);
-	bdi_up_write_wb_ctx_switch_rwsem(bdi->wb_ctx_arr[0]);
+	for_each_bdi_wb_ctx(bdi, bdi_wb_ctx)
+		sync_inodes_bdi_wb_ctx(sb, bdi, bdi_wb_ctx);
 
 	wait_sb_inodes(sb);
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 08/13] writeback: add support to collect stats for all writeback ctxs
       [not found]   ` <CGME20250529113249epcas5p38b29d3c6256337eadc2d1644181f9b74@epcas5p3.samsung.com>
@ 2025-05-29 11:14     ` Kundan Kumar
  0 siblings, 0 replies; 40+ messages in thread
From: Kundan Kumar @ 2025-05-29 11:14 UTC (permalink / raw)
  To: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy,
	anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch,
	ritesh.list, djwong, dave, p.raghav, da.gomez
  Cc: linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm,
	gost.dev, Kundan Kumar, Anuj Gupta

Modified stats collection to collect stats for all the writeback
contexts within a bdi.

Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 mm/backing-dev.c | 72 ++++++++++++++++++++++++++++--------------------
 1 file changed, 42 insertions(+), 30 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 5479e2d34160..d416122e2914 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -50,6 +50,7 @@ struct wb_stats {
 	unsigned long nr_written;
 	unsigned long dirty_thresh;
 	unsigned long wb_thresh;
+	unsigned long state;
 };
 
 static struct dentry *bdi_debug_root;
@@ -81,6 +82,7 @@ static void collect_wb_stats(struct wb_stats *stats,
 	stats->nr_dirtied += wb_stat(wb, WB_DIRTIED);
 	stats->nr_written += wb_stat(wb, WB_WRITTEN);
 	stats->wb_thresh += wb_calc_thresh(wb, stats->dirty_thresh);
+	stats->state |= wb->state;
 }
 
 #ifdef CONFIG_CGROUP_WRITEBACK
@@ -89,22 +91,27 @@ static void bdi_collect_stats(struct backing_dev_info *bdi,
 			      struct wb_stats *stats)
 {
 	struct bdi_writeback *wb;
+	struct bdi_writeback_ctx *bdi_wb_ctx;
 
 	rcu_read_lock();
-	list_for_each_entry_rcu(wb, &bdi->wb_ctx_arr[0]->wb_list, bdi_node) {
-		if (!wb_tryget(wb))
-			continue;
+	for_each_bdi_wb_ctx(bdi, bdi_wb_ctx)
+		list_for_each_entry_rcu(wb, &bdi_wb_ctx->wb_list, bdi_node) {
+			if (!wb_tryget(wb))
+				continue;
 
-		collect_wb_stats(stats, wb);
-		wb_put(wb);
-	}
+			collect_wb_stats(stats, wb);
+			wb_put(wb);
+		}
 	rcu_read_unlock();
 }
 #else
 static void bdi_collect_stats(struct backing_dev_info *bdi,
 			      struct wb_stats *stats)
 {
-	collect_wb_stats(stats, &bdi->wb_ctx_arr[0]->wb);
+	struct bdi_writeback_ctx *bdi_wb_ctx;
+
+	for_each_bdi_wb_ctx(bdi, bdi_wb_ctx)
+		collect_wb_stats(stats, &bdi_wb_ctx->wb);
 }
 #endif
 
@@ -150,7 +157,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		   stats.nr_io,
 		   stats.nr_more_io,
 		   stats.nr_dirty_time,
-		   !list_empty(&bdi->bdi_list), bdi->wb_ctx_arr[0]->wb.state);
+		   !list_empty(&bdi->bdi_list), stats.state);
 
 	return 0;
 }
@@ -195,35 +202,40 @@ static int cgwb_debug_stats_show(struct seq_file *m, void *v)
 {
 	struct backing_dev_info *bdi = m->private;
 	struct bdi_writeback *wb;
+	struct bdi_writeback_ctx *bdi_wb_ctx;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
+	struct wb_stats stats;
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
+	stats.dirty_thresh = dirty_thresh;
 
 	rcu_read_lock();
-	list_for_each_entry_rcu(wb, &bdi->wb_ctx_arr[0]->wb_list, bdi_node) {
-		struct wb_stats stats = { .dirty_thresh = dirty_thresh };
-
-		if (!wb_tryget(wb))
-			continue;
-
-		collect_wb_stats(&stats, wb);
-
-		/*
-		 * Calculate thresh of wb in writeback cgroup which is min of
-		 * thresh in global domain and thresh in cgroup domain. Drop
-		 * rcu lock because cgwb_calc_thresh may sleep in
-		 * cgroup_rstat_flush. We can do so here because we have a ref.
-		 */
-		if (mem_cgroup_wb_domain(wb)) {
-			rcu_read_unlock();
-			stats.wb_thresh = min(stats.wb_thresh, cgwb_calc_thresh(wb));
-			rcu_read_lock();
+	for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) {
+		list_for_each_entry_rcu(wb, &bdi_wb_ctx->wb_list, bdi_node) {
+			if (!wb_tryget(wb))
+				continue;
+
+			collect_wb_stats(&stats, wb);
+
+			/*
+			 * Calculate thresh of wb in writeback cgroup which is
+			 * min of thresh in global domain and thresh in cgroup
+			 * domain. Drop rcu lock because cgwb_calc_thresh may
+			 * sleep in cgroup_rstat_flush. We can do so here
+			 * because we have a ref.
+			 */
+			if (mem_cgroup_wb_domain(wb)) {
+				rcu_read_unlock();
+				stats.wb_thresh = min(stats.wb_thresh,
+						      cgwb_calc_thresh(wb));
+				rcu_read_lock();
+			}
+
+			wb_stats_show(m, wb, &stats);
+
+			wb_put(wb);
 		}
-
-		wb_stats_show(m, wb, &stats);
-
-		wb_put(wb);
 	}
 	rcu_read_unlock();
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 09/13] f2fs: add support in f2fs to handle multiple writeback contexts
       [not found]   ` <CGME20250529113253epcas5p1a28e77b2d9824d55f594ccb053725ece@epcas5p1.samsung.com>
@ 2025-05-29 11:15     ` Kundan Kumar
  2025-06-02 14:20       ` Christoph Hellwig
  0 siblings, 1 reply; 40+ messages in thread
From: Kundan Kumar @ 2025-05-29 11:15 UTC (permalink / raw)
  To: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy,
	anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch,
	ritesh.list, djwong, dave, p.raghav, da.gomez
  Cc: linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm,
	gost.dev, Kundan Kumar, Anuj Gupta

Add support to handle multiple writeback contexts and check for
dirty_exceeded across all the writeback contexts.

Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 fs/f2fs/node.c    | 11 +++++++----
 fs/f2fs/segment.h |  7 +++++--
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 4b6568cd5bef..19f208d6c6d3 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -50,6 +50,7 @@ bool f2fs_available_free_memory(struct f2fs_sb_info *sbi, int type)
 	unsigned long avail_ram;
 	unsigned long mem_size = 0;
 	bool res = false;
+	struct bdi_writeback_ctx *bdi_wb_ctx;
 
 	if (!nm_i)
 		return true;
@@ -73,8 +74,9 @@ bool f2fs_available_free_memory(struct f2fs_sb_info *sbi, int type)
 		if (excess_cached_nats(sbi))
 			res = false;
 	} else if (type == DIRTY_DENTS) {
-		if (sbi->sb->s_bdi->wb_ctx_arr[0]->wb.dirty_exceeded)
-			return false;
+		for_each_bdi_wb_ctx(sbi->sb->s_bdi, bdi_wb_ctx)
+			if (bdi_wb_ctx->wb.dirty_exceeded)
+				return false;
 		mem_size = get_pages(sbi, F2FS_DIRTY_DENTS);
 		res = mem_size < ((avail_ram * nm_i->ram_thresh / 100) >> 1);
 	} else if (type == INO_ENTRIES) {
@@ -114,8 +116,9 @@ bool f2fs_available_free_memory(struct f2fs_sb_info *sbi, int type)
 		res = false;
 #endif
 	} else {
-		if (!sbi->sb->s_bdi->wb_ctx_arr[0]->wb.dirty_exceeded)
-			return true;
+		for_each_bdi_wb_ctx(sbi->sb->s_bdi, bdi_wb_ctx)
+			if (bdi_wb_ctx->wb.dirty_exceeded)
+				return false;
 	}
 	return res;
 }
diff --git a/fs/f2fs/segment.h b/fs/f2fs/segment.h
index a525ccd4cfc8..2eea08549d73 100644
--- a/fs/f2fs/segment.h
+++ b/fs/f2fs/segment.h
@@ -936,8 +936,11 @@ static inline bool sec_usage_check(struct f2fs_sb_info *sbi, unsigned int secno)
  */
 static inline int nr_pages_to_skip(struct f2fs_sb_info *sbi, int type)
 {
-	if (sbi->sb->s_bdi->wb_ctx_arr[0]->wb.dirty_exceeded)
-		return 0;
+	struct bdi_writeback_ctx *bdi_wb_ctx;
+
+	for_each_bdi_wb_ctx(sbi->sb->s_bdi, bdi_wb_ctx)
+		if (bdi_wb_ctx->wb.dirty_exceeded)
+			return 0;
 
 	if (type == DATA)
 		return BLKS_PER_SEG(sbi);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 10/13] fuse: add support for multiple writeback contexts in fuse
       [not found]   ` <CGME20250529113257epcas5p4dbaf9c8e2dc362767c8553399632c1ea@epcas5p4.samsung.com>
@ 2025-05-29 11:15     ` Kundan Kumar
  2025-06-02 14:21       ` Christoph Hellwig
  0 siblings, 1 reply; 40+ messages in thread
From: Kundan Kumar @ 2025-05-29 11:15 UTC (permalink / raw)
  To: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy,
	anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch,
	ritesh.list, djwong, dave, p.raghav, da.gomez
  Cc: linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm,
	gost.dev, Kundan Kumar, Anuj Gupta

Fetch writeback context to which an inode is affined. Use it to perform
writeback related operations.

Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 fs/fuse/file.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 7817219d1599..803359b02383 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1851,11 +1851,11 @@ static void fuse_writepage_free(struct fuse_writepage_args *wpa)
 
 static void fuse_writepage_finish_stat(struct inode *inode, struct folio *folio)
 {
-	struct backing_dev_info *bdi = inode_to_bdi(inode);
+	struct bdi_writeback_ctx *bdi_wb_ctx = fetch_bdi_writeback_ctx(inode);
 
-	dec_wb_stat(&bdi->wb_ctx_arr[0]->wb, WB_WRITEBACK);
+	dec_wb_stat(&bdi_wb_ctx->wb, WB_WRITEBACK);
 	node_stat_sub_folio(folio, NR_WRITEBACK_TEMP);
-	wb_writeout_inc(&bdi->wb_ctx_arr[0]->wb);
+	wb_writeout_inc(&bdi_wb_ctx->wb);
 }
 
 static void fuse_writepage_finish(struct fuse_writepage_args *wpa)
@@ -2134,6 +2134,7 @@ static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struc
 					  struct folio *tmp_folio, uint32_t folio_index)
 {
 	struct inode *inode = folio->mapping->host;
+	struct bdi_writeback_ctx *bdi_wb_ctx = fetch_bdi_writeback_ctx(inode);
 	struct fuse_args_pages *ap = &wpa->ia.ap;
 
 	folio_copy(tmp_folio, folio);
@@ -2142,7 +2143,7 @@ static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struc
 	ap->descs[folio_index].offset = 0;
 	ap->descs[folio_index].length = PAGE_SIZE;
 
-	inc_wb_stat(&inode_to_bdi(inode)->wb_ctx_arr[0]->wb, WB_WRITEBACK);
+	inc_wb_stat(&bdi_wb_ctx->wb, WB_WRITEBACK);
 	node_stat_add_folio(tmp_folio, NR_WRITEBACK_TEMP);
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 11/13] gfs2: add support in gfs2 to handle multiple writeback contexts
       [not found]   ` <CGME20250529113302epcas5p3bdae265288af32172fb7380a727383eb@epcas5p3.samsung.com>
@ 2025-05-29 11:15     ` Kundan Kumar
  0 siblings, 0 replies; 40+ messages in thread
From: Kundan Kumar @ 2025-05-29 11:15 UTC (permalink / raw)
  To: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy,
	anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch,
	ritesh.list, djwong, dave, p.raghav, da.gomez
  Cc: linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm,
	gost.dev, Kundan Kumar, Anuj Gupta

Add support to handle multiple writeback contexts and check for
dirty_exceeded across all the writeback contexts

Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 fs/gfs2/super.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index dfc83bd3def3..d4fdab4a4201 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -450,6 +450,7 @@ static int gfs2_write_inode(struct inode *inode, struct writeback_control *wbc)
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
 	struct address_space *metamapping = gfs2_glock2aspace(ip->i_gl);
 	struct backing_dev_info *bdi = inode_to_bdi(metamapping->host);
+	struct bdi_writeback_ctx *bdi_wb_ctx;
 	int ret = 0;
 	bool flush_all = (wbc->sync_mode == WB_SYNC_ALL || gfs2_is_jdata(ip));
 
@@ -457,10 +458,12 @@ static int gfs2_write_inode(struct inode *inode, struct writeback_control *wbc)
 		gfs2_log_flush(GFS2_SB(inode), ip->i_gl,
 			       GFS2_LOG_HEAD_FLUSH_NORMAL |
 			       GFS2_LFC_WRITE_INODE);
-	if (bdi->wb_ctx_arr[0]->wb.dirty_exceeded)
-		gfs2_ail1_flush(sdp, wbc);
-	else
-		filemap_fdatawrite(metamapping);
+
+	for_each_bdi_wb_ctx(bdi, bdi_wb_ctx)
+		if (bdi_wb_ctx->wb.dirty_exceeded)
+			gfs2_ail1_flush(sdp, wbc);
+		else
+			filemap_fdatawrite(metamapping);
 	if (flush_all)
 		ret = filemap_fdatawait(metamapping);
 	if (ret)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 12/13] nfs: add support in nfs to handle multiple writeback contexts
       [not found]   ` <CGME20250529113306epcas5p3d10606ae4ea7c3491e93bde9ae408c9f@epcas5p3.samsung.com>
@ 2025-05-29 11:15     ` Kundan Kumar
  2025-06-02 14:22       ` Christoph Hellwig
  0 siblings, 1 reply; 40+ messages in thread
From: Kundan Kumar @ 2025-05-29 11:15 UTC (permalink / raw)
  To: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy,
	anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch,
	ritesh.list, djwong, dave, p.raghav, da.gomez
  Cc: linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm,
	gost.dev, Kundan Kumar, Anuj Gupta

Fetch writeback context to which an inode is affined. Use it to perform
writeback related operations.

Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 fs/nfs/internal.h | 5 +++--
 fs/nfs/write.c    | 6 +++---
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index fd513bf9e875..a7cacaf484c9 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -838,14 +838,15 @@ static inline void nfs_folio_mark_unstable(struct folio *folio,
 {
 	if (folio && !cinfo->dreq) {
 		struct inode *inode = folio->mapping->host;
+		struct bdi_writeback_ctx *bdi_wb_ctx =
+						fetch_bdi_writeback_ctx(inode);
 		long nr = folio_nr_pages(folio);
 
 		/* This page is really still in write-back - just that the
 		 * writeback is happening on the server now.
 		 */
 		node_stat_mod_folio(folio, NR_WRITEBACK, nr);
-		wb_stat_mod(&inode_to_bdi(inode)->wb_ctx_arr[0]->wb,
-			    WB_WRITEBACK, nr);
+		wb_stat_mod(&bdi_wb_ctx->wb, WB_WRITEBACK, nr);
 		__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 	}
 }
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index ec48ec8c2db8..ca0823debce7 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -932,11 +932,11 @@ static void nfs_folio_clear_commit(struct folio *folio)
 {
 	if (folio) {
 		long nr = folio_nr_pages(folio);
-		struct inode *inode = folio->mapping->host;
+		struct bdi_writeback_ctx *bdi_wb_ctx =
+				fetch_bdi_writeback_ctx(folio->mapping->host);
 
 		node_stat_mod_folio(folio, NR_WRITEBACK, -nr);
-		wb_stat_mod(&inode_to_bdi(inode)->wb_ctx_arr[0]->wb,
-			    WB_WRITEBACK, -nr);
+		wb_stat_mod(&bdi_wb_ctx->wb, WB_WRITEBACK, -nr);
 	}
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 13/13] writeback: set the num of writeback contexts to number of online cpus
       [not found]   ` <CGME20250529113311epcas5p3c8f1785b34680481e2126fda3ab51ad9@epcas5p3.samsung.com>
@ 2025-05-29 11:15     ` Kundan Kumar
  2025-06-03 14:36       ` kernel test robot
  0 siblings, 1 reply; 40+ messages in thread
From: Kundan Kumar @ 2025-05-29 11:15 UTC (permalink / raw)
  To: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy,
	anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch,
	ritesh.list, djwong, dave, p.raghav, da.gomez
  Cc: linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm,
	gost.dev, Kundan Kumar, Anuj Gupta

We create N number of writeback contexts, N = number of online cpus. The
inodes gets distributed across different writeback contexts, enabling
parallel writeback.

Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 mm/backing-dev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index d416122e2914..55c07c9be4cd 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -1046,7 +1046,7 @@ int bdi_init(struct backing_dev_info *bdi)
 	bdi->min_ratio = 0;
 	bdi->max_ratio = 100 * BDI_RATIO_SCALE;
 	bdi->max_prop_frac = FPROP_FRAC_BASE;
-	bdi->nr_wb_ctx = 1;
+	bdi->nr_wb_ctx = num_online_cpus();
 	bdi->wb_ctx_arr = kcalloc(bdi->nr_wb_ctx,
 				  sizeof(struct bdi_writeback_ctx *),
 				  GFP_KERNEL);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/13] Parallelizing filesystem writeback
  2025-05-29 11:14 ` [PATCH 00/13] Parallelizing filesystem writeback Kundan Kumar
                     ` (12 preceding siblings ...)
       [not found]   ` <CGME20250529113311epcas5p3c8f1785b34680481e2126fda3ab51ad9@epcas5p3.samsung.com>
@ 2025-05-30  3:37   ` Andrew Morton
  2025-06-25 15:44     ` Kundan Kumar
  2025-06-02 14:19   ` Christoph Hellwig
  14 siblings, 1 reply; 40+ messages in thread
From: Andrew Morton @ 2025-05-30  3:37 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy,
	anna, willy, mcgrof, clm, david, amir73il, axboe, hch,
	ritesh.list, djwong, dave, p.raghav, da.gomez, linux-f2fs-devel,
	linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev

On Thu, 29 May 2025 16:44:51 +0530 Kundan Kumar <kundan.kumar@samsung.com> wrote:

> Currently, pagecache writeback is performed by a single thread. Inodes
> are added to a dirty list, and delayed writeback is triggered. The single
> writeback thread then iterates through the dirty inode list, and executes
> the writeback.
> 
> This series parallelizes the writeback by allowing multiple writeback
> contexts per backing device (bdi). These writebacks contexts are executed
> as separate, independent threads, improving overall parallelism.
> 
> Would love to hear feedback in-order to move this effort forward.
> 
> Design Overview
> ================
> Following Jan Kara's suggestion [1], we have introduced a new bdi
> writeback context within the backing_dev_info structure. Specifically,
> we have created a new structure, bdi_writeback_context, which contains
> its own set of members for each writeback context.
> 
> struct bdi_writeback_ctx {
>         struct bdi_writeback wb;
>         struct list_head wb_list; /* list of all wbs */
>         struct radix_tree_root cgwb_tree;
>         struct rw_semaphore wb_switch_rwsem;
>         wait_queue_head_t wb_waitq;
> };
> 
> There can be multiple writeback contexts in a bdi, which helps in
> achieving writeback parallelism.
> 
> struct backing_dev_info {
> ...
>         int nr_wb_ctx;
>         struct bdi_writeback_ctx **wb_ctx_arr;

I don't think the "_arr" adds value. bdi->wb_contexts[i]?

> ...
> };
> 
> FS geometry and filesystem fragmentation
> ========================================
> The community was concerned that parallelizing writeback would impact
> delayed allocation and increase filesystem fragmentation.
> Our analysis of XFS delayed allocation behavior showed that merging of
> extents occurs within a specific inode. Earlier experiments with multiple
> writeback contexts [2] resulted in increased fragmentation due to the
> same inode being processed by different threads.
> 
> To address this, we now affine an inode to a specific writeback context
> ensuring that delayed allocation works effectively.
> 
> Number of writeback contexts
> ===========================
> The plan is to keep the nr_wb_ctx as 1, ensuring default single threaded
> behavior. However, we set the number of writeback contexts equal to
> number of CPUs in the current version.

Makes sense.  It would be good to test this on a non-SMP machine, if
you can find one ;)

> Later we will make it configurable
> using a mount option, allowing filesystems to choose the optimal number
> of writeback contexts.
> 
> IOPS and throughput
> ===================
> We see significant improvement in IOPS across several filesystem on both
> PMEM and NVMe devices.
> 
> Performance gains:
>   - On PMEM:
> 	Base XFS		: 544 MiB/s
> 	Parallel Writeback XFS	: 1015 MiB/s  (+86%)
> 	Base EXT4		: 536 MiB/s
> 	Parallel Writeback EXT4	: 1047 MiB/s  (+95%)
> 
>   - On NVMe:
> 	Base XFS		: 651 MiB/s
> 	Parallel Writeback XFS	: 808 MiB/s  (+24%)
> 	Base EXT4		: 494 MiB/s
> 	Parallel Writeback EXT4	: 797 MiB/s  (+61%)
> 
> We also see that there is no increase in filesystem fragmentation
> # of extents:
>   - On XFS (on PMEM):
> 	Base XFS		: 1964
> 	Parallel Writeback XFS	: 1384
> 
>   - On EXT4 (on PMEM):
> 	Base EXT4		: 21
> 	Parallel Writeback EXT4	: 11

Please test the performance on spinning disks, and with more filesystems?



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/13] Parallelizing filesystem writeback
  2025-05-29 11:14 ` [PATCH 00/13] Parallelizing filesystem writeback Kundan Kumar
                     ` (13 preceding siblings ...)
  2025-05-30  3:37   ` [PATCH 00/13] Parallelizing filesystem writeback Andrew Morton
@ 2025-06-02 14:19   ` Christoph Hellwig
  2025-06-03  9:16     ` Anuj Gupta/Anuj Gupta
  14 siblings, 1 reply; 40+ messages in thread
From: Christoph Hellwig @ 2025-06-02 14:19 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy,
	anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch,
	ritesh.list, djwong, dave, p.raghav, da.gomez, linux-f2fs-devel,
	linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev

On Thu, May 29, 2025 at 04:44:51PM +0530, Kundan Kumar wrote:
> Number of writeback contexts
> ===========================
> The plan is to keep the nr_wb_ctx as 1, ensuring default single threaded
> behavior. However, we set the number of writeback contexts equal to
> number of CPUs in the current version. Later we will make it configurable
> using a mount option, allowing filesystems to choose the optimal number
> of writeback contexts.

Well, the proper thing would be to figure out a good default and not
just keep things as-is, no?

> IOPS and throughput
> ===================
> We see significant improvement in IOPS across several filesystem on both
> PMEM and NVMe devices.
> 
> Performance gains:
>   - On PMEM:
> 	Base XFS		: 544 MiB/s
> 	Parallel Writeback XFS	: 1015 MiB/s  (+86%)
> 	Base EXT4		: 536 MiB/s
> 	Parallel Writeback EXT4	: 1047 MiB/s  (+95%)
> 
>   - On NVMe:
> 	Base XFS		: 651 MiB/s
> 	Parallel Writeback XFS	: 808 MiB/s  (+24%)
> 	Base EXT4		: 494 MiB/s
> 	Parallel Writeback EXT4	: 797 MiB/s  (+61%)

What worksload was this?

How many CPU cores did the system have, how many AGs/BGs did the file
systems have?   What SSD/Pmem was this?  Did this change the write
amp as measure by the media writes on the NVMe SSD?

Also I'd be really curious to see numbers on hard drives.

> We also see that there is no increase in filesystem fragmentation
> # of extents:
>   - On XFS (on PMEM):
> 	Base XFS		: 1964
> 	Parallel Writeback XFS	: 1384
> 
>   - On EXT4 (on PMEM):
> 	Base EXT4		: 21
> 	Parallel Writeback EXT4	: 11

How were the number of extents counts given that they look so wildly
different?


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 09/13] f2fs: add support in f2fs to handle multiple writeback contexts
  2025-05-29 11:15     ` [PATCH 09/13] f2fs: add support in f2fs to handle multiple writeback contexts Kundan Kumar
@ 2025-06-02 14:20       ` Christoph Hellwig
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2025-06-02 14:20 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy,
	anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch,
	ritesh.list, djwong, dave, p.raghav, da.gomez, linux-f2fs-devel,
	linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, Anuj Gupta

>  	} else if (type == DIRTY_DENTS) {
> -		if (sbi->sb->s_bdi->wb_ctx_arr[0]->wb.dirty_exceeded)
> -			return false;
> +		for_each_bdi_wb_ctx(sbi->sb->s_bdi, bdi_wb_ctx)
> +			if (bdi_wb_ctx->wb.dirty_exceeded)
> +				return false;

I think we need to figure out what the dirty_exceeded here and in
the other places in f2fs and gfs2 is trying to do and factor that into
well-documented core helpers instead of adding these loops in places
that should not really poke into writeback internals.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/13] fuse: add support for multiple writeback contexts in fuse
  2025-05-29 11:15     ` [PATCH 10/13] fuse: add support for multiple writeback contexts in fuse Kundan Kumar
@ 2025-06-02 14:21       ` Christoph Hellwig
  2025-06-02 15:50         ` Bernd Schubert
  0 siblings, 1 reply; 40+ messages in thread
From: Christoph Hellwig @ 2025-06-02 14:21 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy,
	anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch,
	ritesh.list, djwong, dave, p.raghav, da.gomez, linux-f2fs-devel,
	linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, Anuj Gupta

>  static void fuse_writepage_finish_stat(struct inode *inode, struct folio *folio)
>  {
> -	struct backing_dev_info *bdi = inode_to_bdi(inode);
> +	struct bdi_writeback_ctx *bdi_wb_ctx = fetch_bdi_writeback_ctx(inode);
>  
> -	dec_wb_stat(&bdi->wb_ctx_arr[0]->wb, WB_WRITEBACK);
> +	dec_wb_stat(&bdi_wb_ctx->wb, WB_WRITEBACK);
>  	node_stat_sub_folio(folio, NR_WRITEBACK_TEMP);
> -	wb_writeout_inc(&bdi->wb_ctx_arr[0]->wb);
> +	wb_writeout_inc(&bdi_wb_ctx->wb);
>  }

There's nothing fuse-specific here except that nothing but fuse uses
NR_WRITEBACK_TEMP.  Can we try to move this into the core first so that
the patches don't have to touch file system code?

> -	inc_wb_stat(&inode_to_bdi(inode)->wb_ctx_arr[0]->wb, WB_WRITEBACK);
> +	inc_wb_stat(&bdi_wb_ctx->wb, WB_WRITEBACK);
>  	node_stat_add_folio(tmp_folio, NR_WRITEBACK_TEMP);

Same here.  On pattern is that fuse and nfs both touch the node stat
and the web stat, and having a common helper doing both would probably
also be very helpful.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 12/13] nfs: add support in nfs to handle multiple writeback contexts
  2025-05-29 11:15     ` [PATCH 12/13] nfs: add support in nfs " Kundan Kumar
@ 2025-06-02 14:22       ` Christoph Hellwig
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2025-06-02 14:22 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy,
	anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch,
	ritesh.list, djwong, dave, p.raghav, da.gomez, linux-f2fs-devel,
	linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, Anuj Gupta

On Thu, May 29, 2025 at 04:45:03PM +0530, Kundan Kumar wrote:
>  	if (folio && !cinfo->dreq) {
>  		struct inode *inode = folio->mapping->host;
> +		struct bdi_writeback_ctx *bdi_wb_ctx =
> +						fetch_bdi_writeback_ctx(inode);
>  		long nr = folio_nr_pages(folio);
>  
>  		/* This page is really still in write-back - just that the
>  		 * writeback is happening on the server now.
>  		 */
>  		node_stat_mod_folio(folio, NR_WRITEBACK, nr);
> -		wb_stat_mod(&inode_to_bdi(inode)->wb_ctx_arr[0]->wb,
> -			    WB_WRITEBACK, nr);
> +		wb_stat_mod(&bdi_wb_ctx->wb, WB_WRITEBACK, nr);

Similar comments to fuse here as well, except that nfs also really
should be using the node stat helpers automatically counting the
numbers of pages in a folio instead of duplicating the logic.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 04/13] writeback: affine inode to a writeback ctx within a bdi
  2025-05-29 11:14     ` [PATCH 04/13] writeback: affine inode to a writeback ctx within a bdi Kundan Kumar
@ 2025-06-02 14:24       ` Christoph Hellwig
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2025-06-02 14:24 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy,
	anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch,
	ritesh.list, djwong, dave, p.raghav, da.gomez, linux-f2fs-devel,
	linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, Anuj Gupta

On Thu, May 29, 2025 at 04:44:55PM +0530, Kundan Kumar wrote:
> @@ -157,7 +157,7 @@ fetch_bdi_writeback_ctx(struct inode *inode)
>  {
>  	struct backing_dev_info *bdi = inode_to_bdi(inode);
>  
> -	return bdi->wb_ctx_arr[0];
> +	return bdi->wb_ctx_arr[inode->i_ino % bdi->nr_wb_ctx];

Most modern file systems use 64-bit inode numbers, while i_ino sadly
still is only an ino_t that can be 32-bits wide.  So we'll either need
an ugly fs hook here, or maybe convince Linus that it finally is time
for a 64-bit i_ino (which would also clean up a lot of mess in the
file systems and this constant source of confusion).


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/13] fuse: add support for multiple writeback contexts in fuse
  2025-06-02 14:21       ` Christoph Hellwig
@ 2025-06-02 15:50         ` Bernd Schubert
  2025-06-02 15:55           ` Christoph Hellwig
  0 siblings, 1 reply; 40+ messages in thread
From: Bernd Schubert @ 2025-06-02 15:50 UTC (permalink / raw)
  To: Christoph Hellwig, Kundan Kumar
  Cc: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy,
	anna, akpm, willy, mcgrof, clm, david, amir73il, axboe,
	ritesh.list, djwong, dave, p.raghav, da.gomez, linux-f2fs-devel,
	linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, Anuj Gupta,
	Joanne Koong



On 6/2/25 16:21, Christoph Hellwig wrote:
>>  static void fuse_writepage_finish_stat(struct inode *inode, struct folio *folio)
>>  {
>> -	struct backing_dev_info *bdi = inode_to_bdi(inode);
>> +	struct bdi_writeback_ctx *bdi_wb_ctx = fetch_bdi_writeback_ctx(inode);
>>  
>> -	dec_wb_stat(&bdi->wb_ctx_arr[0]->wb, WB_WRITEBACK);
>> +	dec_wb_stat(&bdi_wb_ctx->wb, WB_WRITEBACK);
>>  	node_stat_sub_folio(folio, NR_WRITEBACK_TEMP);
>> -	wb_writeout_inc(&bdi->wb_ctx_arr[0]->wb);
>> +	wb_writeout_inc(&bdi_wb_ctx->wb);
>>  }
> 
> There's nothing fuse-specific here except that nothing but fuse uses
> NR_WRITEBACK_TEMP.  Can we try to move this into the core first so that
> the patches don't have to touch file system code?
> 
>> -	inc_wb_stat(&inode_to_bdi(inode)->wb_ctx_arr[0]->wb, WB_WRITEBACK);
>> +	inc_wb_stat(&bdi_wb_ctx->wb, WB_WRITEBACK);
>>  	node_stat_add_folio(tmp_folio, NR_WRITEBACK_TEMP);
> 
> Same here.  On pattern is that fuse and nfs both touch the node stat
> and the web stat, and having a common helper doing both would probably
> also be very helpful.
> 
> 

Note that Miklos' PR for 6.16 removes NR_WRITEBACK_TEMP through
patches from Joanne, i.e. only 

dec_wb_stat(&bdi->wb, WB_WRITEBACK);

is left over.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/13] fuse: add support for multiple writeback contexts in fuse
  2025-06-02 15:50         ` Bernd Schubert
@ 2025-06-02 15:55           ` Christoph Hellwig
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2025-06-02 15:55 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Christoph Hellwig, Kundan Kumar, jaegeuk, chao, viro, brauner,
	jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm,
	david, amir73il, axboe, ritesh.list, djwong, dave, p.raghav,
	da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs,
	linux-mm, gost.dev, Anuj Gupta, Joanne Koong

On Mon, Jun 02, 2025 at 05:50:48PM +0200, Bernd Schubert wrote:
> > Same here.  On pattern is that fuse and nfs both touch the node stat
> > and the web stat, and having a common helper doing both would probably
> > also be very helpful.
> > 
> > 
> 
> Note that Miklos' PR for 6.16 removes NR_WRITEBACK_TEMP through
> patches from Joanne, i.e. only 
> 
> dec_wb_stat(&bdi->wb, WB_WRITEBACK);
> 
> is left over.

That'll make it even easier to consolidate with the NFS version.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/13] Parallelizing filesystem writeback
  2025-06-02 14:19   ` Christoph Hellwig
@ 2025-06-03  9:16     ` Anuj Gupta/Anuj Gupta
  2025-06-03 13:24       ` Christoph Hellwig
  0 siblings, 1 reply; 40+ messages in thread
From: Anuj Gupta/Anuj Gupta @ 2025-06-03  9:16 UTC (permalink / raw)
  To: Christoph Hellwig, Kundan Kumar
  Cc: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy,
	anna, akpm, willy, mcgrof, clm, david, amir73il, axboe,
	ritesh.list, djwong, dave, p.raghav, da.gomez, linux-f2fs-devel,
	linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, anuj1072538,
	kundanthebest

On 6/2/2025 7:49 PM, Christoph Hellwig wrote:
> On Thu, May 29, 2025 at 04:44:51PM +0530, Kundan Kumar wrote:
> Well, the proper thing would be to figure out a good default and not
> just keep things as-is, no?

We observed that some filesystems, such as Btrfs, don't benefit from
this infra due to their distinct writeback architecture. To preserve
current behavior and avoid unintended changes for such filesystems,
we have kept nr_wb_ctx=1 as the default. Filesystems that can take
advantage of parallel writeback (xfs, ext4) can opt-in via a mount
option. Also we wanted to reduce risk during initial integration and
hence kept it as opt-in.

> 
>> IOPS and throughput
>> ===================
>> We see significant improvement in IOPS across several filesystem on both
>> PMEM and NVMe devices.
>>
>> Performance gains:
>>    - On PMEM:
>> 	Base XFS		: 544 MiB/s
>> 	Parallel Writeback XFS	: 1015 MiB/s  (+86%)
>> 	Base EXT4		: 536 MiB/s
>> 	Parallel Writeback EXT4	: 1047 MiB/s  (+95%)
>>
>>    - On NVMe:
>> 	Base XFS		: 651 MiB/s
>> 	Parallel Writeback XFS	: 808 MiB/s  (+24%)
>> 	Base EXT4		: 494 MiB/s
>> 	Parallel Writeback EXT4	: 797 MiB/s  (+61%)
> 
> What worksload was this?

Number of CPUs = 12
System RAM = 16G
For XFS number of AGs = 4
For EXT4 BG count = 28616
Used PMEM of 6G and NVMe SSD of 3.84 TB

fio command line :
fio --directory=/mnt --name=test --bs=4k --iodepth=1024 --rw=randwrite 
--ioengine=io_uring --time_based=1 -runtime=60 --numjobs=12 --size=450M 
--direct=0  --eta-interval=1 --eta-newline=1 --group_reporting

Will measure the write-amp and share.

> 
> How many CPU cores did the system have, how many AGs/BGs did the file
> systems have?   What SSD/Pmem was this?  Did this change the write
> amp as measure by the media writes on the NVMe SSD?
> 
> Also I'd be really curious to see numbers on hard drives.
> 
>> We also see that there is no increase in filesystem fragmentation
>> # of extents:
>>    - On XFS (on PMEM):
>> 	Base XFS		: 1964
>> 	Parallel Writeback XFS	: 1384
>>
>>    - On EXT4 (on PMEM):
>> 	Base EXT4		: 21
>> 	Parallel Writeback EXT4	: 11
> 
> How were the number of extents counts given that they look so wildly
> different?
> 
> 

Issued random write of 1G using fio with fallocate=none and then
measured the number of extents, after a delay of 30 secs :
fio --filename=/mnt/testfile --name=test --bs=4k --iodepth=1024 
--rw=randwrite --ioengine=io_uring  --fallocate=none --numjobs=1 
--size=1G --direct=0 --eta-interval=1 --eta-newline=1 --group_reporting

For xfs used this command:
xfs_io -c "stat" /mnt/testfile

And for ext4 used this:
filefrag /mnt/testfile

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/13] Parallelizing filesystem writeback
  2025-06-03  9:16     ` Anuj Gupta/Anuj Gupta
@ 2025-06-03 13:24       ` Christoph Hellwig
  2025-06-03 13:52         ` Anuj gupta
  0 siblings, 1 reply; 40+ messages in thread
From: Christoph Hellwig @ 2025-06-03 13:24 UTC (permalink / raw)
  To: Anuj Gupta/Anuj Gupta
  Cc: Christoph Hellwig, Kundan Kumar, jaegeuk, chao, viro, brauner,
	jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm,
	david, amir73il, axboe, ritesh.list, djwong, dave, p.raghav,
	da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs,
	linux-mm, gost.dev, anuj1072538, kundanthebest

On Tue, Jun 03, 2025 at 02:46:20PM +0530, Anuj Gupta/Anuj Gupta wrote:
> On 6/2/2025 7:49 PM, Christoph Hellwig wrote:
> > On Thu, May 29, 2025 at 04:44:51PM +0530, Kundan Kumar wrote:
> > Well, the proper thing would be to figure out a good default and not
> > just keep things as-is, no?
> 
> We observed that some filesystems, such as Btrfs, don't benefit from
> this infra due to their distinct writeback architecture. To preserve
> current behavior and avoid unintended changes for such filesystems,
> we have kept nr_wb_ctx=1 as the default. Filesystems that can take
> advantage of parallel writeback (xfs, ext4) can opt-in via a mount
> option. Also we wanted to reduce risk during initial integration and
> hence kept it as opt-in.

A mount option is about the worst possible interface for behavior
that depends on file system implementation and possibly hardware
chacteristics.  This needs to be set by the file systems, possibly
using generic helpers using hardware information.

> Used PMEM of 6G

battery/capacitor backed DRAM, or optane?

>
> and NVMe SSD of 3.84 TB

Consumer drive, enterprise drive?

> For xfs used this command:
> xfs_io -c "stat" /mnt/testfile
> And for ext4 used this:
> filefrag /mnt/testfile

filefrag merges contiguous extents, and only counts up for discontiguous
mappings, while fsxattr.nextents counts all extent even if they are
contiguous.  So you probably want to use filefrag for both cases.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/13] Parallelizing filesystem writeback
  2025-06-03 13:24       ` Christoph Hellwig
@ 2025-06-03 13:52         ` Anuj gupta
  2025-06-03 14:04           ` Christoph Hellwig
  2025-06-04  9:22           ` Kundan Kumar
  0 siblings, 2 replies; 40+ messages in thread
From: Anuj gupta @ 2025-06-03 13:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Anuj Gupta/Anuj Gupta, Kundan Kumar, jaegeuk, chao, viro, brauner,
	jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm,
	david, amir73il, axboe, ritesh.list, djwong, dave, p.raghav,
	da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs,
	linux-mm, gost.dev, kundanthebest

On Tue, Jun 3, 2025 at 6:54 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Tue, Jun 03, 2025 at 02:46:20PM +0530, Anuj Gupta/Anuj Gupta wrote:
> > On 6/2/2025 7:49 PM, Christoph Hellwig wrote:
> > > On Thu, May 29, 2025 at 04:44:51PM +0530, Kundan Kumar wrote:
> > > Well, the proper thing would be to figure out a good default and not
> > > just keep things as-is, no?
> >
> > We observed that some filesystems, such as Btrfs, don't benefit from
> > this infra due to their distinct writeback architecture. To preserve
> > current behavior and avoid unintended changes for such filesystems,
> > we have kept nr_wb_ctx=1 as the default. Filesystems that can take
> > advantage of parallel writeback (xfs, ext4) can opt-in via a mount
> > option. Also we wanted to reduce risk during initial integration and
> > hence kept it as opt-in.
>
> A mount option is about the worst possible interface for behavior
> that depends on file system implementation and possibly hardware
> chacteristics.  This needs to be set by the file systems, possibly
> using generic helpers using hardware information.

Right, that makes sense. Instead of using a mount option, we can
introduce generic helpers to initialize multiple writeback contexts
based on underlying hardware characteristics — e.g., number of CPUs or
NUMA topology. Filesystems like XFS and EXT4 can then call these helpers
during mount to opt into parallel writeback in a controlled way.

>
> > Used PMEM of 6G
>
> battery/capacitor backed DRAM, or optane?

We emulated PMEM using DRAM by following the steps here:
https://www.intel.com/content/www/us/en/developer/articles/training/how-to-emulate-persistent-memory-on-an-intel-architecture-server.html

>
> >
> > and NVMe SSD of 3.84 TB
>
> Consumer drive, enterprise drive?

It's an enterprise-grade drive — Samsung PM1733

>
> > For xfs used this command:
> > xfs_io -c "stat" /mnt/testfile
> > And for ext4 used this:
> > filefrag /mnt/testfile
>
> filefrag merges contiguous extents, and only counts up for discontiguous
> mappings, while fsxattr.nextents counts all extent even if they are
> contiguous.  So you probably want to use filefrag for both cases.

Got it — thanks for the clarification. We'll switch to using filefrag
and will share updated extent count numbers accordingly.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/13] Parallelizing filesystem writeback
  2025-06-03 13:52         ` Anuj gupta
@ 2025-06-03 14:04           ` Christoph Hellwig
  2025-06-03 14:05             ` Christoph Hellwig
  2025-06-04  9:22           ` Kundan Kumar
  1 sibling, 1 reply; 40+ messages in thread
From: Christoph Hellwig @ 2025-06-03 14:04 UTC (permalink / raw)
  To: Anuj gupta
  Cc: Christoph Hellwig, Anuj Gupta/Anuj Gupta, Kundan Kumar, jaegeuk,
	chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm,
	willy, mcgrof, clm, david, amir73il, axboe, ritesh.list, djwong,
	dave, p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2,
	linux-nfs, linux-mm, gost.dev, kundanthebest

On Tue, Jun 03, 2025 at 07:22:18PM +0530, Anuj gupta wrote:
> > A mount option is about the worst possible interface for behavior
> > that depends on file system implementation and possibly hardware
> > chacteristics.  This needs to be set by the file systems, possibly
> > using generic helpers using hardware information.
> 
> Right, that makes sense. Instead of using a mount option, we can
> introduce generic helpers to initialize multiple writeback contexts
> based on underlying hardware characteristics — e.g., number of CPUs or
> NUMA topology. Filesystems like XFS and EXT4 can then call these helpers
> during mount to opt into parallel writeback in a controlled way.

Yes.  A mount option might still be useful to override this default,
but it should not be needed for the normal use case.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/13] Parallelizing filesystem writeback
  2025-06-03 14:04           ` Christoph Hellwig
@ 2025-06-03 14:05             ` Christoph Hellwig
  2025-06-06  5:04               ` Kundan Kumar
  0 siblings, 1 reply; 40+ messages in thread
From: Christoph Hellwig @ 2025-06-03 14:05 UTC (permalink / raw)
  To: Anuj gupta
  Cc: Christoph Hellwig, Anuj Gupta/Anuj Gupta, Kundan Kumar, jaegeuk,
	chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm,
	willy, mcgrof, clm, david, amir73il, axboe, ritesh.list, djwong,
	dave, p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2,
	linux-nfs, linux-mm, gost.dev, kundanthebest

On Tue, Jun 03, 2025 at 04:04:45PM +0200, Christoph Hellwig wrote:
> On Tue, Jun 03, 2025 at 07:22:18PM +0530, Anuj gupta wrote:
> > > A mount option is about the worst possible interface for behavior
> > > that depends on file system implementation and possibly hardware
> > > chacteristics.  This needs to be set by the file systems, possibly
> > > using generic helpers using hardware information.
> > 
> > Right, that makes sense. Instead of using a mount option, we can
> > introduce generic helpers to initialize multiple writeback contexts
> > based on underlying hardware characteristics — e.g., number of CPUs or
> > NUMA topology. Filesystems like XFS and EXT4 can then call these helpers
> > during mount to opt into parallel writeback in a controlled way.
> 
> Yes.  A mount option might still be useful to override this default,
> but it should not be needed for the normal use case.

.. actually a sysfs file on the bdi is probably the better interface
for the override than a mount option.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 13/13] writeback: set the num of writeback contexts to number of online cpus
  2025-05-29 11:15     ` [PATCH 13/13] writeback: set the num of writeback contexts to number of online cpus Kundan Kumar
@ 2025-06-03 14:36       ` kernel test robot
  0 siblings, 0 replies; 40+ messages in thread
From: kernel test robot @ 2025-06-03 14:36 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: oe-lkp, lkp, Anuj Gupta, linux-mm, jaegeuk, chao, viro, brauner,
	jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm,
	david, amir73il, axboe, hch, ritesh.list, djwong, dave, p.raghav,
	da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs,
	gost.dev, Kundan Kumar, oliver.sang



Hello,

kernel test robot noticed a 53.9% improvement of fsmark.files_per_sec on:


commit: 2850eee23dbc4ff9878d88625b1f84965eefcce6 ("[PATCH 13/13] writeback: set the num of writeback contexts to number of online cpus")
url: https://github.com/intel-lab-lkp/linux/commits/Kundan-Kumar/writeback-add-infra-for-parallel-writeback/20250529-193523
base: https://git.kernel.org/cgit/linux/kernel/git/vfs/vfs.git vfs.all
patch link: https://lore.kernel.org/all/20250529111504.89912-14-kundan.kumar@samsung.com/
patch subject: [PATCH 13/13] writeback: set the num of writeback contexts to number of online cpus

testcase: fsmark
config: x86_64-rhel-9.4
compiler: gcc-12
test machine: 192 threads 4 sockets Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz (Cascade Lake) with 176G memory
parameters:

	iterations: 1x
	nr_threads: 32t
	disk: 1SSD
	fs: ext4
	filesize: 16MB
	test_size: 60G
	sync_method: NoSync
	nr_directories: 16d
	nr_files_per_directory: 256fpd
	cpufreq_governor: performance


In addition to that, the commit also has significant impact on the following tests:

+------------------+------------------------------------------------------------------------------------------------+
| testcase: change | filebench: filebench.sum_operations/s 4.3% improvement                                         |
| test machine     | 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory |
| test parameters  | cpufreq_governor=performance                                                                   |
|                  | disk=1HDD                                                                                      |
|                  | fs=xfs                                                                                         |
|                  | test=fivestreamwrite.f                                                                         |
+------------------+------------------------------------------------------------------------------------------------+




Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250603/202506032246.89ddc1a2-lkp@intel.com

=========================================================================================
compiler/cpufreq_governor/disk/filesize/fs/iterations/kconfig/nr_directories/nr_files_per_directory/nr_threads/rootfs/sync_method/tbox_group/test_size/testcase:
  gcc-12/performance/1SSD/16MB/ext4/1x/x86_64-rhel-9.4/16d/256fpd/32t/debian-12-x86_64-20240206.cgz/NoSync/lkp-csl-2sp10/60G/fsmark

commit: 
  a2dadb7ea8 ("nfs: add support in nfs to handle multiple writeback contexts")
  2850eee23d ("writeback: set the num of writeback contexts to number of online cpus")

a2dadb7ea862d5c1 2850eee23dbc4ff9878d88625b1 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
   1641480           +13.3%    1860148 ±  2%  cpuidle..usage
    302.00 ±  8%     +14.1%     344.67 ±  7%  perf-c2c.HITM.remote
     24963 ±  4%     -13.3%      21647 ±  6%  uptime.idle
     91.64           -22.2%      71.26 ±  7%  iostat.cpu.idle
      7.34 ±  4%    +275.7%      27.59 ± 19%  iostat.cpu.iowait
      0.46 ±141%      +2.2        2.63 ± 66%  perf-profile.calltrace.cycles-pp.setlocale
      0.46 ±141%      +2.2        2.63 ± 66%  perf-profile.children.cycles-pp.setlocale
    194019            -7.5%     179552        fsmark.app_overhead
    108.10 ±  8%     +53.9%     166.40 ± 10%  fsmark.files_per_sec
     43295 ±  7%     +35.8%      58787 ±  5%  fsmark.time.voluntary_context_switches
  19970922           -10.5%   17867270 ±  2%  meminfo.Dirty
    493817           +13.1%     558422        meminfo.SUnreclaim
    141708         +1439.0%    2180863 ± 14%  meminfo.Writeback
   4993428           -10.3%    4480219 ±  2%  proc-vmstat.nr_dirty
     34285            +5.8%      36262        proc-vmstat.nr_kernel_stack
    123504           +13.1%     139636        proc-vmstat.nr_slab_unreclaimable
     36381 ±  4%   +1403.0%     546810 ± 14%  proc-vmstat.nr_writeback
     91.54           -22.1%      71.32 ±  7%  vmstat.cpu.id
      7.22 ±  4%    +280.4%      27.47 ± 19%  vmstat.cpu.wa
     14.58 ±  4%    +537.4%      92.92 ±  8%  vmstat.procs.b
      6140 ±  2%     +90.1%      11673 ±  9%  vmstat.system.cs
     91.46           -20.9       70.56 ±  7%  mpstat.cpu.all.idle%
      7.52 ±  4%     +20.8       28.29 ± 19%  mpstat.cpu.all.iowait%
      0.12 ±  7%      +0.0        0.14 ±  7%  mpstat.cpu.all.irq%
      0.35 ±  6%      +0.1        0.43 ±  5%  mpstat.cpu.all.sys%
     11.24 ±  8%     +20.7%      13.56 ±  3%  mpstat.max_utilization_pct
     34947 ±  5%     +14.3%      39928 ±  4%  numa-vmstat.node0.nr_slab_unreclaimable
      9001 ± 14%   +1553.7%     148860 ± 19%  numa-vmstat.node0.nr_writeback
   1329092 ±  4%     -20.9%    1051569 ±  9%  numa-vmstat.node1.nr_dirty
     10019 ±  7%   +1490.0%     159311 ± 14%  numa-vmstat.node1.nr_writeback
   2808522 ±  8%     -17.7%    2310216 ±  2%  numa-vmstat.node2.nr_file_pages
   2638799 ±  3%     -12.7%    2304024 ±  2%  numa-vmstat.node2.nr_inactive_file
      7810 ±  9%   +1035.8%      88707 ± 16%  numa-vmstat.node2.nr_writeback
   2638797 ±  3%     -12.7%    2304025 ±  2%  numa-vmstat.node2.nr_zone_inactive_file
     29952 ±  3%     +13.4%      33964 ±  4%  numa-vmstat.node3.nr_slab_unreclaimable
     10686 ±  9%   +1351.3%     155091 ± 12%  numa-vmstat.node3.nr_writeback
    139656 ±  5%     +14.2%     159539 ±  4%  numa-meminfo.node0.SUnreclaim
     35586 ± 13%   +1565.8%     592799 ± 18%  numa-meminfo.node0.Writeback
   5304285 ±  4%     -20.8%    4198452 ± 10%  numa-meminfo.node1.Dirty
     40011 ±  5%   +1484.2%     633862 ± 14%  numa-meminfo.node1.Writeback
  11211668 ±  7%     -17.7%    9222157 ±  2%  numa-meminfo.node2.FilePages
  10532776 ±  3%     -12.7%    9197387 ±  2%  numa-meminfo.node2.Inactive
  10532776 ±  3%     -12.7%    9197387 ±  2%  numa-meminfo.node2.Inactive(file)
  12378624 ±  7%     -15.0%   10520827 ±  2%  numa-meminfo.node2.MemUsed
     29574 ±  9%   +1087.0%     351055 ± 16%  numa-meminfo.node2.Writeback
    119679 ±  3%     +13.4%     135718 ±  4%  numa-meminfo.node3.SUnreclaim
     41446 ± 10%   +1380.1%     613443 ± 11%  numa-meminfo.node3.Writeback
     23.38 ±  2%      -6.7       16.72        perf-stat.i.cache-miss-rate%
  38590732 ±  3%     +53.9%   59394561 ±  5%  perf-stat.i.cache-references
      5973 ±  2%     +96.3%      11729 ± 10%  perf-stat.i.context-switches
      0.92            +7.4%       0.99        perf-stat.i.cpi
 7.023e+09 ±  3%     +12.5%  7.898e+09 ±  4%  perf-stat.i.cpu-cycles
    237.41 ±  2%    +393.6%       1171 ± 20%  perf-stat.i.cpu-migrations
      1035 ±  3%      +9.3%       1132 ±  4%  perf-stat.i.cycles-between-cache-misses
      1.15            -5.7%       1.09        perf-stat.i.ipc
     25.41 ±  2%      -7.4       18.03 ±  2%  perf-stat.overall.cache-miss-rate%
      0.94            +7.2%       1.01        perf-stat.overall.cpi
      1.06            -6.8%       0.99        perf-stat.overall.ipc
  38042801 ±  3%     +54.0%   58576659 ±  5%  perf-stat.ps.cache-references
      5897 ±  2%     +96.3%      11577 ± 10%  perf-stat.ps.context-switches
 6.925e+09 ±  3%     +12.4%  7.787e+09 ±  4%  perf-stat.ps.cpu-cycles
    234.11 ±  2%    +394.2%       1156 ± 20%  perf-stat.ps.cpu-migrations
 5.892e+11            +1.4%  5.977e+11        perf-stat.total.instructions
      0.08 ±  6%     -36.2%       0.05 ± 33%  perf-sched.sch_delay.avg.ms.__cond_resched.__alloc_frozen_pages_noprof.alloc_pages_mpol.folio_alloc_noprof.__filemap_get_folio
      0.01 ± 73%    +159.8%       0.04 ± 13%  perf-sched.sch_delay.avg.ms.__cond_resched.stop_one_cpu.sched_exec.bprm_execve.part
      0.07 ± 10%     -33.0%       0.05 ± 44%  perf-sched.sch_delay.avg.ms.devkmsg_read.vfs_read.ksys_read.do_syscall_64
      0.01 ± 73%    +695.6%       0.09 ± 25%  perf-sched.sch_delay.avg.ms.io_schedule.blk_mq_get_tag.__blk_mq_alloc_requests.blk_mq_submit_bio
      0.03 ± 30%     -45.4%       0.02 ± 24%  perf-sched.sch_delay.avg.ms.irqentry_exit_to_user_mode.asm_sysvec_reschedule_ipi.[unknown]
      0.06 ± 21%    -100.0%       0.00        perf-sched.sch_delay.avg.ms.kjournald2.kthread.ret_from_fork.ret_from_fork_asm
      0.05 ± 45%    +305.0%       0.19 ± 66%  perf-sched.sch_delay.avg.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
      0.01 ±210%   +8303.3%       0.85 ±183%  perf-sched.sch_delay.max.ms.__cond_resched.down_write.mpage_map_and_submit_extent.ext4_do_writepages.ext4_writepages
      0.05 ± 49%     +92.4%       0.10 ± 35%  perf-sched.sch_delay.max.ms.__cond_resched.process_one_work.worker_thread.kthread.ret_from_fork
      0.03 ± 76%    +165.2%       0.07 ± 28%  perf-sched.sch_delay.max.ms.__cond_resched.stop_one_cpu.sched_exec.bprm_execve.part
      0.07 ± 46%  +83423.5%      59.86 ±146%  perf-sched.sch_delay.max.ms.io_schedule.blk_mq_get_tag.__blk_mq_alloc_requests.blk_mq_submit_bio
      0.06 ± 21%    -100.0%       0.00        perf-sched.sch_delay.max.ms.kjournald2.kthread.ret_from_fork.ret_from_fork_asm
      0.09 ±  7%     +19.7%       0.10 ±  8%  perf-sched.sch_delay.max.ms.rcu_gp_kthread.kthread.ret_from_fork.ret_from_fork_asm
      0.10 ± 13%   +3587.7%       3.60 ±172%  perf-sched.sch_delay.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
     18911          +109.0%      39526 ± 25%  perf-sched.total_wait_and_delay.count.ms
      4.42 ± 25%     +77.8%       7.86 ± 15%  perf-sched.wait_and_delay.avg.ms.__cond_resched.__wait_for_common.affine_move_task.__set_cpus_allowed_ptr.__sched_setaffinity
    150.39 ±  6%     -14.9%     127.97 ±  8%  perf-sched.wait_and_delay.avg.ms.anon_pipe_read.fifo_pipe_read.vfs_read.ksys_read
      8.03 ± 89%     -74.5%       2.05 ±143%  perf-sched.wait_and_delay.avg.ms.do_wait.kernel_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe
      1.24 ±  8%   +2018.4%      26.26 ± 26%  perf-sched.wait_and_delay.avg.ms.io_schedule.blk_mq_get_tag.__blk_mq_alloc_requests.blk_mq_submit_bio
      0.83 ±  2%    -100.0%       0.00        perf-sched.wait_and_delay.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
     86.14 ±  9%     +33.0%     114.52 ±  7%  perf-sched.wait_and_delay.avg.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
      1047 ±  6%      -8.2%     960.83        perf-sched.wait_and_delay.count.__cond_resched.__wait_for_common.affine_move_task.__set_cpus_allowed_ptr.__sched_setaffinity
    171.50 ±  7%     -69.1%      53.00 ±141%  perf-sched.wait_and_delay.count.do_task_dead.do_exit.do_group_exit.__x64_sys_exit_group.x64_sys_call
    162.50 ±  7%     -69.3%      49.83 ±141%  perf-sched.wait_and_delay.count.do_wait.kernel_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe
      3635 ±  6%    +451.1%      20036 ± 35%  perf-sched.wait_and_delay.count.io_schedule.blk_mq_get_tag.__blk_mq_alloc_requests.blk_mq_submit_bio
     26.33 ±  5%     -12.7%      23.00 ±  2%  perf-sched.wait_and_delay.count.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_poll
    116.17          -100.0%       0.00        perf-sched.wait_and_delay.count.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
      3938 ± 21%     -66.2%       1332 ± 60%  perf-sched.wait_and_delay.count.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags
      4831 ±  5%    +102.8%       9799 ± 23%  perf-sched.wait_and_delay.count.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
     87.47 ± 20%    +823.3%     807.60 ±141%  perf-sched.wait_and_delay.max.ms.io_schedule.blk_mq_get_tag.__blk_mq_alloc_requests.blk_mq_submit_bio
      2.81 ±  4%    -100.0%       0.00        perf-sched.wait_and_delay.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
     77.99 ± 13%   +2082.7%       1702 ± 73%  perf-sched.wait_and_delay.max.ms.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags
    396.00 ± 16%     -34.0%     261.22 ± 33%  perf-sched.wait_and_delay.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
      3.89 ± 18%    +100.6%       7.81 ± 15%  perf-sched.wait_time.avg.ms.__cond_resched.__wait_for_common.affine_move_task.__set_cpus_allowed_ptr.__sched_setaffinity
      0.13 ±157%  +2.1e+05%     271.76 ±126%  perf-sched.wait_time.avg.ms.__cond_resched.down_write.mpage_map_and_submit_extent.ext4_do_writepages.ext4_writepages
    150.37 ±  6%     -15.2%     127.52 ±  8%  perf-sched.wait_time.avg.ms.anon_pipe_read.fifo_pipe_read.vfs_read.ksys_read
      1.23 ±  8%   +2030.8%      26.17 ± 26%  perf-sched.wait_time.avg.ms.io_schedule.blk_mq_get_tag.__blk_mq_alloc_requests.blk_mq_submit_bio
     31.17 ± 50%    -100.0%       0.00        perf-sched.wait_time.avg.ms.kjournald2.kthread.ret_from_fork.ret_from_fork_asm
     86.09 ±  9%     +32.8%     114.34 ±  7%  perf-sched.wait_time.avg.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
      1086 ± 17%    +309.4%       4449 ± 26%  perf-sched.wait_time.max.ms.__cond_resched.__wait_for_common.affine_move_task.__set_cpus_allowed_ptr.__sched_setaffinity
      0.17 ±142%  +7.2e+05%       1196 ±119%  perf-sched.wait_time.max.ms.__cond_resched.down_write.mpage_map_and_submit_extent.ext4_do_writepages.ext4_writepages
      7.27 ± 45%   +1259.6%      98.80 ±139%  perf-sched.wait_time.max.ms.__cond_resched.process_one_work.worker_thread.kthread.ret_from_fork
    262.77 ±113%    +316.1%       1093 ± 52%  perf-sched.wait_time.max.ms.do_wait.kernel_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe
     87.45 ± 20%    +823.5%     807.53 ±141%  perf-sched.wait_time.max.ms.io_schedule.blk_mq_get_tag.__blk_mq_alloc_requests.blk_mq_submit_bio
      0.04 ± 30%    +992.0%       0.43 ± 92%  perf-sched.wait_time.max.ms.irqentry_exit_to_user_mode.asm_sysvec_reschedule_ipi.[unknown]
     31.17 ± 50%    -100.0%       0.00        perf-sched.wait_time.max.ms.kjournald2.kthread.ret_from_fork.ret_from_fork_asm
     75.85 ± 16%   +2144.2%       1702 ± 73%  perf-sched.wait_time.max.ms.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags
    395.95 ± 16%     -34.0%     261.15 ± 33%  perf-sched.wait_time.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread


***************************************************************************************************
lkp-icl-2sp6: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory
=========================================================================================
compiler/cpufreq_governor/disk/fs/kconfig/rootfs/tbox_group/test/testcase:
  gcc-12/performance/1HDD/xfs/x86_64-rhel-9.4/debian-12-x86_64-20240206.cgz/lkp-icl-2sp6/fivestreamwrite.f/filebench

commit: 
  a2dadb7ea8 ("nfs: add support in nfs to handle multiple writeback contexts")
  2850eee23d ("writeback: set the num of writeback contexts to number of online cpus")

a2dadb7ea862d5c1 2850eee23dbc4ff9878d88625b1 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
      2.06 ±  3%      +1.5        3.58        mpstat.cpu.all.iowait%
   8388855 ±  5%     +17.7%    9875928 ±  5%  numa-meminfo.node0.Dirty
      0.02 ±  5%     +48.0%       0.03 ±  3%  sched_debug.cpu.nr_uninterruptible.avg
      2.70           +72.6%       4.65        vmstat.procs.b
     97.67            -1.5%      96.17        iostat.cpu.idle
      2.04 ±  3%     +73.9%       3.55        iostat.cpu.iowait
   2094449 ±  5%     +17.8%    2468063 ±  5%  numa-vmstat.node0.nr_dirty
   2113170 ±  5%     +17.7%    2487005 ±  5%  numa-vmstat.node0.nr_zone_write_pending
      6.99 ±  3%      +0.5        7.48 ±  2%  perf-stat.i.cache-miss-rate%
      1.82            +3.2%       1.88        perf-stat.i.cpi
      0.64            -2.0%       0.62        perf-stat.i.ipc
      2.88 ±  5%      +9.5%       3.15 ±  4%  perf-stat.overall.MPKI
    464.45            +4.3%     484.58        filebench.sum_bytes_mb/s
     27873            +4.3%      29084        filebench.sum_operations
    464.51            +4.3%     484.66        filebench.sum_operations/s
     10.76            -4.2%      10.31        filebench.sum_time_ms/op
    464.67            +4.3%     484.67        filebench.sum_writes/s
  57175040            +4.2%   59565397        filebench.time.file_system_outputs
   7146880            +4.2%    7445674        proc-vmstat.nr_dirtied
   4412053            +9.1%    4815253        proc-vmstat.nr_dirty
   7485090            +5.0%    7855964        proc-vmstat.nr_file_pages
  24899858            -1.5%   24530112        proc-vmstat.nr_free_pages
  24705120            -1.5%   24343672        proc-vmstat.nr_free_pages_blocks
   6573042            +5.6%    6943969        proc-vmstat.nr_inactive_file
     34473 ±  3%      +7.5%      37072        proc-vmstat.nr_writeback
   6573042            +5.6%    6943969        proc-vmstat.nr_zone_inactive_file
   4446526            +9.1%    4852325        proc-vmstat.nr_zone_write_pending
   7963041            +3.8%    8262916        proc-vmstat.pgalloc_normal
      0.02 ± 10%     +45.0%       0.03 ±  5%  perf-sched.sch_delay.avg.ms.do_wait.kernel_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe
    317.88 ±166%    -100.0%       0.08 ± 60%  perf-sched.sch_delay.avg.ms.kthreadd.ret_from_fork.ret_from_fork_asm
    474.99 ±141%    -100.0%       0.10 ± 49%  perf-sched.sch_delay.max.ms.kthreadd.ret_from_fork.ret_from_fork_asm
     17.87 ± 13%    +125.8%      40.36 ±  4%  perf-sched.wait_and_delay.avg.ms.io_schedule.blk_mq_get_tag.__blk_mq_alloc_requests.blk_mq_submit_bio
     47.75           +19.8%      57.20 ±  5%  perf-sched.wait_and_delay.avg.ms.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags
    517.00           -17.2%     427.83 ±  5%  perf-sched.wait_and_delay.count.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags
     54.05 ±  2%    +253.0%     190.80 ± 18%  perf-sched.wait_and_delay.max.ms.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags
      4286 ±  4%      -8.8%       3909 ±  8%  perf-sched.wait_and_delay.max.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
     17.77 ± 13%    +126.0%      40.16 ±  4%  perf-sched.wait_time.avg.ms.io_schedule.blk_mq_get_tag.__blk_mq_alloc_requests.blk_mq_submit_bio
     47.63           +19.8%      57.06 ±  5%  perf-sched.wait_time.avg.ms.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags
     53.95 ±  2%    +253.5%     190.70 ± 18%  perf-sched.wait_time.max.ms.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags
      4285 ±  4%      -8.9%       3906 ±  7%  perf-sched.wait_time.max.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
      0.77 ± 15%      -0.3        0.43 ± 72%  perf-profile.calltrace.cycles-pp.enqueue_task.ttwu_do_activate.sched_ttwu_pending.__flush_smp_call_function_queue.flush_smp_call_function_queue
      1.76 ±  9%      -0.3        1.44 ±  8%  perf-profile.calltrace.cycles-pp.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe
      1.73 ± 10%      -0.3        1.43 ±  8%  perf-profile.calltrace.cycles-pp.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe
      0.85 ± 12%      -0.2        0.66 ± 14%  perf-profile.calltrace.cycles-pp.__pick_next_task.__schedule.schedule_idle.do_idle.cpu_startup_entry
      4.74 ±  6%      -0.6        4.17 ±  9%  perf-profile.children.cycles-pp.__handle_mm_fault
      4.92 ±  6%      -0.6        4.36 ±  9%  perf-profile.children.cycles-pp.handle_mm_fault
      2.85 ±  5%      -0.5        2.34 ±  8%  perf-profile.children.cycles-pp.enqueue_task
      2.60 ±  4%      -0.4        2.15 ±  8%  perf-profile.children.cycles-pp.enqueue_task_fair
      2.58 ±  6%      -0.4        2.22 ±  8%  perf-profile.children.cycles-pp.do_pte_missing
      2.04 ±  9%      -0.3        1.70 ± 10%  perf-profile.children.cycles-pp.do_read_fault
      1.88 ±  5%      -0.3        1.56 ±  7%  perf-profile.children.cycles-pp.ttwu_do_activate
      1.85 ± 10%      -0.3        1.58 ± 12%  perf-profile.children.cycles-pp.filemap_map_pages
      1.20 ±  8%      -0.2        0.98 ±  9%  perf-profile.children.cycles-pp.__flush_smp_call_function_queue
      0.49 ± 23%      -0.2        0.29 ± 23%  perf-profile.children.cycles-pp.set_next_task_fair
      0.38 ± 32%      -0.2        0.20 ± 39%  perf-profile.children.cycles-pp.strnlen_user
      0.44 ± 28%      -0.2        0.26 ± 27%  perf-profile.children.cycles-pp.set_next_entity
      0.70 ±  7%      -0.2        0.52 ±  7%  perf-profile.children.cycles-pp.folios_put_refs
      0.22 ± 20%      -0.1        0.15 ± 27%  perf-profile.children.cycles-pp.try_charge_memcg
      0.02 ±141%      +0.1        0.08 ± 44%  perf-profile.children.cycles-pp.__blk_mq_alloc_driver_tag
      0.09 ± 59%      +0.1        0.18 ± 21%  perf-profile.children.cycles-pp.irq_work_tick
      0.26 ± 22%      +0.2        0.41 ± 13%  perf-profile.children.cycles-pp.cpu_stop_queue_work
      0.34 ± 31%      +0.2        0.57 ± 27%  perf-profile.children.cycles-pp.perf_event_mmap_event
      2.68 ± 10%      +0.4        3.10 ± 12%  perf-profile.children.cycles-pp.sched_balance_domains
      0.38 ± 32%      -0.2        0.19 ± 44%  perf-profile.self.cycles-pp.strnlen_user
      0.02 ±141%      +0.1        0.08 ± 44%  perf-profile.self.cycles-pp.ahci_single_level_irq_intr
      0.22 ± 36%      +0.2        0.43 ± 21%  perf-profile.self.cycles-pp.sched_balance_rq
      0.36 ± 41%      +0.3        0.66 ± 18%  perf-profile.self.cycles-pp._find_next_and_bit





Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/13] Parallelizing filesystem writeback
  2025-06-03 13:52         ` Anuj gupta
  2025-06-03 14:04           ` Christoph Hellwig
@ 2025-06-04  9:22           ` Kundan Kumar
  2025-06-11 15:51             ` Darrick J. Wong
  1 sibling, 1 reply; 40+ messages in thread
From: Kundan Kumar @ 2025-06-04  9:22 UTC (permalink / raw)
  To: Anuj gupta
  Cc: Christoph Hellwig, Anuj Gupta/Anuj Gupta, Kundan Kumar, jaegeuk,
	chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm,
	willy, mcgrof, clm, david, amir73il, axboe, ritesh.list, djwong,
	dave, p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2,
	linux-nfs, linux-mm, gost.dev

> > > For xfs used this command:
> > > xfs_io -c "stat" /mnt/testfile
> > > And for ext4 used this:
> > > filefrag /mnt/testfile
> >
> > filefrag merges contiguous extents, and only counts up for discontiguous
> > mappings, while fsxattr.nextents counts all extent even if they are
> > contiguous.  So you probably want to use filefrag for both cases.
>
> Got it — thanks for the clarification. We'll switch to using filefrag
> and will share updated extent count numbers accordingly.

Using filefrag, we recorded extent counts on xfs and ext4 at three
stages:
a. Just after a 1G random write,
b. After a 30-second wait,
c. After unmounting and remounting the filesystem,

xfs
Base
a. 6251   b. 2526  c. 2526
Parallel writeback
a. 6183   b. 2326  c. 2326

ext4
Base
a. 7080   b. 7080    c. 11
Parallel writeback
a. 5961   b. 5961    c. 11

Used the same fio commandline as earlier:
fio --filename=/mnt/testfile --name=test --bs=4k --iodepth=1024
--rw=randwrite --ioengine=io_uring  --fallocate=none --numjobs=1
--size=1G --direct=0 --eta-interval=1 --eta-newline=1
--group_reporting

filefrag command:
filefrag  /mnt/testfile

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/13] Parallelizing filesystem writeback
  2025-06-03 14:05             ` Christoph Hellwig
@ 2025-06-06  5:04               ` Kundan Kumar
  2025-06-09  4:00                 ` Christoph Hellwig
  0 siblings, 1 reply; 40+ messages in thread
From: Kundan Kumar @ 2025-06-06  5:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Anuj gupta, Anuj Gupta/Anuj Gupta, Kundan Kumar, jaegeuk, chao,
	viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy,
	mcgrof, clm, david, amir73il, axboe, ritesh.list, djwong, dave,
	p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2,
	linux-nfs, linux-mm, gost.dev

On Tue, Jun 3, 2025 at 7:35 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Tue, Jun 03, 2025 at 04:04:45PM +0200, Christoph Hellwig wrote:
> > On Tue, Jun 03, 2025 at 07:22:18PM +0530, Anuj gupta wrote:
> > > > A mount option is about the worst possible interface for behavior
> > > > that depends on file system implementation and possibly hardware
> > > > chacteristics.  This needs to be set by the file systems, possibly
> > > > using generic helpers using hardware information.
> > >
> > > Right, that makes sense. Instead of using a mount option, we can
> > > introduce generic helpers to initialize multiple writeback contexts
> > > based on underlying hardware characteristics — e.g., number of CPUs or
> > > NUMA topology. Filesystems like XFS and EXT4 can then call these helpers
> > > during mount to opt into parallel writeback in a controlled way.
> >
> > Yes.  A mount option might still be useful to override this default,
> > but it should not be needed for the normal use case.
>
> .. actually a sysfs file on the bdi is probably the better interface
> for the override than a mount option.

Hi Christoph,

Thanks for the suggestion — I agree the default should come from a
filesystem-level helper, not a mount option.

I looked into the sysfs override idea, but one challenge is that
nr_wb_ctx must be finalized before any writes occur. That leaves only
a narrow window — after the bdi is registered but before any inodes
are dirtied — where changing it is safe.

This makes the sysfs knob a bit fragile unless we tightly guard it
(e.g., mark it read-only after init). A mount option, even just as an
override, feels simpler and more predictable, since it’s set before
the FS becomes active.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/13] Parallelizing filesystem writeback
  2025-06-06  5:04               ` Kundan Kumar
@ 2025-06-09  4:00                 ` Christoph Hellwig
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2025-06-09  4:00 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: Christoph Hellwig, Anuj gupta, Anuj Gupta/Anuj Gupta,
	Kundan Kumar, jaegeuk, chao, viro, brauner, jack, miklos,
	agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david,
	amir73il, axboe, ritesh.list, djwong, dave, p.raghav, da.gomez,
	linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm,
	gost.dev

On Fri, Jun 06, 2025 at 10:34:42AM +0530, Kundan Kumar wrote:
> Thanks for the suggestion — I agree the default should come from a
> filesystem-level helper, not a mount option.
> 
> I looked into the sysfs override idea, but one challenge is that
> nr_wb_ctx must be finalized before any writes occur. That leaves only
> a narrow window — after the bdi is registered but before any inodes
> are dirtied — where changing it is safe.
> 
> This makes the sysfs knob a bit fragile unless we tightly guard it
> (e.g., mark it read-only after init). A mount option, even just as an
> override, feels simpler and more predictable, since it’s set before
> the FS becomes active.

The mount option has a few issues:

 - the common VFS code only support flags, not value options, so you'd
   have to wire this up in every file system
 - some file system might not want to allow changing it
 - changing it at runtime is actuallyt quite useful

So you'll need to quiesce writeback or maybe even do a full fs freeze
when changing it a runtime, but that seems ok for a change this invasive.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/13] Parallelizing filesystem writeback
  2025-06-04  9:22           ` Kundan Kumar
@ 2025-06-11 15:51             ` Darrick J. Wong
  2025-06-24  5:59               ` Kundan Kumar
  0 siblings, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2025-06-11 15:51 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: Anuj gupta, Christoph Hellwig, Anuj Gupta/Anuj Gupta,
	Kundan Kumar, jaegeuk, chao, viro, brauner, jack, miklos,
	agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david,
	amir73il, axboe, ritesh.list, dave, p.raghav, da.gomez,
	linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm,
	gost.dev

On Wed, Jun 04, 2025 at 02:52:34PM +0530, Kundan Kumar wrote:
> > > > For xfs used this command:
> > > > xfs_io -c "stat" /mnt/testfile
> > > > And for ext4 used this:
> > > > filefrag /mnt/testfile
> > >
> > > filefrag merges contiguous extents, and only counts up for discontiguous
> > > mappings, while fsxattr.nextents counts all extent even if they are
> > > contiguous.  So you probably want to use filefrag for both cases.
> >
> > Got it — thanks for the clarification. We'll switch to using filefrag
> > and will share updated extent count numbers accordingly.
> 
> Using filefrag, we recorded extent counts on xfs and ext4 at three
> stages:
> a. Just after a 1G random write,
> b. After a 30-second wait,
> c. After unmounting and remounting the filesystem,
> 
> xfs
> Base
> a. 6251   b. 2526  c. 2526
> Parallel writeback
> a. 6183   b. 2326  c. 2326

Interesting that the mapping record count goes down...

I wonder, you said the xfs filesystem has 4 AGs and 12 cores, so I guess
wb_ctx_arr[] is 12?  I wonder, do you see a knee point in writeback
throughput when the # of wb contexts exceeds the AG count?

Though I guess for the (hopefully common) case of pure overwrites, we
don't have to do any metadata updates so we wouldn't really hit a
scaling limit due to ag count or log contention or whatever.  Does that
square with what you see?

> ext4
> Base
> a. 7080   b. 7080    c. 11
> Parallel writeback
> a. 5961   b. 5961    c. 11

Hum, that's particularly ... interesting.  I wonder what the mapping
count behaviors are when you turn off delayed allocation?

--D

> Used the same fio commandline as earlier:
> fio --filename=/mnt/testfile --name=test --bs=4k --iodepth=1024
> --rw=randwrite --ioengine=io_uring  --fallocate=none --numjobs=1
> --size=1G --direct=0 --eta-interval=1 --eta-newline=1
> --group_reporting
> 
> filefrag command:
> filefrag  /mnt/testfile
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/13] Parallelizing filesystem writeback
  2025-06-11 15:51             ` Darrick J. Wong
@ 2025-06-24  5:59               ` Kundan Kumar
  2025-07-02 18:44                 ` Darrick J. Wong
  0 siblings, 1 reply; 40+ messages in thread
From: Kundan Kumar @ 2025-06-24  5:59 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Anuj gupta, Christoph Hellwig, Anuj Gupta/Anuj Gupta,
	Kundan Kumar, jaegeuk, chao, viro, brauner, jack, miklos,
	agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david,
	amir73il, axboe, ritesh.list, dave, p.raghav, da.gomez,
	linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm,
	gost.dev

On Wed, Jun 11, 2025 at 9:21 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Wed, Jun 04, 2025 at 02:52:34PM +0530, Kundan Kumar wrote:
> > > > > For xfs used this command:
> > > > > xfs_io -c "stat" /mnt/testfile
> > > > > And for ext4 used this:
> > > > > filefrag /mnt/testfile
> > > >
> > > > filefrag merges contiguous extents, and only counts up for discontiguous
> > > > mappings, while fsxattr.nextents counts all extent even if they are
> > > > contiguous.  So you probably want to use filefrag for both cases.
> > >
> > > Got it — thanks for the clarification. We'll switch to using filefrag
> > > and will share updated extent count numbers accordingly.
> >
> > Using filefrag, we recorded extent counts on xfs and ext4 at three
> > stages:
> > a. Just after a 1G random write,
> > b. After a 30-second wait,
> > c. After unmounting and remounting the filesystem,
> >
> > xfs
> > Base
> > a. 6251   b. 2526  c. 2526
> > Parallel writeback
> > a. 6183   b. 2326  c. 2326
>
> Interesting that the mapping record count goes down...
>
> I wonder, you said the xfs filesystem has 4 AGs and 12 cores, so I guess
> wb_ctx_arr[] is 12?  I wonder, do you see a knee point in writeback
> throughput when the # of wb contexts exceeds the AG count?
>
> Though I guess for the (hopefully common) case of pure overwrites, we
> don't have to do any metadata updates so we wouldn't really hit a
> scaling limit due to ag count or log contention or whatever.  Does that
> square with what you see?
>

Hi Darrick,

We analyzed AG count vs. number of writeback contexts to identify any
knee point. Earlier, wb_ctx_arr[] was fixed at 12; now we varied nr_wb_ctx
and measured the impact.

We implemented a configurable number of writeback contexts to measure
throughput more easily. This feature will be exposed in the next series.
To configure, used: echo <nr_wb_ctx> > /sys/class/bdi/259:2/nwritebacks.

In our test, writing 1G across 12 directories showed improved bandwidth up
to the number of allocation groups (AGs), mostly a knee point, but gains
tapered off beyond that. Also, we see a good increase in bandwidth of about
16 times from base to nr_wb_ctx = 6.

    Base (single threaded)                :  9799KiB/s
    Parallel Writeback (nr_wb_ctx = 1)    :  9727KiB/s
    Parallel Writeback (nr_wb_ctx = 2)    :  18.1MiB/s
    Parallel Writeback (nr_wb_ctx = 3)    :  46.4MiB/s
    Parallel Writeback (nr_wb_ctx = 4)    :  135MiB/s
    Parallel Writeback (nr_wb_ctx = 5)    :  160MiB/s
    Parallel Writeback (nr_wb_ctx = 6)    :  163MiB/s
    Parallel Writeback (nr_wb_ctx = 7)    :  162MiB/s
    Parallel Writeback (nr_wb_ctx = 8)    :  154MiB/s
    Parallel Writeback (nr_wb_ctx = 9)    :  152MiB/s
    Parallel Writeback (nr_wb_ctx = 10)   :  145MiB/s
    Parallel Writeback (nr_wb_ctx = 11)   :  145MiB/s
    Parallel Writeback (nr_wb_ctx = 12)   :  138MiB/s


System config
===========
Number of CPUs = 12
System RAM = 9G
For XFS number of AGs = 4
Used NVMe SSD of 3.84 TB (Enterprise SSD PM1733a)

Script
=====
mkfs.xfs -f /dev/nvme0n1
mount /dev/nvme0n1 /mnt
echo <num_wb_ctx> > /sys/class/bdi/259:2/nwritebacks
sync
echo 3 > /proc/sys/vm/drop_caches

for i in {1..12}; do
  mkdir -p /mnt/dir$i
done

fio job_nvme.fio

umount /mnt
echo 3 > /proc/sys/vm/drop_caches
sync

fio job
=====
[global]
bs=4k
iodepth=1
rw=randwrite
ioengine=io_uring
nrfiles=12
numjobs=1                # Each job writes to a different file
size=1g
direct=0                 # Buffered I/O to trigger writeback
group_reporting=1
create_on_open=1
name=test

[job1]
directory=/mnt/dir1

[job2]
directory=/mnt/dir2
...
...
[job12]
directory=/mnt/dir1

> > ext4
> > Base
> > a. 7080   b. 7080    c. 11
> > Parallel writeback
> > a. 5961   b. 5961    c. 11
>
> Hum, that's particularly ... interesting.  I wonder what the mapping
> count behaviors are when you turn off delayed allocation?
>
> --D
>

I attempted to disable delayed allocation by setting allocsize=4096
during mount (mount -o allocsize=4096 /dev/pmem0 /mnt), but still
observed a reduction in file fragments after a delay. Is there something
I'm overlooking?

-Kundan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/13] Parallelizing filesystem writeback
  2025-05-30  3:37   ` [PATCH 00/13] Parallelizing filesystem writeback Andrew Morton
@ 2025-06-25 15:44     ` Kundan Kumar
  2025-07-02 18:43       ` Darrick J. Wong
  0 siblings, 1 reply; 40+ messages in thread
From: Kundan Kumar @ 2025-06-25 15:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kundan Kumar, jaegeuk, chao, viro, brauner, jack, miklos,
	agruenba, trondmy, anna, willy, mcgrof, clm, david, amir73il,
	axboe, hch, ritesh.list, djwong, dave, p.raghav, da.gomez,
	linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm,
	gost.dev

>
> Makes sense.  It would be good to test this on a non-SMP machine, if
> you can find one ;)
>

Tested with kernel cmdline with maxcpus=1. The parallel writeback falls
back to 1 thread behavior, showing nochange in BW.

  - On PMEM:
    Base XFS        : 70.7 MiB/s
    Parallel Writeback XFS    : 70.5 MiB/s
    Base EXT4        : 137 MiB/s
    Parallel Writeback EXT4    : 138 MiB/s

  - On NVMe:
    Base XFS        : 45.2 MiB/s
    Parallel Writeback XFS    : 44.5 MiB/s
    Base EXT4        : 81.2 MiB/s
    Parallel Writeback EXT4    : 80.1 MiB/s

>
> Please test the performance on spinning disks, and with more filesystems?
>

On a spinning disk, random IO bandwidth remains unchanged, while sequential
IO performance declines. However, setting nr_wb_ctx = 1 via configurable
writeback(planned in next version) eliminates the decline.

echo 1 > /sys/class/bdi/8:16/nwritebacks

We can fetch the device queue's rotational property and allocate BDI with
nr_wb_ctx = 1 for rotational disks. Hope this is a viable solution for
spinning disks?

  - Random IO
    Base XFS        : 22.6 MiB/s
    Parallel Writeback XFS    : 22.9 MiB/s
    Base EXT4        : 22.5 MiB/s
    Parallel Writeback EXT4    : 20.9 MiB/s

  - Sequential IO
    Base XFS        : 156 MiB/s
    Parallel Writeback XFS    : 133 MiB/s (-14.7%)
    Base EXT4        : 147 MiB/s
    Parallel Writeback EXT4    : 124 MiB/s (-15.6%)

-Kundan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/13] Parallelizing filesystem writeback
  2025-06-25 15:44     ` Kundan Kumar
@ 2025-07-02 18:43       ` Darrick J. Wong
  2025-07-03 13:05         ` Christoph Hellwig
  0 siblings, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2025-07-02 18:43 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: Andrew Morton, Kundan Kumar, jaegeuk, chao, viro, brauner, jack,
	miklos, agruenba, trondmy, anna, willy, mcgrof, clm, david,
	amir73il, axboe, hch, ritesh.list, dave, p.raghav, da.gomez,
	linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm,
	gost.dev

On Wed, Jun 25, 2025 at 09:14:51PM +0530, Kundan Kumar wrote:
> >
> > Makes sense.  It would be good to test this on a non-SMP machine, if
> > you can find one ;)
> >
> 
> Tested with kernel cmdline with maxcpus=1. The parallel writeback falls
> back to 1 thread behavior, showing nochange in BW.
> 
>   - On PMEM:
>     Base XFS        : 70.7 MiB/s
>     Parallel Writeback XFS    : 70.5 MiB/s
>     Base EXT4        : 137 MiB/s
>     Parallel Writeback EXT4    : 138 MiB/s
> 
>   - On NVMe:
>     Base XFS        : 45.2 MiB/s
>     Parallel Writeback XFS    : 44.5 MiB/s
>     Base EXT4        : 81.2 MiB/s
>     Parallel Writeback EXT4    : 80.1 MiB/s
> 
> >
> > Please test the performance on spinning disks, and with more filesystems?
> >
> 
> On a spinning disk, random IO bandwidth remains unchanged, while sequential
> IO performance declines. However, setting nr_wb_ctx = 1 via configurable
> writeback(planned in next version) eliminates the decline.
> 
> echo 1 > /sys/class/bdi/8:16/nwritebacks
> 
> We can fetch the device queue's rotational property and allocate BDI with
> nr_wb_ctx = 1 for rotational disks. Hope this is a viable solution for
> spinning disks?

Sounds good to me, spinning rust isn't known for iops.

Though: What about a raid0 of spinning rust?  Do you see the same
declines for sequential IO?

--D

>   - Random IO
>     Base XFS        : 22.6 MiB/s
>     Parallel Writeback XFS    : 22.9 MiB/s
>     Base EXT4        : 22.5 MiB/s
>     Parallel Writeback EXT4    : 20.9 MiB/s
> 
>   - Sequential IO
>     Base XFS        : 156 MiB/s
>     Parallel Writeback XFS    : 133 MiB/s (-14.7%)
>     Base EXT4        : 147 MiB/s
>     Parallel Writeback EXT4    : 124 MiB/s (-15.6%)
> 
> -Kundan
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/13] Parallelizing filesystem writeback
  2025-06-24  5:59               ` Kundan Kumar
@ 2025-07-02 18:44                 ` Darrick J. Wong
  0 siblings, 0 replies; 40+ messages in thread
From: Darrick J. Wong @ 2025-07-02 18:44 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: Anuj gupta, Christoph Hellwig, Anuj Gupta/Anuj Gupta,
	Kundan Kumar, jaegeuk, chao, viro, brauner, jack, miklos,
	agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david,
	amir73il, axboe, ritesh.list, dave, p.raghav, da.gomez,
	linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm,
	gost.dev

On Tue, Jun 24, 2025 at 11:29:28AM +0530, Kundan Kumar wrote:
> On Wed, Jun 11, 2025 at 9:21 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Wed, Jun 04, 2025 at 02:52:34PM +0530, Kundan Kumar wrote:
> > > > > > For xfs used this command:
> > > > > > xfs_io -c "stat" /mnt/testfile
> > > > > > And for ext4 used this:
> > > > > > filefrag /mnt/testfile
> > > > >
> > > > > filefrag merges contiguous extents, and only counts up for discontiguous
> > > > > mappings, while fsxattr.nextents counts all extent even if they are
> > > > > contiguous.  So you probably want to use filefrag for both cases.
> > > >
> > > > Got it — thanks for the clarification. We'll switch to using filefrag
> > > > and will share updated extent count numbers accordingly.
> > >
> > > Using filefrag, we recorded extent counts on xfs and ext4 at three
> > > stages:
> > > a. Just after a 1G random write,
> > > b. After a 30-second wait,
> > > c. After unmounting and remounting the filesystem,
> > >
> > > xfs
> > > Base
> > > a. 6251   b. 2526  c. 2526
> > > Parallel writeback
> > > a. 6183   b. 2326  c. 2326
> >
> > Interesting that the mapping record count goes down...
> >
> > I wonder, you said the xfs filesystem has 4 AGs and 12 cores, so I guess
> > wb_ctx_arr[] is 12?  I wonder, do you see a knee point in writeback
> > throughput when the # of wb contexts exceeds the AG count?
> >
> > Though I guess for the (hopefully common) case of pure overwrites, we
> > don't have to do any metadata updates so we wouldn't really hit a
> > scaling limit due to ag count or log contention or whatever.  Does that
> > square with what you see?
> >
> 
> Hi Darrick,
> 
> We analyzed AG count vs. number of writeback contexts to identify any
> knee point. Earlier, wb_ctx_arr[] was fixed at 12; now we varied nr_wb_ctx
> and measured the impact.
> 
> We implemented a configurable number of writeback contexts to measure
> throughput more easily. This feature will be exposed in the next series.
> To configure, used: echo <nr_wb_ctx> > /sys/class/bdi/259:2/nwritebacks.
> 
> In our test, writing 1G across 12 directories showed improved bandwidth up
> to the number of allocation groups (AGs), mostly a knee point, but gains
> tapered off beyond that. Also, we see a good increase in bandwidth of about
> 16 times from base to nr_wb_ctx = 6.
> 
>     Base (single threaded)                :  9799KiB/s
>     Parallel Writeback (nr_wb_ctx = 1)    :  9727KiB/s
>     Parallel Writeback (nr_wb_ctx = 2)    :  18.1MiB/s
>     Parallel Writeback (nr_wb_ctx = 3)    :  46.4MiB/s
>     Parallel Writeback (nr_wb_ctx = 4)    :  135MiB/s
>     Parallel Writeback (nr_wb_ctx = 5)    :  160MiB/s
>     Parallel Writeback (nr_wb_ctx = 6)    :  163MiB/s

Heh, nice!

>     Parallel Writeback (nr_wb_ctx = 7)    :  162MiB/s
>     Parallel Writeback (nr_wb_ctx = 8)    :  154MiB/s
>     Parallel Writeback (nr_wb_ctx = 9)    :  152MiB/s
>     Parallel Writeback (nr_wb_ctx = 10)   :  145MiB/s
>     Parallel Writeback (nr_wb_ctx = 11)   :  145MiB/s
>     Parallel Writeback (nr_wb_ctx = 12)   :  138MiB/s
> 
> 
> System config
> ===========
> Number of CPUs = 12
> System RAM = 9G
> For XFS number of AGs = 4
> Used NVMe SSD of 3.84 TB (Enterprise SSD PM1733a)
> 
> Script
> =====
> mkfs.xfs -f /dev/nvme0n1
> mount /dev/nvme0n1 /mnt
> echo <num_wb_ctx> > /sys/class/bdi/259:2/nwritebacks
> sync
> echo 3 > /proc/sys/vm/drop_caches
> 
> for i in {1..12}; do
>   mkdir -p /mnt/dir$i
> done
> 
> fio job_nvme.fio
> 
> umount /mnt
> echo 3 > /proc/sys/vm/drop_caches
> sync
> 
> fio job
> =====
> [global]
> bs=4k
> iodepth=1
> rw=randwrite
> ioengine=io_uring
> nrfiles=12
> numjobs=1                # Each job writes to a different file
> size=1g
> direct=0                 # Buffered I/O to trigger writeback
> group_reporting=1
> create_on_open=1
> name=test
> 
> [job1]
> directory=/mnt/dir1
> 
> [job2]
> directory=/mnt/dir2
> ...
> ...
> [job12]
> directory=/mnt/dir1
> 
> > > ext4
> > > Base
> > > a. 7080   b. 7080    c. 11
> > > Parallel writeback
> > > a. 5961   b. 5961    c. 11
> >
> > Hum, that's particularly ... interesting.  I wonder what the mapping
> > count behaviors are when you turn off delayed allocation?
> >
> > --D
> >
> 
> I attempted to disable delayed allocation by setting allocsize=4096
> during mount (mount -o allocsize=4096 /dev/pmem0 /mnt), but still
> observed a reduction in file fragments after a delay. Is there something
> I'm overlooking?

Not that I know of.  Maybe we should just take the win. :)

--D

> -Kundan
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/13] Parallelizing filesystem writeback
  2025-07-02 18:43       ` Darrick J. Wong
@ 2025-07-03 13:05         ` Christoph Hellwig
  2025-07-04  7:02           ` Kundan Kumar
  2025-07-07 15:47           ` Jan Kara
  0 siblings, 2 replies; 40+ messages in thread
From: Christoph Hellwig @ 2025-07-03 13:05 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Kundan Kumar, Andrew Morton, Kundan Kumar, jaegeuk, chao, viro,
	brauner, jack, miklos, agruenba, trondmy, anna, willy, mcgrof,
	clm, david, amir73il, axboe, hch, ritesh.list, dave, p.raghav,
	da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs,
	linux-mm, gost.dev

On Wed, Jul 02, 2025 at 11:43:12AM -0700, Darrick J. Wong wrote:
> > On a spinning disk, random IO bandwidth remains unchanged, while sequential
> > IO performance declines. However, setting nr_wb_ctx = 1 via configurable
> > writeback(planned in next version) eliminates the decline.
> > 
> > echo 1 > /sys/class/bdi/8:16/nwritebacks
> > 
> > We can fetch the device queue's rotational property and allocate BDI with
> > nr_wb_ctx = 1 for rotational disks. Hope this is a viable solution for
> > spinning disks?
> 
> Sounds good to me, spinning rust isn't known for iops.
> 
> Though: What about a raid0 of spinning rust?  Do you see the same
> declines for sequential IO?

Well, even for a raid0 multiple I/O streams will degrade performance
on a disk.  Of course many real life workloads will have multiple
I/O streams anyway.

I think the important part is to have:

 a) sane defaults
 b) an easy way for the file system and/or user to override the default

For a) a single thread for rotational is a good default.  For file system
that driver multiple spindles independently or do compression multiple
threads might still make sense.

For b) one big issue is that right now the whole writeback handling is
per-bdi and not per superblock.  So maybe the first step needs to be
to move the writeback to the superblock instead of bdi?  If someone
uses partitions and multiple file systems on spinning rusts these
days reducing the number of writeback threads isn't really going to
save their day either.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/13] Parallelizing filesystem writeback
  2025-07-03 13:05         ` Christoph Hellwig
@ 2025-07-04  7:02           ` Kundan Kumar
  2025-07-07 14:28             ` Christoph Hellwig
  2025-07-07 15:47           ` Jan Kara
  1 sibling, 1 reply; 40+ messages in thread
From: Kundan Kumar @ 2025-07-04  7:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, Andrew Morton, Kundan Kumar, jaegeuk, chao, viro,
	Christian Brauner, jack, miklos, agruenba, Trond Myklebust, anna,
	Matthew Wilcox, mcgrof, clm, david, amir73il, Jens Axboe,
	ritesh.list, dave, p.raghav, da.gomez, linux-f2fs-devel,
	linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev

On Thu, Jul 3, 2025 at 6:35 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Wed, Jul 02, 2025 at 11:43:12AM -0700, Darrick J. Wong wrote:
> > > On a spinning disk, random IO bandwidth remains unchanged, while sequential
> > > IO performance declines. However, setting nr_wb_ctx = 1 via configurable
> > > writeback(planned in next version) eliminates the decline.
> > >
> > > echo 1 > /sys/class/bdi/8:16/nwritebacks
> > >
> > > We can fetch the device queue's rotational property and allocate BDI with
> > > nr_wb_ctx = 1 for rotational disks. Hope this is a viable solution for
> > > spinning disks?
> >
> > Sounds good to me, spinning rust isn't known for iops.
> >
> > Though: What about a raid0 of spinning rust?  Do you see the same
> > declines for sequential IO?
>
> Well, even for a raid0 multiple I/O streams will degrade performance
> on a disk.  Of course many real life workloads will have multiple
> I/O streams anyway.
>
> I think the important part is to have:
>
>  a) sane defaults
>  b) an easy way for the file system and/or user to override the default
>
> For a) a single thread for rotational is a good default.  For file system
> that driver multiple spindles independently or do compression multiple
> threads might still make sense.
>
> For b) one big issue is that right now the whole writeback handling is
> per-bdi and not per superblock.  So maybe the first step needs to be
> to move the writeback to the superblock instead of bdi?

bdi is tied to the underlying block device, and helps for device
bandwidth specific throttling, dirty ratelimiting etc. Making it per
superblock will need duplicating the device specific throttling, ratelimiting
to superblock, which will be difficult.

> If someone
> uses partitions and multiple file systems on spinning rusts these
> days reducing the number of writeback threads isn't really going to
> save their day either.
>

in this case with single wb thread multiple partitions/filesystems use the
same bdi, we fall back to base case, will that not help ?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/13] Parallelizing filesystem writeback
  2025-07-04  7:02           ` Kundan Kumar
@ 2025-07-07 14:28             ` Christoph Hellwig
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2025-07-07 14:28 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: Christoph Hellwig, Darrick J. Wong, Andrew Morton, Kundan Kumar,
	jaegeuk, chao, viro, Christian Brauner, jack, miklos, agruenba,
	Trond Myklebust, anna, Matthew Wilcox, mcgrof, clm, david,
	amir73il, Jens Axboe, ritesh.list, dave, p.raghav, da.gomez,
	linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm,
	gost.dev

On Fri, Jul 04, 2025 at 12:32:51PM +0530, Kundan Kumar wrote:
> bdi is tied to the underlying block device, and helps for device
> bandwidth specific throttling, dirty ratelimiting etc. Making it per
> superblock will need duplicating the device specific throttling, ratelimiting
> to superblock, which will be difficult.

Yes, but my point is that compared to actually having a high performing
writeback code that doesn't matter.  What is the use case for actually
having production workloads (vs just a root fs and EFI partition) on
a single SSD or hard disk?

> > If someone
> > uses partitions and multiple file systems on spinning rusts these
> > days reducing the number of writeback threads isn't really going to
> > save their day either.
> >
> 
> in this case with single wb thread multiple partitions/filesystems use the
> same bdi, we fall back to base case, will that not help ?

If you multiple file systems sharing a BDI, they can have different
and potentially very different requirements and they can trivially
get in the way.  Or in other words we can't do anything remotely
smart without fully having the file system in charge.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/13] Parallelizing filesystem writeback
  2025-07-03 13:05         ` Christoph Hellwig
  2025-07-04  7:02           ` Kundan Kumar
@ 2025-07-07 15:47           ` Jan Kara
  1 sibling, 0 replies; 40+ messages in thread
From: Jan Kara @ 2025-07-07 15:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, Kundan Kumar, Andrew Morton, Kundan Kumar,
	jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy,
	anna, willy, mcgrof, clm, david, amir73il, axboe, ritesh.list,
	dave, p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2,
	linux-nfs, linux-mm, gost.dev

On Thu 03-07-25 15:05:00, Christoph Hellwig wrote:
> On Wed, Jul 02, 2025 at 11:43:12AM -0700, Darrick J. Wong wrote:
> > > On a spinning disk, random IO bandwidth remains unchanged, while sequential
> > > IO performance declines. However, setting nr_wb_ctx = 1 via configurable
> > > writeback(planned in next version) eliminates the decline.
> > > 
> > > echo 1 > /sys/class/bdi/8:16/nwritebacks
> > > 
> > > We can fetch the device queue's rotational property and allocate BDI with
> > > nr_wb_ctx = 1 for rotational disks. Hope this is a viable solution for
> > > spinning disks?
> > 
> > Sounds good to me, spinning rust isn't known for iops.
> > 
> > Though: What about a raid0 of spinning rust?  Do you see the same
> > declines for sequential IO?
> 
> Well, even for a raid0 multiple I/O streams will degrade performance
> on a disk.  Of course many real life workloads will have multiple
> I/O streams anyway.
> 
> I think the important part is to have:
> 
>  a) sane defaults
>  b) an easy way for the file system and/or user to override the default
> 
> For a) a single thread for rotational is a good default.  For file system
> that driver multiple spindles independently or do compression multiple
> threads might still make sense.
> 
> For b) one big issue is that right now the whole writeback handling is
> per-bdi and not per superblock.  So maybe the first step needs to be
> to move the writeback to the superblock instead of bdi?  If someone
> uses partitions and multiple file systems on spinning rusts these
> days reducing the number of writeback threads isn't really going to
> save their day either.

We have had requests to move writeback infrastructure to be per sb in the
past, mostly so that the filesystem has a better control of the writeback
process (e.g. selection of inodes etc.). After some thought I tend to agree
that today setups where we have multiple filesystems over the same bdi and
end up doing writeback from several of them in parallel should be mostly
limited to desktops / laptops / small servers. And there you usually have
only one main data filesystem - e.g. /home/ - and you don't tend to write
that much to your / filesystem. Although there could be exceptions like
large occasional writes to /tmp, news server updates or similar. Anyway in
these cases I'd expect IO scheduler (BFQ for rotational disks where this
really matters) to still achieve a decent IO locality but it would be good
to verify what the impact is.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2025-07-07 15:47 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CGME20250529113215epcas5p2edd67e7b129621f386be005fdba53378@epcas5p2.samsung.com>
2025-05-29 11:14 ` [PATCH 00/13] Parallelizing filesystem writeback Kundan Kumar
     [not found]   ` <CGME20250529113219epcas5p4d8ccb25ea910faea7120f092623f321d@epcas5p4.samsung.com>
2025-05-29 11:14     ` [PATCH 01/13] writeback: add infra for parallel writeback Kundan Kumar
     [not found]   ` <CGME20250529113224epcas5p2eea35fd0ebe445d8ad0471a144714b23@epcas5p2.samsung.com>
2025-05-29 11:14     ` [PATCH 02/13] writeback: add support to initialize and free multiple writeback ctxs Kundan Kumar
     [not found]   ` <CGME20250529113228epcas5p1db88ab42c2dac0698d715e38bd5e0896@epcas5p1.samsung.com>
2025-05-29 11:14     ` [PATCH 03/13] writeback: link bdi_writeback to its corresponding bdi_writeback_ctx Kundan Kumar
     [not found]   ` <CGME20250529113232epcas5p4e6f3b2f03d3a5f8fcaace3ddd03298d0@epcas5p4.samsung.com>
2025-05-29 11:14     ` [PATCH 04/13] writeback: affine inode to a writeback ctx within a bdi Kundan Kumar
2025-06-02 14:24       ` Christoph Hellwig
     [not found]   ` <CGME20250529113236epcas5p2049b6cc3be27d8727ac1f15697987ff5@epcas5p2.samsung.com>
2025-05-29 11:14     ` [PATCH 05/13] writeback: modify bdi_writeback search logic to search across all wb ctxs Kundan Kumar
     [not found]   ` <CGME20250529113240epcas5p295dcf9a016cc28e5c3e88d698808f645@epcas5p2.samsung.com>
2025-05-29 11:14     ` [PATCH 06/13] writeback: invoke all writeback contexts for flusher and dirtytime writeback Kundan Kumar
     [not found]   ` <CGME20250529113245epcas5p2978b77ce5ccf2d620f2a9ee5e796bee3@epcas5p2.samsung.com>
2025-05-29 11:14     ` [PATCH 07/13] writeback: modify sync related functions to iterate over all writeback contexts Kundan Kumar
     [not found]   ` <CGME20250529113249epcas5p38b29d3c6256337eadc2d1644181f9b74@epcas5p3.samsung.com>
2025-05-29 11:14     ` [PATCH 08/13] writeback: add support to collect stats for all writeback ctxs Kundan Kumar
     [not found]   ` <CGME20250529113253epcas5p1a28e77b2d9824d55f594ccb053725ece@epcas5p1.samsung.com>
2025-05-29 11:15     ` [PATCH 09/13] f2fs: add support in f2fs to handle multiple writeback contexts Kundan Kumar
2025-06-02 14:20       ` Christoph Hellwig
     [not found]   ` <CGME20250529113257epcas5p4dbaf9c8e2dc362767c8553399632c1ea@epcas5p4.samsung.com>
2025-05-29 11:15     ` [PATCH 10/13] fuse: add support for multiple writeback contexts in fuse Kundan Kumar
2025-06-02 14:21       ` Christoph Hellwig
2025-06-02 15:50         ` Bernd Schubert
2025-06-02 15:55           ` Christoph Hellwig
     [not found]   ` <CGME20250529113302epcas5p3bdae265288af32172fb7380a727383eb@epcas5p3.samsung.com>
2025-05-29 11:15     ` [PATCH 11/13] gfs2: add support in gfs2 to handle multiple writeback contexts Kundan Kumar
     [not found]   ` <CGME20250529113306epcas5p3d10606ae4ea7c3491e93bde9ae408c9f@epcas5p3.samsung.com>
2025-05-29 11:15     ` [PATCH 12/13] nfs: add support in nfs " Kundan Kumar
2025-06-02 14:22       ` Christoph Hellwig
     [not found]   ` <CGME20250529113311epcas5p3c8f1785b34680481e2126fda3ab51ad9@epcas5p3.samsung.com>
2025-05-29 11:15     ` [PATCH 13/13] writeback: set the num of writeback contexts to number of online cpus Kundan Kumar
2025-06-03 14:36       ` kernel test robot
2025-05-30  3:37   ` [PATCH 00/13] Parallelizing filesystem writeback Andrew Morton
2025-06-25 15:44     ` Kundan Kumar
2025-07-02 18:43       ` Darrick J. Wong
2025-07-03 13:05         ` Christoph Hellwig
2025-07-04  7:02           ` Kundan Kumar
2025-07-07 14:28             ` Christoph Hellwig
2025-07-07 15:47           ` Jan Kara
2025-06-02 14:19   ` Christoph Hellwig
2025-06-03  9:16     ` Anuj Gupta/Anuj Gupta
2025-06-03 13:24       ` Christoph Hellwig
2025-06-03 13:52         ` Anuj gupta
2025-06-03 14:04           ` Christoph Hellwig
2025-06-03 14:05             ` Christoph Hellwig
2025-06-06  5:04               ` Kundan Kumar
2025-06-09  4:00                 ` Christoph Hellwig
2025-06-04  9:22           ` Kundan Kumar
2025-06-11 15:51             ` Darrick J. Wong
2025-06-24  5:59               ` Kundan Kumar
2025-07-02 18:44                 ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).