* [PATCH 00/13] Parallelizing filesystem writeback [not found] <CGME20250529113215epcas5p2edd67e7b129621f386be005fdba53378@epcas5p2.samsung.com> @ 2025-05-29 11:14 ` Kundan Kumar [not found] ` <CGME20250529113219epcas5p4d8ccb25ea910faea7120f092623f321d@epcas5p4.samsung.com> ` (14 more replies) 0 siblings, 15 replies; 40+ messages in thread From: Kundan Kumar @ 2025-05-29 11:14 UTC (permalink / raw) To: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, p.raghav, da.gomez Cc: linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, Kundan Kumar Currently, pagecache writeback is performed by a single thread. Inodes are added to a dirty list, and delayed writeback is triggered. The single writeback thread then iterates through the dirty inode list, and executes the writeback. This series parallelizes the writeback by allowing multiple writeback contexts per backing device (bdi). These writebacks contexts are executed as separate, independent threads, improving overall parallelism. Would love to hear feedback in-order to move this effort forward. Design Overview ================ Following Jan Kara's suggestion [1], we have introduced a new bdi writeback context within the backing_dev_info structure. Specifically, we have created a new structure, bdi_writeback_context, which contains its own set of members for each writeback context. struct bdi_writeback_ctx { struct bdi_writeback wb; struct list_head wb_list; /* list of all wbs */ struct radix_tree_root cgwb_tree; struct rw_semaphore wb_switch_rwsem; wait_queue_head_t wb_waitq; }; There can be multiple writeback contexts in a bdi, which helps in achieving writeback parallelism. struct backing_dev_info { ... int nr_wb_ctx; struct bdi_writeback_ctx **wb_ctx_arr; ... }; FS geometry and filesystem fragmentation ======================================== The community was concerned that parallelizing writeback would impact delayed allocation and increase filesystem fragmentation. Our analysis of XFS delayed allocation behavior showed that merging of extents occurs within a specific inode. Earlier experiments with multiple writeback contexts [2] resulted in increased fragmentation due to the same inode being processed by different threads. To address this, we now affine an inode to a specific writeback context ensuring that delayed allocation works effectively. Number of writeback contexts =========================== The plan is to keep the nr_wb_ctx as 1, ensuring default single threaded behavior. However, we set the number of writeback contexts equal to number of CPUs in the current version. Later we will make it configurable using a mount option, allowing filesystems to choose the optimal number of writeback contexts. IOPS and throughput =================== We see significant improvement in IOPS across several filesystem on both PMEM and NVMe devices. Performance gains: - On PMEM: Base XFS : 544 MiB/s Parallel Writeback XFS : 1015 MiB/s (+86%) Base EXT4 : 536 MiB/s Parallel Writeback EXT4 : 1047 MiB/s (+95%) - On NVMe: Base XFS : 651 MiB/s Parallel Writeback XFS : 808 MiB/s (+24%) Base EXT4 : 494 MiB/s Parallel Writeback EXT4 : 797 MiB/s (+61%) We also see that there is no increase in filesystem fragmentation # of extents: - On XFS (on PMEM): Base XFS : 1964 Parallel Writeback XFS : 1384 - On EXT4 (on PMEM): Base EXT4 : 21 Parallel Writeback EXT4 : 11 [1] Jan Kara suggestion : https://lore.kernel.org/all/gamxtewl5yzg4xwu7lpp7obhp44xh344swvvf7tmbiknvbd3ww@jowphz4h4zmb/ [2] Writeback using unaffined N (# of CPUs) threads : https://lore.kernel.org/all/20250414102824.9901-1-kundan.kumar@samsung.com/ Kundan Kumar (13): writeback: add infra for parallel writeback writeback: add support to initialize and free multiple writeback ctxs writeback: link bdi_writeback to its corresponding bdi_writeback_ctx writeback: affine inode to a writeback ctx within a bdi writeback: modify bdi_writeback search logic to search across all wb ctxs writeback: invoke all writeback contexts for flusher and dirtytime writeback writeback: modify sync related functions to iterate over all writeback contexts writeback: add support to collect stats for all writeback ctxs f2fs: add support in f2fs to handle multiple writeback contexts fuse: add support for multiple writeback contexts in fuse gfs2: add support in gfs2 to handle multiple writeback contexts nfs: add support in nfs to handle multiple writeback contexts writeback: set the num of writeback contexts to number of online cpus fs/f2fs/node.c | 11 +- fs/f2fs/segment.h | 7 +- fs/fs-writeback.c | 146 +++++++++++++------- fs/fuse/file.c | 9 +- fs/gfs2/super.c | 11 +- fs/nfs/internal.h | 4 +- fs/nfs/write.c | 5 +- include/linux/backing-dev-defs.h | 32 +++-- include/linux/backing-dev.h | 45 +++++-- include/linux/fs.h | 1 - mm/backing-dev.c | 225 ++++++++++++++++++++----------- mm/page-writeback.c | 5 +- 12 files changed, 333 insertions(+), 168 deletions(-) -- 2.25.1 ^ permalink raw reply [flat|nested] 40+ messages in thread
[parent not found: <CGME20250529113219epcas5p4d8ccb25ea910faea7120f092623f321d@epcas5p4.samsung.com>]
* [PATCH 01/13] writeback: add infra for parallel writeback [not found] ` <CGME20250529113219epcas5p4d8ccb25ea910faea7120f092623f321d@epcas5p4.samsung.com> @ 2025-05-29 11:14 ` Kundan Kumar 0 siblings, 0 replies; 40+ messages in thread From: Kundan Kumar @ 2025-05-29 11:14 UTC (permalink / raw) To: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, p.raghav, da.gomez Cc: linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, Kundan Kumar, Anuj Gupta This is a prep patch which introduces a new bdi_writeback_ctx structure that enables us to have multiple writeback contexts for parallel writeback. Each bdi now can have multiple writeback contexts, with each writeback context having has its own cgwb tree. Modify all the functions/places that operate on bdi's wb, wb_list, cgwb_tree, wb_switch_rwsem, wb_waitq as these fields have now been moved to bdi_writeback_ctx. This patch mechanically replaces bdi->wb to bdi->wb_ctx_arr[0]->wb and there is no functional change. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> --- fs/f2fs/node.c | 4 +- fs/f2fs/segment.h | 2 +- fs/fs-writeback.c | 78 +++++++++++++-------- fs/fuse/file.c | 6 +- fs/gfs2/super.c | 2 +- fs/nfs/internal.h | 3 +- fs/nfs/write.c | 3 +- include/linux/backing-dev-defs.h | 32 +++++---- include/linux/backing-dev.h | 41 +++++++---- include/linux/fs.h | 1 - mm/backing-dev.c | 113 +++++++++++++++++++------------ mm/page-writeback.c | 5 +- 12 files changed, 179 insertions(+), 111 deletions(-) diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c index 5f15c224bf78..4b6568cd5bef 100644 --- a/fs/f2fs/node.c +++ b/fs/f2fs/node.c @@ -73,7 +73,7 @@ bool f2fs_available_free_memory(struct f2fs_sb_info *sbi, int type) if (excess_cached_nats(sbi)) res = false; } else if (type == DIRTY_DENTS) { - if (sbi->sb->s_bdi->wb.dirty_exceeded) + if (sbi->sb->s_bdi->wb_ctx_arr[0]->wb.dirty_exceeded) return false; mem_size = get_pages(sbi, F2FS_DIRTY_DENTS); res = mem_size < ((avail_ram * nm_i->ram_thresh / 100) >> 1); @@ -114,7 +114,7 @@ bool f2fs_available_free_memory(struct f2fs_sb_info *sbi, int type) res = false; #endif } else { - if (!sbi->sb->s_bdi->wb.dirty_exceeded) + if (!sbi->sb->s_bdi->wb_ctx_arr[0]->wb.dirty_exceeded) return true; } return res; diff --git a/fs/f2fs/segment.h b/fs/f2fs/segment.h index 0465dc00b349..a525ccd4cfc8 100644 --- a/fs/f2fs/segment.h +++ b/fs/f2fs/segment.h @@ -936,7 +936,7 @@ static inline bool sec_usage_check(struct f2fs_sb_info *sbi, unsigned int secno) */ static inline int nr_pages_to_skip(struct f2fs_sb_info *sbi, int type) { - if (sbi->sb->s_bdi->wb.dirty_exceeded) + if (sbi->sb->s_bdi->wb_ctx_arr[0]->wb.dirty_exceeded) return 0; if (type == DATA) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index cc57367fb641..0959fff46235 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -265,23 +265,26 @@ void __inode_attach_wb(struct inode *inode, struct folio *folio) { struct backing_dev_info *bdi = inode_to_bdi(inode); struct bdi_writeback *wb = NULL; + struct bdi_writeback_ctx *bdi_writeback_ctx = bdi->wb_ctx_arr[0]; if (inode_cgwb_enabled(inode)) { struct cgroup_subsys_state *memcg_css; if (folio) { memcg_css = mem_cgroup_css_from_folio(folio); - wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); + wb = wb_get_create(bdi, bdi_writeback_ctx, memcg_css, + GFP_ATOMIC); } else { /* must pin memcg_css, see wb_get_create() */ memcg_css = task_get_css(current, memory_cgrp_id); - wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); + wb = wb_get_create(bdi, bdi_writeback_ctx, memcg_css, + GFP_ATOMIC); css_put(memcg_css); } } if (!wb) - wb = &bdi->wb; + wb = &bdi_writeback_ctx->wb; /* * There may be multiple instances of this function racing to @@ -307,7 +310,7 @@ static void inode_cgwb_move_to_attached(struct inode *inode, WARN_ON_ONCE(inode->i_state & I_FREEING); inode->i_state &= ~I_SYNC_QUEUED; - if (wb != &wb->bdi->wb) + if (wb != &wb->bdi_wb_ctx->wb) list_move(&inode->i_io_list, &wb->b_attached); else list_del_init(&inode->i_io_list); @@ -382,14 +385,16 @@ struct inode_switch_wbs_context { struct inode *inodes[]; }; -static void bdi_down_write_wb_switch_rwsem(struct backing_dev_info *bdi) +static void +bdi_down_write_wb_ctx_switch_rwsem(struct bdi_writeback_ctx *bdi_wb_ctx) { - down_write(&bdi->wb_switch_rwsem); + down_write(&bdi_wb_ctx->wb_switch_rwsem); } -static void bdi_up_write_wb_switch_rwsem(struct backing_dev_info *bdi) +static void +bdi_up_write_wb_ctx_switch_rwsem(struct bdi_writeback_ctx *bdi_wb_ctx) { - up_write(&bdi->wb_switch_rwsem); + up_write(&bdi_wb_ctx->wb_switch_rwsem); } static bool inode_do_switch_wbs(struct inode *inode, @@ -490,7 +495,8 @@ static void inode_switch_wbs_work_fn(struct work_struct *work) { struct inode_switch_wbs_context *isw = container_of(to_rcu_work(work), struct inode_switch_wbs_context, work); - struct backing_dev_info *bdi = inode_to_bdi(isw->inodes[0]); + struct bdi_writeback_ctx *bdi_wb_ctx = + fetch_bdi_writeback_ctx(isw->inodes[0]); struct bdi_writeback *old_wb = isw->inodes[0]->i_wb; struct bdi_writeback *new_wb = isw->new_wb; unsigned long nr_switched = 0; @@ -500,7 +506,7 @@ static void inode_switch_wbs_work_fn(struct work_struct *work) * If @inode switches cgwb membership while sync_inodes_sb() is * being issued, sync_inodes_sb() might miss it. Synchronize. */ - down_read(&bdi->wb_switch_rwsem); + down_read(&bdi_wb_ctx->wb_switch_rwsem); /* * By the time control reaches here, RCU grace period has passed @@ -529,7 +535,7 @@ static void inode_switch_wbs_work_fn(struct work_struct *work) spin_unlock(&new_wb->list_lock); spin_unlock(&old_wb->list_lock); - up_read(&bdi->wb_switch_rwsem); + up_read(&bdi_wb_ctx->wb_switch_rwsem); if (nr_switched) { wb_wakeup(new_wb); @@ -583,6 +589,7 @@ static bool inode_prepare_wbs_switch(struct inode *inode, static void inode_switch_wbs(struct inode *inode, int new_wb_id) { struct backing_dev_info *bdi = inode_to_bdi(inode); + struct bdi_writeback_ctx *bdi_wb_ctx = fetch_bdi_writeback_ctx(inode); struct cgroup_subsys_state *memcg_css; struct inode_switch_wbs_context *isw; @@ -609,7 +616,7 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id) if (!memcg_css) goto out_free; - isw->new_wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); + isw->new_wb = wb_get_create(bdi, bdi_wb_ctx, memcg_css, GFP_ATOMIC); css_put(memcg_css); if (!isw->new_wb) goto out_free; @@ -678,12 +685,14 @@ bool cleanup_offline_cgwb(struct bdi_writeback *wb) for (memcg_css = wb->memcg_css->parent; memcg_css; memcg_css = memcg_css->parent) { - isw->new_wb = wb_get_create(wb->bdi, memcg_css, GFP_KERNEL); + isw->new_wb = wb_get_create(wb->bdi, wb->bdi_wb_ctx, + memcg_css, GFP_KERNEL); if (isw->new_wb) break; } + /* wb_get() is noop for bdi's wb */ if (unlikely(!isw->new_wb)) - isw->new_wb = &wb->bdi->wb; /* wb_get() is noop for bdi's wb */ + isw->new_wb = &wb->bdi_wb_ctx->wb; nr = 0; spin_lock(&wb->list_lock); @@ -994,18 +1003,19 @@ static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages) * total active write bandwidth of @bdi. */ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, + struct bdi_writeback_ctx *bdi_wb_ctx, struct wb_writeback_work *base_work, bool skip_if_busy) { struct bdi_writeback *last_wb = NULL; - struct bdi_writeback *wb = list_entry(&bdi->wb_list, + struct bdi_writeback *wb = list_entry(&bdi_wb_ctx->wb_list, struct bdi_writeback, bdi_node); might_sleep(); restart: rcu_read_lock(); - list_for_each_entry_continue_rcu(wb, &bdi->wb_list, bdi_node) { - DEFINE_WB_COMPLETION(fallback_work_done, bdi); + list_for_each_entry_continue_rcu(wb, &bdi_wb_ctx->wb_list, bdi_node) { + DEFINE_WB_COMPLETION(fallback_work_done, bdi_wb_ctx); struct wb_writeback_work fallback_work; struct wb_writeback_work *work; long nr_pages; @@ -1103,7 +1113,7 @@ int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, * And find the associated wb. If the wb isn't there already * there's nothing to flush, don't create one. */ - wb = wb_get_lookup(bdi, memcg_css); + wb = wb_get_lookup(bdi->wb_ctx_arr[0], memcg_css); if (!wb) { ret = -ENOENT; goto out_css_put; @@ -1189,8 +1199,13 @@ fs_initcall(cgroup_writeback_init); #else /* CONFIG_CGROUP_WRITEBACK */ -static void bdi_down_write_wb_switch_rwsem(struct backing_dev_info *bdi) { } -static void bdi_up_write_wb_switch_rwsem(struct backing_dev_info *bdi) { } +static void +bdi_down_write_wb_ctx_switch_rwsem(struct bdi_writeback_ctx *bdi_wb_ctx) +{ } + +static void +bdi_up_write_wb_ctx_switch_rwsem(struct bdi_writeback_ctx *bdi_wb_ctx) +{ } static void inode_cgwb_move_to_attached(struct inode *inode, struct bdi_writeback *wb) @@ -1231,14 +1246,15 @@ static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages) } static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, + struct bdi_writeback_ctx *bdi_wb_ctx, struct wb_writeback_work *base_work, bool skip_if_busy) { might_sleep(); - if (!skip_if_busy || !writeback_in_progress(&bdi->wb)) { + if (!skip_if_busy || !writeback_in_progress(&bdi_wb_ctx->wb)) { base_work->auto_free = 0; - wb_queue_work(&bdi->wb, base_work); + wb_queue_work(&bdi_wb_ctx->wb, base_work); } } @@ -2371,7 +2387,7 @@ static void __wakeup_flusher_threads_bdi(struct backing_dev_info *bdi, if (!bdi_has_dirty_io(bdi)) return; - list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node) + list_for_each_entry_rcu(wb, &bdi->wb_ctx_arr[0]->wb_list, bdi_node) wb_start_writeback(wb, reason); } @@ -2427,7 +2443,8 @@ static void wakeup_dirtytime_writeback(struct work_struct *w) list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) { struct bdi_writeback *wb; - list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node) + list_for_each_entry_rcu(wb, &bdi->wb_ctx_arr[0]->wb_list, + bdi_node) if (!list_empty(&wb->b_dirty_time)) wb_wakeup(wb); } @@ -2729,7 +2746,7 @@ static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, enum wb_reason reason, bool skip_if_busy) { struct backing_dev_info *bdi = sb->s_bdi; - DEFINE_WB_COMPLETION(done, bdi); + DEFINE_WB_COMPLETION(done, bdi->wb_ctx_arr[0]); struct wb_writeback_work work = { .sb = sb, .sync_mode = WB_SYNC_NONE, @@ -2743,7 +2760,8 @@ static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, return; WARN_ON(!rwsem_is_locked(&sb->s_umount)); - bdi_split_work_to_wbs(sb->s_bdi, &work, skip_if_busy); + bdi_split_work_to_wbs(sb->s_bdi, bdi->wb_ctx_arr[0], &work, + skip_if_busy); wb_wait_for_completion(&done); } @@ -2807,7 +2825,7 @@ EXPORT_SYMBOL(try_to_writeback_inodes_sb); void sync_inodes_sb(struct super_block *sb) { struct backing_dev_info *bdi = sb->s_bdi; - DEFINE_WB_COMPLETION(done, bdi); + DEFINE_WB_COMPLETION(done, bdi->wb_ctx_arr[0]); struct wb_writeback_work work = { .sb = sb, .sync_mode = WB_SYNC_ALL, @@ -2828,10 +2846,10 @@ void sync_inodes_sb(struct super_block *sb) WARN_ON(!rwsem_is_locked(&sb->s_umount)); /* protect against inode wb switch, see inode_switch_wbs_work_fn() */ - bdi_down_write_wb_switch_rwsem(bdi); - bdi_split_work_to_wbs(bdi, &work, false); + bdi_down_write_wb_ctx_switch_rwsem(bdi->wb_ctx_arr[0]); + bdi_split_work_to_wbs(bdi, bdi->wb_ctx_arr[0], &work, false); wb_wait_for_completion(&done); - bdi_up_write_wb_switch_rwsem(bdi); + bdi_up_write_wb_ctx_switch_rwsem(bdi->wb_ctx_arr[0]); wait_sb_inodes(sb); } diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 754378dd9f71..7817219d1599 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -1853,9 +1853,9 @@ static void fuse_writepage_finish_stat(struct inode *inode, struct folio *folio) { struct backing_dev_info *bdi = inode_to_bdi(inode); - dec_wb_stat(&bdi->wb, WB_WRITEBACK); + dec_wb_stat(&bdi->wb_ctx_arr[0]->wb, WB_WRITEBACK); node_stat_sub_folio(folio, NR_WRITEBACK_TEMP); - wb_writeout_inc(&bdi->wb); + wb_writeout_inc(&bdi->wb_ctx_arr[0]->wb); } static void fuse_writepage_finish(struct fuse_writepage_args *wpa) @@ -2142,7 +2142,7 @@ static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struc ap->descs[folio_index].offset = 0; ap->descs[folio_index].length = PAGE_SIZE; - inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK); + inc_wb_stat(&inode_to_bdi(inode)->wb_ctx_arr[0]->wb, WB_WRITEBACK); node_stat_add_folio(tmp_folio, NR_WRITEBACK_TEMP); } diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c index 44e5658b896c..dfc83bd3def3 100644 --- a/fs/gfs2/super.c +++ b/fs/gfs2/super.c @@ -457,7 +457,7 @@ static int gfs2_write_inode(struct inode *inode, struct writeback_control *wbc) gfs2_log_flush(GFS2_SB(inode), ip->i_gl, GFS2_LOG_HEAD_FLUSH_NORMAL | GFS2_LFC_WRITE_INODE); - if (bdi->wb.dirty_exceeded) + if (bdi->wb_ctx_arr[0]->wb.dirty_exceeded) gfs2_ail1_flush(sdp, wbc); else filemap_fdatawrite(metamapping); diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h index 6655e5f32ec6..fd513bf9e875 100644 --- a/fs/nfs/internal.h +++ b/fs/nfs/internal.h @@ -844,7 +844,8 @@ static inline void nfs_folio_mark_unstable(struct folio *folio, * writeback is happening on the server now. */ node_stat_mod_folio(folio, NR_WRITEBACK, nr); - wb_stat_mod(&inode_to_bdi(inode)->wb, WB_WRITEBACK, nr); + wb_stat_mod(&inode_to_bdi(inode)->wb_ctx_arr[0]->wb, + WB_WRITEBACK, nr); __mark_inode_dirty(inode, I_DIRTY_DATASYNC); } } diff --git a/fs/nfs/write.c b/fs/nfs/write.c index 23df8b214474..ec48ec8c2db8 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -932,9 +932,10 @@ static void nfs_folio_clear_commit(struct folio *folio) { if (folio) { long nr = folio_nr_pages(folio); + struct inode *inode = folio->mapping->host; node_stat_mod_folio(folio, NR_WRITEBACK, -nr); - wb_stat_mod(&inode_to_bdi(folio->mapping->host)->wb, + wb_stat_mod(&inode_to_bdi(inode)->wb_ctx_arr[0]->wb, WB_WRITEBACK, -nr); } } diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 2ad261082bba..ec0dd8df1a8c 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -75,10 +75,11 @@ struct wb_completion { * can wait for the completion of all using wb_wait_for_completion(). Work * items which are waited upon aren't freed automatically on completion. */ -#define WB_COMPLETION_INIT(bdi) __WB_COMPLETION_INIT(&(bdi)->wb_waitq) +#define WB_COMPLETION_INIT(bdi_wb_ctx) \ + __WB_COMPLETION_INIT(&(bdi_wb_ctx)->wb_waitq) -#define DEFINE_WB_COMPLETION(cmpl, bdi) \ - struct wb_completion cmpl = WB_COMPLETION_INIT(bdi) +#define DEFINE_WB_COMPLETION(cmpl, bdi_wb_ctx) \ + struct wb_completion cmpl = WB_COMPLETION_INIT(bdi_wb_ctx) /* * Each wb (bdi_writeback) can perform writeback operations, is measured @@ -104,6 +105,7 @@ struct wb_completion { */ struct bdi_writeback { struct backing_dev_info *bdi; /* our parent bdi */ + struct bdi_writeback_ctx *bdi_wb_ctx; unsigned long state; /* Always use atomic bitops on this */ unsigned long last_old_flush; /* last old data flush */ @@ -160,6 +162,16 @@ struct bdi_writeback { #endif }; +struct bdi_writeback_ctx { + struct bdi_writeback wb; /* the root writeback info for this bdi */ + struct list_head wb_list; /* list of all wbs */ +#ifdef CONFIG_CGROUP_WRITEBACK + struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */ + struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */ +#endif + wait_queue_head_t wb_waitq; +}; + struct backing_dev_info { u64 id; struct rb_node rb_node; /* keyed by ->id */ @@ -183,15 +195,11 @@ struct backing_dev_info { */ unsigned long last_bdp_sleep; - struct bdi_writeback wb; /* the root writeback info for this bdi */ - struct list_head wb_list; /* list of all wbs */ + int nr_wb_ctx; + struct bdi_writeback_ctx **wb_ctx_arr; #ifdef CONFIG_CGROUP_WRITEBACK - struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */ struct mutex cgwb_release_mutex; /* protect shutdown of wb structs */ - struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */ #endif - wait_queue_head_t wb_waitq; - struct device *dev; char dev_name[64]; struct device *owner; @@ -216,7 +224,7 @@ struct wb_lock_cookie { */ static inline bool wb_tryget(struct bdi_writeback *wb) { - if (wb != &wb->bdi->wb) + if (wb != &wb->bdi_wb_ctx->wb) return percpu_ref_tryget(&wb->refcnt); return true; } @@ -227,7 +235,7 @@ static inline bool wb_tryget(struct bdi_writeback *wb) */ static inline void wb_get(struct bdi_writeback *wb) { - if (wb != &wb->bdi->wb) + if (wb != &wb->bdi_wb_ctx->wb) percpu_ref_get(&wb->refcnt); } @@ -246,7 +254,7 @@ static inline void wb_put_many(struct bdi_writeback *wb, unsigned long nr) return; } - if (wb != &wb->bdi->wb) + if (wb != &wb->bdi_wb_ctx->wb) percpu_ref_put_many(&wb->refcnt, nr); } diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index e721148c95d0..894968e98dd8 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -148,11 +148,20 @@ static inline bool mapping_can_writeback(struct address_space *mapping) return inode_to_bdi(mapping->host)->capabilities & BDI_CAP_WRITEBACK; } +static inline struct bdi_writeback_ctx * +fetch_bdi_writeback_ctx(struct inode *inode) +{ + struct backing_dev_info *bdi = inode_to_bdi(inode); + + return bdi->wb_ctx_arr[0]; +} + #ifdef CONFIG_CGROUP_WRITEBACK -struct bdi_writeback *wb_get_lookup(struct backing_dev_info *bdi, +struct bdi_writeback *wb_get_lookup(struct bdi_writeback_ctx *bdi_wb_ctx, struct cgroup_subsys_state *memcg_css); struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, + struct bdi_writeback_ctx *bdi_wb_ctx, struct cgroup_subsys_state *memcg_css, gfp_t gfp); void wb_memcg_offline(struct mem_cgroup *memcg); @@ -187,16 +196,18 @@ static inline bool inode_cgwb_enabled(struct inode *inode) * Must be called under rcu_read_lock() which protects the returend wb. * NULL if not found. */ -static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi) +static inline struct bdi_writeback * +wb_find_current(struct backing_dev_info *bdi, + struct bdi_writeback_ctx *bdi_wb_ctx) { struct cgroup_subsys_state *memcg_css; struct bdi_writeback *wb; memcg_css = task_css(current, memory_cgrp_id); if (!memcg_css->parent) - return &bdi->wb; + return &bdi_wb_ctx->wb; - wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id); + wb = radix_tree_lookup(&bdi_wb_ctx->cgwb_tree, memcg_css->id); /* * %current's blkcg equals the effective blkcg of its memcg. No @@ -217,12 +228,13 @@ static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi * wb_find_current(). */ static inline struct bdi_writeback * -wb_get_create_current(struct backing_dev_info *bdi, gfp_t gfp) +wb_get_create_current(struct backing_dev_info *bdi, + struct bdi_writeback_ctx *bdi_wb_ctx, gfp_t gfp) { struct bdi_writeback *wb; rcu_read_lock(); - wb = wb_find_current(bdi); + wb = wb_find_current(bdi, bdi_wb_ctx); if (wb && unlikely(!wb_tryget(wb))) wb = NULL; rcu_read_unlock(); @@ -231,7 +243,7 @@ wb_get_create_current(struct backing_dev_info *bdi, gfp_t gfp) struct cgroup_subsys_state *memcg_css; memcg_css = task_get_css(current, memory_cgrp_id); - wb = wb_get_create(bdi, memcg_css, gfp); + wb = wb_get_create(bdi, bdi_wb_ctx, memcg_css, gfp); css_put(memcg_css); } return wb; @@ -265,7 +277,7 @@ static inline struct bdi_writeback *inode_to_wb_wbc( * If wbc does not have inode attached, it means cgroup writeback was * disabled when wbc started. Just use the default wb in that case. */ - return wbc->wb ? wbc->wb : &inode_to_bdi(inode)->wb; + return wbc->wb ? wbc->wb : &fetch_bdi_writeback_ctx(inode)->wb; } /** @@ -325,20 +337,23 @@ static inline bool inode_cgwb_enabled(struct inode *inode) return false; } -static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi) +static inline struct bdi_writeback *wb_find_current( + struct backing_dev_info *bdi, + struct bdi_writeback_ctx *bdi_wb_ctx) { - return &bdi->wb; + return &bdi_wb_ctx->wb; } static inline struct bdi_writeback * -wb_get_create_current(struct backing_dev_info *bdi, gfp_t gfp) +wb_get_create_current(struct backing_dev_info *bdi, + struct bdi_writeback_ctx *bdi_wb_ctx, gfp_t gfp) { - return &bdi->wb; + return &bdi_wb_ctx->wb; } static inline struct bdi_writeback *inode_to_wb(struct inode *inode) { - return &inode_to_bdi(inode)->wb; + return &fetch_bdi_writeback_ctx(inode)->wb; } static inline struct bdi_writeback *inode_to_wb_wbc( diff --git a/include/linux/fs.h b/include/linux/fs.h index d5988867fe31..09575c399ccc 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2289,7 +2289,6 @@ struct super_operations { struct inode *(*alloc_inode)(struct super_block *sb); void (*destroy_inode)(struct inode *); void (*free_inode)(struct inode *); - void (*dirty_inode) (struct inode *, int flags); int (*write_inode) (struct inode *, struct writeback_control *wbc); int (*drop_inode) (struct inode *); diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 783904d8c5ef..0efa9632011a 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -84,13 +84,14 @@ static void collect_wb_stats(struct wb_stats *stats, } #ifdef CONFIG_CGROUP_WRITEBACK + static void bdi_collect_stats(struct backing_dev_info *bdi, struct wb_stats *stats) { struct bdi_writeback *wb; rcu_read_lock(); - list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node) { + list_for_each_entry_rcu(wb, &bdi->wb_ctx_arr[0]->wb_list, bdi_node) { if (!wb_tryget(wb)) continue; @@ -103,7 +104,7 @@ static void bdi_collect_stats(struct backing_dev_info *bdi, static void bdi_collect_stats(struct backing_dev_info *bdi, struct wb_stats *stats) { - collect_wb_stats(stats, &bdi->wb); + collect_wb_stats(stats, &bdi->wb_ctx_arr[0]->wb); } #endif @@ -149,7 +150,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) stats.nr_io, stats.nr_more_io, stats.nr_dirty_time, - !list_empty(&bdi->bdi_list), bdi->wb.state); + !list_empty(&bdi->bdi_list), bdi->wb_ctx_arr[0]->wb.state); return 0; } @@ -193,14 +194,14 @@ static void wb_stats_show(struct seq_file *m, struct bdi_writeback *wb, static int cgwb_debug_stats_show(struct seq_file *m, void *v) { struct backing_dev_info *bdi = m->private; + struct bdi_writeback *wb; unsigned long background_thresh; unsigned long dirty_thresh; - struct bdi_writeback *wb; global_dirty_limits(&background_thresh, &dirty_thresh); rcu_read_lock(); - list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node) { + list_for_each_entry_rcu(wb, &bdi->wb_ctx_arr[0]->wb_list, bdi_node) { struct wb_stats stats = { .dirty_thresh = dirty_thresh }; if (!wb_tryget(wb)) @@ -520,6 +521,7 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi, memset(wb, 0, sizeof(*wb)); wb->bdi = bdi; + wb->bdi_wb_ctx = bdi->wb_ctx_arr[0]; wb->last_old_flush = jiffies; INIT_LIST_HEAD(&wb->b_dirty); INIT_LIST_HEAD(&wb->b_io); @@ -643,11 +645,12 @@ static void cgwb_release(struct percpu_ref *refcnt) queue_work(cgwb_release_wq, &wb->release_work); } -static void cgwb_kill(struct bdi_writeback *wb) +static void cgwb_kill(struct bdi_writeback *wb, + struct bdi_writeback_ctx *bdi_wb_ctx) { lockdep_assert_held(&cgwb_lock); - WARN_ON(!radix_tree_delete(&wb->bdi->cgwb_tree, wb->memcg_css->id)); + WARN_ON(!radix_tree_delete(&bdi_wb_ctx->cgwb_tree, wb->memcg_css->id)); list_del(&wb->memcg_node); list_del(&wb->blkcg_node); list_add(&wb->offline_node, &offline_cgwbs); @@ -662,6 +665,7 @@ static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb) } static int cgwb_create(struct backing_dev_info *bdi, + struct bdi_writeback_ctx *bdi_wb_ctx, struct cgroup_subsys_state *memcg_css, gfp_t gfp) { struct mem_cgroup *memcg; @@ -678,9 +682,9 @@ static int cgwb_create(struct backing_dev_info *bdi, /* look up again under lock and discard on blkcg mismatch */ spin_lock_irqsave(&cgwb_lock, flags); - wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id); + wb = radix_tree_lookup(&bdi_wb_ctx->cgwb_tree, memcg_css->id); if (wb && wb->blkcg_css != blkcg_css) { - cgwb_kill(wb); + cgwb_kill(wb, bdi_wb_ctx); wb = NULL; } spin_unlock_irqrestore(&cgwb_lock, flags); @@ -721,12 +725,13 @@ static int cgwb_create(struct backing_dev_info *bdi, */ ret = -ENODEV; spin_lock_irqsave(&cgwb_lock, flags); - if (test_bit(WB_registered, &bdi->wb.state) && + if (test_bit(WB_registered, &bdi_wb_ctx->wb.state) && blkcg_cgwb_list->next && memcg_cgwb_list->next) { /* we might have raced another instance of this function */ - ret = radix_tree_insert(&bdi->cgwb_tree, memcg_css->id, wb); + ret = radix_tree_insert(&bdi_wb_ctx->cgwb_tree, + memcg_css->id, wb); if (!ret) { - list_add_tail_rcu(&wb->bdi_node, &bdi->wb_list); + list_add_tail_rcu(&wb->bdi_node, &bdi_wb_ctx->wb_list); list_add(&wb->memcg_node, memcg_cgwb_list); list_add(&wb->blkcg_node, blkcg_cgwb_list); blkcg_pin_online(blkcg_css); @@ -779,16 +784,16 @@ static int cgwb_create(struct backing_dev_info *bdi, * each lookup. On mismatch, the existing wb is discarded and a new one is * created. */ -struct bdi_writeback *wb_get_lookup(struct backing_dev_info *bdi, +struct bdi_writeback *wb_get_lookup(struct bdi_writeback_ctx *bdi_wb_ctx, struct cgroup_subsys_state *memcg_css) { struct bdi_writeback *wb; if (!memcg_css->parent) - return &bdi->wb; + return &bdi_wb_ctx->wb; rcu_read_lock(); - wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id); + wb = radix_tree_lookup(&bdi_wb_ctx->cgwb_tree, memcg_css->id); if (wb) { struct cgroup_subsys_state *blkcg_css; @@ -813,6 +818,7 @@ struct bdi_writeback *wb_get_lookup(struct backing_dev_info *bdi, * create one. See wb_get_lookup() for more details. */ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, + struct bdi_writeback_ctx *bdi_wb_ctx, struct cgroup_subsys_state *memcg_css, gfp_t gfp) { @@ -821,8 +827,8 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, might_alloc(gfp); do { - wb = wb_get_lookup(bdi, memcg_css); - } while (!wb && !cgwb_create(bdi, memcg_css, gfp)); + wb = wb_get_lookup(bdi_wb_ctx, memcg_css); + } while (!wb && !cgwb_create(bdi, bdi_wb_ctx, memcg_css, gfp)); return wb; } @@ -830,36 +836,40 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, static int cgwb_bdi_init(struct backing_dev_info *bdi) { int ret; + struct bdi_writeback_ctx *bdi_wb_ctx = bdi->wb_ctx_arr[0]; - INIT_RADIX_TREE(&bdi->cgwb_tree, GFP_ATOMIC); + INIT_RADIX_TREE(&bdi_wb_ctx->cgwb_tree, GFP_ATOMIC); mutex_init(&bdi->cgwb_release_mutex); - init_rwsem(&bdi->wb_switch_rwsem); + init_rwsem(&bdi_wb_ctx->wb_switch_rwsem); - ret = wb_init(&bdi->wb, bdi, GFP_KERNEL); + ret = wb_init(&bdi_wb_ctx->wb, bdi, GFP_KERNEL); if (!ret) { - bdi->wb.memcg_css = &root_mem_cgroup->css; - bdi->wb.blkcg_css = blkcg_root_css; + bdi_wb_ctx->wb.memcg_css = &root_mem_cgroup->css; + bdi_wb_ctx->wb.blkcg_css = blkcg_root_css; } return ret; } -static void cgwb_bdi_unregister(struct backing_dev_info *bdi) +/* callers should create a loop and pass bdi_wb_ctx */ +static void cgwb_bdi_unregister(struct backing_dev_info *bdi, + struct bdi_writeback_ctx *bdi_wb_ctx) { struct radix_tree_iter iter; void **slot; struct bdi_writeback *wb; - WARN_ON(test_bit(WB_registered, &bdi->wb.state)); + WARN_ON(test_bit(WB_registered, &bdi_wb_ctx->wb.state)); spin_lock_irq(&cgwb_lock); - radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0) - cgwb_kill(*slot); + radix_tree_for_each_slot(slot, &bdi_wb_ctx->cgwb_tree, &iter, 0) + cgwb_kill(*slot, bdi_wb_ctx); spin_unlock_irq(&cgwb_lock); mutex_lock(&bdi->cgwb_release_mutex); spin_lock_irq(&cgwb_lock); - while (!list_empty(&bdi->wb_list)) { - wb = list_first_entry(&bdi->wb_list, struct bdi_writeback, + while (!list_empty(&bdi_wb_ctx->wb_list)) { + wb = list_first_entry(&bdi_wb_ctx->wb_list, + struct bdi_writeback, bdi_node); spin_unlock_irq(&cgwb_lock); wb_shutdown(wb); @@ -930,7 +940,7 @@ void wb_memcg_offline(struct mem_cgroup *memcg) spin_lock_irq(&cgwb_lock); list_for_each_entry_safe(wb, next, memcg_cgwb_list, memcg_node) - cgwb_kill(wb); + cgwb_kill(wb, wb->bdi_wb_ctx); memcg_cgwb_list->next = NULL; /* prevent new wb's */ spin_unlock_irq(&cgwb_lock); @@ -950,15 +960,16 @@ void wb_blkcg_offline(struct cgroup_subsys_state *css) spin_lock_irq(&cgwb_lock); list_for_each_entry_safe(wb, next, list, blkcg_node) - cgwb_kill(wb); + cgwb_kill(wb, wb->bdi_wb_ctx); list->next = NULL; /* prevent new wb's */ spin_unlock_irq(&cgwb_lock); } -static void cgwb_bdi_register(struct backing_dev_info *bdi) +static void cgwb_bdi_register(struct backing_dev_info *bdi, + struct bdi_writeback_ctx *bdi_wb_ctx) { spin_lock_irq(&cgwb_lock); - list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list); + list_add_tail_rcu(&bdi_wb_ctx->wb.bdi_node, &bdi_wb_ctx->wb_list); spin_unlock_irq(&cgwb_lock); } @@ -981,14 +992,18 @@ subsys_initcall(cgwb_init); static int cgwb_bdi_init(struct backing_dev_info *bdi) { - return wb_init(&bdi->wb, bdi, GFP_KERNEL); + return wb_init(&bdi->wb_ctx_arr[0]->wb, bdi, GFP_KERNEL); } -static void cgwb_bdi_unregister(struct backing_dev_info *bdi) { } +static void cgwb_bdi_unregister(struct backing_dev_info *bdi, + struct bdi_writeback_ctx *bdi_wb_ctx) +{ } -static void cgwb_bdi_register(struct backing_dev_info *bdi) +/* callers should create a loop and pass bdi_wb_ctx */ +static void cgwb_bdi_register(struct backing_dev_info *bdi, + struct bdi_writeback_ctx *bdi_wb_ctx) { - list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list); + list_add_tail_rcu(&bdi_wb_ctx->wb.bdi_node, &bdi_wb_ctx->wb_list); } static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb) @@ -1006,9 +1021,15 @@ int bdi_init(struct backing_dev_info *bdi) bdi->min_ratio = 0; bdi->max_ratio = 100 * BDI_RATIO_SCALE; bdi->max_prop_frac = FPROP_FRAC_BASE; + bdi->nr_wb_ctx = 1; + bdi->wb_ctx_arr = kcalloc(bdi->nr_wb_ctx, + sizeof(struct bdi_writeback_ctx *), + GFP_KERNEL); INIT_LIST_HEAD(&bdi->bdi_list); - INIT_LIST_HEAD(&bdi->wb_list); - init_waitqueue_head(&bdi->wb_waitq); + bdi->wb_ctx_arr[0] = (struct bdi_writeback_ctx *) + kzalloc(sizeof(struct bdi_writeback_ctx), GFP_KERNEL); + INIT_LIST_HEAD(&bdi->wb_ctx_arr[0]->wb_list); + init_waitqueue_head(&bdi->wb_ctx_arr[0]->wb_waitq); bdi->last_bdp_sleep = jiffies; return cgwb_bdi_init(bdi); @@ -1023,6 +1044,8 @@ struct backing_dev_info *bdi_alloc(int node_id) return NULL; if (bdi_init(bdi)) { + kfree(bdi->wb_ctx_arr[0]); + kfree(bdi->wb_ctx_arr); kfree(bdi); return NULL; } @@ -1095,11 +1118,11 @@ int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args) if (IS_ERR(dev)) return PTR_ERR(dev); - cgwb_bdi_register(bdi); + cgwb_bdi_register(bdi, bdi->wb_ctx_arr[0]); + set_bit(WB_registered, &bdi->wb_ctx_arr[0]->wb.state); bdi->dev = dev; bdi_debug_register(bdi, dev_name(dev)); - set_bit(WB_registered, &bdi->wb.state); spin_lock_bh(&bdi_lock); @@ -1155,8 +1178,8 @@ void bdi_unregister(struct backing_dev_info *bdi) /* make sure nobody finds us on the bdi_list anymore */ bdi_remove_from_list(bdi); - wb_shutdown(&bdi->wb); - cgwb_bdi_unregister(bdi); + wb_shutdown(&bdi->wb_ctx_arr[0]->wb); + cgwb_bdi_unregister(bdi, bdi->wb_ctx_arr[0]); /* * If this BDI's min ratio has been set, use bdi_set_min_ratio() to @@ -1183,9 +1206,11 @@ static void release_bdi(struct kref *ref) struct backing_dev_info *bdi = container_of(ref, struct backing_dev_info, refcnt); - WARN_ON_ONCE(test_bit(WB_registered, &bdi->wb.state)); WARN_ON_ONCE(bdi->dev); - wb_exit(&bdi->wb); + WARN_ON_ONCE(test_bit(WB_registered, &bdi->wb_ctx_arr[0]->wb.state)); + wb_exit(&bdi->wb_ctx_arr[0]->wb); + kfree(bdi->wb_ctx_arr[0]); + kfree(bdi->wb_ctx_arr); kfree(bdi); } diff --git a/mm/page-writeback.c b/mm/page-writeback.c index c81624bc3969..b27416da569b 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2049,6 +2049,7 @@ int balance_dirty_pages_ratelimited_flags(struct address_space *mapping, { struct inode *inode = mapping->host; struct backing_dev_info *bdi = inode_to_bdi(inode); + struct bdi_writeback_ctx *bdi_wb_ctx = fetch_bdi_writeback_ctx(inode); struct bdi_writeback *wb = NULL; int ratelimit; int ret = 0; @@ -2058,9 +2059,9 @@ int balance_dirty_pages_ratelimited_flags(struct address_space *mapping, return ret; if (inode_cgwb_enabled(inode)) - wb = wb_get_create_current(bdi, GFP_KERNEL); + wb = wb_get_create_current(bdi, bdi_wb_ctx, GFP_KERNEL); if (!wb) - wb = &bdi->wb; + wb = &bdi_wb_ctx->wb; ratelimit = current->nr_dirtied_pause; if (wb->dirty_exceeded) -- 2.25.1 ^ permalink raw reply related [flat|nested] 40+ messages in thread
[parent not found: <CGME20250529113224epcas5p2eea35fd0ebe445d8ad0471a144714b23@epcas5p2.samsung.com>]
* [PATCH 02/13] writeback: add support to initialize and free multiple writeback ctxs [not found] ` <CGME20250529113224epcas5p2eea35fd0ebe445d8ad0471a144714b23@epcas5p2.samsung.com> @ 2025-05-29 11:14 ` Kundan Kumar 0 siblings, 0 replies; 40+ messages in thread From: Kundan Kumar @ 2025-05-29 11:14 UTC (permalink / raw) To: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, p.raghav, da.gomez Cc: linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, Kundan Kumar, Anuj Gupta Introduce a new macro for_each_bdi_wb_ctx to iterate over multiple writeback ctxs. Added logic for allocation, init, free, registration and unregistration of multiple writeback contexts within a bdi. Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> --- include/linux/backing-dev.h | 4 ++ mm/backing-dev.c | 79 +++++++++++++++++++++++++++---------- 2 files changed, 62 insertions(+), 21 deletions(-) diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 894968e98dd8..fbccb483e59c 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -148,6 +148,10 @@ static inline bool mapping_can_writeback(struct address_space *mapping) return inode_to_bdi(mapping->host)->capabilities & BDI_CAP_WRITEBACK; } +#define for_each_bdi_wb_ctx(bdi, wb_ctx) \ + for (int __i = 0; __i < (bdi)->nr_wb_ctx \ + && ((wb_ctx) = (bdi)->wb_ctx_arr[__i]) != NULL; __i++) + static inline struct bdi_writeback_ctx * fetch_bdi_writeback_ctx(struct inode *inode) { diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 0efa9632011a..adf87b036827 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -836,16 +836,19 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, static int cgwb_bdi_init(struct backing_dev_info *bdi) { int ret; - struct bdi_writeback_ctx *bdi_wb_ctx = bdi->wb_ctx_arr[0]; + struct bdi_writeback_ctx *bdi_wb_ctx; - INIT_RADIX_TREE(&bdi_wb_ctx->cgwb_tree, GFP_ATOMIC); - mutex_init(&bdi->cgwb_release_mutex); - init_rwsem(&bdi_wb_ctx->wb_switch_rwsem); + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) { + INIT_RADIX_TREE(&bdi_wb_ctx->cgwb_tree, GFP_ATOMIC); + mutex_init(&bdi->cgwb_release_mutex); + init_rwsem(&bdi_wb_ctx->wb_switch_rwsem); - ret = wb_init(&bdi_wb_ctx->wb, bdi, GFP_KERNEL); - if (!ret) { - bdi_wb_ctx->wb.memcg_css = &root_mem_cgroup->css; - bdi_wb_ctx->wb.blkcg_css = blkcg_root_css; + ret = wb_init(&bdi_wb_ctx->wb, bdi, GFP_KERNEL); + if (!ret) { + bdi_wb_ctx->wb.memcg_css = &root_mem_cgroup->css; + bdi_wb_ctx->wb.blkcg_css = blkcg_root_css; + } else + return ret; } return ret; } @@ -992,7 +995,16 @@ subsys_initcall(cgwb_init); static int cgwb_bdi_init(struct backing_dev_info *bdi) { - return wb_init(&bdi->wb_ctx_arr[0]->wb, bdi, GFP_KERNEL); + struct bdi_writeback_ctx *bdi_wb_ctx; + + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) { + int ret; + + ret = wb_init(&bdi_wb_ctx->wb, bdi, GFP_KERNEL); + if (ret) + return ret; + } + return 0; } static void cgwb_bdi_unregister(struct backing_dev_info *bdi, @@ -1026,10 +1038,19 @@ int bdi_init(struct backing_dev_info *bdi) sizeof(struct bdi_writeback_ctx *), GFP_KERNEL); INIT_LIST_HEAD(&bdi->bdi_list); - bdi->wb_ctx_arr[0] = (struct bdi_writeback_ctx *) - kzalloc(sizeof(struct bdi_writeback_ctx), GFP_KERNEL); - INIT_LIST_HEAD(&bdi->wb_ctx_arr[0]->wb_list); - init_waitqueue_head(&bdi->wb_ctx_arr[0]->wb_waitq); + for (int i = 0; i < bdi->nr_wb_ctx; i++) { + bdi->wb_ctx_arr[i] = (struct bdi_writeback_ctx *) + kzalloc(sizeof(struct bdi_writeback_ctx), GFP_KERNEL); + if (!bdi->wb_ctx_arr[i]) { + pr_err("Failed to allocate %d", i); + while (--i >= 0) + kfree(bdi->wb_ctx_arr[i]); + kfree(bdi->wb_ctx_arr); + return -ENOMEM; + } + INIT_LIST_HEAD(&bdi->wb_ctx_arr[i]->wb_list); + init_waitqueue_head(&bdi->wb_ctx_arr[i]->wb_waitq); + } bdi->last_bdp_sleep = jiffies; return cgwb_bdi_init(bdi); @@ -1038,13 +1059,16 @@ int bdi_init(struct backing_dev_info *bdi) struct backing_dev_info *bdi_alloc(int node_id) { struct backing_dev_info *bdi; + struct bdi_writeback_ctx *bdi_wb_ctx; bdi = kzalloc_node(sizeof(*bdi), GFP_KERNEL, node_id); if (!bdi) return NULL; if (bdi_init(bdi)) { - kfree(bdi->wb_ctx_arr[0]); + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) { + kfree(bdi_wb_ctx); + } kfree(bdi->wb_ctx_arr); kfree(bdi); return NULL; @@ -1109,6 +1133,7 @@ int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args) { struct device *dev; struct rb_node *parent, **p; + struct bdi_writeback_ctx *bdi_wb_ctx; if (bdi->dev) /* The driver needs to use separate queues per device */ return 0; @@ -1118,8 +1143,11 @@ int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args) if (IS_ERR(dev)) return PTR_ERR(dev); - cgwb_bdi_register(bdi, bdi->wb_ctx_arr[0]); - set_bit(WB_registered, &bdi->wb_ctx_arr[0]->wb.state); + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) { + cgwb_bdi_register(bdi, bdi_wb_ctx); + set_bit(WB_registered, &bdi_wb_ctx->wb.state); + } + bdi->dev = dev; bdi_debug_register(bdi, dev_name(dev)); @@ -1174,12 +1202,17 @@ static void bdi_remove_from_list(struct backing_dev_info *bdi) void bdi_unregister(struct backing_dev_info *bdi) { + struct bdi_writeback_ctx *bdi_wb_ctx; + timer_delete_sync(&bdi->laptop_mode_wb_timer); /* make sure nobody finds us on the bdi_list anymore */ bdi_remove_from_list(bdi); - wb_shutdown(&bdi->wb_ctx_arr[0]->wb); - cgwb_bdi_unregister(bdi, bdi->wb_ctx_arr[0]); + + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) { + wb_shutdown(&bdi_wb_ctx->wb); + cgwb_bdi_unregister(bdi, bdi_wb_ctx); + } /* * If this BDI's min ratio has been set, use bdi_set_min_ratio() to @@ -1205,11 +1238,15 @@ static void release_bdi(struct kref *ref) { struct backing_dev_info *bdi = container_of(ref, struct backing_dev_info, refcnt); + struct bdi_writeback_ctx *bdi_wb_ctx; WARN_ON_ONCE(bdi->dev); - WARN_ON_ONCE(test_bit(WB_registered, &bdi->wb_ctx_arr[0]->wb.state)); - wb_exit(&bdi->wb_ctx_arr[0]->wb); - kfree(bdi->wb_ctx_arr[0]); + + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) { + WARN_ON_ONCE(test_bit(WB_registered, &bdi_wb_ctx->wb.state)); + wb_exit(&bdi_wb_ctx->wb); + kfree(bdi_wb_ctx); + } kfree(bdi->wb_ctx_arr); kfree(bdi); } -- 2.25.1 ^ permalink raw reply related [flat|nested] 40+ messages in thread
[parent not found: <CGME20250529113228epcas5p1db88ab42c2dac0698d715e38bd5e0896@epcas5p1.samsung.com>]
* [PATCH 03/13] writeback: link bdi_writeback to its corresponding bdi_writeback_ctx [not found] ` <CGME20250529113228epcas5p1db88ab42c2dac0698d715e38bd5e0896@epcas5p1.samsung.com> @ 2025-05-29 11:14 ` Kundan Kumar 0 siblings, 0 replies; 40+ messages in thread From: Kundan Kumar @ 2025-05-29 11:14 UTC (permalink / raw) To: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, p.raghav, da.gomez Cc: linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, Kundan Kumar, Anuj Gupta Introduce a bdi_writeback_ctx field in bdi_writeback. This helps in fetching the writeback context from the bdi_writeback. Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> --- mm/backing-dev.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/mm/backing-dev.c b/mm/backing-dev.c index adf87b036827..5479e2d34160 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -513,15 +513,16 @@ static void wb_update_bandwidth_workfn(struct work_struct *work) */ #define INIT_BW (100 << (20 - PAGE_SHIFT)) -static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi, - gfp_t gfp) +static int wb_init(struct bdi_writeback *wb, + struct bdi_writeback_ctx *bdi_wb_ctx, + struct backing_dev_info *bdi, gfp_t gfp) { int err; memset(wb, 0, sizeof(*wb)); wb->bdi = bdi; - wb->bdi_wb_ctx = bdi->wb_ctx_arr[0]; + wb->bdi_wb_ctx = bdi_wb_ctx; wb->last_old_flush = jiffies; INIT_LIST_HEAD(&wb->b_dirty); INIT_LIST_HEAD(&wb->b_io); @@ -698,7 +699,7 @@ static int cgwb_create(struct backing_dev_info *bdi, goto out_put; } - ret = wb_init(wb, bdi, gfp); + ret = wb_init(wb, bdi_wb_ctx, bdi, gfp); if (ret) goto err_free; @@ -843,7 +844,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi) mutex_init(&bdi->cgwb_release_mutex); init_rwsem(&bdi_wb_ctx->wb_switch_rwsem); - ret = wb_init(&bdi_wb_ctx->wb, bdi, GFP_KERNEL); + ret = wb_init(&bdi_wb_ctx->wb, bdi_wb_ctx, bdi, GFP_KERNEL); if (!ret) { bdi_wb_ctx->wb.memcg_css = &root_mem_cgroup->css; bdi_wb_ctx->wb.blkcg_css = blkcg_root_css; @@ -1000,7 +1001,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi) for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) { int ret; - ret = wb_init(&bdi_wb_ctx->wb, bdi, GFP_KERNEL); + ret = wb_init(&bdi_wb_ctx->wb, bdi_wb_ctx, bdi, GFP_KERNEL); if (ret) return ret; } -- 2.25.1 ^ permalink raw reply related [flat|nested] 40+ messages in thread
[parent not found: <CGME20250529113232epcas5p4e6f3b2f03d3a5f8fcaace3ddd03298d0@epcas5p4.samsung.com>]
* [PATCH 04/13] writeback: affine inode to a writeback ctx within a bdi [not found] ` <CGME20250529113232epcas5p4e6f3b2f03d3a5f8fcaace3ddd03298d0@epcas5p4.samsung.com> @ 2025-05-29 11:14 ` Kundan Kumar 2025-06-02 14:24 ` Christoph Hellwig 0 siblings, 1 reply; 40+ messages in thread From: Kundan Kumar @ 2025-05-29 11:14 UTC (permalink / raw) To: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, p.raghav, da.gomez Cc: linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, Kundan Kumar, Anuj Gupta Affine inode to a writeback context. This helps in minimizing the filesytem fragmentation due to inode being processed by different threads. Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> --- fs/fs-writeback.c | 3 ++- include/linux/backing-dev.h | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 0959fff46235..9529e16c9b66 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -265,7 +265,8 @@ void __inode_attach_wb(struct inode *inode, struct folio *folio) { struct backing_dev_info *bdi = inode_to_bdi(inode); struct bdi_writeback *wb = NULL; - struct bdi_writeback_ctx *bdi_writeback_ctx = bdi->wb_ctx_arr[0]; + struct bdi_writeback_ctx *bdi_writeback_ctx = + fetch_bdi_writeback_ctx(inode); if (inode_cgwb_enabled(inode)) { struct cgroup_subsys_state *memcg_css; diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index fbccb483e59c..30a812fbd488 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -157,7 +157,7 @@ fetch_bdi_writeback_ctx(struct inode *inode) { struct backing_dev_info *bdi = inode_to_bdi(inode); - return bdi->wb_ctx_arr[0]; + return bdi->wb_ctx_arr[inode->i_ino % bdi->nr_wb_ctx]; } #ifdef CONFIG_CGROUP_WRITEBACK -- 2.25.1 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 04/13] writeback: affine inode to a writeback ctx within a bdi 2025-05-29 11:14 ` [PATCH 04/13] writeback: affine inode to a writeback ctx within a bdi Kundan Kumar @ 2025-06-02 14:24 ` Christoph Hellwig 0 siblings, 0 replies; 40+ messages in thread From: Christoph Hellwig @ 2025-06-02 14:24 UTC (permalink / raw) To: Kundan Kumar Cc: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, Anuj Gupta On Thu, May 29, 2025 at 04:44:55PM +0530, Kundan Kumar wrote: > @@ -157,7 +157,7 @@ fetch_bdi_writeback_ctx(struct inode *inode) > { > struct backing_dev_info *bdi = inode_to_bdi(inode); > > - return bdi->wb_ctx_arr[0]; > + return bdi->wb_ctx_arr[inode->i_ino % bdi->nr_wb_ctx]; Most modern file systems use 64-bit inode numbers, while i_ino sadly still is only an ino_t that can be 32-bits wide. So we'll either need an ugly fs hook here, or maybe convince Linus that it finally is time for a 64-bit i_ino (which would also clean up a lot of mess in the file systems and this constant source of confusion). ^ permalink raw reply [flat|nested] 40+ messages in thread
[parent not found: <CGME20250529113236epcas5p2049b6cc3be27d8727ac1f15697987ff5@epcas5p2.samsung.com>]
* [PATCH 05/13] writeback: modify bdi_writeback search logic to search across all wb ctxs [not found] ` <CGME20250529113236epcas5p2049b6cc3be27d8727ac1f15697987ff5@epcas5p2.samsung.com> @ 2025-05-29 11:14 ` Kundan Kumar 0 siblings, 0 replies; 40+ messages in thread From: Kundan Kumar @ 2025-05-29 11:14 UTC (permalink / raw) To: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, p.raghav, da.gomez Cc: linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, Kundan Kumar, Anuj Gupta Since we have multiple cgwb per bdi, embedded in writeback_ctx now, we iterate over all of them to find the associated writeback. Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> --- fs/fs-writeback.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 9529e16c9b66..72b73c3353fe 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1091,6 +1091,7 @@ int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, struct backing_dev_info *bdi; struct cgroup_subsys_state *memcg_css; struct bdi_writeback *wb; + struct bdi_writeback_ctx *bdi_wb_ctx; struct wb_writeback_work *work; unsigned long dirty; int ret; @@ -1114,7 +1115,11 @@ int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, * And find the associated wb. If the wb isn't there already * there's nothing to flush, don't create one. */ - wb = wb_get_lookup(bdi->wb_ctx_arr[0], memcg_css); + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) { + wb = wb_get_lookup(bdi_wb_ctx, memcg_css); + if (wb) + break; + } if (!wb) { ret = -ENOENT; goto out_css_put; -- 2.25.1 ^ permalink raw reply related [flat|nested] 40+ messages in thread
[parent not found: <CGME20250529113240epcas5p295dcf9a016cc28e5c3e88d698808f645@epcas5p2.samsung.com>]
* [PATCH 06/13] writeback: invoke all writeback contexts for flusher and dirtytime writeback [not found] ` <CGME20250529113240epcas5p295dcf9a016cc28e5c3e88d698808f645@epcas5p2.samsung.com> @ 2025-05-29 11:14 ` Kundan Kumar 0 siblings, 0 replies; 40+ messages in thread From: Kundan Kumar @ 2025-05-29 11:14 UTC (permalink / raw) To: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, p.raghav, da.gomez Cc: linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, Kundan Kumar, Anuj Gupta Modify flusher and dirtytime logic to iterate through all the writeback contexts. Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> --- fs/fs-writeback.c | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 72b73c3353fe..9b0940a6fe78 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -2389,12 +2389,14 @@ static void __wakeup_flusher_threads_bdi(struct backing_dev_info *bdi, enum wb_reason reason) { struct bdi_writeback *wb; + struct bdi_writeback_ctx *bdi_wb_ctx; if (!bdi_has_dirty_io(bdi)) return; - list_for_each_entry_rcu(wb, &bdi->wb_ctx_arr[0]->wb_list, bdi_node) - wb_start_writeback(wb, reason); + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) + list_for_each_entry_rcu(wb, &bdi_wb_ctx->wb_list, bdi_node) + wb_start_writeback(wb, reason); } void wakeup_flusher_threads_bdi(struct backing_dev_info *bdi, @@ -2444,15 +2446,17 @@ static DECLARE_DELAYED_WORK(dirtytime_work, wakeup_dirtytime_writeback); static void wakeup_dirtytime_writeback(struct work_struct *w) { struct backing_dev_info *bdi; + struct bdi_writeback_ctx *bdi_wb_ctx; rcu_read_lock(); list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) { struct bdi_writeback *wb; - list_for_each_entry_rcu(wb, &bdi->wb_ctx_arr[0]->wb_list, - bdi_node) - if (!list_empty(&wb->b_dirty_time)) - wb_wakeup(wb); + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) + list_for_each_entry_rcu(wb, &bdi_wb_ctx->wb_list, + bdi_node) + if (!list_empty(&wb->b_dirty_time)) + wb_wakeup(wb); } rcu_read_unlock(); schedule_delayed_work(&dirtytime_work, dirtytime_expire_interval * HZ); -- 2.25.1 ^ permalink raw reply related [flat|nested] 40+ messages in thread
[parent not found: <CGME20250529113245epcas5p2978b77ce5ccf2d620f2a9ee5e796bee3@epcas5p2.samsung.com>]
* [PATCH 07/13] writeback: modify sync related functions to iterate over all writeback contexts [not found] ` <CGME20250529113245epcas5p2978b77ce5ccf2d620f2a9ee5e796bee3@epcas5p2.samsung.com> @ 2025-05-29 11:14 ` Kundan Kumar 0 siblings, 0 replies; 40+ messages in thread From: Kundan Kumar @ 2025-05-29 11:14 UTC (permalink / raw) To: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, p.raghav, da.gomez Cc: linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, Kundan Kumar, Anuj Gupta Modify sync related functions to iterate over all writeback contexts. Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> --- fs/fs-writeback.c | 66 +++++++++++++++++++++++++++++++---------------- 1 file changed, 44 insertions(+), 22 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 9b0940a6fe78..7558b8a33fe0 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -2752,11 +2752,13 @@ static void wait_sb_inodes(struct super_block *sb) mutex_unlock(&sb->s_sync_lock); } -static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, - enum wb_reason reason, bool skip_if_busy) +static void __writeback_inodes_sb_nr_ctx(struct super_block *sb, + unsigned long nr, + enum wb_reason reason, + bool skip_if_busy, + struct bdi_writeback_ctx *bdi_wb_ctx) { - struct backing_dev_info *bdi = sb->s_bdi; - DEFINE_WB_COMPLETION(done, bdi->wb_ctx_arr[0]); + DEFINE_WB_COMPLETION(done, bdi_wb_ctx); struct wb_writeback_work work = { .sb = sb, .sync_mode = WB_SYNC_NONE, @@ -2766,13 +2768,23 @@ static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, .reason = reason, }; + bdi_split_work_to_wbs(sb->s_bdi, bdi_wb_ctx, &work, skip_if_busy); + wb_wait_for_completion(&done); +} + +static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, + enum wb_reason reason, bool skip_if_busy) +{ + struct backing_dev_info *bdi = sb->s_bdi; + struct bdi_writeback_ctx *bdi_wb_ctx; + if (!bdi_has_dirty_io(bdi) || bdi == &noop_backing_dev_info) return; WARN_ON(!rwsem_is_locked(&sb->s_umount)); - bdi_split_work_to_wbs(sb->s_bdi, bdi->wb_ctx_arr[0], &work, - skip_if_busy); - wb_wait_for_completion(&done); + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) + __writeback_inodes_sb_nr_ctx(sb, nr, reason, skip_if_busy, + bdi_wb_ctx); } /** @@ -2825,17 +2837,11 @@ void try_to_writeback_inodes_sb(struct super_block *sb, enum wb_reason reason) } EXPORT_SYMBOL(try_to_writeback_inodes_sb); -/** - * sync_inodes_sb - sync sb inode pages - * @sb: the superblock - * - * This function writes and waits on any dirty inode belonging to this - * super_block. - */ -void sync_inodes_sb(struct super_block *sb) +static void sync_inodes_bdi_wb_ctx(struct super_block *sb, + struct backing_dev_info *bdi, + struct bdi_writeback_ctx *bdi_wb_ctx) { - struct backing_dev_info *bdi = sb->s_bdi; - DEFINE_WB_COMPLETION(done, bdi->wb_ctx_arr[0]); + DEFINE_WB_COMPLETION(done, bdi_wb_ctx); struct wb_writeback_work work = { .sb = sb, .sync_mode = WB_SYNC_ALL, @@ -2846,6 +2852,25 @@ void sync_inodes_sb(struct super_block *sb) .for_sync = 1, }; + /* protect against inode wb switch, see inode_switch_wbs_work_fn() */ + bdi_down_write_wb_ctx_switch_rwsem(bdi_wb_ctx); + bdi_split_work_to_wbs(bdi, bdi_wb_ctx, &work, false); + wb_wait_for_completion(&done); + bdi_up_write_wb_ctx_switch_rwsem(bdi_wb_ctx); +} + +/** + * sync_inodes_sb - sync sb inode pages + * @sb: the superblock + * + * This function writes and waits on any dirty inode belonging to this + * super_block. + */ +void sync_inodes_sb(struct super_block *sb) +{ + struct backing_dev_info *bdi = sb->s_bdi; + struct bdi_writeback_ctx *bdi_wb_ctx; + /* * Can't skip on !bdi_has_dirty() because we should wait for !dirty * inodes under writeback and I_DIRTY_TIME inodes ignored by @@ -2855,11 +2880,8 @@ void sync_inodes_sb(struct super_block *sb) return; WARN_ON(!rwsem_is_locked(&sb->s_umount)); - /* protect against inode wb switch, see inode_switch_wbs_work_fn() */ - bdi_down_write_wb_ctx_switch_rwsem(bdi->wb_ctx_arr[0]); - bdi_split_work_to_wbs(bdi, bdi->wb_ctx_arr[0], &work, false); - wb_wait_for_completion(&done); - bdi_up_write_wb_ctx_switch_rwsem(bdi->wb_ctx_arr[0]); + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) + sync_inodes_bdi_wb_ctx(sb, bdi, bdi_wb_ctx); wait_sb_inodes(sb); } -- 2.25.1 ^ permalink raw reply related [flat|nested] 40+ messages in thread
[parent not found: <CGME20250529113249epcas5p38b29d3c6256337eadc2d1644181f9b74@epcas5p3.samsung.com>]
* [PATCH 08/13] writeback: add support to collect stats for all writeback ctxs [not found] ` <CGME20250529113249epcas5p38b29d3c6256337eadc2d1644181f9b74@epcas5p3.samsung.com> @ 2025-05-29 11:14 ` Kundan Kumar 0 siblings, 0 replies; 40+ messages in thread From: Kundan Kumar @ 2025-05-29 11:14 UTC (permalink / raw) To: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, p.raghav, da.gomez Cc: linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, Kundan Kumar, Anuj Gupta Modified stats collection to collect stats for all the writeback contexts within a bdi. Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> --- mm/backing-dev.c | 72 ++++++++++++++++++++++++++++-------------------- 1 file changed, 42 insertions(+), 30 deletions(-) diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 5479e2d34160..d416122e2914 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -50,6 +50,7 @@ struct wb_stats { unsigned long nr_written; unsigned long dirty_thresh; unsigned long wb_thresh; + unsigned long state; }; static struct dentry *bdi_debug_root; @@ -81,6 +82,7 @@ static void collect_wb_stats(struct wb_stats *stats, stats->nr_dirtied += wb_stat(wb, WB_DIRTIED); stats->nr_written += wb_stat(wb, WB_WRITTEN); stats->wb_thresh += wb_calc_thresh(wb, stats->dirty_thresh); + stats->state |= wb->state; } #ifdef CONFIG_CGROUP_WRITEBACK @@ -89,22 +91,27 @@ static void bdi_collect_stats(struct backing_dev_info *bdi, struct wb_stats *stats) { struct bdi_writeback *wb; + struct bdi_writeback_ctx *bdi_wb_ctx; rcu_read_lock(); - list_for_each_entry_rcu(wb, &bdi->wb_ctx_arr[0]->wb_list, bdi_node) { - if (!wb_tryget(wb)) - continue; + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) + list_for_each_entry_rcu(wb, &bdi_wb_ctx->wb_list, bdi_node) { + if (!wb_tryget(wb)) + continue; - collect_wb_stats(stats, wb); - wb_put(wb); - } + collect_wb_stats(stats, wb); + wb_put(wb); + } rcu_read_unlock(); } #else static void bdi_collect_stats(struct backing_dev_info *bdi, struct wb_stats *stats) { - collect_wb_stats(stats, &bdi->wb_ctx_arr[0]->wb); + struct bdi_writeback_ctx *bdi_wb_ctx; + + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) + collect_wb_stats(stats, &bdi_wb_ctx->wb); } #endif @@ -150,7 +157,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) stats.nr_io, stats.nr_more_io, stats.nr_dirty_time, - !list_empty(&bdi->bdi_list), bdi->wb_ctx_arr[0]->wb.state); + !list_empty(&bdi->bdi_list), stats.state); return 0; } @@ -195,35 +202,40 @@ static int cgwb_debug_stats_show(struct seq_file *m, void *v) { struct backing_dev_info *bdi = m->private; struct bdi_writeback *wb; + struct bdi_writeback_ctx *bdi_wb_ctx; unsigned long background_thresh; unsigned long dirty_thresh; + struct wb_stats stats; global_dirty_limits(&background_thresh, &dirty_thresh); + stats.dirty_thresh = dirty_thresh; rcu_read_lock(); - list_for_each_entry_rcu(wb, &bdi->wb_ctx_arr[0]->wb_list, bdi_node) { - struct wb_stats stats = { .dirty_thresh = dirty_thresh }; - - if (!wb_tryget(wb)) - continue; - - collect_wb_stats(&stats, wb); - - /* - * Calculate thresh of wb in writeback cgroup which is min of - * thresh in global domain and thresh in cgroup domain. Drop - * rcu lock because cgwb_calc_thresh may sleep in - * cgroup_rstat_flush. We can do so here because we have a ref. - */ - if (mem_cgroup_wb_domain(wb)) { - rcu_read_unlock(); - stats.wb_thresh = min(stats.wb_thresh, cgwb_calc_thresh(wb)); - rcu_read_lock(); + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) { + list_for_each_entry_rcu(wb, &bdi_wb_ctx->wb_list, bdi_node) { + if (!wb_tryget(wb)) + continue; + + collect_wb_stats(&stats, wb); + + /* + * Calculate thresh of wb in writeback cgroup which is + * min of thresh in global domain and thresh in cgroup + * domain. Drop rcu lock because cgwb_calc_thresh may + * sleep in cgroup_rstat_flush. We can do so here + * because we have a ref. + */ + if (mem_cgroup_wb_domain(wb)) { + rcu_read_unlock(); + stats.wb_thresh = min(stats.wb_thresh, + cgwb_calc_thresh(wb)); + rcu_read_lock(); + } + + wb_stats_show(m, wb, &stats); + + wb_put(wb); } - - wb_stats_show(m, wb, &stats); - - wb_put(wb); } rcu_read_unlock(); -- 2.25.1 ^ permalink raw reply related [flat|nested] 40+ messages in thread
[parent not found: <CGME20250529113253epcas5p1a28e77b2d9824d55f594ccb053725ece@epcas5p1.samsung.com>]
* [PATCH 09/13] f2fs: add support in f2fs to handle multiple writeback contexts [not found] ` <CGME20250529113253epcas5p1a28e77b2d9824d55f594ccb053725ece@epcas5p1.samsung.com> @ 2025-05-29 11:15 ` Kundan Kumar 2025-06-02 14:20 ` Christoph Hellwig 0 siblings, 1 reply; 40+ messages in thread From: Kundan Kumar @ 2025-05-29 11:15 UTC (permalink / raw) To: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, p.raghav, da.gomez Cc: linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, Kundan Kumar, Anuj Gupta Add support to handle multiple writeback contexts and check for dirty_exceeded across all the writeback contexts. Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> --- fs/f2fs/node.c | 11 +++++++---- fs/f2fs/segment.h | 7 +++++-- 2 files changed, 12 insertions(+), 6 deletions(-) diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c index 4b6568cd5bef..19f208d6c6d3 100644 --- a/fs/f2fs/node.c +++ b/fs/f2fs/node.c @@ -50,6 +50,7 @@ bool f2fs_available_free_memory(struct f2fs_sb_info *sbi, int type) unsigned long avail_ram; unsigned long mem_size = 0; bool res = false; + struct bdi_writeback_ctx *bdi_wb_ctx; if (!nm_i) return true; @@ -73,8 +74,9 @@ bool f2fs_available_free_memory(struct f2fs_sb_info *sbi, int type) if (excess_cached_nats(sbi)) res = false; } else if (type == DIRTY_DENTS) { - if (sbi->sb->s_bdi->wb_ctx_arr[0]->wb.dirty_exceeded) - return false; + for_each_bdi_wb_ctx(sbi->sb->s_bdi, bdi_wb_ctx) + if (bdi_wb_ctx->wb.dirty_exceeded) + return false; mem_size = get_pages(sbi, F2FS_DIRTY_DENTS); res = mem_size < ((avail_ram * nm_i->ram_thresh / 100) >> 1); } else if (type == INO_ENTRIES) { @@ -114,8 +116,9 @@ bool f2fs_available_free_memory(struct f2fs_sb_info *sbi, int type) res = false; #endif } else { - if (!sbi->sb->s_bdi->wb_ctx_arr[0]->wb.dirty_exceeded) - return true; + for_each_bdi_wb_ctx(sbi->sb->s_bdi, bdi_wb_ctx) + if (bdi_wb_ctx->wb.dirty_exceeded) + return false; } return res; } diff --git a/fs/f2fs/segment.h b/fs/f2fs/segment.h index a525ccd4cfc8..2eea08549d73 100644 --- a/fs/f2fs/segment.h +++ b/fs/f2fs/segment.h @@ -936,8 +936,11 @@ static inline bool sec_usage_check(struct f2fs_sb_info *sbi, unsigned int secno) */ static inline int nr_pages_to_skip(struct f2fs_sb_info *sbi, int type) { - if (sbi->sb->s_bdi->wb_ctx_arr[0]->wb.dirty_exceeded) - return 0; + struct bdi_writeback_ctx *bdi_wb_ctx; + + for_each_bdi_wb_ctx(sbi->sb->s_bdi, bdi_wb_ctx) + if (bdi_wb_ctx->wb.dirty_exceeded) + return 0; if (type == DATA) return BLKS_PER_SEG(sbi); -- 2.25.1 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 09/13] f2fs: add support in f2fs to handle multiple writeback contexts 2025-05-29 11:15 ` [PATCH 09/13] f2fs: add support in f2fs to handle multiple writeback contexts Kundan Kumar @ 2025-06-02 14:20 ` Christoph Hellwig 0 siblings, 0 replies; 40+ messages in thread From: Christoph Hellwig @ 2025-06-02 14:20 UTC (permalink / raw) To: Kundan Kumar Cc: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, Anuj Gupta > } else if (type == DIRTY_DENTS) { > - if (sbi->sb->s_bdi->wb_ctx_arr[0]->wb.dirty_exceeded) > - return false; > + for_each_bdi_wb_ctx(sbi->sb->s_bdi, bdi_wb_ctx) > + if (bdi_wb_ctx->wb.dirty_exceeded) > + return false; I think we need to figure out what the dirty_exceeded here and in the other places in f2fs and gfs2 is trying to do and factor that into well-documented core helpers instead of adding these loops in places that should not really poke into writeback internals. ^ permalink raw reply [flat|nested] 40+ messages in thread
[parent not found: <CGME20250529113257epcas5p4dbaf9c8e2dc362767c8553399632c1ea@epcas5p4.samsung.com>]
* [PATCH 10/13] fuse: add support for multiple writeback contexts in fuse [not found] ` <CGME20250529113257epcas5p4dbaf9c8e2dc362767c8553399632c1ea@epcas5p4.samsung.com> @ 2025-05-29 11:15 ` Kundan Kumar 2025-06-02 14:21 ` Christoph Hellwig 0 siblings, 1 reply; 40+ messages in thread From: Kundan Kumar @ 2025-05-29 11:15 UTC (permalink / raw) To: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, p.raghav, da.gomez Cc: linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, Kundan Kumar, Anuj Gupta Fetch writeback context to which an inode is affined. Use it to perform writeback related operations. Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> --- fs/fuse/file.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 7817219d1599..803359b02383 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -1851,11 +1851,11 @@ static void fuse_writepage_free(struct fuse_writepage_args *wpa) static void fuse_writepage_finish_stat(struct inode *inode, struct folio *folio) { - struct backing_dev_info *bdi = inode_to_bdi(inode); + struct bdi_writeback_ctx *bdi_wb_ctx = fetch_bdi_writeback_ctx(inode); - dec_wb_stat(&bdi->wb_ctx_arr[0]->wb, WB_WRITEBACK); + dec_wb_stat(&bdi_wb_ctx->wb, WB_WRITEBACK); node_stat_sub_folio(folio, NR_WRITEBACK_TEMP); - wb_writeout_inc(&bdi->wb_ctx_arr[0]->wb); + wb_writeout_inc(&bdi_wb_ctx->wb); } static void fuse_writepage_finish(struct fuse_writepage_args *wpa) @@ -2134,6 +2134,7 @@ static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struc struct folio *tmp_folio, uint32_t folio_index) { struct inode *inode = folio->mapping->host; + struct bdi_writeback_ctx *bdi_wb_ctx = fetch_bdi_writeback_ctx(inode); struct fuse_args_pages *ap = &wpa->ia.ap; folio_copy(tmp_folio, folio); @@ -2142,7 +2143,7 @@ static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struc ap->descs[folio_index].offset = 0; ap->descs[folio_index].length = PAGE_SIZE; - inc_wb_stat(&inode_to_bdi(inode)->wb_ctx_arr[0]->wb, WB_WRITEBACK); + inc_wb_stat(&bdi_wb_ctx->wb, WB_WRITEBACK); node_stat_add_folio(tmp_folio, NR_WRITEBACK_TEMP); } -- 2.25.1 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 10/13] fuse: add support for multiple writeback contexts in fuse 2025-05-29 11:15 ` [PATCH 10/13] fuse: add support for multiple writeback contexts in fuse Kundan Kumar @ 2025-06-02 14:21 ` Christoph Hellwig 2025-06-02 15:50 ` Bernd Schubert 0 siblings, 1 reply; 40+ messages in thread From: Christoph Hellwig @ 2025-06-02 14:21 UTC (permalink / raw) To: Kundan Kumar Cc: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, Anuj Gupta > static void fuse_writepage_finish_stat(struct inode *inode, struct folio *folio) > { > - struct backing_dev_info *bdi = inode_to_bdi(inode); > + struct bdi_writeback_ctx *bdi_wb_ctx = fetch_bdi_writeback_ctx(inode); > > - dec_wb_stat(&bdi->wb_ctx_arr[0]->wb, WB_WRITEBACK); > + dec_wb_stat(&bdi_wb_ctx->wb, WB_WRITEBACK); > node_stat_sub_folio(folio, NR_WRITEBACK_TEMP); > - wb_writeout_inc(&bdi->wb_ctx_arr[0]->wb); > + wb_writeout_inc(&bdi_wb_ctx->wb); > } There's nothing fuse-specific here except that nothing but fuse uses NR_WRITEBACK_TEMP. Can we try to move this into the core first so that the patches don't have to touch file system code? > - inc_wb_stat(&inode_to_bdi(inode)->wb_ctx_arr[0]->wb, WB_WRITEBACK); > + inc_wb_stat(&bdi_wb_ctx->wb, WB_WRITEBACK); > node_stat_add_folio(tmp_folio, NR_WRITEBACK_TEMP); Same here. On pattern is that fuse and nfs both touch the node stat and the web stat, and having a common helper doing both would probably also be very helpful. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 10/13] fuse: add support for multiple writeback contexts in fuse 2025-06-02 14:21 ` Christoph Hellwig @ 2025-06-02 15:50 ` Bernd Schubert 2025-06-02 15:55 ` Christoph Hellwig 0 siblings, 1 reply; 40+ messages in thread From: Bernd Schubert @ 2025-06-02 15:50 UTC (permalink / raw) To: Christoph Hellwig, Kundan Kumar Cc: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, ritesh.list, djwong, dave, p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, Anuj Gupta, Joanne Koong On 6/2/25 16:21, Christoph Hellwig wrote: >> static void fuse_writepage_finish_stat(struct inode *inode, struct folio *folio) >> { >> - struct backing_dev_info *bdi = inode_to_bdi(inode); >> + struct bdi_writeback_ctx *bdi_wb_ctx = fetch_bdi_writeback_ctx(inode); >> >> - dec_wb_stat(&bdi->wb_ctx_arr[0]->wb, WB_WRITEBACK); >> + dec_wb_stat(&bdi_wb_ctx->wb, WB_WRITEBACK); >> node_stat_sub_folio(folio, NR_WRITEBACK_TEMP); >> - wb_writeout_inc(&bdi->wb_ctx_arr[0]->wb); >> + wb_writeout_inc(&bdi_wb_ctx->wb); >> } > > There's nothing fuse-specific here except that nothing but fuse uses > NR_WRITEBACK_TEMP. Can we try to move this into the core first so that > the patches don't have to touch file system code? > >> - inc_wb_stat(&inode_to_bdi(inode)->wb_ctx_arr[0]->wb, WB_WRITEBACK); >> + inc_wb_stat(&bdi_wb_ctx->wb, WB_WRITEBACK); >> node_stat_add_folio(tmp_folio, NR_WRITEBACK_TEMP); > > Same here. On pattern is that fuse and nfs both touch the node stat > and the web stat, and having a common helper doing both would probably > also be very helpful. > > Note that Miklos' PR for 6.16 removes NR_WRITEBACK_TEMP through patches from Joanne, i.e. only dec_wb_stat(&bdi->wb, WB_WRITEBACK); is left over. Thanks, Bernd ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 10/13] fuse: add support for multiple writeback contexts in fuse 2025-06-02 15:50 ` Bernd Schubert @ 2025-06-02 15:55 ` Christoph Hellwig 0 siblings, 0 replies; 40+ messages in thread From: Christoph Hellwig @ 2025-06-02 15:55 UTC (permalink / raw) To: Bernd Schubert Cc: Christoph Hellwig, Kundan Kumar, jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, ritesh.list, djwong, dave, p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, Anuj Gupta, Joanne Koong On Mon, Jun 02, 2025 at 05:50:48PM +0200, Bernd Schubert wrote: > > Same here. On pattern is that fuse and nfs both touch the node stat > > and the web stat, and having a common helper doing both would probably > > also be very helpful. > > > > > > Note that Miklos' PR for 6.16 removes NR_WRITEBACK_TEMP through > patches from Joanne, i.e. only > > dec_wb_stat(&bdi->wb, WB_WRITEBACK); > > is left over. That'll make it even easier to consolidate with the NFS version. ^ permalink raw reply [flat|nested] 40+ messages in thread
[parent not found: <CGME20250529113302epcas5p3bdae265288af32172fb7380a727383eb@epcas5p3.samsung.com>]
* [PATCH 11/13] gfs2: add support in gfs2 to handle multiple writeback contexts [not found] ` <CGME20250529113302epcas5p3bdae265288af32172fb7380a727383eb@epcas5p3.samsung.com> @ 2025-05-29 11:15 ` Kundan Kumar 0 siblings, 0 replies; 40+ messages in thread From: Kundan Kumar @ 2025-05-29 11:15 UTC (permalink / raw) To: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, p.raghav, da.gomez Cc: linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, Kundan Kumar, Anuj Gupta Add support to handle multiple writeback contexts and check for dirty_exceeded across all the writeback contexts Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> --- fs/gfs2/super.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c index dfc83bd3def3..d4fdab4a4201 100644 --- a/fs/gfs2/super.c +++ b/fs/gfs2/super.c @@ -450,6 +450,7 @@ static int gfs2_write_inode(struct inode *inode, struct writeback_control *wbc) struct gfs2_sbd *sdp = GFS2_SB(inode); struct address_space *metamapping = gfs2_glock2aspace(ip->i_gl); struct backing_dev_info *bdi = inode_to_bdi(metamapping->host); + struct bdi_writeback_ctx *bdi_wb_ctx; int ret = 0; bool flush_all = (wbc->sync_mode == WB_SYNC_ALL || gfs2_is_jdata(ip)); @@ -457,10 +458,12 @@ static int gfs2_write_inode(struct inode *inode, struct writeback_control *wbc) gfs2_log_flush(GFS2_SB(inode), ip->i_gl, GFS2_LOG_HEAD_FLUSH_NORMAL | GFS2_LFC_WRITE_INODE); - if (bdi->wb_ctx_arr[0]->wb.dirty_exceeded) - gfs2_ail1_flush(sdp, wbc); - else - filemap_fdatawrite(metamapping); + + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) + if (bdi_wb_ctx->wb.dirty_exceeded) + gfs2_ail1_flush(sdp, wbc); + else + filemap_fdatawrite(metamapping); if (flush_all) ret = filemap_fdatawait(metamapping); if (ret) -- 2.25.1 ^ permalink raw reply related [flat|nested] 40+ messages in thread
[parent not found: <CGME20250529113306epcas5p3d10606ae4ea7c3491e93bde9ae408c9f@epcas5p3.samsung.com>]
* [PATCH 12/13] nfs: add support in nfs to handle multiple writeback contexts [not found] ` <CGME20250529113306epcas5p3d10606ae4ea7c3491e93bde9ae408c9f@epcas5p3.samsung.com> @ 2025-05-29 11:15 ` Kundan Kumar 2025-06-02 14:22 ` Christoph Hellwig 0 siblings, 1 reply; 40+ messages in thread From: Kundan Kumar @ 2025-05-29 11:15 UTC (permalink / raw) To: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, p.raghav, da.gomez Cc: linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, Kundan Kumar, Anuj Gupta Fetch writeback context to which an inode is affined. Use it to perform writeback related operations. Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> --- fs/nfs/internal.h | 5 +++-- fs/nfs/write.c | 6 +++--- 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h index fd513bf9e875..a7cacaf484c9 100644 --- a/fs/nfs/internal.h +++ b/fs/nfs/internal.h @@ -838,14 +838,15 @@ static inline void nfs_folio_mark_unstable(struct folio *folio, { if (folio && !cinfo->dreq) { struct inode *inode = folio->mapping->host; + struct bdi_writeback_ctx *bdi_wb_ctx = + fetch_bdi_writeback_ctx(inode); long nr = folio_nr_pages(folio); /* This page is really still in write-back - just that the * writeback is happening on the server now. */ node_stat_mod_folio(folio, NR_WRITEBACK, nr); - wb_stat_mod(&inode_to_bdi(inode)->wb_ctx_arr[0]->wb, - WB_WRITEBACK, nr); + wb_stat_mod(&bdi_wb_ctx->wb, WB_WRITEBACK, nr); __mark_inode_dirty(inode, I_DIRTY_DATASYNC); } } diff --git a/fs/nfs/write.c b/fs/nfs/write.c index ec48ec8c2db8..ca0823debce7 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -932,11 +932,11 @@ static void nfs_folio_clear_commit(struct folio *folio) { if (folio) { long nr = folio_nr_pages(folio); - struct inode *inode = folio->mapping->host; + struct bdi_writeback_ctx *bdi_wb_ctx = + fetch_bdi_writeback_ctx(folio->mapping->host); node_stat_mod_folio(folio, NR_WRITEBACK, -nr); - wb_stat_mod(&inode_to_bdi(inode)->wb_ctx_arr[0]->wb, - WB_WRITEBACK, -nr); + wb_stat_mod(&bdi_wb_ctx->wb, WB_WRITEBACK, -nr); } } -- 2.25.1 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 12/13] nfs: add support in nfs to handle multiple writeback contexts 2025-05-29 11:15 ` [PATCH 12/13] nfs: add support in nfs " Kundan Kumar @ 2025-06-02 14:22 ` Christoph Hellwig 0 siblings, 0 replies; 40+ messages in thread From: Christoph Hellwig @ 2025-06-02 14:22 UTC (permalink / raw) To: Kundan Kumar Cc: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, Anuj Gupta On Thu, May 29, 2025 at 04:45:03PM +0530, Kundan Kumar wrote: > if (folio && !cinfo->dreq) { > struct inode *inode = folio->mapping->host; > + struct bdi_writeback_ctx *bdi_wb_ctx = > + fetch_bdi_writeback_ctx(inode); > long nr = folio_nr_pages(folio); > > /* This page is really still in write-back - just that the > * writeback is happening on the server now. > */ > node_stat_mod_folio(folio, NR_WRITEBACK, nr); > - wb_stat_mod(&inode_to_bdi(inode)->wb_ctx_arr[0]->wb, > - WB_WRITEBACK, nr); > + wb_stat_mod(&bdi_wb_ctx->wb, WB_WRITEBACK, nr); Similar comments to fuse here as well, except that nfs also really should be using the node stat helpers automatically counting the numbers of pages in a folio instead of duplicating the logic. ^ permalink raw reply [flat|nested] 40+ messages in thread
[parent not found: <CGME20250529113311epcas5p3c8f1785b34680481e2126fda3ab51ad9@epcas5p3.samsung.com>]
* [PATCH 13/13] writeback: set the num of writeback contexts to number of online cpus [not found] ` <CGME20250529113311epcas5p3c8f1785b34680481e2126fda3ab51ad9@epcas5p3.samsung.com> @ 2025-05-29 11:15 ` Kundan Kumar 2025-06-03 14:36 ` kernel test robot 0 siblings, 1 reply; 40+ messages in thread From: Kundan Kumar @ 2025-05-29 11:15 UTC (permalink / raw) To: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, p.raghav, da.gomez Cc: linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, Kundan Kumar, Anuj Gupta We create N number of writeback contexts, N = number of online cpus. The inodes gets distributed across different writeback contexts, enabling parallel writeback. Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> --- mm/backing-dev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/backing-dev.c b/mm/backing-dev.c index d416122e2914..55c07c9be4cd 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -1046,7 +1046,7 @@ int bdi_init(struct backing_dev_info *bdi) bdi->min_ratio = 0; bdi->max_ratio = 100 * BDI_RATIO_SCALE; bdi->max_prop_frac = FPROP_FRAC_BASE; - bdi->nr_wb_ctx = 1; + bdi->nr_wb_ctx = num_online_cpus(); bdi->wb_ctx_arr = kcalloc(bdi->nr_wb_ctx, sizeof(struct bdi_writeback_ctx *), GFP_KERNEL); -- 2.25.1 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 13/13] writeback: set the num of writeback contexts to number of online cpus 2025-05-29 11:15 ` [PATCH 13/13] writeback: set the num of writeback contexts to number of online cpus Kundan Kumar @ 2025-06-03 14:36 ` kernel test robot 0 siblings, 0 replies; 40+ messages in thread From: kernel test robot @ 2025-06-03 14:36 UTC (permalink / raw) To: Kundan Kumar Cc: oe-lkp, lkp, Anuj Gupta, linux-mm, jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, gost.dev, Kundan Kumar, oliver.sang Hello, kernel test robot noticed a 53.9% improvement of fsmark.files_per_sec on: commit: 2850eee23dbc4ff9878d88625b1f84965eefcce6 ("[PATCH 13/13] writeback: set the num of writeback contexts to number of online cpus") url: https://github.com/intel-lab-lkp/linux/commits/Kundan-Kumar/writeback-add-infra-for-parallel-writeback/20250529-193523 base: https://git.kernel.org/cgit/linux/kernel/git/vfs/vfs.git vfs.all patch link: https://lore.kernel.org/all/20250529111504.89912-14-kundan.kumar@samsung.com/ patch subject: [PATCH 13/13] writeback: set the num of writeback contexts to number of online cpus testcase: fsmark config: x86_64-rhel-9.4 compiler: gcc-12 test machine: 192 threads 4 sockets Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz (Cascade Lake) with 176G memory parameters: iterations: 1x nr_threads: 32t disk: 1SSD fs: ext4 filesize: 16MB test_size: 60G sync_method: NoSync nr_directories: 16d nr_files_per_directory: 256fpd cpufreq_governor: performance In addition to that, the commit also has significant impact on the following tests: +------------------+------------------------------------------------------------------------------------------------+ | testcase: change | filebench: filebench.sum_operations/s 4.3% improvement | | test machine | 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory | | test parameters | cpufreq_governor=performance | | | disk=1HDD | | | fs=xfs | | | test=fivestreamwrite.f | +------------------+------------------------------------------------------------------------------------------------+ Details are as below: --------------------------------------------------------------------------------------------------> The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20250603/202506032246.89ddc1a2-lkp@intel.com ========================================================================================= compiler/cpufreq_governor/disk/filesize/fs/iterations/kconfig/nr_directories/nr_files_per_directory/nr_threads/rootfs/sync_method/tbox_group/test_size/testcase: gcc-12/performance/1SSD/16MB/ext4/1x/x86_64-rhel-9.4/16d/256fpd/32t/debian-12-x86_64-20240206.cgz/NoSync/lkp-csl-2sp10/60G/fsmark commit: a2dadb7ea8 ("nfs: add support in nfs to handle multiple writeback contexts") 2850eee23d ("writeback: set the num of writeback contexts to number of online cpus") a2dadb7ea862d5c1 2850eee23dbc4ff9878d88625b1 ---------------- --------------------------- %stddev %change %stddev \ | \ 1641480 +13.3% 1860148 ± 2% cpuidle..usage 302.00 ± 8% +14.1% 344.67 ± 7% perf-c2c.HITM.remote 24963 ± 4% -13.3% 21647 ± 6% uptime.idle 91.64 -22.2% 71.26 ± 7% iostat.cpu.idle 7.34 ± 4% +275.7% 27.59 ± 19% iostat.cpu.iowait 0.46 ±141% +2.2 2.63 ± 66% perf-profile.calltrace.cycles-pp.setlocale 0.46 ±141% +2.2 2.63 ± 66% perf-profile.children.cycles-pp.setlocale 194019 -7.5% 179552 fsmark.app_overhead 108.10 ± 8% +53.9% 166.40 ± 10% fsmark.files_per_sec 43295 ± 7% +35.8% 58787 ± 5% fsmark.time.voluntary_context_switches 19970922 -10.5% 17867270 ± 2% meminfo.Dirty 493817 +13.1% 558422 meminfo.SUnreclaim 141708 +1439.0% 2180863 ± 14% meminfo.Writeback 4993428 -10.3% 4480219 ± 2% proc-vmstat.nr_dirty 34285 +5.8% 36262 proc-vmstat.nr_kernel_stack 123504 +13.1% 139636 proc-vmstat.nr_slab_unreclaimable 36381 ± 4% +1403.0% 546810 ± 14% proc-vmstat.nr_writeback 91.54 -22.1% 71.32 ± 7% vmstat.cpu.id 7.22 ± 4% +280.4% 27.47 ± 19% vmstat.cpu.wa 14.58 ± 4% +537.4% 92.92 ± 8% vmstat.procs.b 6140 ± 2% +90.1% 11673 ± 9% vmstat.system.cs 91.46 -20.9 70.56 ± 7% mpstat.cpu.all.idle% 7.52 ± 4% +20.8 28.29 ± 19% mpstat.cpu.all.iowait% 0.12 ± 7% +0.0 0.14 ± 7% mpstat.cpu.all.irq% 0.35 ± 6% +0.1 0.43 ± 5% mpstat.cpu.all.sys% 11.24 ± 8% +20.7% 13.56 ± 3% mpstat.max_utilization_pct 34947 ± 5% +14.3% 39928 ± 4% numa-vmstat.node0.nr_slab_unreclaimable 9001 ± 14% +1553.7% 148860 ± 19% numa-vmstat.node0.nr_writeback 1329092 ± 4% -20.9% 1051569 ± 9% numa-vmstat.node1.nr_dirty 10019 ± 7% +1490.0% 159311 ± 14% numa-vmstat.node1.nr_writeback 2808522 ± 8% -17.7% 2310216 ± 2% numa-vmstat.node2.nr_file_pages 2638799 ± 3% -12.7% 2304024 ± 2% numa-vmstat.node2.nr_inactive_file 7810 ± 9% +1035.8% 88707 ± 16% numa-vmstat.node2.nr_writeback 2638797 ± 3% -12.7% 2304025 ± 2% numa-vmstat.node2.nr_zone_inactive_file 29952 ± 3% +13.4% 33964 ± 4% numa-vmstat.node3.nr_slab_unreclaimable 10686 ± 9% +1351.3% 155091 ± 12% numa-vmstat.node3.nr_writeback 139656 ± 5% +14.2% 159539 ± 4% numa-meminfo.node0.SUnreclaim 35586 ± 13% +1565.8% 592799 ± 18% numa-meminfo.node0.Writeback 5304285 ± 4% -20.8% 4198452 ± 10% numa-meminfo.node1.Dirty 40011 ± 5% +1484.2% 633862 ± 14% numa-meminfo.node1.Writeback 11211668 ± 7% -17.7% 9222157 ± 2% numa-meminfo.node2.FilePages 10532776 ± 3% -12.7% 9197387 ± 2% numa-meminfo.node2.Inactive 10532776 ± 3% -12.7% 9197387 ± 2% numa-meminfo.node2.Inactive(file) 12378624 ± 7% -15.0% 10520827 ± 2% numa-meminfo.node2.MemUsed 29574 ± 9% +1087.0% 351055 ± 16% numa-meminfo.node2.Writeback 119679 ± 3% +13.4% 135718 ± 4% numa-meminfo.node3.SUnreclaim 41446 ± 10% +1380.1% 613443 ± 11% numa-meminfo.node3.Writeback 23.38 ± 2% -6.7 16.72 perf-stat.i.cache-miss-rate% 38590732 ± 3% +53.9% 59394561 ± 5% perf-stat.i.cache-references 5973 ± 2% +96.3% 11729 ± 10% perf-stat.i.context-switches 0.92 +7.4% 0.99 perf-stat.i.cpi 7.023e+09 ± 3% +12.5% 7.898e+09 ± 4% perf-stat.i.cpu-cycles 237.41 ± 2% +393.6% 1171 ± 20% perf-stat.i.cpu-migrations 1035 ± 3% +9.3% 1132 ± 4% perf-stat.i.cycles-between-cache-misses 1.15 -5.7% 1.09 perf-stat.i.ipc 25.41 ± 2% -7.4 18.03 ± 2% perf-stat.overall.cache-miss-rate% 0.94 +7.2% 1.01 perf-stat.overall.cpi 1.06 -6.8% 0.99 perf-stat.overall.ipc 38042801 ± 3% +54.0% 58576659 ± 5% perf-stat.ps.cache-references 5897 ± 2% +96.3% 11577 ± 10% perf-stat.ps.context-switches 6.925e+09 ± 3% +12.4% 7.787e+09 ± 4% perf-stat.ps.cpu-cycles 234.11 ± 2% +394.2% 1156 ± 20% perf-stat.ps.cpu-migrations 5.892e+11 +1.4% 5.977e+11 perf-stat.total.instructions 0.08 ± 6% -36.2% 0.05 ± 33% perf-sched.sch_delay.avg.ms.__cond_resched.__alloc_frozen_pages_noprof.alloc_pages_mpol.folio_alloc_noprof.__filemap_get_folio 0.01 ± 73% +159.8% 0.04 ± 13% perf-sched.sch_delay.avg.ms.__cond_resched.stop_one_cpu.sched_exec.bprm_execve.part 0.07 ± 10% -33.0% 0.05 ± 44% perf-sched.sch_delay.avg.ms.devkmsg_read.vfs_read.ksys_read.do_syscall_64 0.01 ± 73% +695.6% 0.09 ± 25% perf-sched.sch_delay.avg.ms.io_schedule.blk_mq_get_tag.__blk_mq_alloc_requests.blk_mq_submit_bio 0.03 ± 30% -45.4% 0.02 ± 24% perf-sched.sch_delay.avg.ms.irqentry_exit_to_user_mode.asm_sysvec_reschedule_ipi.[unknown] 0.06 ± 21% -100.0% 0.00 perf-sched.sch_delay.avg.ms.kjournald2.kthread.ret_from_fork.ret_from_fork_asm 0.05 ± 45% +305.0% 0.19 ± 66% perf-sched.sch_delay.avg.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm 0.01 ±210% +8303.3% 0.85 ±183% perf-sched.sch_delay.max.ms.__cond_resched.down_write.mpage_map_and_submit_extent.ext4_do_writepages.ext4_writepages 0.05 ± 49% +92.4% 0.10 ± 35% perf-sched.sch_delay.max.ms.__cond_resched.process_one_work.worker_thread.kthread.ret_from_fork 0.03 ± 76% +165.2% 0.07 ± 28% perf-sched.sch_delay.max.ms.__cond_resched.stop_one_cpu.sched_exec.bprm_execve.part 0.07 ± 46% +83423.5% 59.86 ±146% perf-sched.sch_delay.max.ms.io_schedule.blk_mq_get_tag.__blk_mq_alloc_requests.blk_mq_submit_bio 0.06 ± 21% -100.0% 0.00 perf-sched.sch_delay.max.ms.kjournald2.kthread.ret_from_fork.ret_from_fork_asm 0.09 ± 7% +19.7% 0.10 ± 8% perf-sched.sch_delay.max.ms.rcu_gp_kthread.kthread.ret_from_fork.ret_from_fork_asm 0.10 ± 13% +3587.7% 3.60 ±172% perf-sched.sch_delay.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread 18911 +109.0% 39526 ± 25% perf-sched.total_wait_and_delay.count.ms 4.42 ± 25% +77.8% 7.86 ± 15% perf-sched.wait_and_delay.avg.ms.__cond_resched.__wait_for_common.affine_move_task.__set_cpus_allowed_ptr.__sched_setaffinity 150.39 ± 6% -14.9% 127.97 ± 8% perf-sched.wait_and_delay.avg.ms.anon_pipe_read.fifo_pipe_read.vfs_read.ksys_read 8.03 ± 89% -74.5% 2.05 ±143% perf-sched.wait_and_delay.avg.ms.do_wait.kernel_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe 1.24 ± 8% +2018.4% 26.26 ± 26% perf-sched.wait_and_delay.avg.ms.io_schedule.blk_mq_get_tag.__blk_mq_alloc_requests.blk_mq_submit_bio 0.83 ± 2% -100.0% 0.00 perf-sched.wait_and_delay.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone 86.14 ± 9% +33.0% 114.52 ± 7% perf-sched.wait_and_delay.avg.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm 1047 ± 6% -8.2% 960.83 perf-sched.wait_and_delay.count.__cond_resched.__wait_for_common.affine_move_task.__set_cpus_allowed_ptr.__sched_setaffinity 171.50 ± 7% -69.1% 53.00 ±141% perf-sched.wait_and_delay.count.do_task_dead.do_exit.do_group_exit.__x64_sys_exit_group.x64_sys_call 162.50 ± 7% -69.3% 49.83 ±141% perf-sched.wait_and_delay.count.do_wait.kernel_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe 3635 ± 6% +451.1% 20036 ± 35% perf-sched.wait_and_delay.count.io_schedule.blk_mq_get_tag.__blk_mq_alloc_requests.blk_mq_submit_bio 26.33 ± 5% -12.7% 23.00 ± 2% perf-sched.wait_and_delay.count.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_poll 116.17 -100.0% 0.00 perf-sched.wait_and_delay.count.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone 3938 ± 21% -66.2% 1332 ± 60% perf-sched.wait_and_delay.count.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags 4831 ± 5% +102.8% 9799 ± 23% perf-sched.wait_and_delay.count.worker_thread.kthread.ret_from_fork.ret_from_fork_asm 87.47 ± 20% +823.3% 807.60 ±141% perf-sched.wait_and_delay.max.ms.io_schedule.blk_mq_get_tag.__blk_mq_alloc_requests.blk_mq_submit_bio 2.81 ± 4% -100.0% 0.00 perf-sched.wait_and_delay.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone 77.99 ± 13% +2082.7% 1702 ± 73% perf-sched.wait_and_delay.max.ms.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags 396.00 ± 16% -34.0% 261.22 ± 33% perf-sched.wait_and_delay.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread 3.89 ± 18% +100.6% 7.81 ± 15% perf-sched.wait_time.avg.ms.__cond_resched.__wait_for_common.affine_move_task.__set_cpus_allowed_ptr.__sched_setaffinity 0.13 ±157% +2.1e+05% 271.76 ±126% perf-sched.wait_time.avg.ms.__cond_resched.down_write.mpage_map_and_submit_extent.ext4_do_writepages.ext4_writepages 150.37 ± 6% -15.2% 127.52 ± 8% perf-sched.wait_time.avg.ms.anon_pipe_read.fifo_pipe_read.vfs_read.ksys_read 1.23 ± 8% +2030.8% 26.17 ± 26% perf-sched.wait_time.avg.ms.io_schedule.blk_mq_get_tag.__blk_mq_alloc_requests.blk_mq_submit_bio 31.17 ± 50% -100.0% 0.00 perf-sched.wait_time.avg.ms.kjournald2.kthread.ret_from_fork.ret_from_fork_asm 86.09 ± 9% +32.8% 114.34 ± 7% perf-sched.wait_time.avg.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm 1086 ± 17% +309.4% 4449 ± 26% perf-sched.wait_time.max.ms.__cond_resched.__wait_for_common.affine_move_task.__set_cpus_allowed_ptr.__sched_setaffinity 0.17 ±142% +7.2e+05% 1196 ±119% perf-sched.wait_time.max.ms.__cond_resched.down_write.mpage_map_and_submit_extent.ext4_do_writepages.ext4_writepages 7.27 ± 45% +1259.6% 98.80 ±139% perf-sched.wait_time.max.ms.__cond_resched.process_one_work.worker_thread.kthread.ret_from_fork 262.77 ±113% +316.1% 1093 ± 52% perf-sched.wait_time.max.ms.do_wait.kernel_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe 87.45 ± 20% +823.5% 807.53 ±141% perf-sched.wait_time.max.ms.io_schedule.blk_mq_get_tag.__blk_mq_alloc_requests.blk_mq_submit_bio 0.04 ± 30% +992.0% 0.43 ± 92% perf-sched.wait_time.max.ms.irqentry_exit_to_user_mode.asm_sysvec_reschedule_ipi.[unknown] 31.17 ± 50% -100.0% 0.00 perf-sched.wait_time.max.ms.kjournald2.kthread.ret_from_fork.ret_from_fork_asm 75.85 ± 16% +2144.2% 1702 ± 73% perf-sched.wait_time.max.ms.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags 395.95 ± 16% -34.0% 261.15 ± 33% perf-sched.wait_time.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread *************************************************************************************************** lkp-icl-2sp6: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory ========================================================================================= compiler/cpufreq_governor/disk/fs/kconfig/rootfs/tbox_group/test/testcase: gcc-12/performance/1HDD/xfs/x86_64-rhel-9.4/debian-12-x86_64-20240206.cgz/lkp-icl-2sp6/fivestreamwrite.f/filebench commit: a2dadb7ea8 ("nfs: add support in nfs to handle multiple writeback contexts") 2850eee23d ("writeback: set the num of writeback contexts to number of online cpus") a2dadb7ea862d5c1 2850eee23dbc4ff9878d88625b1 ---------------- --------------------------- %stddev %change %stddev \ | \ 2.06 ± 3% +1.5 3.58 mpstat.cpu.all.iowait% 8388855 ± 5% +17.7% 9875928 ± 5% numa-meminfo.node0.Dirty 0.02 ± 5% +48.0% 0.03 ± 3% sched_debug.cpu.nr_uninterruptible.avg 2.70 +72.6% 4.65 vmstat.procs.b 97.67 -1.5% 96.17 iostat.cpu.idle 2.04 ± 3% +73.9% 3.55 iostat.cpu.iowait 2094449 ± 5% +17.8% 2468063 ± 5% numa-vmstat.node0.nr_dirty 2113170 ± 5% +17.7% 2487005 ± 5% numa-vmstat.node0.nr_zone_write_pending 6.99 ± 3% +0.5 7.48 ± 2% perf-stat.i.cache-miss-rate% 1.82 +3.2% 1.88 perf-stat.i.cpi 0.64 -2.0% 0.62 perf-stat.i.ipc 2.88 ± 5% +9.5% 3.15 ± 4% perf-stat.overall.MPKI 464.45 +4.3% 484.58 filebench.sum_bytes_mb/s 27873 +4.3% 29084 filebench.sum_operations 464.51 +4.3% 484.66 filebench.sum_operations/s 10.76 -4.2% 10.31 filebench.sum_time_ms/op 464.67 +4.3% 484.67 filebench.sum_writes/s 57175040 +4.2% 59565397 filebench.time.file_system_outputs 7146880 +4.2% 7445674 proc-vmstat.nr_dirtied 4412053 +9.1% 4815253 proc-vmstat.nr_dirty 7485090 +5.0% 7855964 proc-vmstat.nr_file_pages 24899858 -1.5% 24530112 proc-vmstat.nr_free_pages 24705120 -1.5% 24343672 proc-vmstat.nr_free_pages_blocks 6573042 +5.6% 6943969 proc-vmstat.nr_inactive_file 34473 ± 3% +7.5% 37072 proc-vmstat.nr_writeback 6573042 +5.6% 6943969 proc-vmstat.nr_zone_inactive_file 4446526 +9.1% 4852325 proc-vmstat.nr_zone_write_pending 7963041 +3.8% 8262916 proc-vmstat.pgalloc_normal 0.02 ± 10% +45.0% 0.03 ± 5% perf-sched.sch_delay.avg.ms.do_wait.kernel_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe 317.88 ±166% -100.0% 0.08 ± 60% perf-sched.sch_delay.avg.ms.kthreadd.ret_from_fork.ret_from_fork_asm 474.99 ±141% -100.0% 0.10 ± 49% perf-sched.sch_delay.max.ms.kthreadd.ret_from_fork.ret_from_fork_asm 17.87 ± 13% +125.8% 40.36 ± 4% perf-sched.wait_and_delay.avg.ms.io_schedule.blk_mq_get_tag.__blk_mq_alloc_requests.blk_mq_submit_bio 47.75 +19.8% 57.20 ± 5% perf-sched.wait_and_delay.avg.ms.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags 517.00 -17.2% 427.83 ± 5% perf-sched.wait_and_delay.count.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags 54.05 ± 2% +253.0% 190.80 ± 18% perf-sched.wait_and_delay.max.ms.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags 4286 ± 4% -8.8% 3909 ± 8% perf-sched.wait_and_delay.max.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm 17.77 ± 13% +126.0% 40.16 ± 4% perf-sched.wait_time.avg.ms.io_schedule.blk_mq_get_tag.__blk_mq_alloc_requests.blk_mq_submit_bio 47.63 +19.8% 57.06 ± 5% perf-sched.wait_time.avg.ms.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags 53.95 ± 2% +253.5% 190.70 ± 18% perf-sched.wait_time.max.ms.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags 4285 ± 4% -8.9% 3906 ± 7% perf-sched.wait_time.max.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm 0.77 ± 15% -0.3 0.43 ± 72% perf-profile.calltrace.cycles-pp.enqueue_task.ttwu_do_activate.sched_ttwu_pending.__flush_smp_call_function_queue.flush_smp_call_function_queue 1.76 ± 9% -0.3 1.44 ± 8% perf-profile.calltrace.cycles-pp.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe 1.73 ± 10% -0.3 1.43 ± 8% perf-profile.calltrace.cycles-pp.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.85 ± 12% -0.2 0.66 ± 14% perf-profile.calltrace.cycles-pp.__pick_next_task.__schedule.schedule_idle.do_idle.cpu_startup_entry 4.74 ± 6% -0.6 4.17 ± 9% perf-profile.children.cycles-pp.__handle_mm_fault 4.92 ± 6% -0.6 4.36 ± 9% perf-profile.children.cycles-pp.handle_mm_fault 2.85 ± 5% -0.5 2.34 ± 8% perf-profile.children.cycles-pp.enqueue_task 2.60 ± 4% -0.4 2.15 ± 8% perf-profile.children.cycles-pp.enqueue_task_fair 2.58 ± 6% -0.4 2.22 ± 8% perf-profile.children.cycles-pp.do_pte_missing 2.04 ± 9% -0.3 1.70 ± 10% perf-profile.children.cycles-pp.do_read_fault 1.88 ± 5% -0.3 1.56 ± 7% perf-profile.children.cycles-pp.ttwu_do_activate 1.85 ± 10% -0.3 1.58 ± 12% perf-profile.children.cycles-pp.filemap_map_pages 1.20 ± 8% -0.2 0.98 ± 9% perf-profile.children.cycles-pp.__flush_smp_call_function_queue 0.49 ± 23% -0.2 0.29 ± 23% perf-profile.children.cycles-pp.set_next_task_fair 0.38 ± 32% -0.2 0.20 ± 39% perf-profile.children.cycles-pp.strnlen_user 0.44 ± 28% -0.2 0.26 ± 27% perf-profile.children.cycles-pp.set_next_entity 0.70 ± 7% -0.2 0.52 ± 7% perf-profile.children.cycles-pp.folios_put_refs 0.22 ± 20% -0.1 0.15 ± 27% perf-profile.children.cycles-pp.try_charge_memcg 0.02 ±141% +0.1 0.08 ± 44% perf-profile.children.cycles-pp.__blk_mq_alloc_driver_tag 0.09 ± 59% +0.1 0.18 ± 21% perf-profile.children.cycles-pp.irq_work_tick 0.26 ± 22% +0.2 0.41 ± 13% perf-profile.children.cycles-pp.cpu_stop_queue_work 0.34 ± 31% +0.2 0.57 ± 27% perf-profile.children.cycles-pp.perf_event_mmap_event 2.68 ± 10% +0.4 3.10 ± 12% perf-profile.children.cycles-pp.sched_balance_domains 0.38 ± 32% -0.2 0.19 ± 44% perf-profile.self.cycles-pp.strnlen_user 0.02 ±141% +0.1 0.08 ± 44% perf-profile.self.cycles-pp.ahci_single_level_irq_intr 0.22 ± 36% +0.2 0.43 ± 21% perf-profile.self.cycles-pp.sched_balance_rq 0.36 ± 41% +0.3 0.66 ± 18% perf-profile.self.cycles-pp._find_next_and_bit Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 00/13] Parallelizing filesystem writeback 2025-05-29 11:14 ` [PATCH 00/13] Parallelizing filesystem writeback Kundan Kumar ` (12 preceding siblings ...) [not found] ` <CGME20250529113311epcas5p3c8f1785b34680481e2126fda3ab51ad9@epcas5p3.samsung.com> @ 2025-05-30 3:37 ` Andrew Morton 2025-06-25 15:44 ` Kundan Kumar 2025-06-02 14:19 ` Christoph Hellwig 14 siblings, 1 reply; 40+ messages in thread From: Andrew Morton @ 2025-05-30 3:37 UTC (permalink / raw) To: Kundan Kumar Cc: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev On Thu, 29 May 2025 16:44:51 +0530 Kundan Kumar <kundan.kumar@samsung.com> wrote: > Currently, pagecache writeback is performed by a single thread. Inodes > are added to a dirty list, and delayed writeback is triggered. The single > writeback thread then iterates through the dirty inode list, and executes > the writeback. > > This series parallelizes the writeback by allowing multiple writeback > contexts per backing device (bdi). These writebacks contexts are executed > as separate, independent threads, improving overall parallelism. > > Would love to hear feedback in-order to move this effort forward. > > Design Overview > ================ > Following Jan Kara's suggestion [1], we have introduced a new bdi > writeback context within the backing_dev_info structure. Specifically, > we have created a new structure, bdi_writeback_context, which contains > its own set of members for each writeback context. > > struct bdi_writeback_ctx { > struct bdi_writeback wb; > struct list_head wb_list; /* list of all wbs */ > struct radix_tree_root cgwb_tree; > struct rw_semaphore wb_switch_rwsem; > wait_queue_head_t wb_waitq; > }; > > There can be multiple writeback contexts in a bdi, which helps in > achieving writeback parallelism. > > struct backing_dev_info { > ... > int nr_wb_ctx; > struct bdi_writeback_ctx **wb_ctx_arr; I don't think the "_arr" adds value. bdi->wb_contexts[i]? > ... > }; > > FS geometry and filesystem fragmentation > ======================================== > The community was concerned that parallelizing writeback would impact > delayed allocation and increase filesystem fragmentation. > Our analysis of XFS delayed allocation behavior showed that merging of > extents occurs within a specific inode. Earlier experiments with multiple > writeback contexts [2] resulted in increased fragmentation due to the > same inode being processed by different threads. > > To address this, we now affine an inode to a specific writeback context > ensuring that delayed allocation works effectively. > > Number of writeback contexts > =========================== > The plan is to keep the nr_wb_ctx as 1, ensuring default single threaded > behavior. However, we set the number of writeback contexts equal to > number of CPUs in the current version. Makes sense. It would be good to test this on a non-SMP machine, if you can find one ;) > Later we will make it configurable > using a mount option, allowing filesystems to choose the optimal number > of writeback contexts. > > IOPS and throughput > =================== > We see significant improvement in IOPS across several filesystem on both > PMEM and NVMe devices. > > Performance gains: > - On PMEM: > Base XFS : 544 MiB/s > Parallel Writeback XFS : 1015 MiB/s (+86%) > Base EXT4 : 536 MiB/s > Parallel Writeback EXT4 : 1047 MiB/s (+95%) > > - On NVMe: > Base XFS : 651 MiB/s > Parallel Writeback XFS : 808 MiB/s (+24%) > Base EXT4 : 494 MiB/s > Parallel Writeback EXT4 : 797 MiB/s (+61%) > > We also see that there is no increase in filesystem fragmentation > # of extents: > - On XFS (on PMEM): > Base XFS : 1964 > Parallel Writeback XFS : 1384 > > - On EXT4 (on PMEM): > Base EXT4 : 21 > Parallel Writeback EXT4 : 11 Please test the performance on spinning disks, and with more filesystems? ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 00/13] Parallelizing filesystem writeback 2025-05-30 3:37 ` [PATCH 00/13] Parallelizing filesystem writeback Andrew Morton @ 2025-06-25 15:44 ` Kundan Kumar 2025-07-02 18:43 ` Darrick J. Wong 0 siblings, 1 reply; 40+ messages in thread From: Kundan Kumar @ 2025-06-25 15:44 UTC (permalink / raw) To: Andrew Morton Cc: Kundan Kumar, jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev > > Makes sense. It would be good to test this on a non-SMP machine, if > you can find one ;) > Tested with kernel cmdline with maxcpus=1. The parallel writeback falls back to 1 thread behavior, showing nochange in BW. - On PMEM: Base XFS : 70.7 MiB/s Parallel Writeback XFS : 70.5 MiB/s Base EXT4 : 137 MiB/s Parallel Writeback EXT4 : 138 MiB/s - On NVMe: Base XFS : 45.2 MiB/s Parallel Writeback XFS : 44.5 MiB/s Base EXT4 : 81.2 MiB/s Parallel Writeback EXT4 : 80.1 MiB/s > > Please test the performance on spinning disks, and with more filesystems? > On a spinning disk, random IO bandwidth remains unchanged, while sequential IO performance declines. However, setting nr_wb_ctx = 1 via configurable writeback(planned in next version) eliminates the decline. echo 1 > /sys/class/bdi/8:16/nwritebacks We can fetch the device queue's rotational property and allocate BDI with nr_wb_ctx = 1 for rotational disks. Hope this is a viable solution for spinning disks? - Random IO Base XFS : 22.6 MiB/s Parallel Writeback XFS : 22.9 MiB/s Base EXT4 : 22.5 MiB/s Parallel Writeback EXT4 : 20.9 MiB/s - Sequential IO Base XFS : 156 MiB/s Parallel Writeback XFS : 133 MiB/s (-14.7%) Base EXT4 : 147 MiB/s Parallel Writeback EXT4 : 124 MiB/s (-15.6%) -Kundan ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 00/13] Parallelizing filesystem writeback 2025-06-25 15:44 ` Kundan Kumar @ 2025-07-02 18:43 ` Darrick J. Wong 2025-07-03 13:05 ` Christoph Hellwig 0 siblings, 1 reply; 40+ messages in thread From: Darrick J. Wong @ 2025-07-02 18:43 UTC (permalink / raw) To: Kundan Kumar Cc: Andrew Morton, Kundan Kumar, jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, dave, p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev On Wed, Jun 25, 2025 at 09:14:51PM +0530, Kundan Kumar wrote: > > > > Makes sense. It would be good to test this on a non-SMP machine, if > > you can find one ;) > > > > Tested with kernel cmdline with maxcpus=1. The parallel writeback falls > back to 1 thread behavior, showing nochange in BW. > > - On PMEM: > Base XFS : 70.7 MiB/s > Parallel Writeback XFS : 70.5 MiB/s > Base EXT4 : 137 MiB/s > Parallel Writeback EXT4 : 138 MiB/s > > - On NVMe: > Base XFS : 45.2 MiB/s > Parallel Writeback XFS : 44.5 MiB/s > Base EXT4 : 81.2 MiB/s > Parallel Writeback EXT4 : 80.1 MiB/s > > > > > Please test the performance on spinning disks, and with more filesystems? > > > > On a spinning disk, random IO bandwidth remains unchanged, while sequential > IO performance declines. However, setting nr_wb_ctx = 1 via configurable > writeback(planned in next version) eliminates the decline. > > echo 1 > /sys/class/bdi/8:16/nwritebacks > > We can fetch the device queue's rotational property and allocate BDI with > nr_wb_ctx = 1 for rotational disks. Hope this is a viable solution for > spinning disks? Sounds good to me, spinning rust isn't known for iops. Though: What about a raid0 of spinning rust? Do you see the same declines for sequential IO? --D > - Random IO > Base XFS : 22.6 MiB/s > Parallel Writeback XFS : 22.9 MiB/s > Base EXT4 : 22.5 MiB/s > Parallel Writeback EXT4 : 20.9 MiB/s > > - Sequential IO > Base XFS : 156 MiB/s > Parallel Writeback XFS : 133 MiB/s (-14.7%) > Base EXT4 : 147 MiB/s > Parallel Writeback EXT4 : 124 MiB/s (-15.6%) > > -Kundan > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 00/13] Parallelizing filesystem writeback 2025-07-02 18:43 ` Darrick J. Wong @ 2025-07-03 13:05 ` Christoph Hellwig 2025-07-04 7:02 ` Kundan Kumar 2025-07-07 15:47 ` Jan Kara 0 siblings, 2 replies; 40+ messages in thread From: Christoph Hellwig @ 2025-07-03 13:05 UTC (permalink / raw) To: Darrick J. Wong Cc: Kundan Kumar, Andrew Morton, Kundan Kumar, jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, dave, p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev On Wed, Jul 02, 2025 at 11:43:12AM -0700, Darrick J. Wong wrote: > > On a spinning disk, random IO bandwidth remains unchanged, while sequential > > IO performance declines. However, setting nr_wb_ctx = 1 via configurable > > writeback(planned in next version) eliminates the decline. > > > > echo 1 > /sys/class/bdi/8:16/nwritebacks > > > > We can fetch the device queue's rotational property and allocate BDI with > > nr_wb_ctx = 1 for rotational disks. Hope this is a viable solution for > > spinning disks? > > Sounds good to me, spinning rust isn't known for iops. > > Though: What about a raid0 of spinning rust? Do you see the same > declines for sequential IO? Well, even for a raid0 multiple I/O streams will degrade performance on a disk. Of course many real life workloads will have multiple I/O streams anyway. I think the important part is to have: a) sane defaults b) an easy way for the file system and/or user to override the default For a) a single thread for rotational is a good default. For file system that driver multiple spindles independently or do compression multiple threads might still make sense. For b) one big issue is that right now the whole writeback handling is per-bdi and not per superblock. So maybe the first step needs to be to move the writeback to the superblock instead of bdi? If someone uses partitions and multiple file systems on spinning rusts these days reducing the number of writeback threads isn't really going to save their day either. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 00/13] Parallelizing filesystem writeback 2025-07-03 13:05 ` Christoph Hellwig @ 2025-07-04 7:02 ` Kundan Kumar 2025-07-07 14:28 ` Christoph Hellwig 2025-07-07 15:47 ` Jan Kara 1 sibling, 1 reply; 40+ messages in thread From: Kundan Kumar @ 2025-07-04 7:02 UTC (permalink / raw) To: Christoph Hellwig Cc: Darrick J. Wong, Andrew Morton, Kundan Kumar, jaegeuk, chao, viro, Christian Brauner, jack, miklos, agruenba, Trond Myklebust, anna, Matthew Wilcox, mcgrof, clm, david, amir73il, Jens Axboe, ritesh.list, dave, p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev On Thu, Jul 3, 2025 at 6:35 PM Christoph Hellwig <hch@lst.de> wrote: > > On Wed, Jul 02, 2025 at 11:43:12AM -0700, Darrick J. Wong wrote: > > > On a spinning disk, random IO bandwidth remains unchanged, while sequential > > > IO performance declines. However, setting nr_wb_ctx = 1 via configurable > > > writeback(planned in next version) eliminates the decline. > > > > > > echo 1 > /sys/class/bdi/8:16/nwritebacks > > > > > > We can fetch the device queue's rotational property and allocate BDI with > > > nr_wb_ctx = 1 for rotational disks. Hope this is a viable solution for > > > spinning disks? > > > > Sounds good to me, spinning rust isn't known for iops. > > > > Though: What about a raid0 of spinning rust? Do you see the same > > declines for sequential IO? > > Well, even for a raid0 multiple I/O streams will degrade performance > on a disk. Of course many real life workloads will have multiple > I/O streams anyway. > > I think the important part is to have: > > a) sane defaults > b) an easy way for the file system and/or user to override the default > > For a) a single thread for rotational is a good default. For file system > that driver multiple spindles independently or do compression multiple > threads might still make sense. > > For b) one big issue is that right now the whole writeback handling is > per-bdi and not per superblock. So maybe the first step needs to be > to move the writeback to the superblock instead of bdi? bdi is tied to the underlying block device, and helps for device bandwidth specific throttling, dirty ratelimiting etc. Making it per superblock will need duplicating the device specific throttling, ratelimiting to superblock, which will be difficult. > If someone > uses partitions and multiple file systems on spinning rusts these > days reducing the number of writeback threads isn't really going to > save their day either. > in this case with single wb thread multiple partitions/filesystems use the same bdi, we fall back to base case, will that not help ? ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 00/13] Parallelizing filesystem writeback 2025-07-04 7:02 ` Kundan Kumar @ 2025-07-07 14:28 ` Christoph Hellwig 0 siblings, 0 replies; 40+ messages in thread From: Christoph Hellwig @ 2025-07-07 14:28 UTC (permalink / raw) To: Kundan Kumar Cc: Christoph Hellwig, Darrick J. Wong, Andrew Morton, Kundan Kumar, jaegeuk, chao, viro, Christian Brauner, jack, miklos, agruenba, Trond Myklebust, anna, Matthew Wilcox, mcgrof, clm, david, amir73il, Jens Axboe, ritesh.list, dave, p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev On Fri, Jul 04, 2025 at 12:32:51PM +0530, Kundan Kumar wrote: > bdi is tied to the underlying block device, and helps for device > bandwidth specific throttling, dirty ratelimiting etc. Making it per > superblock will need duplicating the device specific throttling, ratelimiting > to superblock, which will be difficult. Yes, but my point is that compared to actually having a high performing writeback code that doesn't matter. What is the use case for actually having production workloads (vs just a root fs and EFI partition) on a single SSD or hard disk? > > If someone > > uses partitions and multiple file systems on spinning rusts these > > days reducing the number of writeback threads isn't really going to > > save their day either. > > > > in this case with single wb thread multiple partitions/filesystems use the > same bdi, we fall back to base case, will that not help ? If you multiple file systems sharing a BDI, they can have different and potentially very different requirements and they can trivially get in the way. Or in other words we can't do anything remotely smart without fully having the file system in charge. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 00/13] Parallelizing filesystem writeback 2025-07-03 13:05 ` Christoph Hellwig 2025-07-04 7:02 ` Kundan Kumar @ 2025-07-07 15:47 ` Jan Kara 1 sibling, 0 replies; 40+ messages in thread From: Jan Kara @ 2025-07-07 15:47 UTC (permalink / raw) To: Christoph Hellwig Cc: Darrick J. Wong, Kundan Kumar, Andrew Morton, Kundan Kumar, jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, willy, mcgrof, clm, david, amir73il, axboe, ritesh.list, dave, p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev On Thu 03-07-25 15:05:00, Christoph Hellwig wrote: > On Wed, Jul 02, 2025 at 11:43:12AM -0700, Darrick J. Wong wrote: > > > On a spinning disk, random IO bandwidth remains unchanged, while sequential > > > IO performance declines. However, setting nr_wb_ctx = 1 via configurable > > > writeback(planned in next version) eliminates the decline. > > > > > > echo 1 > /sys/class/bdi/8:16/nwritebacks > > > > > > We can fetch the device queue's rotational property and allocate BDI with > > > nr_wb_ctx = 1 for rotational disks. Hope this is a viable solution for > > > spinning disks? > > > > Sounds good to me, spinning rust isn't known for iops. > > > > Though: What about a raid0 of spinning rust? Do you see the same > > declines for sequential IO? > > Well, even for a raid0 multiple I/O streams will degrade performance > on a disk. Of course many real life workloads will have multiple > I/O streams anyway. > > I think the important part is to have: > > a) sane defaults > b) an easy way for the file system and/or user to override the default > > For a) a single thread for rotational is a good default. For file system > that driver multiple spindles independently or do compression multiple > threads might still make sense. > > For b) one big issue is that right now the whole writeback handling is > per-bdi and not per superblock. So maybe the first step needs to be > to move the writeback to the superblock instead of bdi? If someone > uses partitions and multiple file systems on spinning rusts these > days reducing the number of writeback threads isn't really going to > save their day either. We have had requests to move writeback infrastructure to be per sb in the past, mostly so that the filesystem has a better control of the writeback process (e.g. selection of inodes etc.). After some thought I tend to agree that today setups where we have multiple filesystems over the same bdi and end up doing writeback from several of them in parallel should be mostly limited to desktops / laptops / small servers. And there you usually have only one main data filesystem - e.g. /home/ - and you don't tend to write that much to your / filesystem. Although there could be exceptions like large occasional writes to /tmp, news server updates or similar. Anyway in these cases I'd expect IO scheduler (BFQ for rotational disks where this really matters) to still achieve a decent IO locality but it would be good to verify what the impact is. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 00/13] Parallelizing filesystem writeback 2025-05-29 11:14 ` [PATCH 00/13] Parallelizing filesystem writeback Kundan Kumar ` (13 preceding siblings ...) 2025-05-30 3:37 ` [PATCH 00/13] Parallelizing filesystem writeback Andrew Morton @ 2025-06-02 14:19 ` Christoph Hellwig 2025-06-03 9:16 ` Anuj Gupta/Anuj Gupta 14 siblings, 1 reply; 40+ messages in thread From: Christoph Hellwig @ 2025-06-02 14:19 UTC (permalink / raw) To: Kundan Kumar Cc: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev On Thu, May 29, 2025 at 04:44:51PM +0530, Kundan Kumar wrote: > Number of writeback contexts > =========================== > The plan is to keep the nr_wb_ctx as 1, ensuring default single threaded > behavior. However, we set the number of writeback contexts equal to > number of CPUs in the current version. Later we will make it configurable > using a mount option, allowing filesystems to choose the optimal number > of writeback contexts. Well, the proper thing would be to figure out a good default and not just keep things as-is, no? > IOPS and throughput > =================== > We see significant improvement in IOPS across several filesystem on both > PMEM and NVMe devices. > > Performance gains: > - On PMEM: > Base XFS : 544 MiB/s > Parallel Writeback XFS : 1015 MiB/s (+86%) > Base EXT4 : 536 MiB/s > Parallel Writeback EXT4 : 1047 MiB/s (+95%) > > - On NVMe: > Base XFS : 651 MiB/s > Parallel Writeback XFS : 808 MiB/s (+24%) > Base EXT4 : 494 MiB/s > Parallel Writeback EXT4 : 797 MiB/s (+61%) What worksload was this? How many CPU cores did the system have, how many AGs/BGs did the file systems have? What SSD/Pmem was this? Did this change the write amp as measure by the media writes on the NVMe SSD? Also I'd be really curious to see numbers on hard drives. > We also see that there is no increase in filesystem fragmentation > # of extents: > - On XFS (on PMEM): > Base XFS : 1964 > Parallel Writeback XFS : 1384 > > - On EXT4 (on PMEM): > Base EXT4 : 21 > Parallel Writeback EXT4 : 11 How were the number of extents counts given that they look so wildly different? ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 00/13] Parallelizing filesystem writeback 2025-06-02 14:19 ` Christoph Hellwig @ 2025-06-03 9:16 ` Anuj Gupta/Anuj Gupta 2025-06-03 13:24 ` Christoph Hellwig 0 siblings, 1 reply; 40+ messages in thread From: Anuj Gupta/Anuj Gupta @ 2025-06-03 9:16 UTC (permalink / raw) To: Christoph Hellwig, Kundan Kumar Cc: jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, ritesh.list, djwong, dave, p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, anuj1072538, kundanthebest On 6/2/2025 7:49 PM, Christoph Hellwig wrote: > On Thu, May 29, 2025 at 04:44:51PM +0530, Kundan Kumar wrote: > Well, the proper thing would be to figure out a good default and not > just keep things as-is, no? We observed that some filesystems, such as Btrfs, don't benefit from this infra due to their distinct writeback architecture. To preserve current behavior and avoid unintended changes for such filesystems, we have kept nr_wb_ctx=1 as the default. Filesystems that can take advantage of parallel writeback (xfs, ext4) can opt-in via a mount option. Also we wanted to reduce risk during initial integration and hence kept it as opt-in. > >> IOPS and throughput >> =================== >> We see significant improvement in IOPS across several filesystem on both >> PMEM and NVMe devices. >> >> Performance gains: >> - On PMEM: >> Base XFS : 544 MiB/s >> Parallel Writeback XFS : 1015 MiB/s (+86%) >> Base EXT4 : 536 MiB/s >> Parallel Writeback EXT4 : 1047 MiB/s (+95%) >> >> - On NVMe: >> Base XFS : 651 MiB/s >> Parallel Writeback XFS : 808 MiB/s (+24%) >> Base EXT4 : 494 MiB/s >> Parallel Writeback EXT4 : 797 MiB/s (+61%) > > What worksload was this? Number of CPUs = 12 System RAM = 16G For XFS number of AGs = 4 For EXT4 BG count = 28616 Used PMEM of 6G and NVMe SSD of 3.84 TB fio command line : fio --directory=/mnt --name=test --bs=4k --iodepth=1024 --rw=randwrite --ioengine=io_uring --time_based=1 -runtime=60 --numjobs=12 --size=450M --direct=0 --eta-interval=1 --eta-newline=1 --group_reporting Will measure the write-amp and share. > > How many CPU cores did the system have, how many AGs/BGs did the file > systems have? What SSD/Pmem was this? Did this change the write > amp as measure by the media writes on the NVMe SSD? > > Also I'd be really curious to see numbers on hard drives. > >> We also see that there is no increase in filesystem fragmentation >> # of extents: >> - On XFS (on PMEM): >> Base XFS : 1964 >> Parallel Writeback XFS : 1384 >> >> - On EXT4 (on PMEM): >> Base EXT4 : 21 >> Parallel Writeback EXT4 : 11 > > How were the number of extents counts given that they look so wildly > different? > > Issued random write of 1G using fio with fallocate=none and then measured the number of extents, after a delay of 30 secs : fio --filename=/mnt/testfile --name=test --bs=4k --iodepth=1024 --rw=randwrite --ioengine=io_uring --fallocate=none --numjobs=1 --size=1G --direct=0 --eta-interval=1 --eta-newline=1 --group_reporting For xfs used this command: xfs_io -c "stat" /mnt/testfile And for ext4 used this: filefrag /mnt/testfile ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 00/13] Parallelizing filesystem writeback 2025-06-03 9:16 ` Anuj Gupta/Anuj Gupta @ 2025-06-03 13:24 ` Christoph Hellwig 2025-06-03 13:52 ` Anuj gupta 0 siblings, 1 reply; 40+ messages in thread From: Christoph Hellwig @ 2025-06-03 13:24 UTC (permalink / raw) To: Anuj Gupta/Anuj Gupta Cc: Christoph Hellwig, Kundan Kumar, jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, ritesh.list, djwong, dave, p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, anuj1072538, kundanthebest On Tue, Jun 03, 2025 at 02:46:20PM +0530, Anuj Gupta/Anuj Gupta wrote: > On 6/2/2025 7:49 PM, Christoph Hellwig wrote: > > On Thu, May 29, 2025 at 04:44:51PM +0530, Kundan Kumar wrote: > > Well, the proper thing would be to figure out a good default and not > > just keep things as-is, no? > > We observed that some filesystems, such as Btrfs, don't benefit from > this infra due to their distinct writeback architecture. To preserve > current behavior and avoid unintended changes for such filesystems, > we have kept nr_wb_ctx=1 as the default. Filesystems that can take > advantage of parallel writeback (xfs, ext4) can opt-in via a mount > option. Also we wanted to reduce risk during initial integration and > hence kept it as opt-in. A mount option is about the worst possible interface for behavior that depends on file system implementation and possibly hardware chacteristics. This needs to be set by the file systems, possibly using generic helpers using hardware information. > Used PMEM of 6G battery/capacitor backed DRAM, or optane? > > and NVMe SSD of 3.84 TB Consumer drive, enterprise drive? > For xfs used this command: > xfs_io -c "stat" /mnt/testfile > And for ext4 used this: > filefrag /mnt/testfile filefrag merges contiguous extents, and only counts up for discontiguous mappings, while fsxattr.nextents counts all extent even if they are contiguous. So you probably want to use filefrag for both cases. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 00/13] Parallelizing filesystem writeback 2025-06-03 13:24 ` Christoph Hellwig @ 2025-06-03 13:52 ` Anuj gupta 2025-06-03 14:04 ` Christoph Hellwig 2025-06-04 9:22 ` Kundan Kumar 0 siblings, 2 replies; 40+ messages in thread From: Anuj gupta @ 2025-06-03 13:52 UTC (permalink / raw) To: Christoph Hellwig Cc: Anuj Gupta/Anuj Gupta, Kundan Kumar, jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, ritesh.list, djwong, dave, p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, kundanthebest On Tue, Jun 3, 2025 at 6:54 PM Christoph Hellwig <hch@lst.de> wrote: > > On Tue, Jun 03, 2025 at 02:46:20PM +0530, Anuj Gupta/Anuj Gupta wrote: > > On 6/2/2025 7:49 PM, Christoph Hellwig wrote: > > > On Thu, May 29, 2025 at 04:44:51PM +0530, Kundan Kumar wrote: > > > Well, the proper thing would be to figure out a good default and not > > > just keep things as-is, no? > > > > We observed that some filesystems, such as Btrfs, don't benefit from > > this infra due to their distinct writeback architecture. To preserve > > current behavior and avoid unintended changes for such filesystems, > > we have kept nr_wb_ctx=1 as the default. Filesystems that can take > > advantage of parallel writeback (xfs, ext4) can opt-in via a mount > > option. Also we wanted to reduce risk during initial integration and > > hence kept it as opt-in. > > A mount option is about the worst possible interface for behavior > that depends on file system implementation and possibly hardware > chacteristics. This needs to be set by the file systems, possibly > using generic helpers using hardware information. Right, that makes sense. Instead of using a mount option, we can introduce generic helpers to initialize multiple writeback contexts based on underlying hardware characteristics — e.g., number of CPUs or NUMA topology. Filesystems like XFS and EXT4 can then call these helpers during mount to opt into parallel writeback in a controlled way. > > > Used PMEM of 6G > > battery/capacitor backed DRAM, or optane? We emulated PMEM using DRAM by following the steps here: https://www.intel.com/content/www/us/en/developer/articles/training/how-to-emulate-persistent-memory-on-an-intel-architecture-server.html > > > > > and NVMe SSD of 3.84 TB > > Consumer drive, enterprise drive? It's an enterprise-grade drive — Samsung PM1733 > > > For xfs used this command: > > xfs_io -c "stat" /mnt/testfile > > And for ext4 used this: > > filefrag /mnt/testfile > > filefrag merges contiguous extents, and only counts up for discontiguous > mappings, while fsxattr.nextents counts all extent even if they are > contiguous. So you probably want to use filefrag for both cases. Got it — thanks for the clarification. We'll switch to using filefrag and will share updated extent count numbers accordingly. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 00/13] Parallelizing filesystem writeback 2025-06-03 13:52 ` Anuj gupta @ 2025-06-03 14:04 ` Christoph Hellwig 2025-06-03 14:05 ` Christoph Hellwig 2025-06-04 9:22 ` Kundan Kumar 1 sibling, 1 reply; 40+ messages in thread From: Christoph Hellwig @ 2025-06-03 14:04 UTC (permalink / raw) To: Anuj gupta Cc: Christoph Hellwig, Anuj Gupta/Anuj Gupta, Kundan Kumar, jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, ritesh.list, djwong, dave, p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, kundanthebest On Tue, Jun 03, 2025 at 07:22:18PM +0530, Anuj gupta wrote: > > A mount option is about the worst possible interface for behavior > > that depends on file system implementation and possibly hardware > > chacteristics. This needs to be set by the file systems, possibly > > using generic helpers using hardware information. > > Right, that makes sense. Instead of using a mount option, we can > introduce generic helpers to initialize multiple writeback contexts > based on underlying hardware characteristics — e.g., number of CPUs or > NUMA topology. Filesystems like XFS and EXT4 can then call these helpers > during mount to opt into parallel writeback in a controlled way. Yes. A mount option might still be useful to override this default, but it should not be needed for the normal use case. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 00/13] Parallelizing filesystem writeback 2025-06-03 14:04 ` Christoph Hellwig @ 2025-06-03 14:05 ` Christoph Hellwig 2025-06-06 5:04 ` Kundan Kumar 0 siblings, 1 reply; 40+ messages in thread From: Christoph Hellwig @ 2025-06-03 14:05 UTC (permalink / raw) To: Anuj gupta Cc: Christoph Hellwig, Anuj Gupta/Anuj Gupta, Kundan Kumar, jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, ritesh.list, djwong, dave, p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev, kundanthebest On Tue, Jun 03, 2025 at 04:04:45PM +0200, Christoph Hellwig wrote: > On Tue, Jun 03, 2025 at 07:22:18PM +0530, Anuj gupta wrote: > > > A mount option is about the worst possible interface for behavior > > > that depends on file system implementation and possibly hardware > > > chacteristics. This needs to be set by the file systems, possibly > > > using generic helpers using hardware information. > > > > Right, that makes sense. Instead of using a mount option, we can > > introduce generic helpers to initialize multiple writeback contexts > > based on underlying hardware characteristics — e.g., number of CPUs or > > NUMA topology. Filesystems like XFS and EXT4 can then call these helpers > > during mount to opt into parallel writeback in a controlled way. > > Yes. A mount option might still be useful to override this default, > but it should not be needed for the normal use case. .. actually a sysfs file on the bdi is probably the better interface for the override than a mount option. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 00/13] Parallelizing filesystem writeback 2025-06-03 14:05 ` Christoph Hellwig @ 2025-06-06 5:04 ` Kundan Kumar 2025-06-09 4:00 ` Christoph Hellwig 0 siblings, 1 reply; 40+ messages in thread From: Kundan Kumar @ 2025-06-06 5:04 UTC (permalink / raw) To: Christoph Hellwig Cc: Anuj gupta, Anuj Gupta/Anuj Gupta, Kundan Kumar, jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, ritesh.list, djwong, dave, p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev On Tue, Jun 3, 2025 at 7:35 PM Christoph Hellwig <hch@lst.de> wrote: > > On Tue, Jun 03, 2025 at 04:04:45PM +0200, Christoph Hellwig wrote: > > On Tue, Jun 03, 2025 at 07:22:18PM +0530, Anuj gupta wrote: > > > > A mount option is about the worst possible interface for behavior > > > > that depends on file system implementation and possibly hardware > > > > chacteristics. This needs to be set by the file systems, possibly > > > > using generic helpers using hardware information. > > > > > > Right, that makes sense. Instead of using a mount option, we can > > > introduce generic helpers to initialize multiple writeback contexts > > > based on underlying hardware characteristics — e.g., number of CPUs or > > > NUMA topology. Filesystems like XFS and EXT4 can then call these helpers > > > during mount to opt into parallel writeback in a controlled way. > > > > Yes. A mount option might still be useful to override this default, > > but it should not be needed for the normal use case. > > .. actually a sysfs file on the bdi is probably the better interface > for the override than a mount option. Hi Christoph, Thanks for the suggestion — I agree the default should come from a filesystem-level helper, not a mount option. I looked into the sysfs override idea, but one challenge is that nr_wb_ctx must be finalized before any writes occur. That leaves only a narrow window — after the bdi is registered but before any inodes are dirtied — where changing it is safe. This makes the sysfs knob a bit fragile unless we tightly guard it (e.g., mark it read-only after init). A mount option, even just as an override, feels simpler and more predictable, since it’s set before the FS becomes active. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 00/13] Parallelizing filesystem writeback 2025-06-06 5:04 ` Kundan Kumar @ 2025-06-09 4:00 ` Christoph Hellwig 0 siblings, 0 replies; 40+ messages in thread From: Christoph Hellwig @ 2025-06-09 4:00 UTC (permalink / raw) To: Kundan Kumar Cc: Christoph Hellwig, Anuj gupta, Anuj Gupta/Anuj Gupta, Kundan Kumar, jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, ritesh.list, djwong, dave, p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev On Fri, Jun 06, 2025 at 10:34:42AM +0530, Kundan Kumar wrote: > Thanks for the suggestion — I agree the default should come from a > filesystem-level helper, not a mount option. > > I looked into the sysfs override idea, but one challenge is that > nr_wb_ctx must be finalized before any writes occur. That leaves only > a narrow window — after the bdi is registered but before any inodes > are dirtied — where changing it is safe. > > This makes the sysfs knob a bit fragile unless we tightly guard it > (e.g., mark it read-only after init). A mount option, even just as an > override, feels simpler and more predictable, since it’s set before > the FS becomes active. The mount option has a few issues: - the common VFS code only support flags, not value options, so you'd have to wire this up in every file system - some file system might not want to allow changing it - changing it at runtime is actuallyt quite useful So you'll need to quiesce writeback or maybe even do a full fs freeze when changing it a runtime, but that seems ok for a change this invasive. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 00/13] Parallelizing filesystem writeback 2025-06-03 13:52 ` Anuj gupta 2025-06-03 14:04 ` Christoph Hellwig @ 2025-06-04 9:22 ` Kundan Kumar 2025-06-11 15:51 ` Darrick J. Wong 1 sibling, 1 reply; 40+ messages in thread From: Kundan Kumar @ 2025-06-04 9:22 UTC (permalink / raw) To: Anuj gupta Cc: Christoph Hellwig, Anuj Gupta/Anuj Gupta, Kundan Kumar, jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, ritesh.list, djwong, dave, p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev > > > For xfs used this command: > > > xfs_io -c "stat" /mnt/testfile > > > And for ext4 used this: > > > filefrag /mnt/testfile > > > > filefrag merges contiguous extents, and only counts up for discontiguous > > mappings, while fsxattr.nextents counts all extent even if they are > > contiguous. So you probably want to use filefrag for both cases. > > Got it — thanks for the clarification. We'll switch to using filefrag > and will share updated extent count numbers accordingly. Using filefrag, we recorded extent counts on xfs and ext4 at three stages: a. Just after a 1G random write, b. After a 30-second wait, c. After unmounting and remounting the filesystem, xfs Base a. 6251 b. 2526 c. 2526 Parallel writeback a. 6183 b. 2326 c. 2326 ext4 Base a. 7080 b. 7080 c. 11 Parallel writeback a. 5961 b. 5961 c. 11 Used the same fio commandline as earlier: fio --filename=/mnt/testfile --name=test --bs=4k --iodepth=1024 --rw=randwrite --ioengine=io_uring --fallocate=none --numjobs=1 --size=1G --direct=0 --eta-interval=1 --eta-newline=1 --group_reporting filefrag command: filefrag /mnt/testfile ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 00/13] Parallelizing filesystem writeback 2025-06-04 9:22 ` Kundan Kumar @ 2025-06-11 15:51 ` Darrick J. Wong 2025-06-24 5:59 ` Kundan Kumar 0 siblings, 1 reply; 40+ messages in thread From: Darrick J. Wong @ 2025-06-11 15:51 UTC (permalink / raw) To: Kundan Kumar Cc: Anuj gupta, Christoph Hellwig, Anuj Gupta/Anuj Gupta, Kundan Kumar, jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, ritesh.list, dave, p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev On Wed, Jun 04, 2025 at 02:52:34PM +0530, Kundan Kumar wrote: > > > > For xfs used this command: > > > > xfs_io -c "stat" /mnt/testfile > > > > And for ext4 used this: > > > > filefrag /mnt/testfile > > > > > > filefrag merges contiguous extents, and only counts up for discontiguous > > > mappings, while fsxattr.nextents counts all extent even if they are > > > contiguous. So you probably want to use filefrag for both cases. > > > > Got it — thanks for the clarification. We'll switch to using filefrag > > and will share updated extent count numbers accordingly. > > Using filefrag, we recorded extent counts on xfs and ext4 at three > stages: > a. Just after a 1G random write, > b. After a 30-second wait, > c. After unmounting and remounting the filesystem, > > xfs > Base > a. 6251 b. 2526 c. 2526 > Parallel writeback > a. 6183 b. 2326 c. 2326 Interesting that the mapping record count goes down... I wonder, you said the xfs filesystem has 4 AGs and 12 cores, so I guess wb_ctx_arr[] is 12? I wonder, do you see a knee point in writeback throughput when the # of wb contexts exceeds the AG count? Though I guess for the (hopefully common) case of pure overwrites, we don't have to do any metadata updates so we wouldn't really hit a scaling limit due to ag count or log contention or whatever. Does that square with what you see? > ext4 > Base > a. 7080 b. 7080 c. 11 > Parallel writeback > a. 5961 b. 5961 c. 11 Hum, that's particularly ... interesting. I wonder what the mapping count behaviors are when you turn off delayed allocation? --D > Used the same fio commandline as earlier: > fio --filename=/mnt/testfile --name=test --bs=4k --iodepth=1024 > --rw=randwrite --ioengine=io_uring --fallocate=none --numjobs=1 > --size=1G --direct=0 --eta-interval=1 --eta-newline=1 > --group_reporting > > filefrag command: > filefrag /mnt/testfile > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 00/13] Parallelizing filesystem writeback 2025-06-11 15:51 ` Darrick J. Wong @ 2025-06-24 5:59 ` Kundan Kumar 2025-07-02 18:44 ` Darrick J. Wong 0 siblings, 1 reply; 40+ messages in thread From: Kundan Kumar @ 2025-06-24 5:59 UTC (permalink / raw) To: Darrick J. Wong Cc: Anuj gupta, Christoph Hellwig, Anuj Gupta/Anuj Gupta, Kundan Kumar, jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, ritesh.list, dave, p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev On Wed, Jun 11, 2025 at 9:21 PM Darrick J. Wong <djwong@kernel.org> wrote: > > On Wed, Jun 04, 2025 at 02:52:34PM +0530, Kundan Kumar wrote: > > > > > For xfs used this command: > > > > > xfs_io -c "stat" /mnt/testfile > > > > > And for ext4 used this: > > > > > filefrag /mnt/testfile > > > > > > > > filefrag merges contiguous extents, and only counts up for discontiguous > > > > mappings, while fsxattr.nextents counts all extent even if they are > > > > contiguous. So you probably want to use filefrag for both cases. > > > > > > Got it — thanks for the clarification. We'll switch to using filefrag > > > and will share updated extent count numbers accordingly. > > > > Using filefrag, we recorded extent counts on xfs and ext4 at three > > stages: > > a. Just after a 1G random write, > > b. After a 30-second wait, > > c. After unmounting and remounting the filesystem, > > > > xfs > > Base > > a. 6251 b. 2526 c. 2526 > > Parallel writeback > > a. 6183 b. 2326 c. 2326 > > Interesting that the mapping record count goes down... > > I wonder, you said the xfs filesystem has 4 AGs and 12 cores, so I guess > wb_ctx_arr[] is 12? I wonder, do you see a knee point in writeback > throughput when the # of wb contexts exceeds the AG count? > > Though I guess for the (hopefully common) case of pure overwrites, we > don't have to do any metadata updates so we wouldn't really hit a > scaling limit due to ag count or log contention or whatever. Does that > square with what you see? > Hi Darrick, We analyzed AG count vs. number of writeback contexts to identify any knee point. Earlier, wb_ctx_arr[] was fixed at 12; now we varied nr_wb_ctx and measured the impact. We implemented a configurable number of writeback contexts to measure throughput more easily. This feature will be exposed in the next series. To configure, used: echo <nr_wb_ctx> > /sys/class/bdi/259:2/nwritebacks. In our test, writing 1G across 12 directories showed improved bandwidth up to the number of allocation groups (AGs), mostly a knee point, but gains tapered off beyond that. Also, we see a good increase in bandwidth of about 16 times from base to nr_wb_ctx = 6. Base (single threaded) : 9799KiB/s Parallel Writeback (nr_wb_ctx = 1) : 9727KiB/s Parallel Writeback (nr_wb_ctx = 2) : 18.1MiB/s Parallel Writeback (nr_wb_ctx = 3) : 46.4MiB/s Parallel Writeback (nr_wb_ctx = 4) : 135MiB/s Parallel Writeback (nr_wb_ctx = 5) : 160MiB/s Parallel Writeback (nr_wb_ctx = 6) : 163MiB/s Parallel Writeback (nr_wb_ctx = 7) : 162MiB/s Parallel Writeback (nr_wb_ctx = 8) : 154MiB/s Parallel Writeback (nr_wb_ctx = 9) : 152MiB/s Parallel Writeback (nr_wb_ctx = 10) : 145MiB/s Parallel Writeback (nr_wb_ctx = 11) : 145MiB/s Parallel Writeback (nr_wb_ctx = 12) : 138MiB/s System config =========== Number of CPUs = 12 System RAM = 9G For XFS number of AGs = 4 Used NVMe SSD of 3.84 TB (Enterprise SSD PM1733a) Script ===== mkfs.xfs -f /dev/nvme0n1 mount /dev/nvme0n1 /mnt echo <num_wb_ctx> > /sys/class/bdi/259:2/nwritebacks sync echo 3 > /proc/sys/vm/drop_caches for i in {1..12}; do mkdir -p /mnt/dir$i done fio job_nvme.fio umount /mnt echo 3 > /proc/sys/vm/drop_caches sync fio job ===== [global] bs=4k iodepth=1 rw=randwrite ioengine=io_uring nrfiles=12 numjobs=1 # Each job writes to a different file size=1g direct=0 # Buffered I/O to trigger writeback group_reporting=1 create_on_open=1 name=test [job1] directory=/mnt/dir1 [job2] directory=/mnt/dir2 ... ... [job12] directory=/mnt/dir1 > > ext4 > > Base > > a. 7080 b. 7080 c. 11 > > Parallel writeback > > a. 5961 b. 5961 c. 11 > > Hum, that's particularly ... interesting. I wonder what the mapping > count behaviors are when you turn off delayed allocation? > > --D > I attempted to disable delayed allocation by setting allocsize=4096 during mount (mount -o allocsize=4096 /dev/pmem0 /mnt), but still observed a reduction in file fragments after a delay. Is there something I'm overlooking? -Kundan ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 00/13] Parallelizing filesystem writeback 2025-06-24 5:59 ` Kundan Kumar @ 2025-07-02 18:44 ` Darrick J. Wong 0 siblings, 0 replies; 40+ messages in thread From: Darrick J. Wong @ 2025-07-02 18:44 UTC (permalink / raw) To: Kundan Kumar Cc: Anuj gupta, Christoph Hellwig, Anuj Gupta/Anuj Gupta, Kundan Kumar, jaegeuk, chao, viro, brauner, jack, miklos, agruenba, trondmy, anna, akpm, willy, mcgrof, clm, david, amir73il, axboe, ritesh.list, dave, p.raghav, da.gomez, linux-f2fs-devel, linux-fsdevel, gfs2, linux-nfs, linux-mm, gost.dev On Tue, Jun 24, 2025 at 11:29:28AM +0530, Kundan Kumar wrote: > On Wed, Jun 11, 2025 at 9:21 PM Darrick J. Wong <djwong@kernel.org> wrote: > > > > On Wed, Jun 04, 2025 at 02:52:34PM +0530, Kundan Kumar wrote: > > > > > > For xfs used this command: > > > > > > xfs_io -c "stat" /mnt/testfile > > > > > > And for ext4 used this: > > > > > > filefrag /mnt/testfile > > > > > > > > > > filefrag merges contiguous extents, and only counts up for discontiguous > > > > > mappings, while fsxattr.nextents counts all extent even if they are > > > > > contiguous. So you probably want to use filefrag for both cases. > > > > > > > > Got it — thanks for the clarification. We'll switch to using filefrag > > > > and will share updated extent count numbers accordingly. > > > > > > Using filefrag, we recorded extent counts on xfs and ext4 at three > > > stages: > > > a. Just after a 1G random write, > > > b. After a 30-second wait, > > > c. After unmounting and remounting the filesystem, > > > > > > xfs > > > Base > > > a. 6251 b. 2526 c. 2526 > > > Parallel writeback > > > a. 6183 b. 2326 c. 2326 > > > > Interesting that the mapping record count goes down... > > > > I wonder, you said the xfs filesystem has 4 AGs and 12 cores, so I guess > > wb_ctx_arr[] is 12? I wonder, do you see a knee point in writeback > > throughput when the # of wb contexts exceeds the AG count? > > > > Though I guess for the (hopefully common) case of pure overwrites, we > > don't have to do any metadata updates so we wouldn't really hit a > > scaling limit due to ag count or log contention or whatever. Does that > > square with what you see? > > > > Hi Darrick, > > We analyzed AG count vs. number of writeback contexts to identify any > knee point. Earlier, wb_ctx_arr[] was fixed at 12; now we varied nr_wb_ctx > and measured the impact. > > We implemented a configurable number of writeback contexts to measure > throughput more easily. This feature will be exposed in the next series. > To configure, used: echo <nr_wb_ctx> > /sys/class/bdi/259:2/nwritebacks. > > In our test, writing 1G across 12 directories showed improved bandwidth up > to the number of allocation groups (AGs), mostly a knee point, but gains > tapered off beyond that. Also, we see a good increase in bandwidth of about > 16 times from base to nr_wb_ctx = 6. > > Base (single threaded) : 9799KiB/s > Parallel Writeback (nr_wb_ctx = 1) : 9727KiB/s > Parallel Writeback (nr_wb_ctx = 2) : 18.1MiB/s > Parallel Writeback (nr_wb_ctx = 3) : 46.4MiB/s > Parallel Writeback (nr_wb_ctx = 4) : 135MiB/s > Parallel Writeback (nr_wb_ctx = 5) : 160MiB/s > Parallel Writeback (nr_wb_ctx = 6) : 163MiB/s Heh, nice! > Parallel Writeback (nr_wb_ctx = 7) : 162MiB/s > Parallel Writeback (nr_wb_ctx = 8) : 154MiB/s > Parallel Writeback (nr_wb_ctx = 9) : 152MiB/s > Parallel Writeback (nr_wb_ctx = 10) : 145MiB/s > Parallel Writeback (nr_wb_ctx = 11) : 145MiB/s > Parallel Writeback (nr_wb_ctx = 12) : 138MiB/s > > > System config > =========== > Number of CPUs = 12 > System RAM = 9G > For XFS number of AGs = 4 > Used NVMe SSD of 3.84 TB (Enterprise SSD PM1733a) > > Script > ===== > mkfs.xfs -f /dev/nvme0n1 > mount /dev/nvme0n1 /mnt > echo <num_wb_ctx> > /sys/class/bdi/259:2/nwritebacks > sync > echo 3 > /proc/sys/vm/drop_caches > > for i in {1..12}; do > mkdir -p /mnt/dir$i > done > > fio job_nvme.fio > > umount /mnt > echo 3 > /proc/sys/vm/drop_caches > sync > > fio job > ===== > [global] > bs=4k > iodepth=1 > rw=randwrite > ioengine=io_uring > nrfiles=12 > numjobs=1 # Each job writes to a different file > size=1g > direct=0 # Buffered I/O to trigger writeback > group_reporting=1 > create_on_open=1 > name=test > > [job1] > directory=/mnt/dir1 > > [job2] > directory=/mnt/dir2 > ... > ... > [job12] > directory=/mnt/dir1 > > > > ext4 > > > Base > > > a. 7080 b. 7080 c. 11 > > > Parallel writeback > > > a. 5961 b. 5961 c. 11 > > > > Hum, that's particularly ... interesting. I wonder what the mapping > > count behaviors are when you turn off delayed allocation? > > > > --D > > > > I attempted to disable delayed allocation by setting allocsize=4096 > during mount (mount -o allocsize=4096 /dev/pmem0 /mnt), but still > observed a reduction in file fragments after a delay. Is there something > I'm overlooking? Not that I know of. Maybe we should just take the win. :) --D > -Kundan > ^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2025-07-07 15:47 UTC | newest] Thread overview: 40+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <CGME20250529113215epcas5p2edd67e7b129621f386be005fdba53378@epcas5p2.samsung.com> 2025-05-29 11:14 ` [PATCH 00/13] Parallelizing filesystem writeback Kundan Kumar [not found] ` <CGME20250529113219epcas5p4d8ccb25ea910faea7120f092623f321d@epcas5p4.samsung.com> 2025-05-29 11:14 ` [PATCH 01/13] writeback: add infra for parallel writeback Kundan Kumar [not found] ` <CGME20250529113224epcas5p2eea35fd0ebe445d8ad0471a144714b23@epcas5p2.samsung.com> 2025-05-29 11:14 ` [PATCH 02/13] writeback: add support to initialize and free multiple writeback ctxs Kundan Kumar [not found] ` <CGME20250529113228epcas5p1db88ab42c2dac0698d715e38bd5e0896@epcas5p1.samsung.com> 2025-05-29 11:14 ` [PATCH 03/13] writeback: link bdi_writeback to its corresponding bdi_writeback_ctx Kundan Kumar [not found] ` <CGME20250529113232epcas5p4e6f3b2f03d3a5f8fcaace3ddd03298d0@epcas5p4.samsung.com> 2025-05-29 11:14 ` [PATCH 04/13] writeback: affine inode to a writeback ctx within a bdi Kundan Kumar 2025-06-02 14:24 ` Christoph Hellwig [not found] ` <CGME20250529113236epcas5p2049b6cc3be27d8727ac1f15697987ff5@epcas5p2.samsung.com> 2025-05-29 11:14 ` [PATCH 05/13] writeback: modify bdi_writeback search logic to search across all wb ctxs Kundan Kumar [not found] ` <CGME20250529113240epcas5p295dcf9a016cc28e5c3e88d698808f645@epcas5p2.samsung.com> 2025-05-29 11:14 ` [PATCH 06/13] writeback: invoke all writeback contexts for flusher and dirtytime writeback Kundan Kumar [not found] ` <CGME20250529113245epcas5p2978b77ce5ccf2d620f2a9ee5e796bee3@epcas5p2.samsung.com> 2025-05-29 11:14 ` [PATCH 07/13] writeback: modify sync related functions to iterate over all writeback contexts Kundan Kumar [not found] ` <CGME20250529113249epcas5p38b29d3c6256337eadc2d1644181f9b74@epcas5p3.samsung.com> 2025-05-29 11:14 ` [PATCH 08/13] writeback: add support to collect stats for all writeback ctxs Kundan Kumar [not found] ` <CGME20250529113253epcas5p1a28e77b2d9824d55f594ccb053725ece@epcas5p1.samsung.com> 2025-05-29 11:15 ` [PATCH 09/13] f2fs: add support in f2fs to handle multiple writeback contexts Kundan Kumar 2025-06-02 14:20 ` Christoph Hellwig [not found] ` <CGME20250529113257epcas5p4dbaf9c8e2dc362767c8553399632c1ea@epcas5p4.samsung.com> 2025-05-29 11:15 ` [PATCH 10/13] fuse: add support for multiple writeback contexts in fuse Kundan Kumar 2025-06-02 14:21 ` Christoph Hellwig 2025-06-02 15:50 ` Bernd Schubert 2025-06-02 15:55 ` Christoph Hellwig [not found] ` <CGME20250529113302epcas5p3bdae265288af32172fb7380a727383eb@epcas5p3.samsung.com> 2025-05-29 11:15 ` [PATCH 11/13] gfs2: add support in gfs2 to handle multiple writeback contexts Kundan Kumar [not found] ` <CGME20250529113306epcas5p3d10606ae4ea7c3491e93bde9ae408c9f@epcas5p3.samsung.com> 2025-05-29 11:15 ` [PATCH 12/13] nfs: add support in nfs " Kundan Kumar 2025-06-02 14:22 ` Christoph Hellwig [not found] ` <CGME20250529113311epcas5p3c8f1785b34680481e2126fda3ab51ad9@epcas5p3.samsung.com> 2025-05-29 11:15 ` [PATCH 13/13] writeback: set the num of writeback contexts to number of online cpus Kundan Kumar 2025-06-03 14:36 ` kernel test robot 2025-05-30 3:37 ` [PATCH 00/13] Parallelizing filesystem writeback Andrew Morton 2025-06-25 15:44 ` Kundan Kumar 2025-07-02 18:43 ` Darrick J. Wong 2025-07-03 13:05 ` Christoph Hellwig 2025-07-04 7:02 ` Kundan Kumar 2025-07-07 14:28 ` Christoph Hellwig 2025-07-07 15:47 ` Jan Kara 2025-06-02 14:19 ` Christoph Hellwig 2025-06-03 9:16 ` Anuj Gupta/Anuj Gupta 2025-06-03 13:24 ` Christoph Hellwig 2025-06-03 13:52 ` Anuj gupta 2025-06-03 14:04 ` Christoph Hellwig 2025-06-03 14:05 ` Christoph Hellwig 2025-06-06 5:04 ` Kundan Kumar 2025-06-09 4:00 ` Christoph Hellwig 2025-06-04 9:22 ` Kundan Kumar 2025-06-11 15:51 ` Darrick J. Wong 2025-06-24 5:59 ` Kundan Kumar 2025-07-02 18:44 ` Darrick J. Wong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).