linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/1] writeback: add sysfs to config the number of writeback contexts
@ 2025-08-25 12:29 ` wangyufei
  2025-08-25 12:29   ` [RFC 1/1] " wangyufei
                     ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: wangyufei @ 2025-08-25 12:29 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Matthew Wilcox (Oracle),
	Stephen Rothwell, wangyufei, open list,
	open list:MEMORY MANAGEMENT - MISC, open list:PAGE CACHE
  Cc: kundan.kumar, anuj20.g, hch, bernd, djwong, jack, linux-kernel,
	linux-mm, linux-fsdevel, opensource.kernel

Hi everyone,

We've been interested in this patch about parallelizing writeback [1] 
and have been following its discussion and development. Our testing in 
several application scenarios on mobile devices has shown significant 
performance improvements.

Currently, we're focusing on how the number of writeback contexts impacts 
the performance on different filesystems and storage workloads. We noticed 
the previous discussion about making the number of writeback contexts an 
opt-in configuration to adapt to different filesystems [2]. Currently, it 
can only be set via a sysfs interface at system initialization. We'd like 
to discuss the possibility of supporting dynamic runtime configuration of 
the number of writeback contexts.

We have developed a mechanism that allows the number of writeback contexts 
to be configured at runtime via a sysfs interface. To configure, use: 
echo <nr_wb_ctx> > /sys/class/bdi/<dev>/nwritebacks.

Our implementation supports *increasing* the number of writeback contexts. 
This is achieved by dynamically allocating new writeback contexts and 
replacing the existing bdi->wb_ctx_arr and bdi->nr_wb_ctx. But we have 
not yet solved the problem of safely *reducing* the bdi->nr_wb_ctx.

Several challenges remain:
 - How should we safely handle ongoing I/Os when contexts are removed?
 - What is the correct way to migrate pending writeback tasks and related 
   resources to other writeback contexts?
 - Should this be a per-device or global setting?

We're sharing this early implementation to gather feedback on:
 1. Is runtime configurability of writeback contexts a worthwhile goal?
 2. How should we handle synchronization and migration when dynamically
    changing the bdi->nr_wb_ctx, particularly when removing the active
    writeback contexts?
 3. Any better tests to validate the stability of this approach?

We look forward to feedback and suggestions for further improvements.

[1] Parallelizing filesystem writeback :
https://lore.kernel.org/linux-fsdevel/20250529111504.89912-1-kundan.kumar@samsung.com/
[2] The discussion on configuration of the number of writeback contexts :
https://lore.kernel.org/linux-fsdevel/20250609040056.GA26101@lst.de/

wangyufei (1):
  writeback: add sysfs to config the number of writeback contexts

 include/linux/backing-dev.h |  3 ++
 mm/backing-dev.c            | 59 +++++++++++++++++++++++++++++++++++
 mm/page-writeback.c         | 61 +++++++++++++++++++++++++++++++++++++
 3 files changed, 123 insertions(+)

-- 
2.39.0



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC 1/1] writeback: add sysfs to config the number of writeback contexts
  2025-08-25 12:29 ` [RFC 0/1] writeback: add sysfs to config the number of writeback contexts wangyufei
@ 2025-08-25 12:29   ` wangyufei
  2025-08-25 14:46   ` [RFC 0/1] " David Hildenbrand
  2025-08-29  8:59   ` Kundan Kumar
  2 siblings, 0 replies; 6+ messages in thread
From: wangyufei @ 2025-08-25 12:29 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Matthew Wilcox (Oracle),
	Stephen Rothwell, wangyufei, open list,
	open list:MEMORY MANAGEMENT - MISC, open list:PAGE CACHE
  Cc: kundan.kumar, anuj20.g, hch, bernd, djwong, jack, linux-kernel,
	linux-mm, linux-fsdevel, opensource.kernel

The number of writeback contexts is set to the number of CPUs by
default. To test the impact of the number of writeback contexts
on the writeback performance of filesystems, we introduce a sysfs
interface 'nwritebacks' for adjusting bdi->wb_ctx_arr in runtime.
However, only increasing the bdi->wb_ctx_arr is supported; support
for reducing it is still under development.

Signed-off-by: wangyufei <wangyufei@vivo.com>
---
 include/linux/backing-dev.h |  3 ++
 mm/backing-dev.c            | 59 +++++++++++++++++++++++++++++++++++
 mm/page-writeback.c         | 61 +++++++++++++++++++++++++++++++++++++
 3 files changed, 123 insertions(+)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 30a812fbd..c59578c25 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -112,6 +112,7 @@ int bdi_set_max_ratio_no_scale(struct backing_dev_info *bdi, unsigned int max_ra
 int bdi_set_min_bytes(struct backing_dev_info *bdi, u64 min_bytes);
 int bdi_set_max_bytes(struct backing_dev_info *bdi, u64 max_bytes);
 int bdi_set_strict_limit(struct backing_dev_info *bdi, unsigned int strict_limit);
+int bdi_set_nwritebacks(struct backing_dev_info *bdi, int nwritebacks);
 
 /*
  * Flags in backing_dev_info::capability
@@ -128,6 +129,8 @@ int bdi_set_strict_limit(struct backing_dev_info *bdi, unsigned int strict_limit
 extern struct backing_dev_info noop_backing_dev_info;
 
 int bdi_init(struct backing_dev_info *bdi);
+int bdi_wb_ctx_init(struct backing_dev_info *bdi, struct bdi_writeback_ctx *bdi_wb_ctx);
+void bdi_wb_ctx_exit(struct backing_dev_info *bdi, struct bdi_writeback_ctx *bdi_wb_ctx);
 
 /**
  * writeback_in_progress - determine whether there is writeback in progress
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index a5b44dd79..44b24c1e4 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -469,6 +469,34 @@ static ssize_t strict_limit_show(struct device *dev,
 }
 static DEVICE_ATTR_RW(strict_limit);
 
+static ssize_t nwritebacks_show(struct device *dev,
+			      struct device_attribute *attr,
+			      char *buf)
+{
+	struct backing_dev_info *bdi = dev_get_drvdata(dev);
+
+	return sysfs_emit(buf, "%d\n", bdi->nr_wb_ctx);
+}
+
+static ssize_t nwritebacks_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	struct backing_dev_info *bdi = dev_get_drvdata(dev);
+	int nr;
+	ssize_t ret;
+
+	ret = kstrtoint(buf, 10, &nr);
+	if (ret < 0)
+		return ret;
+
+	ret = bdi_set_nwritebacks(bdi, nr);
+	if (!ret)
+		ret = count;
+
+	return ret;
+}
+static DEVICE_ATTR_RW(nwritebacks);
+
 static struct attribute *bdi_dev_attrs[] = {
 	&dev_attr_read_ahead_kb.attr,
 	&dev_attr_min_ratio.attr,
@@ -479,6 +507,7 @@ static struct attribute *bdi_dev_attrs[] = {
 	&dev_attr_max_bytes.attr,
 	&dev_attr_stable_pages_required.attr,
 	&dev_attr_strict_limit.attr,
+	&dev_attr_nwritebacks.attr,
 	NULL,
 };
 ATTRIBUTE_GROUPS(bdi_dev);
@@ -1004,6 +1033,22 @@ static int __init cgwb_init(void)
 }
 subsys_initcall(cgwb_init);
 
+int bdi_wb_ctx_init(struct backing_dev_info *bdi, struct bdi_writeback_ctx *bdi_wb_ctx)
+{
+	int ret;
+
+	INIT_RADIX_TREE(&bdi_wb_ctx->cgwb_tree, GFP_ATOMIC);
+	mutex_init(&bdi->cgwb_release_mutex);
+	init_rwsem(&bdi_wb_ctx->wb_switch_rwsem);
+
+	ret = wb_init(&bdi_wb_ctx->wb, bdi_wb_ctx, bdi, GFP_KERNEL);
+	if (!ret) {
+		bdi_wb_ctx->wb.memcg_css = &root_mem_cgroup->css;
+		bdi_wb_ctx->wb.blkcg_css = blkcg_root_css;
+	}
+	return ret;
+}
+
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
 static int cgwb_bdi_init(struct backing_dev_info *bdi)
@@ -1292,3 +1337,17 @@ const char *bdi_dev_name(struct backing_dev_info *bdi)
 	return bdi->dev_name;
 }
 EXPORT_SYMBOL_GPL(bdi_dev_name);
+
+int bdi_wb_ctx_init(struct backing_dev_info *bdi, struct bdi_writeback_ctx *bdi_wb_ctx)
+{
+	return wb_init(&bdi_wb_ctx->wb, bdi_wb_ctx, bdi, GFP_KERNEL);
+}
+
+void bdi_wb_ctx_exit(struct backing_dev_info *bdi, struct bdi_writeback_ctx *bdi_wb_ctx)
+{
+	wb_shutdown(&bdi_wb_ctx->wb);
+	cgwb_bdi_unregister(bdi, bdi_wb_ctx);
+
+	WARN_ON_ONCE(test_bit(WB_registered, &bdi_wb_ctx->wb.state));
+	wb_exit(&bdi_wb_ctx->wb);
+}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 6f283a777..87c77004b 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -740,6 +740,59 @@ static int __bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ra
 	return ret;
 }
 
+static int __bdi_set_wb_ctx(struct backing_dev_info *bdi, int nwritebacks)
+{
+	struct bdi_writeback_ctx **new_ctx_arr, **old_ctx_arr;
+	int i, ret;
+
+	new_ctx_arr = kcalloc(nwritebacks, sizeof(struct bdi_writeback_ctx *), GFP_KERNEL);
+	if (!new_ctx_arr)
+		return -ENOMEM;
+
+	for (i = 0; i < min(bdi->nr_wb_ctx, nwritebacks); i++)
+		new_ctx_arr[i] = bdi->wb_ctx_arr[i];
+
+	for (i = bdi->nr_wb_ctx; i < nwritebacks; i++) {
+		new_ctx_arr[i] = (struct bdi_writeback_ctx *)
+			kzalloc(sizeof(struct bdi_writeback_ctx), GFP_KERNEL);
+		if (!new_ctx_arr[i]) {
+			pr_err("Failed to allocate %d", i);
+			while (--i >= bdi->nr_wb_ctx)
+				kfree(new_ctx_arr[i]);
+			kfree(new_ctx_arr);
+			return -ENOMEM;
+		}
+		INIT_LIST_HEAD(&new_ctx_arr[i]->wb_list);
+		init_waitqueue_head(&new_ctx_arr[i]->wb_waitq);
+	}
+
+	for (i = bdi->nr_wb_ctx; i < nwritebacks; i++) {
+		ret = bdi_wb_ctx_init(bdi, new_ctx_arr[i]);
+		if (ret) {
+			while (--i >= bdi->nr_wb_ctx) {
+				bdi_wb_ctx_exit(bdi, new_ctx_arr[i]);
+				kfree(new_ctx_arr[i]);
+			}
+			kfree(new_ctx_arr);
+			return ret;
+		}
+		list_add_tail_rcu(&new_ctx_arr[i]->wb.bdi_node, &new_ctx_arr[i]->wb_list);
+		set_bit(WB_registered, &new_ctx_arr[i]->wb.state);
+	}
+
+	// Make sure the initialization is done before assignment
+	smp_wmb();
+
+	old_ctx_arr = bdi->wb_ctx_arr;
+	spin_lock_bh(&bdi_lock);
+	bdi->wb_ctx_arr = new_ctx_arr;
+	bdi->nr_wb_ctx = nwritebacks;
+	spin_unlock_bh(&bdi_lock);
+
+	kfree(old_ctx_arr);
+	return 0;
+}
+
 int bdi_set_min_ratio_no_scale(struct backing_dev_info *bdi, unsigned int min_ratio)
 {
 	return __bdi_set_min_ratio(bdi, min_ratio);
@@ -818,6 +871,14 @@ int bdi_set_strict_limit(struct backing_dev_info *bdi, unsigned int strict_limit
 	return 0;
 }
 
+int bdi_set_nwritebacks(struct backing_dev_info *bdi, int nwritebacks)
+{
+	if (nwritebacks < bdi->nr_wb_ctx)
+		return -EINVAL;
+
+	return __bdi_set_wb_ctx(bdi, nwritebacks);
+}
+
 static unsigned long dirty_freerun_ceiling(unsigned long thresh,
 					   unsigned long bg_thresh)
 {
-- 
2.39.0



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC 0/1] writeback: add sysfs to config the number of writeback contexts
  2025-08-25 12:29 ` [RFC 0/1] writeback: add sysfs to config the number of writeback contexts wangyufei
  2025-08-25 12:29   ` [RFC 1/1] " wangyufei
@ 2025-08-25 14:46   ` David Hildenbrand
  2025-08-25 16:15     ` Matthew Wilcox
  2025-08-29  8:59   ` Kundan Kumar
  2 siblings, 1 reply; 6+ messages in thread
From: David Hildenbrand @ 2025-08-25 14:46 UTC (permalink / raw)
  To: wangyufei, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Matthew Wilcox (Oracle), Stephen Rothwell, open list,
	open list:MEMORY MANAGEMENT - MISC, open list:PAGE CACHE
  Cc: kundan.kumar, anuj20.g, hch, bernd, djwong, jack,
	opensource.kernel

On 25.08.25 14:29, wangyufei wrote:
> Hi everyone,
> 
> We've been interested in this patch about parallelizing writeback [1]
> and have been following its discussion and development. Our testing in
> several application scenarios on mobile devices has shown significant
> performance improvements.
> 
> Currently, we're focusing on how the number of writeback contexts impacts
> the performance on different filesystems and storage workloads. We noticed
> the previous discussion about making the number of writeback contexts an
> opt-in configuration to adapt to different filesystems [2]. Currently, it
> can only be set via a sysfs interface at system initialization. We'd like
> to discuss the possibility of supporting dynamic runtime configuration of
> the number of writeback contexts.
> 
> We have developed a mechanism that allows the number of writeback contexts
> to be configured at runtime via a sysfs interface. To configure, use:
> echo <nr_wb_ctx> > /sys/class/bdi/<dev>/nwritebacks.

What's the target use case for updating it dynamically?

If it's mostly for debugging/testing (find out what works, what 
doesn't), it might better go into debugfs or just carried out of tree.

If it's about setting sane default based on specific filesystems, maybe 
it could be optimized from within the kernel, without the need to expose 
this to an admin?

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC 0/1] writeback: add sysfs to config the number of writeback contexts
  2025-08-25 14:46   ` [RFC 0/1] " David Hildenbrand
@ 2025-08-25 16:15     ` Matthew Wilcox
  0 siblings, 0 replies; 6+ messages in thread
From: Matthew Wilcox @ 2025-08-25 16:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: wangyufei, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Stephen Rothwell, open list, open list:MEMORY MANAGEMENT - MISC,
	open list:PAGE CACHE, kundan.kumar, anuj20.g, hch, bernd, djwong,
	jack, opensource.kernel

On Mon, Aug 25, 2025 at 04:46:46PM +0200, David Hildenbrand wrote:
> On 25.08.25 14:29, wangyufei wrote:
> > Hi everyone,
> > 
> > We've been interested in this patch about parallelizing writeback [1]
> > and have been following its discussion and development. Our testing in
> > several application scenarios on mobile devices has shown significant
> > performance improvements.
> > 
> > Currently, we're focusing on how the number of writeback contexts impacts
> > the performance on different filesystems and storage workloads. We noticed
> > the previous discussion about making the number of writeback contexts an
> > opt-in configuration to adapt to different filesystems [2]. Currently, it
> > can only be set via a sysfs interface at system initialization. We'd like
> > to discuss the possibility of supporting dynamic runtime configuration of
> > the number of writeback contexts.
> > 
> > We have developed a mechanism that allows the number of writeback contexts
> > to be configured at runtime via a sysfs interface. To configure, use:
> > echo <nr_wb_ctx> > /sys/class/bdi/<dev>/nwritebacks.
> 
> What's the target use case for updating it dynamically?
> 
> If it's mostly for debugging/testing (find out what works, what doesn't), it
> might better go into debugfs or just carried out of tree.
> 
> If it's about setting sane default based on specific filesystems, maybe it
> could be optimized from within the kernel, without the need to expose this
> to an admin?

I was assuming that this patch is for people who are experimenting to
gather data more effectively.  I'd NAK it being included, but it's good
to have it out on the list so other people don't have to reinvent it.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC 0/1] writeback: add sysfs to config the number of writeback contexts
  2025-08-25 12:29 ` [RFC 0/1] writeback: add sysfs to config the number of writeback contexts wangyufei
  2025-08-25 12:29   ` [RFC 1/1] " wangyufei
  2025-08-25 14:46   ` [RFC 0/1] " David Hildenbrand
@ 2025-08-29  8:59   ` Kundan Kumar
  2025-09-02 11:19     ` wangyufei
  2 siblings, 1 reply; 6+ messages in thread
From: Kundan Kumar @ 2025-08-29  8:59 UTC (permalink / raw)
  To: wangyufei, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Matthew Wilcox (Oracle),
	Stephen Rothwell, open list, open list:MEMORY MANAGEMENT - MISC,
	open list:PAGE CACHE
  Cc: anuj20.g, hch, bernd, djwong, jack, opensource.kernel

On 8/25/2025 5:59 PM, wangyufei wrote:
> Hi everyone,
> 
> We've been interested in this patch about parallelizing writeback [1]
> and have been following its discussion and development. Our testing in
> several application scenarios on mobile devices has shown significant
> performance improvements.
> 

Hi,

Thanks for sharing this work.

Could you clarify a few details about your test setup?

- Which filesystem did you run these experiments on?
- What were the specifics of the workload (number of threads, block size,
   I/O size)?
- If you are using fio, can you please share the fio command.
- How much RAM was available on the test system?
- Can you share the performance improvement numbers you observed?

That would help in understanding the impact of parallel writeback?

I made similar modifications to dynamically configure the number of
writeback threads in this experimental patch. Refer to patches 14 and 15:
https://lore.kernel.org/all/20250807045706.2848-1-kundan.kumar@samsung.com/
The key difference is that this change also enables a reduction in the
number of writeback threads.

Thanks,
Kundan

> 



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC 0/1] writeback: add sysfs to config the number of writeback contexts
  2025-08-29  8:59   ` Kundan Kumar
@ 2025-09-02 11:19     ` wangyufei
  0 siblings, 0 replies; 6+ messages in thread
From: wangyufei @ 2025-09-02 11:19 UTC (permalink / raw)
  To: Kundan Kumar, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Matthew Wilcox (Oracle),
	Stephen Rothwell, open list, open list:MEMORY MANAGEMENT - MISC,
	open list:PAGE CACHE
  Cc: anuj20.g, hch, bernd, djwong, jack, opensource.kernel


On 8/29/2025 4:59 PM, Kundan Kumar wrote:
> On 8/25/2025 5:59 PM, wangyufei wrote:
>> Hi everyone,
>>
>> We've been interested in this patch about parallelizing writeback [1]
>> and have been following its discussion and development. Our testing in
>> several application scenarios on mobile devices has shown significant
>> performance improvements.
>>
> Hi,
>
> Thanks for sharing this work.
>
> Could you clarify a few details about your test setup?
>
> - Which filesystem did you run these experiments on?
> - What were the specifics of the workload (number of threads, block size,
>     I/O size)?
> - If you are using fio, can you please share the fio command.
> - How much RAM was available on the test system?
> - Can you share the performance improvement numbers you observed?
>
> That would help in understanding the impact of parallel writeback?
Hi Kundan,

Most of the time we tested this patch on mobile devices. The test 
platform setup is shown as below:

- filesystem:F2FS

- system config:
Number of CPUs = 8
System RAM = 11G

- workload & fio:We used the same fio command as mentioned in your patch
fio command line:
fio --directory=/mnt --name=test --bs=4k --iodepth=1024 --rw=randwrite
--ioengine=io_uring --time_based=1 -runtime=60 --numjobs=8 --size=450M
--direct=0  --eta-interval=1 --eta-newline=1 --group_reporting

- Performance gains:
Base F2FS                         :973 MiB/s
Parallel Writeback F2FS   :1237 MiB/s (+27%)
>
> I made similar modifications to dynamically configure the number of
> writeback threads in this experimental patch. Refer to patches 14 and 15:
> https://lore.kernel.org/all/20250807045706.2848-1-kundan.kumar@samsung.com/
> The key difference is that this change also enables a reduction in the
> number of writeback threads.
Thanks for sharing the patch. I have a few questions:
- The current approach freezes the filesystem and reallocates all 
writeback_ctx structures. Could this introduce latency? In some cases, I 
think the existing bdi_writeback_ctx structures could be reused instead.
- Are there other use cases for dynamic thread tuning besides 
initialization and testing?
- What methods are used to test the stability of this function?

Finally, I would like to ask if there are any problems to be solved or 
optimization directions worth discussing for the parallelizing 
filesystem writeback?


Thanks,

yufei



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-09-02 11:19 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CGME20250825123009epcas5p1573496e2cad2f58d22036493e5af03be@epcas5p1.samsung.com>
2025-08-25 12:29 ` [RFC 0/1] writeback: add sysfs to config the number of writeback contexts wangyufei
2025-08-25 12:29   ` [RFC 1/1] " wangyufei
2025-08-25 14:46   ` [RFC 0/1] " David Hildenbrand
2025-08-25 16:15     ` Matthew Wilcox
2025-08-29  8:59   ` Kundan Kumar
2025-09-02 11:19     ` wangyufei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).