[PATCH v1] fuse: enable large folios

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v1] fuse: enable large folios
@ 2026-06-24  1:21 Joanne Koong
  2026-06-24  4:34 ` Jingbo Xu
  2026-06-24  6:16 ` Horst Birthelmer
  0 siblings, 2 replies; 7+ messages in thread
From: Joanne Koong @ 2026-06-24  1:21 UTC (permalink / raw)
  To: miklos; +Cc: jefflexu, horst, fuse-devel

Enable large folios, capping the max order at the largest request fuse
can issue, so a folio always fits within a single request. The order
range minimum is 0, so under memory pressure the allocator falls back to
smaller folios.

Benchmarks (libfuse passthrough_hp, buffered fio, single job, 4 GiB
file, medians, NUMA-pinned, performance governor, strictlimiting on by
default):

tmpfs backing (page-cache bound):
  workload          bs      large folios off   on        delta
  seq read,  cold,  128k    3110 MiB/s    4514 MiB/s     +45%
  seq read,  cold,  1M      3079 MiB/s    5181 MiB/s     +68%
  seq read,  warm,  128k    2438 MiB/s    4486 MiB/s     +84%
  seq read,  warm,  1M      2403 MiB/s    5123 MiB/s    +113%
  writeback write, seq,128k 1211 MiB/s    1699 MiB/s     +40%
  writeback write, seq, 1M  1462 MiB/s    2208 MiB/s     +51%
  writeback write, rand,128k 1101 MiB/s   1757 MiB/s     +60% +
  writeback write, rand, 1M 1284 MiB/s    2228 MiB/s     +74% +

xfs on NVMe backing (device bound for cold I/O):
  workload          bs      large folios off   on        delta
  seq read,  cold,  128k    2030 MiB/s    2172 MiB/s      +7% *
  seq read,  cold,  1M      1999 MiB/s    2181 MiB/s      +9% *
  seq read,  warm,  128k    2451 MiB/s    4939 MiB/s    +101%
  seq read,  warm,  1M      2340 MiB/s    5639 MiB/s    +141%
  writeback write, seq,128k  637 MiB/s     747 MiB/s     +17% *
  writeback write, seq, 1M   694 MiB/s     833 MiB/s     +20% *
  writeback write, rand,128k 1004 MiB/s   1648 MiB/s     +64% +
  writeback write, rand, 1M 1171 MiB/s    2055 MiB/s     +75% +

(*) device-bandwidth bound. Not much throughput gain but system cpu
utilization was roughly halved
(+) random write was tested as an overwrite of a hot region (under
writeback, this is page-cache bound, so the gain comes from lower
per-folio cpu overhead rather than higher backing-device throughput)

Random reads (4k and 128k) and writethrough writes were neutral with
no regression (no read-modify-write or read-amplification penalty from
large folios)

More information about the benchmark setup and results are in
https://github.com/joannekoong/linux/commits/fuse_large_folios_benchmarks/

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
This has a dependency on the iomap uptodate helpers that were submitted to
Christian's vfs tree [1]. If it's easier to route this patch through
Christian's tree, I can resubmit this.

[1] https://lore.kernel.org/linux-fsdevel/20260623202843.2064992-1-joannelkoong@gmail.com/

 fs/fuse/file.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index cb8da4c06d17..3c9be6d8ede1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -3136,4 +3136,14 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags)
 
 	if (IS_ENABLED(CONFIG_FUSE_DAX))
 		fuse_dax_inode_init(inode, flags);
+
+	if (!FUSE_IS_DAX(inode)) {
+		unsigned int max_pages = min(min(fc->max_write,
+						 fc->max_read) >> PAGE_SHIFT,
+					     fc->max_pages);
+
+		if (max_pages)
+			mapping_set_folio_order_range(inode->i_mapping, 0,
+						      ilog2(max_pages));
+	}
 }
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v1] fuse: enable large folios
  2026-06-24  1:21 [PATCH v1] fuse: enable large folios Joanne Koong
@ 2026-06-24  4:34 ` Jingbo Xu
  2026-06-24  6:10   ` Horst Birthelmer
  2026-06-24  6:16 ` Horst Birthelmer
  1 sibling, 1 reply; 7+ messages in thread
From: Jingbo Xu @ 2026-06-24  4:34 UTC (permalink / raw)
  To: Joanne Koong, miklos; +Cc: horst, fuse-devel



On 6/24/26 9:21 AM, Joanne Koong wrote:
> Enable large folios, capping the max order at the largest request fuse
> can issue, so a folio always fits within a single request. The order
> range minimum is 0, so under memory pressure the allocator falls back to
> smaller folios.
> 
> Benchmarks (libfuse passthrough_hp, buffered fio, single job, 4 GiB
> file, medians, NUMA-pinned, performance governor, strictlimiting on by
> default):
> 
> tmpfs backing (page-cache bound):
>   workload          bs      large folios off   on        delta
>   seq read,  cold,  128k    3110 MiB/s    4514 MiB/s     +45%
>   seq read,  cold,  1M      3079 MiB/s    5181 MiB/s     +68%
>   seq read,  warm,  128k    2438 MiB/s    4486 MiB/s     +84%
>   seq read,  warm,  1M      2403 MiB/s    5123 MiB/s    +113%
>   writeback write, seq,128k 1211 MiB/s    1699 MiB/s     +40%
>   writeback write, seq, 1M  1462 MiB/s    2208 MiB/s     +51%
>   writeback write, rand,128k 1101 MiB/s   1757 MiB/s     +60% +
>   writeback write, rand, 1M 1284 MiB/s    2228 MiB/s     +74% +
> 
> xfs on NVMe backing (device bound for cold I/O):
>   workload          bs      large folios off   on        delta
>   seq read,  cold,  128k    2030 MiB/s    2172 MiB/s      +7% *
>   seq read,  cold,  1M      1999 MiB/s    2181 MiB/s      +9% *
>   seq read,  warm,  128k    2451 MiB/s    4939 MiB/s    +101%
>   seq read,  warm,  1M      2340 MiB/s    5639 MiB/s    +141%
>   writeback write, seq,128k  637 MiB/s     747 MiB/s     +17% *
>   writeback write, seq, 1M   694 MiB/s     833 MiB/s     +20% *
>   writeback write, rand,128k 1004 MiB/s   1648 MiB/s     +64% +
>   writeback write, rand, 1M 1171 MiB/s    2055 MiB/s     +75% +
> 
> (*) device-bandwidth bound. Not much throughput gain but system cpu
> utilization was roughly halved
> (+) random write was tested as an overwrite of a hot region (under
> writeback, this is page-cache bound, so the gain comes from lower
> per-folio cpu overhead rather than higher backing-device throughput)
> 
> Random reads (4k and 128k) and writethrough writes were neutral with
> no regression (no read-modify-write or read-amplification penalty from
> large folios)
> 
> More information about the benchmark setup and results are in
> https://github.com/joannekoong/linux/commits/fuse_large_folios_benchmarks/
> 
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> This has a dependency on the iomap uptodate helpers that were submitted to
> Christian's vfs tree [1]. If it's easier to route this patch through
> Christian's tree, I can resubmit this.
> 
> [1] https://lore.kernel.org/linux-fsdevel/20260623202843.2064992-1-joannelkoong@gmail.com/
> 
>  fs/fuse/file.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index cb8da4c06d17..3c9be6d8ede1 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -3136,4 +3136,14 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags)
>  
>  	if (IS_ENABLED(CONFIG_FUSE_DAX))
>  		fuse_dax_inode_init(inode, flags);
> +
> +	if (!FUSE_IS_DAX(inode)) {
> +		unsigned int max_pages = min(min(fc->max_write,
> +						 fc->max_read) >> PAGE_SHIFT,
> +					     fc->max_pages);
> +
> +		if (max_pages)
> +			mapping_set_folio_order_range(inode->i_mapping, 0,
> +						      ilog2(max_pages));
> +	}
>  }

mapping_set_folio_order_range(..., 0, 0) seems harmless even when
max_pages is 0.

Anyway

Reviewed-by: Jingbo Xu \<jefflexu@linux.alibaba.com\>



-- 
Thanks,
Jingbo


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Re: [PATCH v1] fuse: enable large folios
  2026-06-24  4:34 ` Jingbo Xu
@ 2026-06-24  6:10   ` Horst Birthelmer
  2026-06-24  7:28     ` Jingbo Xu
  0 siblings, 1 reply; 7+ messages in thread
From: Horst Birthelmer @ 2026-06-24  6:10 UTC (permalink / raw)
  To: Jingbo Xu; +Cc: Joanne Koong, miklos, fuse-devel

On Wed, Jun 24, 2026 at 12:34:16PM +0800, Jingbo Xu wrote:
> 
> 
> On 6/24/26 9:21 AM, Joanne Koong wrote:
> > Enable large folios, capping the max order at the largest request fuse
> > can issue, so a folio always fits within a single request. The order
> > range minimum is 0, so under memory pressure the allocator falls back to
> > smaller folios.
> > 
> > Benchmarks (libfuse passthrough_hp, buffered fio, single job, 4 GiB
> > file, medians, NUMA-pinned, performance governor, strictlimiting on by
> > default):
> > 
> > tmpfs backing (page-cache bound):
> >   workload          bs      large folios off   on        delta
> >   seq read,  cold,  128k    3110 MiB/s    4514 MiB/s     +45%
> >   seq read,  cold,  1M      3079 MiB/s    5181 MiB/s     +68%
> >   seq read,  warm,  128k    2438 MiB/s    4486 MiB/s     +84%
> >   seq read,  warm,  1M      2403 MiB/s    5123 MiB/s    +113%
> >   writeback write, seq,128k 1211 MiB/s    1699 MiB/s     +40%
> >   writeback write, seq, 1M  1462 MiB/s    2208 MiB/s     +51%
> >   writeback write, rand,128k 1101 MiB/s   1757 MiB/s     +60% +
> >   writeback write, rand, 1M 1284 MiB/s    2228 MiB/s     +74% +
> > 
> > xfs on NVMe backing (device bound for cold I/O):
> >   workload          bs      large folios off   on        delta
> >   seq read,  cold,  128k    2030 MiB/s    2172 MiB/s      +7% *
> >   seq read,  cold,  1M      1999 MiB/s    2181 MiB/s      +9% *
> >   seq read,  warm,  128k    2451 MiB/s    4939 MiB/s    +101%
> >   seq read,  warm,  1M      2340 MiB/s    5639 MiB/s    +141%
> >   writeback write, seq,128k  637 MiB/s     747 MiB/s     +17% *
> >   writeback write, seq, 1M   694 MiB/s     833 MiB/s     +20% *
> >   writeback write, rand,128k 1004 MiB/s   1648 MiB/s     +64% +
> >   writeback write, rand, 1M 1171 MiB/s    2055 MiB/s     +75% +
> > 
> > (*) device-bandwidth bound. Not much throughput gain but system cpu
> > utilization was roughly halved
> > (+) random write was tested as an overwrite of a hot region (under
> > writeback, this is page-cache bound, so the gain comes from lower
> > per-folio cpu overhead rather than higher backing-device throughput)
> > 
> > Random reads (4k and 128k) and writethrough writes were neutral with
> > no regression (no read-modify-write or read-amplification penalty from
> > large folios)
> > 
> > More information about the benchmark setup and results are in
> > https://github.com/joannekoong/linux/commits/fuse_large_folios_benchmarks/
> > 
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > ---
> > This has a dependency on the iomap uptodate helpers that were submitted to
> > Christian's vfs tree [1]. If it's easier to route this patch through
> > Christian's tree, I can resubmit this.
> > 
> > [1] https://lore.kernel.org/linux-fsdevel/20260623202843.2064992-1-joannelkoong@gmail.com/
> > 
> >  fs/fuse/file.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index cb8da4c06d17..3c9be6d8ede1 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -3136,4 +3136,14 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags)
> >  
> >  	if (IS_ENABLED(CONFIG_FUSE_DAX))
> >  		fuse_dax_inode_init(inode, flags);
> > +
> > +	if (!FUSE_IS_DAX(inode)) {
> > +		unsigned int max_pages = min(min(fc->max_write,
> > +						 fc->max_read) >> PAGE_SHIFT,
> > +					     fc->max_pages);
> > +
> > +		if (max_pages)
> > +			mapping_set_folio_order_range(inode->i_mapping, 0,
> > +						      ilog2(max_pages));
> > +	}
> >  }
> 
> mapping_set_folio_order_range(..., 0, 0) seems harmless even when
> max_pages is 0.
> 
> Anyway
> 
> Reviewed-by: Jingbo Xu \<jefflexu@linux.alibaba.com\>
> 

I was just about to suggest something like

mapping_set_folio_order_range(inode->i_mapping, 0,
			      ilog2(max_pages ?: 1));

but you are right, it's not a problem.
> 
> -- 
> Thanks,
> Jingbo
> 

This was actually one of the problems I ran into when I enabled large folios in linux 6.17.
Since it would create problems for readahead.

Reviewed-By: Horst Birthelmer <hbirthelmer@ddn.com>

---
Thanks,
Horst

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v1] fuse: enable large folios
  2026-06-24  1:21 [PATCH v1] fuse: enable large folios Joanne Koong
  2026-06-24  4:34 ` Jingbo Xu
@ 2026-06-24  6:16 ` Horst Birthelmer
  2026-06-24 17:52   ` Joanne Koong
  1 sibling, 1 reply; 7+ messages in thread
From: Horst Birthelmer @ 2026-06-24  6:16 UTC (permalink / raw)
  To: Joanne Koong; +Cc: miklos, jefflexu, fuse-devel

On Tue, Jun 23, 2026 at 06:21:32PM -0700, Joanne Koong wrote:
> Enable large folios, capping the max order at the largest request fuse
> can issue, so a folio always fits within a single request. The order
> range minimum is 0, so under memory pressure the allocator falls back to
> smaller folios.
> 
> Benchmarks (libfuse passthrough_hp, buffered fio, single job, 4 GiB
> file, medians, NUMA-pinned, performance governor, strictlimiting on by
> default):
> 
> tmpfs backing (page-cache bound):
>   workload          bs      large folios off   on        delta
>   seq read,  cold,  128k    3110 MiB/s    4514 MiB/s     +45%
>   seq read,  cold,  1M      3079 MiB/s    5181 MiB/s     +68%
>   seq read,  warm,  128k    2438 MiB/s    4486 MiB/s     +84%
>   seq read,  warm,  1M      2403 MiB/s    5123 MiB/s    +113%
>   writeback write, seq,128k 1211 MiB/s    1699 MiB/s     +40%
>   writeback write, seq, 1M  1462 MiB/s    2208 MiB/s     +51%
>   writeback write, rand,128k 1101 MiB/s   1757 MiB/s     +60% +
>   writeback write, rand, 1M 1284 MiB/s    2228 MiB/s     +74% +
> 
> xfs on NVMe backing (device bound for cold I/O):
>   workload          bs      large folios off   on        delta
>   seq read,  cold,  128k    2030 MiB/s    2172 MiB/s      +7% *
>   seq read,  cold,  1M      1999 MiB/s    2181 MiB/s      +9% *
>   seq read,  warm,  128k    2451 MiB/s    4939 MiB/s    +101%
>   seq read,  warm,  1M      2340 MiB/s    5639 MiB/s    +141%
>   writeback write, seq,128k  637 MiB/s     747 MiB/s     +17% *
>   writeback write, seq, 1M   694 MiB/s     833 MiB/s     +20% *
>   writeback write, rand,128k 1004 MiB/s   1648 MiB/s     +64% +
>   writeback write, rand, 1M 1171 MiB/s    2055 MiB/s     +75% +
> 

Hi Joanne,

just out of curiosity, did you disable bdi strict limiting for this?
In my tests esapcially the large writes run into throttling pretty
fast, so that it effectively writes pagewise, which was not the target
of the test.

Thanks,
Horst

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v1] fuse: enable large folios
  2026-06-24  6:10   ` Horst Birthelmer
@ 2026-06-24  7:28     ` Jingbo Xu
  0 siblings, 0 replies; 7+ messages in thread
From: Jingbo Xu @ 2026-06-24  7:28 UTC (permalink / raw)
  To: Horst Birthelmer; +Cc: Joanne Koong, miklos, fuse-devel



On 6/24/26 2:10 PM, Horst Birthelmer wrote:
> On Wed, Jun 24, 2026 at 12:34:16PM +0800, Jingbo Xu wrote:
>>
>>
>> On 6/24/26 9:21 AM, Joanne Koong wrote:
>>> Enable large folios, capping the max order at the largest request fuse
>>> can issue, so a folio always fits within a single request. The order
>>> range minimum is 0, so under memory pressure the allocator falls back to
>>> smaller folios.
>>>
>>> Benchmarks (libfuse passthrough_hp, buffered fio, single job, 4 GiB
>>> file, medians, NUMA-pinned, performance governor, strictlimiting on by
>>> default):
>>>
>>> tmpfs backing (page-cache bound):
>>>   workload          bs      large folios off   on        delta
>>>   seq read,  cold,  128k    3110 MiB/s    4514 MiB/s     +45%
>>>   seq read,  cold,  1M      3079 MiB/s    5181 MiB/s     +68%
>>>   seq read,  warm,  128k    2438 MiB/s    4486 MiB/s     +84%
>>>   seq read,  warm,  1M      2403 MiB/s    5123 MiB/s    +113%
>>>   writeback write, seq,128k 1211 MiB/s    1699 MiB/s     +40%
>>>   writeback write, seq, 1M  1462 MiB/s    2208 MiB/s     +51%
>>>   writeback write, rand,128k 1101 MiB/s   1757 MiB/s     +60% +
>>>   writeback write, rand, 1M 1284 MiB/s    2228 MiB/s     +74% +
>>>
>>> xfs on NVMe backing (device bound for cold I/O):
>>>   workload          bs      large folios off   on        delta
>>>   seq read,  cold,  128k    2030 MiB/s    2172 MiB/s      +7% *
>>>   seq read,  cold,  1M      1999 MiB/s    2181 MiB/s      +9% *
>>>   seq read,  warm,  128k    2451 MiB/s    4939 MiB/s    +101%
>>>   seq read,  warm,  1M      2340 MiB/s    5639 MiB/s    +141%
>>>   writeback write, seq,128k  637 MiB/s     747 MiB/s     +17% *
>>>   writeback write, seq, 1M   694 MiB/s     833 MiB/s     +20% *
>>>   writeback write, rand,128k 1004 MiB/s   1648 MiB/s     +64% +
>>>   writeback write, rand, 1M 1171 MiB/s    2055 MiB/s     +75% +
>>>
>>> (*) device-bandwidth bound. Not much throughput gain but system cpu
>>> utilization was roughly halved
>>> (+) random write was tested as an overwrite of a hot region (under
>>> writeback, this is page-cache bound, so the gain comes from lower
>>> per-folio cpu overhead rather than higher backing-device throughput)
>>>
>>> Random reads (4k and 128k) and writethrough writes were neutral with
>>> no regression (no read-modify-write or read-amplification penalty from
>>> large folios)
>>>
>>> More information about the benchmark setup and results are in
>>> https://github.com/joannekoong/linux/commits/fuse_large_folios_benchmarks/
>>>
>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>> ---
>>> This has a dependency on the iomap uptodate helpers that were submitted to
>>> Christian's vfs tree [1]. If it's easier to route this patch through
>>> Christian's tree, I can resubmit this.
>>>
>>> [1] https://lore.kernel.org/linux-fsdevel/20260623202843.2064992-1-joannelkoong@gmail.com/
>>>
>>>  fs/fuse/file.c | 10 ++++++++++
>>>  1 file changed, 10 insertions(+)
>>>
>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>>> index cb8da4c06d17..3c9be6d8ede1 100644
>>> --- a/fs/fuse/file.c
>>> +++ b/fs/fuse/file.c
>>> @@ -3136,4 +3136,14 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags)
>>>  
>>>  	if (IS_ENABLED(CONFIG_FUSE_DAX))
>>>  		fuse_dax_inode_init(inode, flags);
>>> +
>>> +	if (!FUSE_IS_DAX(inode)) {
>>> +		unsigned int max_pages = min(min(fc->max_write,
>>> +						 fc->max_read) >> PAGE_SHIFT,
>>> +					     fc->max_pages);
>>> +
>>> +		if (max_pages)
>>> +			mapping_set_folio_order_range(inode->i_mapping, 0,
>>> +						      ilog2(max_pages));
>>> +	}
>>>  }
>>
>> mapping_set_folio_order_range(..., 0, 0) seems harmless even when
>> max_pages is 0.
>>
>> Anyway
>>
>> Reviewed-by: Jingbo Xu \<jefflexu@linux.alibaba.com\>
>>
> 
> I was just about to suggest something like
> 
> mapping_set_folio_order_range(inode->i_mapping, 0,
> 			      ilog2(max_pages ?: 1));
> 
> but you are right, it's not a problem.
>>
>> -- 
>> Thanks,
>> Jingbo
>>
> 
> This was actually one of the problems I ran into when I enabled large folios in linux 6.17.
> Since it would create problems for readahead.
> 
> Reviewed-By: Horst Birthelmer <hbirthelmer@ddn.com>
> 

Okay I got it.  ilog2(0) doesn't equal to 0


-- 
Thanks,
Jingbo


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v1] fuse: enable large folios
  2026-06-24  6:16 ` Horst Birthelmer
@ 2026-06-24 17:52   ` Joanne Koong
  2026-06-25  7:17     ` Horst Birthelmer
  0 siblings, 1 reply; 7+ messages in thread
From: Joanne Koong @ 2026-06-24 17:52 UTC (permalink / raw)
  To: Horst Birthelmer; +Cc: miklos, jefflexu, fuse-devel

On Tue, Jun 23, 2026 at 11:16 PM Horst Birthelmer <horst@birthelmer.de> wrote:
>
> On Tue, Jun 23, 2026 at 06:21:32PM -0700, Joanne Koong wrote:
> > Enable large folios, capping the max order at the largest request fuse
> > can issue, so a folio always fits within a single request. The order
> > range minimum is 0, so under memory pressure the allocator falls back to
> > smaller folios.
> >
> > Benchmarks (libfuse passthrough_hp, buffered fio, single job, 4 GiB
> > file, medians, NUMA-pinned, performance governor, strictlimiting on by
> > default):
> >
> > tmpfs backing (page-cache bound):
> >   workload          bs      large folios off   on        delta
> >   seq read,  cold,  128k    3110 MiB/s    4514 MiB/s     +45%
> >   seq read,  cold,  1M      3079 MiB/s    5181 MiB/s     +68%
> >   seq read,  warm,  128k    2438 MiB/s    4486 MiB/s     +84%
> >   seq read,  warm,  1M      2403 MiB/s    5123 MiB/s    +113%
> >   writeback write, seq,128k 1211 MiB/s    1699 MiB/s     +40%
> >   writeback write, seq, 1M  1462 MiB/s    2208 MiB/s     +51%
> >   writeback write, rand,128k 1101 MiB/s   1757 MiB/s     +60% +
> >   writeback write, rand, 1M 1284 MiB/s    2228 MiB/s     +74% +
> >
> > xfs on NVMe backing (device bound for cold I/O):
> >   workload          bs      large folios off   on        delta
> >   seq read,  cold,  128k    2030 MiB/s    2172 MiB/s      +7% *
> >   seq read,  cold,  1M      1999 MiB/s    2181 MiB/s      +9% *
> >   seq read,  warm,  128k    2451 MiB/s    4939 MiB/s    +101%
> >   seq read,  warm,  1M      2340 MiB/s    5639 MiB/s    +141%
> >   writeback write, seq,128k  637 MiB/s     747 MiB/s     +17% *
> >   writeback write, seq, 1M   694 MiB/s     833 MiB/s     +20% *
> >   writeback write, rand,128k 1004 MiB/s   1648 MiB/s     +64% +
> >   writeback write, rand, 1M 1171 MiB/s    2055 MiB/s     +75% +
> >
>
> Hi Joanne,
>
> just out of curiosity, did you disable bdi strict limiting for this?

Hi Horst,

Those results are with strictlimiting on. After commit 494d2f508883
('fuse: use default writeback accounting') [1], I didn't see any
performance regressions anymore with large folios + strictlimiting on.
More information on why that commit fixed the issue is in [2].

When I ran the benchmarks last week with strictlimiting off, I saw roughly:
    tmpfs:
      seq,  128k    1174 -> 1648 MiB/s    +40%
      seq,  1M      1261 -> 1845 MiB/s    +46%
      rand, 128k    1148 -> 1638 MiB/s    +43%
      rand, 1M      1273 -> 2065 MiB/s    +62%

    xfs on NVMe:
      seq,  128k     621 ->  740 MiB/s    +19%
      seq,  1M       649 ->  776 MiB/s    +20%
      rand, 128k    1020 -> 1515 MiB/s    +49%
      rand, 1M      1125 -> 1895 MiB/s    +68%

Strict limiting on actually had better performance here, which I think
is because with the small dirty limit, the dirtying and the writeback
happen in parallel instead of more dirty pages accumulating and then
writeback getting kicked off. Because the backing device is so fast,
it didn't cost throughput for the dirtying and the writeback to happen
concurrently. The benchmarks were run with fsync, so everything had to
be flushed before the fio run returned. If the backing device was slow
and writes were bursty and fsync wasn't enforced, I think there'd
probably be better performance with strictlimiting off than on, since
the writer would be throttled to the speed of the backing device with
strictlimtiing on.

> In my tests esapcially the large writes run into throttling pretty
> fast, so that it effectively writes pagewise, which was not the target
> of the test.

Are you seeing this on your system with strictlimiting on or off? Is
that with commit 494d2f508883 in your tree? What test and server were
you running? Do you know what the speed of the backing device is?

Thanks,
Joanne

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/fuse/file.c?id=494d2f508883a6e5c4530e5c6b3c8b2bbfb7318d
[2] https://lore.kernel.org/linux-fsdevel/CAJnrk1ZSaNRr-HWw-hbo2=LmbZiNGZveb0MwxZbPtBDFgg2icQ@mail.gmail.com/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Re: [PATCH v1] fuse: enable large folios
  2026-06-24 17:52   ` Joanne Koong
@ 2026-06-25  7:17     ` Horst Birthelmer
  0 siblings, 0 replies; 7+ messages in thread
From: Horst Birthelmer @ 2026-06-25  7:17 UTC (permalink / raw)
  To: Joanne Koong; +Cc: miklos, jefflexu, fuse-devel

On Wed, Jun 24, 2026 at 10:52:11AM -0700, Joanne Koong wrote:
> On Tue, Jun 23, 2026 at 11:16 PM Horst Birthelmer <horst@birthelmer.de> wrote:
> >
> > On Tue, Jun 23, 2026 at 06:21:32PM -0700, Joanne Koong wrote:
> > > Enable large folios, capping the max order at the largest request fuse
> > > can issue, so a folio always fits within a single request. The order
> > > range minimum is 0, so under memory pressure the allocator falls back to
> > > smaller folios.
> > >
> > > Benchmarks (libfuse passthrough_hp, buffered fio, single job, 4 GiB
> > > file, medians, NUMA-pinned, performance governor, strictlimiting on by
> > > default):
> > >
> > > tmpfs backing (page-cache bound):
> > >   workload          bs      large folios off   on        delta
> > >   seq read,  cold,  128k    3110 MiB/s    4514 MiB/s     +45%
> > >   seq read,  cold,  1M      3079 MiB/s    5181 MiB/s     +68%
> > >   seq read,  warm,  128k    2438 MiB/s    4486 MiB/s     +84%
> > >   seq read,  warm,  1M      2403 MiB/s    5123 MiB/s    +113%
> > >   writeback write, seq,128k 1211 MiB/s    1699 MiB/s     +40%
> > >   writeback write, seq, 1M  1462 MiB/s    2208 MiB/s     +51%
> > >   writeback write, rand,128k 1101 MiB/s   1757 MiB/s     +60% +
> > >   writeback write, rand, 1M 1284 MiB/s    2228 MiB/s     +74% +
> > >
> > > xfs on NVMe backing (device bound for cold I/O):
> > >   workload          bs      large folios off   on        delta
> > >   seq read,  cold,  128k    2030 MiB/s    2172 MiB/s      +7% *
> > >   seq read,  cold,  1M      1999 MiB/s    2181 MiB/s      +9% *
> > >   seq read,  warm,  128k    2451 MiB/s    4939 MiB/s    +101%
> > >   seq read,  warm,  1M      2340 MiB/s    5639 MiB/s    +141%
> > >   writeback write, seq,128k  637 MiB/s     747 MiB/s     +17% *
> > >   writeback write, seq, 1M   694 MiB/s     833 MiB/s     +20% *
> > >   writeback write, rand,128k 1004 MiB/s   1648 MiB/s     +64% +
> > >   writeback write, rand, 1M 1171 MiB/s    2055 MiB/s     +75% +
> > >
> >
> > Hi Joanne,
> >
> > just out of curiosity, did you disable bdi strict limiting for this?
> 
> Hi Horst,
> 
> Those results are with strictlimiting on. After commit 494d2f508883
> ('fuse: use default writeback accounting') [1], I didn't see any
> performance regressions anymore with large folios + strictlimiting on.
> More information on why that commit fixed the issue is in [2].
> 
> When I ran the benchmarks last week with strictlimiting off, I saw roughly:
>     tmpfs:
>       seq,  128k    1174 -> 1648 MiB/s    +40%
>       seq,  1M      1261 -> 1845 MiB/s    +46%
>       rand, 128k    1148 -> 1638 MiB/s    +43%
>       rand, 1M      1273 -> 2065 MiB/s    +62%
> 
>     xfs on NVMe:
>       seq,  128k     621 ->  740 MiB/s    +19%
>       seq,  1M       649 ->  776 MiB/s    +20%
>       rand, 128k    1020 -> 1515 MiB/s    +49%
>       rand, 1M      1125 -> 1895 MiB/s    +68%
> 
> Strict limiting on actually had better performance here, which I think
> is because with the small dirty limit, the dirtying and the writeback
> happen in parallel instead of more dirty pages accumulating and then
> writeback getting kicked off. Because the backing device is so fast,
> it didn't cost throughput for the dirtying and the writeback to happen
> concurrently. The benchmarks were run with fsync, so everything had to
> be flushed before the fio run returned. If the backing device was slow
> and writes were bursty and fsync wasn't enforced, I think there'd
> probably be better performance with strictlimiting off than on, since
> the writer would be throttled to the speed of the backing device with
> strictlimtiing on.
> 
> > In my tests esapcially the large writes run into throttling pretty
> > fast, so that it effectively writes pagewise, which was not the target
> > of the test.
> 
> Are you seeing this on your system with strictlimiting on or off? Is
> that with commit 494d2f508883 in your tree? What test and server were
> you running? Do you know what the speed of the backing device is?

Without your patch I see that when I enable strict limiting most 
FUSE_WRITE requests in the fuse server are 4k. With it disabled the 
writes are triggered with larger sizes.
Since bdi min_ratio is 0 by default I just assumed that as soon as the
cache gets over the limit a write with that page is triggered.

I am not that familiar with the page cache code, so I can't point to
the exact culprit where writing of the dirty page is triggered.

> 
> Thanks,
> Joanne
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/fuse/file.c?id=494d2f508883a6e5c4530e5c6b3c8b2bbfb7318d
> [2] https://lore.kernel.org/linux-fsdevel/CAJnrk1ZSaNRr-HWw-hbo2=LmbZiNGZveb0MwxZbPtBDFgg2icQ@mail.gmail.com/
> 

Thanks,
Horst

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-06-25  7:17 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-24  1:21 [PATCH v1] fuse: enable large folios Joanne Koong
2026-06-24  4:34 ` Jingbo Xu
2026-06-24  6:10   ` Horst Birthelmer
2026-06-24  7:28     ` Jingbo Xu
2026-06-24  6:16 ` Horst Birthelmer
2026-06-24 17:52   ` Joanne Koong
2026-06-25  7:17     ` Horst Birthelmer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.