All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] fuse: disable default bdi strictlimiting
@ 2025-10-08 20:41 Joanne Koong
  2025-10-09 14:16 ` Miklos Szeredi
  0 siblings, 1 reply; 20+ messages in thread
From: Joanne Koong @ 2025-10-08 20:41 UTC (permalink / raw)
  To: miklos, linux-fsdevel; +Cc: kernel-team

Commit 5a53748568f7 ("mm/page-writeback.c: add strictlimit feature")
enabled strictlimiting by default on all fuse bdis to address the lack
of writeback accounting for temporary writeback pages.

Commit 0c58a97f919c ("fuse: remove tmp folio for writebacks and internal
rb tree") eliminated the use of temporary writeback pages and commit
494d2f508883 ("fuse: use default writeback accounting") switched fuse to
use the standard writeback accounting logic provided by the mm layer.

Since fuse now uses proper writeback accounting without temporary pages,
strictlimiting is no longer needed. Additionally, for fuse large folio
buffered writes, strictlimiting is overly conservative and causes
suboptimal performance due to excessive IO throttling.

Administrators can still enable strictlimiting for specific fuse servers
via /sys/class/bdi/*/strict_limit. If needed in the future,
strictlimiting for all unprivileged fuse servers could be enabled
through a sysctl.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
 fs/fuse/inode.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 6fcfa15da868..87cb2c2bbc7b 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1591,8 +1591,6 @@ static int fuse_bdi_init(struct fuse_conn *fc, struct super_block *sb)
 	if (err)
 		return err;
 
-	sb->s_bdi->capabilities |= BDI_CAP_STRICTLIMIT;
-
 	/*
 	 * For a single fuse filesystem use max 1% of dirty +
 	 * writeback threshold.
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] fuse: disable default bdi strictlimiting
  2025-10-08 20:41 [PATCH] fuse: disable default bdi strictlimiting Joanne Koong
@ 2025-10-09 14:16 ` Miklos Szeredi
  2025-10-09 18:36   ` Joanne Koong
  0 siblings, 1 reply; 20+ messages in thread
From: Miklos Szeredi @ 2025-10-09 14:16 UTC (permalink / raw)
  To: Joanne Koong; +Cc: linux-fsdevel, kernel-team

On Wed, 8 Oct 2025 at 22:42, Joanne Koong <joannelkoong@gmail.com> wrote:

> Since fuse now uses proper writeback accounting without temporary pages,
> strictlimiting is no longer needed. Additionally, for fuse large folio
> buffered writes, strictlimiting is overly conservative and causes
> suboptimal performance due to excessive IO throttling.

I don't quite get this part.  Is this a fuse specific limitation of
stritlimit vs. large folios?

Or is it the case that other filesystems are also affected, but
strictlimit is never used outside of fuse?

> Administrators can still enable strictlimiting for specific fuse servers
> via /sys/class/bdi/*/strict_limit. If needed in the future,

What's the issue with doing the opposite: leaving strictlimit the
default and disabling strictlimit for specific servers?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] fuse: disable default bdi strictlimiting
  2025-10-09 14:16 ` Miklos Szeredi
@ 2025-10-09 18:36   ` Joanne Koong
  2025-10-10 15:01     ` Darrick J. Wong
  2025-10-27 22:38     ` Joanne Koong
  0 siblings, 2 replies; 20+ messages in thread
From: Joanne Koong @ 2025-10-09 18:36 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-fsdevel, kernel-team

On Thu, Oct 9, 2025 at 7:17 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Wed, 8 Oct 2025 at 22:42, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> > Since fuse now uses proper writeback accounting without temporary pages,
> > strictlimiting is no longer needed. Additionally, for fuse large folio
> > buffered writes, strictlimiting is overly conservative and causes
> > suboptimal performance due to excessive IO throttling.
>
> I don't quite get this part.  Is this a fuse specific limitation of
> stritlimit vs. large folios?
>
> Or is it the case that other filesystems are also affected, but
> strictlimit is never used outside of fuse?

It's the combination of fuse doing strictlimiting and setting the bdi
max ratio to 1%.

I don't think this is fuse-specific. I ran the same fio job [1]
locally on xfs and with setting the bdi max ratio to 1%, saw
performance drops between strictlimiting off vs. on

[1] fio --name=write --ioengine=sync --rw=write --bs=256K --size=1G
--numjobs=2 --ramp_time=30 --group_reporting=1
>
> > Administrators can still enable strictlimiting for specific fuse servers
> > via /sys/class/bdi/*/strict_limit. If needed in the future,
>
> What's the issue with doing the opposite: leaving strictlimit the
> default and disabling strictlimit for specific servers?

If we do that, then we can't enable large folios for servers that use
the writeback cache. I don't think we can just turn on large folios if
an admin later on disables strictlimiting for the server, because I
don't think mapping_set_folio_order_range() can be called after the
inode has been initialized (not 100% sure about this), which means
we'd also need to add some mount option for servers to disable
strictlimiting.

Thanks,
Joanne
>
> Thanks,
> Miklos

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] fuse: disable default bdi strictlimiting
  2025-10-09 18:36   ` Joanne Koong
@ 2025-10-10 15:01     ` Darrick J. Wong
  2025-10-10 15:07       ` Matthew Wilcox
  2025-10-10 23:14       ` Joanne Koong
  2025-10-27 22:38     ` Joanne Koong
  1 sibling, 2 replies; 20+ messages in thread
From: Darrick J. Wong @ 2025-10-10 15:01 UTC (permalink / raw)
  To: Joanne Koong; +Cc: Miklos Szeredi, linux-fsdevel, kernel-team, Matthew Wilcox

[cc willy in case he has opinions about dynamically changing the
pagecache order range]

On Thu, Oct 09, 2025 at 11:36:30AM -0700, Joanne Koong wrote:
> On Thu, Oct 9, 2025 at 7:17 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Wed, 8 Oct 2025 at 22:42, Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > > Since fuse now uses proper writeback accounting without temporary pages,
> > > strictlimiting is no longer needed. Additionally, for fuse large folio
> > > buffered writes, strictlimiting is overly conservative and causes
> > > suboptimal performance due to excessive IO throttling.
> >
> > I don't quite get this part.  Is this a fuse specific limitation of
> > stritlimit vs. large folios?
> >
> > Or is it the case that other filesystems are also affected, but
> > strictlimit is never used outside of fuse?
> 
> It's the combination of fuse doing strictlimiting and setting the bdi
> max ratio to 1%.
> 
> I don't think this is fuse-specific. I ran the same fio job [1]
> locally on xfs and with setting the bdi max ratio to 1%, saw
> performance drops between strictlimiting off vs. on
> 
> [1] fio --name=write --ioengine=sync --rw=write --bs=256K --size=1G
> --numjobs=2 --ramp_time=30 --group_reporting=1

Er... what kind of numbers? :)

> >
> > > Administrators can still enable strictlimiting for specific fuse servers
> > > via /sys/class/bdi/*/strict_limit. If needed in the future,
> >
> > What's the issue with doing the opposite: leaving strictlimit the
> > default and disabling strictlimit for specific servers?
> 
> If we do that, then we can't enable large folios for servers that use
> the writeback cache. I don't think we can just turn on large folios if

What's the limitation on strictlimit && large_folios?  Is it just the
throttling problem because dirtying a single byte in a 2M folio charges
the process with all 2M?  Or something else?

> an admin later on disables strictlimiting for the server, because I
> don't think mapping_set_folio_order_range() can be called after the
> inode has been initialized (not 100% sure about this), which means
> we'd also need to add some mount option for servers to disable
> strictlimiting.

I think it's ok to increase the folio order range at runtime because
you're merely expanding the range of valid folio sizes in the mapping.

Decreasing the range probably won't work unless you take the inode and
mapping locks exclusively and purge the pagecache.

--D

> Thanks,
> Joanne
> >
> > Thanks,
> > Miklos
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] fuse: disable default bdi strictlimiting
  2025-10-10 15:01     ` Darrick J. Wong
@ 2025-10-10 15:07       ` Matthew Wilcox
  2025-10-10 23:14       ` Joanne Koong
  1 sibling, 0 replies; 20+ messages in thread
From: Matthew Wilcox @ 2025-10-10 15:07 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Joanne Koong, Miklos Szeredi, linux-fsdevel, kernel-team

On Fri, Oct 10, 2025 at 08:01:13AM -0700, Darrick J. Wong wrote:
> [cc willy in case he has opinions about dynamically changing the
> pagecache order range]

It's not designed for that.  mapping_set_folio_order_range() accesses
mapping->flags without any locking/atomicity, so we can overwrite
other changes to mapping->flags, like setting AS_EIO.  It really
is supposed to be "the filesystem supports folios of these sizes",
not "we've made some runtime change to the filesystem and now we'd
preefer the MM uses folios of these sizes instead of those".

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] fuse: disable default bdi strictlimiting
  2025-10-10 15:01     ` Darrick J. Wong
  2025-10-10 15:07       ` Matthew Wilcox
@ 2025-10-10 23:14       ` Joanne Koong
  1 sibling, 0 replies; 20+ messages in thread
From: Joanne Koong @ 2025-10-10 23:14 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Miklos Szeredi, linux-fsdevel, kernel-team, Matthew Wilcox

On Fri, Oct 10, 2025 at 8:01 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> [cc willy in case he has opinions about dynamically changing the
> pagecache order range]
>
> On Thu, Oct 09, 2025 at 11:36:30AM -0700, Joanne Koong wrote:
> > On Thu, Oct 9, 2025 at 7:17 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > >
> > > On Wed, 8 Oct 2025 at 22:42, Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > > Since fuse now uses proper writeback accounting without temporary pages,
> > > > strictlimiting is no longer needed. Additionally, for fuse large folio
> > > > buffered writes, strictlimiting is overly conservative and causes
> > > > suboptimal performance due to excessive IO throttling.
> > >
> > > I don't quite get this part.  Is this a fuse specific limitation of
> > > stritlimit vs. large folios?
> > >
> > > Or is it the case that other filesystems are also affected, but
> > > strictlimit is never used outside of fuse?
> >
> > It's the combination of fuse doing strictlimiting and setting the bdi
> > max ratio to 1%.
> >
> > I don't think this is fuse-specific. I ran the same fio job [1]
> > locally on xfs and with setting the bdi max ratio to 1%, saw
> > performance drops between strictlimiting off vs. on
> >
> > [1] fio --name=write --ioengine=sync --rw=write --bs=256K --size=1G
> > --numjobs=2 --ramp_time=30 --group_reporting=1
>
> Er... what kind of numbers? :)
>

When I tested it earlier this week it was on a VM but testing it on an
actual machine, this is what I'm seeing:

 echo 4294967296 > /proc/sys/vm/dirty_bytes # 4GB
 echo 2147483648 > /proc/sys/vm/dirty_background_bytes # 2GB

 fio --name=write --ioengine=sync --rw=write --bs=512K --size=2G
--numjobs=2 --ramp_time=30 --group_reporting=1

default (no strictlimiting and max_ratio set to 100):
         around 1600 to 1800 MiB/s
strictlimiting on and max_ratio set to 1:
         around 1050 MiB/s


On systems with a lot of RAM where /proc/sys/vm/dirty_bytes is high
enough, we don't see the performance drop. But 4 GB seemed like a
reasonable value for /proc/sys/vm/dirty_bytes as that implies 20 GB of
RAM (as I understand it, the default /proc/sys/vm/dirty_ratio value is
usually set to 20% of system ram).

> > >
> > > > Administrators can still enable strictlimiting for specific fuse servers
> > > > via /sys/class/bdi/*/strict_limit. If needed in the future,
> > >
> > > What's the issue with doing the opposite: leaving strictlimit the
> > > default and disabling strictlimit for specific servers?
> >
> > If we do that, then we can't enable large folios for servers that use
> > the writeback cache. I don't think we can just turn on large folios if
>
> What's the limitation on strictlimit && large_folios?  Is it just the
> throttling problem because dirtying a single byte in a 2M folio charges
> the process with all 2M?  Or something else?

With strictlimiting on, the throttling threshold is a lot more
conservative. When large folios are used, a larger number of pages are
dirtied per write at once and not incrementally balanced, which causes
the logic in balance_dirty_pages() to schedule io waits, whereas small
folios don't have this issue because they incrementally balance pages
as they write them back. This thread has a lot more context:
https://lore.kernel.org/linux-fsdevel/Z1N505RCcH1dXlLZ@casper.infradead.org/T/#m9e3dd273aa202f9f4e12eb9c96602b5fec2d383d

The dirtying a single byte in a 2M folio should also imo be addressed,
eg through the followup to [1], but when strictlimiting is off, this
is much less of an issue since the threshold is higher.

Thanks,
Joanne

[1] https://lore.kernel.org/linux-fsdevel/5qgjrq6l627byybxjs6vzouspeqj6hdrx2ohqbxqkkjy65mtz5@zp6pimrpeu4e/T/#med8769e865e98960b1f504375cb1c0c2c3bdea51

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] fuse: disable default bdi strictlimiting
  2025-10-09 18:36   ` Joanne Koong
  2025-10-10 15:01     ` Darrick J. Wong
@ 2025-10-27 22:38     ` Joanne Koong
  2026-05-08  9:42       ` Miklos Szeredi
  1 sibling, 1 reply; 20+ messages in thread
From: Joanne Koong @ 2025-10-27 22:38 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-fsdevel, kernel-team

On Thu, Oct 9, 2025 at 11:36 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Thu, Oct 9, 2025 at 7:17 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Wed, 8 Oct 2025 at 22:42, Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > > Since fuse now uses proper writeback accounting without temporary pages,
> > > strictlimiting is no longer needed. Additionally, for fuse large folio
> > > buffered writes, strictlimiting is overly conservative and causes
> > > suboptimal performance due to excessive IO throttling.
> >
> > I don't quite get this part.  Is this a fuse specific limitation of
> > stritlimit vs. large folios?
> >
> > Or is it the case that other filesystems are also affected, but
> > strictlimit is never used outside of fuse?
>
> It's the combination of fuse doing strictlimiting and setting the bdi
> max ratio to 1%.
>
> I don't think this is fuse-specific. I ran the same fio job [1]
> locally on xfs and with setting the bdi max ratio to 1%, saw
> performance drops between strictlimiting off vs. on
>
> [1] fio --name=write --ioengine=sync --rw=write --bs=256K --size=1G
> --numjobs=2 --ramp_time=30 --group_reporting=1
> >
> > > Administrators can still enable strictlimiting for specific fuse servers
> > > via /sys/class/bdi/*/strict_limit. If needed in the future,
> >
> > What's the issue with doing the opposite: leaving strictlimit the
> > default and disabling strictlimit for specific servers?
>
> If we do that, then we can't enable large folios for servers that use
> the writeback cache. I don't think we can just turn on large folios if
> an admin later on disables strictlimiting for the server, because I
> don't think mapping_set_folio_order_range() can be called after the
> inode has been initialized (not 100% sure about this), which means
> we'd also need to add some mount option for servers to disable
> strictlimiting.

Miklos, could you share your thoughts on this? Are you in favor of
disabling default strictlimiting? Or do you prefer to have it kept
enabled by default, with some mount option or sysctl added for
privileged servers to be able to disable strictlimiting + enable large
folios if they use the writeback cache?

Thanks,
Joanne

>
> Thanks,
> Joanne
> >
> > Thanks,
> > Miklos

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] fuse: disable default bdi strictlimiting
  2025-10-27 22:38     ` Joanne Koong
@ 2026-05-08  9:42       ` Miklos Szeredi
  2026-05-08 11:54         ` Horst Birthelmer
  2026-05-12 20:56         ` Joanne Koong
  0 siblings, 2 replies; 20+ messages in thread
From: Miklos Szeredi @ 2026-05-08  9:42 UTC (permalink / raw)
  To: Joanne Koong; +Cc: linux-fsdevel, kernel-team, fuse-devel

On Mon, 27 Oct 2025 at 23:39, Joanne Koong <joannelkoong@gmail.com> wrote:
> Miklos, could you share your thoughts on this? Are you in favor of
> disabling default strictlimiting? Or do you prefer to have it kept
> enabled by default, with some mount option or sysctl added for
> privileged servers to be able to disable strictlimiting + enable large
> folios if they use the writeback cache?

So what I think we should do is implement some sort of slow writer
test, and see what happens with and without strictlimit.

Tried to ask claude to do this for me, but not getting very far.

So if I take this maintainership role seriously and not let myself
drown in the details, then the logical thing to do is to delegate ;)
Which is hard (for me at least) but I'll give it a try...

Could you please check how things change if there's limited writeback
rate and we disable strictlimit?  And what happens if there are
several such instances running in parallel?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Re: [PATCH] fuse: disable default bdi strictlimiting
  2026-05-08  9:42       ` Miklos Szeredi
@ 2026-05-08 11:54         ` Horst Birthelmer
  2026-05-12 20:56         ` Joanne Koong
  1 sibling, 0 replies; 20+ messages in thread
From: Horst Birthelmer @ 2026-05-08 11:54 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Joanne Koong, linux-fsdevel, kernel-team, fuse-devel

On Fri, May 08, 2026 at 11:42:10AM +0200, Miklos Szeredi wrote:
> On Mon, 27 Oct 2025 at 23:39, Joanne Koong <joannelkoong@gmail.com> wrote:
> > Miklos, could you share your thoughts on this? Are you in favor of
> > disabling default strictlimiting? Or do you prefer to have it kept
> > enabled by default, with some mount option or sysctl added for
> > privileged servers to be able to disable strictlimiting + enable large
> > folios if they use the writeback cache?
> 
> So what I think we should do is implement some sort of slow writer
> test, and see what happens with and without strictlimit.
> 
> Tried to ask claude to do this for me, but not getting very far.
> 
> So if I take this maintainership role seriously and not let myself
> drown in the details, then the logical thing to do is to delegate ;)
> Which is hard (for me at least) but I'll give it a try...
> 
> Could you please check how things change if there's limited writeback
> rate and we disable strictlimit?  And what happens if there are
> several such instances running in parallel?

We have run all kinds of workloads and tests (xfstest, too) with 
writeback enabled and strictlimiting off.

I have not noticed any problems, but we have not done any systematic
tests regarding this. We were always testing something else.
(usually performance impacts)

Thanks,
Horst

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] fuse: disable default bdi strictlimiting
  2026-05-08  9:42       ` Miklos Szeredi
  2026-05-08 11:54         ` Horst Birthelmer
@ 2026-05-12 20:56         ` Joanne Koong
  2026-05-27  1:42           ` Joanne Koong
  1 sibling, 1 reply; 20+ messages in thread
From: Joanne Koong @ 2026-05-12 20:56 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-fsdevel, kernel-team, fuse-devel

On Fri, May 8, 2026 at 2:42 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Mon, 27 Oct 2025 at 23:39, Joanne Koong <joannelkoong@gmail.com> wrote:
> > Miklos, could you share your thoughts on this? Are you in favor of
> > disabling default strictlimiting? Or do you prefer to have it kept
> > enabled by default, with some mount option or sysctl added for
> > privileged servers to be able to disable strictlimiting + enable large
> > folios if they use the writeback cache?
>
> So what I think we should do is implement some sort of slow writer
> test, and see what happens with and without strictlimit.
>
> Tried to ask claude to do this for me, but not getting very far.
>
> So if I take this maintainership role seriously and not let myself
> drown in the details, then the logical thing to do is to delegate ;)
> Which is hard (for me at least) but I'll give it a try...
>
> Could you please check how things change if there's limited writeback
> rate and we disable strictlimit?  And what happens if there are
> several such instances running in parallel?

I think for unprivileged fuse servers, strictlimting will always need
to be enabled or else a malicious user can launch tons of unprivileged
servers and eat up the global dirty page budget / starve writeback for
the rest of the system. Similarly for privileged servers, it could be
unintentionally slow or buggy and eat up the dirty page budget. I'll
read through the writeback throttling code to verify this and run some
local tests.

I think the question is whether we want to let admins opt out of
strictlimit when they're confident their server is well-behaved eg
through a sysctl an admin can set to disable strictlimiting for all
servers. Otherwise, large folios will always have to be off for any
server that runs with writeback caching.

Thanks,
Joanne
>
> Thanks,
> Miklos

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] fuse: disable default bdi strictlimiting
  2026-05-12 20:56         ` Joanne Koong
@ 2026-05-27  1:42           ` Joanne Koong
  2026-05-27  5:57             ` Horst Birthelmer
                               ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Joanne Koong @ 2026-05-27  1:42 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-fsdevel, kernel-team, fuse-devel, Jan Kara, Jingbo Xu

On Tue, May 12, 2026 at 1:56 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Fri, May 8, 2026 at 2:42 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Mon, 27 Oct 2025 at 23:39, Joanne Koong <joannelkoong@gmail.com> wrote:
> > > Miklos, could you share your thoughts on this? Are you in favor of
> > > disabling default strictlimiting? Or do you prefer to have it kept
> > > enabled by default, with some mount option or sysctl added for
> > > privileged servers to be able to disable strictlimiting + enable large
> > > folios if they use the writeback cache?
> >
> > So what I think we should do is implement some sort of slow writer
> > test, and see what happens with and without strictlimit.
> >
> > Tried to ask claude to do this for me, but not getting very far.
> >
> > So if I take this maintainership role seriously and not let myself
> > drown in the details, then the logical thing to do is to delegate ;)
> > Which is hard (for me at least) but I'll give it a try...
> >
> > Could you please check how things change if there's limited writeback
> > rate and we disable strictlimit?  And what happens if there are
> > several such instances running in parallel?
>
> I think for unprivileged fuse servers, strictlimting will always need
> to be enabled or else a malicious user can launch tons of unprivileged
> servers and eat up the global dirty page budget / starve writeback for
> the rest of the system. Similarly for privileged servers, it could be
> unintentionally slow or buggy and eat up the dirty page budget. I'll
> read through the writeback throttling code to verify this and run some
> local tests.

I read through the writeback throttling code and re-read Jan's very
helpful comments from this thread last year [1]. So for unprivileged
servers, I think we definitely cannot remove strictlimiting. If the
fuse server is slow or unresponsive with writing back the pages, it
will take up too much of the global dirty budget which will degrade
write throughput for other filesystems (their throttling will be
computed against the global dirty page count, eg the freerunning check
in balance_dirty_pages() and the pos_ratio calculation "pos_ratio =
pos_ratio_polynom(setpoint, dtc->dirty, limit)" (dtc->dirty is the
global dirty page count)) and any fuse stuck dirty pages are
essentially unreclaimable. Without strictlimiting, there will be no
hard cap on how many dirty pages a misbehaving server can accumulate.

With strictlimiting on and large folios enabled, the problem is that
the large folio size can potentially dwarf the server's dirty budget,
which can lead to excessive throttling. When I ran my benchmarks last
year, I and independently Jingbo saw severe performance regressions
for buffered writes with large folios (eg 2 GB/s BW w/o, and 200 MB/s
BW w/) [2] but I think that might have been because the machines had
limited RAM, resulting in a very small dirty budget. Fuse sets the max
ratio of the bdi to 1% of the global dirty threshold, so running
through some napkin math:

On a 64 GB machine:
  - DirtyThresh = 20% of 64 GB = 12.8 GB
  - BdiDirtyThresh = 12.8 GB / 100 = 128 MB
  - 128 MB / 2 MB folio = 64 dirty folios

On a 32 GB machine:
  - BdiDirtyThresh = 64 MB
  - 32 dirty folios

On an 8 GB machine:
  - BdiDirtyThresh = 16 MB
  - 8 dirty folios

On a 8GB machine (with assuming vm.dirty_ratio=20% and
vm.dirty_background_ratio=10%), we get 12 MB of freerun, 4 MB of
proportional throttling, and then full throtttling starts at 16 MB.
With 2MB folios, the 4MB zone between freerun and f ull throttling
doesn't leave that much room for the balance_dirty_pages() logic to
adjust the dirtier's speed, which I think causes the writes to
oscillate between freerunning and then being fully (overly) throttled.

I think this is also going to be a problem for cgroups with large
folios since they also, as I understand it, are constrained with a
limited / tight dirty budget. I ran some initial benchmarks with
cgroup memory constraints on NVMe and saw similar instability (a
single writer in a 8 GB cgroup had max write latencies of 6 seconds vs
15 ms without the cgroup, with the balance_dirty_pages() throttling
oscillating rather than settling near the set point).

I think this problem gets untenable for random writes with large
folios, since dirtying just a few bytes will charge the whole folio
size to the dirty budget. I have a patchset from last year for adding
more granular dirty/writeback tracking [3], I'm going to pick this
series back up. I think it will be useful generically, not just for
fuse.

For getting this to work on fuse servers with strictlimiting, I think
the next steps are to
a) as Jan had suggested in [1], come up with some heuristic to
constrain the max order supported for large folios for these fuse
servers if they're running with the writeback cache enabled
b) benchmark ^ and if there are still regressions, then we should
probably just turn large folios off for these servers
c) add the granular writeback/dirty accounting for large folios
d) look into improving the balance_dirty_pages() throttling logic to
handle narrow gaps between the freerun and full throttling zones
better and reduce over-throttling

Does this sound like a reasonable way forward?

For privileged servers, I still think it makes sense to remove the
strictlimiting requirement or at the least, let admins opt out of that
if they are confident their server is well-behaved.


Thanks,
Joanne

[1] https://lore.kernel.org/linux-fsdevel/tglgxjxcs3wpm4msgxlvzk3hebzcguhuu752hs3eefku6wj4zv@2ixuho7rxbah/
[2] https://lore.kernel.org/linux-fsdevel/f9b63a41-ced7-4176-8f40-6cba8fce7a4c@linux.alibaba.com/
[3] https://lore.kernel.org/linux-fsdevel/20250829233942.3607248-1-joannelkoong@gmail.com/

>
> I think the question is whether we want to let admins opt out of
> strictlimit when they're confident their server is well-behaved eg
> through a sysctl an admin can set to disable strictlimiting for all
> servers. Otherwise, large folios will always have to be off for any
> server that runs with writeback caching.
>
> Thanks,
> Joanne
> >
> > Thanks,
> > Miklos

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Re: [PATCH] fuse: disable default bdi strictlimiting
  2026-05-27  1:42           ` Joanne Koong
@ 2026-05-27  5:57             ` Horst Birthelmer
  2026-05-27 10:59               ` Amir Goldstein
  2026-05-27 12:25             ` Miklos Szeredi
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 20+ messages in thread
From: Horst Birthelmer @ 2026-05-27  5:57 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Miklos Szeredi, linux-fsdevel, kernel-team, fuse-devel, Jan Kara,
	Jingbo Xu

On Tue, May 26, 2026 at 06:42:35PM -0700, Joanne Koong wrote:
> On Tue, May 12, 2026 at 1:56 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Fri, May 8, 2026 at 2:42 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > >
> > > On Mon, 27 Oct 2025 at 23:39, Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > Miklos, could you share your thoughts on this? Are you in favor of
> > > > disabling default strictlimiting? Or do you prefer to have it kept
> > > > enabled by default, with some mount option or sysctl added for
> > > > privileged servers to be able to disable strictlimiting + enable large
> > > > folios if they use the writeback cache?
> > >
> > > So what I think we should do is implement some sort of slow writer
> > > test, and see what happens with and without strictlimit.
> > >
> > > Tried to ask claude to do this for me, but not getting very far.
> > >
> > > So if I take this maintainership role seriously and not let myself
> > > drown in the details, then the logical thing to do is to delegate ;)
> > > Which is hard (for me at least) but I'll give it a try...
> > >
> > > Could you please check how things change if there's limited writeback
> > > rate and we disable strictlimit?  And what happens if there are
> > > several such instances running in parallel?
> >
> > I think for unprivileged fuse servers, strictlimting will always need
> > to be enabled or else a malicious user can launch tons of unprivileged
> > servers and eat up the global dirty page budget / starve writeback for
> > the rest of the system. Similarly for privileged servers, it could be
> > unintentionally slow or buggy and eat up the dirty page budget. I'll
> > read through the writeback throttling code to verify this and run some
> > local tests.
> 
> I read through the writeback throttling code and re-read Jan's very
> helpful comments from this thread last year [1]. So for unprivileged
> servers, I think we definitely cannot remove strictlimiting. If the
> fuse server is slow or unresponsive with writing back the pages, it
> will take up too much of the global dirty budget which will degrade
> write throughput for other filesystems (their throttling will be
> computed against the global dirty page count, eg the freerunning check
> in balance_dirty_pages() and the pos_ratio calculation "pos_ratio =
> pos_ratio_polynom(setpoint, dtc->dirty, limit)" (dtc->dirty is the
> global dirty page count)) and any fuse stuck dirty pages are
> essentially unreclaimable. Without strictlimiting, there will be no
> hard cap on how many dirty pages a misbehaving server can accumulate.
> 
> With strictlimiting on and large folios enabled, the problem is that
> the large folio size can potentially dwarf the server's dirty budget,
> which can lead to excessive throttling. When I ran my benchmarks last
> year, I and independently Jingbo saw severe performance regressions
> for buffered writes with large folios (eg 2 GB/s BW w/o, and 200 MB/s
> BW w/) [2] but I think that might have been because the machines had
> limited RAM, resulting in a very small dirty budget. Fuse sets the max
> ratio of the bdi to 1% of the global dirty threshold, so running
> through some napkin math:
> 
> On a 64 GB machine:
>   - DirtyThresh = 20% of 64 GB = 12.8 GB
>   - BdiDirtyThresh = 12.8 GB / 100 = 128 MB
>   - 128 MB / 2 MB folio = 64 dirty folios
> 
> On a 32 GB machine:
>   - BdiDirtyThresh = 64 MB
>   - 32 dirty folios
> 
> On an 8 GB machine:
>   - BdiDirtyThresh = 16 MB
>   - 8 dirty folios
> 
> On a 8GB machine (with assuming vm.dirty_ratio=20% and
> vm.dirty_background_ratio=10%), we get 12 MB of freerun, 4 MB of
> proportional throttling, and then full throtttling starts at 16 MB.
> With 2MB folios, the 4MB zone between freerun and f ull throttling
> doesn't leave that much room for the balance_dirty_pages() logic to
> adjust the dirtier's speed, which I think causes the writes to
> oscillate between freerunning and then being fully (overly) throttled.
> 
> I think this is also going to be a problem for cgroups with large
> folios since they also, as I understand it, are constrained with a
> limited / tight dirty budget. I ran some initial benchmarks with
> cgroup memory constraints on NVMe and saw similar instability (a
> single writer in a 8 GB cgroup had max write latencies of 6 seconds vs
> 15 ms without the cgroup, with the balance_dirty_pages() throttling
> oscillating rather than settling near the set point).
> 
> I think this problem gets untenable for random writes with large
> folios, since dirtying just a few bytes will charge the whole folio
> size to the dirty budget. I have a patchset from last year for adding
> more granular dirty/writeback tracking [3], I'm going to pick this
> series back up. I think it will be useful generically, not just for
> fuse.
> 
> For getting this to work on fuse servers with strictlimiting, I think
> the next steps are to
> a) as Jan had suggested in [1], come up with some heuristic to
> constrain the max order supported for large folios for these fuse
> servers if they're running with the writeback cache enabled
> b) benchmark ^ and if there are still regressions, then we should
> probably just turn large folios off for these servers
> c) add the granular writeback/dirty accounting for large folios
> d) look into improving the balance_dirty_pages() throttling logic to
> handle narrow gaps between the freerun and full throttling zones
> better and reduce over-throttling
> 
> Does this sound like a reasonable way forward?

Sounds good to me, since we have seen pretty much the same when we enabled
large folios for testing.

> 
> For privileged servers, I still think it makes sense to remove the
> strictlimiting requirement or at the least, let admins opt out of that
> if they are confident their server is well-behaved.
> 

Here I'm not really sure what the most logical and sane way would be.
I really don't like limits for no reason but I understand the necessity
to have limits enabled for unpriviledged servers.

Do you think a module parameter is the right way to go here?
The connection parameter might be a problem since an admin would have
to set it for a large number of mounts.

Horst

> 
> Thanks,
> Joanne
> 
> [1] https://lore.kernel.org/linux-fsdevel/tglgxjxcs3wpm4msgxlvzk3hebzcguhuu752hs3eefku6wj4zv@2ixuho7rxbah/
> [2] https://lore.kernel.org/linux-fsdevel/f9b63a41-ced7-4176-8f40-6cba8fce7a4c@linux.alibaba.com/
> [3] https://lore.kernel.org/linux-fsdevel/20250829233942.3607248-1-joannelkoong@gmail.com/
> 
> >
> > I think the question is whether we want to let admins opt out of
> > strictlimit when they're confident their server is well-behaved eg
> > through a sysctl an admin can set to disable strictlimiting for all
> > servers. Otherwise, large folios will always have to be off for any
> > server that runs with writeback caching.
> >
> > Thanks,
> > Joanne
> > >
> > > Thanks,
> > > Miklos
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Re: [PATCH] fuse: disable default bdi strictlimiting
  2026-05-27  5:57             ` Horst Birthelmer
@ 2026-05-27 10:59               ` Amir Goldstein
  2026-05-27 22:40                 ` Joanne Koong
  0 siblings, 1 reply; 20+ messages in thread
From: Amir Goldstein @ 2026-05-27 10:59 UTC (permalink / raw)
  To: Horst Birthelmer
  Cc: Joanne Koong, Miklos Szeredi, linux-fsdevel, kernel-team,
	fuse-devel, Jan Kara, Jingbo Xu

On Wed, May 27, 2026 at 8:06 AM Horst Birthelmer <horst@birthelmer.de> wrote:
>
> On Tue, May 26, 2026 at 06:42:35PM -0700, Joanne Koong wrote:
> > On Tue, May 12, 2026 at 1:56 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > On Fri, May 8, 2026 at 2:42 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > >
> > > > On Mon, 27 Oct 2025 at 23:39, Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > > Miklos, could you share your thoughts on this? Are you in favor of
> > > > > disabling default strictlimiting? Or do you prefer to have it kept
> > > > > enabled by default, with some mount option or sysctl added for
> > > > > privileged servers to be able to disable strictlimiting + enable large
> > > > > folios if they use the writeback cache?
> > > >
> > > > So what I think we should do is implement some sort of slow writer
> > > > test, and see what happens with and without strictlimit.
> > > >
> > > > Tried to ask claude to do this for me, but not getting very far.
> > > >
> > > > So if I take this maintainership role seriously and not let myself
> > > > drown in the details, then the logical thing to do is to delegate ;)
> > > > Which is hard (for me at least) but I'll give it a try...
> > > >
> > > > Could you please check how things change if there's limited writeback
> > > > rate and we disable strictlimit?  And what happens if there are
> > > > several such instances running in parallel?
> > >
> > > I think for unprivileged fuse servers, strictlimting will always need
> > > to be enabled or else a malicious user can launch tons of unprivileged
> > > servers and eat up the global dirty page budget / starve writeback for
> > > the rest of the system. Similarly for privileged servers, it could be
> > > unintentionally slow or buggy and eat up the dirty page budget. I'll
> > > read through the writeback throttling code to verify this and run some
> > > local tests.
> >
> > I read through the writeback throttling code and re-read Jan's very
> > helpful comments from this thread last year [1]. So for unprivileged
> > servers, I think we definitely cannot remove strictlimiting. If the
> > fuse server is slow or unresponsive with writing back the pages, it
> > will take up too much of the global dirty budget which will degrade
> > write throughput for other filesystems (their throttling will be
> > computed against the global dirty page count, eg the freerunning check
> > in balance_dirty_pages() and the pos_ratio calculation "pos_ratio =
> > pos_ratio_polynom(setpoint, dtc->dirty, limit)" (dtc->dirty is the
> > global dirty page count)) and any fuse stuck dirty pages are
> > essentially unreclaimable. Without strictlimiting, there will be no
> > hard cap on how many dirty pages a misbehaving server can accumulate.
> >
> > With strictlimiting on and large folios enabled, the problem is that
> > the large folio size can potentially dwarf the server's dirty budget,
> > which can lead to excessive throttling. When I ran my benchmarks last
> > year, I and independently Jingbo saw severe performance regressions
> > for buffered writes with large folios (eg 2 GB/s BW w/o, and 200 MB/s
> > BW w/) [2] but I think that might have been because the machines had
> > limited RAM, resulting in a very small dirty budget. Fuse sets the max
> > ratio of the bdi to 1% of the global dirty threshold, so running
> > through some napkin math:
> >
> > On a 64 GB machine:
> >   - DirtyThresh = 20% of 64 GB = 12.8 GB
> >   - BdiDirtyThresh = 12.8 GB / 100 = 128 MB
> >   - 128 MB / 2 MB folio = 64 dirty folios
> >
> > On a 32 GB machine:
> >   - BdiDirtyThresh = 64 MB
> >   - 32 dirty folios
> >
> > On an 8 GB machine:
> >   - BdiDirtyThresh = 16 MB
> >   - 8 dirty folios
> >
> > On a 8GB machine (with assuming vm.dirty_ratio=20% and
> > vm.dirty_background_ratio=10%), we get 12 MB of freerun, 4 MB of
> > proportional throttling, and then full throtttling starts at 16 MB.
> > With 2MB folios, the 4MB zone between freerun and f ull throttling
> > doesn't leave that much room for the balance_dirty_pages() logic to
> > adjust the dirtier's speed, which I think causes the writes to
> > oscillate between freerunning and then being fully (overly) throttled.
> >
> > I think this is also going to be a problem for cgroups with large
> > folios since they also, as I understand it, are constrained with a
> > limited / tight dirty budget. I ran some initial benchmarks with
> > cgroup memory constraints on NVMe and saw similar instability (a
> > single writer in a 8 GB cgroup had max write latencies of 6 seconds vs
> > 15 ms without the cgroup, with the balance_dirty_pages() throttling
> > oscillating rather than settling near the set point).
> >
> > I think this problem gets untenable for random writes with large
> > folios, since dirtying just a few bytes will charge the whole folio
> > size to the dirty budget. I have a patchset from last year for adding
> > more granular dirty/writeback tracking [3], I'm going to pick this
> > series back up. I think it will be useful generically, not just for
> > fuse.
> >
> > For getting this to work on fuse servers with strictlimiting, I think
> > the next steps are to
> > a) as Jan had suggested in [1], come up with some heuristic to
> > constrain the max order supported for large folios for these fuse
> > servers if they're running with the writeback cache enabled
> > b) benchmark ^ and if there are still regressions, then we should
> > probably just turn large folios off for these servers
> > c) add the granular writeback/dirty accounting for large folios
> > d) look into improving the balance_dirty_pages() throttling logic to
> > handle narrow gaps between the freerun and full throttling zones
> > better and reduce over-throttling
> >
> > Does this sound like a reasonable way forward?
>
> Sounds good to me, since we have seen pretty much the same when we enabled
> large folios for testing.
>
> >
> > For privileged servers, I still think it makes sense to remove the
> > strictlimiting requirement or at the least, let admins opt out of that
> > if they are confident their server is well-behaved.
> >
>
> Here I'm not really sure what the most logical and sane way would be.
> I really don't like limits for no reason but I understand the necessity
> to have limits enabled for unpriviledged servers.
>
> Do you think a module parameter is the right way to go here?
> The connection parameter might be a problem since an admin would have
> to set it for a large number of mounts.
>

Not saying no to a module option, but I am curious.
This admin opt-in per mount could easily be implemented in userspace
in either libfuse and/or mount helper.

For such a niche ("I know what I am doing") feature, why does it matter
to do this in the kernel module params?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] fuse: disable default bdi strictlimiting
  2026-05-27  1:42           ` Joanne Koong
  2026-05-27  5:57             ` Horst Birthelmer
@ 2026-05-27 12:25             ` Miklos Szeredi
  2026-05-27 23:32               ` Joanne Koong
  2026-05-28 12:34             ` Jan Kara
  2026-05-30  2:15             ` Joanne Koong
  3 siblings, 1 reply; 20+ messages in thread
From: Miklos Szeredi @ 2026-05-27 12:25 UTC (permalink / raw)
  To: Joanne Koong; +Cc: linux-fsdevel, kernel-team, fuse-devel, Jan Kara, Jingbo Xu

On Wed, 27 May 2026 at 03:42, Joanne Koong <joannelkoong@gmail.com> wrote:

> On a 8GB machine (with assuming vm.dirty_ratio=20% and
> vm.dirty_background_ratio=10%), we get 12 MB of freerun, 4 MB of
> proportional throttling, and then full throtttling starts at 16 MB.
> With 2MB folios, the 4MB zone between freerun and f ull throttling
> doesn't leave that much room for the balance_dirty_pages() logic to
> adjust the dirtier's speed, which I think causes the writes to
> oscillate between freerunning and then being fully (overly) throttled.

Some handwaving: would it be possible for the folio alloc logic to
modulate the folio size based on how close are we to the dirty
threshold?

E.g. if the number of folios to reach the dirty limit is less than 10,
then start allocating smaller folios.

> I think this problem gets untenable for random writes with large
> folios, since dirtying just a few bytes will charge the whole folio
> size to the dirty budget. I have a patchset from last year for adding
> more granular dirty/writeback tracking [3], I'm going to pick this
> series back up. I think it will be useful generically, not just for
> fuse.

Tracking partial dirty is not going to help free up those large
folios, unless the dirty part is migrated to a smaller folio.  Not
sure if it's worth the added complexity, maybe...

> For getting this to work on fuse servers with strictlimiting, I think
> the next steps are to
> a) as Jan had suggested in [1], come up with some heuristic to
> constrain the max order supported for large folios for these fuse
> servers if they're running with the writeback cache enabled
> b) benchmark ^ and if there are still regressions, then we should
> probably just turn large folios off for these servers

Definitely worth exploring the characteristics of the current system.

> c) add the granular writeback/dirty accounting for large folios
> d) look into improving the balance_dirty_pages() throttling logic to
> handle narrow gaps between the freerun and full throttling zones
> better and reduce over-throttling
>
> Does this sound like a reasonable way forward?

Yeah, makes sense.

> For privileged servers, I still think it makes sense to remove the
> strictlimiting requirement or at the least, let admins opt out of that
> if they are confident their server is well-behaved.

This is already possible via the /sys/class/bdi interface.

I'm fine with introducing a config option / module option to turn it
off by default.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Re: [PATCH] fuse: disable default bdi strictlimiting
  2026-05-27 10:59               ` Amir Goldstein
@ 2026-05-27 22:40                 ` Joanne Koong
  0 siblings, 0 replies; 20+ messages in thread
From: Joanne Koong @ 2026-05-27 22:40 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Horst Birthelmer, Miklos Szeredi, linux-fsdevel, kernel-team,
	fuse-devel, Jan Kara, Jingbo Xu

On Wed, May 27, 2026 at 3:59 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Wed, May 27, 2026 at 8:06 AM Horst Birthelmer <horst@birthelmer.de> wrote:
> >
> > On Tue, May 26, 2026 at 06:42:35PM -0700, Joanne Koong wrote:
> > >
> > > For getting this to work on fuse servers with strictlimiting, I think
> > > the next steps are to
> > > a) as Jan had suggested in [1], come up with some heuristic to
> > > constrain the max order supported for large folios for these fuse
> > > servers if they're running with the writeback cache enabled
> > > b) benchmark ^ and if there are still regressions, then we should
> > > probably just turn large folios off for these servers
> > > c) add the granular writeback/dirty accounting for large folios
> > > d) look into improving the balance_dirty_pages() throttling logic to
> > > handle narrow gaps between the freerun and full throttling zones
> > > better and reduce over-throttling
> > >
> > > Does this sound like a reasonable way forward?
> >
> > Sounds good to me, since we have seen pretty much the same when we enabled
> > large folios for testing.

Awesome, glad to hear we've been seeing the same things.

> >
> > >
> > > For privileged servers, I still think it makes sense to remove the
> > > strictlimiting requirement or at the least, let admins opt out of that
> > > if they are confident their server is well-behaved.
> > >
> >
> > Here I'm not really sure what the most logical and sane way would be.
> > I really don't like limits for no reason but I understand the necessity
> > to have limits enabled for unpriviledged servers.
> >
> > Do you think a module parameter is the right way to go here?
> > The connection parameter might be a problem since an admin would have
> > to set it for a large number of mounts.
> >
>
> Not saying no to a module option, but I am curious.
> This admin opt-in per mount could easily be implemented in userspace
> in either libfuse and/or mount helper.
>
> For such a niche ("I know what I am doing") feature, why does it matter
> to do this in the kernel module params?
>

I'm hoping we're able to keep things as simple as possible by just
getting rid of strictlimiting by default for privileged servers on the
kernel side, without needing any module parameters or libfuse changes,
but I'm not sure if Miklos is comfortable with that. imo if the server
is privileged, it can be trusted to not misbehave and should be given
the same kind of capabilities as a local filesystem. But I guess there
might be the possibility of user regressions if it turns out people's
privileged fuse servers are accidentally buggy.

I think the module parameter makes sense if admins trust unprivileged
servers enough to turn off strictlimiting globally. Otherwise, I think
Miklos just mentioned there's already a /sys/class/bdi interface that
lets admins disable strictlimit on a per-mount basis, so I think it
would be cleaner (though more annoying on the userspace side :)) if
libfuse used this to turn it off by default. The max supported folio
order is set per-inode when the inode is first looked up, not at mount
time, so the admin can disable strictlimit via sysfs after mount
(which is when the bdi sysfs entry gets created) and new inodes will
pick up large folio support from there.

Thanks,
Joanne

> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] fuse: disable default bdi strictlimiting
  2026-05-27 12:25             ` Miklos Szeredi
@ 2026-05-27 23:32               ` Joanne Koong
  0 siblings, 0 replies; 20+ messages in thread
From: Joanne Koong @ 2026-05-27 23:32 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-fsdevel, kernel-team, fuse-devel, Jan Kara, Jingbo Xu

On Wed, May 27, 2026 at 5:25 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Wed, 27 May 2026 at 03:42, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> > On a 8GB machine (with assuming vm.dirty_ratio=20% and
> > vm.dirty_background_ratio=10%), we get 12 MB of freerun, 4 MB of
> > proportional throttling, and then full throtttling starts at 16 MB.
> > With 2MB folios, the 4MB zone between freerun and f ull throttling
> > doesn't leave that much room for the balance_dirty_pages() logic to
> > adjust the dirtier's speed, which I think causes the writes to
> > oscillate between freerunning and then being fully (overly) throttled.
>
> Some handwaving: would it be possible for the folio alloc logic to
> modulate the folio size based on how close are we to the dirty
> threshold?
>
> E.g. if the number of folios to reach the dirty limit is less than 10,
> then start allocating smaller folios.

I don't think this would help for the case where the writes are to
existing large folios already resident in the page cache, but I can do
some experimentation with this.

>
> > I think this problem gets untenable for random writes with large
> > folios, since dirtying just a few bytes will charge the whole folio
> > size to the dirty budget. I have a patchset from last year for adding
> > more granular dirty/writeback tracking [3], I'm going to pick this
> > series back up. I think it will be useful generically, not just for
> > fuse.
>
> Tracking partial dirty is not going to help free up those large
> folios, unless the dirty part is migrated to a smaller folio.  Not
> sure if it's worth the added complexity, maybe...

That's a good point... the whole point of throttling is to cap the
number of pages pinned in RAM...

>
> > For getting this to work on fuse servers with strictlimiting, I think
> > the next steps are to
> > a) as Jan had suggested in [1], come up with some heuristic to
> > constrain the max order supported for large folios for these fuse
> > servers if they're running with the writeback cache enabled
> > b) benchmark ^ and if there are still regressions, then we should
> > probably just turn large folios off for these servers
>
> Definitely worth exploring the characteristics of the current system.
>
> > c) add the granular writeback/dirty accounting for large folios
> > d) look into improving the balance_dirty_pages() throttling logic to
> > handle narrow gaps between the freerun and full throttling zones
> > better and reduce over-throttling
> >
> > Does this sound like a reasonable way forward?
>
> Yeah, makes sense.
>
> > For privileged servers, I still think it makes sense to remove the
> > strictlimiting requirement or at the least, let admins opt out of that
> > if they are confident their server is well-behaved.
>
> This is already possible via the /sys/class/bdi interface.

Nice, I think in that case we can just have libfuse disable this on
behalf of the privileged server.

Thanks,
Joanne

>
> I'm fine with introducing a config option / module option to turn it
> off by default.
>
> Thanks,
> Miklos

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] fuse: disable default bdi strictlimiting
  2026-05-27  1:42           ` Joanne Koong
  2026-05-27  5:57             ` Horst Birthelmer
  2026-05-27 12:25             ` Miklos Szeredi
@ 2026-05-28 12:34             ` Jan Kara
  2026-05-28 22:11               ` Joanne Koong
  2026-05-30  2:15             ` Joanne Koong
  3 siblings, 1 reply; 20+ messages in thread
From: Jan Kara @ 2026-05-28 12:34 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Miklos Szeredi, linux-fsdevel, kernel-team, fuse-devel, Jan Kara,
	Jingbo Xu

On Tue 26-05-26 18:42:35, Joanne Koong wrote:
> On Tue, May 12, 2026 at 1:56 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Fri, May 8, 2026 at 2:42 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > >
> > > On Mon, 27 Oct 2025 at 23:39, Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > Miklos, could you share your thoughts on this? Are you in favor of
> > > > disabling default strictlimiting? Or do you prefer to have it kept
> > > > enabled by default, with some mount option or sysctl added for
> > > > privileged servers to be able to disable strictlimiting + enable large
> > > > folios if they use the writeback cache?
> > >
> > > So what I think we should do is implement some sort of slow writer
> > > test, and see what happens with and without strictlimit.
> > >
> > > Tried to ask claude to do this for me, but not getting very far.
> > >
> > > So if I take this maintainership role seriously and not let myself
> > > drown in the details, then the logical thing to do is to delegate ;)
> > > Which is hard (for me at least) but I'll give it a try...
> > >
> > > Could you please check how things change if there's limited writeback
> > > rate and we disable strictlimit?  And what happens if there are
> > > several such instances running in parallel?
> >
> > I think for unprivileged fuse servers, strictlimting will always need
> > to be enabled or else a malicious user can launch tons of unprivileged
> > servers and eat up the global dirty page budget / starve writeback for
> > the rest of the system. Similarly for privileged servers, it could be
> > unintentionally slow or buggy and eat up the dirty page budget. I'll
> > read through the writeback throttling code to verify this and run some
> > local tests.
> 
> I read through the writeback throttling code and re-read Jan's very
> helpful comments from this thread last year [1]. So for unprivileged
> servers, I think we definitely cannot remove strictlimiting. If the
> fuse server is slow or unresponsive with writing back the pages, it
> will take up too much of the global dirty budget which will degrade
> write throughput for other filesystems (their throttling will be
> computed against the global dirty page count, eg the freerunning check
> in balance_dirty_pages() and the pos_ratio calculation "pos_ratio =
> pos_ratio_polynom(setpoint, dtc->dirty, limit)" (dtc->dirty is the
> global dirty page count)) and any fuse stuck dirty pages are
> essentially unreclaimable. Without strictlimiting, there will be no
> hard cap on how many dirty pages a misbehaving server can accumulate.

Yes.

> With strictlimiting on and large folios enabled, the problem is that
> the large folio size can potentially dwarf the server's dirty budget,
> which can lead to excessive throttling. When I ran my benchmarks last
> year, I and independently Jingbo saw severe performance regressions
> for buffered writes with large folios (eg 2 GB/s BW w/o, and 200 MB/s
> BW w/) [2] but I think that might have been because the machines had
> limited RAM, resulting in a very small dirty budget. Fuse sets the max
> ratio of the bdi to 1% of the global dirty threshold, so running
> through some napkin math:
> 
> On a 64 GB machine:
>   - DirtyThresh = 20% of 64 GB = 12.8 GB
>   - BdiDirtyThresh = 12.8 GB / 100 = 128 MB
>   - 128 MB / 2 MB folio = 64 dirty folios
> 
> On a 32 GB machine:
>   - BdiDirtyThresh = 64 MB
>   - 32 dirty folios
> 
> On an 8 GB machine:
>   - BdiDirtyThresh = 16 MB
>   - 8 dirty folios
> 
> On a 8GB machine (with assuming vm.dirty_ratio=20% and
> vm.dirty_background_ratio=10%), we get 12 MB of freerun, 4 MB of
> proportional throttling, and then full throtttling starts at 16 MB.
> With 2MB folios, the 4MB zone between freerun and f ull throttling
> doesn't leave that much room for the balance_dirty_pages() logic to
> adjust the dirtier's speed, which I think causes the writes to
> oscillate between freerunning and then being fully (overly) throttled.

Thanks for spelling this out. Even with 64 dirty folios effective limit I
think the throttling code will be too unstable. In general I think we
should aim at having at least ~512 (gut feeling estimate ;) folios in a
dirty limit to have a decent wiggle room for the throttling algorithm.

> I think this is also going to be a problem for cgroups with large
> folios since they also, as I understand it, are constrained with a
> limited / tight dirty budget. I ran some initial benchmarks with
> cgroup memory constraints on NVMe and saw similar instability (a
> single writer in a 8 GB cgroup had max write latencies of 6 seconds vs
> 15 ms without the cgroup, with the balance_dirty_pages() throttling
> oscillating rather than settling near the set point).

Ok, so this is more folios (819) than my 512 gut feeling estimate :) What
was the write throughput of the NVMe drive? The high drive throughput also
requires more dirty data to keep the drive saturated so that writeback
throughput doesn't oscilate too much.

> I think this problem gets untenable for random writes with large
> folios, since dirtying just a few bytes will charge the whole folio
> size to the dirty budget. I have a patchset from last year for adding
> more granular dirty/writeback tracking [3], I'm going to pick this
> series back up. I think it will be useful generically, not just for
> fuse.
> 
> For getting this to work on fuse servers with strictlimiting, I think
> the next steps are to
> a) as Jan had suggested in [1], come up with some heuristic to
> constrain the max order supported for large folios for these fuse
> servers if they're running with the writeback cache enabled
> b) benchmark ^ and if there are still regressions, then we should
> probably just turn large folios off for these servers

Yes, as we briefly discussed at LSF, picking proper max folio order is
useful not only for dirty throttling but also for memory reclaim in
general in memory-constrained environments. Strictlimit is just one of the
cases where the limit is very small and so the problems with the memory
overhead of larger folios become very apparent.

So although we may decide for a FUSE specific heuristic setting the maximum
folio order, you properly point out memory limited memcgs will see similar
issues so I think a more general solution for selecting sensible max order
would be better.

> c) add the granular writeback/dirty accounting for large folios

As Miklos pointed out and I think we've discussed that also last year since
the full folio is pinned in memory by a single dirty fs block (or whatever
is going to be your unit), finer grained accounting alone doesn't help you
with the narrow limits. We'd also have to add logic to decide about
splitting large folios.

> d) look into improving the balance_dirty_pages() throttling logic to
> handle narrow gaps between the freerun and full throttling zones
> better and reduce over-throttling

Sure, this is always a possibility but this is easier said than done :) If
you get very coarse-grained signals, there's only so much you can do to
smooth out the behavior for the application. But the erratic behavior with
8GB memcg is certainly something I'd expect to work reasonably smoothly so
starting with investigating that case might be a good start.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] fuse: disable default bdi strictlimiting
  2026-05-28 12:34             ` Jan Kara
@ 2026-05-28 22:11               ` Joanne Koong
  2026-05-30 11:04                 ` Jan Kara
  0 siblings, 1 reply; 20+ messages in thread
From: Joanne Koong @ 2026-05-28 22:11 UTC (permalink / raw)
  To: Jan Kara
  Cc: Miklos Szeredi, linux-fsdevel, kernel-team, fuse-devel, Jingbo Xu

On Thu, May 28, 2026 at 5:34 AM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 26-05-26 18:42:35, Joanne Koong wrote:
> > On Tue, May 12, 2026 at 1:56 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > On Fri, May 8, 2026 at 2:42 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > >
> > > > On Mon, 27 Oct 2025 at 23:39, Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > > Miklos, could you share your thoughts on this? Are you in favor of
> > > > > disabling default strictlimiting? Or do you prefer to have it kept
> > > > > enabled by default, with some mount option or sysctl added for
> > > > > privileged servers to be able to disable strictlimiting + enable large
> > > > > folios if they use the writeback cache?
> > > >
> > > > So what I think we should do is implement some sort of slow writer
> > > > test, and see what happens with and without strictlimit.
> > > >
> > > > Tried to ask claude to do this for me, but not getting very far.
> > > >
> > > > So if I take this maintainership role seriously and not let myself
> > > > drown in the details, then the logical thing to do is to delegate ;)
> > > > Which is hard (for me at least) but I'll give it a try...
> > > >
> > > > Could you please check how things change if there's limited writeback
> > > > rate and we disable strictlimit?  And what happens if there are
> > > > several such instances running in parallel?
> > >
> > > I think for unprivileged fuse servers, strictlimting will always need
> > > to be enabled or else a malicious user can launch tons of unprivileged
> > > servers and eat up the global dirty page budget / starve writeback for
> > > the rest of the system. Similarly for privileged servers, it could be
> > > unintentionally slow or buggy and eat up the dirty page budget. I'll
> > > read through the writeback throttling code to verify this and run some
> > > local tests.
> >
> > I read through the writeback throttling code and re-read Jan's very
> > helpful comments from this thread last year [1]. So for unprivileged
> > servers, I think we definitely cannot remove strictlimiting. If the
> > fuse server is slow or unresponsive with writing back the pages, it
> > will take up too much of the global dirty budget which will degrade
> > write throughput for other filesystems (their throttling will be
> > computed against the global dirty page count, eg the freerunning check
> > in balance_dirty_pages() and the pos_ratio calculation "pos_ratio =
> > pos_ratio_polynom(setpoint, dtc->dirty, limit)" (dtc->dirty is the
> > global dirty page count)) and any fuse stuck dirty pages are
> > essentially unreclaimable. Without strictlimiting, there will be no
> > hard cap on how many dirty pages a misbehaving server can accumulate.
>
> Yes.
>
> > With strictlimiting on and large folios enabled, the problem is that
> > the large folio size can potentially dwarf the server's dirty budget,
> > which can lead to excessive throttling. When I ran my benchmarks last
> > year, I and independently Jingbo saw severe performance regressions
> > for buffered writes with large folios (eg 2 GB/s BW w/o, and 200 MB/s
> > BW w/) [2] but I think that might have been because the machines had
> > limited RAM, resulting in a very small dirty budget. Fuse sets the max
> > ratio of the bdi to 1% of the global dirty threshold, so running
> > through some napkin math:
> >
> > On a 64 GB machine:
> >   - DirtyThresh = 20% of 64 GB = 12.8 GB
> >   - BdiDirtyThresh = 12.8 GB / 100 = 128 MB
> >   - 128 MB / 2 MB folio = 64 dirty folios
> >
> > On a 32 GB machine:
> >   - BdiDirtyThresh = 64 MB
> >   - 32 dirty folios
> >
> > On an 8 GB machine:
> >   - BdiDirtyThresh = 16 MB
> >   - 8 dirty folios
> >
> > On a 8GB machine (with assuming vm.dirty_ratio=20% and
> > vm.dirty_background_ratio=10%), we get 12 MB of freerun, 4 MB of
> > proportional throttling, and then full throtttling starts at 16 MB.
> > With 2MB folios, the 4MB zone between freerun and f ull throttling
> > doesn't leave that much room for the balance_dirty_pages() logic to
> > adjust the dirtier's speed, which I think causes the writes to
> > oscillate between freerunning and then being fully (overly) throttled.
>
> Thanks for spelling this out. Even with 64 dirty folios effective limit I
> think the throttling code will be too unstable. In general I think we
> should aim at having at least ~512 (gut feeling estimate ;) folios in a
> dirty limit to have a decent wiggle room for the throttling algorithm.

Awesome, I'll do some benchmarking on this and gather some data.

>
> > I think this is also going to be a problem for cgroups with large
> > folios since they also, as I understand it, are constrained with a
> > limited / tight dirty budget. I ran some initial benchmarks with
> > cgroup memory constraints on NVMe and saw similar instability (a
> > single writer in a 8 GB cgroup had max write latencies of 6 seconds vs
> > 15 ms without the cgroup, with the balance_dirty_pages() throttling
> > oscillating rather than settling near the set point).
>
> Ok, so this is more folios (819) than my 512 gut feeling estimate :) What
> was the write throughput of the NVMe drive? The high drive throughput also
> requires more dirty data to keep the drive saturated so that writeback
> throughput doesn't oscilate too much.

The write throughput of the NVMe drive I was using was around ~1.1
Gb/s (measured by running direct I/O). I think with the 1.6GB dirty
budget, the math for that ends up being that the device drains it in
~1.5 secs. The performance I was seeing with the initial cgroup
benchmarks was with 4k pages (no large folios enabled) on btrfs.

>
> > I think this problem gets untenable for random writes with large
> > folios, since dirtying just a few bytes will charge the whole folio
> > size to the dirty budget. I have a patchset from last year for adding
> > more granular dirty/writeback tracking [3], I'm going to pick this
> > series back up. I think it will be useful generically, not just for
> > fuse.
> >
> > For getting this to work on fuse servers with strictlimiting, I think
> > the next steps are to
> > a) as Jan had suggested in [1], come up with some heuristic to
> > constrain the max order supported for large folios for these fuse
> > servers if they're running with the writeback cache enabled
> > b) benchmark ^ and if there are still regressions, then we should
> > probably just turn large folios off for these servers
>
> Yes, as we briefly discussed at LSF, picking proper max folio order is
> useful not only for dirty throttling but also for memory reclaim in
> general in memory-constrained environments. Strictlimit is just one of the
> cases where the limit is very small and so the problems with the memory
> overhead of larger folios become very apparent.
>
> So although we may decide for a FUSE specific heuristic setting the maximum
> folio order, you properly point out memory limited memcgs will see similar
> issues so I think a more general solution for selecting sensible max order
> would be better.

This sounds great. I'll run some more benchmarks to get more data and
then come up with a general solution and submit that as an rfc.

>
> > c) add the granular writeback/dirty accounting for large folios
>
> As Miklos pointed out and I think we've discussed that also last year since
> the full folio is pinned in memory by a single dirty fs block (or whatever
> is going to be your unit), finer grained accounting alone doesn't help you
> with the narrow limits. We'd also have to add logic to decide about
> splitting large folios.

If I'm inferring this right (and please let me know if I"m not), it
seems like it would be extremely useful to get the infrastructure
working better for splitting file-backed large folios. That seems like
the proper long-term solution if I'm understanding it correctly (eg
instead of capping the folio order statically across the board, larger
folios can be used until there's memory pressure that splits it or if
at write time we detect the dirty budget is tight, that will allow the
clean parts to be reclaimable and reduce memory pressure and
throttling)?

From what I see in __folio_split() in mm/huge_memory.c, it's not
currently possible to split dirty file-backed folios (eg
iomap_release_folio() will not release the folio if any part of it is
dirty), but I think we can get this working in the iomap layer for
supporting the per-folio dirty/uptodate state tracking for the split
folio instead of rejecting the release. Iomap already tracks which
subfolios are dirty vs not, so we have the information upfront as well
about how to split the dirty folio optimally. Do you think this would
be useful to get working?

Thanks,
Joanne

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] fuse: disable default bdi strictlimiting
  2026-05-27  1:42           ` Joanne Koong
                               ` (2 preceding siblings ...)
  2026-05-28 12:34             ` Jan Kara
@ 2026-05-30  2:15             ` Joanne Koong
  3 siblings, 0 replies; 20+ messages in thread
From: Joanne Koong @ 2026-05-30  2:15 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-fsdevel, kernel-team, fuse-devel, Jan Kara, Jingbo Xu

On Tue, May 26, 2026 at 6:42 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> I read through the writeback throttling code and re-read Jan's very
> helpful comments from this thread last year [1]. So for unprivileged
> servers, I think we definitely cannot remove strictlimiting. If the
> fuse server is slow or unresponsive with writing back the pages, it
> will take up too much of the global dirty budget which will degrade
> write throughput for other filesystems (their throttling will be
> computed against the global dirty page count, eg the freerunning check
> in balance_dirty_pages() and the pos_ratio calculation "pos_ratio =
> pos_ratio_polynom(setpoint, dtc->dirty, limit)" (dtc->dirty is the
> global dirty page count)) and any fuse stuck dirty pages are
> essentially unreclaimable. Without strictlimiting, there will be no
> hard cap on how many dirty pages a misbehaving server can accumulate.
>
> With strictlimiting on and large folios enabled, the problem is that
> the large folio size can potentially dwarf the server's dirty budget,
> which can lead to excessive throttling. When I ran my benchmarks last
> year, I and independently Jingbo saw severe performance regressions
> for buffered writes with large folios (eg 2 GB/s BW w/o, and 200 MB/s
> BW w/) [2] but I think that might have been because the machines had
> limited RAM, resulting in a very small dirty budget.
>

Update: this is no longer a problem. The large folio buffered write
performance regressions I saw last year was fixed by commit
494d2f508883 ('fuse: use default writeback accounting') [1], which
landed last August.

Prior to that commit, on a 2 GB VM, I was seeing for buffered writes:
No large folios:  840 MB/s
Large folios: 270 MB/s

After that commit, I now see around a 30% improvement compared to baseline:
Large folios: 1130 MB/s

Prior to commit 494d2f508883, fuse was doing its own custom writeback
accounting but the accounting wasn't updating wb->writeback_inodes
(which tracks when fuse inodes enter/finish writeback) or
wb->avg_write_bandwidth (which estimates the device write bandwidth).
As a result, this caused the balance_dirty_pages() throttling logic to
think the device is super slow since it's using stale underestimated
values for how fast the device can drain dirty pages. This caused
overthrottling based on the wrong estimated calcualtions.With large
folios this became a noticeably big issue because each folio charges
more dirty pages, where the balance_dirty_pages() logic thinks it will
take a super long time to drain those amount of dirty pages.

commit 494d2f508883 switched fuse to use the mm-internal writeback
accounting, which properly updated wb->writeback_inodes and
wb->avg_write_bandwidth, which fixed the incorrect bandwidth
estimations, which in turn fixed the overthrottling.

Overall, what I'm seeing is that enabling large folios improves
performance across the board for fuse, even with strictlimiting
enabled on memory-constrained systems. I'm going to do a closer audit
of the fuse code to see if we're missing any gaps that will be needed
for turning on large folios, and then will do some more testing. I'll
also rerun benchmarks on NVMe to make sure large folios don't lead to
any performance regressions on fast storage. After that, I think we
are good to go for turning on large folios.

I'm still planning to look into the cgroup dirty throttling
instability on NVMe I was seeing earlier and do some benchmarking /
experimentation with the folio order cap. I'll post updates on that
when I have them.

Thanks,
Joanne

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/fuse/file.c?id=494d2f508883a6e5c4530e5c6b3c8b2bbfb7318d

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] fuse: disable default bdi strictlimiting
  2026-05-28 22:11               ` Joanne Koong
@ 2026-05-30 11:04                 ` Jan Kara
  0 siblings, 0 replies; 20+ messages in thread
From: Jan Kara @ 2026-05-30 11:04 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Jan Kara, Miklos Szeredi, linux-fsdevel, kernel-team, fuse-devel,
	Jingbo Xu

On Thu 28-05-26 15:11:18, Joanne Koong wrote:
> On Thu, May 28, 2026 at 5:34 AM Jan Kara <jack@suse.cz> wrote:
> > > I think this is also going to be a problem for cgroups with large
> > > folios since they also, as I understand it, are constrained with a
> > > limited / tight dirty budget. I ran some initial benchmarks with
> > > cgroup memory constraints on NVMe and saw similar instability (a
> > > single writer in a 8 GB cgroup had max write latencies of 6 seconds vs
> > > 15 ms without the cgroup, with the balance_dirty_pages() throttling
> > > oscillating rather than settling near the set point).
> >
> > Ok, so this is more folios (819) than my 512 gut feeling estimate :) What
> > was the write throughput of the NVMe drive? The high drive throughput also
> > requires more dirty data to keep the drive saturated so that writeback
> > throughput doesn't oscilate too much.
> 
> The write throughput of the NVMe drive I was using was around ~1.1
> Gb/s (measured by running direct I/O). I think with the 1.6GB dirty
> budget, the math for that ends up being that the device drains it in
> ~1.5 secs. The performance I was seeing with the initial cgroup
> benchmarks was with 4k pages (no large folios enabled) on btrfs.

OK, you might want to experiment with some other filesystem (I suggest xfs
or ext4) as well. Btrfs writeback behavior is a bit special with its data
checksum computations etc. and thus latency of starting writeback. It could
contribute to the erratic behavior with the relatively low dirty limits.

> > > I think this problem gets untenable for random writes with large
> > > folios, since dirtying just a few bytes will charge the whole folio
> > > size to the dirty budget. I have a patchset from last year for adding
> > > more granular dirty/writeback tracking [3], I'm going to pick this
> > > series back up. I think it will be useful generically, not just for
> > > fuse.
> > >
> > > For getting this to work on fuse servers with strictlimiting, I think
> > > the next steps are to
> > > a) as Jan had suggested in [1], come up with some heuristic to
> > > constrain the max order supported for large folios for these fuse
> > > servers if they're running with the writeback cache enabled
> > > b) benchmark ^ and if there are still regressions, then we should
> > > probably just turn large folios off for these servers
> >
> > Yes, as we briefly discussed at LSF, picking proper max folio order is
> > useful not only for dirty throttling but also for memory reclaim in
> > general in memory-constrained environments. Strictlimit is just one of the
> > cases where the limit is very small and so the problems with the memory
> > overhead of larger folios become very apparent.
> >
> > So although we may decide for a FUSE specific heuristic setting the maximum
> > folio order, you properly point out memory limited memcgs will see similar
> > issues so I think a more general solution for selecting sensible max order
> > would be better.
> 
> This sounds great. I'll run some more benchmarks to get more data and
> then come up with a general solution and submit that as an rfc.
> 
> > > c) add the granular writeback/dirty accounting for large folios
> >
> > As Miklos pointed out and I think we've discussed that also last year since
> > the full folio is pinned in memory by a single dirty fs block (or whatever
> > is going to be your unit), finer grained accounting alone doesn't help you
> > with the narrow limits. We'd also have to add logic to decide about
> > splitting large folios.
> 
> If I'm inferring this right (and please let me know if I"m not), it
> seems like it would be extremely useful to get the infrastructure
> working better for splitting file-backed large folios. That seems like
> the proper long-term solution if I'm understanding it correctly (eg
> instead of capping the folio order statically across the board, larger
> folios can be used until there's memory pressure that splits it or if
> at write time we detect the dirty budget is tight, that will allow the
> clean parts to be reclaimable and reduce memory pressure and
> throttling)?
> 
> From what I see in __folio_split() in mm/huge_memory.c, it's not
> currently possible to split dirty file-backed folios (eg
> iomap_release_folio() will not release the folio if any part of it is
> dirty), but I think we can get this working in the iomap layer for
> supporting the per-folio dirty/uptodate state tracking for the split
> folio instead of rejecting the release. Iomap already tracks which
> subfolios are dirty vs not, so we have the information upfront as well
> about how to split the dirty folio optimally. Do you think this would
> be useful to get working?

I don't think people have reached concensus on what would be the best way
to address the issues with coarse granularity of large folios (even within
myself I haven't quite decided about the best solution :)). So we are just
experimenting and seeing what could work and what not. Limiting maximum
folio order is kind of a dumb solution but it is relatively easy to
implement and might be good enough.

Splitting of folios is definitely more complicated, requires support from
filesystems, can fail for various reasons (folio is pinned, you need to
allocate memory to split a folio, ...), and is generally more expensive.
So tuning that properly and making it work reliably under various
conditions is an order of magnitude harder task and I think at this point
you'd have a hard time justifying all the complexity without showing why
the dumb solution isn't good enough... So we might get to splitting folios
at some point in some shape or form I don't think we're there yet.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2026-05-30 11:04 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-08 20:41 [PATCH] fuse: disable default bdi strictlimiting Joanne Koong
2025-10-09 14:16 ` Miklos Szeredi
2025-10-09 18:36   ` Joanne Koong
2025-10-10 15:01     ` Darrick J. Wong
2025-10-10 15:07       ` Matthew Wilcox
2025-10-10 23:14       ` Joanne Koong
2025-10-27 22:38     ` Joanne Koong
2026-05-08  9:42       ` Miklos Szeredi
2026-05-08 11:54         ` Horst Birthelmer
2026-05-12 20:56         ` Joanne Koong
2026-05-27  1:42           ` Joanne Koong
2026-05-27  5:57             ` Horst Birthelmer
2026-05-27 10:59               ` Amir Goldstein
2026-05-27 22:40                 ` Joanne Koong
2026-05-27 12:25             ` Miklos Szeredi
2026-05-27 23:32               ` Joanne Koong
2026-05-28 12:34             ` Jan Kara
2026-05-28 22:11               ` Joanne Koong
2026-05-30 11:04                 ` Jan Kara
2026-05-30  2:15             ` Joanne Koong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.