[HELP] FUSE writeback performance bottleneck

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [HELP] FUSE writeback performance bottleneck
@ 2024-06-03  6:17 Jingbo Xu
  2024-06-03 14:43 ` Bernd Schubert
  0 siblings, 1 reply; 29+ messages in thread
From: Jingbo Xu @ 2024-06-03  6:17 UTC (permalink / raw)
  To: Miklos Szeredi, linux-fsdevel@vger.kernel.org
  Cc: linux-kernel@vger.kernel.org, lege.wang

Hi, Miklos,

We spotted a performance bottleneck for FUSE writeback in which the
writeback kworker has consumed nearly 100% CPU, among which 40% CPU is
used for copy_page().

fuse_writepages_fill
  alloc tmp_page
  copy_highpage

This is because of FUSE writeback design (see commit 3be5a52b30aa
("fuse: support writable mmap")), which newly allocates a temp page for
each dirty page to be written back, copy content of dirty page to temp
page, and then write back the temp page instead.  This special design is
intentional to avoid potential deadlocked due to buggy or even malicious
fuse user daemon.

There was a proposal of removing this constraint for virtiofs [1], which
is reasonable as users of virtiofs and virtiofs daemon don't run on the
same OS, and virtiofs daemon is usually offered by cloud vendors that
shall not be malicious.  While for the normal /dev/fuse interface, I
don't think removing the constraint is acceptable.

Come back to the writeback performance bottleneck.  Another important
factor is that, (IIUC) only one kworker at the same time is allowed for
writeback for each filesystem instance (if cgroup writeback is not
enabled).  The kworker is scheduled upon sb->s_bdi->wb.dwork, and the
workqueue infrastructure guarantees that at most one *running* worker is
allowed for one specific work (sb->s_bdi->wb.dwork) at any time.  Thus
the writeback is constraint to one CPU for each filesystem instance.

I'm not sure if offloading the page copying and then FUSE requests
sending to another worker (if a bunch of dirty pages have been
collected) is a good idea or not, e.g.

```
fuse_writepages_fill
  if fuse_writepage_need_send:
    # schedule a work

# the worker
for each dirty page in ap->pages[]:
    copy_page
fuse_writepages_send
```

Any suggestion?

This issue can be reproduced by:

1 ./libfuse/build/example/passthrough_ll -o cache=always -o writeback -o
source=/run/ /mnt
("/run/" is a tmpfs mount)

2 fio --name=write_test --ioengine=psync --iodepth=1 --rw=write --bs=1M
--direct=0 --size=1G --numjobs=2 --group_reporting --directory=/mnt
(at least two threads are needed; fio shows ~1800MiB/s buffer write
bandwidth)

[1]
https://lore.kernel.org/all/20231228123528.705-1-lege.wang@jaguarmicro.com/

-- 
Thanks,
Jingbo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [HELP] FUSE writeback performance bottleneck
  2024-06-03  6:17 [HELP] FUSE writeback performance bottleneck Jingbo Xu
@ 2024-06-03 14:43 ` Bernd Schubert
  2024-06-03 15:19   ` Miklos Szeredi
  0 siblings, 1 reply; 29+ messages in thread
From: Bernd Schubert @ 2024-06-03 14:43 UTC (permalink / raw)
  To: Jingbo Xu, Miklos Szeredi, linux-fsdevel@vger.kernel.org
  Cc: linux-kernel@vger.kernel.org, lege.wang



On 6/3/24 08:17, Jingbo Xu wrote:
> Hi, Miklos,
> 
> We spotted a performance bottleneck for FUSE writeback in which the
> writeback kworker has consumed nearly 100% CPU, among which 40% CPU is
> used for copy_page().
> 
> fuse_writepages_fill
>   alloc tmp_page
>   copy_highpage
> 
> This is because of FUSE writeback design (see commit 3be5a52b30aa
> ("fuse: support writable mmap")), which newly allocates a temp page for
> each dirty page to be written back, copy content of dirty page to temp
> page, and then write back the temp page instead.  This special design is
> intentional to avoid potential deadlocked due to buggy or even malicious
> fuse user daemon.

I also noticed that and I admin that I don't understand it yet. The commit says

<quote>
    The basic problem is that there can be no guarantee about the time in which
    the userspace filesystem will complete a write.  It may be buggy or even
    malicious, and fail to complete WRITE requests.  We don't want unrelated parts
    of the system to grind to a halt in such cases.
</quote>


Timing - NFS/cifs/etc have the same issue? Even a local file system has no guarantees
how fast storage is?

Buggy - hmm yeah, then it is splice related only? But I think splice feature was
not introduced yet when fuse got mmap and writeback in 2008? 
Without splice the pages are just copied into a userspace buffer? So what can
userspace do wrong with its copy?

Failure - why can't it do what nfs_mapping_set_error() does?

I guess I miss something, but so far I don't understand what that is.


> 
> There was a proposal of removing this constraint for virtiofs [1], which
> is reasonable as users of virtiofs and virtiofs daemon don't run on the
> same OS, and virtiofs daemon is usually offered by cloud vendors that
> shall not be malicious.  While for the normal /dev/fuse interface, I
> don't think removing the constraint is acceptable.
> 
> 
> Come back to the writeback performance bottleneck.  Another important
> factor is that, (IIUC) only one kworker at the same time is allowed for
> writeback for each filesystem instance (if cgroup writeback is not
> enabled).  The kworker is scheduled upon sb->s_bdi->wb.dwork, and the
> workqueue infrastructure guarantees that at most one *running* worker is
> allowed for one specific work (sb->s_bdi->wb.dwork) at any time.  Thus
> the writeback is constraint to one CPU for each filesystem instance.
> 
> I'm not sure if offloading the page copying and then FUSE requests
> sending to another worker (if a bunch of dirty pages have been
> collected) is a good idea or not, e.g.
> 
> ```
> fuse_writepages_fill
>   if fuse_writepage_need_send:
>     # schedule a work
> 
> # the worker
> for each dirty page in ap->pages[]:
>     copy_page
> fuse_writepages_send
> ```
> 
> Any suggestion?
> 
> 
> 
> This issue can be reproduced by:
> 
> 1 ./libfuse/build/example/passthrough_ll -o cache=always -o writeback -o
> source=/run/ /mnt
> ("/run/" is a tmpfs mount)
> 
> 2 fio --name=write_test --ioengine=psync --iodepth=1 --rw=write --bs=1M
> --direct=0 --size=1G --numjobs=2 --group_reporting --directory=/mnt
> (at least two threads are needed; fio shows ~1800MiB/s buffer write
> bandwidth)


That should quickly run out of tmpfs memory. I need to find time to improve
this a bit, but this should give you an easier test:

https://github.com/libfuse/libfuse/pull/807

> 
> 
> [1]
> https://lore.kernel.org/all/20231228123528.705-1-lege.wang@jaguarmicro.com/
> 
> 


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [HELP] FUSE writeback performance bottleneck
  2024-06-03 14:43 ` Bernd Schubert
@ 2024-06-03 15:19   ` Miklos Szeredi
  2024-06-03 15:32     ` Bernd Schubert
                       ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Miklos Szeredi @ 2024-06-03 15:19 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Jingbo Xu, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, lege.wang

On Mon, 3 Jun 2024 at 16:43, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
>
>
>
> On 6/3/24 08:17, Jingbo Xu wrote:
> > Hi, Miklos,
> >
> > We spotted a performance bottleneck for FUSE writeback in which the
> > writeback kworker has consumed nearly 100% CPU, among which 40% CPU is
> > used for copy_page().
> >
> > fuse_writepages_fill
> >   alloc tmp_page
> >   copy_highpage
> >
> > This is because of FUSE writeback design (see commit 3be5a52b30aa
> > ("fuse: support writable mmap")), which newly allocates a temp page for
> > each dirty page to be written back, copy content of dirty page to temp
> > page, and then write back the temp page instead.  This special design is
> > intentional to avoid potential deadlocked due to buggy or even malicious
> > fuse user daemon.
>
> I also noticed that and I admin that I don't understand it yet. The commit says
>
> <quote>
>     The basic problem is that there can be no guarantee about the time in which
>     the userspace filesystem will complete a write.  It may be buggy or even
>     malicious, and fail to complete WRITE requests.  We don't want unrelated parts
>     of the system to grind to a halt in such cases.
> </quote>
>
>
> Timing - NFS/cifs/etc have the same issue? Even a local file system has no guarantees
> how fast storage is?

I don't have the details but it boils down to the fact that the
allocation context provided by GFP_NOFS (PF_MEMALLOC_NOFS) cannot be
used by the unprivileged userspace server (and even if it could,
there's no guarantee, that it would).

When this mechanism was introduced, the deadlock was a real
possibility.  I'm not sure that it can still happen, but proving that
it cannot might be difficult.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [HELP] FUSE writeback performance bottleneck
  2024-06-03 15:19   ` Miklos Szeredi
@ 2024-06-03 15:32     ` Bernd Schubert
  2024-06-03 22:10     ` Dave Chinner
  2024-06-04  1:57     ` Jingbo Xu
  2 siblings, 0 replies; 29+ messages in thread
From: Bernd Schubert @ 2024-06-03 15:32 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Jingbo Xu, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, lege.wang



On 6/3/24 17:19, Miklos Szeredi wrote:
> On Mon, 3 Jun 2024 at 16:43, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
>>
>>
>>
>> On 6/3/24 08:17, Jingbo Xu wrote:
>>> Hi, Miklos,
>>>
>>> We spotted a performance bottleneck for FUSE writeback in which the
>>> writeback kworker has consumed nearly 100% CPU, among which 40% CPU is
>>> used for copy_page().
>>>
>>> fuse_writepages_fill
>>>   alloc tmp_page
>>>   copy_highpage
>>>
>>> This is because of FUSE writeback design (see commit 3be5a52b30aa
>>> ("fuse: support writable mmap")), which newly allocates a temp page for
>>> each dirty page to be written back, copy content of dirty page to temp
>>> page, and then write back the temp page instead.  This special design is
>>> intentional to avoid potential deadlocked due to buggy or even malicious
>>> fuse user daemon.
>>
>> I also noticed that and I admin that I don't understand it yet. The commit says
>>
>> <quote>
>>     The basic problem is that there can be no guarantee about the time in which
>>     the userspace filesystem will complete a write.  It may be buggy or even
>>     malicious, and fail to complete WRITE requests.  We don't want unrelated parts
>>     of the system to grind to a halt in such cases.
>> </quote>
>>
>>
>> Timing - NFS/cifs/etc have the same issue? Even a local file system has no guarantees
>> how fast storage is?
> 
> I don't have the details but it boils down to the fact that the
> allocation context provided by GFP_NOFS (PF_MEMALLOC_NOFS) cannot be
> used by the unprivileged userspace server (and even if it could,
> there's no guarantee, that it would).
> 
> When this mechanism was introduced, the deadlock was a real
> possibility.  I'm not sure that it can still happen, but proving that
> it cannot might be difficult.

Thanks Miklos!
I need to go through all of the GFP_NOFS allocation, but I wonder if we
could introduce cached allocations and fall back to the slow path if
that didn't work.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [HELP] FUSE writeback performance bottleneck
  2024-06-03 15:19   ` Miklos Szeredi
  2024-06-03 15:32     ` Bernd Schubert
@ 2024-06-03 22:10     ` Dave Chinner
  2024-06-04  7:20       ` Miklos Szeredi
  2024-06-04  1:57     ` Jingbo Xu
  2 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2024-06-03 22:10 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Bernd Schubert, Jingbo Xu, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, lege.wang

On Mon, Jun 03, 2024 at 05:19:44PM +0200, Miklos Szeredi wrote:
> On Mon, 3 Jun 2024 at 16:43, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
> >
> >
> >
> > On 6/3/24 08:17, Jingbo Xu wrote:
> > > Hi, Miklos,
> > >
> > > We spotted a performance bottleneck for FUSE writeback in which the
> > > writeback kworker has consumed nearly 100% CPU, among which 40% CPU is
> > > used for copy_page().
> > >
> > > fuse_writepages_fill
> > >   alloc tmp_page
> > >   copy_highpage
> > >
> > > This is because of FUSE writeback design (see commit 3be5a52b30aa
> > > ("fuse: support writable mmap")), which newly allocates a temp page for
> > > each dirty page to be written back, copy content of dirty page to temp
> > > page, and then write back the temp page instead.  This special design is
> > > intentional to avoid potential deadlocked due to buggy or even malicious
> > > fuse user daemon.
> >
> > I also noticed that and I admin that I don't understand it yet. The commit says
> >
> > <quote>
> >     The basic problem is that there can be no guarantee about the time in which
> >     the userspace filesystem will complete a write.  It may be buggy or even
> >     malicious, and fail to complete WRITE requests.  We don't want unrelated parts
> >     of the system to grind to a halt in such cases.
> > </quote>
> >
> >
> > Timing - NFS/cifs/etc have the same issue? Even a local file system has no guarantees
> > how fast storage is?
> 
> I don't have the details but it boils down to the fact that the
> allocation context provided by GFP_NOFS (PF_MEMALLOC_NOFS) cannot be
> used by the unprivileged userspace server (and even if it could,
> there's no guarantee, that it would).

I thought we had PR_SET_IO_FLUSHER for that. Requires
CAP_SYS_RESOURCES but no other privileges, then the userspace
server will then always operate in PF_MEMALLOC_NOIO |
PF_LOCAL_THROTTLE memory allocation context.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [HELP] FUSE writeback performance bottleneck
  2024-06-03 15:19   ` Miklos Szeredi
  2024-06-03 15:32     ` Bernd Schubert
  2024-06-03 22:10     ` Dave Chinner
@ 2024-06-04  1:57     ` Jingbo Xu
  2024-06-04  7:27       ` Miklos Szeredi
  2 siblings, 1 reply; 29+ messages in thread
From: Jingbo Xu @ 2024-06-04  1:57 UTC (permalink / raw)
  To: Miklos Szeredi, Bernd Schubert
  Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	lege.wang

Hi Bernd and Miklos,

On 6/3/24 11:19 PM, Miklos Szeredi wrote:
> On Mon, 3 Jun 2024 at 16:43, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
>>
>>
>>
>> On 6/3/24 08:17, Jingbo Xu wrote:
>>> Hi, Miklos,
>>>
>>> We spotted a performance bottleneck for FUSE writeback in which the
>>> writeback kworker has consumed nearly 100% CPU, among which 40% CPU is
>>> used for copy_page().
>>>
>>> fuse_writepages_fill
>>>   alloc tmp_page
>>>   copy_highpage
>>>
>>> This is because of FUSE writeback design (see commit 3be5a52b30aa
>>> ("fuse: support writable mmap")), which newly allocates a temp page for
>>> each dirty page to be written back, copy content of dirty page to temp
>>> page, and then write back the temp page instead.  This special design is
>>> intentional to avoid potential deadlocked due to buggy or even malicious
>>> fuse user daemon.
>>
>> I also noticed that and I admin that I don't understand it yet. The commit says
>>
>> <quote>
>>     The basic problem is that there can be no guarantee about the time in which
>>     the userspace filesystem will complete a write.  It may be buggy or even
>>     malicious, and fail to complete WRITE requests.  We don't want unrelated parts
>>     of the system to grind to a halt in such cases.
>> </quote>
>>
>>
>> Timing - NFS/cifs/etc have the same issue? Even a local file system has no guarantees
>> how fast storage is?
> 
> I don't have the details but it boils down to the fact that the
> allocation context provided by GFP_NOFS (PF_MEMALLOC_NOFS) cannot be
> used by the unprivileged userspace server (and even if it could,
> there's no guarantee, that it would).
> 
> When this mechanism was introduced, the deadlock was a real
> possibility.  I'm not sure that it can still happen, but proving that
> it cannot might be difficult.

IIUC, there are two sources that may cause deadlock:
1) the fuse server needs memory allocation when processing FUSE_WRITE
requests, which in turn triggers direct memory reclaim, and FUSE
writeback then - deadlock here
2) a process that trigfgers direct memory reclaim or calls sync(2) may
hang there forever, if the fuse server is buggyly or malicious and thus
hang there when processing FUSE_WRITE requests

Thus the temp page design was introduced to avoid the above potential
issues.

I think case 1 may be fixed (if any), but I don't know how case 2 can be
avoided as any one could run a fuse server in unprivileged mode.  Or if
case 2 really matters?  Please correct me if I miss something.

-- 
Thanks,
Jingbo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [HELP] FUSE writeback performance bottleneck
  2024-06-03 22:10     ` Dave Chinner
@ 2024-06-04  7:20       ` Miklos Szeredi
  0 siblings, 0 replies; 29+ messages in thread
From: Miklos Szeredi @ 2024-06-04  7:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Bernd Schubert, Jingbo Xu, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, lege.wang

On Tue, 4 Jun 2024 at 00:10, Dave Chinner <david@fromorbit.com> wrote:

> I thought we had PR_SET_IO_FLUSHER for that. Requires
> CAP_SYS_RESOURCES but no other privileges, then the userspace
> server will then always operate in PF_MEMALLOC_NOIO |
> PF_LOCAL_THROTTLE memory allocation context.

There could be any number of services that are being used while
serving a fuse request.  There's no well defined "fuse server
process", as many people seem to think.  Any approach depending on
somehow marking the fuse server as a special entity will fail.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [HELP] FUSE writeback performance bottleneck
  2024-06-04  1:57     ` Jingbo Xu
@ 2024-06-04  7:27       ` Miklos Szeredi
  2024-06-04  7:36         ` Jingbo Xu
  2024-09-11  9:32         ` Jingbo Xu
  0 siblings, 2 replies; 29+ messages in thread
From: Miklos Szeredi @ 2024-06-04  7:27 UTC (permalink / raw)
  To: Jingbo Xu
  Cc: Bernd Schubert, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, lege.wang

On Tue, 4 Jun 2024 at 03:57, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:

> IIUC, there are two sources that may cause deadlock:
> 1) the fuse server needs memory allocation when processing FUSE_WRITE
> requests, which in turn triggers direct memory reclaim, and FUSE
> writeback then - deadlock here

Yep, see the folio_wait_writeback() call deep in the guts of direct
reclaim, which sleeps until the PG_writeback flag is cleared.  If that
happens to be triggered by the writeback in question, then that's a
deadlock.

> 2) a process that trigfgers direct memory reclaim or calls sync(2) may
> hang there forever, if the fuse server is buggyly or malicious and thus
> hang there when processing FUSE_WRITE requests

Ah, yes, sync(2) is also an interesting case.   We don't want unpriv
fuse servers to be able to block sync(2), which means that sync(2)
won't actually guarantee a synchronization of fuse's dirty pages.  I
don't think there's even a theoretical solution to that, but
apparently nobody cares...

Thanks,
Mikos

>
> Thus the temp page design was introduced to avoid the above potential
> issues.
>
> I think case 1 may be fixed (if any), but I don't know how case 2 can be
> avoided as any one could run a fuse server in unprivileged mode.  Or if
> case 2 really matters?  Please correct me if I miss something.
>
> --
> Thanks,
> Jingbo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [HELP] FUSE writeback performance bottleneck
  2024-06-04  7:27       ` Miklos Szeredi
@ 2024-06-04  7:36         ` Jingbo Xu
  2024-06-04  9:32           ` Bernd Schubert
  2024-09-11  9:32         ` Jingbo Xu
  1 sibling, 1 reply; 29+ messages in thread
From: Jingbo Xu @ 2024-06-04  7:36 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Bernd Schubert, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, lege.wang



On 6/4/24 3:27 PM, Miklos Szeredi wrote:
> On Tue, 4 Jun 2024 at 03:57, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> 
>> IIUC, there are two sources that may cause deadlock:
>> 1) the fuse server needs memory allocation when processing FUSE_WRITE
>> requests, which in turn triggers direct memory reclaim, and FUSE
>> writeback then - deadlock here
> 
> Yep, see the folio_wait_writeback() call deep in the guts of direct
> reclaim, which sleeps until the PG_writeback flag is cleared.  If that
> happens to be triggered by the writeback in question, then that's a
> deadlock.
> 
>> 2) a process that trigfgers direct memory reclaim or calls sync(2) may
>> hang there forever, if the fuse server is buggyly or malicious and thus
>> hang there when processing FUSE_WRITE requests
> 
> Ah, yes, sync(2) is also an interesting case.   We don't want unpriv
> fuse servers to be able to block sync(2), which means that sync(2)
> won't actually guarantee a synchronization of fuse's dirty pages.  I
> don't think there's even a theoretical solution to that, but
> apparently nobody cares...

Okay if the temp page design is unavoidable, then I don't know if there
is any approach (in FUSE or VFS layer) helps page copy offloading.  At
least we don't want the writeback performance to be limited by the
single writeback kworker.  This is also the initial attempt of this thread.

-- 
Thanks,
Jingbo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [HELP] FUSE writeback performance bottleneck
  2024-06-04  7:36         ` Jingbo Xu
@ 2024-06-04  9:32           ` Bernd Schubert
  2024-06-04 10:02             ` Miklos Szeredi
  2024-06-04 12:24             ` Jingbo Xu
  0 siblings, 2 replies; 29+ messages in thread
From: Bernd Schubert @ 2024-06-04  9:32 UTC (permalink / raw)
  To: Jingbo Xu, Miklos Szeredi
  Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	lege.wang, Matthew Wilcox (Oracle), linux-mm@kvack.org



On 6/4/24 09:36, Jingbo Xu wrote:
> 
> 
> On 6/4/24 3:27 PM, Miklos Szeredi wrote:
>> On Tue, 4 Jun 2024 at 03:57, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>
>>> IIUC, there are two sources that may cause deadlock:
>>> 1) the fuse server needs memory allocation when processing FUSE_WRITE
>>> requests, which in turn triggers direct memory reclaim, and FUSE
>>> writeback then - deadlock here
>>
>> Yep, see the folio_wait_writeback() call deep in the guts of direct
>> reclaim, which sleeps until the PG_writeback flag is cleared.  If that
>> happens to be triggered by the writeback in question, then that's a
>> deadlock.
>>
>>> 2) a process that trigfgers direct memory reclaim or calls sync(2) may
>>> hang there forever, if the fuse server is buggyly or malicious and thus
>>> hang there when processing FUSE_WRITE requests
>>
>> Ah, yes, sync(2) is also an interesting case.   We don't want unpriv
>> fuse servers to be able to block sync(2), which means that sync(2)
>> won't actually guarantee a synchronization of fuse's dirty pages.  I
>> don't think there's even a theoretical solution to that, but
>> apparently nobody cares...
> 
> Okay if the temp page design is unavoidable, then I don't know if there
> is any approach (in FUSE or VFS layer) helps page copy offloading.  At
> least we don't want the writeback performance to be limited by the
> single writeback kworker.  This is also the initial attempt of this thread.
> 

Offloading it to another thread is just a workaround, though maybe a
temporary solution.

Back to the background for the copy, so it copies pages to avoid
blocking on memory reclaim. With that allocation it in fact increases
memory pressure even more. Isn't the right solution to mark those pages
as not reclaimable and to avoid blocking on it? Which is what the tmp
pages do, just not in beautiful way.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [HELP] FUSE writeback performance bottleneck
  2024-06-04  9:32           ` Bernd Schubert
@ 2024-06-04 10:02             ` Miklos Szeredi
  2024-06-04 14:13               ` Bernd Schubert
                                 ` (2 more replies)
  2024-06-04 12:24             ` Jingbo Xu
  1 sibling, 3 replies; 29+ messages in thread
From: Miklos Szeredi @ 2024-06-04 10:02 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Jingbo Xu, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, lege.wang, Matthew Wilcox (Oracle),
	linux-mm@kvack.org

On Tue, 4 Jun 2024 at 11:32, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:

> Back to the background for the copy, so it copies pages to avoid
> blocking on memory reclaim. With that allocation it in fact increases
> memory pressure even more. Isn't the right solution to mark those pages
> as not reclaimable and to avoid blocking on it? Which is what the tmp
> pages do, just not in beautiful way.

Copying to the tmp page is the same as marking the pages as
non-reclaimable and non-syncable.

Conceptually it would be nice to only copy when there's something
actually waiting for writeback on the page.

Note: normally the WRITE request would be copied to userspace along
with the contents of the pages very soon after starting writeback.
After this the contents of the page no longer matter, and we can just
clear writeback without doing the copy.

But if the request gets stuck in the input queue before being copied
to userspace, then deadlock can still happen if the server blocks on
direct reclaim and won't continue with processing the queue.   And
sync(2) will also block in that case.

So we'd somehow need to handle stuck WRITE requests.   I don't see an
easy way to do this "on demand", when something actually starts
waiting on PG_writeback.  Alternatively the page copy could be done
after a timeout, which is ugly, but much easier to implement.

Also splice from the fuse dev would need to copy those pages, but that
shouldn't be a problem, since it's just moving the copy from one place
to another.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [HELP] FUSE writeback performance bottleneck
  2024-06-04  9:32           ` Bernd Schubert
  2024-06-04 10:02             ` Miklos Szeredi
@ 2024-06-04 12:24             ` Jingbo Xu
  1 sibling, 0 replies; 29+ messages in thread
From: Jingbo Xu @ 2024-06-04 12:24 UTC (permalink / raw)
  To: Bernd Schubert, Miklos Szeredi
  Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	lege.wang, Matthew Wilcox (Oracle), linux-mm@kvack.org



On 6/4/24 5:32 PM, Bernd Schubert wrote:
> 
> 
> On 6/4/24 09:36, Jingbo Xu wrote:
>>
>>
>> On 6/4/24 3:27 PM, Miklos Szeredi wrote:
>>> On Tue, 4 Jun 2024 at 03:57, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>
>>>> IIUC, there are two sources that may cause deadlock:
>>>> 1) the fuse server needs memory allocation when processing FUSE_WRITE
>>>> requests, which in turn triggers direct memory reclaim, and FUSE
>>>> writeback then - deadlock here
>>>
>>> Yep, see the folio_wait_writeback() call deep in the guts of direct
>>> reclaim, which sleeps until the PG_writeback flag is cleared.  If that
>>> happens to be triggered by the writeback in question, then that's a
>>> deadlock.
>>>
>>>> 2) a process that trigfgers direct memory reclaim or calls sync(2) may
>>>> hang there forever, if the fuse server is buggyly or malicious and thus
>>>> hang there when processing FUSE_WRITE requests
>>>
>>> Ah, yes, sync(2) is also an interesting case.   We don't want unpriv
>>> fuse servers to be able to block sync(2), which means that sync(2)
>>> won't actually guarantee a synchronization of fuse's dirty pages.  I
>>> don't think there's even a theoretical solution to that, but
>>> apparently nobody cares...
>>
>> Okay if the temp page design is unavoidable, then I don't know if there
>> is any approach (in FUSE or VFS layer) helps page copy offloading.  At
>> least we don't want the writeback performance to be limited by the
>> single writeback kworker.  This is also the initial attempt of this thread.
>>
> 
> Offloading it to another thread is just a workaround, though maybe a
> temporary solution.

If we could break the limit that only one single (writeback) kworker for
one bdi... Apparently it's much more complicated.  Just a brainstorming
idea...

I agree it's a tough thing.

-- 
Thanks,
Jingbo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [HELP] FUSE writeback performance bottleneck
  2024-06-04 10:02             ` Miklos Szeredi
@ 2024-06-04 14:13               ` Bernd Schubert
  2024-06-04 16:53                 ` Josef Bacik
  2024-08-22 17:00               ` Joanne Koong
  2024-08-23  3:34               ` Jingbo Xu
  2 siblings, 1 reply; 29+ messages in thread
From: Bernd Schubert @ 2024-06-04 14:13 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Jingbo Xu, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, lege.wang, Matthew Wilcox (Oracle),
	linux-mm@kvack.org



On 6/4/24 12:02, Miklos Szeredi wrote:
> On Tue, 4 Jun 2024 at 11:32, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
> 
>> Back to the background for the copy, so it copies pages to avoid
>> blocking on memory reclaim. With that allocation it in fact increases
>> memory pressure even more. Isn't the right solution to mark those pages
>> as not reclaimable and to avoid blocking on it? Which is what the tmp
>> pages do, just not in beautiful way.
> 
> Copying to the tmp page is the same as marking the pages as
> non-reclaimable and non-syncable.
> 
> Conceptually it would be nice to only copy when there's something
> actually waiting for writeback on the page.
> 
> Note: normally the WRITE request would be copied to userspace along
> with the contents of the pages very soon after starting writeback.
> After this the contents of the page no longer matter, and we can just
> clear writeback without doing the copy.
> 
> But if the request gets stuck in the input queue before being copied
> to userspace, then deadlock can still happen if the server blocks on
> direct reclaim and won't continue with processing the queue.   And
> sync(2) will also block in that case.>
> So we'd somehow need to handle stuck WRITE requests.   I don't see an
> easy way to do this "on demand", when something actually starts
> waiting on PG_writeback.  Alternatively the page copy could be done
> after a timeout, which is ugly, but much easier to implement.

I think the timeout method would only work if we have already allocated
the pages, under memory pressure page allocation might not work well.
But then this still seems to be a workaround, because we don't take any
less memory with these copied pages.
I'm going to look into mm/ if there isn't a better solution.

> 
> Also splice from the fuse dev would need to copy those pages, but that
> shouldn't be a problem, since it's just moving the copy from one place
> to another.

Ok, at least I need to keep an eye on it that it doesn't break when I
write a patch.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [HELP] FUSE writeback performance bottleneck
  2024-06-04 14:13               ` Bernd Schubert
@ 2024-06-04 16:53                 ` Josef Bacik
  2024-06-04 21:39                   ` Bernd Schubert
  0 siblings, 1 reply; 29+ messages in thread
From: Josef Bacik @ 2024-06-04 16:53 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, Jingbo Xu, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, lege.wang, Matthew Wilcox (Oracle),
	linux-mm@kvack.org

On Tue, Jun 04, 2024 at 04:13:25PM +0200, Bernd Schubert wrote:
> 
> 
> On 6/4/24 12:02, Miklos Szeredi wrote:
> > On Tue, 4 Jun 2024 at 11:32, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
> > 
> >> Back to the background for the copy, so it copies pages to avoid
> >> blocking on memory reclaim. With that allocation it in fact increases
> >> memory pressure even more. Isn't the right solution to mark those pages
> >> as not reclaimable and to avoid blocking on it? Which is what the tmp
> >> pages do, just not in beautiful way.
> > 
> > Copying to the tmp page is the same as marking the pages as
> > non-reclaimable and non-syncable.
> > 
> > Conceptually it would be nice to only copy when there's something
> > actually waiting for writeback on the page.
> > 
> > Note: normally the WRITE request would be copied to userspace along
> > with the contents of the pages very soon after starting writeback.
> > After this the contents of the page no longer matter, and we can just
> > clear writeback without doing the copy.
> > 
> > But if the request gets stuck in the input queue before being copied
> > to userspace, then deadlock can still happen if the server blocks on
> > direct reclaim and won't continue with processing the queue.   And
> > sync(2) will also block in that case.>
> > So we'd somehow need to handle stuck WRITE requests.   I don't see an
> > easy way to do this "on demand", when something actually starts
> > waiting on PG_writeback.  Alternatively the page copy could be done
> > after a timeout, which is ugly, but much easier to implement.
> 
> I think the timeout method would only work if we have already allocated
> the pages, under memory pressure page allocation might not work well.
> But then this still seems to be a workaround, because we don't take any
> less memory with these copied pages.
> I'm going to look into mm/ if there isn't a better solution.

I've thought a bit about this, and I still don't have a good solution, so I'm
going to throw out my random thoughts and see if it helps us get to a good spot.

1. Generally we are moving away from GFP_NOFS/GFP_NOIO to instead use
   memalloc_*_save/memalloc_*_restore, so instead the process is marked being in
   these contexts.  We could do something similar for FUSE, tho this gets hairy
   with things that async off request handling to other threads (which is all of
   the FUSE file systems we have internally).  We'd need to have some way to
   apply this to an entire process group, but this could be a workable solution.

2. Per-request timeouts.  This is something we're planning on tackling for other
   reasons, but it could fit nicely here to say "if this fuse fs has a
   per-request timeout, skip the copy".  That way we at least know we're upper
   bound on how long we would be "deadlocked".  I don't love this approach
   because it's still a deadlock until the timeout elapsed, but it's an idea.

3. Since we're limiting writeout per the BDI, we could just say FUSE is special,
   only one memory reclaim related writeout at a time.  We flag when we're doing
   a write via memory reclaim, and then if we try to trigger writeout via memory
   reclaim again we simply reject it to avoid the deadlock.  This has the
   downside of making it so non-fuse related things that may be triggering
   direct reclaim through FUSE means they'll reclaim something else, and if the
   dirty pages from FUSE are the ones causing the problem we could spin a bunch
   evicting pages that we don't care about and thrashing a bit.

As I said all of these have downsides, I think #1 is probably the most workable,
but I haven't thought about it super thoroughly. Thanks,

Josef

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [HELP] FUSE writeback performance bottleneck
  2024-06-04 16:53                 ` Josef Bacik
@ 2024-06-04 21:39                   ` Bernd Schubert
  2024-06-04 22:16                     ` Josef Bacik
  0 siblings, 1 reply; 29+ messages in thread
From: Bernd Schubert @ 2024-06-04 21:39 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Miklos Szeredi, Jingbo Xu, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, lege.wang, Matthew Wilcox (Oracle),
	linux-mm@kvack.org



On 6/4/24 18:53, Josef Bacik wrote:
> On Tue, Jun 04, 2024 at 04:13:25PM +0200, Bernd Schubert wrote:
>>
>>
>> On 6/4/24 12:02, Miklos Szeredi wrote:
>>> On Tue, 4 Jun 2024 at 11:32, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
>>>
>>>> Back to the background for the copy, so it copies pages to avoid
>>>> blocking on memory reclaim. With that allocation it in fact increases
>>>> memory pressure even more. Isn't the right solution to mark those pages
>>>> as not reclaimable and to avoid blocking on it? Which is what the tmp
>>>> pages do, just not in beautiful way.
>>>
>>> Copying to the tmp page is the same as marking the pages as
>>> non-reclaimable and non-syncable.
>>>
>>> Conceptually it would be nice to only copy when there's something
>>> actually waiting for writeback on the page.
>>>
>>> Note: normally the WRITE request would be copied to userspace along
>>> with the contents of the pages very soon after starting writeback.
>>> After this the contents of the page no longer matter, and we can just
>>> clear writeback without doing the copy.
>>>
>>> But if the request gets stuck in the input queue before being copied
>>> to userspace, then deadlock can still happen if the server blocks on
>>> direct reclaim and won't continue with processing the queue.   And
>>> sync(2) will also block in that case.>
>>> So we'd somehow need to handle stuck WRITE requests.   I don't see an
>>> easy way to do this "on demand", when something actually starts
>>> waiting on PG_writeback.  Alternatively the page copy could be done
>>> after a timeout, which is ugly, but much easier to implement.
>>
>> I think the timeout method would only work if we have already allocated
>> the pages, under memory pressure page allocation might not work well.
>> But then this still seems to be a workaround, because we don't take any
>> less memory with these copied pages.
>> I'm going to look into mm/ if there isn't a better solution.
> 
> I've thought a bit about this, and I still don't have a good solution, so I'm
> going to throw out my random thoughts and see if it helps us get to a good spot.
> 
> 1. Generally we are moving away from GFP_NOFS/GFP_NOIO to instead use
>    memalloc_*_save/memalloc_*_restore, so instead the process is marked being in
>    these contexts.  We could do something similar for FUSE, tho this gets hairy
>    with things that async off request handling to other threads (which is all of
>    the FUSE file systems we have internally).  We'd need to have some way to
>    apply this to an entire process group, but this could be a workable solution.
> 

I'm not sure how either of of both (GFP_ and memalloc_) would work for
userspace allocations.
Wouldn't we basically need to have a feature to disable memory
allocations for fuse userspace tasks? Hmm, maybe through mem_cgroup.
Although even then, the file system might depend on other kernel
resources (backend file system or block device or even network) that
might do allocations on their own without the knowledge of the fuse server.

> 2. Per-request timeouts.  This is something we're planning on tackling for other
>    reasons, but it could fit nicely here to say "if this fuse fs has a
>    per-request timeout, skip the copy".  That way we at least know we're upper
>    bound on how long we would be "deadlocked".  I don't love this approach
>    because it's still a deadlock until the timeout elapsed, but it's an idea.

Hmm, how do we know "this fuse fs has a per-request timeout"? I don't
think we could trust initialization flags set by userspace.

> 
> 3. Since we're limiting writeout per the BDI, we could just say FUSE is special,
>    only one memory reclaim related writeout at a time.  We flag when we're doing
>    a write via memory reclaim, and then if we try to trigger writeout via memory
>    reclaim again we simply reject it to avoid the deadlock.  This has the
>    downside of making it so non-fuse related things that may be triggering
>    direct reclaim through FUSE means they'll reclaim something else, and if the
>    dirty pages from FUSE are the ones causing the problem we could spin a bunch
>    evicting pages that we don't care about and thrashing a bit.


Isn't that what we have right now? Reclaim basically ignores fuse tmp pages.


Thanks,
Bernd


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [HELP] FUSE writeback performance bottleneck
  2024-06-04 21:39                   ` Bernd Schubert
@ 2024-06-04 22:16                     ` Josef Bacik
  2024-06-05  5:49                       ` Amir Goldstein
  0 siblings, 1 reply; 29+ messages in thread
From: Josef Bacik @ 2024-06-04 22:16 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, Jingbo Xu, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, lege.wang, Matthew Wilcox (Oracle),
	linux-mm@kvack.org

On Tue, Jun 04, 2024 at 11:39:17PM +0200, Bernd Schubert wrote:
> 
> 
> On 6/4/24 18:53, Josef Bacik wrote:
> > On Tue, Jun 04, 2024 at 04:13:25PM +0200, Bernd Schubert wrote:
> >>
> >>
> >> On 6/4/24 12:02, Miklos Szeredi wrote:
> >>> On Tue, 4 Jun 2024 at 11:32, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
> >>>
> >>>> Back to the background for the copy, so it copies pages to avoid
> >>>> blocking on memory reclaim. With that allocation it in fact increases
> >>>> memory pressure even more. Isn't the right solution to mark those pages
> >>>> as not reclaimable and to avoid blocking on it? Which is what the tmp
> >>>> pages do, just not in beautiful way.
> >>>
> >>> Copying to the tmp page is the same as marking the pages as
> >>> non-reclaimable and non-syncable.
> >>>
> >>> Conceptually it would be nice to only copy when there's something
> >>> actually waiting for writeback on the page.
> >>>
> >>> Note: normally the WRITE request would be copied to userspace along
> >>> with the contents of the pages very soon after starting writeback.
> >>> After this the contents of the page no longer matter, and we can just
> >>> clear writeback without doing the copy.
> >>>
> >>> But if the request gets stuck in the input queue before being copied
> >>> to userspace, then deadlock can still happen if the server blocks on
> >>> direct reclaim and won't continue with processing the queue.   And
> >>> sync(2) will also block in that case.>
> >>> So we'd somehow need to handle stuck WRITE requests.   I don't see an
> >>> easy way to do this "on demand", when something actually starts
> >>> waiting on PG_writeback.  Alternatively the page copy could be done
> >>> after a timeout, which is ugly, but much easier to implement.
> >>
> >> I think the timeout method would only work if we have already allocated
> >> the pages, under memory pressure page allocation might not work well.
> >> But then this still seems to be a workaround, because we don't take any
> >> less memory with these copied pages.
> >> I'm going to look into mm/ if there isn't a better solution.
> > 
> > I've thought a bit about this, and I still don't have a good solution, so I'm
> > going to throw out my random thoughts and see if it helps us get to a good spot.
> > 
> > 1. Generally we are moving away from GFP_NOFS/GFP_NOIO to instead use
> >    memalloc_*_save/memalloc_*_restore, so instead the process is marked being in
> >    these contexts.  We could do something similar for FUSE, tho this gets hairy
> >    with things that async off request handling to other threads (which is all of
> >    the FUSE file systems we have internally).  We'd need to have some way to
> >    apply this to an entire process group, but this could be a workable solution.
> > 
> 
> I'm not sure how either of of both (GFP_ and memalloc_) would work for
> userspace allocations.
> Wouldn't we basically need to have a feature to disable memory
> allocations for fuse userspace tasks? Hmm, maybe through mem_cgroup.
> Although even then, the file system might depend on other kernel
> resources (backend file system or block device or even network) that
> might do allocations on their own without the knowledge of the fuse server.
> 

Basically that only in the case that we're handling a request from memory
pressure we would invoke this, and then any allocation would automatically have
gfp_nofs protection because it's flagged at the task level.

Again there's a lot of problems with this, like how do we set it for the task,
how does it work for threads etc.

> > 2. Per-request timeouts.  This is something we're planning on tackling for other
> >    reasons, but it could fit nicely here to say "if this fuse fs has a
> >    per-request timeout, skip the copy".  That way we at least know we're upper
> >    bound on how long we would be "deadlocked".  I don't love this approach
> >    because it's still a deadlock until the timeout elapsed, but it's an idea.
> 
> Hmm, how do we know "this fuse fs has a per-request timeout"? I don't
> think we could trust initialization flags set by userspace.
> 

It would be controlled by the kernel.  So at init time the fuse file system says
"my command timeout is 30 minutes."  Then the kernel enforces this by having a
per-request timeout, and once that 30 minutes elapses we cancel the request and
EIO it.  User space doesn't do anything beyond telling the kernel what it's
timeout is, so this would be safe.

> > 
> > 3. Since we're limiting writeout per the BDI, we could just say FUSE is special,
> >    only one memory reclaim related writeout at a time.  We flag when we're doing
> >    a write via memory reclaim, and then if we try to trigger writeout via memory
> >    reclaim again we simply reject it to avoid the deadlock.  This has the
> >    downside of making it so non-fuse related things that may be triggering
> >    direct reclaim through FUSE means they'll reclaim something else, and if the
> >    dirty pages from FUSE are the ones causing the problem we could spin a bunch
> >    evicting pages that we don't care about and thrashing a bit.
> 
> 
> Isn't that what we have right now? Reclaim basically ignores fuse tmp pages.

Yes but extending it to no longer have tmp pages and tie it to the BDI instead,
my goal is to get rid of all the excess copying.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [HELP] FUSE writeback performance bottleneck
  2024-06-04 22:16                     ` Josef Bacik
@ 2024-06-05  5:49                       ` Amir Goldstein
  2024-06-05 15:35                         ` Josef Bacik
  0 siblings, 1 reply; 29+ messages in thread
From: Amir Goldstein @ 2024-06-05  5:49 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Bernd Schubert, Miklos Szeredi, Jingbo Xu,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	lege.wang, Matthew Wilcox (Oracle), linux-mm@kvack.org

On Wed, Jun 5, 2024 at 1:17 AM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On Tue, Jun 04, 2024 at 11:39:17PM +0200, Bernd Schubert wrote:
> >
> >
> > On 6/4/24 18:53, Josef Bacik wrote:
> > > On Tue, Jun 04, 2024 at 04:13:25PM +0200, Bernd Schubert wrote:
> > >>
> > >>
> > >> On 6/4/24 12:02, Miklos Szeredi wrote:
> > >>> On Tue, 4 Jun 2024 at 11:32, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
> > >>>
> > >>>> Back to the background for the copy, so it copies pages to avoid
> > >>>> blocking on memory reclaim. With that allocation it in fact increases
> > >>>> memory pressure even more. Isn't the right solution to mark those pages
> > >>>> as not reclaimable and to avoid blocking on it? Which is what the tmp
> > >>>> pages do, just not in beautiful way.
> > >>>
> > >>> Copying to the tmp page is the same as marking the pages as
> > >>> non-reclaimable and non-syncable.
> > >>>
> > >>> Conceptually it would be nice to only copy when there's something
> > >>> actually waiting for writeback on the page.
> > >>>
> > >>> Note: normally the WRITE request would be copied to userspace along
> > >>> with the contents of the pages very soon after starting writeback.
> > >>> After this the contents of the page no longer matter, and we can just
> > >>> clear writeback without doing the copy.
> > >>>
> > >>> But if the request gets stuck in the input queue before being copied
> > >>> to userspace, then deadlock can still happen if the server blocks on
> > >>> direct reclaim and won't continue with processing the queue.   And
> > >>> sync(2) will also block in that case.>
> > >>> So we'd somehow need to handle stuck WRITE requests.   I don't see an
> > >>> easy way to do this "on demand", when something actually starts
> > >>> waiting on PG_writeback.  Alternatively the page copy could be done
> > >>> after a timeout, which is ugly, but much easier to implement.
> > >>
> > >> I think the timeout method would only work if we have already allocated
> > >> the pages, under memory pressure page allocation might not work well.
> > >> But then this still seems to be a workaround, because we don't take any
> > >> less memory with these copied pages.
> > >> I'm going to look into mm/ if there isn't a better solution.
> > >
> > > I've thought a bit about this, and I still don't have a good solution, so I'm
> > > going to throw out my random thoughts and see if it helps us get to a good spot.
> > >
> > > 1. Generally we are moving away from GFP_NOFS/GFP_NOIO to instead use
> > >    memalloc_*_save/memalloc_*_restore, so instead the process is marked being in
> > >    these contexts.  We could do something similar for FUSE, tho this gets hairy
> > >    with things that async off request handling to other threads (which is all of
> > >    the FUSE file systems we have internally).  We'd need to have some way to
> > >    apply this to an entire process group, but this could be a workable solution.
> > >
> >
> > I'm not sure how either of of both (GFP_ and memalloc_) would work for
> > userspace allocations.
> > Wouldn't we basically need to have a feature to disable memory
> > allocations for fuse userspace tasks? Hmm, maybe through mem_cgroup.
> > Although even then, the file system might depend on other kernel
> > resources (backend file system or block device or even network) that
> > might do allocations on their own without the knowledge of the fuse server.
> >
>
> Basically that only in the case that we're handling a request from memory
> pressure we would invoke this, and then any allocation would automatically have
> gfp_nofs protection because it's flagged at the task level.
>
> Again there's a lot of problems with this, like how do we set it for the task,
> how does it work for threads etc.
>
> > > 2. Per-request timeouts.  This is something we're planning on tackling for other
> > >    reasons, but it could fit nicely here to say "if this fuse fs has a
> > >    per-request timeout, skip the copy".  That way we at least know we're upper
> > >    bound on how long we would be "deadlocked".  I don't love this approach
> > >    because it's still a deadlock until the timeout elapsed, but it's an idea.
> >
> > Hmm, how do we know "this fuse fs has a per-request timeout"? I don't
> > think we could trust initialization flags set by userspace.
> >
>
> It would be controlled by the kernel.  So at init time the fuse file system says
> "my command timeout is 30 minutes."  Then the kernel enforces this by having a
> per-request timeout, and once that 30 minutes elapses we cancel the request and
> EIO it.  User space doesn't do anything beyond telling the kernel what it's
> timeout is, so this would be safe.
>

Maybe that would be better to configure by mounter, similar to nfs -otimeo
and maybe consider opt-in to returning ETIMEDOUT in this case.
At least nfsd will pass that error to nfs client and nfs client will retry.

Different applications (or network protocols) handle timeouts differently,
so the timeout and error seems like a decision for the admin/mounter not
for the fuse server, although there may be a fuse fs that would want to
set the default timeout, as if to request the kernel to be its watchdog
(i.e. do not expect me to take more than 30 min to handle any request).

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [HELP] FUSE writeback performance bottleneck
  2024-06-05  5:49                       ` Amir Goldstein
@ 2024-06-05 15:35                         ` Josef Bacik
  0 siblings, 0 replies; 29+ messages in thread
From: Josef Bacik @ 2024-06-05 15:35 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Bernd Schubert, Miklos Szeredi, Jingbo Xu,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	lege.wang, Matthew Wilcox (Oracle), linux-mm@kvack.org

On Wed, Jun 05, 2024 at 08:49:48AM +0300, Amir Goldstein wrote:
> On Wed, Jun 5, 2024 at 1:17 AM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > On Tue, Jun 04, 2024 at 11:39:17PM +0200, Bernd Schubert wrote:
> > >
> > >
> > > On 6/4/24 18:53, Josef Bacik wrote:
> > > > On Tue, Jun 04, 2024 at 04:13:25PM +0200, Bernd Schubert wrote:
> > > >>
> > > >>
> > > >> On 6/4/24 12:02, Miklos Szeredi wrote:
> > > >>> On Tue, 4 Jun 2024 at 11:32, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
> > > >>>
> > > >>>> Back to the background for the copy, so it copies pages to avoid
> > > >>>> blocking on memory reclaim. With that allocation it in fact increases
> > > >>>> memory pressure even more. Isn't the right solution to mark those pages
> > > >>>> as not reclaimable and to avoid blocking on it? Which is what the tmp
> > > >>>> pages do, just not in beautiful way.
> > > >>>
> > > >>> Copying to the tmp page is the same as marking the pages as
> > > >>> non-reclaimable and non-syncable.
> > > >>>
> > > >>> Conceptually it would be nice to only copy when there's something
> > > >>> actually waiting for writeback on the page.
> > > >>>
> > > >>> Note: normally the WRITE request would be copied to userspace along
> > > >>> with the contents of the pages very soon after starting writeback.
> > > >>> After this the contents of the page no longer matter, and we can just
> > > >>> clear writeback without doing the copy.
> > > >>>
> > > >>> But if the request gets stuck in the input queue before being copied
> > > >>> to userspace, then deadlock can still happen if the server blocks on
> > > >>> direct reclaim and won't continue with processing the queue.   And
> > > >>> sync(2) will also block in that case.>
> > > >>> So we'd somehow need to handle stuck WRITE requests.   I don't see an
> > > >>> easy way to do this "on demand", when something actually starts
> > > >>> waiting on PG_writeback.  Alternatively the page copy could be done
> > > >>> after a timeout, which is ugly, but much easier to implement.
> > > >>
> > > >> I think the timeout method would only work if we have already allocated
> > > >> the pages, under memory pressure page allocation might not work well.
> > > >> But then this still seems to be a workaround, because we don't take any
> > > >> less memory with these copied pages.
> > > >> I'm going to look into mm/ if there isn't a better solution.
> > > >
> > > > I've thought a bit about this, and I still don't have a good solution, so I'm
> > > > going to throw out my random thoughts and see if it helps us get to a good spot.
> > > >
> > > > 1. Generally we are moving away from GFP_NOFS/GFP_NOIO to instead use
> > > >    memalloc_*_save/memalloc_*_restore, so instead the process is marked being in
> > > >    these contexts.  We could do something similar for FUSE, tho this gets hairy
> > > >    with things that async off request handling to other threads (which is all of
> > > >    the FUSE file systems we have internally).  We'd need to have some way to
> > > >    apply this to an entire process group, but this could be a workable solution.
> > > >
> > >
> > > I'm not sure how either of of both (GFP_ and memalloc_) would work for
> > > userspace allocations.
> > > Wouldn't we basically need to have a feature to disable memory
> > > allocations for fuse userspace tasks? Hmm, maybe through mem_cgroup.
> > > Although even then, the file system might depend on other kernel
> > > resources (backend file system or block device or even network) that
> > > might do allocations on their own without the knowledge of the fuse server.
> > >
> >
> > Basically that only in the case that we're handling a request from memory
> > pressure we would invoke this, and then any allocation would automatically have
> > gfp_nofs protection because it's flagged at the task level.
> >
> > Again there's a lot of problems with this, like how do we set it for the task,
> > how does it work for threads etc.
> >
> > > > 2. Per-request timeouts.  This is something we're planning on tackling for other
> > > >    reasons, but it could fit nicely here to say "if this fuse fs has a
> > > >    per-request timeout, skip the copy".  That way we at least know we're upper
> > > >    bound on how long we would be "deadlocked".  I don't love this approach
> > > >    because it's still a deadlock until the timeout elapsed, but it's an idea.
> > >
> > > Hmm, how do we know "this fuse fs has a per-request timeout"? I don't
> > > think we could trust initialization flags set by userspace.
> > >
> >
> > It would be controlled by the kernel.  So at init time the fuse file system says
> > "my command timeout is 30 minutes."  Then the kernel enforces this by having a
> > per-request timeout, and once that 30 minutes elapses we cancel the request and
> > EIO it.  User space doesn't do anything beyond telling the kernel what it's
> > timeout is, so this would be safe.
> >
> 
> Maybe that would be better to configure by mounter, similar to nfs -otimeo
> and maybe consider opt-in to returning ETIMEDOUT in this case.
> At least nfsd will pass that error to nfs client and nfs client will retry.
> 
> Different applications (or network protocols) handle timeouts differently,
> so the timeout and error seems like a decision for the admin/mounter not
> for the fuse server, although there may be a fuse fs that would want to
> set the default timeout, as if to request the kernel to be its watchdog
> (i.e. do not expect me to take more than 30 min to handle any request).

Oh yeah for sure, I'm just saying for the purposes of allowing the FUSE daemon
to be a little riskier with system resources we base it off of wether it opts in
to command timeouts.

My plans are to have it be able to be set by the fuse daemon, or externally by a
sysadmin via sysfs.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [HELP] FUSE writeback performance bottleneck
  2024-06-04 10:02             ` Miklos Szeredi
  2024-06-04 14:13               ` Bernd Schubert
@ 2024-08-22 17:00               ` Joanne Koong
  2024-08-22 21:01                 ` Joanne Koong
  2024-08-23  3:34               ` Jingbo Xu
  2 siblings, 1 reply; 29+ messages in thread
From: Joanne Koong @ 2024-08-22 17:00 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Bernd Schubert, Jingbo Xu, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, lege.wang, Matthew Wilcox (Oracle),
	linux-mm@kvack.org

On Tue, Jun 4, 2024 at 3:02 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Tue, 4 Jun 2024 at 11:32, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
>
> > Back to the background for the copy, so it copies pages to avoid
> > blocking on memory reclaim. With that allocation it in fact increases
> > memory pressure even more. Isn't the right solution to mark those pages
> > as not reclaimable and to avoid blocking on it? Which is what the tmp
> > pages do, just not in beautiful way.
>
> Copying to the tmp page is the same as marking the pages as
> non-reclaimable and non-syncable.
>
> Conceptually it would be nice to only copy when there's something
> actually waiting for writeback on the page.
>
> Note: normally the WRITE request would be copied to userspace along
> with the contents of the pages very soon after starting writeback.
> After this the contents of the page no longer matter, and we can just
> clear writeback without doing the copy.
>
> But if the request gets stuck in the input queue before being copied
> to userspace, then deadlock can still happen if the server blocks on
> direct reclaim and won't continue with processing the queue.   And
> sync(2) will also block in that case.

Why doesn't it suffice to just check if the page is being reclaimed
and do the tmp page allocation only if it's under reclaim?

>
> So we'd somehow need to handle stuck WRITE requests.   I don't see an
> easy way to do this "on demand", when something actually starts
> waiting on PG_writeback.  Alternatively the page copy could be done
> after a timeout, which is ugly, but much easier to implement.
>
> Also splice from the fuse dev would need to copy those pages, but that
> shouldn't be a problem, since it's just moving the copy from one place
> to another.
>
> Thanks,
> Miklos

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [HELP] FUSE writeback performance bottleneck
  2024-08-22 17:00               ` Joanne Koong
@ 2024-08-22 21:01                 ` Joanne Koong
  0 siblings, 0 replies; 29+ messages in thread
From: Joanne Koong @ 2024-08-22 21:01 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Bernd Schubert, Jingbo Xu, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, lege.wang, Matthew Wilcox (Oracle),
	linux-mm@kvack.org

On Thu, Aug 22, 2024 at 10:00 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Tue, Jun 4, 2024 at 3:02 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Tue, 4 Jun 2024 at 11:32, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
> >
> > > Back to the background for the copy, so it copies pages to avoid
> > > blocking on memory reclaim. With that allocation it in fact increases
> > > memory pressure even more. Isn't the right solution to mark those pages
> > > as not reclaimable and to avoid blocking on it? Which is what the tmp
> > > pages do, just not in beautiful way.
> >
> > Copying to the tmp page is the same as marking the pages as
> > non-reclaimable and non-syncable.
> >
> > Conceptually it would be nice to only copy when there's something
> > actually waiting for writeback on the page.
> >
> > Note: normally the WRITE request would be copied to userspace along
> > with the contents of the pages very soon after starting writeback.
> > After this the contents of the page no longer matter, and we can just
> > clear writeback without doing the copy.
> >
> > But if the request gets stuck in the input queue before being copied
> > to userspace, then deadlock can still happen if the server blocks on
> > direct reclaim and won't continue with processing the queue.   And
> > sync(2) will also block in that case.
>
> Why doesn't it suffice to just check if the page is being reclaimed
> and do the tmp page allocation only if it's under reclaim?

Never mind, Josef explained it to me. I misunderstood what the
PG_reclaim flag does.
>
> >
> > So we'd somehow need to handle stuck WRITE requests.   I don't see an
> > easy way to do this "on demand", when something actually starts
> > waiting on PG_writeback.  Alternatively the page copy could be done
> > after a timeout, which is ugly, but much easier to implement.
> >
> > Also splice from the fuse dev would need to copy those pages, but that
> > shouldn't be a problem, since it's just moving the copy from one place
> > to another.
> >
> > Thanks,
> > Miklos

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [HELP] FUSE writeback performance bottleneck
  2024-06-04 10:02             ` Miklos Szeredi
  2024-06-04 14:13               ` Bernd Schubert
  2024-08-22 17:00               ` Joanne Koong
@ 2024-08-23  3:34               ` Jingbo Xu
  2024-09-13  0:00                 ` Joanne Koong
  2 siblings, 1 reply; 29+ messages in thread
From: Jingbo Xu @ 2024-08-23  3:34 UTC (permalink / raw)
  To: Miklos Szeredi, Bernd Schubert
  Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	lege.wang, Matthew Wilcox (Oracle), linux-mm@kvack.org

On 6/4/24 6:02 PM, Miklos Szeredi wrote:
> On Tue, 4 Jun 2024 at 11:32, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
> 
>> Back to the background for the copy, so it copies pages to avoid
>> blocking on memory reclaim. With that allocation it in fact increases
>> memory pressure even more. Isn't the right solution to mark those pages
>> as not reclaimable and to avoid blocking on it? Which is what the tmp
>> pages do, just not in beautiful way.
> 
> Copying to the tmp page is the same as marking the pages as
> non-reclaimable and non-syncable.
> 
> Conceptually it would be nice to only copy when there's something
> actually waiting for writeback on the page.
> 
> Note: normally the WRITE request would be copied to userspace along
> with the contents of the pages very soon after starting writeback.
> After this the contents of the page no longer matter, and we can just
> clear writeback without doing the copy.

OK this really deviates from my previous understanding of the deadlock
issue.  Previously I thought *after* the server has received the WRITE
request, i.e. has copied the request and page content to userspace, the
server needs to allocate some memory to handle the WRITE request, e.g.
make the data persistent on disk, or send the data to the remote
storage.  It is the memory allocation at this point that actually
triggers a memory direct reclaim (on the FUSE dirty page) and causes a
deadlock.  It seems that I misunderstand it.

If that's true, we can clear PF_writeback as long as the whole request
along with the page content has already been copied to userspace, and
thus eliminate the tmp page copying.

> 
> But if the request gets stuck in the input queue before being copied
> to userspace, then deadlock can still happen if the server blocks on
> direct reclaim and won't continue with processing the queue.   And
> sync(2) will also block in that case.
> 

Hi, Miklos,

Would you please give more details on how "the request can get stuck in
the input queue before being copied userspace"?  Do you mean the WRITE
requests (submitted from writeback) are still pending in the
background/pending list, waiting to be processed by the server, while at
the same time the server gets blocked from processing the queue, either
due to the server is blocked on direct reclaim (when handling *another*
request), or it's a malicious server and refuses to process any request?

-- 
Thanks,
Jingbo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [HELP] FUSE writeback performance bottleneck
  2024-06-04  7:27       ` Miklos Szeredi
  2024-06-04  7:36         ` Jingbo Xu
@ 2024-09-11  9:32         ` Jingbo Xu
  2024-09-12 23:18           ` Joanne Koong
  1 sibling, 1 reply; 29+ messages in thread
From: Jingbo Xu @ 2024-09-11  9:32 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Bernd Schubert, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, Josef Bacik, Joanne Koong,
	Dave Chinner

Hi all,

On 6/4/24 3:27 PM, Miklos Szeredi wrote:
> On Tue, 4 Jun 2024 at 03:57, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> 
>> IIUC, there are two sources that may cause deadlock:
>> 1) the fuse server needs memory allocation when processing FUSE_WRITE
>> requests, which in turn triggers direct memory reclaim, and FUSE
>> writeback then - deadlock here
> 
> Yep, see the folio_wait_writeback() call deep in the guts of direct
> reclaim, which sleeps until the PG_writeback flag is cleared.  If that
> happens to be triggered by the writeback in question, then that's a
> deadlock.

After diving deep into the direct reclaim code, there are some insights
may be helpful.

Back to the time when the support for fuse writeback is introduced, i.e.
commit 3be5a52b30aa ("fuse: support writable mmap") since v2.6.26, the
direct reclaim indeed unconditionally waits for PG_writeback flag being
cleared.  At that time the direct reclaim is implemented in a two-stage
style, stage 1) pass over the LRU list to start parallel writeback
asynchronously, and stage 2) synchronously wait for completion of the
writeback previously started.

This two-stage design and the unconditionally waiting for PG_writeback
flag being cleared is removed by commit 41ac199 ("mm: vmscan: do not
stall on writeback during memory compaction") since v3.5.

Though the direct reclaim logic continues to evolve and the waiting is
added back, now the stall will happen only when the direct reclaim is
triggered from kswapd or memory cgroup.

Specifically the stall will only happen in following certain conditions
(see shrink_folio_list() for details):
1) kswapd
2) or it's a user process under a non-root memory cgroup (actually
cgroup_v1) with GFP_IO permitted

Thus the potential deadlock does not exist actually (if I'm not wrong) if:
1) cgroup is not enabled
2) or cgroup_v2 is actually used
3) or (memory cgroup is enabled and is attached upon cgroup_v1) the fuse
server actually resides under the root cgroup
4) or (the fuse server resides under a non-root memory cgroup_v1), but
the fuse server advertises itself as a PR_IO_FLUSHER[1]

Then we could considering adding a new feature bit indicating that any
one of the above condition is met and thus the fuse server is safe from
the potential deadlock inside direct reclaim.  When this feature bit is
set, the kernel side could bypass the temp page copying when doing
writeback.

As for the condition 4 (PR_IO_FLUSHER), there was a concern from
Miklos[2].  I think the new feature bit could be disabled by default,
and enabled only when the fuse server itself guarantees that it is in a
safe distribution condition.  Even when it's enabled either by a mistake
or a malicious fuse server, and thus causes a deadlock, maybe the
sysadmin could still abort the connection through the abort sysctl knob?

Just some insights and brainstorm here.

[1] https://lore.kernel.org/all/Zl4%2FOAsMiqB4LO0e@dread.disaster.area/
[2]
https://lore.kernel.org/all/CAJfpegvYpWuTbKOm1hoySHZocY+ki07EzcXBUX8kZx92T8W6uQ@mail.gmail.com/

-- 
Thanks,
Jingbo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [HELP] FUSE writeback performance bottleneck
  2024-09-11  9:32         ` Jingbo Xu
@ 2024-09-12 23:18           ` Joanne Koong
  2024-09-13  3:35             ` Jingbo Xu
  0 siblings, 1 reply; 29+ messages in thread
From: Joanne Koong @ 2024-09-12 23:18 UTC (permalink / raw)
  To: Jingbo Xu
  Cc: Miklos Szeredi, Bernd Schubert, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, Josef Bacik, Dave Chinner

On Wed, Sep 11, 2024 at 2:32 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
> Hi all,
>
> On 6/4/24 3:27 PM, Miklos Szeredi wrote:
> > On Tue, 4 Jun 2024 at 03:57, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >
> >> IIUC, there are two sources that may cause deadlock:
> >> 1) the fuse server needs memory allocation when processing FUSE_WRITE
> >> requests, which in turn triggers direct memory reclaim, and FUSE
> >> writeback then - deadlock here
> >
> > Yep, see the folio_wait_writeback() call deep in the guts of direct
> > reclaim, which sleeps until the PG_writeback flag is cleared.  If that
> > happens to be triggered by the writeback in question, then that's a
> > deadlock.
>
> After diving deep into the direct reclaim code, there are some insights
> may be helpful.
>
> Back to the time when the support for fuse writeback is introduced, i.e.
> commit 3be5a52b30aa ("fuse: support writable mmap") since v2.6.26, the
> direct reclaim indeed unconditionally waits for PG_writeback flag being
> cleared.  At that time the direct reclaim is implemented in a two-stage
> style, stage 1) pass over the LRU list to start parallel writeback
> asynchronously, and stage 2) synchronously wait for completion of the
> writeback previously started.
>
> This two-stage design and the unconditionally waiting for PG_writeback
> flag being cleared is removed by commit 41ac199 ("mm: vmscan: do not
> stall on writeback during memory compaction") since v3.5.
>
> Though the direct reclaim logic continues to evolve and the waiting is
> added back, now the stall will happen only when the direct reclaim is
> triggered from kswapd or memory cgroup.
>
> Specifically the stall will only happen in following certain conditions
> (see shrink_folio_list() for details):
> 1) kswapd
> 2) or it's a user process under a non-root memory cgroup (actually
> cgroup_v1) with GFP_IO permitted
>
> Thus the potential deadlock does not exist actually (if I'm not wrong) if:
> 1) cgroup is not enabled
> 2) or cgroup_v2 is actually used
> 3) or (memory cgroup is enabled and is attached upon cgroup_v1) the fuse
> server actually resides under the root cgroup
> 4) or (the fuse server resides under a non-root memory cgroup_v1), but
> the fuse server advertises itself as a PR_IO_FLUSHER[1]
>
>
> Then we could considering adding a new feature bit indicating that any
> one of the above condition is met and thus the fuse server is safe from
> the potential deadlock inside direct reclaim.  When this feature bit is
> set, the kernel side could bypass the temp page copying when doing
> writeback.
>

Hi Jingbo, thanks for sharing your analysis of this.

Having the temp page copying gated on the conditions you mentioned
above seems a bit brittle to me. My understanding is that the mm code
for when it decides to stall or not stall can change anytime in the
future, in which case that seems like it could automatically break our
precondition assumptions. Additionally, if I'm understanding it
correctly, we also would need to know if the writeback is being
triggered from reclaim by kswapd - is there even a way in the kernel
to check that?

I'm wondering if there's some way we could tell if a folio is under
reclaim when we're writing it back. I'm not familiar yet with the
reclaim code, but my initial thoughts were whether it'd be possible to
purpose the PG_reclaim flag or perhaps if the folio is not on any lru
list, as an indication that it's being reclaimed. We could then just
use the temp page in those cases, and skip the temp page otherwise.

Could you also point me to where in the reclaim code we end up
invoking the writeback callback? I see pageout() calls ->writepage()
but I'm not seeing where we invoke ->writepages().


Thanks,
Joanne

>
> As for the condition 4 (PR_IO_FLUSHER), there was a concern from
> Miklos[2].  I think the new feature bit could be disabled by default,
> and enabled only when the fuse server itself guarantees that it is in a
> safe distribution condition.  Even when it's enabled either by a mistake
> or a malicious fuse server, and thus causes a deadlock, maybe the
> sysadmin could still abort the connection through the abort sysctl knob?
>
>
> Just some insights and brainstorm here.
>
>
> [1] https://lore.kernel.org/all/Zl4%2FOAsMiqB4LO0e@dread.disaster.area/
> [2]
> https://lore.kernel.org/all/CAJfpegvYpWuTbKOm1hoySHZocY+ki07EzcXBUX8kZx92T8W6uQ@mail.gmail.com/
>
>
>
> --
> Thanks,
> Jingbo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [HELP] FUSE writeback performance bottleneck
  2024-08-23  3:34               ` Jingbo Xu
@ 2024-09-13  0:00                 ` Joanne Koong
  2024-09-13  1:25                   ` Jingbo Xu
  0 siblings, 1 reply; 29+ messages in thread
From: Joanne Koong @ 2024-09-13  0:00 UTC (permalink / raw)
  To: Jingbo Xu
  Cc: Miklos Szeredi, Bernd Schubert, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, lege.wang, Matthew Wilcox (Oracle),
	linux-mm@kvack.org

On Thu, Aug 22, 2024 at 8:34 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
> On 6/4/24 6:02 PM, Miklos Szeredi wrote:
> > On Tue, 4 Jun 2024 at 11:32, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
> >
> >> Back to the background for the copy, so it copies pages to avoid
> >> blocking on memory reclaim. With that allocation it in fact increases
> >> memory pressure even more. Isn't the right solution to mark those pages
> >> as not reclaimable and to avoid blocking on it? Which is what the tmp
> >> pages do, just not in beautiful way.
> >
> > Copying to the tmp page is the same as marking the pages as
> > non-reclaimable and non-syncable.
> >
> > Conceptually it would be nice to only copy when there's something
> > actually waiting for writeback on the page.
> >
> > Note: normally the WRITE request would be copied to userspace along
> > with the contents of the pages very soon after starting writeback.
> > After this the contents of the page no longer matter, and we can just
> > clear writeback without doing the copy.
>
> OK this really deviates from my previous understanding of the deadlock
> issue.  Previously I thought *after* the server has received the WRITE
> request, i.e. has copied the request and page content to userspace, the
> server needs to allocate some memory to handle the WRITE request, e.g.
> make the data persistent on disk, or send the data to the remote
> storage.  It is the memory allocation at this point that actually
> triggers a memory direct reclaim (on the FUSE dirty page) and causes a
> deadlock.  It seems that I misunderstand it.

I think your previous understanding is correct (or if not, then my
understanding of this is incorrect too lol).
The first write request makes it to userspace and when the server is
in the middle of handling it, a memory reclaim is triggered where
pages need to be written back. This leads to a SECOND write request
(eg writing back the pages that are reclaimed) but this second write
request will never be copied out to userspace because the server is
stuck handling the first write request and waiting for the page
reclaim bits of the reclaimed pages to be unset, but those reclaim
bits can only be unset when the pages have been copied out to
userspace, which only happens when the server reads /dev/fuse for the
next request.

>
> If that's true, we can clear PF_writeback as long as the whole request
> along with the page content has already been copied to userspace, and
> thus eliminate the tmp page copying.
>

I think the problem is that on a single-threaded server,  the pages
will not be copied out to userspace for the second request (aka
writing back the dirty reclaimed pages) since the server is stuck on
the first request.

> >
> > But if the request gets stuck in the input queue before being copied
> > to userspace, then deadlock can still happen if the server blocks on
> > direct reclaim and won't continue with processing the queue.   And
> > sync(2) will also block in that case.
> >
>
> Hi, Miklos,
>
> Would you please give more details on how "the request can get stuck in
> the input queue before being copied userspace"?  Do you mean the WRITE
> requests (submitted from writeback) are still pending in the
> background/pending list, waiting to be processed by the server, while at
> the same time the server gets blocked from processing the queue, either
> due to the server is blocked on direct reclaim (when handling *another*
> request), or it's a malicious server and refuses to process any request?
>
>
> --
> Thanks,
> Jingbo
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [HELP] FUSE writeback performance bottleneck
  2024-09-13  0:00                 ` Joanne Koong
@ 2024-09-13  1:25                   ` Jingbo Xu
  0 siblings, 0 replies; 29+ messages in thread
From: Jingbo Xu @ 2024-09-13  1:25 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Miklos Szeredi, Bernd Schubert, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, lege.wang, Matthew Wilcox (Oracle),
	linux-mm@kvack.org



On 9/13/24 8:00 AM, Joanne Koong wrote:
> On Thu, Aug 22, 2024 at 8:34 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>
>> On 6/4/24 6:02 PM, Miklos Szeredi wrote:
>>> On Tue, 4 Jun 2024 at 11:32, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
>>>
>>>> Back to the background for the copy, so it copies pages to avoid
>>>> blocking on memory reclaim. With that allocation it in fact increases
>>>> memory pressure even more. Isn't the right solution to mark those pages
>>>> as not reclaimable and to avoid blocking on it? Which is what the tmp
>>>> pages do, just not in beautiful way.
>>>
>>> Copying to the tmp page is the same as marking the pages as
>>> non-reclaimable and non-syncable.
>>>
>>> Conceptually it would be nice to only copy when there's something
>>> actually waiting for writeback on the page.
>>>
>>> Note: normally the WRITE request would be copied to userspace along
>>> with the contents of the pages very soon after starting writeback.
>>> After this the contents of the page no longer matter, and we can just
>>> clear writeback without doing the copy.
>>
>> OK this really deviates from my previous understanding of the deadlock
>> issue.  Previously I thought *after* the server has received the WRITE
>> request, i.e. has copied the request and page content to userspace, the
>> server needs to allocate some memory to handle the WRITE request, e.g.
>> make the data persistent on disk, or send the data to the remote
>> storage.  It is the memory allocation at this point that actually
>> triggers a memory direct reclaim (on the FUSE dirty page) and causes a
>> deadlock.  It seems that I misunderstand it.
> 
> I think your previous understanding is correct (or if not, then my
> understanding of this is incorrect too lol).
> The first write request makes it to userspace and when the server is
> in the middle of handling it, a memory reclaim is triggered where
> pages need to be written back. This leads to a SECOND write request
> (eg writing back the pages that are reclaimed) but this second write
> request will never be copied out to userspace because the server is
> stuck handling the first write request and waiting for the page
> reclaim bits of the reclaimed pages to be unset, but those reclaim
> bits can only be unset when the pages have been copied out to
> userspace, which only happens when the server reads /dev/fuse for the
> next request.

Right, that's true.

> 
>>
>> If that's true, we can clear PF_writeback as long as the whole request
>> along with the page content has already been copied to userspace, and
>> thus eliminate the tmp page copying.
>>
> 
> I think the problem is that on a single-threaded server,  the pages
> will not be copied out to userspace for the second request (aka
> writing back the dirty reclaimed pages) since the server is stuck on
> the first request.

Agreed.


-- 
Thanks,
Jingbo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [HELP] FUSE writeback performance bottleneck
  2024-09-12 23:18           ` Joanne Koong
@ 2024-09-13  3:35             ` Jingbo Xu
  2024-09-13 20:55               ` Joanne Koong
  0 siblings, 1 reply; 29+ messages in thread
From: Jingbo Xu @ 2024-09-13  3:35 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Miklos Szeredi, Bernd Schubert, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, Josef Bacik, Dave Chinner



On 9/13/24 7:18 AM, Joanne Koong wrote:
> On Wed, Sep 11, 2024 at 2:32 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>
>> Hi all,
>>
>> On 6/4/24 3:27 PM, Miklos Szeredi wrote:
>>> On Tue, 4 Jun 2024 at 03:57, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>
>>>> IIUC, there are two sources that may cause deadlock:
>>>> 1) the fuse server needs memory allocation when processing FUSE_WRITE
>>>> requests, which in turn triggers direct memory reclaim, and FUSE
>>>> writeback then - deadlock here
>>>
>>> Yep, see the folio_wait_writeback() call deep in the guts of direct
>>> reclaim, which sleeps until the PG_writeback flag is cleared.  If that
>>> happens to be triggered by the writeback in question, then that's a
>>> deadlock.
>>
>> After diving deep into the direct reclaim code, there are some insights
>> may be helpful.
>>
>> Back to the time when the support for fuse writeback is introduced, i.e.
>> commit 3be5a52b30aa ("fuse: support writable mmap") since v2.6.26, the
>> direct reclaim indeed unconditionally waits for PG_writeback flag being
>> cleared.  At that time the direct reclaim is implemented in a two-stage
>> style, stage 1) pass over the LRU list to start parallel writeback
>> asynchronously, and stage 2) synchronously wait for completion of the
>> writeback previously started.
>>
>> This two-stage design and the unconditionally waiting for PG_writeback
>> flag being cleared is removed by commit 41ac199 ("mm: vmscan: do not
>> stall on writeback during memory compaction") since v3.5.
>>
>> Though the direct reclaim logic continues to evolve and the waiting is
>> added back, now the stall will happen only when the direct reclaim is
>> triggered from kswapd or memory cgroup.
>>
>> Specifically the stall will only happen in following certain conditions
>> (see shrink_folio_list() for details):
>> 1) kswapd
>> 2) or it's a user process under a non-root memory cgroup (actually
>> cgroup_v1) with GFP_IO permitted
>>
>> Thus the potential deadlock does not exist actually (if I'm not wrong) if:
>> 1) cgroup is not enabled
>> 2) or cgroup_v2 is actually used
>> 3) or (memory cgroup is enabled and is attached upon cgroup_v1) the fuse
>> server actually resides under the root cgroup
>> 4) or (the fuse server resides under a non-root memory cgroup_v1), but
>> the fuse server advertises itself as a PR_IO_FLUSHER[1]
>>
>>
>> Then we could considering adding a new feature bit indicating that any
>> one of the above condition is met and thus the fuse server is safe from
>> the potential deadlock inside direct reclaim.  When this feature bit is
>> set, the kernel side could bypass the temp page copying when doing
>> writeback.
>>
> 
> Hi Jingbo, thanks for sharing your analysis of this.
> 
> Having the temp page copying gated on the conditions you mentioned
> above seems a bit brittle to me. My understanding is that the mm code
> for when it decides to stall or not stall can change anytime in the
> future, in which case that seems like it could automatically break our
> precondition assumptions.

So this is why PR_IO_FLUSHER is introduced here, which is specifically
for user space components playing a role in IO stack, e.g. fuse daemon,
tcmu/nbd daemon, etc.  PR_IO_FLUSHER offers guarantee similar to
GFP_NOIO, but for user space components.  At least we can rely on the
assumption that mm would take PR_IO_FLUSHER into account.

The limitation of the PR_IO_FLUSHER approach is that, as pointed by
Miklos[1], there may be multiple components or services involved to
service the fuse requests, and the kernel side has no effective way to
check if all services in the whole chain have set PR_IO_FLUSHER.


> Additionally, if I'm understanding it
> correctly, we also would need to know if the writeback is being
> triggered from reclaim by kswapd - is there even a way in the kernel
> to check that?

Nope.  What I mean in the previous email is that, kswapd can get stalled
in direct reclaim, while the normal process, e.g. the fuse server, may
not get stalled in certain condition, e.g. explicitly advertising
PR_IO_FLUSHER.

> 
> I'm wondering if there's some way we could tell if a folio is under
> reclaim when we're writing it back. I'm not familiar yet with the
> reclaim code, but my initial thoughts were whether it'd be possible to
> purpose the PG_reclaim flag or perhaps if the folio is not on any lru
> list, as an indication that it's being reclaimed. We could then just
> use the temp page in those cases, and skip the temp page otherwise.

That is a good idea but I'm afraid it doesn't works.  Explained below.

> 
> Could you also point me to where in the reclaim code we end up
> invoking the writeback callback? I see pageout() calls ->writepage()
> but I'm not seeing where we invoke ->writepages().

Yes, the direct reclaim would end up calling ->writepage() to writeback
the dirty page.  ->writepages() is only called in normal writeback
routine, e.g. when triggered from balance_dirty_page().

Also FYI FUSE has removed ->writepage() since commit e1c420a ("fuse:
Remove fuse_writepage"), and now it relies on ->migrate_folio(), i.e.
memory compacting and the normal writeback routine (triggered from
balance_dirty_page()) in low memory.

Thus I'm afraid the approach of doing temp page copying only for
writeback from direct reclaim code actually doesn't work.  That's
because when doing the direct reclaim, the process not only waits for
the writeback completion submitted from direct reclaim (e.g. marked with
PG_reclaim, by ->writepage), but may also waits for that submitted from
the normal writeback routine (without PG_reclaim marked, by
->writepages). See commit c3b94f4 ("memcg: further prevent OOM with too
many dirty pages").



[1]
https://lore.kernel.org/all/CAJfpegvYpWuTbKOm1hoySHZocY+ki07EzcXBUX8kZx92T8W6uQ@mail.gmail.com/

-- 
Thanks,
Jingbo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [HELP] FUSE writeback performance bottleneck
  2024-09-13  3:35             ` Jingbo Xu
@ 2024-09-13 20:55               ` Joanne Koong
  2024-10-11 23:08                 ` Joanne Koong
  0 siblings, 1 reply; 29+ messages in thread
From: Joanne Koong @ 2024-09-13 20:55 UTC (permalink / raw)
  To: Jingbo Xu
  Cc: Miklos Szeredi, Bernd Schubert, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, Josef Bacik, Dave Chinner

On Thu, Sep 12, 2024 at 8:35 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
> On 9/13/24 7:18 AM, Joanne Koong wrote:
> > On Wed, Sep 11, 2024 at 2:32 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >>
> >> Hi all,
> >>
> >> On 6/4/24 3:27 PM, Miklos Szeredi wrote:
> >>> On Tue, 4 Jun 2024 at 03:57, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >>>
> >>>> IIUC, there are two sources that may cause deadlock:
> >>>> 1) the fuse server needs memory allocation when processing FUSE_WRITE
> >>>> requests, which in turn triggers direct memory reclaim, and FUSE
> >>>> writeback then - deadlock here
> >>>
> >>> Yep, see the folio_wait_writeback() call deep in the guts of direct
> >>> reclaim, which sleeps until the PG_writeback flag is cleared.  If that
> >>> happens to be triggered by the writeback in question, then that's a
> >>> deadlock.
> >>
> >> After diving deep into the direct reclaim code, there are some insights
> >> may be helpful.
> >>
> >> Back to the time when the support for fuse writeback is introduced, i.e.
> >> commit 3be5a52b30aa ("fuse: support writable mmap") since v2.6.26, the
> >> direct reclaim indeed unconditionally waits for PG_writeback flag being
> >> cleared.  At that time the direct reclaim is implemented in a two-stage
> >> style, stage 1) pass over the LRU list to start parallel writeback
> >> asynchronously, and stage 2) synchronously wait for completion of the
> >> writeback previously started.
> >>
> >> This two-stage design and the unconditionally waiting for PG_writeback
> >> flag being cleared is removed by commit 41ac199 ("mm: vmscan: do not
> >> stall on writeback during memory compaction") since v3.5.
> >>
> >> Though the direct reclaim logic continues to evolve and the waiting is
> >> added back, now the stall will happen only when the direct reclaim is
> >> triggered from kswapd or memory cgroup.
> >>
> >> Specifically the stall will only happen in following certain conditions
> >> (see shrink_folio_list() for details):
> >> 1) kswapd
> >> 2) or it's a user process under a non-root memory cgroup (actually
> >> cgroup_v1) with GFP_IO permitted
> >>
> >> Thus the potential deadlock does not exist actually (if I'm not wrong) if:
> >> 1) cgroup is not enabled
> >> 2) or cgroup_v2 is actually used
> >> 3) or (memory cgroup is enabled and is attached upon cgroup_v1) the fuse
> >> server actually resides under the root cgroup
> >> 4) or (the fuse server resides under a non-root memory cgroup_v1), but
> >> the fuse server advertises itself as a PR_IO_FLUSHER[1]
> >>
> >>
> >> Then we could considering adding a new feature bit indicating that any
> >> one of the above condition is met and thus the fuse server is safe from
> >> the potential deadlock inside direct reclaim.  When this feature bit is
> >> set, the kernel side could bypass the temp page copying when doing
> >> writeback.
> >>
> >
> > Hi Jingbo, thanks for sharing your analysis of this.
> >
> > Having the temp page copying gated on the conditions you mentioned
> > above seems a bit brittle to me. My understanding is that the mm code
> > for when it decides to stall or not stall can change anytime in the
> > future, in which case that seems like it could automatically break our
> > precondition assumptions.
>
> So this is why PR_IO_FLUSHER is introduced here, which is specifically
> for user space components playing a role in IO stack, e.g. fuse daemon,
> tcmu/nbd daemon, etc.  PR_IO_FLUSHER offers guarantee similar to
> GFP_NOIO, but for user space components.  At least we can rely on the
> assumption that mm would take PR_IO_FLUSHER into account.
>
> The limitation of the PR_IO_FLUSHER approach is that, as pointed by
> Miklos[1], there may be multiple components or services involved to
> service the fuse requests, and the kernel side has no effective way to
> check if all services in the whole chain have set PR_IO_FLUSHER.
>

Right, so doesn't that still bring us back to the original problem
where if we gate this on any of the one conditions being enough to
bypass needing the temp page, if the conditions change anytime in the
future in the mm code, then this would automatically open up the
potential deadlock in fuse as a byproduct? That seems a bit brittle to
me to have this dependency.

The other alternatives seem to be:
* adding a timer to writeback requests [1] where if the pages have not
been copied out to userspace by a certain amount of time, then the
handler copies out those pages to temporary pages and immediately
clears writeback on the pages. The timer is canceled as soon as the
pages will be copied out to userspace.
* (not sure how possible this is) add some way to tag pages being
reclaimed/balanced (I saw your comment below about the
->migrate_folio() call, which I need to look more into)

The timeout option seems like the most promising one. I don't think
the code would be that ugly.

Curious to hear your thoughts on this. Are there any other
alternatives you think could work here?


[1] https://lore.kernel.org/all/CAJfpegt_mEYOeeTo2bWS3iJfC38t5bf29mzrxK68dhMptrgamg@mail.gmail.com/

>
> > Additionally, if I'm understanding it
> > correctly, we also would need to know if the writeback is being
> > triggered from reclaim by kswapd - is there even a way in the kernel
> > to check that?
>
> Nope.  What I mean in the previous email is that, kswapd can get stalled
> in direct reclaim, while the normal process, e.g. the fuse server, may
> not get stalled in certain condition, e.g. explicitly advertising
> PR_IO_FLUSHER.
>

Gotcha. I just took a look at shrink_folio_list() and now I see the
"current_is_kswapd()" check.

> >
> > I'm wondering if there's some way we could tell if a folio is under
> > reclaim when we're writing it back. I'm not familiar yet with the
> > reclaim code, but my initial thoughts were whether it'd be possible to
> > purpose the PG_reclaim flag or perhaps if the folio is not on any lru
> > list, as an indication that it's being reclaimed. We could then just
> > use the temp page in those cases, and skip the temp page otherwise.
>
> That is a good idea but I'm afraid it doesn't works.  Explained below.
>
> >
> > Could you also point me to where in the reclaim code we end up
> > invoking the writeback callback? I see pageout() calls ->writepage()
> > but I'm not seeing where we invoke ->writepages().
>
> Yes, the direct reclaim would end up calling ->writepage() to writeback
> the dirty page.  ->writepages() is only called in normal writeback
> routine, e.g. when triggered from balance_dirty_page().
>
> Also FYI FUSE has removed ->writepage() since commit e1c420a ("fuse:
> Remove fuse_writepage"), and now it relies on ->migrate_folio(), i.e.
> memory compacting and the normal writeback routine (triggered from
> balance_dirty_page()) in low memory.
>
> Thus I'm afraid the approach of doing temp page copying only for
> writeback from direct reclaim code actually doesn't work.  That's
> because when doing the direct reclaim, the process not only waits for
> the writeback completion submitted from direct reclaim (e.g. marked with
> PG_reclaim, by ->writepage), but may also waits for that submitted from
> the normal writeback routine (without PG_reclaim marked, by
> ->writepages). See commit c3b94f4 ("memcg: further prevent OOM with too
> many dirty pages").
>

Thanks for the explanation! This is very helpful. The reliance on
->migrate_folio() for reclaim is the piece I was missing.

>
>
> [1]
> https://lore.kernel.org/all/CAJfpegvYpWuTbKOm1hoySHZocY+ki07EzcXBUX8kZx92T8W6uQ@mail.gmail.com/
>
> --
> Thanks,
> Jingbo

Thanks,
Joanne

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [HELP] FUSE writeback performance bottleneck
  2024-09-13 20:55               ` Joanne Koong
@ 2024-10-11 23:08                 ` Joanne Koong
  2024-10-14  1:57                   ` Jingbo Xu
  0 siblings, 1 reply; 29+ messages in thread
From: Joanne Koong @ 2024-10-11 23:08 UTC (permalink / raw)
  To: Jingbo Xu
  Cc: Miklos Szeredi, Bernd Schubert, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, Josef Bacik, Dave Chinner

On Fri, Sep 13, 2024 at 1:55 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Thu, Sep 12, 2024 at 8:35 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >
> > On 9/13/24 7:18 AM, Joanne Koong wrote:
> > > On Wed, Sep 11, 2024 at 2:32 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> > >>
> > >> Hi all,
> > >>
> > >> On 6/4/24 3:27 PM, Miklos Szeredi wrote:
> > >>> On Tue, 4 Jun 2024 at 03:57, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> > >>>
> > >>>> IIUC, there are two sources that may cause deadlock:
> > >>>> 1) the fuse server needs memory allocation when processing FUSE_WRITE
> > >>>> requests, which in turn triggers direct memory reclaim, and FUSE
> > >>>> writeback then - deadlock here
> > >>>
> > >>> Yep, see the folio_wait_writeback() call deep in the guts of direct
> > >>> reclaim, which sleeps until the PG_writeback flag is cleared.  If that
> > >>> happens to be triggered by the writeback in question, then that's a
> > >>> deadlock.
> > >>
> > >> After diving deep into the direct reclaim code, there are some insights
> > >> may be helpful.
> > >>
> > >> Back to the time when the support for fuse writeback is introduced, i.e.
> > >> commit 3be5a52b30aa ("fuse: support writable mmap") since v2.6.26, the
> > >> direct reclaim indeed unconditionally waits for PG_writeback flag being
> > >> cleared.  At that time the direct reclaim is implemented in a two-stage
> > >> style, stage 1) pass over the LRU list to start parallel writeback
> > >> asynchronously, and stage 2) synchronously wait for completion of the
> > >> writeback previously started.
> > >>
> > >> This two-stage design and the unconditionally waiting for PG_writeback
> > >> flag being cleared is removed by commit 41ac199 ("mm: vmscan: do not
> > >> stall on writeback during memory compaction") since v3.5.
> > >>
> > >> Though the direct reclaim logic continues to evolve and the waiting is
> > >> added back, now the stall will happen only when the direct reclaim is
> > >> triggered from kswapd or memory cgroup.
> > >>
> > >> Specifically the stall will only happen in following certain conditions
> > >> (see shrink_folio_list() for details):
> > >> 1) kswapd
> > >> 2) or it's a user process under a non-root memory cgroup (actually
> > >> cgroup_v1) with GFP_IO permitted
> > >>
> > >> Thus the potential deadlock does not exist actually (if I'm not wrong) if:
> > >> 1) cgroup is not enabled
> > >> 2) or cgroup_v2 is actually used
> > >> 3) or (memory cgroup is enabled and is attached upon cgroup_v1) the fuse
> > >> server actually resides under the root cgroup
> > >> 4) or (the fuse server resides under a non-root memory cgroup_v1), but
> > >> the fuse server advertises itself as a PR_IO_FLUSHER[1]
> > >>
> > >>
> > >> Then we could considering adding a new feature bit indicating that any
> > >> one of the above condition is met and thus the fuse server is safe from
> > >> the potential deadlock inside direct reclaim.  When this feature bit is
> > >> set, the kernel side could bypass the temp page copying when doing
> > >> writeback.
> > >>
> > >
> > > Hi Jingbo, thanks for sharing your analysis of this.
> > >
> > > Having the temp page copying gated on the conditions you mentioned
> > > above seems a bit brittle to me. My understanding is that the mm code
> > > for when it decides to stall or not stall can change anytime in the
> > > future, in which case that seems like it could automatically break our
> > > precondition assumptions.
> >
> > So this is why PR_IO_FLUSHER is introduced here, which is specifically
> > for user space components playing a role in IO stack, e.g. fuse daemon,
> > tcmu/nbd daemon, etc.  PR_IO_FLUSHER offers guarantee similar to
> > GFP_NOIO, but for user space components.  At least we can rely on the
> > assumption that mm would take PR_IO_FLUSHER into account.
> >
> > The limitation of the PR_IO_FLUSHER approach is that, as pointed by
> > Miklos[1], there may be multiple components or services involved to
> > service the fuse requests, and the kernel side has no effective way to
> > check if all services in the whole chain have set PR_IO_FLUSHER.
> >
>
> Right, so doesn't that still bring us back to the original problem
> where if we gate this on any of the one conditions being enough to
> bypass needing the temp page, if the conditions change anytime in the
> future in the mm code, then this would automatically open up the
> potential deadlock in fuse as a byproduct? That seems a bit brittle to
> me to have this dependency.
>

Hi Jingbo,

I had some talks with Josef about this during/after LPC and he came up
with the idea of adding a flag to the 'struct
address_space_operations' to indicate that a folio under writeback
should be skipped during reclaim if it gets to that 3rd case of the
legacy cgroupv1 encountering a folio that has been marked for reclaim.
imo this seems like the most elegant solution and allows us to remove
the temporary folio and rb tree entirely. I sent out a patch for this
https://lore.kernel.org/linux-fsdevel/20241011223434.1307300-1-joannelkoong@gmail.com/
that I cc'ed you on, the benchmarks show roughly a 20% improvement in
throughput for 4k block size writes and a 40% improvement for 1M block
size writes. I'm curious to see if this speeds up writeback for you as
well on the workloads you're running.


Thanks,
Joanne

> The other alternatives seem to be:
> * adding a timer to writeback requests [1] where if the pages have not
> been copied out to userspace by a certain amount of time, then the
> handler copies out those pages to temporary pages and immediately
> clears writeback on the pages. The timer is canceled as soon as the
> pages will be copied out to userspace.
> * (not sure how possible this is) add some way to tag pages being
> reclaimed/balanced (I saw your comment below about the
> ->migrate_folio() call, which I need to look more into)
>
> The timeout option seems like the most promising one. I don't think
> the code would be that ugly.
>
> Curious to hear your thoughts on this. Are there any other
> alternatives you think could work here?
>
>
> [1] https://lore.kernel.org/all/CAJfpegt_mEYOeeTo2bWS3iJfC38t5bf29mzrxK68dhMptrgamg@mail.gmail.com/
>
> >
> > > Additionally, if I'm understanding it
> > > correctly, we also would need to know if the writeback is being
> > > triggered from reclaim by kswapd - is there even a way in the kernel
> > > to check that?
> >
> > Nope.  What I mean in the previous email is that, kswapd can get stalled
> > in direct reclaim, while the normal process, e.g. the fuse server, may
> > not get stalled in certain condition, e.g. explicitly advertising
> > PR_IO_FLUSHER.
> >
>
> Gotcha. I just took a look at shrink_folio_list() and now I see the
> "current_is_kswapd()" check.
>
> > >
> > > I'm wondering if there's some way we could tell if a folio is under
> > > reclaim when we're writing it back. I'm not familiar yet with the
> > > reclaim code, but my initial thoughts were whether it'd be possible to
> > > purpose the PG_reclaim flag or perhaps if the folio is not on any lru
> > > list, as an indication that it's being reclaimed. We could then just
> > > use the temp page in those cases, and skip the temp page otherwise.
> >
> > That is a good idea but I'm afraid it doesn't works.  Explained below.
> >
> > >
> > > Could you also point me to where in the reclaim code we end up
> > > invoking the writeback callback? I see pageout() calls ->writepage()
> > > but I'm not seeing where we invoke ->writepages().
> >
> > Yes, the direct reclaim would end up calling ->writepage() to writeback
> > the dirty page.  ->writepages() is only called in normal writeback
> > routine, e.g. when triggered from balance_dirty_page().
> >
> > Also FYI FUSE has removed ->writepage() since commit e1c420a ("fuse:
> > Remove fuse_writepage"), and now it relies on ->migrate_folio(), i.e.
> > memory compacting and the normal writeback routine (triggered from
> > balance_dirty_page()) in low memory.
> >
> > Thus I'm afraid the approach of doing temp page copying only for
> > writeback from direct reclaim code actually doesn't work.  That's
> > because when doing the direct reclaim, the process not only waits for
> > the writeback completion submitted from direct reclaim (e.g. marked with
> > PG_reclaim, by ->writepage), but may also waits for that submitted from
> > the normal writeback routine (without PG_reclaim marked, by
> > ->writepages). See commit c3b94f4 ("memcg: further prevent OOM with too
> > many dirty pages").
> >
>
> Thanks for the explanation! This is very helpful. The reliance on
> ->migrate_folio() for reclaim is the piece I was missing.
>
> >
> >
> > [1]
> > https://lore.kernel.org/all/CAJfpegvYpWuTbKOm1hoySHZocY+ki07EzcXBUX8kZx92T8W6uQ@mail.gmail.com/
> >
> > --
> > Thanks,
> > Jingbo
>
> Thanks,
> Joanne

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [HELP] FUSE writeback performance bottleneck
  2024-10-11 23:08                 ` Joanne Koong
@ 2024-10-14  1:57                   ` Jingbo Xu
  0 siblings, 0 replies; 29+ messages in thread
From: Jingbo Xu @ 2024-10-14  1:57 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Miklos Szeredi, Bernd Schubert, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, Josef Bacik, Dave Chinner



On 10/12/24 7:08 AM, Joanne Koong wrote:
> On Fri, Sep 13, 2024 at 1:55 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>>
>> On Thu, Sep 12, 2024 at 8:35 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>
>>> On 9/13/24 7:18 AM, Joanne Koong wrote:
>>>> On Wed, Sep 11, 2024 at 2:32 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> On 6/4/24 3:27 PM, Miklos Szeredi wrote:
>>>>>> On Tue, 4 Jun 2024 at 03:57, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>>
>>>>>>> IIUC, there are two sources that may cause deadlock:
>>>>>>> 1) the fuse server needs memory allocation when processing FUSE_WRITE
>>>>>>> requests, which in turn triggers direct memory reclaim, and FUSE
>>>>>>> writeback then - deadlock here
>>>>>>
>>>>>> Yep, see the folio_wait_writeback() call deep in the guts of direct
>>>>>> reclaim, which sleeps until the PG_writeback flag is cleared.  If that
>>>>>> happens to be triggered by the writeback in question, then that's a
>>>>>> deadlock.
>>>>>
>>>>> After diving deep into the direct reclaim code, there are some insights
>>>>> may be helpful.
>>>>>
>>>>> Back to the time when the support for fuse writeback is introduced, i.e.
>>>>> commit 3be5a52b30aa ("fuse: support writable mmap") since v2.6.26, the
>>>>> direct reclaim indeed unconditionally waits for PG_writeback flag being
>>>>> cleared.  At that time the direct reclaim is implemented in a two-stage
>>>>> style, stage 1) pass over the LRU list to start parallel writeback
>>>>> asynchronously, and stage 2) synchronously wait for completion of the
>>>>> writeback previously started.
>>>>>
>>>>> This two-stage design and the unconditionally waiting for PG_writeback
>>>>> flag being cleared is removed by commit 41ac199 ("mm: vmscan: do not
>>>>> stall on writeback during memory compaction") since v3.5.
>>>>>
>>>>> Though the direct reclaim logic continues to evolve and the waiting is
>>>>> added back, now the stall will happen only when the direct reclaim is
>>>>> triggered from kswapd or memory cgroup.
>>>>>
>>>>> Specifically the stall will only happen in following certain conditions
>>>>> (see shrink_folio_list() for details):
>>>>> 1) kswapd
>>>>> 2) or it's a user process under a non-root memory cgroup (actually
>>>>> cgroup_v1) with GFP_IO permitted
>>>>>
>>>>> Thus the potential deadlock does not exist actually (if I'm not wrong) if:
>>>>> 1) cgroup is not enabled
>>>>> 2) or cgroup_v2 is actually used
>>>>> 3) or (memory cgroup is enabled and is attached upon cgroup_v1) the fuse
>>>>> server actually resides under the root cgroup
>>>>> 4) or (the fuse server resides under a non-root memory cgroup_v1), but
>>>>> the fuse server advertises itself as a PR_IO_FLUSHER[1]
>>>>>
>>>>>
>>>>> Then we could considering adding a new feature bit indicating that any
>>>>> one of the above condition is met and thus the fuse server is safe from
>>>>> the potential deadlock inside direct reclaim.  When this feature bit is
>>>>> set, the kernel side could bypass the temp page copying when doing
>>>>> writeback.
>>>>>
>>>>
>>>> Hi Jingbo, thanks for sharing your analysis of this.
>>>>
>>>> Having the temp page copying gated on the conditions you mentioned
>>>> above seems a bit brittle to me. My understanding is that the mm code
>>>> for when it decides to stall or not stall can change anytime in the
>>>> future, in which case that seems like it could automatically break our
>>>> precondition assumptions.
>>>
>>> So this is why PR_IO_FLUSHER is introduced here, which is specifically
>>> for user space components playing a role in IO stack, e.g. fuse daemon,
>>> tcmu/nbd daemon, etc.  PR_IO_FLUSHER offers guarantee similar to
>>> GFP_NOIO, but for user space components.  At least we can rely on the
>>> assumption that mm would take PR_IO_FLUSHER into account.
>>>
>>> The limitation of the PR_IO_FLUSHER approach is that, as pointed by
>>> Miklos[1], there may be multiple components or services involved to
>>> service the fuse requests, and the kernel side has no effective way to
>>> check if all services in the whole chain have set PR_IO_FLUSHER.
>>>
>>
>> Right, so doesn't that still bring us back to the original problem
>> where if we gate this on any of the one conditions being enough to
>> bypass needing the temp page, if the conditions change anytime in the
>> future in the mm code, then this would automatically open up the
>> potential deadlock in fuse as a byproduct? That seems a bit brittle to
>> me to have this dependency.
>>
> 
> Hi Jingbo,
> 
> I had some talks with Josef about this during/after LPC and he came up
> with the idea of adding a flag to the 'struct
> address_space_operations' to indicate that a folio under writeback
> should be skipped during reclaim if it gets to that 3rd case of the
> legacy cgroupv1 encountering a folio that has been marked for reclaim.
> imo this seems like the most elegant solution and allows us to remove
> the temporary folio and rb tree entirely. I sent out a patch for this
> https://lore.kernel.org/linux-fsdevel/20241011223434.1307300-1-joannelkoong@gmail.com/
> that I cc'ed you on, the benchmarks show roughly a 20% improvement in
> throughput for 4k block size writes and a 40% improvement for 1M block
> size writes. I'm curious to see if this speeds up writeback for you as
> well on the workloads you're running.
> 

Yeah I just saw your post this morning.  It would be best if the mm
folks are happy with the modification to the memory reclaiming code.

I haven't tested your patch yet, I will test it later.  But at least my
previous test of dropping the temp page copying shows ~100% 1M write
bandwidth improvement (4GB/s->9GB/s), though it was tested upon Bernd's
passthrough_hp [1] which bypasses all buffer copying (i.e. the daemon
replies immediately without copying the data buffer).

[1] https://github.com/libfuse/libfuse/pull/807

-- 
Thanks,
Jingbo

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2024-10-14  1:57 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-03  6:17 [HELP] FUSE writeback performance bottleneck Jingbo Xu
2024-06-03 14:43 ` Bernd Schubert
2024-06-03 15:19   ` Miklos Szeredi
2024-06-03 15:32     ` Bernd Schubert
2024-06-03 22:10     ` Dave Chinner
2024-06-04  7:20       ` Miklos Szeredi
2024-06-04  1:57     ` Jingbo Xu
2024-06-04  7:27       ` Miklos Szeredi
2024-06-04  7:36         ` Jingbo Xu
2024-06-04  9:32           ` Bernd Schubert
2024-06-04 10:02             ` Miklos Szeredi
2024-06-04 14:13               ` Bernd Schubert
2024-06-04 16:53                 ` Josef Bacik
2024-06-04 21:39                   ` Bernd Schubert
2024-06-04 22:16                     ` Josef Bacik
2024-06-05  5:49                       ` Amir Goldstein
2024-06-05 15:35                         ` Josef Bacik
2024-08-22 17:00               ` Joanne Koong
2024-08-22 21:01                 ` Joanne Koong
2024-08-23  3:34               ` Jingbo Xu
2024-09-13  0:00                 ` Joanne Koong
2024-09-13  1:25                   ` Jingbo Xu
2024-06-04 12:24             ` Jingbo Xu
2024-09-11  9:32         ` Jingbo Xu
2024-09-12 23:18           ` Joanne Koong
2024-09-13  3:35             ` Jingbo Xu
2024-09-13 20:55               ` Joanne Koong
2024-10-11 23:08                 ` Joanne Koong
2024-10-14  1:57                   ` Jingbo Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).