public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
From: Joanne Koong <joannelkoong@gmail.com>
To: Bernd Schubert <bernd@bsbernd.com>
Cc: Horst Birthelmer <horst@birthelmer.com>,
	Miklos Szeredi <miklos@szeredi.hu>,
	 linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	 Horst Birthelmer <hbirthelmer@ddn.com>
Subject: Re: [PATCH] fuse: when copying a folio delay the mark dirty until the end
Date: Wed, 18 Mar 2026 18:32:25 -0700	[thread overview]
Message-ID: <CAJnrk1aWgWxzh0ETFkZXvJcfHFAC8KESEBLXa2S=ZqDfSSssvA@mail.gmail.com> (raw)
In-Reply-To: <60103445-0d45-427c-aa00-2fa79207b129@bsbernd.com>

On Wed, Mar 18, 2026 at 2:52 PM Bernd Schubert <bernd@bsbernd.com> wrote:
>
> Hi Joanne,
>
> On 3/18/26 22:19, Joanne Koong wrote:
> > On Wed, Mar 18, 2026 at 7:03 AM Horst Birthelmer <horst@birthelmer.de> wrote:
> >>
> >> Hi Joanne,
> >>
> >> I wonder, would something like this help for large folios?
> >
> > Hi Horst,
> >
> > I don't think it's likely that the pages backing the userspace buffer
> > are large folios, so I think this may actually add extra overhead with
> > the extra folio_test_dirty() check.
> >
> > From what I've seen, the main cost that dwarfs everything else for
> > writes/reads is the actual IO, the context switches, and the memcpys.
> > I think compared to these things, the set_page_dirty_lock() cost is
> > negligible and pretty much undetectable.
>
>
> a little bit background here. We see in cpu flame graphs that the spin
> lock taken in unlock_request() and unlock_request() takes about the same
> amount of CPU time as the memcpy. Interestingly, only on Intel, but not
> AMD CPUs. Note that we are running with out custom page pinning, which
> just takes the pages from an array, so iov_iter_get_pages2() is not used.
>
> The reason for that unlock/lock is documented at the end of
> Documentation/filesystems/fuse/fuse.rst as Kamikaze file system. Well we
> don't have that, so for now these checks are modified in our branches to
> avoid the lock. Although that is not upstreamable. Right solution is
> here to extract an array of pages and do that unlock/lock per pagevec.
>
> Next in the flame graph is setting that set_page_dirty_lock which also
> takes as much CPU time as the memcpy. Again, Intel CPUs only.
> In the combination with the above pagevec method, I think right solution
> is to iterate over the pages, stores the last folio and then set to
> dirty once per folio.

Thanks for the background context. The intel vs amd difference is
interesting. The approaches you mention sound reasonable. Are you able
to share the flame graph or is this easily repro-able using fio on the
passthrough_hp server?


> Also, I disagree about that the userspace buffers are not likely large
> folios, see commit
> 59ba47b6be9cd0146ef9a55c6e32e337e11e7625 "fuse: Check for large folio)
> with SPLICE_F_MOVE". Especially Horst persistently runs into it when
> doing xfstests with recent kernels. I think the issue came up first time

I think that's because xfstests uses /tmp for scratch space, so the

    "This is easily reproducible (on 6.19) with
    CONFIG_TRANSPARENT_HUGEPAGE_SHMEM_HUGE_ALWAYS=y
    CONFIG_TRANSPARENT_HUGEPAGE_TMPFS_HUGE_ALWAYS=y"

triggers it but on production workloads I don't think it's likely that
those source pages are backed by shmem/tmpfs or exist in the page
cache already as a large folio as the server has no control over that.
I also don't think most applications use splice, though maybe I'm
wrong here.

For non-splice, even if the user sets
"/sys/kernel/mm/transparent_hugepage/enabled" to 'always' or in
libfuse we do madvise on the buffer allocation for huge pages, that
has a 2 MB granularity requirement which depends on the user system
also having explicitly upped the max pages limit through the sysctl
since the kernel fuse max pages limit is 256 (1 MB) by default. I
don't think that is common on most servers.

Thanks,
Joanne

> with 3.18ish.
>
> One can further enforce that by setting
> "/sys/kernel/mm/transparent_hugepage/enabled" to 'always', what I did
> when I tested the above commit. And actually that points out that
> libfuse allocations should do the madvise. I'm going to do that during
> the next days, maybe tomorrow.
>
>
> Thanks,
> Bernd

  reply	other threads:[~2026-03-19  1:32 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-16 15:16 [PATCH] fuse: when copying a folio delay the mark dirty until the end Horst Birthelmer
2026-03-16 17:29 ` Joanne Koong
2026-03-16 20:02   ` Horst Birthelmer
2026-03-16 22:06     ` Joanne Koong
2026-03-18 14:03       ` Horst Birthelmer
2026-03-18 21:19         ` Joanne Koong
2026-03-18 21:52           ` Bernd Schubert
2026-03-19  1:32             ` Joanne Koong [this message]
2026-03-19  4:27               ` Darrick J. Wong
2026-03-20 17:24                 ` Joanne Koong
2026-03-19  8:32               ` Horst Birthelmer
2026-03-20 17:18                 ` Joanne Koong
2026-03-26  6:35 ` kernel test robot
2026-03-26 15:05   ` [LTP] " Cyril Hrubis

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJnrk1aWgWxzh0ETFkZXvJcfHFAC8KESEBLXa2S=ZqDfSSssvA@mail.gmail.com' \
    --to=joannelkoong@gmail.com \
    --cc=bernd@bsbernd.com \
    --cc=hbirthelmer@ddn.com \
    --cc=horst@birthelmer.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox