linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Joanne Koong <joannelkoong@gmail.com>
To: Jingbo Xu <jefflexu@linux.alibaba.com>
Cc: miklos@szeredi.hu, akpm@linux-foundation.org,
	 linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	shakeel.butt@linux.dev,  david@redhat.com,
	bernd.schubert@fastmail.fm, ziy@nvidia.com,  jlayton@kernel.org,
	kernel-team@meta.com,  Miklos Szeredi <mszeredi@redhat.com>
Subject: Re: [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree
Date: Mon, 14 Apr 2025 13:24:24 -0700	[thread overview]
Message-ID: <CAJnrk1Y4TPoTrWPz-SDG9rFiH44w-uC_hfiENnVLFkDAeyctSA@mail.gmail.com> (raw)
In-Reply-To: <CAJnrk1a_YL-Dg4HeVWnmwUVH2tCN2MYu30kiV5KSv4mkezWOZg@mail.gmail.com>

On Thu, Apr 10, 2025 at 9:11 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Thu, Apr 10, 2025 at 8:11 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >
> > On 4/10/25 11:07 PM, Joanne Koong wrote:
> > > On Wed, Apr 9, 2025 at 7:12 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> > >>
> > >>
> > >>
> > >> On 4/10/25 7:47 AM, Joanne Koong wrote:
> > >>>   On Tue, Apr 8, 2025 at 7:43 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> > >>>>
> > >>>> Hi Joanne,
> > >>>>
> > >>>> On 4/5/25 2:14 AM, Joanne Koong wrote:
> > >>>>> In the current FUSE writeback design (see commit 3be5a52b30aa
> > >>>>> ("fuse: support writable mmap")), a temp page is allocated for every
> > >>>>> dirty page to be written back, the contents of the dirty page are copied over
> > >>>>> to the temp page, and the temp page gets handed to the server to write back.
> > >>>>>
> > >>>>> This is done so that writeback may be immediately cleared on the dirty page,
> > >>>>> and this in turn is done in order to mitigate the following deadlock scenario
> > >>>>> that may arise if reclaim waits on writeback on the dirty page to complete:
> > >>>>> * single-threaded FUSE server is in the middle of handling a request
> > >>>>>   that needs a memory allocation
> > >>>>> * memory allocation triggers direct reclaim
> > >>>>> * direct reclaim waits on a folio under writeback
> > >>>>> * the FUSE server can't write back the folio since it's stuck in
> > >>>>>   direct reclaim
> > >>>>>
> > >>>>> With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates
> > >>>>> the situations described above, FUSE writeback does not need to use
> > >>>>> temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings.
> > >>>>>
> > >>>>> This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings
> > >>>>> and removes the temporary pages + extra copying and the internal rb
> > >>>>> tree.
> > >>>>>
> > >>>>> fio benchmarks --
> > >>>>> (using averages observed from 10 runs, throwing away outliers)
> > >>>>>
> > >>>>> Setup:
> > >>>>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
> > >>>>>  ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
> > >>>>>
> > >>>>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
> > >>>>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
> > >>>>>
> > >>>>>         bs =  1k          4k            1M
> > >>>>> Before  351 MiB/s     1818 MiB/s     1851 MiB/s
> > >>>>> After   341 MiB/s     2246 MiB/s     2685 MiB/s
> > >>>>> % diff        -3%          23%         45%
> > >>>>>
> > >>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > >>>>> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
> > >>>>> Acked-by: Miklos Szeredi <mszeredi@redhat.com>
> > >>>>
> > >>>
> > >>> Hi Jingbo,
> > >>>
> > >>> Thanks for sharing your analysis for this.
> > >>>
> > >>>> Overall this patch LGTM.
> > >>>>
> > >>>> Apart from that, IMO the fi->writectr and fi->queued_writes mechanism is
> > >>>> also unneeded then, at least the DIRECT IO routine (i.e.
> > >>>
> > >>> I took a look at fi->writectr and fi->queued_writes and my
> > >>> understanding is that we do still need this. For example, for
> > >>> truncates (I'm looking at fuse_do_setattr()), I think we still need to
> > >>> prevent concurrent writeback or else the setattr request and the
> > >>> writeback request could race which would result in a mismatch between
> > >>> the file's reported size and the actual data written to disk.
> > >>
> > >> I haven't looked into the truncate routine yet.  I will see it later.
> > >>
> > >>>
> > >>>> fuse_direct_io()) doesn't need fuse_sync_writes() anymore.  That is
> > >>>> because after removing the temp page, the DIRECT IO routine has already
> > >>>> been waiting for all inflight WRITE requests, see
> > >>>>
> > >>>> # DIRECT read
> > >>>> generic_file_read_iter
> > >>>>   kiocb_write_and_wait
> > >>>>     filemap_write_and_wait_range
> > >>>
> > >>> Where do you see generic_file_read_iter() getting called for direct io reads?
> > >>
> > >> # DIRECT read
> > >> fuse_file_read_iter
> > >>   fuse_cache_read_iter
> > >>     generic_file_read_iter
> > >>       kiocb_write_and_wait
> > >>        filemap_write_and_wait_range
> > >>       a_ops->direct_IO(),i.e. fuse_direct_IO()
> > >>
> > >
> > > Oh I see, I thought files opened with O_DIRECT automatically call the
> > > .direct_IO handler for reads/writes but you're right, it first goes
> > > through .read_iter / .write_iter handlers, and the .direct_IO handler
> > > only gets invoked through generic_file_read_iter() /
> > > generic_file_direct_write() in mm/filemap.c
> > >
> > > There's two paths for direct io in FUSE:
> > > a) fuse server sets fi->direct_io = true when a file is opened, which
> > > will set the FOPEN_DIRECT_IO bit in ff->open_flags on the kernel side
> > > b) fuse server doesn't set fi->direct_io = true, but the client opens
> > > the file with O_DIRECT
> > >
> > > We only go through the stack trace you listed above for the b) case.
> > > For the a) case, we'll hit
> > >
> > >         if (ff->open_flags & FOPEN_DIRECT_IO)
> > >                 return fuse_direct_read_iter(iocb, to);
> > >
> > > and
> > >
> > >         if (ff->open_flags & FOPEN_DIRECT_IO)
> > >                 return fuse_direct_write_iter(iocb, from);
> > >
> > > which will invoke fuse_direct_IO() / fuse_direct_io() without going
> > > through the kiocb_write_and_wait() -> filemap_write_and_wait_range() /
> > > kiocb_invalidate_pages() -> filemap_write_and_wait_range() you listed
> > > above.
> > >
> > > So for the a) case I think we'd still need the fuse_sync_writes() in
> > > case there's still pending writeback.
> > >
> > > Do you agree with this analysis or am I missing something here?
> >
> > Yeah, that's true.  But instead of calling fuse_sync_writes(), we can
> > call filemap_wait_range() or something similar here.
> >
>
> Agreed. Actually, the more I look at this, the more I think we can
> replace all fuse_sync_writes() and get rid of it entirely. Right now,
> fuse_sync_writes() is called in:

This is nontrivial so I'll leave this optimization to be done as a
separate future patchset so as to not slow this one down.

Thanks,
Joanne

>
> fuse_fsync():
>         /*
>          * Start writeback against all dirty pages of the inode, then
>          * wait for all outstanding writes, before sending the FSYNC
>          * request.
>          */
>         err = file_write_and_wait_range(file, start, end);
>         if (err)
>                 goto out;
>
>         fuse_sync_writes(inode);
>
>         /*
>          * Due to implementation of fuse writeback
>          * file_write_and_wait_range() does not catch errors.
>          * We have to do this directly after fuse_sync_writes()
>          */
>         err = file_check_and_advance_wb_err(file);
>         if (err)
>                 goto out;
>
>
>       We can get rid of the fuse_sync_writes() and
> file_check_and_advance_wb_err() entirely since now without temp pages,
> the file_write_and_wait_range() call actually ensures that writeback
> is completed
>
>
>
> fuse_writeback_range():
>         static int fuse_writeback_range(struct inode *inode, loff_t
> start, loff_t end)
>         {
>                 int err =
> filemap_write_and_wait_range(inode->i_mapping, start, LLONG_MAX);
>
>                 if (!err)
>                         fuse_sync_writes(inode);
>
>                 return err;
>         }
>
>
>       We can replace fuse_writeback_range() entirely with
> filemap_write_and_wait_range().
>
>
>
> fuse_direct_io():
>         if (fopen_direct_io && fc->direct_io_allow_mmap) {
>                 res = filemap_write_and_wait_range(mapping, pos, pos +
> count - 1);
>                 if (res) {
>                         fuse_io_free(ia);
>                         return res;
>                 }
>         }
>         if (!cuse && filemap_range_has_writeback(mapping, pos, (pos +
> count - 1))) {
>                 if (!write)
>                         inode_lock(inode);
>                 fuse_sync_writes(inode);
>                 if (!write)
>                         inode_unlock(inode);
>         }
>
>
>        I think this can just replaced with
>                 if (fopen_direct_io && (fc->direct_io_allow_mmap || !cuse)) {
>                         res = filemap_write_and_wait_range(mapping,
> pos, pos + count - 1);
>                         if (res) {
>                                 fuse_io_free(ia);
>                                 return res;
>                         }
>                 }
>        since for the !fopen_direct_io case, it will already go through
> filemap_write_and_wait_range(), as you mentioned in your previous
> message. I think this also fixes a bug (?) in the original code - in
> the fopen_direct_io && !fc->direct_io_allow_mmap case, I think we
> still need to write out dirty pages first, which we don't currently
> do.
>
>
> What do you think?
>
> Thanks for all your careful review on this patchset throughout all of
> its iterations.
>
> >
> >
> > >> filp_close()
> > >>    filp_flush()
> > >>        filp->f_op->flush()
> > >>            fuse_flush()
> > >>              write_inode_now
> > >>                 writeback_single_inode(WB_SYNC_ALL)
> > >>                   do_writepages
> > >>                     # flush dirty page
> > >>                   filemap_fdatawait
> > >>                     # wait for WRITE completion
> > >
> > > Nice. I missed that write_inode_now() will invoke filemap_fdatawait().
> > > This seems pretty straightforward. I'll remove the fuse_sync_writes()
> > > call in fuse_flush() when I send out v8.
> > >
> > > The direct io one above is less straight-forward. I won't add that to
> > > v8 but that can be done in a separate future patch when we figure that
> > > out.
> >
> > Thanks for keep working on this. Appreciated.
> >
> > --
> > Thanks,
> > Jingbo

  reply	other threads:[~2025-04-14 20:24 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-04 18:14 [PATCH v7 0/3] fuse: remove temp page copies in writeback Joanne Koong
2025-04-04 18:14 ` [PATCH v7 1/3] mm: add AS_WRITEBACK_INDETERMINATE mapping flag Joanne Koong
2025-04-04 19:13   ` David Hildenbrand
2025-04-04 20:09     ` Joanne Koong
2025-04-04 20:13       ` David Hildenbrand
2025-04-09 22:05         ` Shakeel Butt
2025-04-09 23:48           ` Joanne Koong
2025-04-04 18:14 ` [PATCH v7 2/3] mm: skip reclaiming folios in legacy memcg writeback indeterminate contexts Joanne Koong
2025-04-04 18:14 ` [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
2025-04-09  2:43   ` Jingbo Xu
2025-04-09 23:47     ` Joanne Koong
2025-04-10  2:12       ` Jingbo Xu
2025-04-10 15:07         ` Joanne Koong
2025-04-10 15:11           ` Jingbo Xu
2025-04-10 16:11             ` Joanne Koong
2025-04-14 20:24               ` Joanne Koong [this message]
2025-04-15  7:49               ` Jingbo Xu
2025-04-15 15:59                 ` Joanne Koong
2025-04-16  1:40                   ` Jingbo Xu
2025-04-16 16:43                     ` Joanne Koong
2025-04-16 18:05                       ` Bernd Schubert
2025-04-14 16:21 ` [PATCH v7 0/3] fuse: remove temp page copies in writeback Jeff Layton
2025-04-14 20:28   ` Joanne Koong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJnrk1Y4TPoTrWPz-SDG9rFiH44w-uC_hfiENnVLFkDAeyctSA@mail.gmail.com \
    --to=joannelkoong@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=bernd.schubert@fastmail.fm \
    --cc=david@redhat.com \
    --cc=jefflexu@linux.alibaba.com \
    --cc=jlayton@kernel.org \
    --cc=kernel-team@meta.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=miklos@szeredi.hu \
    --cc=mszeredi@redhat.com \
    --cc=shakeel.butt@linux.dev \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).