linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Joanne Koong <joannelkoong@gmail.com>
To: Bernd Schubert <bernd.schubert@fastmail.fm>
Cc: Shakeel Butt <shakeel.butt@linux.dev>,
	David Hildenbrand <david@redhat.com>, Zi Yan <ziy@nvidia.com>,
	 miklos@szeredi.hu, linux-fsdevel@vger.kernel.org,
	jefflexu@linux.alibaba.com,  josef@toxicpanda.com,
	linux-mm@kvack.org, kernel-team@meta.com,
	 Matthew Wilcox <willy@infradead.org>,
	Oscar Salvador <osalvador@suse.de>,
	 Michal Hocko <mhocko@kernel.org>
Subject: Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
Date: Mon, 30 Dec 2024 09:52:40 -0800	[thread overview]
Message-ID: <CAJnrk1aoKB_uMqjtdM7omj2ZEJ08es3pfdkzku9PmQg8vx=9zQ@mail.gmail.com> (raw)
In-Reply-To: <934dc31b-e38a-4506-a2eb-59a67f544305@fastmail.fm>

On Fri, Dec 27, 2024 at 12:32 PM Bernd Schubert
<bernd.schubert@fastmail.fm> wrote:
>
> On 12/27/24 21:08, Joanne Koong wrote:
> > On Thu, Dec 26, 2024 at 12:13 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >>
> >> On Tue, Dec 24, 2024 at 01:37:49PM +0100, David Hildenbrand wrote:
> >>> On 23.12.24 23:14, Shakeel Butt wrote:
> >>>> On Sat, Dec 21, 2024 at 05:18:20PM +0100, David Hildenbrand wrote:
> >>>> [...]
> >>>>>
> >>>>> Yes, so I can see fuse
> >>>>>
> >>>>> (1) Breaking memory reclaim (memory cannot get freed up)
> >>>>>
> >>>>> (2) Breaking page migration (memory cannot be migrated)
> >>>>>
> >>>>> Due to (1) we might experience bigger memory pressure in the system I guess.
> >>>>> A handful of these pages don't really hurt, I have no idea how bad having
> >>>>> many of these pages can be. But yes, inherently we cannot throw away the
> >>>>> data as long as it is dirty without causing harm. (maybe we could move it to
> >>>>> some other cache, like swap/zswap; but that smells like a big and
> >>>>> complicated project)
> >>>>>
> >>>>> Due to (2) we turn pages that are supposed to be movable possibly for a long
> >>>>> time unmovable. Even a *single* such page will mean that CMA allocations /
> >>>>> memory unplug can start failing.
> >>>>>
> >>>>> We have similar situations with page pinning. With things like O_DIRECT, our
> >>>>> assumption/experience so far is that it will only take a couple of seconds
> >>>>> max, and retry loops are sufficient to handle it. That's why only long-term
> >>>>> pinning ("indeterminate", e.g., vfio) migrate these pages out of
> >>>>> ZONE_MOVABLE/MIGRATE_CMA areas in order to long-term pin them.
> >>>>>
> >>>>>
> >>>>> The biggest concern I have is that timeouts, while likely reasonable it many
> >>>>> scenarios, might not be desirable even for some sane workloads, and the
> >>>>> default in all system will be "no timeout", letting the clueless admin of
> >>>>> each and every system out there that might support fuse to make a decision.
> >>>>>
> >>>>> I might have misunderstood something, in which case I am very sorry, but we
> >>>>> also don't want CMA allocations to start failing simply because a network
> >>>>> connection is down for a couple of minutes such that a fuse daemon cannot
> >>>>> make progress.
> >>>>>
> >>>>
> >>>> I think you have valid concerns but these are not new and not unique to
> >>>> fuse. Any filesystem with a potential arbitrary stall can have similar
> >>>> issues. The arbitrary stall can be caused due to network issues or some
> >>>> faultly local storage.
> >>>
> >>> What concerns me more is that this is can be triggered by even unprivileged
> >>> user space, and that there is no default protection as far as I understood,
> >>> because timeouts cannot be set universally to a sane defaults.
> >>>
> >>> Again, please correct me if I got that wrong.
> >>>
> >>
> >> Let's route this question to FUSE folks. More specifically: can an
> >> unprivileged process create a mount point backed by itself, create a
> >> lot of dirty (bound by cgroup) and writeback pages on it and let the
> >> writeback pages in that state forever?
> >>
> >>>
> >>> BTW, I just looked at NFS out of interest, in particular
> >>> nfs_page_async_flush(), and I spot some logic about re-dirtying pages +
> >>> canceling writeback. IIUC, there are default timeouts for UDP and TCP,
> >>> whereby the TCP default one seems to be around 60s (* retrans?), and the
> >>> privileged user that mounts it can set higher ones. I guess one could run
> >>> into similar writeback issues?
> >>
> >> Yes, I think so.
> >>
> >>>
> >>> So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs?
> >>
> >> I feel like INDETERMINATE in the name is the main cause of confusion.
> >> So, let me explain why it is required (but later I will tell you how it
> >> can be avoided). The FUSE thread which is actively handling writeback of
> >> a given folio can cause memory allocation either through syscall or page
> >> fault. That memory allocation can trigger global reclaim synchronously
> >> and in cgroup-v1, that FUSE thread can wait on the writeback on the same
> >> folio whose writeback it is supposed to end and cauing a deadlock. So,
> >> AS_WRITEBACK_INDETERMINATE is used to just avoid this deadlock.
> >>
> >> The in-kernel fs avoid this situation through the use of GFP_NOFS
> >> allocations. The userspace fs can also use a similar approach which is
> >> prctl(PR_SET_IO_FLUSHER, 1) to avoid this situation. However I have been
> >> told that it is hard to use as it is per-thread flag and has to be set
> >> for all the threads handling writeback which can be error prone if the
> >> threadpool is dynamic. Second it is very coarse such that all the
> >> allocations from those threads (e.g. page faults) become NOFS which
> >> makes userspace very unreliable on highly utilized machine as NOFS can
> >> not reclaim potentially a lot of memory and can not trigger oom-kill.
> >>
> >>> Not
> >>> sure if I grasped all details about NFS and writeback and when it would
> >>> redirty+end writeback, and if there is some other handling in there.
> >>>
> >> [...]
> >>>>
> >>>> Please note that such filesystems are mostly used in environments like
> >>>> data center or hyperscalar and usually have more advanced mechanisms to
> >>>> handle and avoid situations like long delays. For such environment
> >>>> network unavailability is a larger issue than some cma allocation
> >>>> failure. My point is: let's not assume the disastrous situaion is normal
> >>>> and overcomplicate the solution.
> >>>
> >>> Let me summarize my main point: ZONE_MOVABLE/MIGRATE_CMA must only be used
> >>> for movable allocations.
> >>>
> >>> Mechanisms that possible turn these folios unmovable for a
> >>> long/indeterminate time must either fail or migrate these folios out of
> >>> these regions, otherwise we start violating the very semantics why
> >>> ZONE_MOVABLE/MIGRATE_CMA was added in the first place.
> >>>
> >>> Yes, there are corner cases where we cannot guarantee movability (e.g., OOM
> >>> when allocating a migration destination), but these are not cases that can
> >>> be triggered by (unprivileged) user space easily.
> >>>
> >>> That's why FOLL_LONGTERM pinning does exactly that: even if user space would
> >>> promise that this is really only "short-term", we will treat it as "possibly
> >>> forever", because it's under user-space control.
> >>>
> >>>
> >>> Instead of having more subsystems violate these semantics because
> >>> "performance" ... I would hope we would do better. Maybe it's an issue for
> >>> NFS as well ("at least" only for privileged user space)? In which case,
> >>> again, I would hope we would do better.
> >>>
> >>>
> >>> Anyhow, I'm hoping there will be more feedback from other MM folks, but
> >>> likely right now a lot of people are out (just like I should ;) ).
> >>>
> >>> If I end up being the only one with these concerns, then likely people can
> >>> feel free to ignore them. ;)
> >>
> >> I agree we should do better but IMHO it should be an iterative process.
> >> I think your concerns are valid, so let's push the discussion towards
> >> resolving those concerns. I think the concerns can be resolved by better
> >> handling of lifetime of folios under writeback. The amount of such
> >> folios is already handled through existing dirty throttling mechanism.
> >>
> >> We should start with a baseline i.e. distribution of lifetime of folios
> >> under writeback for traditional storage devices (spinning disk and SSDs)
> >> as we don't want an unrealistic goal for ourself. I think this data will
> >> drive the appropriate timeout values (if we decide timeout based
> >> approach is the right one).
> >>
> >> At the moment we have timeout based approach to limit the lifetime of
> >> folios under writeback. Any other ideas?
> >
> > I don't see any other approach that would handle splice, other than
> > modifying the splice code to prevent the underlying buf->page from
> > being migrated while it's being copied out, which seems non-viable to
> > consider. The other alternatives I see are to either a) do the extra
> > temp page copying for splice and "abort" the writeback if migration is
> > triggered or b) gate this to only apply to servers running as
> > privileged. I assume the majority of use cases do use splice, in which
> > case a) would be pointless and would make the internal logic more
> > complicated (eg we would still need the rb tree and would now need to
> > check writeback against the folio writeback state or the rb tree,
> > etc). I'm not sure how useful this would be either if this is just
> > gated to privileged servers.
>
>
> I'm not so sure about that majority of unprivileged servers.
> Try this patch and then run an unprivileged process.
>
> diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
> index ee0b3b1d0470..adebfbc03d4c 100644
> --- a/lib/fuse_lowlevel.c
> +++ b/lib/fuse_lowlevel.c
> @@ -3588,6 +3588,7 @@ static int _fuse_session_receive_buf(struct fuse_session *se,
>                         res = fcntl(llp->pipe[0], F_SETPIPE_SZ, bufsize);
>                         if (res == -1) {
>                                 llp->can_grow = 0;
> +                               fuse_log(FUSE_LOG_ERR, "cannot grow pipe\n");
>                                 res = grow_pipe_to_max(llp->pipe[0]);
>                                 if (res > 0)
>                                         llp->size = res;
> @@ -3678,6 +3679,7 @@ static int _fuse_session_receive_buf(struct fuse_session *se,
>
>         } else {
>                 /* Don't overwrite buf->mem, as that would cause a leak */
> +               fuse_log(FUSE_LOG_WARNING, "Using splice\n");
>                 buf->fd = tmpbuf.fd;
>                 buf->flags = tmpbuf.flags;
>         }
> @@ -3687,6 +3689,7 @@ static int _fuse_session_receive_buf(struct fuse_session *se,
>
>  fallback:
>  #endif
> +       fuse_log(FUSE_LOG_WARNING, "Splice fallback\n");
>         if (!buf->mem) {
>                 buf->mem = buf_alloc(se->bufsize, internal);
>                 if (!buf->mem) {
>
>
> And then run this again after
> sudo sysctl -w fs.pipe-max-size=1052672
>
> (Please don't change '/proc/sys/fs/fuse/max_pages_limit'
> from default).
>
> And now we would need to know how many users either limit
> max-pages + header to fit default pipe-max-size (1MB) or
> increase max_pages_limit. Given there is no warning in
> libfuse about the fallback from splice to buf copy, I doubt
> many people know about that - who would change system
> defaults without the knowledge?
>

My concern is that this would break backwards compatibility for the
rare subset of users who use their own custom library instead of
libfuse, who expect splice to work as-is and might not have this
in-built fallback to buffer copies.


Thanks,
Joanne

>
> And then, I still doubt that copy-to-tmp-page-and-splice
> is any faster than no-tmp-page-copy-but-copy-to-lib-fuse-buffer.
> Especially as the tmp page copy is single threaded, I think.
> But needs to be benchmarked.
>
>
> Thanks,
> Bernd
>
>
>


  reply	other threads:[~2024-12-30 17:52 UTC|newest]

Thread overview: 124+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-22 23:23 [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong
2024-11-22 23:23 ` [PATCH v6 1/5] mm: add AS_WRITEBACK_INDETERMINATE mapping flag Joanne Koong
2024-11-22 23:23 ` [PATCH v6 2/5] mm: skip reclaiming folios in legacy memcg writeback indeterminate contexts Joanne Koong
2024-11-22 23:23 ` [PATCH v6 3/5] fs/writeback: in wait_sb_inodes(), skip wait for AS_WRITEBACK_INDETERMINATE mappings Joanne Koong
2024-11-22 23:23 ` [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with " Joanne Koong
2024-12-19 13:05   ` David Hildenbrand
2024-12-19 14:19     ` Zi Yan
2024-12-19 15:08       ` Zi Yan
2024-12-19 15:39         ` David Hildenbrand
2024-12-19 15:47           ` Zi Yan
2024-12-19 15:50             ` David Hildenbrand
2024-12-19 15:43     ` Shakeel Butt
2024-12-19 15:47       ` David Hildenbrand
2024-12-19 15:53         ` Shakeel Butt
2024-12-19 15:55           ` Zi Yan
2024-12-19 15:56             ` Bernd Schubert
2024-12-19 16:00               ` Zi Yan
2024-12-19 16:02                 ` Zi Yan
2024-12-19 16:09                   ` Bernd Schubert
2024-12-19 16:14                     ` Zi Yan
2024-12-19 16:26                       ` Shakeel Butt
2024-12-19 16:31                         ` David Hildenbrand
2024-12-19 16:53                           ` Shakeel Butt
2024-12-19 16:22             ` Shakeel Butt
2024-12-19 16:29               ` David Hildenbrand
2024-12-19 16:40                 ` Shakeel Butt
2024-12-19 16:41                   ` David Hildenbrand
2024-12-19 17:14                     ` Shakeel Butt
2024-12-19 17:26                       ` David Hildenbrand
2024-12-19 17:30                         ` Bernd Schubert
2024-12-19 17:37                           ` Shakeel Butt
2024-12-19 17:40                             ` Bernd Schubert
2024-12-19 17:44                             ` Joanne Koong
2024-12-19 17:54                               ` Shakeel Butt
2024-12-20 11:44                                 ` David Hildenbrand
2024-12-20 12:15                                   ` Bernd Schubert
2024-12-20 14:49                                     ` David Hildenbrand
2024-12-20 15:26                                       ` Bernd Schubert
2024-12-20 18:01                                       ` Shakeel Butt
2024-12-21  2:28                                         ` Jingbo Xu
2024-12-21 16:23                                           ` David Hildenbrand
2024-12-22  2:47                                             ` Jingbo Xu
2024-12-24 11:32                                               ` David Hildenbrand
2024-12-21 16:18                                         ` David Hildenbrand
2024-12-23 22:14                                           ` Shakeel Butt
2024-12-24 12:37                                             ` David Hildenbrand
2024-12-26 15:11                                               ` Zi Yan
2024-12-26 20:13                                               ` Shakeel Butt
2024-12-26 22:02                                                 ` Bernd Schubert
2024-12-27 20:08                                                 ` Joanne Koong
2024-12-27 20:32                                                   ` Bernd Schubert
2024-12-30 17:52                                                     ` Joanne Koong [this message]
2024-12-30 10:16                                                 ` David Hildenbrand
2024-12-30 18:38                                                   ` Joanne Koong
2024-12-30 19:52                                                     ` David Hildenbrand
2024-12-30 20:11                                                       ` Shakeel Butt
2025-01-02 18:54                                                         ` Joanne Koong
2025-01-03 20:31                                                           ` David Hildenbrand
2025-01-06 10:19                                                             ` Miklos Szeredi
2025-01-06 18:17                                                               ` Shakeel Butt
2025-01-07  8:34                                                                 ` David Hildenbrand
2025-01-07 18:07                                                                   ` Shakeel Butt
2025-01-09 11:22                                                                     ` David Hildenbrand
2025-01-10 20:28                                                                       ` Jeff Layton
2025-01-10 21:13                                                                         ` David Hildenbrand
2025-01-10 22:00                                                                           ` Shakeel Butt
2025-01-13 15:27                                                                             ` David Hildenbrand
2025-01-13 21:44                                                                               ` Jeff Layton
2025-01-14  8:38                                                                                 ` Miklos Szeredi
2025-01-14  9:40                                                                                   ` Miklos Szeredi
2025-01-14  9:55                                                                                     ` Bernd Schubert
2025-01-14 10:07                                                                                       ` Miklos Szeredi
2025-01-14 18:07                                                                                         ` Joanne Koong
2025-01-14 18:58                                                                                           ` Miklos Szeredi
2025-01-14 19:12                                                                                             ` Joanne Koong
2025-01-14 20:00                                                                                               ` Miklos Szeredi
2025-01-14 20:29                                                                                               ` Jeff Layton
2025-01-14 21:40                                                                                                 ` Bernd Schubert
2025-01-23 16:06                                                                                                   ` Pavel Begunkov
2025-01-14 20:51                                                                                         ` Joanne Koong
2025-01-24 12:25                                                                                           ` David Hildenbrand
2025-01-14 15:49                                                                                     ` Jeff Layton
2025-01-24 12:29                                                                                       ` David Hildenbrand
2025-01-28 10:16                                                                                         ` Miklos Szeredi
2025-01-14 15:44                                                                                   ` Jeff Layton
2025-01-14 18:58                                                                                     ` Joanne Koong
2025-01-10 23:11                                                                           ` Jeff Layton
2025-01-10 20:16                                                                   ` Jeff Layton
2025-01-10 20:20                                                                     ` David Hildenbrand
2025-01-10 20:43                                                                       ` Jeff Layton
2025-01-10 21:00                                                                         ` David Hildenbrand
2025-01-10 21:07                                                                           ` Jeff Layton
2025-01-10 21:21                                                                             ` David Hildenbrand
2025-01-07 16:15                                                                 ` Miklos Szeredi
2025-01-08  1:40                                                                   ` Jingbo Xu
2024-12-30 20:04                                                     ` Shakeel Butt
2025-01-02 19:59                                                       ` Joanne Koong
2025-01-02 20:26                                                         ` Zi Yan
2024-12-20 21:01                                       ` Joanne Koong
2024-12-21 16:25                                         ` David Hildenbrand
2024-12-21 21:59                                           ` Bernd Schubert
2024-12-23 19:00                                             ` Joanne Koong
2024-12-26 22:44                                               ` Bernd Schubert
2024-12-27 18:25                                                 ` Joanne Koong
2024-12-19 17:55                         ` Joanne Koong
2024-12-19 18:04                           ` Bernd Schubert
2024-12-19 18:11                             ` Shakeel Butt
2024-12-20  7:55                     ` Jingbo Xu
2025-04-02 21:34     ` Joanne Koong
2025-04-03  3:31       ` Jingbo Xu
2025-04-03  9:18         ` David Hildenbrand
2025-04-03  9:25           ` Bernd Schubert
2025-04-03  9:35             ` Christian Brauner
2025-04-03 19:09           ` Joanne Koong
2025-04-03 20:44             ` David Hildenbrand
2025-04-03 22:04               ` Joanne Koong
2024-11-22 23:23 ` [PATCH v6 5/5] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
2024-11-25  9:46   ` Jingbo Xu
2024-12-12 21:55 ` [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong
2024-12-13 11:52 ` Miklos Szeredi
2024-12-13 16:47   ` Shakeel Butt
2024-12-18 17:37     ` Joanne Koong
2024-12-18 17:44       ` Shakeel Butt
2024-12-18 17:53         ` Joanne Koong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJnrk1aoKB_uMqjtdM7omj2ZEJ08es3pfdkzku9PmQg8vx=9zQ@mail.gmail.com' \
    --to=joannelkoong@gmail.com \
    --cc=bernd.schubert@fastmail.fm \
    --cc=david@redhat.com \
    --cc=jefflexu@linux.alibaba.com \
    --cc=josef@toxicpanda.com \
    --cc=kernel-team@meta.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=osalvador@suse.de \
    --cc=shakeel.butt@linux.dev \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).